All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/9] fuse: API for Checkpoint/Restore
@ 2023-02-20 19:37 Alexander Mikhalitsyn
  2023-02-20 19:37 ` [RFC PATCH 1/9] fuse: move FUSE_DEFAULT_* defines to fuse common header Alexander Mikhalitsyn
                   ` (10 more replies)
  0 siblings, 11 replies; 27+ messages in thread
From: Alexander Mikhalitsyn @ 2023-02-20 19:37 UTC (permalink / raw)
  To: mszeredi
  Cc: Alexander Mikhalitsyn, Al Viro, Amir Goldstein,
	Stéphane Graber, Seth Forshee, Christian Brauner,
	Andrei Vagin, Pavel Tikhomirov, linux-fsdevel, linux-kernel,
	criu

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=ASCII, Size: 5672 bytes --]

Hello everyone,

It would be great to hear your comments regarding this proof-of-concept Checkpoint/Restore API for FUSE.

Support of FUSE C/R is a challenging task for CRIU [1]. Last year I've given a brief talk on LPC 2022
about how we handle files C/R in CRIU and which blockers we have for FUSE filesystems. [2]

The main problem for CRIU is that we have to restore mount namespaces and memory mappings before the process tree.
It means that when CRIU is performing mount of fuse filesystem it can't use the original FUSE daemon from the
restorable process tree, but instead use a "fake daemon".

This leads to many other technical problems:
* "fake" daemon has to reply to FUSE_INIT request from the kernel and initialize fuse connection somehow.
This setup can be not consistent with the original daemon (protocol version, daemon capabilities/settings
like no_open, no_flush, readahead, and so on).
* each fuse request has a unique ID. It could confuse userspace if this unique ID sequence was reset.

We can workaround some issues and implement fragile and limited support of FUSE in CRIU but it doesn't make any sense, IMHO.
Btw, I've enumerated only CRIU restore-stage problems there. The dump stage is another story...

My proposal is not only about CRIU. The same interface can be useful for FUSE mounts recovery after daemon crashes.
LXC project uses LXCFS [3] as a procfs/cgroupfs/sysfs emulation layer for containers. We are using a scheme when
one LXCFS daemon handles all the work for all the containers and we use bindmounts to overmount particular
files/directories in procfs/cgroupfs/sysfs. If this single daemon crashes for some reason we are in trouble,
because we have to restart all the containers (fuse bindmounts become invalid after the crash).
The solution is fairly easy:
allow somehow to reinitialize the existing fuse connection and replace the daemon on the fly
This case is a little bit simpler than CRIU cause we don't need to care about the previously opened files
and other stuff, we are only interested in mounts.

Current PoC implementation was developed and tested with this "recovery case".
Right now I only have LXCFS patched and have nothing for CRIU. But I wanted to discuss this idea before going forward with CRIU.

Brief API/design description.

I've added two ioctl's:
* ioctl(FUSE_DEV_IOC_REINIT)
Performs fuse connection abort and then reinitializes all internal fuse structures as "brand new".
Then sends a FUSE_INIT request, so a new userspace daemon can reply to it and start processing fuse reqs.

* ioctl(FUSE_DEV_IOC_BM_REVAL)
A little bit hacky thing. Traverses all the bindmounts of existing fuse mount and performs relookup
of (struct vfsmount)->mnt_root dentries with the new daemon and reset them to new dentries.
Pretty useful for the recovery case (for instance, LXCFS).

Now about the dentry/inode invalidation mechanism.
* added the "fuse connection generation" concept.
When reinit is performed then connection generation is increased by 1.
Each fuse inode stores the generation of the connection it was allocated with.

* perform dentry revalidation if it has an old generation [fuse_dentry_revalidate]
The current implementation of fuse_dentry_revalidate follows a simple and elegant idea. When we
want to revalidate the dentry we just send a FUSE_LOOKUP request to the userspace
for the parent dentry with the name of the current dentry and check which attributes/inode id
it gets. If inode ids are the same and attributes (provided by the userspace) are valid
then we mark dentry valid and it continues to live (with inode).
I've only added a connection generation check to the condition when we have to perform revalidation
and added an inode connection generation reset (to actual connection gen) if the new userspace
daemon has looked up the same inode id (important for the CRIU case!).

Thank you for your attention and I'm waiting for your feedback :)

References:
[1] Support FUSE mountpoints https://github.com/checkpoint-restore/criu/issues/53
[2] Bringing up FUSE mounts C/R support https://lpc.events/event/16/contributions/1243/
[3] LXCFS https://github.com/lxc/lxcfs

Kind regards,
Alex

Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Stéphane Graber <stgraber@ubuntu.com>
Cc: Seth Forshee <sforshee@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: criu@openvz.org

Alexander Mikhalitsyn (9):
  fuse: move FUSE_DEFAULT_* defines to fuse common header
  fuse: add const qualifiers to common fuse helpers
  fuse: add fuse connection generation
  fuse: handle stale inode connection in fuse_queue_forget
  fuse: move fuse connection flags to the separate structure
  fuse: take fuse connection generation into account
  fuse: add fuse device ioctl(FUSE_DEV_IOC_REINIT)
  namespace: add sb_revalidate_bindmounts helper
  fuse: add fuse device ioctl(FUSE_DEV_IOC_BM_REVAL)

 fs/fuse/acl.c                 |   6 +-
 fs/fuse/dev.c                 | 167 +++++++++++++++++++-
 fs/fuse/dir.c                 |  38 ++---
 fs/fuse/file.c                |  85 +++++-----
 fs/fuse/fuse_i.h              | 281 ++++++++++++++++++++--------------
 fs/fuse/inode.c               |  62 ++++----
 fs/fuse/readdir.c             |   8 +-
 fs/fuse/xattr.c               |  18 +--
 fs/namespace.c                |  90 +++++++++++
 include/linux/mnt_namespace.h |   3 +
 include/uapi/linux/fuse.h     |   2 +
 11 files changed, 531 insertions(+), 229 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 1/9] fuse: move FUSE_DEFAULT_* defines to fuse common header
  2023-02-20 19:37 [RFC PATCH 0/9] fuse: API for Checkpoint/Restore Alexander Mikhalitsyn
@ 2023-02-20 19:37 ` Alexander Mikhalitsyn
  2023-02-20 19:37 ` [RFC PATCH 2/9] fuse: add const qualifiers to common fuse helpers Alexander Mikhalitsyn
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Alexander Mikhalitsyn @ 2023-02-20 19:37 UTC (permalink / raw)
  To: mszeredi
  Cc: Alexander Mikhalitsyn, Al Viro, Amir Goldstein,
	Stéphane Graber, Seth Forshee, Christian Brauner,
	Andrei Vagin, Pavel Tikhomirov, linux-fsdevel, linux-kernel,
	criu

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=ASCII, Size: 1714 bytes --]

Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Stéphane Graber <stgraber@ubuntu.com>
Cc: Seth Forshee <sforshee@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: criu@openvz.org
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
---
 fs/fuse/fuse_i.h | 6 ++++++
 fs/fuse/inode.c  | 6 ------
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e13a8eff2e3d..4be7c404da4b 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -47,6 +47,12 @@
 /** Number of dentries for each connection in the control filesystem */
 #define FUSE_CTL_NUM_DENTRIES 5
 
+/** Maximum number of outstanding background requests */
+#define FUSE_DEFAULT_MAX_BACKGROUND 12
+
+/** Congestion starts at 75% of maximum */
+#define FUSE_DEFAULT_CONGESTION_THRESHOLD (FUSE_DEFAULT_MAX_BACKGROUND * 3 / 4)
+
 /** List of active connections */
 extern struct list_head fuse_conn_list;
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 114bdb3f7ccb..097b7049057e 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -53,12 +53,6 @@ MODULE_PARM_DESC(max_user_congthresh,
 
 #define FUSE_DEFAULT_BLKSIZE 512
 
-/** Maximum number of outstanding background requests */
-#define FUSE_DEFAULT_MAX_BACKGROUND 12
-
-/** Congestion starts at 75% of maximum */
-#define FUSE_DEFAULT_CONGESTION_THRESHOLD (FUSE_DEFAULT_MAX_BACKGROUND * 3 / 4)
-
 #ifdef CONFIG_BLOCK
 static struct file_system_type fuseblk_fs_type;
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 2/9] fuse: add const qualifiers to common fuse helpers
  2023-02-20 19:37 [RFC PATCH 0/9] fuse: API for Checkpoint/Restore Alexander Mikhalitsyn
  2023-02-20 19:37 ` [RFC PATCH 1/9] fuse: move FUSE_DEFAULT_* defines to fuse common header Alexander Mikhalitsyn
@ 2023-02-20 19:37 ` Alexander Mikhalitsyn
  2023-02-20 19:37 ` [RFC PATCH 3/9] fuse: add fuse connection generation Alexander Mikhalitsyn
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Alexander Mikhalitsyn @ 2023-02-20 19:37 UTC (permalink / raw)
  To: mszeredi
  Cc: Alexander Mikhalitsyn, Al Viro, Amir Goldstein,
	Stéphane Graber, Seth Forshee, Christian Brauner,
	Andrei Vagin, Pavel Tikhomirov, linux-fsdevel, linux-kernel,
	criu

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=ASCII, Size: 2592 bytes --]

Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Stéphane Graber <stgraber@ubuntu.com>
Cc: Seth Forshee <sforshee@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: criu@openvz.org
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
---
 fs/fuse/fuse_i.h | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 4be7c404da4b..d5fc2d89ff1c 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -870,32 +870,32 @@ struct fuse_mount {
 	struct list_head fc_entry;
 };
 
-static inline struct fuse_mount *get_fuse_mount_super(struct super_block *sb)
+static inline struct fuse_mount *get_fuse_mount_super(const struct super_block *sb)
 {
 	return sb->s_fs_info;
 }
 
-static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
+static inline struct fuse_conn *get_fuse_conn_super(const struct super_block *sb)
 {
 	return get_fuse_mount_super(sb)->fc;
 }
 
-static inline struct fuse_mount *get_fuse_mount(struct inode *inode)
+static inline struct fuse_mount *get_fuse_mount(const struct inode *inode)
 {
 	return get_fuse_mount_super(inode->i_sb);
 }
 
-static inline struct fuse_conn *get_fuse_conn(struct inode *inode)
+static inline struct fuse_conn *get_fuse_conn(const struct inode *inode)
 {
 	return get_fuse_mount_super(inode->i_sb)->fc;
 }
 
-static inline struct fuse_inode *get_fuse_inode(struct inode *inode)
+static inline struct fuse_inode *get_fuse_inode(const struct inode *inode)
 {
 	return container_of(inode, struct fuse_inode, inode);
 }
 
-static inline u64 get_node_id(struct inode *inode)
+static inline u64 get_node_id(const struct inode *inode)
 {
 	return get_fuse_inode(inode)->nodeid;
 }
@@ -905,7 +905,7 @@ static inline int invalid_nodeid(u64 nodeid)
 	return !nodeid || nodeid == FUSE_ROOT_ID;
 }
 
-static inline u64 fuse_get_attr_version(struct fuse_conn *fc)
+static inline u64 fuse_get_attr_version(const struct fuse_conn *fc)
 {
 	return atomic64_read(&fc->attr_version);
 }
@@ -923,7 +923,7 @@ static inline void fuse_make_bad(struct inode *inode)
 	set_bit(FUSE_I_BAD, &get_fuse_inode(inode)->state);
 }
 
-static inline bool fuse_is_bad(struct inode *inode)
+static inline bool fuse_is_bad(const struct inode *inode)
 {
 	return unlikely(test_bit(FUSE_I_BAD, &get_fuse_inode(inode)->state));
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 3/9] fuse: add fuse connection generation
  2023-02-20 19:37 [RFC PATCH 0/9] fuse: API for Checkpoint/Restore Alexander Mikhalitsyn
  2023-02-20 19:37 ` [RFC PATCH 1/9] fuse: move FUSE_DEFAULT_* defines to fuse common header Alexander Mikhalitsyn
  2023-02-20 19:37 ` [RFC PATCH 2/9] fuse: add const qualifiers to common fuse helpers Alexander Mikhalitsyn
@ 2023-02-20 19:37 ` Alexander Mikhalitsyn
  2023-02-20 19:37 ` [RFC PATCH 4/9] fuse: handle stale inode connection in fuse_queue_forget Alexander Mikhalitsyn
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Alexander Mikhalitsyn @ 2023-02-20 19:37 UTC (permalink / raw)
  To: mszeredi
  Cc: Alexander Mikhalitsyn, Al Viro, Amir Goldstein,
	Stéphane Graber, Seth Forshee, Christian Brauner,
	Andrei Vagin, Pavel Tikhomirov, linux-fsdevel, linux-kernel,
	criu

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=ASCII, Size: 3605 bytes --]

We will use connection generation to detect stale inodes
from the "old" fuse daemon and invalidate/revalidate them.

There is no functional changes.

Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Stéphane Graber <stgraber@ubuntu.com>
Cc: Seth Forshee <sforshee@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: criu@openvz.org
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
---
 fs/fuse/file.c   |  1 +
 fs/fuse/fuse_i.h | 29 +++++++++++++++++++++++++++++
 fs/fuse/inode.c  |  2 ++
 3 files changed, 32 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 3eb28aad5674..6d89d4dd4c55 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -78,6 +78,7 @@ struct fuse_file *fuse_file_alloc(struct fuse_mount *fm)
 	init_waitqueue_head(&ff->poll_wait);
 
 	ff->kh = atomic64_inc_return(&fm->fc->khctr);
+	ff->conn_gen = READ_ONCE(fm->fc->conn_gen);
 
 	return ff;
 }
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index d5fc2d89ff1c..ccd7c145de94 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -161,6 +161,9 @@ struct fuse_inode {
 	 */
 	struct fuse_inode_dax *dax;
 #endif
+
+	/** Fuse connection (fuse_conn) generation when inode was allocated */
+	u32 conn_gen;
 };
 
 /** FUSE inode state bits */
@@ -232,6 +235,9 @@ struct fuse_file {
 
 	/** Has flock been performed on this file? */
 	bool flock:1;
+
+	/** Fuse connection (fuse_conn) generation when file was allocated */
+	u32 conn_gen;
 };
 
 /** One input argument of a request */
@@ -847,6 +853,18 @@ struct fuse_conn {
 
 	/* New writepages go into this bucket */
 	struct fuse_sync_bucket __rcu *curr_bucket;
+
+	/**
+	 * Connection generation.
+	 * Used to determine if inodes/files were created with an "old"
+	 * fuse connection and have to be invalidated. So, all requests
+	 * related to these inodes should fail with -EIO.
+	 *
+	 * CHECKME: do we really need conn_gen for struct fuse_file?
+	 * Right now it's only needed for fuse_file_put(), where we have
+	 * no access to the inode in some cases.
+	 */
+	u32 conn_gen;
 };
 
 /*
@@ -910,6 +928,17 @@ static inline u64 fuse_get_attr_version(const struct fuse_conn *fc)
 	return atomic64_read(&fc->attr_version);
 }
 
+static inline bool fuse_stale_ff(const struct fuse_file *ff)
+{
+	return unlikely(READ_ONCE(ff->fm->fc->conn_gen) != ff->conn_gen);
+}
+
+static inline bool fuse_stale_inode_conn(const struct inode *inode)
+{
+	return unlikely(READ_ONCE(get_fuse_conn(inode)->conn_gen) !=
+			get_fuse_inode(inode)->conn_gen);
+}
+
 static inline bool fuse_stale_inode(const struct inode *inode, int generation,
 				    struct fuse_attr *attr)
 {
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 097b7049057e..c604434611d9 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -77,6 +77,7 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
 	fi->attr_version = 0;
 	fi->orig_ino = 0;
 	fi->state = 0;
+	fi->conn_gen = READ_ONCE(get_fuse_conn_super(sb)->conn_gen);
 	mutex_init(&fi->mutex);
 	spin_lock_init(&fi->lock);
 	fi->forget = fuse_alloc_forget();
@@ -841,6 +842,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
 	fc->user_ns = get_user_ns(user_ns);
 	fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
 	fc->max_pages_limit = FUSE_MAX_MAX_PAGES;
+	fc->conn_gen = 1;
 
 	INIT_LIST_HEAD(&fc->mounts);
 	list_add(&fm->fc_entry, &fc->mounts);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 4/9] fuse: handle stale inode connection in fuse_queue_forget
  2023-02-20 19:37 [RFC PATCH 0/9] fuse: API for Checkpoint/Restore Alexander Mikhalitsyn
                   ` (2 preceding siblings ...)
  2023-02-20 19:37 ` [RFC PATCH 3/9] fuse: add fuse connection generation Alexander Mikhalitsyn
@ 2023-02-20 19:37 ` Alexander Mikhalitsyn
  2023-03-03 18:47   ` Bernd Schubert
  2023-02-20 19:37 ` [RFC PATCH 5/9] fuse: move fuse connection flags to the separate structure Alexander Mikhalitsyn
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Alexander Mikhalitsyn @ 2023-02-20 19:37 UTC (permalink / raw)
  To: mszeredi
  Cc: Alexander Mikhalitsyn, Al Viro, Amir Goldstein,
	Stéphane Graber, Seth Forshee, Christian Brauner,
	Andrei Vagin, Pavel Tikhomirov, linux-fsdevel, linux-kernel,
	criu

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=ASCII, Size: 4122 bytes --]

We don't want to send FUSE_FORGET request to the new
fuse daemon if inode was lookuped by the old fuse daemon
because it can confuse and break userspace (libfuse).

For now, just add a new argument to fuse_queue_forget and
handle it. Adjust all callers to match the old behaviour.

Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Stéphane Graber <stgraber@ubuntu.com>
Cc: Seth Forshee <sforshee@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: criu@openvz.org
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
---
 fs/fuse/dev.c    | 4 ++--
 fs/fuse/dir.c    | 8 ++++----
 fs/fuse/fuse_i.h | 2 +-
 fs/fuse/inode.c  | 2 +-
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index eb4f88e3dc97..85f69629f34d 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -234,7 +234,7 @@ __releases(fiq->lock)
 }
 
 void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forget,
-		       u64 nodeid, u64 nlookup)
+		       u64 nodeid, u64 nlookup, bool stale_inode_conn)
 {
 	struct fuse_iqueue *fiq = &fc->iq;
 
@@ -242,7 +242,7 @@ void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forget,
 	forget->forget_one.nlookup = nlookup;
 
 	spin_lock(&fiq->lock);
-	if (fiq->connected) {
+	if (fiq->connected && !stale_inode_conn) {
 		fiq->forget_list_tail->next = forget;
 		fiq->forget_list_tail = forget;
 		fiq->ops->wake_forget_and_unlock(fiq);
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 7f4b29aa7c79..49d91add08bc 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -250,7 +250,7 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
 			if (outarg.nodeid != get_node_id(inode) ||
 			    (bool) IS_AUTOMOUNT(inode) != (bool) (outarg.attr.flags & FUSE_ATTR_SUBMOUNT)) {
 				fuse_queue_forget(fm->fc, forget,
-						  outarg.nodeid, 1);
+						  outarg.nodeid, 1, false);
 				goto invalid;
 			}
 			spin_lock(&fi->lock);
@@ -405,7 +405,7 @@ int fuse_lookup_name(struct super_block *sb, u64 nodeid, const struct qstr *name
 			   attr_version);
 	err = -ENOMEM;
 	if (!*inode) {
-		fuse_queue_forget(fm->fc, forget, outarg->nodeid, 1);
+		fuse_queue_forget(fm->fc, forget, outarg->nodeid, 1, false);
 		goto out;
 	}
 	err = 0;
@@ -692,7 +692,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
 	if (!inode) {
 		flags &= ~(O_CREAT | O_EXCL | O_TRUNC);
 		fuse_sync_release(NULL, ff, flags);
-		fuse_queue_forget(fm->fc, forget, outentry.nodeid, 1);
+		fuse_queue_forget(fm->fc, forget, outentry.nodeid, 1, false);
 		err = -ENOMEM;
 		goto out_err;
 	}
@@ -817,7 +817,7 @@ static int create_new_entry(struct fuse_mount *fm, struct fuse_args *args,
 	inode = fuse_iget(dir->i_sb, outarg.nodeid, outarg.generation,
 			  &outarg.attr, entry_attr_timeout(&outarg), 0);
 	if (!inode) {
-		fuse_queue_forget(fm->fc, forget, outarg.nodeid, 1);
+		fuse_queue_forget(fm->fc, forget, outarg.nodeid, 1, false);
 		return -ENOMEM;
 	}
 	kfree(forget);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index ccd7c145de94..ce154e7ab715 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1008,7 +1008,7 @@ int fuse_lookup_name(struct super_block *sb, u64 nodeid, const struct qstr *name
  * Send FORGET command
  */
 void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forget,
-		       u64 nodeid, u64 nlookup);
+		       u64 nodeid, u64 nlookup, bool stale_inode_conn);
 
 struct fuse_forget_link *fuse_alloc_forget(void);
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index c604434611d9..33a108cfcefe 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -124,7 +124,7 @@ static void fuse_evict_inode(struct inode *inode)
 			fuse_dax_inode_cleanup(inode);
 		if (fi->nlookup) {
 			fuse_queue_forget(fc, fi->forget, fi->nodeid,
-					  fi->nlookup);
+					  fi->nlookup, false);
 			fi->forget = NULL;
 		}
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 5/9] fuse: move fuse connection flags to the separate structure
  2023-02-20 19:37 [RFC PATCH 0/9] fuse: API for Checkpoint/Restore Alexander Mikhalitsyn
                   ` (3 preceding siblings ...)
  2023-02-20 19:37 ` [RFC PATCH 4/9] fuse: handle stale inode connection in fuse_queue_forget Alexander Mikhalitsyn
@ 2023-02-20 19:37 ` Alexander Mikhalitsyn
  2023-03-03 18:26   ` Bernd Schubert
  2023-02-20 19:37 ` [RFC PATCH 6/9] fuse: take fuse connection generation into account Alexander Mikhalitsyn
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Alexander Mikhalitsyn @ 2023-02-20 19:37 UTC (permalink / raw)
  To: mszeredi
  Cc: Alexander Mikhalitsyn, Al Viro, Amir Goldstein,
	Stéphane Graber, Seth Forshee, Christian Brauner,
	Andrei Vagin, Pavel Tikhomirov, linux-fsdevel, linux-kernel,
	criu

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=ASCII, Size: 34227 bytes --]

Let's move all the fuse connection flags that can be safely zeroed
after connection reinitialization to the separate structure fuse_conn_flags.

All of these flags values are calculated dynamically basing on
the userspace daemon capabilities (like no_open, no_flush) or on the
response for FUSE_INIT request.

Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Stéphane Graber <stgraber@ubuntu.com>
Cc: Seth Forshee <sforshee@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: criu@openvz.org
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
---
 fs/fuse/acl.c     |   6 +-
 fs/fuse/dev.c     |   4 +-
 fs/fuse/dir.c     |  26 +++---
 fs/fuse/file.c    |  80 ++++++++---------
 fs/fuse/fuse_i.h  | 225 ++++++++++++++++++++++++----------------------
 fs/fuse/inode.c   |  52 +++++------
 fs/fuse/readdir.c |   8 +-
 fs/fuse/xattr.c   |  18 ++--
 8 files changed, 215 insertions(+), 204 deletions(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index a4850aee2639..b27953739de4 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -25,7 +25,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type, bool rcu)
 	if (fuse_is_bad(inode))
 		return ERR_PTR(-EIO);
 
-	if (!fc->posix_acl || fc->no_getxattr)
+	if (!fc->posix_acl || fc->flags.no_getxattr)
 		return NULL;
 
 	if (type == ACL_TYPE_ACCESS)
@@ -42,7 +42,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type, bool rcu)
 	if (size > 0)
 		acl = posix_acl_from_xattr(fc->user_ns, value, size);
 	else if ((size == 0) || (size == -ENODATA) ||
-		 (size == -EOPNOTSUPP && fc->no_getxattr))
+		 (size == -EOPNOTSUPP && fc->flags.no_getxattr))
 		acl = NULL;
 	else if (size == -ERANGE)
 		acl = ERR_PTR(-E2BIG);
@@ -64,7 +64,7 @@ int fuse_set_acl(struct user_namespace *mnt_userns, struct dentry *dentry,
 	if (fuse_is_bad(inode))
 		return -EIO;
 
-	if (!fc->posix_acl || fc->no_setxattr)
+	if (!fc->posix_acl || fc->flags.no_setxattr)
 		return -EOPNOTSUPP;
 
 	if (type == ACL_TYPE_ACCESS)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 85f69629f34d..737764c2295e 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -367,7 +367,7 @@ static void request_wait_answer(struct fuse_req *req)
 	struct fuse_iqueue *fiq = &fc->iq;
 	int err;
 
-	if (!fc->no_interrupt) {
+	if (!fc->flags.no_interrupt) {
 		/* Any signal may interrupt this */
 		err = wait_event_interruptible(req->waitq,
 					test_bit(FR_FINISHED, &req->flags));
@@ -1901,7 +1901,7 @@ static ssize_t fuse_dev_do_write(struct fuse_dev *fud,
 		if (nbytes != sizeof(struct fuse_out_header))
 			err = -EINVAL;
 		else if (oh.error == -ENOSYS)
-			fc->no_interrupt = 1;
+			fc->flags.no_interrupt = 1;
 		else if (oh.error == -EAGAIN)
 			err = queue_interrupt(req);
 
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 49d91add08bc..63625c29d6ef 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -707,7 +707,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
 	} else {
 		file->private_data = ff;
 		fuse_finish_open(inode, file);
-		if (fm->fc->atomic_o_trunc && trunc)
+		if (fm->fc->flags.atomic_o_trunc && trunc)
 			truncate_pagecache(inode, 0);
 		else if (!(ff->open_flags & FOPEN_KEEP_CACHE))
 			invalidate_inode_pages2(inode->i_mapping);
@@ -750,12 +750,12 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
 	/* Only creates */
 	file->f_mode |= FMODE_CREATED;
 
-	if (fc->no_create)
+	if (fc->flags.no_create)
 		goto mknod;
 
 	err = fuse_create_open(dir, entry, file, flags, mode, FUSE_CREATE);
 	if (err == -ENOSYS) {
-		fc->no_create = 1;
+		fc->flags.no_create = 1;
 		goto mknod;
 	}
 out_dput:
@@ -1080,14 +1080,14 @@ static int fuse_rename2(struct user_namespace *mnt_userns, struct inode *olddir,
 		return -EINVAL;
 
 	if (flags) {
-		if (fc->no_rename2 || fc->minor < 23)
+		if (fc->flags.no_rename2 || fc->minor < 23)
 			return -EINVAL;
 
 		err = fuse_rename_common(olddir, oldent, newdir, newent, flags,
 					 FUSE_RENAME2,
 					 sizeof(struct fuse_rename2_in));
 		if (err == -ENOSYS) {
-			fc->no_rename2 = 1;
+			fc->flags.no_rename2 = 1;
 			err = -EINVAL;
 		}
 	} else {
@@ -1354,7 +1354,7 @@ static int fuse_access(struct inode *inode, int mask)
 
 	BUG_ON(mask & MAY_NOT_BLOCK);
 
-	if (fm->fc->no_access)
+	if (fm->fc->flags.no_access)
 		return 0;
 
 	memset(&inarg, 0, sizeof(inarg));
@@ -1366,7 +1366,7 @@ static int fuse_access(struct inode *inode, int mask)
 	args.in_args[0].value = &inarg;
 	err = fuse_simple_request(fm, &args);
 	if (err == -ENOSYS) {
-		fm->fc->no_access = 1;
+		fm->fc->flags.no_access = 1;
 		err = 0;
 	}
 	return err;
@@ -1503,7 +1503,7 @@ static const char *fuse_get_link(struct dentry *dentry, struct inode *inode,
 	if (fuse_is_bad(inode))
 		goto out_err;
 
-	if (fc->cache_symlinks)
+	if (fc->flags.cache_symlinks)
 		return page_get_link(dentry, inode, callback);
 
 	err = -ECHILD;
@@ -1551,13 +1551,13 @@ static int fuse_dir_fsync(struct file *file, loff_t start, loff_t end,
 	if (fuse_is_bad(inode))
 		return -EIO;
 
-	if (fc->no_fsyncdir)
+	if (fc->flags.no_fsyncdir)
 		return 0;
 
 	inode_lock(inode);
 	err = fuse_fsync_common(file, start, end, datasync, FUSE_FSYNCDIR);
 	if (err == -ENOSYS) {
-		fc->no_fsyncdir = 1;
+		fc->flags.no_fsyncdir = 1;
 		err = 0;
 	}
 	inode_unlock(inode);
@@ -1749,7 +1749,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
 	struct fuse_setattr_in inarg;
 	struct fuse_attr_out outarg;
 	bool is_truncate = false;
-	bool is_wb = fc->writeback_cache && S_ISREG(inode->i_mode);
+	bool is_wb = fc->flags.writeback_cache && S_ISREG(inode->i_mode);
 	loff_t oldsize;
 	int err;
 	bool trust_local_cmtime = is_wb;
@@ -1782,7 +1782,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
 		/* This is coming from open(..., ... | O_TRUNC); */
 		WARN_ON(!(attr->ia_valid & ATTR_SIZE));
 		WARN_ON(attr->ia_size != 0);
-		if (fc->atomic_o_trunc) {
+		if (fc->flags.atomic_o_trunc) {
 			/*
 			 * No need to send request to userspace, since actual
 			 * truncation has already been done by OPEN.  But still
@@ -1929,7 +1929,7 @@ static int fuse_setattr(struct user_namespace *mnt_userns, struct dentry *entry,
 		 *
 		 * This should be done on write(), truncate() and chown().
 		 */
-		if (!fc->handle_killpriv && !fc->handle_killpriv_v2) {
+		if (!fc->flags.handle_killpriv && !fc->handle_killpriv_v2) {
 			/*
 			 * ia_mode calculation may have used stale i_mode.
 			 * Refresh and recalculate.
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6d89d4dd4c55..d5b30faff0b9 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -29,7 +29,7 @@ static int fuse_send_open(struct fuse_mount *fm, u64 nodeid,
 
 	memset(&inarg, 0, sizeof(inarg));
 	inarg.flags = open_flags & ~(O_CREAT | O_EXCL | O_NOCTTY);
-	if (!fm->fc->atomic_o_trunc)
+	if (!fm->fc->flags.atomic_o_trunc)
 		inarg.flags &= ~O_TRUNC;
 
 	if (fm->fc->handle_killpriv_v2 &&
@@ -110,7 +110,7 @@ static void fuse_file_put(struct fuse_file *ff, bool sync, bool isdir)
 	if (refcount_dec_and_test(&ff->count)) {
 		struct fuse_args *args = &ff->release_args->args;
 
-		if (isdir ? ff->fm->fc->no_opendir : ff->fm->fc->no_open) {
+		if (isdir ? ff->fm->fc->flags.no_opendir : ff->fm->fc->flags.no_open) {
 			/* Do nothing when client does not implement 'open' */
 			fuse_release_end(ff->fm, args, 0);
 		} else if (sync) {
@@ -140,7 +140,7 @@ struct fuse_file *fuse_file_open(struct fuse_mount *fm, u64 nodeid,
 	ff->fh = 0;
 	/* Default for no-open */
 	ff->open_flags = FOPEN_KEEP_CACHE | (isdir ? FOPEN_CACHE_DIR : 0);
-	if (isdir ? !fc->no_opendir : !fc->no_open) {
+	if (isdir ? !fc->flags.no_opendir : !fc->flags.no_open) {
 		struct fuse_open_out outarg;
 		int err;
 
@@ -154,9 +154,9 @@ struct fuse_file *fuse_file_open(struct fuse_mount *fm, u64 nodeid,
 			return ERR_PTR(err);
 		} else {
 			if (isdir)
-				fc->no_opendir = 1;
+				fc->flags.no_opendir = 1;
 			else
-				fc->no_open = 1;
+				fc->flags.no_open = 1;
 		}
 	}
 
@@ -205,7 +205,7 @@ void fuse_finish_open(struct inode *inode, struct file *file)
 	else if (ff->open_flags & FOPEN_NONSEEKABLE)
 		nonseekable_open(inode, file);
 
-	if (fc->atomic_o_trunc && (file->f_flags & O_TRUNC)) {
+	if (fc->flags.atomic_o_trunc && (file->f_flags & O_TRUNC)) {
 		struct fuse_inode *fi = get_fuse_inode(inode);
 
 		spin_lock(&fi->lock);
@@ -215,7 +215,7 @@ void fuse_finish_open(struct inode *inode, struct file *file)
 		file_update_time(file);
 		fuse_invalidate_attr_mask(inode, FUSE_STATX_MODSIZE);
 	}
-	if ((file->f_mode & FMODE_WRITE) && fc->writeback_cache)
+	if ((file->f_mode & FMODE_WRITE) && fc->flags.writeback_cache)
 		fuse_link_write_file(file);
 }
 
@@ -225,10 +225,10 @@ int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
 	struct fuse_conn *fc = fm->fc;
 	int err;
 	bool is_wb_truncate = (file->f_flags & O_TRUNC) &&
-			  fc->atomic_o_trunc &&
-			  fc->writeback_cache;
+			  fc->flags.atomic_o_trunc &&
+			  fc->flags.writeback_cache;
 	bool dax_truncate = (file->f_flags & O_TRUNC) &&
-			  fc->atomic_o_trunc && FUSE_IS_DAX(inode);
+			  fc->flags.atomic_o_trunc && FUSE_IS_DAX(inode);
 
 	if (fuse_is_bad(inode))
 		return -EIO;
@@ -259,7 +259,7 @@ int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
 	if (!err) {
 		struct fuse_file *ff = file->private_data;
 
-		if (fc->atomic_o_trunc && (file->f_flags & O_TRUNC))
+		if (fc->flags.atomic_o_trunc && (file->f_flags & O_TRUNC))
 			truncate_pagecache(inode, 0);
 		else if (!(ff->open_flags & FOPEN_KEEP_CACHE))
 			invalidate_inode_pages2(inode->i_mapping);
@@ -350,7 +350,7 @@ static int fuse_release(struct inode *inode, struct file *file)
 	 * Dirty pages might remain despite write_inode_now() call from
 	 * fuse_flush() due to writes racing with the close.
 	 */
-	if (fc->writeback_cache)
+	if (fc->flags.writeback_cache)
 		write_inode_now(inode, 1);
 
 	fuse_release_common(file, false);
@@ -505,12 +505,12 @@ static int fuse_do_flush(struct fuse_flush_args *fa)
 		goto out;
 
 	err = 0;
-	if (fm->fc->no_flush)
+	if (fm->fc->flags.no_flush)
 		goto inval_attr_out;
 
 	err = fuse_simple_request(fm, &fa->args);
 	if (err == -ENOSYS) {
-		fm->fc->no_flush = 1;
+		fm->fc->flags.no_flush = 1;
 		err = 0;
 	}
 
@@ -519,7 +519,7 @@ static int fuse_do_flush(struct fuse_flush_args *fa)
 	 * In memory i_blocks is not maintained by fuse, if writeback cache is
 	 * enabled, i_blocks from cached attr may not be accurate.
 	 */
-	if (!err && fm->fc->writeback_cache)
+	if (!err && fm->fc->flags.writeback_cache)
 		fuse_invalidate_attr_mask(inode, STATX_BLOCKS);
 
 out:
@@ -545,7 +545,7 @@ static int fuse_flush(struct file *file, fl_owner_t id)
 	if (fuse_is_bad(inode))
 		return -EIO;
 
-	if (ff->open_flags & FOPEN_NOFLUSH && !fm->fc->writeback_cache)
+	if (ff->open_flags & FOPEN_NOFLUSH && !fm->fc->flags.writeback_cache)
 		return 0;
 
 	fa = kzalloc(sizeof(*fa), GFP_KERNEL);
@@ -628,12 +628,12 @@ static int fuse_fsync(struct file *file, loff_t start, loff_t end,
 	if (err)
 		goto out;
 
-	if (fc->no_fsync)
+	if (fc->flags.no_fsync)
 		goto out;
 
 	err = fuse_fsync_common(file, start, end, datasync, FUSE_FSYNC);
 	if (err == -ENOSYS) {
-		fc->no_fsync = 1;
+		fc->flags.no_fsync = 1;
 		err = 0;
 	}
 out:
@@ -858,7 +858,7 @@ static void fuse_short_read(struct inode *inode, u64 attr_ver, size_t num_read,
 	 * the file.  Some data after the hole is in page cache, but has not
 	 * reached the client fs yet.  So the hole is not present there.
 	 */
-	if (!fc->writeback_cache) {
+	if (!fc->flags.writeback_cache) {
 		loff_t pos = page_offset(ap->pages[0]) + num_read;
 		fuse_read_update_size(inode, pos, attr_ver);
 	}
@@ -989,7 +989,7 @@ static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file)
 
 	fuse_read_args_fill(ia, file, pos, count, FUSE_READ);
 	ia->read.attr_ver = fuse_get_attr_version(fm->fc);
-	if (fm->fc->async_read) {
+	if (fm->fc->flags.async_read) {
 		ia->ff = fuse_file_get(ff);
 		ap->args.end = fuse_readpages_end;
 		err = fuse_simple_background(fm, &ap->args, GFP_KERNEL);
@@ -1056,7 +1056,7 @@ static ssize_t fuse_cache_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	 * Otherwise, only update if we attempt to read past EOF (to ensure
 	 * i_size is up to date).
 	 */
-	if (fc->auto_inval_data ||
+	if (fc->flags.auto_inval_data ||
 	    (iocb->ki_pos + iov_iter_count(to) > i_size_read(inode))) {
 		int err;
 		err = fuse_update_attributes(inode, iocb->ki_filp, STATX_SIZE);
@@ -1263,7 +1263,7 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
 			ia->write.page_locked = true;
 			break;
 		}
-		if (!fc->big_writes)
+		if (!fc->flags.big_writes)
 			break;
 	} while (iov_iter_count(ii) && count < fc->max_write &&
 		 ap->num_pages < max_pages && offset == 0);
@@ -1343,7 +1343,7 @@ static ssize_t fuse_cache_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	loff_t endbyte = 0;
 
-	if (fc->writeback_cache) {
+	if (fc->flags.writeback_cache) {
 		/* Update size (EOF optimization) and mode (SUID clearing) */
 		err = fuse_update_attributes(mapping->host, file,
 					     STATX_SIZE | STATX_MODE);
@@ -1864,7 +1864,7 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
 	 * Do this only if writeback_cache is not enabled.  If writeback_cache
 	 * is enabled, we trust local ctime/mtime.
 	 */
-	if (!fc->writeback_cache)
+	if (!fc->flags.writeback_cache)
 		fuse_invalidate_attr_mask(inode, FUSE_STATX_MODIFY);
 	spin_lock(&fi->lock);
 	rb_erase(&wpa->writepages_entry, &fi->writepages);
@@ -2368,7 +2368,7 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping,
 	loff_t fsize;
 	int err = -ENOMEM;
 
-	WARN_ON(!fc->writeback_cache);
+	WARN_ON(!fc->flags.writeback_cache);
 
 	page = grab_cache_page_write_begin(mapping, index);
 	if (!page)
@@ -2641,13 +2641,13 @@ static int fuse_file_lock(struct file *file, int cmd, struct file_lock *fl)
 	if (cmd == F_CANCELLK) {
 		err = 0;
 	} else if (cmd == F_GETLK) {
-		if (fc->no_lock) {
+		if (fc->flags.no_lock) {
 			posix_test_lock(file, fl);
 			err = 0;
 		} else
 			err = fuse_getlk(file, fl);
 	} else {
-		if (fc->no_lock)
+		if (fc->flags.no_lock)
 			err = posix_lock_file(file, fl, NULL);
 		else
 			err = fuse_setlk(file, fl, 0);
@@ -2661,7 +2661,7 @@ static int fuse_file_flock(struct file *file, int cmd, struct file_lock *fl)
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	int err;
 
-	if (fc->no_flock) {
+	if (fc->flags.no_flock) {
 		err = locks_lock_file_wait(file, fl);
 	} else {
 		struct fuse_file *ff = file->private_data;
@@ -2683,7 +2683,7 @@ static sector_t fuse_bmap(struct address_space *mapping, sector_t block)
 	struct fuse_bmap_out outarg;
 	int err;
 
-	if (!inode->i_sb->s_bdev || fm->fc->no_bmap)
+	if (!inode->i_sb->s_bdev || fm->fc->flags.no_bmap)
 		return 0;
 
 	memset(&inarg, 0, sizeof(inarg));
@@ -2699,7 +2699,7 @@ static sector_t fuse_bmap(struct address_space *mapping, sector_t block)
 	args.out_args[0].value = &outarg;
 	err = fuse_simple_request(fm, &args);
 	if (err == -ENOSYS)
-		fm->fc->no_bmap = 1;
+		fm->fc->flags.no_bmap = 1;
 
 	return err ? 0 : outarg.block;
 }
@@ -2718,7 +2718,7 @@ static loff_t fuse_lseek(struct file *file, loff_t offset, int whence)
 	struct fuse_lseek_out outarg;
 	int err;
 
-	if (fm->fc->no_lseek)
+	if (fm->fc->flags.no_lseek)
 		goto fallback;
 
 	args.opcode = FUSE_LSEEK;
@@ -2732,7 +2732,7 @@ static loff_t fuse_lseek(struct file *file, loff_t offset, int whence)
 	err = fuse_simple_request(fm, &args);
 	if (err) {
 		if (err == -ENOSYS) {
-			fm->fc->no_lseek = 1;
+			fm->fc->flags.no_lseek = 1;
 			goto fallback;
 		}
 		return err;
@@ -2839,7 +2839,7 @@ __poll_t fuse_file_poll(struct file *file, poll_table *wait)
 	FUSE_ARGS(args);
 	int err;
 
-	if (fm->fc->no_poll)
+	if (fm->fc->flags.no_poll)
 		return DEFAULT_POLLMASK;
 
 	poll_wait(file, &ff->poll_wait, wait);
@@ -2867,7 +2867,7 @@ __poll_t fuse_file_poll(struct file *file, poll_table *wait)
 	if (!err)
 		return demangle_poll(outarg.revents);
 	if (err == -ENOSYS) {
-		fm->fc->no_poll = 1;
+		fm->fc->flags.no_poll = 1;
 		return DEFAULT_POLLMASK;
 	}
 	return EPOLLERR;
@@ -2953,7 +2953,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	 * By default, we want to optimize all I/Os with async request
 	 * submission to the client filesystem if supported.
 	 */
-	io->async = ff->fm->fc->async_dio;
+	io->async = ff->fm->fc->flags.async_dio;
 	io->iocb = iocb;
 	io->blocking = is_sync_kiocb(iocb);
 
@@ -3046,7 +3046,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 		     FALLOC_FL_ZERO_RANGE))
 		return -EOPNOTSUPP;
 
-	if (fm->fc->no_fallocate)
+	if (fm->fc->flags.no_fallocate)
 		return -EOPNOTSUPP;
 
 	inode_lock(inode);
@@ -3086,7 +3086,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 	args.in_args[0].value = &inarg;
 	err = fuse_simple_request(fm, &args);
 	if (err == -ENOSYS) {
-		fm->fc->no_fallocate = 1;
+		fm->fc->flags.no_fallocate = 1;
 		err = -EOPNOTSUPP;
 	}
 	if (err)
@@ -3142,10 +3142,10 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
 	ssize_t err;
 	/* mark unstable when write-back is not used, and file_out gets
 	 * extended */
-	bool is_unstable = (!fc->writeback_cache) &&
+	bool is_unstable = (!fc->flags.writeback_cache) &&
 			   ((pos_out + len) > inode_out->i_size);
 
-	if (fc->no_copy_file_range)
+	if (fc->flags.no_copy_file_range)
 		return -EOPNOTSUPP;
 
 	if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
@@ -3198,7 +3198,7 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
 	args.out_args[0].value = &outarg;
 	err = fuse_simple_request(fm, &args);
 	if (err == -ENOSYS) {
-		fc->no_copy_file_range = 1;
+		fc->flags.no_copy_file_range = 1;
 		err = -EOPNOTSUPP;
 	}
 	if (err)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index ce154e7ab715..4f4a6f912c7c 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -543,6 +543,122 @@ struct fuse_sync_bucket {
 	struct rcu_head rcu;
 };
 
+/**
+ * A Fuse connection.
+ *
+ * This structure is created, when the root filesystem is mounted, and
+ * is destroyed, when the client device is closed and the last
+ * fuse_mount is destroyed.
+ */
+struct fuse_conn_flags {
+	/** Do readahead asynchronously?  Only set in INIT */
+	unsigned async_read:1;
+
+	/** Do not send separate SETATTR request before open(O_TRUNC)  */
+	unsigned atomic_o_trunc:1;
+
+	/** Filesystem supports NFS exporting.  Only set in INIT */
+	unsigned export_support:1;
+
+	/** write-back cache policy (default is write-through) */
+	unsigned writeback_cache:1;
+
+	/** allow parallel lookups and readdir (default is serialized) */
+	unsigned parallel_dirops:1;
+
+	/** handle fs handles killing suid/sgid/cap on write/chown/trunc */
+	unsigned handle_killpriv:1;
+
+	/** cache READLINK responses in page cache */
+	unsigned cache_symlinks:1;
+
+	/*
+	 * The following bitfields are only for optimization purposes
+	 * and hence races in setting them will not cause malfunction
+	 */
+
+	/** Is open/release not implemented by fs? */
+	unsigned no_open:1;
+
+	/** Is opendir/releasedir not implemented by fs? */
+	unsigned no_opendir:1;
+
+	/** Is fsync not implemented by fs? */
+	unsigned no_fsync:1;
+
+	/** Is fsyncdir not implemented by fs? */
+	unsigned no_fsyncdir:1;
+
+	/** Is flush not implemented by fs? */
+	unsigned no_flush:1;
+
+	/** Is setxattr not implemented by fs? */
+	unsigned no_setxattr:1;
+
+	/** Does file server support extended setxattr */
+	unsigned setxattr_ext:1;
+
+	/** Is getxattr not implemented by fs? */
+	unsigned no_getxattr:1;
+
+	/** Is listxattr not implemented by fs? */
+	unsigned no_listxattr:1;
+
+	/** Is removexattr not implemented by fs? */
+	unsigned no_removexattr:1;
+
+	/** Are posix file locking primitives not implemented by fs? */
+	unsigned no_lock:1;
+
+	/** Is access not implemented by fs? */
+	unsigned no_access:1;
+
+	/** Is create not implemented by fs? */
+	unsigned no_create:1;
+
+	/** Is interrupt not implemented by fs? */
+	unsigned no_interrupt:1;
+
+	/** Is bmap not implemented by fs? */
+	unsigned no_bmap:1;
+
+	/** Is poll not implemented by fs? */
+	unsigned no_poll:1;
+
+	/** Do multi-page cached writes */
+	unsigned big_writes:1;
+
+	/** Are BSD file locking primitives not implemented by fs? */
+	unsigned no_flock:1;
+
+	/** Is fallocate not implemented by fs? */
+	unsigned no_fallocate:1;
+
+	/** Is rename with flags implemented by fs? */
+	unsigned no_rename2:1;
+
+	/** Use enhanced/automatic page cache invalidation. */
+	unsigned auto_inval_data:1;
+
+	/** Filesystem is fully responsible for page cache invalidation. */
+	unsigned explicit_inval_data:1;
+
+	/** Does the filesystem support readdirplus? */
+	unsigned do_readdirplus:1;
+
+	/** Does the filesystem want adaptive readdirplus? */
+	unsigned readdirplus_auto:1;
+
+	/** Does the filesystem support asynchronous direct-IO submission? */
+	unsigned async_dio:1;
+
+	/** Is lseek not implemented by fs? */
+	unsigned no_lseek:1;
+
+	/** Does the filesystem support copy_file_range? */
+	unsigned no_copy_file_range:1;
+};
+
 /**
  * A Fuse connection.
  *
@@ -641,30 +757,9 @@ struct fuse_conn {
 	/** Connection successful.  Only set in INIT */
 	unsigned conn_init:1;
 
-	/** Do readahead asynchronously?  Only set in INIT */
-	unsigned async_read:1;
-
 	/** Return an unique read error after abort.  Only set in INIT */
 	unsigned abort_err:1;
 
-	/** Do not send separate SETATTR request before open(O_TRUNC)  */
-	unsigned atomic_o_trunc:1;
-
-	/** Filesystem supports NFS exporting.  Only set in INIT */
-	unsigned export_support:1;
-
-	/** write-back cache policy (default is write-through) */
-	unsigned writeback_cache:1;
-
-	/** allow parallel lookups and readdir (default is serialized) */
-	unsigned parallel_dirops:1;
-
-	/** handle fs handles killing suid/sgid/cap on write/chown/trunc */
-	unsigned handle_killpriv:1;
-
-	/** cache READLINK responses in page cache */
-	unsigned cache_symlinks:1;
-
 	/* show legacy mount options */
 	unsigned int legacy_opts_show:1;
 
@@ -676,92 +771,9 @@ struct fuse_conn {
 	 */
 	unsigned handle_killpriv_v2:1;
 
-	/*
-	 * The following bitfields are only for optimization purposes
-	 * and hence races in setting them will not cause malfunction
-	 */
-
-	/** Is open/release not implemented by fs? */
-	unsigned no_open:1;
-
-	/** Is opendir/releasedir not implemented by fs? */
-	unsigned no_opendir:1;
-
-	/** Is fsync not implemented by fs? */
-	unsigned no_fsync:1;
-
-	/** Is fsyncdir not implemented by fs? */
-	unsigned no_fsyncdir:1;
-
-	/** Is flush not implemented by fs? */
-	unsigned no_flush:1;
-
-	/** Is setxattr not implemented by fs? */
-	unsigned no_setxattr:1;
-
-	/** Does file server support extended setxattr */
-	unsigned setxattr_ext:1;
-
-	/** Is getxattr not implemented by fs? */
-	unsigned no_getxattr:1;
-
-	/** Is listxattr not implemented by fs? */
-	unsigned no_listxattr:1;
-
-	/** Is removexattr not implemented by fs? */
-	unsigned no_removexattr:1;
-
-	/** Are posix file locking primitives not implemented by fs? */
-	unsigned no_lock:1;
-
-	/** Is access not implemented by fs? */
-	unsigned no_access:1;
-
-	/** Is create not implemented by fs? */
-	unsigned no_create:1;
-
-	/** Is interrupt not implemented by fs? */
-	unsigned no_interrupt:1;
-
-	/** Is bmap not implemented by fs? */
-	unsigned no_bmap:1;
-
-	/** Is poll not implemented by fs? */
-	unsigned no_poll:1;
-
-	/** Do multi-page cached writes */
-	unsigned big_writes:1;
-
 	/** Don't apply umask to creation modes */
 	unsigned dont_mask:1;
 
-	/** Are BSD file locking primitives not implemented by fs? */
-	unsigned no_flock:1;
-
-	/** Is fallocate not implemented by fs? */
-	unsigned no_fallocate:1;
-
-	/** Is rename with flags implemented by fs? */
-	unsigned no_rename2:1;
-
-	/** Use enhanced/automatic page cache invalidation. */
-	unsigned auto_inval_data:1;
-
-	/** Filesystem is fully responsible for page cache invalidation. */
-	unsigned explicit_inval_data:1;
-
-	/** Does the filesystem support readdirplus? */
-	unsigned do_readdirplus:1;
-
-	/** Does the filesystem want adaptive readdirplus? */
-	unsigned readdirplus_auto:1;
-
-	/** Does the filesystem support asynchronous direct-IO submission? */
-	unsigned async_dio:1;
-
-	/** Is lseek not implemented by fs? */
-	unsigned no_lseek:1;
-
 	/** Does the filesystem support posix acls? */
 	unsigned posix_acl:1;
 
@@ -771,9 +783,6 @@ struct fuse_conn {
 	/** Allow other than the mounter user to access the filesystem ? */
 	unsigned allow_other:1;
 
-	/** Does the filesystem support copy_file_range? */
-	unsigned no_copy_file_range:1;
-
 	/* Send DESTROY request */
 	unsigned int destroy:1;
 
@@ -804,6 +813,8 @@ struct fuse_conn {
 	/* Is tmpfile not implemented by fs? */
 	unsigned int no_tmpfile:1;
 
+	struct fuse_conn_flags flags;
+
 	/** The number of requests waiting for completion */
 	atomic_t num_waiting;
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 33a108cfcefe..c3109e016494 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -224,7 +224,7 @@ u32 fuse_get_cache_mask(struct inode *inode)
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 
-	if (!fc->writeback_cache || !S_ISREG(inode->i_mode))
+	if (!fc->flags.writeback_cache || !S_ISREG(inode->i_mode))
 		return 0;
 
 	return STATX_MTIME | STATX_CTIME | STATX_SIZE;
@@ -282,9 +282,9 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
 
 		if (oldsize != attr->size) {
 			truncate_pagecache(inode, attr->size);
-			if (!fc->explicit_inval_data)
+			if (!fc->flags.explicit_inval_data)
 				inval = true;
-		} else if (fc->auto_inval_data) {
+		} else if (fc->flags.auto_inval_data) {
 			struct timespec64 new_mtime = {
 				.tv_sec = attr->mtime,
 				.tv_nsec = attr->mtimensec,
@@ -380,7 +380,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 
 	if ((inode->i_state & I_NEW)) {
 		inode->i_flags |= S_NOATIME;
-		if (!fc->writeback_cache || !S_ISREG(attr->mode))
+		if (!fc->flags.writeback_cache || !S_ISREG(attr->mode))
 			inode->i_flags |= S_NOCMTIME;
 		inode->i_generation = generation;
 		fuse_init_inode(inode, attr);
@@ -459,7 +459,7 @@ bool fuse_lock_inode(struct inode *inode)
 {
 	bool locked = false;
 
-	if (!get_fuse_conn(inode)->parallel_dirops) {
+	if (!get_fuse_conn(inode)->flags.parallel_dirops) {
 		mutex_lock(&get_fuse_inode(inode)->mutex);
 		locked = true;
 	}
@@ -911,7 +911,7 @@ static struct dentry *fuse_get_dentry(struct super_block *sb,
 		struct fuse_entry_out outarg;
 		const struct qstr name = QSTR_INIT(".", 1);
 
-		if (!fc->export_support)
+		if (!fc->flags.export_support)
 			goto out_err;
 
 		err = fuse_lookup_name(sb, handle->nodeid, &name, &outarg,
@@ -1011,7 +1011,7 @@ static struct dentry *fuse_get_parent(struct dentry *child)
 	struct fuse_entry_out outarg;
 	int err;
 
-	if (!fc->export_support)
+	if (!fc->flags.export_support)
 		return ERR_PTR(-ESTALE);
 
 	err = fuse_lookup_name(child_inode->i_sb, get_node_id(child_inode),
@@ -1127,44 +1127,44 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
 
 			ra_pages = arg->max_readahead / PAGE_SIZE;
 			if (flags & FUSE_ASYNC_READ)
-				fc->async_read = 1;
+				fc->flags.async_read = 1;
 			if (!(flags & FUSE_POSIX_LOCKS))
-				fc->no_lock = 1;
+				fc->flags.no_lock = 1;
 			if (arg->minor >= 17) {
 				if (!(flags & FUSE_FLOCK_LOCKS))
-					fc->no_flock = 1;
+					fc->flags.no_flock = 1;
 			} else {
 				if (!(flags & FUSE_POSIX_LOCKS))
-					fc->no_flock = 1;
+					fc->flags.no_flock = 1;
 			}
 			if (flags & FUSE_ATOMIC_O_TRUNC)
-				fc->atomic_o_trunc = 1;
+				fc->flags.atomic_o_trunc = 1;
 			if (arg->minor >= 9) {
 				/* LOOKUP has dependency on proto version */
 				if (flags & FUSE_EXPORT_SUPPORT)
-					fc->export_support = 1;
+					fc->flags.export_support = 1;
 			}
 			if (flags & FUSE_BIG_WRITES)
-				fc->big_writes = 1;
+				fc->flags.big_writes = 1;
 			if (flags & FUSE_DONT_MASK)
 				fc->dont_mask = 1;
 			if (flags & FUSE_AUTO_INVAL_DATA)
-				fc->auto_inval_data = 1;
+				fc->flags.auto_inval_data = 1;
 			else if (flags & FUSE_EXPLICIT_INVAL_DATA)
-				fc->explicit_inval_data = 1;
+				fc->flags.explicit_inval_data = 1;
 			if (flags & FUSE_DO_READDIRPLUS) {
-				fc->do_readdirplus = 1;
+				fc->flags.do_readdirplus = 1;
 				if (flags & FUSE_READDIRPLUS_AUTO)
-					fc->readdirplus_auto = 1;
+					fc->flags.readdirplus_auto = 1;
 			}
 			if (flags & FUSE_ASYNC_DIO)
-				fc->async_dio = 1;
+				fc->flags.async_dio = 1;
 			if (flags & FUSE_WRITEBACK_CACHE)
-				fc->writeback_cache = 1;
+				fc->flags.writeback_cache = 1;
 			if (flags & FUSE_PARALLEL_DIROPS)
-				fc->parallel_dirops = 1;
+				fc->flags.parallel_dirops = 1;
 			if (flags & FUSE_HANDLE_KILLPRIV)
-				fc->handle_killpriv = 1;
+				fc->flags.handle_killpriv = 1;
 			if (arg->time_gran && arg->time_gran <= 1000000000)
 				fm->sb->s_time_gran = arg->time_gran;
 			if ((flags & FUSE_POSIX_ACL)) {
@@ -1173,7 +1173,7 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
 				fm->sb->s_xattr = fuse_acl_xattr_handlers;
 			}
 			if (flags & FUSE_CACHE_SYMLINKS)
-				fc->cache_symlinks = 1;
+				fc->flags.cache_symlinks = 1;
 			if (flags & FUSE_ABORT_ERROR)
 				fc->abort_err = 1;
 			if (flags & FUSE_MAX_PAGES) {
@@ -1194,15 +1194,15 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
 				fm->sb->s_flags |= SB_NOSEC;
 			}
 			if (flags & FUSE_SETXATTR_EXT)
-				fc->setxattr_ext = 1;
+				fc->flags.setxattr_ext = 1;
 			if (flags & FUSE_SECURITY_CTX)
 				fc->init_security = 1;
 			if (flags & FUSE_CREATE_SUPP_GROUP)
 				fc->create_supp_group = 1;
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
-			fc->no_lock = 1;
-			fc->no_flock = 1;
+			fc->flags.no_lock = 1;
+			fc->flags.no_flock = 1;
 		}
 
 		fm->sb->s_bdi->ra_pages =
diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
index dc603479b30e..2a5bfb52ebf3 100644
--- a/fs/fuse/readdir.c
+++ b/fs/fuse/readdir.c
@@ -18,9 +18,9 @@ static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx)
 	struct fuse_conn *fc = get_fuse_conn(dir);
 	struct fuse_inode *fi = get_fuse_inode(dir);
 
-	if (!fc->do_readdirplus)
+	if (!fc->flags.do_readdirplus)
 		return false;
-	if (!fc->readdirplus_auto)
+	if (!fc->flags.readdirplus_auto)
 		return true;
 	if (test_and_clear_bit(FUSE_I_ADVISE_RDPLUS, &fi->state))
 		return true;
@@ -246,7 +246,7 @@ static int fuse_direntplus_link(struct file *file,
 		if (IS_ERR(dentry))
 			return PTR_ERR(dentry);
 	}
-	if (fc->readdirplus_auto)
+	if (fc->flags.readdirplus_auto)
 		set_bit(FUSE_I_INIT_RDPLUS, &get_fuse_inode(inode)->state);
 	fuse_change_entry_timeout(dentry, o);
 
@@ -455,7 +455,7 @@ static int fuse_readdir_cached(struct file *file, struct dir_context *ctx)
 	 * We're just about to start reading into the cache or reading the
 	 * cache; both cases require an up-to-date mtime value.
 	 */
-	if (!ctx->pos && fc->auto_inval_data) {
+	if (!ctx->pos && fc->flags.auto_inval_data) {
 		int err = fuse_update_attributes(inode, file, STATX_MTIME);
 
 		if (err)
diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
index 0d3e7177fce0..13245c11ce25 100644
--- a/fs/fuse/xattr.c
+++ b/fs/fuse/xattr.c
@@ -19,7 +19,7 @@ int fuse_setxattr(struct inode *inode, const char *name, const void *value,
 	struct fuse_setxattr_in inarg;
 	int err;
 
-	if (fm->fc->no_setxattr)
+	if (fm->fc->flags.no_setxattr)
 		return -EOPNOTSUPP;
 
 	memset(&inarg, 0, sizeof(inarg));
@@ -30,7 +30,7 @@ int fuse_setxattr(struct inode *inode, const char *name, const void *value,
 	args.opcode = FUSE_SETXATTR;
 	args.nodeid = get_node_id(inode);
 	args.in_numargs = 3;
-	args.in_args[0].size = fm->fc->setxattr_ext ?
+	args.in_args[0].size = fm->fc->flags.setxattr_ext ?
 		sizeof(inarg) : FUSE_COMPAT_SETXATTR_IN_SIZE;
 	args.in_args[0].value = &inarg;
 	args.in_args[1].size = strlen(name) + 1;
@@ -39,7 +39,7 @@ int fuse_setxattr(struct inode *inode, const char *name, const void *value,
 	args.in_args[2].value = value;
 	err = fuse_simple_request(fm, &args);
 	if (err == -ENOSYS) {
-		fm->fc->no_setxattr = 1;
+		fm->fc->flags.no_setxattr = 1;
 		err = -EOPNOTSUPP;
 	}
 	if (!err)
@@ -57,7 +57,7 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
 	struct fuse_getxattr_out outarg;
 	ssize_t ret;
 
-	if (fm->fc->no_getxattr)
+	if (fm->fc->flags.no_getxattr)
 		return -EOPNOTSUPP;
 
 	memset(&inarg, 0, sizeof(inarg));
@@ -83,7 +83,7 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
 	if (!ret && !size)
 		ret = min_t(ssize_t, outarg.size, XATTR_SIZE_MAX);
 	if (ret == -ENOSYS) {
-		fm->fc->no_getxattr = 1;
+		fm->fc->flags.no_getxattr = 1;
 		ret = -EOPNOTSUPP;
 	}
 	return ret;
@@ -121,7 +121,7 @@ ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size)
 	if (!fuse_allow_current_process(fm->fc))
 		return -EACCES;
 
-	if (fm->fc->no_listxattr)
+	if (fm->fc->flags.no_listxattr)
 		return -EOPNOTSUPP;
 
 	memset(&inarg, 0, sizeof(inarg));
@@ -147,7 +147,7 @@ ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size)
 	if (ret > 0 && size)
 		ret = fuse_verify_xattr_list(list, ret);
 	if (ret == -ENOSYS) {
-		fm->fc->no_listxattr = 1;
+		fm->fc->flags.no_listxattr = 1;
 		ret = -EOPNOTSUPP;
 	}
 	return ret;
@@ -159,7 +159,7 @@ int fuse_removexattr(struct inode *inode, const char *name)
 	FUSE_ARGS(args);
 	int err;
 
-	if (fm->fc->no_removexattr)
+	if (fm->fc->flags.no_removexattr)
 		return -EOPNOTSUPP;
 
 	args.opcode = FUSE_REMOVEXATTR;
@@ -169,7 +169,7 @@ int fuse_removexattr(struct inode *inode, const char *name)
 	args.in_args[0].value = name;
 	err = fuse_simple_request(fm, &args);
 	if (err == -ENOSYS) {
-		fm->fc->no_removexattr = 1;
+		fm->fc->flags.no_removexattr = 1;
 		err = -EOPNOTSUPP;
 	}
 	if (!err)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 6/9] fuse: take fuse connection generation into account
  2023-02-20 19:37 [RFC PATCH 0/9] fuse: API for Checkpoint/Restore Alexander Mikhalitsyn
                   ` (4 preceding siblings ...)
  2023-02-20 19:37 ` [RFC PATCH 5/9] fuse: move fuse connection flags to the separate structure Alexander Mikhalitsyn
@ 2023-02-20 19:37 ` Alexander Mikhalitsyn
  2023-03-03 18:45   ` Bernd Schubert
  2023-02-20 19:37 ` [RFC PATCH 7/9] fuse: add fuse device ioctl(FUSE_DEV_IOC_REINIT) Alexander Mikhalitsyn
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Alexander Mikhalitsyn @ 2023-02-20 19:37 UTC (permalink / raw)
  To: mszeredi
  Cc: Alexander Mikhalitsyn, Al Viro, Amir Goldstein,
	Stéphane Graber, Seth Forshee, Christian Brauner,
	Andrei Vagin, Pavel Tikhomirov, linux-fsdevel, linux-kernel,
	criu

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=ASCII, Size: 3894 bytes --]

- modify dentry revalidation algorithm to check inode/connection
generations. If them are not equal then perform revalidation.

remark: during forced dentry revalidation we are sending FUSE_LOOKUP
request to the userspace daemon and if it return the same inode after
lookup then we can "upgrade" inode connection generation without
invalidating it.

- don't send FUSE_FSYNC, FUSE_RELEASE, etc requests to the userspace
daemon about stale inodes (this can confuse libfuse)

Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Stéphane Graber <stgraber@ubuntu.com>
Cc: Seth Forshee <sforshee@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: criu@openvz.org
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
---
 fs/fuse/dir.c    | 4 +++-
 fs/fuse/file.c   | 6 ++++--
 fs/fuse/fuse_i.h | 3 ++-
 fs/fuse/inode.c  | 2 +-
 4 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 63625c29d6ef..5ac7d88e5dfc 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -213,7 +213,8 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
 	inode = d_inode_rcu(entry);
 	if (inode && fuse_is_bad(inode))
 		goto invalid;
-	else if (time_before64(fuse_dentry_time(entry), get_jiffies_64()) ||
+	else if ((inode && fuse_stale_inode_conn(inode)) ||
+		 time_before64(fuse_dentry_time(entry), get_jiffies_64()) ||
 		 (flags & (LOOKUP_EXCL | LOOKUP_REVAL | LOOKUP_RENAME_TARGET))) {
 		struct fuse_entry_out outarg;
 		FUSE_ARGS(args);
@@ -255,6 +256,7 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
 			}
 			spin_lock(&fi->lock);
 			fi->nlookup++;
+			fi->conn_gen = READ_ONCE(get_fuse_conn(inode)->conn_gen);
 			spin_unlock(&fi->lock);
 		}
 		kfree(forget);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index d5b30faff0b9..be9086a1868d 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -110,7 +110,8 @@ static void fuse_file_put(struct fuse_file *ff, bool sync, bool isdir)
 	if (refcount_dec_and_test(&ff->count)) {
 		struct fuse_args *args = &ff->release_args->args;
 
-		if (isdir ? ff->fm->fc->flags.no_opendir : ff->fm->fc->flags.no_open) {
+		if (fuse_stale_ff(ff) ||
+		    (isdir ? ff->fm->fc->flags.no_opendir : ff->fm->fc->flags.no_open)) {
 			/* Do nothing when client does not implement 'open' */
 			fuse_release_end(ff->fm, args, 0);
 		} else if (sync) {
@@ -597,9 +598,10 @@ static int fuse_fsync(struct file *file, loff_t start, loff_t end,
 {
 	struct inode *inode = file->f_mapping->host;
 	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_file *ff = file->private_data;
 	int err;
 
-	if (fuse_is_bad(inode))
+	if (fuse_stale_ff(ff) || fuse_is_bad(inode))
 		return -EIO;
 
 	inode_lock(inode);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 4f4a6f912c7c..0643de31674d 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -954,7 +954,8 @@ static inline bool fuse_stale_inode(const struct inode *inode, int generation,
 				    struct fuse_attr *attr)
 {
 	return inode->i_generation != generation ||
-		inode_wrong_type(inode, attr->mode);
+		inode_wrong_type(inode, attr->mode) ||
+		fuse_stale_inode_conn(inode);
 }
 
 static inline void fuse_make_bad(struct inode *inode)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index c3109e016494..f9dc8971274d 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -124,7 +124,7 @@ static void fuse_evict_inode(struct inode *inode)
 			fuse_dax_inode_cleanup(inode);
 		if (fi->nlookup) {
 			fuse_queue_forget(fc, fi->forget, fi->nodeid,
-					  fi->nlookup, false);
+					  fi->nlookup, fuse_stale_inode_conn(inode));
 			fi->forget = NULL;
 		}
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 7/9] fuse: add fuse device ioctl(FUSE_DEV_IOC_REINIT)
  2023-02-20 19:37 [RFC PATCH 0/9] fuse: API for Checkpoint/Restore Alexander Mikhalitsyn
                   ` (5 preceding siblings ...)
  2023-02-20 19:37 ` [RFC PATCH 6/9] fuse: take fuse connection generation into account Alexander Mikhalitsyn
@ 2023-02-20 19:37 ` Alexander Mikhalitsyn
  2023-03-03 19:19   ` Bernd Schubert
  2023-03-03 19:26   ` Bernd Schubert
  2023-02-20 19:37 ` [RFC PATCH 8/9] namespace: add sb_revalidate_bindmounts helper Alexander Mikhalitsyn
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 27+ messages in thread
From: Alexander Mikhalitsyn @ 2023-02-20 19:37 UTC (permalink / raw)
  To: mszeredi
  Cc: Alexander Mikhalitsyn, Al Viro, Amir Goldstein,
	Stéphane Graber, Seth Forshee, Christian Brauner,
	Andrei Vagin, Pavel Tikhomirov, linux-fsdevel, linux-kernel,
	criu

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=ASCII, Size: 4461 bytes --]

This ioctl aborts fuse connection and then reinitializes it,
sends FUSE_INIT request to allow a new userspace daemon
to pick up the fuse connection.

Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Stéphane Graber <stgraber@ubuntu.com>
Cc: Seth Forshee <sforshee@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: criu@openvz.org
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
---
 fs/fuse/dev.c             | 132 ++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fuse.h |   1 +
 2 files changed, 133 insertions(+)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 737764c2295e..0f53ffd63957 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2187,6 +2187,112 @@ void fuse_abort_conn(struct fuse_conn *fc)
 }
 EXPORT_SYMBOL_GPL(fuse_abort_conn);
 
+static int fuse_reinit_conn(struct fuse_conn *fc)
+{
+	struct fuse_iqueue *fiq = &fc->iq;
+	struct fuse_dev *fud;
+	unsigned int i;
+
+	if (fc->conn_gen + 1 < fc->conn_gen)
+		return -EOVERFLOW;
+
+	fuse_abort_conn(fc);
+	fuse_wait_aborted(fc);
+
+	spin_lock(&fc->lock);
+	if (fc->connected) {
+		spin_unlock(&fc->lock);
+		return -EINVAL;
+	}
+
+	if (fc->conn_gen + 1 < fc->conn_gen) {
+		spin_unlock(&fc->lock);
+		return -EOVERFLOW;
+	}
+
+	fc->conn_gen++;
+
+	spin_lock(&fiq->lock);
+	if (request_pending(fiq) || fiq->forget_list_tail != &fiq->forget_list_head) {
+		spin_unlock(&fiq->lock);
+		spin_unlock(&fc->lock);
+		return -EINVAL;
+	}
+
+	if (&fuse_dev_fiq_ops != fiq->ops) {
+		spin_unlock(&fiq->lock);
+		spin_unlock(&fc->lock);
+		return -EOPNOTSUPP;
+	}
+
+	fiq->connected = 1;
+	spin_unlock(&fiq->lock);
+
+	spin_lock(&fc->bg_lock);
+	if (!list_empty(&fc->bg_queue)) {
+		spin_unlock(&fc->bg_lock);
+		spin_unlock(&fc->lock);
+		return -EINVAL;
+	}
+
+	fc->blocked = 0;
+	fc->max_background = FUSE_DEFAULT_MAX_BACKGROUND;
+	spin_unlock(&fc->bg_lock);
+
+	list_for_each_entry(fud, &fc->devices, entry) {
+		struct fuse_pqueue *fpq = &fud->pq;
+
+		spin_lock(&fpq->lock);
+		if (!list_empty(&fpq->io)) {
+			spin_unlock(&fpq->lock);
+			spin_unlock(&fc->lock);
+			return -EINVAL;
+		}
+
+		for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) {
+			if (!list_empty(&fpq->processing[i])) {
+				spin_unlock(&fpq->lock);
+				spin_unlock(&fc->lock);
+				return -EINVAL;
+			}
+		}
+
+		fpq->connected = 1;
+		spin_unlock(&fpq->lock);
+	}
+
+	fuse_set_initialized(fc);
+
+	/* Background queuing checks fc->connected under bg_lock */
+	spin_lock(&fc->bg_lock);
+	fc->connected = 1;
+	spin_unlock(&fc->bg_lock);
+
+	fc->aborted = false;
+	fc->abort_err = 0;
+
+	/* nullify all the flags */
+	memset(&fc->flags, 0, sizeof(struct fuse_conn_flags));
+
+	spin_unlock(&fc->lock);
+
+	down_read(&fc->killsb);
+	if (!list_empty(&fc->mounts)) {
+		struct fuse_mount *fm;
+
+		fm = list_first_entry(&fc->mounts, struct fuse_mount, fc_entry);
+		if (!fm->sb) {
+			up_read(&fc->killsb);
+			return -EINVAL;
+		}
+
+		fuse_send_init(fm);
+	}
+	up_read(&fc->killsb);
+
+	return 0;
+}
+
 void fuse_wait_aborted(struct fuse_conn *fc)
 {
 	/* matches implicit memory barrier in fuse_drop_waiting() */
@@ -2282,6 +2388,32 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
 			}
 		}
 		break;
+	case FUSE_DEV_IOC_REINIT:
+		struct fuse_conn *fc;
+
+		if (!checkpoint_restore_ns_capable(file->f_cred->user_ns))
+			return -EPERM;
+
+		res = -EINVAL;
+		fud = fuse_get_dev(file);
+
+		/*
+		 * Only fuse mounts with an already initialized fuse
+		 * connection are supported
+		 */
+		if (file->f_op == &fuse_dev_operations && fud) {
+			mutex_lock(&fuse_mutex);
+			fc = fud->fc;
+			if (fc)
+				fc = fuse_conn_get(fc);
+			mutex_unlock(&fuse_mutex);
+
+			if (fc) {
+				res = fuse_reinit_conn(fc);
+				fuse_conn_put(fc);
+			}
+		}
+		break;
 	default:
 		res = -ENOTTY;
 		break;
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 1b9d0dfae72d..3dac67b25eae 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -989,6 +989,7 @@ struct fuse_notify_retrieve_in {
 /* Device ioctls: */
 #define FUSE_DEV_IOC_MAGIC		229
 #define FUSE_DEV_IOC_CLONE		_IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
+#define FUSE_DEV_IOC_REINIT		_IO(FUSE_DEV_IOC_MAGIC, 0)
 
 struct fuse_lseek_in {
 	uint64_t	fh;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 8/9] namespace: add sb_revalidate_bindmounts helper
  2023-02-20 19:37 [RFC PATCH 0/9] fuse: API for Checkpoint/Restore Alexander Mikhalitsyn
                   ` (6 preceding siblings ...)
  2023-02-20 19:37 ` [RFC PATCH 7/9] fuse: add fuse device ioctl(FUSE_DEV_IOC_REINIT) Alexander Mikhalitsyn
@ 2023-02-20 19:37 ` Alexander Mikhalitsyn
  2023-02-20 19:37 ` [RFC PATCH 9/9] fuse: add fuse device ioctl(FUSE_DEV_IOC_BM_REVAL) Alexander Mikhalitsyn
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Alexander Mikhalitsyn @ 2023-02-20 19:37 UTC (permalink / raw)
  To: mszeredi
  Cc: Alexander Mikhalitsyn, Al Viro, Amir Goldstein,
	Stéphane Graber, Seth Forshee, Christian Brauner,
	Andrei Vagin, Pavel Tikhomirov, linux-fsdevel, linux-kernel,
	criu

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=ASCII, Size: 4017 bytes --]

Useful if for some reason bindmounts root dentries get invalidated
but it's needed to revalidate existing bindmounts without remounting.

Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Stéphane Graber <stgraber@ubuntu.com>
Cc: Seth Forshee <sforshee@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: criu@openvz.org
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
---
 fs/namespace.c                | 90 +++++++++++++++++++++++++++++++++++
 include/linux/mnt_namespace.h |  3 ++
 2 files changed, 93 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index ab467ee58341..88491f9c8bbd 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -682,6 +682,96 @@ static int mnt_make_readonly(struct mount *mnt)
 	return ret;
 }
 
+struct bind_mount_list_item {
+	struct list_head list;
+	struct vfsmount *mnt;
+};
+
+/*
+ * sb_revalidate_bindmounts - Relookup/reset bindmounts root dentries
+ *
+ * Useful if for some reason bindmount root dentries get invalidated
+ * but it's needed to revalidate existing bindmounts without remounting.
+ */
+int sb_revalidate_bindmounts(struct super_block *sb)
+{
+	struct mount *mnt;
+	struct bind_mount_list_item *bmnt, *next;
+	int err = 0;
+	struct vfsmount *root_mnt = NULL;
+	LIST_HEAD(mnt_to_update);
+	char *buf;
+
+	buf = (char *) __get_free_page(GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	lock_mount_hash();
+	list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) {
+		/* we only want to touch bindmounts */
+		if (mnt->mnt.mnt_root == sb->s_root) {
+			if (!root_mnt)
+				root_mnt = mntget(&mnt->mnt);
+
+			continue;
+		}
+
+		bmnt = kzalloc(sizeof(struct bind_mount_list_item), GFP_NOWAIT | __GFP_NOWARN);
+		if (!bmnt) {
+			err = -ENOMEM;
+			goto exit;
+		}
+
+		bmnt->mnt = mntget(&mnt->mnt);
+		list_add_tail(&bmnt->list, &mnt_to_update);
+	}
+	unlock_mount_hash();
+
+	/* TODO: get rid of this limitation */
+	if (!root_mnt) {
+		err = -ENOENT;
+		goto exit;
+	}
+
+	list_for_each_entry_safe(bmnt, next, &mnt_to_update, list) {
+		struct vfsmount *cur_mnt = bmnt->mnt;
+		struct path path;
+		struct dentry *old_root;
+		char *p;
+
+		p = dentry_path(cur_mnt->mnt_root, buf, PAGE_SIZE);
+		if (IS_ERR(p))
+			goto exit;
+
+		/* TODO: are these lookup flags fully safe and correct? */
+		err = vfs_path_lookup(root_mnt->mnt_root, root_mnt,
+				p, LOOKUP_FOLLOW|LOOKUP_AUTOMOUNT|LOOKUP_REVAL, &path);
+		if (err)
+			goto exit;
+
+		/* replace bindmount root dentry */
+		lock_mount_hash();
+		old_root = cur_mnt->mnt_root;
+		cur_mnt->mnt_root = dget(path.dentry);
+		dput(old_root);
+		unlock_mount_hash();
+
+		path_put(&path);
+	}
+
+exit:
+	free_page((unsigned long) buf);
+	mntput(root_mnt);
+	list_for_each_entry_safe(bmnt, next, &mnt_to_update, list) {
+		list_del(&bmnt->list);
+		mntput(bmnt->mnt);
+		kfree(bmnt);
+	}
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(sb_revalidate_bindmounts);
+
 int sb_prepare_remount_readonly(struct super_block *sb)
 {
 	struct mount *mnt;
diff --git a/include/linux/mnt_namespace.h b/include/linux/mnt_namespace.h
index 8f882f5881e8..20ac29e702f5 100644
--- a/include/linux/mnt_namespace.h
+++ b/include/linux/mnt_namespace.h
@@ -3,6 +3,7 @@
 #define _NAMESPACE_H_
 #ifdef __KERNEL__
 
+struct super_block;
 struct mnt_namespace;
 struct fs_struct;
 struct user_namespace;
@@ -13,6 +14,8 @@ extern struct mnt_namespace *copy_mnt_ns(unsigned long, struct mnt_namespace *,
 extern void put_mnt_ns(struct mnt_namespace *ns);
 extern struct ns_common *from_mnt_ns(struct mnt_namespace *);
 
+extern int sb_revalidate_bindmounts(struct super_block *sb);
+
 extern const struct file_operations proc_mounts_operations;
 extern const struct file_operations proc_mountinfo_operations;
 extern const struct file_operations proc_mountstats_operations;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 9/9] fuse: add fuse device ioctl(FUSE_DEV_IOC_BM_REVAL)
  2023-02-20 19:37 [RFC PATCH 0/9] fuse: API for Checkpoint/Restore Alexander Mikhalitsyn
                   ` (7 preceding siblings ...)
  2023-02-20 19:37 ` [RFC PATCH 8/9] namespace: add sb_revalidate_bindmounts helper Alexander Mikhalitsyn
@ 2023-02-20 19:37 ` Alexander Mikhalitsyn
  2023-03-03 19:42 ` [RFC PATCH 0/9] fuse: API for Checkpoint/Restore Bernd Schubert
  2023-03-06 16:14 ` Miklos Szeredi
  10 siblings, 0 replies; 27+ messages in thread
From: Alexander Mikhalitsyn @ 2023-02-20 19:37 UTC (permalink / raw)
  To: mszeredi
  Cc: Alexander Mikhalitsyn, Al Viro, Amir Goldstein,
	Stéphane Graber, Seth Forshee, Christian Brauner,
	Andrei Vagin, Pavel Tikhomirov, linux-fsdevel, linux-kernel,
	criu

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=ASCII, Size: 2939 bytes --]

This ioctl allows to revalidate all the existing fuse bindmounts
by performing relookup of all root dentries and resetting them.

Useful if it's needed to make fuse bindmounts work without
remounting them after fuse connection reinitialization.

Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Stéphane Graber <stgraber@ubuntu.com>
Cc: Seth Forshee <sforshee@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: criu@openvz.org
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
---
 fs/fuse/dev.c             | 29 ++++++++++++++++++++++++++++-
 include/uapi/linux/fuse.h |  1 +
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 0f53ffd63957..ae301cb8486b 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -13,6 +13,7 @@
 #include <linux/poll.h>
 #include <linux/sched/signal.h>
 #include <linux/uio.h>
+#include <linux/mnt_namespace.h>
 #include <linux/miscdevice.h>
 #include <linux/pagemap.h>
 #include <linux/file.h>
@@ -2293,6 +2294,27 @@ static int fuse_reinit_conn(struct fuse_conn *fc)
 	return 0;
 }
 
+static ssize_t fuse_revalidate_bindmounts(struct fuse_conn *fc)
+{
+	int ret = 0;
+
+	down_read(&fc->killsb);
+	if (!list_empty(&fc->mounts)) {
+		struct fuse_mount *fm;
+
+		fm = list_first_entry(&fc->mounts, struct fuse_mount, fc_entry);
+		if (!fm->sb) {
+			up_read(&fc->killsb);
+			return -EINVAL;
+		}
+
+		ret = sb_revalidate_bindmounts(fm->sb);
+	}
+	up_read(&fc->killsb);
+
+	return ret;
+}
+
 void fuse_wait_aborted(struct fuse_conn *fc)
 {
 	/* matches implicit memory barrier in fuse_drop_waiting() */
@@ -2389,6 +2411,7 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
 		}
 		break;
 	case FUSE_DEV_IOC_REINIT:
+	case FUSE_DEV_IOC_BM_REVAL:
 		struct fuse_conn *fc;
 
 		if (!checkpoint_restore_ns_capable(file->f_cred->user_ns))
@@ -2409,7 +2432,11 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
 			mutex_unlock(&fuse_mutex);
 
 			if (fc) {
-				res = fuse_reinit_conn(fc);
+				if (cmd == FUSE_DEV_IOC_REINIT)
+					res = fuse_reinit_conn(fc);
+				else if (cmd == FUSE_DEV_IOC_BM_REVAL)
+					res = fuse_revalidate_bindmounts(fc);
+
 				fuse_conn_put(fc);
 			}
 		}
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 3dac67b25eae..14f7fe4a99cf 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -990,6 +990,7 @@ struct fuse_notify_retrieve_in {
 #define FUSE_DEV_IOC_MAGIC		229
 #define FUSE_DEV_IOC_CLONE		_IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
 #define FUSE_DEV_IOC_REINIT		_IO(FUSE_DEV_IOC_MAGIC, 0)
+#define FUSE_DEV_IOC_BM_REVAL		_IO(FUSE_DEV_IOC_MAGIC, 1)
 
 struct fuse_lseek_in {
 	uint64_t	fh;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 5/9] fuse: move fuse connection flags to the separate structure
  2023-02-20 19:37 ` [RFC PATCH 5/9] fuse: move fuse connection flags to the separate structure Alexander Mikhalitsyn
@ 2023-03-03 18:26   ` Bernd Schubert
  0 siblings, 0 replies; 27+ messages in thread
From: Bernd Schubert @ 2023-03-03 18:26 UTC (permalink / raw)
  To: Alexander Mikhalitsyn, mszeredi
  Cc: Al Viro, Amir Goldstein, Stéphane Graber, Seth Forshee,
	Christian Brauner, Andrei Vagin, Pavel Tikhomirov, linux-fsdevel,
	linux-kernel, criu



On 2/20/23 20:37, Alexander Mikhalitsyn wrote:
> Let's move all the fuse connection flags that can be safely zeroed
> after connection reinitialization to the separate structure fuse_conn_flags.
> 
> All of these flags values are calculated dynamically basing on
> the userspace daemon capabilities (like no_open, no_flush) or on the
> response for FUSE_INIT request.
> 

 From my point of view this makes the code a bit better readable, in 
general.

[...]

>   };
>   
> +/**
> + * A Fuse connection.
> + *
> + * This structure is created, when the root filesystem is mounted, and
> + * is destroyed, when the client device is closed and the last
> + * fuse_mount is destroyed.
> + */
> +struct fuse_conn_flags {
> +	/** Do readahead asynchronously?  Only set in INIT */
> +	unsigned async_read:1;


The comment does not match the struct?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 6/9] fuse: take fuse connection generation into account
  2023-02-20 19:37 ` [RFC PATCH 6/9] fuse: take fuse connection generation into account Alexander Mikhalitsyn
@ 2023-03-03 18:45   ` Bernd Schubert
  0 siblings, 0 replies; 27+ messages in thread
From: Bernd Schubert @ 2023-03-03 18:45 UTC (permalink / raw)
  To: Alexander Mikhalitsyn, mszeredi
  Cc: Al Viro, Amir Goldstein, Stéphane Graber, Seth Forshee,
	Christian Brauner, Andrei Vagin, Pavel Tikhomirov, linux-fsdevel,
	linux-kernel, criu


[...]
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index d5b30faff0b9..be9086a1868d 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -110,7 +110,8 @@ static void fuse_file_put(struct fuse_file *ff, bool sync, bool isdir)
>   	if (refcount_dec_and_test(&ff->count)) {
>   		struct fuse_args *args = &ff->release_args->args;
>   
> -		if (isdir ? ff->fm->fc->flags.no_opendir : ff->fm->fc->flags.no_open) {
> +		if (fuse_stale_ff(ff) ||
> +		    (isdir ? ff->fm->fc->flags.no_opendir : ff->fm->fc->flags.no_open)) {
>   			/* Do nothing when client does not implement 'open' */

The comment does not match anymore.

>   			fuse_release_end(ff->fm, args, 0);
>   		} else if (sync) {
> @@ -597,9 +598,10 @@ static int fuse_fsync(struct file *file, loff_t start, loff_t end,
>   {
>   	struct inode *inode = file->f_mapping->host;
>   	struct fuse_conn *fc = get_fuse_conn(inode);
> +	struct fuse_file *ff = file->private_data;
>   	int err;
>   
> -	if (fuse_is_bad(inode))
> +	if (fuse_stale_ff(ff) || fuse_is_bad(inode))
>   		return -EIO;
>   
>   	inode_lock(inode);
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 4f4a6f912c7c..0643de31674d 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -954,7 +954,8 @@ static inline bool fuse_stale_inode(const struct inode *inode, int generation,
>   				    struct fuse_attr *attr)
>   {
>   	return inode->i_generation != generation ||
> -		inode_wrong_type(inode, attr->mode);
> +		inode_wrong_type(inode, attr->mode) ||
> +		fuse_stale_inode_conn(inode);
>   }
>   
>   static inline void fuse_make_bad(struct inode *inode)
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index c3109e016494..f9dc8971274d 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -124,7 +124,7 @@ static void fuse_evict_inode(struct inode *inode)
>   			fuse_dax_inode_cleanup(inode);
>   		if (fi->nlookup) {
>   			fuse_queue_forget(fc, fi->forget, fi->nodeid,
> -					  fi->nlookup, false);
> +					  fi->nlookup, fuse_stale_inode_conn(inode));
>   			fi->forget = NULL;
>   		}
>   	}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 4/9] fuse: handle stale inode connection in fuse_queue_forget
  2023-02-20 19:37 ` [RFC PATCH 4/9] fuse: handle stale inode connection in fuse_queue_forget Alexander Mikhalitsyn
@ 2023-03-03 18:47   ` Bernd Schubert
  0 siblings, 0 replies; 27+ messages in thread
From: Bernd Schubert @ 2023-03-03 18:47 UTC (permalink / raw)
  To: Alexander Mikhalitsyn, mszeredi
  Cc: Al Viro, Amir Goldstein, Stéphane Graber, Seth Forshee,
	Christian Brauner, Andrei Vagin, Pavel Tikhomirov, linux-fsdevel,
	linux-kernel, criu



On 2/20/23 20:37, Alexander Mikhalitsyn wrote:
> We don't want to send FUSE_FORGET request to the new
> fuse daemon if inode was lookuped by the old fuse daemon
> because it can confuse and break userspace (libfuse).
> 
> For now, just add a new argument to fuse_queue_forget and
> handle it. Adjust all callers to match the old behaviour.
> 
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Stéphane Graber <stgraber@ubuntu.com>
> Cc: Seth Forshee <sforshee@kernel.org>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Andrei Vagin <avagin@gmail.com>
> Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: criu@openvz.org
> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
> ---
>   fs/fuse/dev.c    | 4 ++--
>   fs/fuse/dir.c    | 8 ++++----
>   fs/fuse/fuse_i.h | 2 +-
>   fs/fuse/inode.c  | 2 +-
>   4 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index eb4f88e3dc97..85f69629f34d 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -234,7 +234,7 @@ __releases(fiq->lock)
>   }
>   
>   void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forget,
> -		       u64 nodeid, u64 nlookup)
> +		       u64 nodeid, u64 nlookup, bool stale_inode_conn)
>   {
>   	struct fuse_iqueue *fiq = &fc->iq;
>   
> @@ -242,7 +242,7 @@ void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forget,
>   	forget->forget_one.nlookup = nlookup;
>   
>   	spin_lock(&fiq->lock);
> -	if (fiq->connected) {
> +	if (fiq->connected && !stale_inode_conn) {
>   		fiq->forget_list_tail->next = forget;
>   		fiq->forget_list_tail = forget;
>   		fiq->ops->wake_forget_and_unlock(fiq);

I'm not sure about kernel coding rules here - for me that is unlikely 
rare event - I would have added unlikely() here.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 7/9] fuse: add fuse device ioctl(FUSE_DEV_IOC_REINIT)
  2023-02-20 19:37 ` [RFC PATCH 7/9] fuse: add fuse device ioctl(FUSE_DEV_IOC_REINIT) Alexander Mikhalitsyn
@ 2023-03-03 19:19   ` Bernd Schubert
  2023-04-03 14:56     ` Aleksandr Mikhalitsyn
  2023-03-03 19:26   ` Bernd Schubert
  1 sibling, 1 reply; 27+ messages in thread
From: Bernd Schubert @ 2023-03-03 19:19 UTC (permalink / raw)
  To: Alexander Mikhalitsyn, mszeredi
  Cc: Al Viro, Amir Goldstein, Stéphane Graber, Seth Forshee,
	Christian Brauner, Andrei Vagin, Pavel Tikhomirov, linux-fsdevel,
	linux-kernel, criu



On 2/20/23 20:37, Alexander Mikhalitsyn wrote:
> This ioctl aborts fuse connection and then reinitializes it,
> sends FUSE_INIT request to allow a new userspace daemon
> to pick up the fuse connection.
> 
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Stéphane Graber <stgraber@ubuntu.com>
> Cc: Seth Forshee <sforshee@kernel.org>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Andrei Vagin <avagin@gmail.com>
> Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: criu@openvz.org
> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
> ---
>   fs/fuse/dev.c             | 132 ++++++++++++++++++++++++++++++++++++++
>   include/uapi/linux/fuse.h |   1 +
>   2 files changed, 133 insertions(+)
> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 737764c2295e..0f53ffd63957 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -2187,6 +2187,112 @@ void fuse_abort_conn(struct fuse_conn *fc)
>   }
>   EXPORT_SYMBOL_GPL(fuse_abort_conn);
>   
> +static int fuse_reinit_conn(struct fuse_conn *fc)
> +{
> +	struct fuse_iqueue *fiq = &fc->iq;
> +	struct fuse_dev *fud;
> +	unsigned int i;

Assuming you have a malicious daemon that tries to cause bad behavior, 
only allow one ioctl at at time? I.e. add a value that reinit is in 
progress? And unset at the end of the function?

> +
> +	if (fc->conn_gen + 1 < fc->conn_gen)
> +		return -EOVERFLOW;
> +

Add a comment, like

/* Unsets fc->connected and fiq->connected and ensures that no new 
requests can be queued */

?

> +	fuse_abort_conn(fc);
> +	fuse_wait_aborted(fc);
> +
> +	spin_lock(&fc->lock);
> +	if (fc->connected) {
> +		spin_unlock(&fc->lock);
> +		return -EINVAL;
> +	}
> +
> +	if (fc->conn_gen + 1 < fc->conn_gen) {
> +		spin_unlock(&fc->lock);
> +		return -EOVERFLOW;
> +	}
> +
> +	fc->conn_gen++;
> +
> +	spin_lock(&fiq->lock);
> +	if (request_pending(fiq) || fiq->forget_list_tail != &fiq->forget_list_head) {
> +		spin_unlock(&fiq->lock);
> +		spin_unlock(&fc->lock);
> +		return -EINVAL;
> +	}
> +
> +	if (&fuse_dev_fiq_ops != fiq->ops) {
> +		spin_unlock(&fiq->lock);
> +		spin_unlock(&fc->lock);
> +		return -EOPNOTSUPP;
> +	}
> +
> +	fiq->connected = 1;
> +	spin_unlock(&fiq->lock);
> +
> +	spin_lock(&fc->bg_lock);
> +	if (!list_empty(&fc->bg_queue)) {
> +		spin_unlock(&fc->bg_lock);
> +		spin_unlock(&fc->lock);
> +		return -EINVAL;
> +	}
> +
> +	fc->blocked = 0;
> +	fc->max_background = FUSE_DEFAULT_MAX_BACKGROUND;
> +	spin_unlock(&fc->bg_lock);
> +
> +	list_for_each_entry(fud, &fc->devices, entry) {
> +		struct fuse_pqueue *fpq = &fud->pq;
> +
> +		spin_lock(&fpq->lock);
> +		if (!list_empty(&fpq->io)) {
> +			spin_unlock(&fpq->lock);
> +			spin_unlock(&fc->lock);
> +			return -EINVAL;
> +		}
> +
> +		for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) {
> +			if (!list_empty(&fpq->processing[i])) {
> +				spin_unlock(&fpq->lock);
> +				spin_unlock(&fc->lock);
> +				return -EINVAL;
> +			}
> +		}
> +
> +		fpq->connected = 1;
> +		spin_unlock(&fpq->lock);
> +	}
> +
> +	fuse_set_initialized(fc);

I'm not sure about this, why not the common way via FUSE_INIT reply?

> +
> +	/* Background queuing checks fc->connected under bg_lock */
> +	spin_lock(&fc->bg_lock);
> +	fc->connected = 1;
> +	spin_unlock(&fc->bg_lock);
> +
> +	fc->aborted = false;
> +	fc->abort_err = 0;
> +
> +	/* nullify all the flags */
> +	memset(&fc->flags, 0, sizeof(struct fuse_conn_flags));
> +
> +	spin_unlock(&fc->lock);
> +
> +	down_read(&fc->killsb);
> +	if (!list_empty(&fc->mounts)) {
> +		struct fuse_mount *fm;
> +
> +		fm = list_first_entry(&fc->mounts, struct fuse_mount, fc_entry);
> +		if (!fm->sb) {
> +			up_read(&fc->killsb);
> +			return -EINVAL;
> +		}
> +
> +		fuse_send_init(fm);
> +	}
> +	up_read(&fc->killsb);
> +
> +	return 0;
> +}
> +
>   void fuse_wait_aborted(struct fuse_conn *fc)
>   {
>   	/* matches implicit memory barrier in fuse_drop_waiting() */
> @@ -2282,6 +2388,32 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
>   			}
>   		}
>   		break;
> +	case FUSE_DEV_IOC_REINIT:
> +		struct fuse_conn *fc;
> +
> +		if (!checkpoint_restore_ns_capable(file->f_cred->user_ns))
> +			return -EPERM;
> +
> +		res = -EINVAL;
> +		fud = fuse_get_dev(file);
> +
> +		/*
> +		 * Only fuse mounts with an already initialized fuse
> +		 * connection are supported
> +		 */
> +		if (file->f_op == &fuse_dev_operations && fud) {
> +			mutex_lock(&fuse_mutex);
> +			fc = fud->fc;
> +			if (fc)
> +				fc = fuse_conn_get(fc);
> +			mutex_unlock(&fuse_mutex);
> +
> +			if (fc) {
> +				res = fuse_reinit_conn(fc);
> +				fuse_conn_put(fc);
> +			}
> +		}
> +		break;
>   	default:
>   		res = -ENOTTY;
>   		break;
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index 1b9d0dfae72d..3dac67b25eae 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -989,6 +989,7 @@ struct fuse_notify_retrieve_in {
>   /* Device ioctls: */
>   #define FUSE_DEV_IOC_MAGIC		229
>   #define FUSE_DEV_IOC_CLONE		_IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
> +#define FUSE_DEV_IOC_REINIT		_IO(FUSE_DEV_IOC_MAGIC, 0)
>   
>   struct fuse_lseek_in {
>   	uint64_t	fh;

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 7/9] fuse: add fuse device ioctl(FUSE_DEV_IOC_REINIT)
  2023-02-20 19:37 ` [RFC PATCH 7/9] fuse: add fuse device ioctl(FUSE_DEV_IOC_REINIT) Alexander Mikhalitsyn
  2023-03-03 19:19   ` Bernd Schubert
@ 2023-03-03 19:26   ` Bernd Schubert
  2023-03-06 14:09     ` Aleksandr Mikhalitsyn
  1 sibling, 1 reply; 27+ messages in thread
From: Bernd Schubert @ 2023-03-03 19:26 UTC (permalink / raw)
  To: Alexander Mikhalitsyn, mszeredi
  Cc: Al Viro, Amir Goldstein, Stéphane Graber, Seth Forshee,
	Christian Brauner, Andrei Vagin, Pavel Tikhomirov, linux-fsdevel,
	linux-kernel, criu



On 2/20/23 20:37, Alexander Mikhalitsyn wrote:
> This ioctl aborts fuse connection and then reinitializes it,
> sends FUSE_INIT request to allow a new userspace daemon
> to pick up the fuse connection.
> 
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Stéphane Graber <stgraber@ubuntu.com>
> Cc: Seth Forshee <sforshee@kernel.org>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Andrei Vagin <avagin@gmail.com>
> Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: criu@openvz.org
> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
> ---
>   fs/fuse/dev.c             | 132 ++++++++++++++++++++++++++++++++++++++
>   include/uapi/linux/fuse.h |   1 +
>   2 files changed, 133 insertions(+)
> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 737764c2295e..0f53ffd63957 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -2187,6 +2187,112 @@ void fuse_abort_conn(struct fuse_conn *fc)
>   }
>   EXPORT_SYMBOL_GPL(fuse_abort_conn);
>   
> +static int fuse_reinit_conn(struct fuse_conn *fc)
> +{
> +	struct fuse_iqueue *fiq = &fc->iq;
> +	struct fuse_dev *fud;
> +	unsigned int i;
> +
> +	if (fc->conn_gen + 1 < fc->conn_gen)
> +		return -EOVERFLOW;
> +
> +	fuse_abort_conn(fc);
> +	fuse_wait_aborted(fc);

Shouldn't this also try to flush all data first?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/9] fuse: API for Checkpoint/Restore
  2023-02-20 19:37 [RFC PATCH 0/9] fuse: API for Checkpoint/Restore Alexander Mikhalitsyn
                   ` (8 preceding siblings ...)
  2023-02-20 19:37 ` [RFC PATCH 9/9] fuse: add fuse device ioctl(FUSE_DEV_IOC_BM_REVAL) Alexander Mikhalitsyn
@ 2023-03-03 19:42 ` Bernd Schubert
  2023-03-06 14:06   ` Aleksandr Mikhalitsyn
  2023-03-06 16:14 ` Miklos Szeredi
  10 siblings, 1 reply; 27+ messages in thread
From: Bernd Schubert @ 2023-03-03 19:42 UTC (permalink / raw)
  To: Alexander Mikhalitsyn, mszeredi
  Cc: Al Viro, Amir Goldstein, Stéphane Graber, Seth Forshee,
	Christian Brauner, Andrei Vagin, Pavel Tikhomirov, linux-fsdevel,
	linux-kernel, criu



On 2/20/23 20:37, Alexander Mikhalitsyn wrote:
> Hello everyone,
> 
> It would be great to hear your comments regarding this proof-of-concept Checkpoint/Restore API for FUSE.
> 
> Support of FUSE C/R is a challenging task for CRIU [1]. Last year I've given a brief talk on LPC 2022
> about how we handle files C/R in CRIU and which blockers we have for FUSE filesystems. [2]
> 
> The main problem for CRIU is that we have to restore mount namespaces and memory mappings before the process tree.
> It means that when CRIU is performing mount of fuse filesystem it can't use the original FUSE daemon from the
> restorable process tree, but instead use a "fake daemon".
> 
> This leads to many other technical problems:
> * "fake" daemon has to reply to FUSE_INIT request from the kernel and initialize fuse connection somehow.
> This setup can be not consistent with the original daemon (protocol version, daemon capabilities/settings
> like no_open, no_flush, readahead, and so on).
> * each fuse request has a unique ID. It could confuse userspace if this unique ID sequence was reset.
> 
> We can workaround some issues and implement fragile and limited support of FUSE in CRIU but it doesn't make any sense, IMHO.
> Btw, I've enumerated only CRIU restore-stage problems there. The dump stage is another story...
> 
> My proposal is not only about CRIU. The same interface can be useful for FUSE mounts recovery after daemon crashes.
> LXC project uses LXCFS [3] as a procfs/cgroupfs/sysfs emulation layer for containers. We are using a scheme when
> one LXCFS daemon handles all the work for all the containers and we use bindmounts to overmount particular
> files/directories in procfs/cgroupfs/sysfs. If this single daemon crashes for some reason we are in trouble,
> because we have to restart all the containers (fuse bindmounts become invalid after the crash).
> The solution is fairly easy:
> allow somehow to reinitialize the existing fuse connection and replace the daemon on the fly
> This case is a little bit simpler than CRIU cause we don't need to care about the previously opened files
> and other stuff, we are only interested in mounts.


I like your patches, small and easy to read :)
So this basically fails all existing open files - our (future) needs go 
beyond that. I wonder if we can extend it later and re-init the new 
daemon with something like "fuse_queue_recall" - basically the opposite 
of fuse_queue_forget. Not sure if fuse can access the vfs dentry cache 
to know for which files that would need to be done - if not, it would 
need to do its own book-keeping.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/9] fuse: API for Checkpoint/Restore
  2023-03-03 19:42 ` [RFC PATCH 0/9] fuse: API for Checkpoint/Restore Bernd Schubert
@ 2023-03-06 14:06   ` Aleksandr Mikhalitsyn
  0 siblings, 0 replies; 27+ messages in thread
From: Aleksandr Mikhalitsyn @ 2023-03-06 14:06 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: mszeredi, Al Viro, Amir Goldstein, Stéphane Graber,
	Seth Forshee, Christian Brauner, Andrei Vagin, Pavel Tikhomirov,
	linux-fsdevel, linux-kernel, criu

On Fri, Mar 3, 2023 at 8:43 PM Bernd Schubert <bschubert@ddn.com> wrote:
>
>
>
> On 2/20/23 20:37, Alexander Mikhalitsyn wrote:
> > Hello everyone,
> >
> > It would be great to hear your comments regarding this proof-of-concept Checkpoint/Restore API for FUSE.
> >
> > Support of FUSE C/R is a challenging task for CRIU [1]. Last year I've given a brief talk on LPC 2022
> > about how we handle files C/R in CRIU and which blockers we have for FUSE filesystems. [2]
> >
> > The main problem for CRIU is that we have to restore mount namespaces and memory mappings before the process tree.
> > It means that when CRIU is performing mount of fuse filesystem it can't use the original FUSE daemon from the
> > restorable process tree, but instead use a "fake daemon".
> >
> > This leads to many other technical problems:
> > * "fake" daemon has to reply to FUSE_INIT request from the kernel and initialize fuse connection somehow.
> > This setup can be not consistent with the original daemon (protocol version, daemon capabilities/settings
> > like no_open, no_flush, readahead, and so on).
> > * each fuse request has a unique ID. It could confuse userspace if this unique ID sequence was reset.
> >
> > We can workaround some issues and implement fragile and limited support of FUSE in CRIU but it doesn't make any sense, IMHO.
> > Btw, I've enumerated only CRIU restore-stage problems there. The dump stage is another story...
> >
> > My proposal is not only about CRIU. The same interface can be useful for FUSE mounts recovery after daemon crashes.
> > LXC project uses LXCFS [3] as a procfs/cgroupfs/sysfs emulation layer for containers. We are using a scheme when
> > one LXCFS daemon handles all the work for all the containers and we use bindmounts to overmount particular
> > files/directories in procfs/cgroupfs/sysfs. If this single daemon crashes for some reason we are in trouble,
> > because we have to restart all the containers (fuse bindmounts become invalid after the crash).
> > The solution is fairly easy:
> > allow somehow to reinitialize the existing fuse connection and replace the daemon on the fly
> > This case is a little bit simpler than CRIU cause we don't need to care about the previously opened files
> > and other stuff, we are only interested in mounts.
>

Hello, Bernd!

Thanks a lot for your attention/review to this patch series!

>
> I like your patches, small and easy to read :)

Glad to hear, thanks! ;-)

> So this basically fails all existing open files - our (future) needs go
> beyond that. I wonder if we can extend it later and re-init the new
> daemon with something like "fuse_queue_recall" - basically the opposite
> of fuse_queue_forget. Not sure if fuse can access the vfs dentry cache
> to know for which files that would need to be done - if not, it would
> need to do its own book-keeping.
>

I thought about this (problem with existing opened FDs) too, it's just
a first approach to the problem and as far as I mentioned I have no
CRIU-side implementation (so it's not tested with full
Checkpoint/Restore),
but it works well for "fuse reinitialization" (which is needed for
LXCFS and can be useful for other fuse filesystems).

I think we can easily extend this later if we come up with some
agreement about generic UAPI.

I hope that more people will react to this and post their opinions.

Kind regards,
Alex

>
> Thanks,
> Bernd

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 7/9] fuse: add fuse device ioctl(FUSE_DEV_IOC_REINIT)
  2023-03-03 19:26   ` Bernd Schubert
@ 2023-03-06 14:09     ` Aleksandr Mikhalitsyn
  2023-04-03 14:51       ` Aleksandr Mikhalitsyn
  0 siblings, 1 reply; 27+ messages in thread
From: Aleksandr Mikhalitsyn @ 2023-03-06 14:09 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: mszeredi, Al Viro, Amir Goldstein, Stéphane Graber,
	Seth Forshee, Christian Brauner, Andrei Vagin, Pavel Tikhomirov,
	linux-fsdevel, linux-kernel, criu

On Fri, Mar 3, 2023 at 8:26 PM Bernd Schubert
<bernd.schubert@fastmail.fm> wrote:
>
>
>
> On 2/20/23 20:37, Alexander Mikhalitsyn wrote:
> > This ioctl aborts fuse connection and then reinitializes it,
> > sends FUSE_INIT request to allow a new userspace daemon
> > to pick up the fuse connection.
> >
> > Cc: Miklos Szeredi <mszeredi@redhat.com>
> > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > Cc: Amir Goldstein <amir73il@gmail.com>
> > Cc: Stéphane Graber <stgraber@ubuntu.com>
> > Cc: Seth Forshee <sforshee@kernel.org>
> > Cc: Christian Brauner <brauner@kernel.org>
> > Cc: Andrei Vagin <avagin@gmail.com>
> > Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> > Cc: linux-fsdevel@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: criu@openvz.org
> > Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
> > ---
> >   fs/fuse/dev.c             | 132 ++++++++++++++++++++++++++++++++++++++
> >   include/uapi/linux/fuse.h |   1 +
> >   2 files changed, 133 insertions(+)
> >
> > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > index 737764c2295e..0f53ffd63957 100644
> > --- a/fs/fuse/dev.c
> > +++ b/fs/fuse/dev.c
> > @@ -2187,6 +2187,112 @@ void fuse_abort_conn(struct fuse_conn *fc)
> >   }
> >   EXPORT_SYMBOL_GPL(fuse_abort_conn);
> >
> > +static int fuse_reinit_conn(struct fuse_conn *fc)
> > +{
> > +     struct fuse_iqueue *fiq = &fc->iq;
> > +     struct fuse_dev *fud;
> > +     unsigned int i;
> > +
> > +     if (fc->conn_gen + 1 < fc->conn_gen)
> > +             return -EOVERFLOW;
> > +
> > +     fuse_abort_conn(fc);
> > +     fuse_wait_aborted(fc);
>
> Shouldn't this also try to flush all data first?

I think we should. Thanks for pointing to that!

I've read all your comments and I'll prepare -v2 series soon.

Thanks a lot, Bernd!

>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/9] fuse: API for Checkpoint/Restore
  2023-02-20 19:37 [RFC PATCH 0/9] fuse: API for Checkpoint/Restore Alexander Mikhalitsyn
                   ` (9 preceding siblings ...)
  2023-03-03 19:42 ` [RFC PATCH 0/9] fuse: API for Checkpoint/Restore Bernd Schubert
@ 2023-03-06 16:14 ` Miklos Szeredi
  2023-03-06 16:44   ` Aleksandr Mikhalitsyn
  10 siblings, 1 reply; 27+ messages in thread
From: Miklos Szeredi @ 2023-03-06 16:14 UTC (permalink / raw)
  To: Alexander Mikhalitsyn
  Cc: mszeredi, Al Viro, Amir Goldstein, Stéphane Graber,
	Seth Forshee, Christian Brauner, Andrei Vagin, Pavel Tikhomirov,
	linux-fsdevel, linux-kernel, criu

On Mon, 20 Feb 2023 at 20:38, Alexander Mikhalitsyn
<aleksandr.mikhalitsyn@canonical.com> wrote:
>
> Hello everyone,
>
> It would be great to hear your comments regarding this proof-of-concept Checkpoint/Restore API for FUSE.
>
> Support of FUSE C/R is a challenging task for CRIU [1]. Last year I've given a brief talk on LPC 2022
> about how we handle files C/R in CRIU and which blockers we have for FUSE filesystems. [2]
>
> The main problem for CRIU is that we have to restore mount namespaces and memory mappings before the process tree.
> It means that when CRIU is performing mount of fuse filesystem it can't use the original FUSE daemon from the
> restorable process tree, but instead use a "fake daemon".
>
> This leads to many other technical problems:
> * "fake" daemon has to reply to FUSE_INIT request from the kernel and initialize fuse connection somehow.
> This setup can be not consistent with the original daemon (protocol version, daemon capabilities/settings
> like no_open, no_flush, readahead, and so on).
> * each fuse request has a unique ID. It could confuse userspace if this unique ID sequence was reset.
>
> We can workaround some issues and implement fragile and limited support of FUSE in CRIU but it doesn't make any sense, IMHO.
> Btw, I've enumerated only CRIU restore-stage problems there. The dump stage is another story...
>
> My proposal is not only about CRIU. The same interface can be useful for FUSE mounts recovery after daemon crashes.
> LXC project uses LXCFS [3] as a procfs/cgroupfs/sysfs emulation layer for containers. We are using a scheme when
> one LXCFS daemon handles all the work for all the containers and we use bindmounts to overmount particular
> files/directories in procfs/cgroupfs/sysfs. If this single daemon crashes for some reason we are in trouble,
> because we have to restart all the containers (fuse bindmounts become invalid after the crash).
> The solution is fairly easy:
> allow somehow to reinitialize the existing fuse connection and replace the daemon on the fly
> This case is a little bit simpler than CRIU cause we don't need to care about the previously opened files
> and other stuff, we are only interested in mounts.
>
> Current PoC implementation was developed and tested with this "recovery case".
> Right now I only have LXCFS patched and have nothing for CRIU. But I wanted to discuss this idea before going forward with CRIU.

Apparently all of the added mechanisms (REINIT, BM_REVAL, conn_gen)
are crash recovery related, and not useful for C/R.  Why is this being
advertised as a precursor for CRIU support?

BTW here's some earlier attempt at partial recovery, which might be interesting:

  https://lore.kernel.org/all/CAPm50a+j8UL9g3UwpRsye5e+a=M0Hy7Tf1FdfwOrUUBWMyosNg@mail.gmail.com/

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/9] fuse: API for Checkpoint/Restore
  2023-03-06 16:14 ` Miklos Szeredi
@ 2023-03-06 16:44   ` Aleksandr Mikhalitsyn
  2023-03-06 19:18     ` Miklos Szeredi
  0 siblings, 1 reply; 27+ messages in thread
From: Aleksandr Mikhalitsyn @ 2023-03-06 16:44 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: mszeredi, Al Viro, Amir Goldstein, Stéphane Graber,
	Seth Forshee, Christian Brauner, Andrei Vagin, Pavel Tikhomirov,
	linux-fsdevel, linux-kernel, criu, flyingpeng

On Mon, Mar 6, 2023 at 5:15 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Mon, 20 Feb 2023 at 20:38, Alexander Mikhalitsyn
> <aleksandr.mikhalitsyn@canonical.com> wrote:
> >
> > Hello everyone,
> >
> > It would be great to hear your comments regarding this proof-of-concept Checkpoint/Restore API for FUSE.
> >
> > Support of FUSE C/R is a challenging task for CRIU [1]. Last year I've given a brief talk on LPC 2022
> > about how we handle files C/R in CRIU and which blockers we have for FUSE filesystems. [2]
> >
> > The main problem for CRIU is that we have to restore mount namespaces and memory mappings before the process tree.
> > It means that when CRIU is performing mount of fuse filesystem it can't use the original FUSE daemon from the
> > restorable process tree, but instead use a "fake daemon".
> >
> > This leads to many other technical problems:
> > * "fake" daemon has to reply to FUSE_INIT request from the kernel and initialize fuse connection somehow.
> > This setup can be not consistent with the original daemon (protocol version, daemon capabilities/settings
> > like no_open, no_flush, readahead, and so on).
> > * each fuse request has a unique ID. It could confuse userspace if this unique ID sequence was reset.
> >
> > We can workaround some issues and implement fragile and limited support of FUSE in CRIU but it doesn't make any sense, IMHO.
> > Btw, I've enumerated only CRIU restore-stage problems there. The dump stage is another story...
> >
> > My proposal is not only about CRIU. The same interface can be useful for FUSE mounts recovery after daemon crashes.
> > LXC project uses LXCFS [3] as a procfs/cgroupfs/sysfs emulation layer for containers. We are using a scheme when
> > one LXCFS daemon handles all the work for all the containers and we use bindmounts to overmount particular
> > files/directories in procfs/cgroupfs/sysfs. If this single daemon crashes for some reason we are in trouble,
> > because we have to restart all the containers (fuse bindmounts become invalid after the crash).
> > The solution is fairly easy:
> > allow somehow to reinitialize the existing fuse connection and replace the daemon on the fly
> > This case is a little bit simpler than CRIU cause we don't need to care about the previously opened files
> > and other stuff, we are only interested in mounts.
> >
> > Current PoC implementation was developed and tested with this "recovery case".
> > Right now I only have LXCFS patched and have nothing for CRIU. But I wanted to discuss this idea before going forward with CRIU.
>

Hi Miklos,

> Apparently all of the added mechanisms (REINIT, BM_REVAL, conn_gen)
> are crash recovery related, and not useful for C/R.  Why is this being
> advertised as a precursor for CRIU support?

It's because I'm doing this with CRIU in mind too, I think it's a good
way to make a universal interface
which can address not only the recovery case but also the C/R, cause
in some sense it's a close problem.
But of course, Checkpoint/Restore is a way more trickier. But before
doing all the work with CRIU PoC,
I wanted to consult with you and folks if there are any serious
objections to this interface/feature or, conversely,
if there is someone else who is interested in it.

Now about interfaces REINIT, BM_REVAL.

I think it will be useful for CRIU case, but probably I need to extend
it a little bit, as I mentioned earlier in the cover letter:
> >* "fake" daemon has to reply to FUSE_INIT request from the kernel and initialize fuse connection somehow.
> > This setup can be not consistent with the original daemon (protocol version, daemon capabilities/settings
> > like no_open, no_flush, readahead, and so on).

So, after the "fake" demon has done its job during CRIU restore, we
need to replace it with the actual demon from
the dumpee tree and performing REINIT looks like a sanner way.

The next point is that if we use REINIT during CRIU restore, then we
automatically need to have BM_REINIT too,
otherwise all restored bind mounts become invalid.

Conn generation is not a problem for CRIU if we are not exposing it to
the userspace. It's just a technical thing to distinguish
old and new inodes/struct file's.

>
> BTW here's some earlier attempt at partial recovery, which might be interesting:
>
>   https://lore.kernel.org/all/CAPm50a+j8UL9g3UwpRsye5e+a=M0Hy7Tf1FdfwOrUUBWMyosNg@mail.gmail.com/

Oh, that's interesting. Thanks for mentioning this! And the most
interesting thing is that Peng Hao mentioned LXCFS as a use case :-)

Added Peng Hao to CC

Kind regards,
Alex

>
> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/9] fuse: API for Checkpoint/Restore
  2023-03-06 16:44   ` Aleksandr Mikhalitsyn
@ 2023-03-06 19:18     ` Miklos Szeredi
  2023-03-06 21:05       ` Bernd Schubert
  0 siblings, 1 reply; 27+ messages in thread
From: Miklos Szeredi @ 2023-03-06 19:18 UTC (permalink / raw)
  To: Aleksandr Mikhalitsyn
  Cc: mszeredi, Al Viro, Amir Goldstein, Stéphane Graber,
	Seth Forshee, Christian Brauner, Andrei Vagin, Pavel Tikhomirov,
	linux-fsdevel, linux-kernel, criu, flyingpeng

On Mon, 6 Mar 2023 at 17:44, Aleksandr Mikhalitsyn
<aleksandr.mikhalitsyn@canonical.com> wrote:
>
> On Mon, Mar 6, 2023 at 5:15 PM Miklos Szeredi <miklos@szeredi.hu> wrote:

> > Apparently all of the added mechanisms (REINIT, BM_REVAL, conn_gen)
> > are crash recovery related, and not useful for C/R.  Why is this being
> > advertised as a precursor for CRIU support?
>
> It's because I'm doing this with CRIU in mind too, I think it's a good
> way to make a universal interface
> which can address not only the recovery case but also the C/R, cause
> in some sense it's a close problem.

That's what I'm wondering about...

Crash recovery is about restoring (or at least regenerating) state in
the userspace server.

In CRIU restoring the state of the userspace server is a solved
problem, the issue is restoring state in the kernel part of fuse.  In
a sense it's the exact opposite problem that crash recovery is doing.

> But of course, Checkpoint/Restore is a way more trickier. But before
> doing all the work with CRIU PoC,
> I wanted to consult with you and folks if there are any serious
> objections to this interface/feature or, conversely,
> if there is someone else who is interested in it.
>
> Now about interfaces REINIT, BM_REVAL.
>
> I think it will be useful for CRIU case, but probably I need to extend
> it a little bit, as I mentioned earlier in the cover letter:
> > >* "fake" daemon has to reply to FUSE_INIT request from the kernel and initialize fuse connection somehow.
> > > This setup can be not consistent with the original daemon (protocol version, daemon capabilities/settings
> > > like no_open, no_flush, readahead, and so on).
>
> So, after the "fake" demon has done its job during CRIU restore, we
> need to replace it with the actual demon from
> the dumpee tree and performing REINIT looks like a sanner way.

I don't get it.  How does REINIT help with switching to the real daemon?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/9] fuse: API for Checkpoint/Restore
  2023-03-06 19:18     ` Miklos Szeredi
@ 2023-03-06 21:05       ` Bernd Schubert
  2023-03-06 22:16         ` Aleksandr Mikhalitsyn
  0 siblings, 1 reply; 27+ messages in thread
From: Bernd Schubert @ 2023-03-06 21:05 UTC (permalink / raw)
  To: Miklos Szeredi, Aleksandr Mikhalitsyn
  Cc: mszeredi, Al Viro, Amir Goldstein, Stéphane Graber,
	Seth Forshee, Christian Brauner, Andrei Vagin, Pavel Tikhomirov,
	linux-fsdevel, linux-kernel, criu, flyingpeng



On 3/6/23 20:18, Miklos Szeredi wrote:
> On Mon, 6 Mar 2023 at 17:44, Aleksandr Mikhalitsyn
> <aleksandr.mikhalitsyn@canonical.com> wrote:
>>
>> On Mon, Mar 6, 2023 at 5:15 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
>>> Apparently all of the added mechanisms (REINIT, BM_REVAL, conn_gen)
>>> are crash recovery related, and not useful for C/R.  Why is this being
>>> advertised as a precursor for CRIU support?
>>
>> It's because I'm doing this with CRIU in mind too, I think it's a good
>> way to make a universal interface
>> which can address not only the recovery case but also the C/R, cause
>> in some sense it's a close problem.
> 
> That's what I'm wondering about...
> 
> Crash recovery is about restoring (or at least regenerating) state in
> the userspace server.
> 
> In CRIU restoring the state of the userspace server is a solved
> problem, the issue is restoring state in the kernel part of fuse.  In
> a sense it's the exact opposite problem that crash recovery is doing.
> 
>> But of course, Checkpoint/Restore is a way more trickier. But before
>> doing all the work with CRIU PoC,
>> I wanted to consult with you and folks if there are any serious
>> objections to this interface/feature or, conversely,
>> if there is someone else who is interested in it.
>>
>> Now about interfaces REINIT, BM_REVAL.
>>
>> I think it will be useful for CRIU case, but probably I need to extend
>> it a little bit, as I mentioned earlier in the cover letter:
>>>> * "fake" daemon has to reply to FUSE_INIT request from the kernel and initialize fuse connection somehow.
>>>> This setup can be not consistent with the original daemon (protocol version, daemon capabilities/settings
>>>> like no_open, no_flush, readahead, and so on).
>>
>> So, after the "fake" demon has done its job during CRIU restore, we
>> need to replace it with the actual demon from
>> the dumpee tree and performing REINIT looks like a sanner way.
> 
> I don't get it.  How does REINIT help with switching to the real daemon?

The way I read the patches, the new daemon sends FUSE_INIT to advertise 
all of its features.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/9] fuse: API for Checkpoint/Restore
  2023-03-06 21:05       ` Bernd Schubert
@ 2023-03-06 22:16         ` Aleksandr Mikhalitsyn
  0 siblings, 0 replies; 27+ messages in thread
From: Aleksandr Mikhalitsyn @ 2023-03-06 22:16 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, mszeredi, Al Viro, Amir Goldstein,
	Stéphane Graber, Seth Forshee, Christian Brauner,
	Andrei Vagin, Pavel Tikhomirov, linux-fsdevel, linux-kernel,
	criu, flyingpeng

On Mon, Mar 6, 2023 at 10:05 PM Bernd Schubert <bschubert@ddn.com> wrote:
>
>
>
> On 3/6/23 20:18, Miklos Szeredi wrote:
> > On Mon, 6 Mar 2023 at 17:44, Aleksandr Mikhalitsyn
> > <aleksandr.mikhalitsyn@canonical.com> wrote:
> >>
> >> On Mon, Mar 6, 2023 at 5:15 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> >>> Apparently all of the added mechanisms (REINIT, BM_REVAL, conn_gen)
> >>> are crash recovery related, and not useful for C/R.  Why is this being
> >>> advertised as a precursor for CRIU support?
> >>
> >> It's because I'm doing this with CRIU in mind too, I think it's a good
> >> way to make a universal interface
> >> which can address not only the recovery case but also the C/R, cause
> >> in some sense it's a close problem.
> >
> > That's what I'm wondering about...
> >
> > Crash recovery is about restoring (or at least regenerating) state in
> > the userspace server.
> >
> > In CRIU restoring the state of the userspace server is a solved
> > problem, the issue is restoring state in the kernel part of fuse.  In
> > a sense it's the exact opposite problem that crash recovery is doing.

I can't argue, you're right. In the "recover" case we don't care about userspace
state, we just want to forget everything in the kernel but only keep
mounts (someone may want to keep opened FDs too).
In the C/R case we want to recreate full userspace and kernel states.

These are different problems, but in some parts they require the same UAPIs.
I think I need to write a detailed motivation for the CRIU part in the
-v2 cover letter, so we can discuss it. What do you think?

> >
> >> But of course, Checkpoint/Restore is a way more trickier. But before
> >> doing all the work with CRIU PoC,
> >> I wanted to consult with you and folks if there are any serious
> >> objections to this interface/feature or, conversely,
> >> if there is someone else who is interested in it.
> >>
> >> Now about interfaces REINIT, BM_REVAL.
> >>
> >> I think it will be useful for CRIU case, but probably I need to extend
> >> it a little bit, as I mentioned earlier in the cover letter:
> >>>> * "fake" daemon has to reply to FUSE_INIT request from the kernel and initialize fuse connection somehow.
> >>>> This setup can be not consistent with the original daemon (protocol version, daemon capabilities/settings
> >>>> like no_open, no_flush, readahead, and so on).
> >>
> >> So, after the "fake" demon has done its job during CRIU restore, we
> >> need to replace it with the actual demon from
> >> the dumpee tree and performing REINIT looks like a sanner way.
> >
> > I don't get it.  How does REINIT help with switching to the real daemon?
>
> The way I read the patches, the new daemon sends FUSE_INIT to advertise
> all of its features.

Yes, thanks, Bernd!

Theoretically, we can implement some basic C/R without using reinit.
It was my first idea and I've described it in my LPC 2022 talk,
but this approach is not fully safe and universal because CRIU fake
daemon will implement a particular fuse protocol version (and define a
particular set on fuse ops/features),
but the dumpee fuse daemon can use a different set of fuse ops and
fuse protocol version. So, changing fuse daemon fully transparently to
the kernel is not fully safe.

Thank you guys for your attention to this!

Kind regards,
Alex

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 7/9] fuse: add fuse device ioctl(FUSE_DEV_IOC_REINIT)
  2023-03-06 14:09     ` Aleksandr Mikhalitsyn
@ 2023-04-03 14:51       ` Aleksandr Mikhalitsyn
  2023-04-11 22:18         ` Bernd Schubert
  0 siblings, 1 reply; 27+ messages in thread
From: Aleksandr Mikhalitsyn @ 2023-04-03 14:51 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: mszeredi, Al Viro, Amir Goldstein, Stéphane Graber,
	Seth Forshee, Christian Brauner, Andrei Vagin, Pavel Tikhomirov,
	linux-fsdevel, linux-kernel, criu

On Mon, Mar 6, 2023 at 3:09 PM Aleksandr Mikhalitsyn
<aleksandr.mikhalitsyn@canonical.com> wrote:
>
> On Fri, Mar 3, 2023 at 8:26 PM Bernd Schubert
> <bernd.schubert@fastmail.fm> wrote:
> >
> >
> >
> > On 2/20/23 20:37, Alexander Mikhalitsyn wrote:
> > > This ioctl aborts fuse connection and then reinitializes it,
> > > sends FUSE_INIT request to allow a new userspace daemon
> > > to pick up the fuse connection.
> > >
> > > Cc: Miklos Szeredi <mszeredi@redhat.com>
> > > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > > Cc: Amir Goldstein <amir73il@gmail.com>
> > > Cc: Stéphane Graber <stgraber@ubuntu.com>
> > > Cc: Seth Forshee <sforshee@kernel.org>
> > > Cc: Christian Brauner <brauner@kernel.org>
> > > Cc: Andrei Vagin <avagin@gmail.com>
> > > Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> > > Cc: linux-fsdevel@vger.kernel.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Cc: criu@openvz.org
> > > Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
> > > ---
> > >   fs/fuse/dev.c             | 132 ++++++++++++++++++++++++++++++++++++++
> > >   include/uapi/linux/fuse.h |   1 +
> > >   2 files changed, 133 insertions(+)
> > >
> > > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > > index 737764c2295e..0f53ffd63957 100644
> > > --- a/fs/fuse/dev.c
> > > +++ b/fs/fuse/dev.c
> > > @@ -2187,6 +2187,112 @@ void fuse_abort_conn(struct fuse_conn *fc)
> > >   }
> > >   EXPORT_SYMBOL_GPL(fuse_abort_conn);
> > >
> > > +static int fuse_reinit_conn(struct fuse_conn *fc)
> > > +{
> > > +     struct fuse_iqueue *fiq = &fc->iq;
> > > +     struct fuse_dev *fud;
> > > +     unsigned int i;
> > > +
> > > +     if (fc->conn_gen + 1 < fc->conn_gen)
> > > +             return -EOVERFLOW;
> > > +
> > > +     fuse_abort_conn(fc);
> > > +     fuse_wait_aborted(fc);
> >
> > Shouldn't this also try to flush all data first?

Dear Bernd,

I've reviewed this place 2nd time and I'm not sure that we have to
perform any flushing there, because userspace daemon can be dead or
stuck.
Technically, if userspace knows that daemon is alive then it can call
fsync/sync before doing reinit.

What do you think about it?

Kind regards,
Alex

>
> I think we should. Thanks for pointing to that!
>
> I've read all your comments and I'll prepare -v2 series soon.
>
> Thanks a lot, Bernd!
>
> >

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 7/9] fuse: add fuse device ioctl(FUSE_DEV_IOC_REINIT)
  2023-03-03 19:19   ` Bernd Schubert
@ 2023-04-03 14:56     ` Aleksandr Mikhalitsyn
  0 siblings, 0 replies; 27+ messages in thread
From: Aleksandr Mikhalitsyn @ 2023-04-03 14:56 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: mszeredi, Al Viro, Amir Goldstein, Stéphane Graber,
	Seth Forshee, Christian Brauner, Andrei Vagin, Pavel Tikhomirov,
	linux-fsdevel, linux-kernel, criu

On Fri, Mar 3, 2023 at 8:19 PM Bernd Schubert
<bernd.schubert@fastmail.fm> wrote:
>
>
>
> On 2/20/23 20:37, Alexander Mikhalitsyn wrote:
> > This ioctl aborts fuse connection and then reinitializes it,
> > sends FUSE_INIT request to allow a new userspace daemon
> > to pick up the fuse connection.
> >
> > Cc: Miklos Szeredi <mszeredi@redhat.com>
> > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > Cc: Amir Goldstein <amir73il@gmail.com>
> > Cc: Stéphane Graber <stgraber@ubuntu.com>
> > Cc: Seth Forshee <sforshee@kernel.org>
> > Cc: Christian Brauner <brauner@kernel.org>
> > Cc: Andrei Vagin <avagin@gmail.com>
> > Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> > Cc: linux-fsdevel@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: criu@openvz.org
> > Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
> > ---
> >   fs/fuse/dev.c             | 132 ++++++++++++++++++++++++++++++++++++++
> >   include/uapi/linux/fuse.h |   1 +
> >   2 files changed, 133 insertions(+)
> >
> > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > index 737764c2295e..0f53ffd63957 100644
> > --- a/fs/fuse/dev.c
> > +++ b/fs/fuse/dev.c
> > @@ -2187,6 +2187,112 @@ void fuse_abort_conn(struct fuse_conn *fc)
> >   }
> >   EXPORT_SYMBOL_GPL(fuse_abort_conn);
> >
> > +static int fuse_reinit_conn(struct fuse_conn *fc)
> > +{
> > +     struct fuse_iqueue *fiq = &fc->iq;
> > +     struct fuse_dev *fud;
> > +     unsigned int i;
>
> Assuming you have a malicious daemon that tries to cause bad behavior,
> only allow one ioctl at at time? I.e. add a value that reinit is in
> progress? And unset at the end of the function?

Have done. Thanks!
>
> > +
> > +     if (fc->conn_gen + 1 < fc->conn_gen)
> > +             return -EOVERFLOW;
> > +
>
> Add a comment, like
>
> /* Unsets fc->connected and fiq->connected and ensures that no new
> requests can be queued */
>
> ?

Have done.

>
> > +     fuse_abort_conn(fc);
> > +     fuse_wait_aborted(fc);
> > +
> > +     spin_lock(&fc->lock);
> > +     if (fc->connected) {
> > +             spin_unlock(&fc->lock);
> > +             return -EINVAL;
> > +     }
> > +
> > +     if (fc->conn_gen + 1 < fc->conn_gen) {
> > +             spin_unlock(&fc->lock);
> > +             return -EOVERFLOW;
> > +     }
> > +
> > +     fc->conn_gen++;
> > +
> > +     spin_lock(&fiq->lock);
> > +     if (request_pending(fiq) || fiq->forget_list_tail != &fiq->forget_list_head) {
> > +             spin_unlock(&fiq->lock);
> > +             spin_unlock(&fc->lock);
> > +             return -EINVAL;
> > +     }
> > +
> > +     if (&fuse_dev_fiq_ops != fiq->ops) {
> > +             spin_unlock(&fiq->lock);
> > +             spin_unlock(&fc->lock);
> > +             return -EOPNOTSUPP;
> > +     }
> > +
> > +     fiq->connected = 1;
> > +     spin_unlock(&fiq->lock);
> > +
> > +     spin_lock(&fc->bg_lock);
> > +     if (!list_empty(&fc->bg_queue)) {
> > +             spin_unlock(&fc->bg_lock);
> > +             spin_unlock(&fc->lock);
> > +             return -EINVAL;
> > +     }
> > +
> > +     fc->blocked = 0;
> > +     fc->max_background = FUSE_DEFAULT_MAX_BACKGROUND;
> > +     spin_unlock(&fc->bg_lock);
> > +
> > +     list_for_each_entry(fud, &fc->devices, entry) {
> > +             struct fuse_pqueue *fpq = &fud->pq;
> > +
> > +             spin_lock(&fpq->lock);
> > +             if (!list_empty(&fpq->io)) {
> > +                     spin_unlock(&fpq->lock);
> > +                     spin_unlock(&fc->lock);
> > +                     return -EINVAL;
> > +             }
> > +
> > +             for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) {
> > +                     if (!list_empty(&fpq->processing[i])) {
> > +                             spin_unlock(&fpq->lock);
> > +                             spin_unlock(&fc->lock);
> > +                             return -EINVAL;
> > +                     }
> > +             }
> > +
> > +             fpq->connected = 1;
> > +             spin_unlock(&fpq->lock);
> > +     }
> > +
> > +     fuse_set_initialized(fc);
>
> I'm not sure about this, why not the common way via FUSE_INIT reply?

fuse_send_init will fail, if !fc->initialized (see fuse_block_alloc <-
fuse_get_req <- fuse_simple_background).

>
> > +
> > +     /* Background queuing checks fc->connected under bg_lock */
> > +     spin_lock(&fc->bg_lock);
> > +     fc->connected = 1;
> > +     spin_unlock(&fc->bg_lock);
> > +
> > +     fc->aborted = false;
> > +     fc->abort_err = 0;
> > +
> > +     /* nullify all the flags */
> > +     memset(&fc->flags, 0, sizeof(struct fuse_conn_flags));
> > +
> > +     spin_unlock(&fc->lock);
> > +
> > +     down_read(&fc->killsb);
> > +     if (!list_empty(&fc->mounts)) {
> > +             struct fuse_mount *fm;
> > +
> > +             fm = list_first_entry(&fc->mounts, struct fuse_mount, fc_entry);
> > +             if (!fm->sb) {
> > +                     up_read(&fc->killsb);
> > +                     return -EINVAL;
> > +             }
> > +
> > +             fuse_send_init(fm);
> > +     }
> > +     up_read(&fc->killsb);
> > +
> > +     return 0;
> > +}
> > +
> >   void fuse_wait_aborted(struct fuse_conn *fc)
> >   {
> >       /* matches implicit memory barrier in fuse_drop_waiting() */
> > @@ -2282,6 +2388,32 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
> >                       }
> >               }
> >               break;
> > +     case FUSE_DEV_IOC_REINIT:
> > +             struct fuse_conn *fc;
> > +
> > +             if (!checkpoint_restore_ns_capable(file->f_cred->user_ns))
> > +                     return -EPERM;
> > +
> > +             res = -EINVAL;
> > +             fud = fuse_get_dev(file);
> > +
> > +             /*
> > +              * Only fuse mounts with an already initialized fuse
> > +              * connection are supported
> > +              */
> > +             if (file->f_op == &fuse_dev_operations && fud) {
> > +                     mutex_lock(&fuse_mutex);
> > +                     fc = fud->fc;
> > +                     if (fc)
> > +                             fc = fuse_conn_get(fc);
> > +                     mutex_unlock(&fuse_mutex);
> > +
> > +                     if (fc) {
> > +                             res = fuse_reinit_conn(fc);
> > +                             fuse_conn_put(fc);
> > +                     }
> > +             }
> > +             break;
> >       default:
> >               res = -ENOTTY;
> >               break;
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index 1b9d0dfae72d..3dac67b25eae 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -989,6 +989,7 @@ struct fuse_notify_retrieve_in {
> >   /* Device ioctls: */
> >   #define FUSE_DEV_IOC_MAGIC          229
> >   #define FUSE_DEV_IOC_CLONE          _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
> > +#define FUSE_DEV_IOC_REINIT          _IO(FUSE_DEV_IOC_MAGIC, 0)
> >
> >   struct fuse_lseek_in {
> >       uint64_t        fh;

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 7/9] fuse: add fuse device ioctl(FUSE_DEV_IOC_REINIT)
  2023-04-03 14:51       ` Aleksandr Mikhalitsyn
@ 2023-04-11 22:18         ` Bernd Schubert
  0 siblings, 0 replies; 27+ messages in thread
From: Bernd Schubert @ 2023-04-11 22:18 UTC (permalink / raw)
  To: Aleksandr Mikhalitsyn
  Cc: mszeredi, Al Viro, Amir Goldstein, Stéphane Graber,
	Seth Forshee, Christian Brauner, Andrei Vagin, Pavel Tikhomirov,
	linux-fsdevel, linux-kernel, criu



On 4/3/23 16:51, Aleksandr Mikhalitsyn wrote:
> On Mon, Mar 6, 2023 at 3:09 PM Aleksandr Mikhalitsyn
> <aleksandr.mikhalitsyn@canonical.com> wrote:
>>
>> On Fri, Mar 3, 2023 at 8:26 PM Bernd Schubert
>> <bernd.schubert@fastmail.fm> wrote:
>>>
>>>
>>>
>>> On 2/20/23 20:37, Alexander Mikhalitsyn wrote:
>>>> This ioctl aborts fuse connection and then reinitializes it,
>>>> sends FUSE_INIT request to allow a new userspace daemon
>>>> to pick up the fuse connection.
>>>>
>>>> Cc: Miklos Szeredi <mszeredi@redhat.com>
>>>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>>>> Cc: Amir Goldstein <amir73il@gmail.com>
>>>> Cc: Stéphane Graber <stgraber@ubuntu.com>
>>>> Cc: Seth Forshee <sforshee@kernel.org>
>>>> Cc: Christian Brauner <brauner@kernel.org>
>>>> Cc: Andrei Vagin <avagin@gmail.com>
>>>> Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
>>>> Cc: linux-fsdevel@vger.kernel.org
>>>> Cc: linux-kernel@vger.kernel.org
>>>> Cc: criu@openvz.org
>>>> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
>>>> ---
>>>>    fs/fuse/dev.c             | 132 ++++++++++++++++++++++++++++++++++++++
>>>>    include/uapi/linux/fuse.h |   1 +
>>>>    2 files changed, 133 insertions(+)
>>>>
>>>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>>>> index 737764c2295e..0f53ffd63957 100644
>>>> --- a/fs/fuse/dev.c
>>>> +++ b/fs/fuse/dev.c
>>>> @@ -2187,6 +2187,112 @@ void fuse_abort_conn(struct fuse_conn *fc)
>>>>    }
>>>>    EXPORT_SYMBOL_GPL(fuse_abort_conn);
>>>>
>>>> +static int fuse_reinit_conn(struct fuse_conn *fc)
>>>> +{
>>>> +     struct fuse_iqueue *fiq = &fc->iq;
>>>> +     struct fuse_dev *fud;
>>>> +     unsigned int i;
>>>> +
>>>> +     if (fc->conn_gen + 1 < fc->conn_gen)
>>>> +             return -EOVERFLOW;
>>>> +
>>>> +     fuse_abort_conn(fc);
>>>> +     fuse_wait_aborted(fc);
>>>
>>> Shouldn't this also try to flush all data first?
> 
> Dear Bernd,
> 
> I've reviewed this place 2nd time and I'm not sure that we have to
> perform any flushing there, because userspace daemon can be dead or
> stuck.
> Technically, if userspace knows that daemon is alive then it can call
> fsync/sync before doing reinit.
> 
> What do you think about it?

Hello Alex,

sorry for my late reply.

Hmm, I just fear that fsync/sync is a bit racy, what is if a user would 
write data after the sync and that would get silently removed by 
fuse_abort_conn()? Isn't what we want:

ioctl
    refuse new requests -> unset fc->initialized
    flush all fc queues (fc->iq.pending, fc->bg_queue, I guess with your 
current patches we do not need to handle forget)
    fuse_abort_conn


So what is missing is the information if the daemon is still running - 
take a daemon reference and then check for PF_EXITING, as in my uring 
patches? Miklos has some objections for that, though.


The alternative would be to mount read-only, then sync, then do the 
ioctl and remount back. I don't know what needs to be done to get 
remount working, though. Just handle it in libfuse mount.fuse and send 
the mount syscall?


As I wrote before, at DDN we want to have run time daemon restart - I'm 
also not opposed to entirely give up on the flush and to just work on a 
restart protocol to make the new daemon to the old state (opened files 
and lookup/forget count). In principle we could even transfer that in 
userspace from one daemon to the other?


Thanks,
Bernd

PS: Will look at the new patches later this week.


Thanks,
Bernd



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/9] fuse: API for Checkpoint/Restore
@ 2023-04-03 16:54 Askar Safin
  0 siblings, 0 replies; 27+ messages in thread
From: Askar Safin @ 2023-04-03 16:54 UTC (permalink / raw)
  To: aleksandr.mikhalitsyn; +Cc: criu, linux-fsdevel, linux-kernel, ptikhomirov

These patches seem to be related to the well-known suspend+fuse problem
(see my comment here:
https://bugzilla.kernel.org/show_bug.cgi?id=34932#c12 ). Still
reproducible on modern kernels. Please, somehow address this
long-standing problem.

CC me when answering

-- 
Askar Safin

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2023-04-11 22:18 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-20 19:37 [RFC PATCH 0/9] fuse: API for Checkpoint/Restore Alexander Mikhalitsyn
2023-02-20 19:37 ` [RFC PATCH 1/9] fuse: move FUSE_DEFAULT_* defines to fuse common header Alexander Mikhalitsyn
2023-02-20 19:37 ` [RFC PATCH 2/9] fuse: add const qualifiers to common fuse helpers Alexander Mikhalitsyn
2023-02-20 19:37 ` [RFC PATCH 3/9] fuse: add fuse connection generation Alexander Mikhalitsyn
2023-02-20 19:37 ` [RFC PATCH 4/9] fuse: handle stale inode connection in fuse_queue_forget Alexander Mikhalitsyn
2023-03-03 18:47   ` Bernd Schubert
2023-02-20 19:37 ` [RFC PATCH 5/9] fuse: move fuse connection flags to the separate structure Alexander Mikhalitsyn
2023-03-03 18:26   ` Bernd Schubert
2023-02-20 19:37 ` [RFC PATCH 6/9] fuse: take fuse connection generation into account Alexander Mikhalitsyn
2023-03-03 18:45   ` Bernd Schubert
2023-02-20 19:37 ` [RFC PATCH 7/9] fuse: add fuse device ioctl(FUSE_DEV_IOC_REINIT) Alexander Mikhalitsyn
2023-03-03 19:19   ` Bernd Schubert
2023-04-03 14:56     ` Aleksandr Mikhalitsyn
2023-03-03 19:26   ` Bernd Schubert
2023-03-06 14:09     ` Aleksandr Mikhalitsyn
2023-04-03 14:51       ` Aleksandr Mikhalitsyn
2023-04-11 22:18         ` Bernd Schubert
2023-02-20 19:37 ` [RFC PATCH 8/9] namespace: add sb_revalidate_bindmounts helper Alexander Mikhalitsyn
2023-02-20 19:37 ` [RFC PATCH 9/9] fuse: add fuse device ioctl(FUSE_DEV_IOC_BM_REVAL) Alexander Mikhalitsyn
2023-03-03 19:42 ` [RFC PATCH 0/9] fuse: API for Checkpoint/Restore Bernd Schubert
2023-03-06 14:06   ` Aleksandr Mikhalitsyn
2023-03-06 16:14 ` Miklos Szeredi
2023-03-06 16:44   ` Aleksandr Mikhalitsyn
2023-03-06 19:18     ` Miklos Szeredi
2023-03-06 21:05       ` Bernd Schubert
2023-03-06 22:16         ` Aleksandr Mikhalitsyn
2023-04-03 16:54 Askar Safin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.