linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [resend PATCH v2 0/2] virtiofs submounts that are still in use forgotten by shrinker
@ 2023-10-02 15:24 Krister Johansen
  2023-10-02 15:24 ` [resend PATCH v2 1/2] fuse: revalidate: move lookup into a separate function Krister Johansen
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Krister Johansen @ 2023-10-02 15:24 UTC (permalink / raw)
  To: Miklos Szeredi, linux-fsdevel
  Cc: Miklos Szeredi, linux-kernel, German Maglione, Greg Kurz,
	Max Reitz, Bernd Schubert

Hi,
I recently ran into a situation where a virtiofs client began
encountering EBADF after the client / guest system had an OOM.  After
reproducing the issue and debugging, the problem is caused by a
virtiofsd submount having the nodeid of its root dentry fogotten.  This
occurs because it borrows the reference for this dentry from the parent
that is passed into the function.

In this particular case, the submount had been bind mounted into a
container's mount namespace.  The reference count on the original parent
dentry was 0, making it eligible for eviction.  However, because this
dentry was also the last reference the fuse client knew it had, it sent
a forget message to the server.  This caused all future references to
the FUSE node-id from virtiofsd perspective to become invalid.
Subsequent attempts to use the node-id for operations against the
submount's root received an EBADF from the server.

This pair of patches modifies the virtiofs submount code to perform a
lookup on the nodeid that forms the root of the submount.  The patch
before this pulls the revalidate lookup code into a helper function that
can be used both in revalidate and submount superblock fill.

Tested via:

- fstests for virtiofs
- fstests for fuse (against passthrough_ll)
- manual testing to watch how refcounts change between client and server
  in response to filesytem access, umount, and eviction by the shrinker.

This resend has rebased against the latest tip of fuse/for-next and
massaged the commit messages in the patches, but hasn't made any
functional modifications since the original v2.

There's also been an issue opened with the project that uses this
functionality.  More details on that can be found at [1].

Changes since v1:

- Cleanups to pacify test robot

Changes since RFC:

- Modified fuse_fill_super_submount to always fail if dentry cannot be
  revalidated.  (Feedback from Bernd Schubert)
- Fixed up an edge case where looked up but subsequently declared
  invalid dentries were not correctly tracking nlookup.  (Error was
  introduced in my RFC).

Thanks,

-K

[1] https://github.com/kata-containers/kata-containers/issues/8040

Krister Johansen (2):
  fuse: revalidate: move lookup into a separate function
  fuse: ensure that submounts lookup their parent

 fs/fuse/dir.c    | 85 +++++++++++++++++++++++++++++++++---------------
 fs/fuse/fuse_i.h |  6 ++++
 fs/fuse/inode.c  | 43 ++++++++++++++++++++----
 3 files changed, 101 insertions(+), 33 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [resend PATCH v2 1/2] fuse: revalidate: move lookup into a separate function
  2023-10-02 15:24 [resend PATCH v2 0/2] virtiofs submounts that are still in use forgotten by shrinker Krister Johansen
@ 2023-10-02 15:24 ` Krister Johansen
  2023-10-02 15:24 ` [resend PATCH v2 2/2] fuse: ensure that submounts lookup their parent Krister Johansen
  2023-10-02 22:18 ` [resend PATCH v2 0/2] virtiofs submounts that are still in use forgotten by shrinker Bernd Schubert
  2 siblings, 0 replies; 25+ messages in thread
From: Krister Johansen @ 2023-10-02 15:24 UTC (permalink / raw)
  To: Miklos Szeredi, linux-fsdevel
  Cc: Miklos Szeredi, linux-kernel, German Maglione, Greg Kurz,
	Max Reitz, Bernd Schubert

Move the lookup parts of fuse_dentry_revalidate into a common function.
This function will be used elsewhere in a separate commit.  In the
meantime, the new function fuse_dentry_revalidate_lookup is responsible
for just the lookup and validation portions of the revalidate dance.
The fuse_dentry_revalidate function retains the responsibility for
invalidating and mutating any state associated with the origial
fuse_inode and dentry.

The rationale for this refactoring is to allow a lookup to be triggered
as part of creating a submount.  The submount code lives in another
module.  A subsequent commit will utilize the common function that is
created by this commit.

Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
---
 fs/fuse/dir.c | 85 +++++++++++++++++++++++++++++++++++----------------
 1 file changed, 58 insertions(+), 27 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index d707e6987da9..5e01946d7531 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -183,6 +183,57 @@ static void fuse_lookup_init(struct fuse_conn *fc, struct fuse_args *args,
 	args->out_args[0].value = outarg;
 }
 
+static int fuse_dentry_revalidate_lookup(struct fuse_mount *fm,
+					 struct dentry *entry,
+					 struct inode *inode,
+					 struct fuse_entry_out *outarg,
+					 bool *lookedup)
+{
+	struct dentry *parent;
+	struct fuse_forget_link *forget;
+	FUSE_ARGS(args);
+	int ret;
+
+	forget = fuse_alloc_forget();
+	ret = -ENOMEM;
+	if (!forget)
+		goto out;
+
+	parent = dget_parent(entry);
+	fuse_lookup_init(fm->fc, &args, get_node_id(d_inode(parent)),
+			 &entry->d_name, outarg);
+	ret = fuse_simple_request(fm, &args);
+	dput(parent);
+
+	/* Zero nodeid is same as -ENOENT */
+	if (!ret && !outarg->nodeid)
+		ret = -ENOENT;
+	if (!ret) {
+		if (outarg->nodeid != get_node_id(inode) ||
+		    (bool) IS_AUTOMOUNT(inode) != (bool) (outarg->attr.flags & FUSE_ATTR_SUBMOUNT)) {
+			fuse_queue_forget(fm->fc, forget,
+					  outarg->nodeid, 1);
+			goto invalid;
+		}
+		*lookedup = true;
+	}
+	kfree(forget);
+	if (ret == -ENOMEM || ret == -EINTR)
+		goto out;
+	if (ret || fuse_invalid_attr(&outarg->attr) ||
+	    fuse_stale_inode(inode, outarg->generation, &outarg->attr)) {
+		goto invalid;
+	}
+
+	ret = 1;
+out:
+	return ret;
+
+invalid:
+	ret = 0;
+	goto out;
+}
+
 /*
  * Check whether the dentry is still valid
  *
@@ -206,9 +257,8 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
 	else if (time_before64(fuse_dentry_time(entry), get_jiffies_64()) ||
 		 (flags & (LOOKUP_EXCL | LOOKUP_REVAL | LOOKUP_RENAME_TARGET))) {
 		struct fuse_entry_out outarg;
-		FUSE_ARGS(args);
-		struct fuse_forget_link *forget;
 		u64 attr_version;
+		bool lookedup = false;
 
 		/* For negative dentries, always do a fresh lookup */
 		if (!inode)
@@ -220,38 +270,19 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
 
 		fm = get_fuse_mount(inode);
 
-		forget = fuse_alloc_forget();
-		ret = -ENOMEM;
-		if (!forget)
-			goto out;
-
 		attr_version = fuse_get_attr_version(fm->fc);
 
-		parent = dget_parent(entry);
-		fuse_lookup_init(fm->fc, &args, get_node_id(d_inode(parent)),
-				 &entry->d_name, &outarg);
-		ret = fuse_simple_request(fm, &args);
-		dput(parent);
-		/* Zero nodeid is same as -ENOENT */
-		if (!ret && !outarg.nodeid)
-			ret = -ENOENT;
-		if (!ret) {
+		ret = fuse_dentry_revalidate_lookup(fm, entry, inode, &outarg,
+						    &lookedup);
+		if (ret == -ENOMEM || ret == -EINTR)
+			goto out;
+		if (lookedup) {
 			fi = get_fuse_inode(inode);
-			if (outarg.nodeid != get_node_id(inode) ||
-			    (bool) IS_AUTOMOUNT(inode) != (bool) (outarg.attr.flags & FUSE_ATTR_SUBMOUNT)) {
-				fuse_queue_forget(fm->fc, forget,
-						  outarg.nodeid, 1);
-				goto invalid;
-			}
 			spin_lock(&fi->lock);
 			fi->nlookup++;
 			spin_unlock(&fi->lock);
 		}
-		kfree(forget);
-		if (ret == -ENOMEM || ret == -EINTR)
-			goto out;
-		if (ret || fuse_invalid_attr(&outarg.attr) ||
-		    fuse_stale_inode(inode, outarg.generation, &outarg.attr))
+		if (ret <= 0)
 			goto invalid;
 
 		forget_all_cached_acls(inode);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [resend PATCH v2 2/2] fuse: ensure that submounts lookup their parent
  2023-10-02 15:24 [resend PATCH v2 0/2] virtiofs submounts that are still in use forgotten by shrinker Krister Johansen
  2023-10-02 15:24 ` [resend PATCH v2 1/2] fuse: revalidate: move lookup into a separate function Krister Johansen
@ 2023-10-02 15:24 ` Krister Johansen
  2023-10-06 17:13   ` Bernd Schubert
  2023-10-09 19:45   ` Miklos Szeredi
  2023-10-02 22:18 ` [resend PATCH v2 0/2] virtiofs submounts that are still in use forgotten by shrinker Bernd Schubert
  2 siblings, 2 replies; 25+ messages in thread
From: Krister Johansen @ 2023-10-02 15:24 UTC (permalink / raw)
  To: Miklos Szeredi, linux-fsdevel
  Cc: Miklos Szeredi, linux-kernel, German Maglione, Greg Kurz,
	Max Reitz, Bernd Schubert

The submount code uses the parent nodeid passed into the function in
order to create the root dentry for the new submount.  This nodeid does
not get its remote reference count incremented by a lookup option.

If the parent inode is evicted from its superblock, due to memory
pressure for example, it can result in a forget opertation being sent to
the server.  Should this nodeid be forgotten while it is still in use in
a submount, users of the submount get an error from the server on any
subsequent access.  In the author's case, this was an EBADF on all
subsequent operations that needed to reference the root.

Debugging the problem revealed that the dentry shrinker triggered a forget
after killing the dentry with the last reference, despite the root
dentry in another superblock still using the nodeid.

As a result, a container that was also using this submount failed to
access its filesystem because it had borrowed the reference instead of
taking its own when setting up its superblock for the submount.

This commit fixes the problem by having the new submount trigger a
lookup for the parent as part of creating a new root dentry for the
virtiofsd submount superblock.  This allows each superblock to have its
inodes removed by the shrinker when unreferenced, while keeping the
nodeid reference count accurate and active with the server.

Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
---
 fs/fuse/dir.c    | 10 +++++-----
 fs/fuse/fuse_i.h |  6 ++++++
 fs/fuse/inode.c  | 43 +++++++++++++++++++++++++++++++++++++------
 3 files changed, 48 insertions(+), 11 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 5e01946d7531..333730c74619 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -183,11 +183,11 @@ static void fuse_lookup_init(struct fuse_conn *fc, struct fuse_args *args,
 	args->out_args[0].value = outarg;
 }
 
-static int fuse_dentry_revalidate_lookup(struct fuse_mount *fm,
-					 struct dentry *entry,
-					 struct inode *inode,
-					 struct fuse_entry_out *outarg,
-					 bool *lookedup)
+int fuse_dentry_revalidate_lookup(struct fuse_mount *fm,
+				  struct dentry *entry,
+				  struct inode *inode,
+				  struct fuse_entry_out *outarg,
+				  bool *lookedup)
 {
 	struct dentry *parent;
 	struct fuse_forget_link *forget;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 405252bb51f2..a66fcf50a4cc 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1325,6 +1325,12 @@ void fuse_dax_dontcache(struct inode *inode, unsigned int flags);
 bool fuse_dax_check_alignment(struct fuse_conn *fc, unsigned int map_alignment);
 void fuse_dax_cancel_work(struct fuse_conn *fc);
 
+/* dir.c */
+int fuse_dentry_revalidate_lookup(struct fuse_mount *fm, struct dentry *entry,
+				  struct inode *inode,
+				  struct fuse_entry_out *outarg,
+				  bool *lookedup);
+
 /* ioctl.c */
 long fuse_file_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
 long fuse_file_compat_ioctl(struct file *file, unsigned int cmd,
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 444418e240c8..79a31cb55512 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1464,7 +1464,13 @@ static int fuse_fill_super_submount(struct super_block *sb,
 	struct fuse_mount *fm = get_fuse_mount_super(sb);
 	struct super_block *parent_sb = parent_fi->inode.i_sb;
 	struct fuse_attr root_attr;
+	struct fuse_inode *fi;
 	struct inode *root;
+	struct inode *parent;
+	struct dentry *pdent;
+	struct fuse_entry_out outarg;
+	bool lookedup = false;
+	int ret;
 
 	fuse_sb_defaults(sb);
 	fm->sb = sb;
@@ -1480,14 +1486,39 @@ static int fuse_fill_super_submount(struct super_block *sb,
 	if (parent_sb->s_subtype && !sb->s_subtype)
 		return -ENOMEM;
 
-	fuse_fill_attr_from_inode(&root_attr, parent_fi);
-	root = fuse_iget(sb, parent_fi->nodeid, 0, &root_attr, 0, 0);
 	/*
-	 * This inode is just a duplicate, so it is not looked up and
-	 * its nlookup should not be incremented.  fuse_iget() does
-	 * that, though, so undo it here.
+	 * It is necessary to lookup the parent_if->nodeid in case the dentry
+	 * that triggered the automount of the submount is later evicted.
+	 * If this dentry is evicted without the lookup count getting increased
+	 * on the submount root, then the server can subsequently forget this
+	 * nodeid which leads to errors when trying to access the root of the
+	 * submount.
 	 */
-	get_fuse_inode(root)->nlookup--;
+	parent = &parent_fi->inode;
+	pdent = d_find_alias(parent);
+	if (!pdent)
+		return -EINVAL;
+
+	ret = fuse_dentry_revalidate_lookup(fm, pdent, parent, &outarg,
+	    &lookedup);
+	dput(pdent);
+	/*
+	 * The new root owns this nlookup on success, and it is incremented by
+	 * fuse_iget().  In the case the lookup succeeded but revalidate fails,
+	 * ensure that the lookup count is tracked by the parent.
+	 */
+	if (ret <= 0) {
+		if (lookedup) {
+			fi = get_fuse_inode(parent);
+			spin_lock(&fi->lock);
+			fi->nlookup++;
+			spin_unlock(&fi->lock);
+		}
+		return ret ? ret : -EINVAL;
+	}
+
+	fuse_fill_attr_from_inode(&root_attr, parent_fi);
+	root = fuse_iget(sb, parent_fi->nodeid, 0, &root_attr, 0, 0);
 	sb->s_d_op = &fuse_dentry_operations;
 	sb->s_root = d_make_root(root);
 	if (!sb->s_root)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [resend PATCH v2 0/2] virtiofs submounts that are still in use forgotten by shrinker
  2023-10-02 15:24 [resend PATCH v2 0/2] virtiofs submounts that are still in use forgotten by shrinker Krister Johansen
  2023-10-02 15:24 ` [resend PATCH v2 1/2] fuse: revalidate: move lookup into a separate function Krister Johansen
  2023-10-02 15:24 ` [resend PATCH v2 2/2] fuse: ensure that submounts lookup their parent Krister Johansen
@ 2023-10-02 22:18 ` Bernd Schubert
  2023-10-03 16:48   ` Krister Johansen
  2 siblings, 1 reply; 25+ messages in thread
From: Bernd Schubert @ 2023-10-02 22:18 UTC (permalink / raw)
  To: Krister Johansen, Miklos Szeredi, linux-fsdevel
  Cc: Miklos Szeredi, linux-kernel, German Maglione, Greg Kurz, Max Reitz



On 10/2/23 17:24, Krister Johansen wrote:
> Hi,
> I recently ran into a situation where a virtiofs client began
> encountering EBADF after the client / guest system had an OOM.  After
> reproducing the issue and debugging, the problem is caused by a
> virtiofsd submount having the nodeid of its root dentry fogotten.  This
> occurs because it borrows the reference for this dentry from the parent
> that is passed into the function.


Sorry, I didn't forget you, just didn't manage to review the 2nd version 
yet. Will definitely do this week.
Please also note that there will be merge conflicts with atomic open 
patches from Dharmendra/me. Although probably not too difficult to resolve.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [resend PATCH v2 0/2] virtiofs submounts that are still in use forgotten by shrinker
  2023-10-02 22:18 ` [resend PATCH v2 0/2] virtiofs submounts that are still in use forgotten by shrinker Bernd Schubert
@ 2023-10-03 16:48   ` Krister Johansen
  2023-10-03 22:54     ` Bernd Schubert
  0 siblings, 1 reply; 25+ messages in thread
From: Krister Johansen @ 2023-10-03 16:48 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Krister Johansen, Miklos Szeredi, linux-fsdevel, Miklos Szeredi,
	linux-kernel, German Maglione, Greg Kurz, Max Reitz

On Tue, Oct 03, 2023 at 12:18:42AM +0200, Bernd Schubert wrote:
> 
> 
> On 10/2/23 17:24, Krister Johansen wrote:
> > Hi,
> > I recently ran into a situation where a virtiofs client began
> > encountering EBADF after the client / guest system had an OOM.  After
> > reproducing the issue and debugging, the problem is caused by a
> > virtiofsd submount having the nodeid of its root dentry fogotten.  This
> > occurs because it borrows the reference for this dentry from the parent
> > that is passed into the function.
> 
> 
> Sorry, I didn't forget you, just didn't manage to review the 2nd version
> yet. Will definitely do this week.

Thanks; I appreciate the feedback you've provided so far.

> Please also note that there will be merge conflicts with atomic open patches
> from Dharmendra/me. Although probably not too difficult to resolve.

Sure. I'm happy to reparent, resolve those conflicts, re-test, and send
another revision when we're ready.  I suspect there are going to be
additional changes requested on the v2.  With that in mind, I'll hold
off for the moment unless it is going to cause headaches for you.

For the atomic-open-revalidate changes: should I be working from what's
on the list?  This is the most recent patchset I see:

https://lore.kernel.org/linux-fsdevel/20230920173445.3943581-1-bschubert@ddn.com/

I found a 6.5 relative tree of yours on GitHub by following the libfuse
pull request, but nothing that seemed in sync with fuse/for-next.

Thanks,

-K

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [resend PATCH v2 0/2] virtiofs submounts that are still in use forgotten by shrinker
  2023-10-03 16:48   ` Krister Johansen
@ 2023-10-03 22:54     ` Bernd Schubert
  2023-10-04 13:58       ` Krister Johansen
  0 siblings, 1 reply; 25+ messages in thread
From: Bernd Schubert @ 2023-10-03 22:54 UTC (permalink / raw)
  To: Krister Johansen
  Cc: Miklos Szeredi, linux-fsdevel, Miklos Szeredi, linux-kernel,
	German Maglione, Greg Kurz, Max Reitz



On 10/3/23 18:48, Krister Johansen wrote:
> On Tue, Oct 03, 2023 at 12:18:42AM +0200, Bernd Schubert wrote:
>>
>>
>> On 10/2/23 17:24, Krister Johansen wrote:
>>> Hi,
>>> I recently ran into a situation where a virtiofs client began
>>> encountering EBADF after the client / guest system had an OOM.  After
>>> reproducing the issue and debugging, the problem is caused by a
>>> virtiofsd submount having the nodeid of its root dentry fogotten.  This
>>> occurs because it borrows the reference for this dentry from the parent
>>> that is passed into the function.
>>
>>
>> Sorry, I didn't forget you, just didn't manage to review the 2nd version
>> yet. Will definitely do this week.
> 
> Thanks; I appreciate the feedback you've provided so far.
> 
>> Please also note that there will be merge conflicts with atomic open patches
>> from Dharmendra/me. Although probably not too difficult to resolve.
> 
> Sure. I'm happy to reparent, resolve those conflicts, re-test, and send
> another revision when we're ready.  I suspect there are going to be
> additional changes requested on the v2.  With that in mind, I'll hold
> off for the moment unless it is going to cause headaches for you.

I certainly also didn't mean that you should check for merge conflicts, 
it was more an annotation that it might come up - depending on the merge 
order. Please don't stop to do improvements, resolving merge conflicts 
shouldn't be difficult.
I'm going to add you to the atomic open patch series to keep you 
updated, if you don't mind.


> 
> For the atomic-open-revalidate changes: should I be working from what's
> on the list?  This is the most recent patchset I see:
> 
> https://lore.kernel.org/linux-fsdevel/20230920173445.3943581-1-bschubert@ddn.com/
> 
> I found a 6.5 relative tree of yours on GitHub by following the libfuse
> pull request, but nothing that seemed in sync with fuse/for-next.

I don't think there are conflicts with fuse-next right now, but I can 
check.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [resend PATCH v2 0/2] virtiofs submounts that are still in use forgotten by shrinker
  2023-10-03 22:54     ` Bernd Schubert
@ 2023-10-04 13:58       ` Krister Johansen
  0 siblings, 0 replies; 25+ messages in thread
From: Krister Johansen @ 2023-10-04 13:58 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, linux-fsdevel, Miklos Szeredi, linux-kernel,
	German Maglione, Greg Kurz, Max Reitz

On Wed, Oct 04, 2023 at 12:54:49AM +0200, Bernd Schubert wrote:
> 
> 
> On 10/3/23 18:48, Krister Johansen wrote:
> > On Tue, Oct 03, 2023 at 12:18:42AM +0200, Bernd Schubert wrote:
> > > 
> > > 
> > > On 10/2/23 17:24, Krister Johansen wrote:
> > > > Hi,
> > > > I recently ran into a situation where a virtiofs client began
> > > > encountering EBADF after the client / guest system had an OOM.  After
> > > > reproducing the issue and debugging, the problem is caused by a
> > > > virtiofsd submount having the nodeid of its root dentry fogotten.  This
> > > > occurs because it borrows the reference for this dentry from the parent
> > > > that is passed into the function.
> > > 
> > > Please also note that there will be merge conflicts with atomic open patches
> > > from Dharmendra/me. Although probably not too difficult to resolve.
> > 
> > Sure. I'm happy to reparent, resolve those conflicts, re-test, and send
> > another revision when we're ready.  I suspect there are going to be
> > additional changes requested on the v2.  With that in mind, I'll hold
> > off for the moment unless it is going to cause headaches for you.
> 
> I certainly also didn't mean that you should check for merge conflicts, it
> was more an annotation that it might come up - depending on the merge order.
> Please don't stop to do improvements, resolving merge conflicts shouldn't be
> difficult.
> I'm going to add you to the atomic open patch series to keep you updated, if
> you don't mind.

Thanks, no objections from me.  I'm willing to help with any conflict
resolution or retesting tasks, if anything turns out to be non-trivial.
My goal is to get these patches to the state where they're acceptable.
I'm happy to make additional changes, or work against a different
branch.


-K

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [resend PATCH v2 2/2] fuse: ensure that submounts lookup their parent
  2023-10-02 15:24 ` [resend PATCH v2 2/2] fuse: ensure that submounts lookup their parent Krister Johansen
@ 2023-10-06 17:13   ` Bernd Schubert
  2023-10-07  0:41     ` Krister Johansen
  2023-10-09 19:45   ` Miklos Szeredi
  1 sibling, 1 reply; 25+ messages in thread
From: Bernd Schubert @ 2023-10-06 17:13 UTC (permalink / raw)
  To: Krister Johansen, Miklos Szeredi, linux-fsdevel
  Cc: Miklos Szeredi, linux-kernel, German Maglione, Greg Kurz, Max Reitz



On 10/2/23 17:24, Krister Johansen wrote:
> The submount code uses the parent nodeid passed into the function in
> order to create the root dentry for the new submount.  This nodeid does
> not get its remote reference count incremented by a lookup option.
> 
> If the parent inode is evicted from its superblock, due to memory
> pressure for example, it can result in a forget opertation being sent to
> the server.  Should this nodeid be forgotten while it is still in use in
> a submount, users of the submount get an error from the server on any
> subsequent access.  In the author's case, this was an EBADF on all
> subsequent operations that needed to reference the root.
> 
> Debugging the problem revealed that the dentry shrinker triggered a forget
> after killing the dentry with the last reference, despite the root
> dentry in another superblock still using the nodeid.
> 
> As a result, a container that was also using this submount failed to
> access its filesystem because it had borrowed the reference instead of
> taking its own when setting up its superblock for the submount.
> 
> This commit fixes the problem by having the new submount trigger a
> lookup for the parent as part of creating a new root dentry for the
> virtiofsd submount superblock.  This allows each superblock to have its
> inodes removed by the shrinker when unreferenced, while keeping the
> nodeid reference count accurate and active with the server.
> 
> Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
> ---
>   fs/fuse/dir.c    | 10 +++++-----
>   fs/fuse/fuse_i.h |  6 ++++++
>   fs/fuse/inode.c  | 43 +++++++++++++++++++++++++++++++++++++------
>   3 files changed, 48 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 5e01946d7531..333730c74619 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -183,11 +183,11 @@ static void fuse_lookup_init(struct fuse_conn *fc, struct fuse_args *args,
>   	args->out_args[0].value = outarg;
>   }
>   
> -static int fuse_dentry_revalidate_lookup(struct fuse_mount *fm,
> -					 struct dentry *entry,
> -					 struct inode *inode,
> -					 struct fuse_entry_out *outarg,
> -					 bool *lookedup)
> +int fuse_dentry_revalidate_lookup(struct fuse_mount *fm,
> +				  struct dentry *entry,
> +				  struct inode *inode,
> +				  struct fuse_entry_out *outarg,
> +				  bool *lookedup)
>   {
>   	struct dentry *parent;
>   	struct fuse_forget_link *forget;
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 405252bb51f2..a66fcf50a4cc 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -1325,6 +1325,12 @@ void fuse_dax_dontcache(struct inode *inode, unsigned int flags);
>   bool fuse_dax_check_alignment(struct fuse_conn *fc, unsigned int map_alignment);
>   void fuse_dax_cancel_work(struct fuse_conn *fc);
>   
> +/* dir.c */
> +int fuse_dentry_revalidate_lookup(struct fuse_mount *fm, struct dentry *entry,
> +				  struct inode *inode,
> +				  struct fuse_entry_out *outarg,
> +				  bool *lookedup);
> +
>   /* ioctl.c */
>   long fuse_file_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
>   long fuse_file_compat_ioctl(struct file *file, unsigned int cmd,
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 444418e240c8..79a31cb55512 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1464,7 +1464,13 @@ static int fuse_fill_super_submount(struct super_block *sb,
>   	struct fuse_mount *fm = get_fuse_mount_super(sb);
>   	struct super_block *parent_sb = parent_fi->inode.i_sb;
>   	struct fuse_attr root_attr;
> +	struct fuse_inode *fi;
>   	struct inode *root;
> +	struct inode *parent;
> +	struct dentry *pdent;
> +	struct fuse_entry_out outarg;
> +	bool lookedup = false;
> +	int ret;
>   
>   	fuse_sb_defaults(sb);
>   	fm->sb = sb;
> @@ -1480,14 +1486,39 @@ static int fuse_fill_super_submount(struct super_block *sb,
>   	if (parent_sb->s_subtype && !sb->s_subtype)
>   		return -ENOMEM;
>   
> -	fuse_fill_attr_from_inode(&root_attr, parent_fi);
> -	root = fuse_iget(sb, parent_fi->nodeid, 0, &root_attr, 0, 0);
>   	/*
> -	 * This inode is just a duplicate, so it is not looked up and
> -	 * its nlookup should not be incremented.  fuse_iget() does
> -	 * that, though, so undo it here.
> +	 * It is necessary to lookup the parent_if->nodeid in case the dentry
> +	 * that triggered the automount of the submount is later evicted.
> +	 * If this dentry is evicted without the lookup count getting increased
> +	 * on the submount root, then the server can subsequently forget this
> +	 * nodeid which leads to errors when trying to access the root of the
> +	 * submount.
>   	 */
> -	get_fuse_inode(root)->nlookup--;
> +	parent = &parent_fi->inode;
> +	pdent = d_find_alias(parent);
> +	if (!pdent)
> +		return -EINVAL;
> +
> +	ret = fuse_dentry_revalidate_lookup(fm, pdent, parent, &outarg,
> +	    &lookedup);
> +	dput(pdent);
> +	/*
> +	 * The new root owns this nlookup on success, and it is incremented by
> +	 * fuse_iget().  In the case the lookup succeeded but revalidate fails,
> +	 * ensure that the lookup count is tracked by the parent.
> +	 */
> +	if (ret <= 0) {
> +		if (lookedup) {
> +			fi = get_fuse_inode(parent);
> +			spin_lock(&fi->lock);
> +			fi->nlookup++;
> +			spin_unlock(&fi->lock);
> +		}

I might be wrong, but doesn't that mean that 
"get_fuse_inode(root)->nlookup--" needs to be called?



Thanks,
Bernd

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [resend PATCH v2 2/2] fuse: ensure that submounts lookup their parent
  2023-10-06 17:13   ` Bernd Schubert
@ 2023-10-07  0:41     ` Krister Johansen
  2023-10-09 12:52       ` Bernd Schubert
  0 siblings, 1 reply; 25+ messages in thread
From: Krister Johansen @ 2023-10-07  0:41 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, linux-fsdevel, Miklos Szeredi, linux-kernel,
	German Maglione, Greg Kurz, Max Reitz

On Fri, Oct 06, 2023 at 07:13:06PM +0200, Bernd Schubert wrote:
> 
> 
> On 10/2/23 17:24, Krister Johansen wrote:
> > The submount code uses the parent nodeid passed into the function in
> > order to create the root dentry for the new submount.  This nodeid does
> > not get its remote reference count incremented by a lookup option.
> > 
> > If the parent inode is evicted from its superblock, due to memory
> > pressure for example, it can result in a forget opertation being sent to
> > the server.  Should this nodeid be forgotten while it is still in use in
> > a submount, users of the submount get an error from the server on any
> > subsequent access.  In the author's case, this was an EBADF on all
> > subsequent operations that needed to reference the root.
> > 
> > Debugging the problem revealed that the dentry shrinker triggered a forget
> > after killing the dentry with the last reference, despite the root
> > dentry in another superblock still using the nodeid.
> > 
> > As a result, a container that was also using this submount failed to
> > access its filesystem because it had borrowed the reference instead of
> > taking its own when setting up its superblock for the submount.
> > 
> > This commit fixes the problem by having the new submount trigger a
> > lookup for the parent as part of creating a new root dentry for the
> > virtiofsd submount superblock.  This allows each superblock to have its
> > inodes removed by the shrinker when unreferenced, while keeping the
> > nodeid reference count accurate and active with the server.
> > 
> > Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
> > ---
> >   fs/fuse/dir.c    | 10 +++++-----
> >   fs/fuse/fuse_i.h |  6 ++++++
> >   fs/fuse/inode.c  | 43 +++++++++++++++++++++++++++++++++++++------
> >   3 files changed, 48 insertions(+), 11 deletions(-)
> > 
> > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > index 5e01946d7531..333730c74619 100644
> > --- a/fs/fuse/dir.c
> > +++ b/fs/fuse/dir.c
> > @@ -183,11 +183,11 @@ static void fuse_lookup_init(struct fuse_conn *fc, struct fuse_args *args,
> >   	args->out_args[0].value = outarg;
> >   }
> > -static int fuse_dentry_revalidate_lookup(struct fuse_mount *fm,
> > -					 struct dentry *entry,
> > -					 struct inode *inode,
> > -					 struct fuse_entry_out *outarg,
> > -					 bool *lookedup)
> > +int fuse_dentry_revalidate_lookup(struct fuse_mount *fm,
> > +				  struct dentry *entry,
> > +				  struct inode *inode,
> > +				  struct fuse_entry_out *outarg,
> > +				  bool *lookedup)
> >   {
> >   	struct dentry *parent;
> >   	struct fuse_forget_link *forget;
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 405252bb51f2..a66fcf50a4cc 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -1325,6 +1325,12 @@ void fuse_dax_dontcache(struct inode *inode, unsigned int flags);
> >   bool fuse_dax_check_alignment(struct fuse_conn *fc, unsigned int map_alignment);
> >   void fuse_dax_cancel_work(struct fuse_conn *fc);
> > +/* dir.c */
> > +int fuse_dentry_revalidate_lookup(struct fuse_mount *fm, struct dentry *entry,
> > +				  struct inode *inode,
> > +				  struct fuse_entry_out *outarg,
> > +				  bool *lookedup);
> > +
> >   /* ioctl.c */
> >   long fuse_file_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
> >   long fuse_file_compat_ioctl(struct file *file, unsigned int cmd,
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index 444418e240c8..79a31cb55512 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -1464,7 +1464,13 @@ static int fuse_fill_super_submount(struct super_block *sb,
> >   	struct fuse_mount *fm = get_fuse_mount_super(sb);
> >   	struct super_block *parent_sb = parent_fi->inode.i_sb;
> >   	struct fuse_attr root_attr;
> > +	struct fuse_inode *fi;
> >   	struct inode *root;
> > +	struct inode *parent;
> > +	struct dentry *pdent;
> > +	struct fuse_entry_out outarg;
> > +	bool lookedup = false;
> > +	int ret;
> >   	fuse_sb_defaults(sb);
> >   	fm->sb = sb;
> > @@ -1480,14 +1486,39 @@ static int fuse_fill_super_submount(struct super_block *sb,
> >   	if (parent_sb->s_subtype && !sb->s_subtype)
> >   		return -ENOMEM;
> > -	fuse_fill_attr_from_inode(&root_attr, parent_fi);
> > -	root = fuse_iget(sb, parent_fi->nodeid, 0, &root_attr, 0, 0);
> >   	/*
> > -	 * This inode is just a duplicate, so it is not looked up and
> > -	 * its nlookup should not be incremented.  fuse_iget() does
> > -	 * that, though, so undo it here.
> > +	 * It is necessary to lookup the parent_if->nodeid in case the dentry
> > +	 * that triggered the automount of the submount is later evicted.
> > +	 * If this dentry is evicted without the lookup count getting increased
> > +	 * on the submount root, then the server can subsequently forget this
> > +	 * nodeid which leads to errors when trying to access the root of the
> > +	 * submount.
> >   	 */
> > -	get_fuse_inode(root)->nlookup--;
> > +	parent = &parent_fi->inode;
> > +	pdent = d_find_alias(parent);
> > +	if (!pdent)
> > +		return -EINVAL;
> > +
> > +	ret = fuse_dentry_revalidate_lookup(fm, pdent, parent, &outarg,
> > +	    &lookedup);
> > +	dput(pdent);
> > +	/*
> > +	 * The new root owns this nlookup on success, and it is incremented by
> > +	 * fuse_iget().  In the case the lookup succeeded but revalidate fails,
> > +	 * ensure that the lookup count is tracked by the parent.
> > +	 */
> > +	if (ret <= 0) {
> > +		if (lookedup) {
> > +			fi = get_fuse_inode(parent);
> > +			spin_lock(&fi->lock);
> > +			fi->nlookup++;
> > +			spin_unlock(&fi->lock);
> > +		}
> 
> I might be wrong, but doesn't that mean that
> "get_fuse_inode(root)->nlookup--" needs to be called?

In the case where ret > 0, the nlookup on get_fuse_inode(root) is set to
1 by fuse_iget().  That ensures that the root is forgotten when later
unmounted.  The code that handles the forget uses the count of nlookup
to tell the server-side how many references to forget.  (That's in
fuse_evict_inode()). 

However, if the fuse_dentry_revalidate_lookup() call performs a valid
lookup but returns an error, this function will return before it fills
out s_root in the superblock or calls fuse_iget().  If the superblock
doesn't have a s_root set, then the code in generic_kill_super() won't
dput() the root dentry and trigger the forget.

The intention of this code was to handle the case where the lookup had
succeeded, but the code determined it was still necessary to return an
error.  In that situation, the reference taken by the lookup has to be
accounted somewhere, and the parent seemed like a plausible candidate.

However, after writing up this response, I can see that there's still a
problem here if d_make_root(root) returns NULL, because we'll also lose
track of the nlookup in that case.

If you agree that charging this to the parent on error makes sense, I'll
re-work the error handling here so that the right thing happens when
either fuse_dentry_revalidate_lookup() or d_make_root() encounter an
error.

Thanks for the feedback.

-K

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [resend PATCH v2 2/2] fuse: ensure that submounts lookup their parent
  2023-10-07  0:41     ` Krister Johansen
@ 2023-10-09 12:52       ` Bernd Schubert
  2023-10-09 17:15         ` Krister Johansen
  0 siblings, 1 reply; 25+ messages in thread
From: Bernd Schubert @ 2023-10-09 12:52 UTC (permalink / raw)
  To: Krister Johansen
  Cc: Miklos Szeredi, linux-fsdevel, Miklos Szeredi, linux-kernel,
	German Maglione, Greg Kurz, Max Reitz



On 10/7/23 02:41, Krister Johansen wrote:
> On Fri, Oct 06, 2023 at 07:13:06PM +0200, Bernd Schubert wrote:
>>
>>
>> On 10/2/23 17:24, Krister Johansen wrote:
>>> The submount code uses the parent nodeid passed into the function in
>>> order to create the root dentry for the new submount.  This nodeid does
>>> not get its remote reference count incremented by a lookup option.
>>>
>>> If the parent inode is evicted from its superblock, due to memory
>>> pressure for example, it can result in a forget opertation being sent to
>>> the server.  Should this nodeid be forgotten while it is still in use in
>>> a submount, users of the submount get an error from the server on any
>>> subsequent access.  In the author's case, this was an EBADF on all
>>> subsequent operations that needed to reference the root.
>>>
>>> Debugging the problem revealed that the dentry shrinker triggered a forget
>>> after killing the dentry with the last reference, despite the root
>>> dentry in another superblock still using the nodeid.
>>>
>>> As a result, a container that was also using this submount failed to
>>> access its filesystem because it had borrowed the reference instead of
>>> taking its own when setting up its superblock for the submount.
>>>
>>> This commit fixes the problem by having the new submount trigger a
>>> lookup for the parent as part of creating a new root dentry for the
>>> virtiofsd submount superblock.  This allows each superblock to have its
>>> inodes removed by the shrinker when unreferenced, while keeping the
>>> nodeid reference count accurate and active with the server.
>>>
>>> Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
>>> ---
>>>    fs/fuse/dir.c    | 10 +++++-----
>>>    fs/fuse/fuse_i.h |  6 ++++++
>>>    fs/fuse/inode.c  | 43 +++++++++++++++++++++++++++++++++++++------
>>>    3 files changed, 48 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
>>> index 5e01946d7531..333730c74619 100644
>>> --- a/fs/fuse/dir.c
>>> +++ b/fs/fuse/dir.c
>>> @@ -183,11 +183,11 @@ static void fuse_lookup_init(struct fuse_conn *fc, struct fuse_args *args,
>>>    	args->out_args[0].value = outarg;
>>>    }
>>> -static int fuse_dentry_revalidate_lookup(struct fuse_mount *fm,
>>> -					 struct dentry *entry,
>>> -					 struct inode *inode,
>>> -					 struct fuse_entry_out *outarg,
>>> -					 bool *lookedup)
>>> +int fuse_dentry_revalidate_lookup(struct fuse_mount *fm,
>>> +				  struct dentry *entry,
>>> +				  struct inode *inode,
>>> +				  struct fuse_entry_out *outarg,
>>> +				  bool *lookedup)
>>>    {
>>>    	struct dentry *parent;
>>>    	struct fuse_forget_link *forget;
>>> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
>>> index 405252bb51f2..a66fcf50a4cc 100644
>>> --- a/fs/fuse/fuse_i.h
>>> +++ b/fs/fuse/fuse_i.h
>>> @@ -1325,6 +1325,12 @@ void fuse_dax_dontcache(struct inode *inode, unsigned int flags);
>>>    bool fuse_dax_check_alignment(struct fuse_conn *fc, unsigned int map_alignment);
>>>    void fuse_dax_cancel_work(struct fuse_conn *fc);
>>> +/* dir.c */
>>> +int fuse_dentry_revalidate_lookup(struct fuse_mount *fm, struct dentry *entry,
>>> +				  struct inode *inode,
>>> +				  struct fuse_entry_out *outarg,
>>> +				  bool *lookedup);
>>> +
>>>    /* ioctl.c */
>>>    long fuse_file_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
>>>    long fuse_file_compat_ioctl(struct file *file, unsigned int cmd,
>>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>>> index 444418e240c8..79a31cb55512 100644
>>> --- a/fs/fuse/inode.c
>>> +++ b/fs/fuse/inode.c
>>> @@ -1464,7 +1464,13 @@ static int fuse_fill_super_submount(struct super_block *sb,
>>>    	struct fuse_mount *fm = get_fuse_mount_super(sb);
>>>    	struct super_block *parent_sb = parent_fi->inode.i_sb;
>>>    	struct fuse_attr root_attr;
>>> +	struct fuse_inode *fi;
>>>    	struct inode *root;
>>> +	struct inode *parent;
>>> +	struct dentry *pdent;
>>> +	struct fuse_entry_out outarg;
>>> +	bool lookedup = false;
>>> +	int ret;
>>>    	fuse_sb_defaults(sb);
>>>    	fm->sb = sb;
>>> @@ -1480,14 +1486,39 @@ static int fuse_fill_super_submount(struct super_block *sb,
>>>    	if (parent_sb->s_subtype && !sb->s_subtype)
>>>    		return -ENOMEM;
>>> -	fuse_fill_attr_from_inode(&root_attr, parent_fi);
>>> -	root = fuse_iget(sb, parent_fi->nodeid, 0, &root_attr, 0, 0);
>>>    	/*
>>> -	 * This inode is just a duplicate, so it is not looked up and
>>> -	 * its nlookup should not be incremented.  fuse_iget() does
>>> -	 * that, though, so undo it here.
>>> +	 * It is necessary to lookup the parent_if->nodeid in case the dentry
>>> +	 * that triggered the automount of the submount is later evicted.
>>> +	 * If this dentry is evicted without the lookup count getting increased
>>> +	 * on the submount root, then the server can subsequently forget this
>>> +	 * nodeid which leads to errors when trying to access the root of the
>>> +	 * submount.
>>>    	 */
>>> -	get_fuse_inode(root)->nlookup--;
>>> +	parent = &parent_fi->inode;
>>> +	pdent = d_find_alias(parent);
>>> +	if (!pdent)
>>> +		return -EINVAL;
>>> +
>>> +	ret = fuse_dentry_revalidate_lookup(fm, pdent, parent, &outarg,
>>> +	    &lookedup);
>>> +	dput(pdent);
>>> +	/*
>>> +	 * The new root owns this nlookup on success, and it is incremented by
>>> +	 * fuse_iget().  In the case the lookup succeeded but revalidate fails,
>>> +	 * ensure that the lookup count is tracked by the parent.
>>> +	 */
>>> +	if (ret <= 0) {
>>> +		if (lookedup) {
>>> +			fi = get_fuse_inode(parent);
>>> +			spin_lock(&fi->lock);
>>> +			fi->nlookup++;
>>> +			spin_unlock(&fi->lock);
>>> +		}
>>
>> I might be wrong, but doesn't that mean that
>> "get_fuse_inode(root)->nlookup--" needs to be called?
> 
> In the case where ret > 0, the nlookup on get_fuse_inode(root) is set to
> 1 by fuse_iget().  That ensures that the root is forgotten when later
> unmounted.  The code that handles the forget uses the count of nlookup
> to tell the server-side how many references to forget.  (That's in
> fuse_evict_inode()).
> 
> However, if the fuse_dentry_revalidate_lookup() call performs a valid
> lookup but returns an error, this function will return before it fills
> out s_root in the superblock or calls fuse_iget().  If the superblock
> doesn't have a s_root set, then the code in generic_kill_super() won't
> dput() the root dentry and trigger the forget.
> 
> The intention of this code was to handle the case where the lookup had
> succeeded, but the code determined it was still necessary to return an
> error.  In that situation, the reference taken by the lookup has to be
> accounted somewhere, and the parent seemed like a plausible candidate.

Yeah sorry, I had just missed that fuse_iget() also moved and then 
thought it would have increased fi->nlookup already.

> 
> However, after writing up this response, I can see that there's still a
> problem here if d_make_root(root) returns NULL, because we'll also lose
> track of the nlookup in that case.
> 
> If you agree that charging this to the parent on error makes sense, I'll
> re-work the error handling here so that the right thing happens when
> either fuse_dentry_revalidate_lookup() or d_make_root() encounter an
> error.

Oh yeah, I also missed that. Although, iput() calls iput_final, which 
then calls evict and sends the fuse forget - isn't that the right action 
already?

> 
> Thanks for the feedback.

Well, false alarm from my side, sorry again!


Bernd

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [resend PATCH v2 2/2] fuse: ensure that submounts lookup their parent
  2023-10-09 12:52       ` Bernd Schubert
@ 2023-10-09 17:15         ` Krister Johansen
  2023-10-09 18:43           ` Bernd Schubert
  0 siblings, 1 reply; 25+ messages in thread
From: Krister Johansen @ 2023-10-09 17:15 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Krister Johansen, Miklos Szeredi, linux-fsdevel, Miklos Szeredi,
	linux-kernel, German Maglione, Greg Kurz, Max Reitz

On Mon, Oct 09, 2023 at 02:52:42PM +0200, Bernd Schubert wrote:
> 
> 
> On 10/7/23 02:41, Krister Johansen wrote:
> > On Fri, Oct 06, 2023 at 07:13:06PM +0200, Bernd Schubert wrote:
> > > 
> > > 
> > > On 10/2/23 17:24, Krister Johansen wrote:
> > > > The submount code uses the parent nodeid passed into the function in
> > > > order to create the root dentry for the new submount.  This nodeid does
> > > > not get its remote reference count incremented by a lookup option.
> > > > 
> > > > If the parent inode is evicted from its superblock, due to memory
> > > > pressure for example, it can result in a forget opertation being sent to
> > > > the server.  Should this nodeid be forgotten while it is still in use in
> > > > a submount, users of the submount get an error from the server on any
> > > > subsequent access.  In the author's case, this was an EBADF on all
> > > > subsequent operations that needed to reference the root.
> > > > 
> > > > Debugging the problem revealed that the dentry shrinker triggered a forget
> > > > after killing the dentry with the last reference, despite the root
> > > > dentry in another superblock still using the nodeid.
> > > > 
> > > > As a result, a container that was also using this submount failed to
> > > > access its filesystem because it had borrowed the reference instead of
> > > > taking its own when setting up its superblock for the submount.
> > > > 
> > > > This commit fixes the problem by having the new submount trigger a
> > > > lookup for the parent as part of creating a new root dentry for the
> > > > virtiofsd submount superblock.  This allows each superblock to have its
> > > > inodes removed by the shrinker when unreferenced, while keeping the
> > > > nodeid reference count accurate and active with the server.
> > > > 
> > > > Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
> > > > ---
> > > >    fs/fuse/dir.c    | 10 +++++-----
> > > >    fs/fuse/fuse_i.h |  6 ++++++
> > > >    fs/fuse/inode.c  | 43 +++++++++++++++++++++++++++++++++++++------
> > > >    3 files changed, 48 insertions(+), 11 deletions(-)
> > > > 
> > > > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > > > index 5e01946d7531..333730c74619 100644
> > > > --- a/fs/fuse/dir.c
> > > > +++ b/fs/fuse/dir.c
> > > > @@ -183,11 +183,11 @@ static void fuse_lookup_init(struct fuse_conn *fc, struct fuse_args *args,
> > > >    	args->out_args[0].value = outarg;
> > > >    }
> > > > -static int fuse_dentry_revalidate_lookup(struct fuse_mount *fm,
> > > > -					 struct dentry *entry,
> > > > -					 struct inode *inode,
> > > > -					 struct fuse_entry_out *outarg,
> > > > -					 bool *lookedup)
> > > > +int fuse_dentry_revalidate_lookup(struct fuse_mount *fm,
> > > > +				  struct dentry *entry,
> > > > +				  struct inode *inode,
> > > > +				  struct fuse_entry_out *outarg,
> > > > +				  bool *lookedup)
> > > >    {
> > > >    	struct dentry *parent;
> > > >    	struct fuse_forget_link *forget;
> > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > > index 405252bb51f2..a66fcf50a4cc 100644
> > > > --- a/fs/fuse/fuse_i.h
> > > > +++ b/fs/fuse/fuse_i.h
> > > > @@ -1325,6 +1325,12 @@ void fuse_dax_dontcache(struct inode *inode, unsigned int flags);
> > > >    bool fuse_dax_check_alignment(struct fuse_conn *fc, unsigned int map_alignment);
> > > >    void fuse_dax_cancel_work(struct fuse_conn *fc);
> > > > +/* dir.c */
> > > > +int fuse_dentry_revalidate_lookup(struct fuse_mount *fm, struct dentry *entry,
> > > > +				  struct inode *inode,
> > > > +				  struct fuse_entry_out *outarg,
> > > > +				  bool *lookedup);
> > > > +
> > > >    /* ioctl.c */
> > > >    long fuse_file_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
> > > >    long fuse_file_compat_ioctl(struct file *file, unsigned int cmd,
> > > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > > index 444418e240c8..79a31cb55512 100644
> > > > --- a/fs/fuse/inode.c
> > > > +++ b/fs/fuse/inode.c
> > > > @@ -1464,7 +1464,13 @@ static int fuse_fill_super_submount(struct super_block *sb,
> > > >    	struct fuse_mount *fm = get_fuse_mount_super(sb);
> > > >    	struct super_block *parent_sb = parent_fi->inode.i_sb;
> > > >    	struct fuse_attr root_attr;
> > > > +	struct fuse_inode *fi;
> > > >    	struct inode *root;
> > > > +	struct inode *parent;
> > > > +	struct dentry *pdent;
> > > > +	struct fuse_entry_out outarg;
> > > > +	bool lookedup = false;
> > > > +	int ret;
> > > >    	fuse_sb_defaults(sb);
> > > >    	fm->sb = sb;
> > > > @@ -1480,14 +1486,39 @@ static int fuse_fill_super_submount(struct super_block *sb,
> > > >    	if (parent_sb->s_subtype && !sb->s_subtype)
> > > >    		return -ENOMEM;
> > > > -	fuse_fill_attr_from_inode(&root_attr, parent_fi);
> > > > -	root = fuse_iget(sb, parent_fi->nodeid, 0, &root_attr, 0, 0);
> > > >    	/*
> > > > -	 * This inode is just a duplicate, so it is not looked up and
> > > > -	 * its nlookup should not be incremented.  fuse_iget() does
> > > > -	 * that, though, so undo it here.
> > > > +	 * It is necessary to lookup the parent_if->nodeid in case the dentry
> > > > +	 * that triggered the automount of the submount is later evicted.
> > > > +	 * If this dentry is evicted without the lookup count getting increased
> > > > +	 * on the submount root, then the server can subsequently forget this
> > > > +	 * nodeid which leads to errors when trying to access the root of the
> > > > +	 * submount.
> > > >    	 */
> > > > -	get_fuse_inode(root)->nlookup--;
> > > > +	parent = &parent_fi->inode;
> > > > +	pdent = d_find_alias(parent);
> > > > +	if (!pdent)
> > > > +		return -EINVAL;
> > > > +
> > > > +	ret = fuse_dentry_revalidate_lookup(fm, pdent, parent, &outarg,
> > > > +	    &lookedup);
> > > > +	dput(pdent);
> > > > +	/*
> > > > +	 * The new root owns this nlookup on success, and it is incremented by
> > > > +	 * fuse_iget().  In the case the lookup succeeded but revalidate fails,
> > > > +	 * ensure that the lookup count is tracked by the parent.
> > > > +	 */
> > > > +	if (ret <= 0) {
> > > > +		if (lookedup) {
> > > > +			fi = get_fuse_inode(parent);
> > > > +			spin_lock(&fi->lock);
> > > > +			fi->nlookup++;
> > > > +			spin_unlock(&fi->lock);
> > > > +		}
> > > 
> > > I might be wrong, but doesn't that mean that
> > > "get_fuse_inode(root)->nlookup--" needs to be called?
> > 
> > In the case where ret > 0, the nlookup on get_fuse_inode(root) is set to
> > 1 by fuse_iget().  That ensures that the root is forgotten when later
> > unmounted.  The code that handles the forget uses the count of nlookup
> > to tell the server-side how many references to forget.  (That's in
> > fuse_evict_inode()).
> > 
> > However, if the fuse_dentry_revalidate_lookup() call performs a valid
> > lookup but returns an error, this function will return before it fills
> > out s_root in the superblock or calls fuse_iget().  If the superblock
> > doesn't have a s_root set, then the code in generic_kill_super() won't
> > dput() the root dentry and trigger the forget.
> > 
> > The intention of this code was to handle the case where the lookup had
> > succeeded, but the code determined it was still necessary to return an
> > error.  In that situation, the reference taken by the lookup has to be
> > accounted somewhere, and the parent seemed like a plausible candidate.
> 
> Yeah sorry, I had just missed that fuse_iget() also moved and then thought
> it would have increased fi->nlookup already.

No worries; I'd much rather get feedback if something doesn't look
right, even if it turns out okay in the end.

> > However, after writing up this response, I can see that there's still a
> > problem here if d_make_root(root) returns NULL, because we'll also lose
> > track of the nlookup in that case.
> > 
> > If you agree that charging this to the parent on error makes sense, I'll
> > re-work the error handling here so that the right thing happens when
> > either fuse_dentry_revalidate_lookup() or d_make_root() encounter an
> > error.
> 
> Oh yeah, I also missed that. Although, iput() calls iput_final, which then
> calls evict and sends the fuse forget - isn't that the right action already?

Thanks, I had forgotten that d_make_root() would call iput() for me if
d_alloc_anon() fails.  Let me restate this to suggest that I account the
nlookup to the parent if fuse_dentry_revalidate_lookup() or fuse_iget()
fail instead.  Does that sound right?

> > Thanks for the feedback.
> 
> Well, false alarm from my side, sorry again!

No apology necessary; I appreciate you spending the time to look and ask
questions.

-K

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [resend PATCH v2 2/2] fuse: ensure that submounts lookup their parent
  2023-10-09 17:15         ` Krister Johansen
@ 2023-10-09 18:43           ` Bernd Schubert
  2023-10-10  2:35             ` Krister Johansen
  0 siblings, 1 reply; 25+ messages in thread
From: Bernd Schubert @ 2023-10-09 18:43 UTC (permalink / raw)
  To: Krister Johansen
  Cc: Miklos Szeredi, linux-fsdevel, Miklos Szeredi, linux-kernel,
	German Maglione, Greg Kurz, Max Reitz



On 10/9/23 19:15, Krister Johansen wrote:
> On Mon, Oct 09, 2023 at 02:52:42PM +0200, Bernd Schubert wrote:
>>
>>
>> On 10/7/23 02:41, Krister Johansen wrote:
>>> On Fri, Oct 06, 2023 at 07:13:06PM +0200, Bernd Schubert wrote:
>>>>
>>>>
>>>> On 10/2/23 17:24, Krister Johansen wrote:
>>>>> The submount code uses the parent nodeid passed into the function in
>>>>> order to create the root dentry for the new submount.  This nodeid does
>>>>> not get its remote reference count incremented by a lookup option.
>>>>>
>>>>> If the parent inode is evicted from its superblock, due to memory
>>>>> pressure for example, it can result in a forget opertation being sent to
>>>>> the server.  Should this nodeid be forgotten while it is still in use in
>>>>> a submount, users of the submount get an error from the server on any
>>>>> subsequent access.  In the author's case, this was an EBADF on all
>>>>> subsequent operations that needed to reference the root.
>>>>>
>>>>> Debugging the problem revealed that the dentry shrinker triggered a forget
>>>>> after killing the dentry with the last reference, despite the root
>>>>> dentry in another superblock still using the nodeid.
>>>>>
>>>>> As a result, a container that was also using this submount failed to
>>>>> access its filesystem because it had borrowed the reference instead of
>>>>> taking its own when setting up its superblock for the submount.
>>>>>
>>>>> This commit fixes the problem by having the new submount trigger a
>>>>> lookup for the parent as part of creating a new root dentry for the
>>>>> virtiofsd submount superblock.  This allows each superblock to have its
>>>>> inodes removed by the shrinker when unreferenced, while keeping the
>>>>> nodeid reference count accurate and active with the server.
>>>>>
>>>>> Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
>>>>> ---
>>>>>     fs/fuse/dir.c    | 10 +++++-----
>>>>>     fs/fuse/fuse_i.h |  6 ++++++
>>>>>     fs/fuse/inode.c  | 43 +++++++++++++++++++++++++++++++++++++------
>>>>>     3 files changed, 48 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
>>>>> index 5e01946d7531..333730c74619 100644
>>>>> --- a/fs/fuse/dir.c
>>>>> +++ b/fs/fuse/dir.c
>>>>> @@ -183,11 +183,11 @@ static void fuse_lookup_init(struct fuse_conn *fc, struct fuse_args *args,
>>>>>     	args->out_args[0].value = outarg;
>>>>>     }
>>>>> -static int fuse_dentry_revalidate_lookup(struct fuse_mount *fm,
>>>>> -					 struct dentry *entry,
>>>>> -					 struct inode *inode,
>>>>> -					 struct fuse_entry_out *outarg,
>>>>> -					 bool *lookedup)
>>>>> +int fuse_dentry_revalidate_lookup(struct fuse_mount *fm,
>>>>> +				  struct dentry *entry,
>>>>> +				  struct inode *inode,
>>>>> +				  struct fuse_entry_out *outarg,
>>>>> +				  bool *lookedup)
>>>>>     {
>>>>>     	struct dentry *parent;
>>>>>     	struct fuse_forget_link *forget;
>>>>> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
>>>>> index 405252bb51f2..a66fcf50a4cc 100644
>>>>> --- a/fs/fuse/fuse_i.h
>>>>> +++ b/fs/fuse/fuse_i.h
>>>>> @@ -1325,6 +1325,12 @@ void fuse_dax_dontcache(struct inode *inode, unsigned int flags);
>>>>>     bool fuse_dax_check_alignment(struct fuse_conn *fc, unsigned int map_alignment);
>>>>>     void fuse_dax_cancel_work(struct fuse_conn *fc);
>>>>> +/* dir.c */
>>>>> +int fuse_dentry_revalidate_lookup(struct fuse_mount *fm, struct dentry *entry,
>>>>> +				  struct inode *inode,
>>>>> +				  struct fuse_entry_out *outarg,
>>>>> +				  bool *lookedup);
>>>>> +
>>>>>     /* ioctl.c */
>>>>>     long fuse_file_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
>>>>>     long fuse_file_compat_ioctl(struct file *file, unsigned int cmd,
>>>>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>>>>> index 444418e240c8..79a31cb55512 100644
>>>>> --- a/fs/fuse/inode.c
>>>>> +++ b/fs/fuse/inode.c
>>>>> @@ -1464,7 +1464,13 @@ static int fuse_fill_super_submount(struct super_block *sb,
>>>>>     	struct fuse_mount *fm = get_fuse_mount_super(sb);
>>>>>     	struct super_block *parent_sb = parent_fi->inode.i_sb;
>>>>>     	struct fuse_attr root_attr;
>>>>> +	struct fuse_inode *fi;
>>>>>     	struct inode *root;
>>>>> +	struct inode *parent;
>>>>> +	struct dentry *pdent;
>>>>> +	struct fuse_entry_out outarg;
>>>>> +	bool lookedup = false;
>>>>> +	int ret;
>>>>>     	fuse_sb_defaults(sb);
>>>>>     	fm->sb = sb;
>>>>> @@ -1480,14 +1486,39 @@ static int fuse_fill_super_submount(struct super_block *sb,
>>>>>     	if (parent_sb->s_subtype && !sb->s_subtype)
>>>>>     		return -ENOMEM;
>>>>> -	fuse_fill_attr_from_inode(&root_attr, parent_fi);
>>>>> -	root = fuse_iget(sb, parent_fi->nodeid, 0, &root_attr, 0, 0);
>>>>>     	/*
>>>>> -	 * This inode is just a duplicate, so it is not looked up and
>>>>> -	 * its nlookup should not be incremented.  fuse_iget() does
>>>>> -	 * that, though, so undo it here.
>>>>> +	 * It is necessary to lookup the parent_if->nodeid in case the dentry
>>>>> +	 * that triggered the automount of the submount is later evicted.
>>>>> +	 * If this dentry is evicted without the lookup count getting increased
>>>>> +	 * on the submount root, then the server can subsequently forget this
>>>>> +	 * nodeid which leads to errors when trying to access the root of the
>>>>> +	 * submount.
>>>>>     	 */
>>>>> -	get_fuse_inode(root)->nlookup--;
>>>>> +	parent = &parent_fi->inode;
>>>>> +	pdent = d_find_alias(parent);
>>>>> +	if (!pdent)
>>>>> +		return -EINVAL;
>>>>> +
>>>>> +	ret = fuse_dentry_revalidate_lookup(fm, pdent, parent, &outarg,
>>>>> +	    &lookedup);
>>>>> +	dput(pdent);
>>>>> +	/*
>>>>> +	 * The new root owns this nlookup on success, and it is incremented by
>>>>> +	 * fuse_iget().  In the case the lookup succeeded but revalidate fails,
>>>>> +	 * ensure that the lookup count is tracked by the parent.
>>>>> +	 */
>>>>> +	if (ret <= 0) {
>>>>> +		if (lookedup) {
>>>>> +			fi = get_fuse_inode(parent);
>>>>> +			spin_lock(&fi->lock);
>>>>> +			fi->nlookup++;
>>>>> +			spin_unlock(&fi->lock);
>>>>> +		}
>>>>
>>>> I might be wrong, but doesn't that mean that
>>>> "get_fuse_inode(root)->nlookup--" needs to be called?
>>>
>>> In the case where ret > 0, the nlookup on get_fuse_inode(root) is set to
>>> 1 by fuse_iget().  That ensures that the root is forgotten when later
>>> unmounted.  The code that handles the forget uses the count of nlookup
>>> to tell the server-side how many references to forget.  (That's in
>>> fuse_evict_inode()).
>>>
>>> However, if the fuse_dentry_revalidate_lookup() call performs a valid
>>> lookup but returns an error, this function will return before it fills
>>> out s_root in the superblock or calls fuse_iget().  If the superblock
>>> doesn't have a s_root set, then the code in generic_kill_super() won't
>>> dput() the root dentry and trigger the forget.
>>>
>>> The intention of this code was to handle the case where the lookup had
>>> succeeded, but the code determined it was still necessary to return an
>>> error.  In that situation, the reference taken by the lookup has to be
>>> accounted somewhere, and the parent seemed like a plausible candidate.
>>
>> Yeah sorry, I had just missed that fuse_iget() also moved and then thought
>> it would have increased fi->nlookup already.
> 
> No worries; I'd much rather get feedback if something doesn't look
> right, even if it turns out okay in the end.
> 
>>> However, after writing up this response, I can see that there's still a
>>> problem here if d_make_root(root) returns NULL, because we'll also lose
>>> track of the nlookup in that case.
>>>
>>> If you agree that charging this to the parent on error makes sense, I'll
>>> re-work the error handling here so that the right thing happens when
>>> either fuse_dentry_revalidate_lookup() or d_make_root() encounter an
>>> error.
>>
>> Oh yeah, I also missed that. Although, iput() calls iput_final, which then
>> calls evict and sends the fuse forget - isn't that the right action already?
> 
> Thanks, I had forgotten that d_make_root() would call iput() for me if
> d_alloc_anon() fails.  Let me restate this to suggest that I account the
> nlookup to the parent if fuse_dentry_revalidate_lookup() or fuse_iget()
> fail instead.  Does that sound right?

Hmm, so server/daemon side uses the lookup count to have an inode 
reference - are you sure that parent is the right inode for the forget 
call? And what is the probability for such failures? I.e. is that 
performance critical? Wouldn't be much simpler and clearer to just avoid 
and doubt and to send an immediate forget?


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [resend PATCH v2 2/2] fuse: ensure that submounts lookup their parent
  2023-10-02 15:24 ` [resend PATCH v2 2/2] fuse: ensure that submounts lookup their parent Krister Johansen
  2023-10-06 17:13   ` Bernd Schubert
@ 2023-10-09 19:45   ` Miklos Szeredi
  2023-10-10  2:35     ` Krister Johansen
  1 sibling, 1 reply; 25+ messages in thread
From: Miklos Szeredi @ 2023-10-09 19:45 UTC (permalink / raw)
  To: Krister Johansen
  Cc: linux-fsdevel, Miklos Szeredi, linux-kernel, German Maglione,
	Greg Kurz, Max Reitz, Bernd Schubert

On Mon, 2 Oct 2023 at 17:24, Krister Johansen <kjlx@templeofstupid.com> wrote:
>
> The submount code uses the parent nodeid passed into the function in
> order to create the root dentry for the new submount.  This nodeid does
> not get its remote reference count incremented by a lookup option.
>
> If the parent inode is evicted from its superblock, due to memory
> pressure for example, it can result in a forget opertation being sent to
> the server.  Should this nodeid be forgotten while it is still in use in
> a submount, users of the submount get an error from the server on any
> subsequent access.  In the author's case, this was an EBADF on all
> subsequent operations that needed to reference the root.
>
> Debugging the problem revealed that the dentry shrinker triggered a forget
> after killing the dentry with the last reference, despite the root
> dentry in another superblock still using the nodeid.

There's some context missing here.  There are two dentries: a mount
point in the parent mount and the root of the submount.

The server indicates that the looked up inode is a submount using
FUSE_ATTR_SUBMOUNT.  Then AFAICS the following happens:

 1) the mountpoint dentry is created with nlookup = 1.  The
S_AUTOMOUNT flag is set on the mountpoint inode.

 2) the lookup code sees S_AUTOMOUNT and triggers the submount
operation, which sets up the new super_block and the root dentry with
the user supplied nodeid and with nlookup = 0 (because it wasn't
actually looked up).

How the automount gets torn down is another story.  You say that the
mount point gets evicted due to memory pressure.  But it can't get
evicted while the submount is attached.  So the submount must first
get detached, and then the mount point can be reclaimed.   The
question is:  how does the submount gets detached.  Do you have an
idea?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [resend PATCH v2 2/2] fuse: ensure that submounts lookup their parent
  2023-10-09 19:45   ` Miklos Szeredi
@ 2023-10-10  2:35     ` Krister Johansen
  2023-10-10  8:15       ` Miklos Szeredi
  0 siblings, 1 reply; 25+ messages in thread
From: Krister Johansen @ 2023-10-10  2:35 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Krister Johansen, linux-fsdevel, Miklos Szeredi, linux-kernel,
	German Maglione, Greg Kurz, Max Reitz, Bernd Schubert

On Mon, Oct 09, 2023 at 09:45:08PM +0200, Miklos Szeredi wrote:
> On Mon, 2 Oct 2023 at 17:24, Krister Johansen <kjlx@templeofstupid.com> wrote:
> >
> > The submount code uses the parent nodeid passed into the function in
> > order to create the root dentry for the new submount.  This nodeid does
> > not get its remote reference count incremented by a lookup option.
> >
> > If the parent inode is evicted from its superblock, due to memory
> > pressure for example, it can result in a forget opertation being sent to
> > the server.  Should this nodeid be forgotten while it is still in use in
> > a submount, users of the submount get an error from the server on any
> > subsequent access.  In the author's case, this was an EBADF on all
> > subsequent operations that needed to reference the root.
> >
> > Debugging the problem revealed that the dentry shrinker triggered a forget
> > after killing the dentry with the last reference, despite the root
> > dentry in another superblock still using the nodeid.
> 
> There's some context missing here.  There are two dentries: a mount
> point in the parent mount and the root of the submount.
> 
> The server indicates that the looked up inode is a submount using
> FUSE_ATTR_SUBMOUNT.  Then AFAICS the following happens:
> 
>  1) the mountpoint dentry is created with nlookup = 1.  The
> S_AUTOMOUNT flag is set on the mountpoint inode.
> 
>  2) the lookup code sees S_AUTOMOUNT and triggers the submount
> operation, which sets up the new super_block and the root dentry with
> the user supplied nodeid and with nlookup = 0 (because it wasn't
> actually looked up).
> 
> How the automount gets torn down is another story.  You say that the
> mount point gets evicted due to memory pressure.  But it can't get
> evicted while the submount is attached.  So the submount must first
> get detached, and then the mount point can be reclaimed.   The
> question is:  how does the submount gets detached.  Do you have an
> idea?

Apologies for not stating this clearly.  The use case is a container
running in a VM, and the container's root is provided to the guest via
virtiofs.  I believe the submount is getting detached as part of the
container setup, either via a umount2(MNT_DETACH) of the old root
filesystem, or as part of pivot_root() itself.  By the time I'm able to
inspect the dentry associated with the submount in the initial mount ns
(case #1) its d_lockref.count is 0, and /proc/mountinfo doesn't show an
active mount for the submount in that mount namespace.

If I manually traverse the path to the submount via something like cd
and ls from the initial mount namespace, it'll stay referenced until I
run a umount for the automounted path.  I'm reasonably sure it's the
container setup that's causing the detaching.

I'm happy to go debug this some more, though, if you're skeptical of the
explanation.

-K

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [resend PATCH v2 2/2] fuse: ensure that submounts lookup their parent
  2023-10-09 18:43           ` Bernd Schubert
@ 2023-10-10  2:35             ` Krister Johansen
  0 siblings, 0 replies; 25+ messages in thread
From: Krister Johansen @ 2023-10-10  2:35 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Krister Johansen, Miklos Szeredi, linux-fsdevel, Miklos Szeredi,
	linux-kernel, German Maglione, Greg Kurz, Max Reitz

On Mon, Oct 09, 2023 at 08:43:02PM +0200, Bernd Schubert wrote:
> On 10/9/23 19:15, Krister Johansen wrote:
> > Thanks, I had forgotten that d_make_root() would call iput() for me if
> > d_alloc_anon() fails.  Let me restate this to suggest that I account the
> > nlookup to the parent if fuse_dentry_revalidate_lookup() or fuse_iget()
> > fail instead.  Does that sound right?
> 
> Hmm, so server/daemon side uses the lookup count to have an inode reference
> - are you sure that parent is the right inode for the forget call? And what
> is the probability for such failures? I.e. is that performance critical?
> Wouldn't be much simpler and clearer to just avoid and doubt and to send an
> immediate forget?

Yeah, the server / daemon side need to track the lookup count so that it
knows when it can close the fd for the file on the server-side. (At
least for virtiofsd, anyway.).

The reason I had avoided doing the forget in the submount code is that
it needs a forget linkage in order to call fuse_queue_forget().  One of
these is allocated by fuse_alloc_inode().  A well formed parent should
always have one.  However, if the fuse_iget() for the submount root
fails, then there's no linkage to borrow from the new inode.  The code
could always call fuse_alloc_forget() directly, like is done elsewhere.
I thought it might be hard to get the memory for this allocation if
fuse_iget() also can't allocate enough, but I could move the allocation
earlier in the function and just free it if it's not used.

I'm not confident that would reduce the amount of code in the function,
but if you'd find it clearer, I'm happy to modify it accordingly.

-K

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [resend PATCH v2 2/2] fuse: ensure that submounts lookup their parent
  2023-10-10  2:35     ` Krister Johansen
@ 2023-10-10  8:15       ` Miklos Szeredi
  2023-10-11  1:25         ` Krister Johansen
  0 siblings, 1 reply; 25+ messages in thread
From: Miklos Szeredi @ 2023-10-10  8:15 UTC (permalink / raw)
  To: Krister Johansen
  Cc: linux-fsdevel, Miklos Szeredi, linux-kernel, German Maglione,
	Greg Kurz, Max Reitz, Bernd Schubert

[-- Attachment #1: Type: text/plain, Size: 480 bytes --]

On Tue, 10 Oct 2023 at 04:35, Krister Johansen <kjlx@templeofstupid.com> wrote:

> If I manually traverse the path to the submount via something like cd
> and ls from the initial mount namespace, it'll stay referenced until I
> run a umount for the automounted path.  I'm reasonably sure it's the
> container setup that's causing the detaching.

Okay.  Can you please try the attached test patch.  It's not a proper
solution, but I think it's the right direction.

Thanks,
Miklos

[-- Attachment #2: fuse-inc-nlookup-on-automount-root.patch --]
[-- Type: text/x-patch, Size: 613 bytes --]

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 2e4eb7cf26fb..d5f47203dfbc 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1524,6 +1524,18 @@ static int fuse_get_tree_submount(struct fs_context *fsc)
 		return err;
 	}
 
+	spin_lock(&mp_fi->lock);
+	if (mp_fi->nlookup) {
+		struct fuse_inode *fi = get_fuse_inode(d_inode(sb->s_root));
+		mp_fi->nlookup--;
+		spin_unlock(&mp_fi->lock);
+		spin_lock(&fi->lock);
+		fi->nlookup++;
+		spin_unlock(&fi->lock);
+	} else {
+		spin_unlock(&mp_fi->lock);
+	}
+
 	down_write(&fc->killsb);
 	list_add_tail(&fm->fc_entry, &fc->mounts);
 	up_write(&fc->killsb);

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [resend PATCH v2 2/2] fuse: ensure that submounts lookup their parent
  2023-10-10  8:15       ` Miklos Szeredi
@ 2023-10-11  1:25         ` Krister Johansen
  2023-10-11  7:07           ` Miklos Szeredi
  0 siblings, 1 reply; 25+ messages in thread
From: Krister Johansen @ 2023-10-11  1:25 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-fsdevel, Miklos Szeredi, linux-kernel, German Maglione,
	Greg Kurz, Max Reitz, Bernd Schubert

Hi Miklos,

On Tue, Oct 10, 2023 at 10:15:38AM +0200, Miklos Szeredi wrote:
> On Tue, 10 Oct 2023 at 04:35, Krister Johansen <kjlx@templeofstupid.com> wrote:
> 
> > If I manually traverse the path to the submount via something like cd
> > and ls from the initial mount namespace, it'll stay referenced until I
> > run a umount for the automounted path.  I'm reasonably sure it's the
> > container setup that's causing the detaching.
> 
> Okay.  Can you please try the attached test patch.  It's not a proper
> solution, but I think it's the right direction.

Thanks, I tested this patch and was unable to reproduce the scenario
where an OOM resulted in the submount from the guest in the guest
getting an EBADF from virtiofsd.  (I did generate OOMs, though).

I am curious what you have in mind in order to move this towards a
proper fix?  I shied away from the approach of stealing a nlookup from
mp_fi beacuse it wasn't clear that I could always count on the nlookup
in the parent staying positive.  E.g. I was afraid I was either going to
not have enough nlookups to move to submounts, or trigger a forget from
an exiting container that leads to an EBADF from the initial mount
namespace.

-K

> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 2e4eb7cf26fb..d5f47203dfbc 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1524,6 +1524,18 @@ static int fuse_get_tree_submount(struct fs_context *fsc)
>  		return err;
>  	}
>  
> +	spin_lock(&mp_fi->lock);
> +	if (mp_fi->nlookup) {
> +		struct fuse_inode *fi = get_fuse_inode(d_inode(sb->s_root));
> +		mp_fi->nlookup--;
> +		spin_unlock(&mp_fi->lock);
> +		spin_lock(&fi->lock);
> +		fi->nlookup++;
> +		spin_unlock(&fi->lock);
> +	} else {
> +		spin_unlock(&mp_fi->lock);
> +	}
> +
>  	down_write(&fc->killsb);
>  	list_add_tail(&fm->fc_entry, &fc->mounts);
>  	up_write(&fc->killsb);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [resend PATCH v2 2/2] fuse: ensure that submounts lookup their parent
  2023-10-11  1:25         ` Krister Johansen
@ 2023-10-11  7:07           ` Miklos Szeredi
  2023-10-11 16:32             ` Krister Johansen
  0 siblings, 1 reply; 25+ messages in thread
From: Miklos Szeredi @ 2023-10-11  7:07 UTC (permalink / raw)
  To: Krister Johansen
  Cc: linux-fsdevel, Miklos Szeredi, linux-kernel, German Maglione,
	Greg Kurz, Max Reitz, Bernd Schubert

On Wed, 11 Oct 2023 at 03:26, Krister Johansen <kjlx@templeofstupid.com> wrote:

> I am curious what you have in mind in order to move this towards a
> proper fix?  I shied away from the approach of stealing a nlookup from
> mp_fi beacuse it wasn't clear that I could always count on the nlookup
> in the parent staying positive.  E.g. I was afraid I was either going to
> not have enough nlookups to move to submounts, or trigger a forget from
> an exiting container that leads to an EBADF from the initial mount
> namespace.

One idea is to transfer the nlookup to a separately refcounted object
that is referenced from mp_fi as well as all the submounts.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [resend PATCH v2 2/2] fuse: ensure that submounts lookup their parent
  2023-10-11  7:07           ` Miklos Szeredi
@ 2023-10-11 16:32             ` Krister Johansen
  2023-10-11 18:27               ` Miklos Szeredi
  0 siblings, 1 reply; 25+ messages in thread
From: Krister Johansen @ 2023-10-11 16:32 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-fsdevel, Miklos Szeredi, linux-kernel, German Maglione,
	Greg Kurz, Max Reitz, Bernd Schubert

On Wed, Oct 11, 2023 at 09:07:33AM +0200, Miklos Szeredi wrote:
> On Wed, 11 Oct 2023 at 03:26, Krister Johansen <kjlx@templeofstupid.com> wrote:
> 
> > I am curious what you have in mind in order to move this towards a
> > proper fix?  I shied away from the approach of stealing a nlookup from
> > mp_fi beacuse it wasn't clear that I could always count on the nlookup
> > in the parent staying positive.  E.g. I was afraid I was either going to
> > not have enough nlookups to move to submounts, or trigger a forget from
> > an exiting container that leads to an EBADF from the initial mount
> > namespace.
> 
> One idea is to transfer the nlookup to a separately refcounted object
> that is referenced from mp_fi as well as all the submounts.

That seems possible.  Would the idea be to move all tracking of nlookup
to a separate refcounted object for the particular nodeid, or just do
this for the first lookup of a submount?

Would you like me to put together a v3 that heads this direction?

Thanks,

-K

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [resend PATCH v2 2/2] fuse: ensure that submounts lookup their parent
  2023-10-11 16:32             ` Krister Johansen
@ 2023-10-11 18:27               ` Miklos Szeredi
  2023-10-18  1:33                 ` Krister Johansen
  2023-10-18  1:33                 ` [PATCH v3] fuse: share lookup state between submount and its parent Krister Johansen
  0 siblings, 2 replies; 25+ messages in thread
From: Miklos Szeredi @ 2023-10-11 18:27 UTC (permalink / raw)
  To: Krister Johansen
  Cc: linux-fsdevel, Miklos Szeredi, linux-kernel, German Maglione,
	Greg Kurz, Max Reitz, Bernd Schubert

On Wed, 11 Oct 2023 at 18:32, Krister Johansen <kjlx@templeofstupid.com> wrote:
>
> On Wed, Oct 11, 2023 at 09:07:33AM +0200, Miklos Szeredi wrote:
> > On Wed, 11 Oct 2023 at 03:26, Krister Johansen <kjlx@templeofstupid.com> wrote:
> >
> > > I am curious what you have in mind in order to move this towards a
> > > proper fix?  I shied away from the approach of stealing a nlookup from
> > > mp_fi beacuse it wasn't clear that I could always count on the nlookup
> > > in the parent staying positive.  E.g. I was afraid I was either going to
> > > not have enough nlookups to move to submounts, or trigger a forget from
> > > an exiting container that leads to an EBADF from the initial mount
> > > namespace.
> >
> > One idea is to transfer the nlookup to a separately refcounted object
> > that is referenced from mp_fi as well as all the submounts.
>
> That seems possible.  Would the idea be to move all tracking of nlookup
> to a separate refcounted object for the particular nodeid, or just do
> this for the first lookup of a submount?

Just for submounts.  And yes, it should work if the count from the
first lookup is transferred to this object (fuse_iget()) and
subsequent counts (fuse_dentry_revalidate()) go to the mountpoint
inode as usual.  This will result in more than one FORGET in most
cases, but that's okay.

> Would you like me to put together a v3 that heads this direction?

That would be great, thanks.

Miklos

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [resend PATCH v2 2/2] fuse: ensure that submounts lookup their parent
  2023-10-11 18:27               ` Miklos Szeredi
@ 2023-10-18  1:33                 ` Krister Johansen
  2023-10-18  1:33                 ` [PATCH v3] fuse: share lookup state between submount and its parent Krister Johansen
  1 sibling, 0 replies; 25+ messages in thread
From: Krister Johansen @ 2023-10-18  1:33 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Krister Johansen, linux-fsdevel, Miklos Szeredi, linux-kernel,
	German Maglione, Greg Kurz, Max Reitz, Bernd Schubert

Hi Miklos,

On Wed, Oct 11, 2023 at 08:27:34PM +0200, Miklos Szeredi wrote:
> On Wed, 11 Oct 2023 at 18:32, Krister Johansen <kjlx@templeofstupid.com> wrote:
> >
> > On Wed, Oct 11, 2023 at 09:07:33AM +0200, Miklos Szeredi wrote:
> > > On Wed, 11 Oct 2023 at 03:26, Krister Johansen <kjlx@templeofstupid.com> wrote:
> > >
> > > > I am curious what you have in mind in order to move this towards a
> > > > proper fix?  I shied away from the approach of stealing a nlookup from
> > > > mp_fi beacuse it wasn't clear that I could always count on the nlookup
> > > > in the parent staying positive.  E.g. I was afraid I was either going to
> > > > not have enough nlookups to move to submounts, or trigger a forget from
> > > > an exiting container that leads to an EBADF from the initial mount
> > > > namespace.
> > >
> > > One idea is to transfer the nlookup to a separately refcounted object
> > > that is referenced from mp_fi as well as all the submounts.
> >
> > That seems possible.  Would the idea be to move all tracking of nlookup
> > to a separate refcounted object for the particular nodeid, or just do
> > this for the first lookup of a submount?
> 
> Just for submounts.  And yes, it should work if the count from the
> first lookup is transferred to this object (fuse_iget()) and
> subsequent counts (fuse_dentry_revalidate()) go to the mountpoint
> inode as usual.  This will result in more than one FORGET in most
> cases, but that's okay.
> 
> > Would you like me to put together a v3 that heads this direction?
> 
> That would be great, thanks.

Thanks for the pointers here.  I started over and followed the approach
that you suggested.  It condensed to a single patch, so I'll send it as
a follow-up to this thread.

-K

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3] fuse: share lookup state between submount and its parent
  2023-10-11 18:27               ` Miklos Szeredi
  2023-10-18  1:33                 ` Krister Johansen
@ 2023-10-18  1:33                 ` Krister Johansen
  2023-10-19 12:39                   ` Miklos Szeredi
  1 sibling, 1 reply; 25+ messages in thread
From: Krister Johansen @ 2023-10-18  1:33 UTC (permalink / raw)
  To: Miklos Szeredi, linux-fsdevel
  Cc: Miklos Szeredi, linux-kernel, German Maglione, Greg Kurz,
	Max Reitz, Bernd Schubert

Fuse submounts do not perform a lookup for the nodeid that they inherit
from their parent.  Instead, the code decrements the nlookup on the
submount's fuse_inode when it is instantiated, and no forget is
performed when a submount root is evicted.

Trouble arises when the submount's parent is evicted despite the
submount itself being in use.  In this author's case, the submount was
in a container and deatched from the initial mount namespace via a
MNT_DEATCH operation.  When memory pressure triggered the shrinker, the
inode from the parent was evicted, which triggered enough forgets to
render the submount's nodeid invalid.

Since submounts should still function, even if their parent goes away,
solve this problem by sharing refcounted state between the parent and
its submount.  When all of the references on this shared state reach
zero, it's safe to forget the final lookup of the fuse nodeid.

Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
Cc: stable@vger.kernel.org
Fixes: 1866d779d5d2 ("fuse: Allow fuse_fill_super_common() for submounts")
---
 fs/fuse/fuse_i.h | 20 +++++++++++
 fs/fuse/inode.c  | 88 ++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 105 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 405252bb51f2..0d1659c5016b 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -63,6 +63,24 @@ struct fuse_forget_link {
 	struct fuse_forget_link *next;
 };
 
+/* Submount lookup tracking */
+struct fuse_submount_lookup {
+	/** Refcount */
+	refcount_t count;
+
+	/** Unique ID, which identifies the inode between userspace
+	 * and kernel */
+	u64 nodeid;
+
+	/** Number of lookups on this inode */
+	u64 nlookup;
+
+	/** The request used for sending the FORGET message */
+	struct fuse_forget_link *forget;
+
+	struct rcu_head rcu;
+};
+
 /** FUSE inode */
 struct fuse_inode {
 	/** Inode data */
@@ -158,6 +176,8 @@ struct fuse_inode {
 	 */
 	struct fuse_inode_dax *dax;
 #endif
+	/** Submount specific lookup tracking */
+	struct fuse_submount_lookup *submount_lookup;
 };
 
 /** FUSE inode state bits */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 444418e240c8..dc1499e2074f 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -68,6 +68,24 @@ struct fuse_forget_link *fuse_alloc_forget(void)
 	return kzalloc(sizeof(struct fuse_forget_link), GFP_KERNEL_ACCOUNT);
 }
 
+static struct fuse_submount_lookup *fuse_alloc_submount_lookup(void)
+{
+	struct fuse_submount_lookup *sl;
+
+	sl = kzalloc(sizeof(struct fuse_submount_lookup), GFP_KERNEL_ACCOUNT);
+	if (!sl)
+		return NULL;
+	sl->forget = fuse_alloc_forget();
+	if (!sl->forget)
+		goto out_free;
+
+	return sl;
+
+out_free:
+	kfree(sl);
+	return NULL;
+}
+
 static struct inode *fuse_alloc_inode(struct super_block *sb)
 {
 	struct fuse_inode *fi;
@@ -113,9 +131,24 @@ static void fuse_free_inode(struct inode *inode)
 	kmem_cache_free(fuse_inode_cachep, fi);
 }
 
+static void fuse_cleanup_submount_lookup(struct fuse_conn *fc,
+					 struct fuse_submount_lookup *sl)
+{
+	if (!refcount_dec_and_test(&sl->count))
+		return;
+
+	if (sl->nlookup) {
+		fuse_queue_forget(fc, sl->forget, sl->nodeid, sl->nlookup);
+		sl->forget = NULL;
+	}
+	kfree(sl->forget);
+	kfree_rcu(sl, rcu);
+}
+
 static void fuse_evict_inode(struct inode *inode)
 {
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_submount_lookup *sl = NULL;
 
 	/* Will write inode on close/munmap and in all other dirtiers */
 	WARN_ON(inode->i_state & I_DIRTY_INODE);
@@ -132,6 +165,15 @@ static void fuse_evict_inode(struct inode *inode)
 					  fi->nlookup);
 			fi->forget = NULL;
 		}
+
+		spin_lock(&fi->lock);
+		if (fi->submount_lookup) {
+			sl = fi->submount_lookup;
+			fi->submount_lookup = NULL;
+		}
+		spin_unlock(&fi->lock);
+		if (sl)
+			fuse_cleanup_submount_lookup(fc, sl);
 	}
 	if (S_ISREG(inode->i_mode) && !fuse_is_bad(inode)) {
 		WARN_ON(!list_empty(&fi->write_files));
@@ -332,6 +374,14 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
 		fuse_dax_dontcache(inode, attr->flags);
 }
 
+static void fuse_init_submount_lookup(struct fuse_submount_lookup *sl,
+				      u64 nodeid)
+{
+	sl->nodeid = nodeid;
+	sl->nlookup = 1;
+	refcount_set(&sl->count, 1);
+}
+
 static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr,
 			    struct fuse_conn *fc)
 {
@@ -395,12 +445,22 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 	 */
 	if (fc->auto_submounts && (attr->flags & FUSE_ATTR_SUBMOUNT) &&
 	    S_ISDIR(attr->mode)) {
+		struct fuse_inode *fi;
+
 		inode = new_inode(sb);
 		if (!inode)
 			return NULL;
 
 		fuse_init_inode(inode, attr, fc);
-		get_fuse_inode(inode)->nodeid = nodeid;
+		fi = get_fuse_inode(inode);
+		fi->nodeid = nodeid;
+		fi->submount_lookup = fuse_alloc_submount_lookup();
+		if (!fi->submount_lookup) {
+			iput(inode);
+			return NULL;
+		}
+		/* Sets nlookup = 1 on fi->submount_lookup->nlookup */
+		fuse_init_submount_lookup(fi->submount_lookup, nodeid);
 		inode->i_flags |= S_AUTOMOUNT;
 		goto done;
 	}
@@ -423,11 +483,11 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 		iput(inode);
 		goto retry;
 	}
-done:
 	fi = get_fuse_inode(inode);
 	spin_lock(&fi->lock);
 	fi->nlookup++;
 	spin_unlock(&fi->lock);
+done:
 	fuse_change_attributes(inode, attr, NULL, attr_valid, attr_version);
 
 	return inode;
@@ -1465,6 +1525,8 @@ static int fuse_fill_super_submount(struct super_block *sb,
 	struct super_block *parent_sb = parent_fi->inode.i_sb;
 	struct fuse_attr root_attr;
 	struct inode *root;
+	struct fuse_submount_lookup *sl;
+	struct fuse_inode *fi;
 
 	fuse_sb_defaults(sb);
 	fm->sb = sb;
@@ -1487,12 +1549,32 @@ static int fuse_fill_super_submount(struct super_block *sb,
 	 * its nlookup should not be incremented.  fuse_iget() does
 	 * that, though, so undo it here.
 	 */
-	get_fuse_inode(root)->nlookup--;
+	fi = get_fuse_inode(root);
+	fi->nlookup--;
+
 	sb->s_d_op = &fuse_dentry_operations;
 	sb->s_root = d_make_root(root);
 	if (!sb->s_root)
 		return -ENOMEM;
 
+	/*
+	 * Grab the parent's submount_lookup pointer and take a
+	 * reference on the shared nlookup from the parent.  This is to
+	 * prevent the last forget for this nodeid from getting
+	 * triggered until all users have finished with it.
+	 */
+	spin_lock(&parent_fi->lock);
+	sl = parent_fi->submount_lookup;
+	if (sl) {
+		refcount_inc(&sl->count);
+		spin_unlock(&parent_fi->lock);
+		spin_lock(&fi->lock);
+		fi->submount_lookup = sl;
+		spin_unlock(&fi->lock);
+	} else {
+		spin_unlock(&parent_fi->lock);
+	}
+
 	return 0;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3] fuse: share lookup state between submount and its parent
  2023-10-18  1:33                 ` [PATCH v3] fuse: share lookup state between submount and its parent Krister Johansen
@ 2023-10-19 12:39                   ` Miklos Szeredi
  2023-10-20 21:33                     ` Krister Johansen
  2023-10-20 21:34                     ` [PATCH v4] " Krister Johansen
  0 siblings, 2 replies; 25+ messages in thread
From: Miklos Szeredi @ 2023-10-19 12:39 UTC (permalink / raw)
  To: Krister Johansen
  Cc: Miklos Szeredi, linux-fsdevel, linux-kernel, German Maglione,
	Greg Kurz, Max Reitz, Bernd Schubert

On Wed, Oct 18, 2023 at 3:34 AM Krister Johansen
<kjlx@templeofstupid.com> wrote:
>
> Fuse submounts do not perform a lookup for the nodeid that they inherit
> from their parent.  Instead, the code decrements the nlookup on the
> submount's fuse_inode when it is instantiated, and no forget is
> performed when a submount root is evicted.
>
> Trouble arises when the submount's parent is evicted despite the
> submount itself being in use.  In this author's case, the submount was
> in a container and deatched from the initial mount namespace via a
> MNT_DEATCH operation.  When memory pressure triggered the shrinker, the
> inode from the parent was evicted, which triggered enough forgets to
> render the submount's nodeid invalid.
>
> Since submounts should still function, even if their parent goes away,
> solve this problem by sharing refcounted state between the parent and
> its submount.  When all of the references on this shared state reach
> zero, it's safe to forget the final lookup of the fuse nodeid.
>
> Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
> Cc: stable@vger.kernel.org
> Fixes: 1866d779d5d2 ("fuse: Allow fuse_fill_super_common() for submounts")
> ---
>  fs/fuse/fuse_i.h | 20 +++++++++++
>  fs/fuse/inode.c  | 88 ++++++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 105 insertions(+), 3 deletions(-)
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 405252bb51f2..0d1659c5016b 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -63,6 +63,24 @@ struct fuse_forget_link {
>         struct fuse_forget_link *next;
>  };
>
> +/* Submount lookup tracking */
> +struct fuse_submount_lookup {
> +       /** Refcount */
> +       refcount_t count;
> +
> +       /** Unique ID, which identifies the inode between userspace
> +        * and kernel */
> +       u64 nodeid;
> +
> +       /** Number of lookups on this inode */
> +       u64 nlookup;

sl->nlookup will always be one.  So that can just be implicit and this
field can just go away.

> +
> +       /** The request used for sending the FORGET message */
> +       struct fuse_forget_link *forget;
> +
> +       struct rcu_head rcu;

RCU would be needed if any fields could be accessed from RCU protected
code.  But AFAICS there's no such access, so this shouldn't be needed.
  Am I missing something?

> +};
> +
>  /** FUSE inode */
>  struct fuse_inode {
>         /** Inode data */
> @@ -158,6 +176,8 @@ struct fuse_inode {
>          */
>         struct fuse_inode_dax *dax;
>  #endif
> +       /** Submount specific lookup tracking */
> +       struct fuse_submount_lookup *submount_lookup;
>  };
>
>  /** FUSE inode state bits */
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 444418e240c8..dc1499e2074f 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -68,6 +68,24 @@ struct fuse_forget_link *fuse_alloc_forget(void)
>         return kzalloc(sizeof(struct fuse_forget_link), GFP_KERNEL_ACCOUNT);
>  }
>
> +static struct fuse_submount_lookup *fuse_alloc_submount_lookup(void)
> +{
> +       struct fuse_submount_lookup *sl;
> +
> +       sl = kzalloc(sizeof(struct fuse_submount_lookup), GFP_KERNEL_ACCOUNT);
> +       if (!sl)
> +               return NULL;
> +       sl->forget = fuse_alloc_forget();
> +       if (!sl->forget)
> +               goto out_free;
> +
> +       return sl;
> +
> +out_free:
> +       kfree(sl);
> +       return NULL;
> +}
> +
>  static struct inode *fuse_alloc_inode(struct super_block *sb)
>  {
>         struct fuse_inode *fi;
> @@ -113,9 +131,24 @@ static void fuse_free_inode(struct inode *inode)
>         kmem_cache_free(fuse_inode_cachep, fi);
>  }
>
> +static void fuse_cleanup_submount_lookup(struct fuse_conn *fc,
> +                                        struct fuse_submount_lookup *sl)
> +{
> +       if (!refcount_dec_and_test(&sl->count))
> +               return;
> +
> +       if (sl->nlookup) {
> +               fuse_queue_forget(fc, sl->forget, sl->nodeid, sl->nlookup);
> +               sl->forget = NULL;
> +       }
> +       kfree(sl->forget);
> +       kfree_rcu(sl, rcu);
> +}
> +
>  static void fuse_evict_inode(struct inode *inode)
>  {
>         struct fuse_inode *fi = get_fuse_inode(inode);
> +       struct fuse_submount_lookup *sl = NULL;
>
>         /* Will write inode on close/munmap and in all other dirtiers */
>         WARN_ON(inode->i_state & I_DIRTY_INODE);
> @@ -132,6 +165,15 @@ static void fuse_evict_inode(struct inode *inode)
>                                           fi->nlookup);
>                         fi->forget = NULL;
>                 }
> +
> +               spin_lock(&fi->lock);
> +               if (fi->submount_lookup) {
> +                       sl = fi->submount_lookup;
> +                       fi->submount_lookup = NULL;
> +               }
> +               spin_unlock(&fi->lock);

I don't think locking is needed.  Eviction happens only once and at
that point nobody else should be touching the inode.

> +               if (sl)
> +                       fuse_cleanup_submount_lookup(fc, sl);
>         }
>         if (S_ISREG(inode->i_mode) && !fuse_is_bad(inode)) {
>                 WARN_ON(!list_empty(&fi->write_files));
> @@ -332,6 +374,14 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
>                 fuse_dax_dontcache(inode, attr->flags);
>  }
>
> +static void fuse_init_submount_lookup(struct fuse_submount_lookup *sl,
> +                                     u64 nodeid)
> +{
> +       sl->nodeid = nodeid;
> +       sl->nlookup = 1;
> +       refcount_set(&sl->count, 1);
> +}
> +
>  static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr,
>                             struct fuse_conn *fc)
>  {
> @@ -395,12 +445,22 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
>          */
>         if (fc->auto_submounts && (attr->flags & FUSE_ATTR_SUBMOUNT) &&
>             S_ISDIR(attr->mode)) {
> +               struct fuse_inode *fi;
> +
>                 inode = new_inode(sb);
>                 if (!inode)
>                         return NULL;
>
>                 fuse_init_inode(inode, attr, fc);
> -               get_fuse_inode(inode)->nodeid = nodeid;
> +               fi = get_fuse_inode(inode);
> +               fi->nodeid = nodeid;
> +               fi->submount_lookup = fuse_alloc_submount_lookup();
> +               if (!fi->submount_lookup) {
> +                       iput(inode);
> +                       return NULL;
> +               }
> +               /* Sets nlookup = 1 on fi->submount_lookup->nlookup */
> +               fuse_init_submount_lookup(fi->submount_lookup, nodeid);
>                 inode->i_flags |= S_AUTOMOUNT;
>                 goto done;
>         }
> @@ -423,11 +483,11 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
>                 iput(inode);
>                 goto retry;
>         }
> -done:
>         fi = get_fuse_inode(inode);
>         spin_lock(&fi->lock);
>         fi->nlookup++;
>         spin_unlock(&fi->lock);
> +done:
>         fuse_change_attributes(inode, attr, NULL, attr_valid, attr_version);
>
>         return inode;
> @@ -1465,6 +1525,8 @@ static int fuse_fill_super_submount(struct super_block *sb,
>         struct super_block *parent_sb = parent_fi->inode.i_sb;
>         struct fuse_attr root_attr;
>         struct inode *root;
> +       struct fuse_submount_lookup *sl;
> +       struct fuse_inode *fi;
>
>         fuse_sb_defaults(sb);
>         fm->sb = sb;
> @@ -1487,12 +1549,32 @@ static int fuse_fill_super_submount(struct super_block *sb,
>          * its nlookup should not be incremented.  fuse_iget() does
>          * that, though, so undo it here.
>          */
> -       get_fuse_inode(root)->nlookup--;
> +       fi = get_fuse_inode(root);
> +       fi->nlookup--;
> +
>         sb->s_d_op = &fuse_dentry_operations;
>         sb->s_root = d_make_root(root);
>         if (!sb->s_root)
>                 return -ENOMEM;
>
> +       /*
> +        * Grab the parent's submount_lookup pointer and take a
> +        * reference on the shared nlookup from the parent.  This is to
> +        * prevent the last forget for this nodeid from getting
> +        * triggered until all users have finished with it.
> +        */
> +       spin_lock(&parent_fi->lock);

Root has just been allocated, no locking needed.

> +       sl = parent_fi->submount_lookup;
> +       if (sl) {

WARN_ON(!sl);

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3] fuse: share lookup state between submount and its parent
  2023-10-19 12:39                   ` Miklos Szeredi
@ 2023-10-20 21:33                     ` Krister Johansen
  2023-10-20 21:34                     ` [PATCH v4] " Krister Johansen
  1 sibling, 0 replies; 25+ messages in thread
From: Krister Johansen @ 2023-10-20 21:33 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Miklos Szeredi, linux-fsdevel, linux-kernel, German Maglione,
	Greg Kurz, Max Reitz, Bernd Schubert

Hi Miklos,
Thanks for all the feedback. I've made all the changes you requested and
was pleased to find that this reduced the overall size of the patch.

On Thu, Oct 19, 2023 at 02:39:34PM +0200, Miklos Szeredi wrote:
> On Wed, Oct 18, 2023 at 3:34 AM Krister Johansen
> <kjlx@templeofstupid.com> wrote:
> >
> > Fuse submounts do not perform a lookup for the nodeid that they inherit
> > from their parent.  Instead, the code decrements the nlookup on the
> > submount's fuse_inode when it is instantiated, and no forget is
> > performed when a submount root is evicted.
> >
> > Trouble arises when the submount's parent is evicted despite the
> > submount itself being in use.  In this author's case, the submount was
> > in a container and deatched from the initial mount namespace via a
> > MNT_DEATCH operation.  When memory pressure triggered the shrinker, the
> > inode from the parent was evicted, which triggered enough forgets to
> > render the submount's nodeid invalid.
> >
> > Since submounts should still function, even if their parent goes away,
> > solve this problem by sharing refcounted state between the parent and
> > its submount.  When all of the references on this shared state reach
> > zero, it's safe to forget the final lookup of the fuse nodeid.
> >
> > Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
> > Cc: stable@vger.kernel.org
> > Fixes: 1866d779d5d2 ("fuse: Allow fuse_fill_super_common() for submounts")
> > ---
> >  fs/fuse/fuse_i.h | 20 +++++++++++
> >  fs/fuse/inode.c  | 88 ++++++++++++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 105 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 405252bb51f2..0d1659c5016b 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -63,6 +63,24 @@ struct fuse_forget_link {
> >         struct fuse_forget_link *next;
> >  };
> >
> > +/* Submount lookup tracking */
> > +struct fuse_submount_lookup {
> > +       /** Refcount */
> > +       refcount_t count;
> > +
> > +       /** Unique ID, which identifies the inode between userspace
> > +        * and kernel */
> > +       u64 nodeid;
> > +
> > +       /** Number of lookups on this inode */
> > +       u64 nlookup;
> 
> sl->nlookup will always be one.  So that can just be implicit and this
> field can just go away.
> 
> > +
> > +       /** The request used for sending the FORGET message */
> > +       struct fuse_forget_link *forget;
> > +
> > +       struct rcu_head rcu;
> 
> RCU would be needed if any fields could be accessed from RCU protected
> code.  But AFAICS there's no such access, so this shouldn't be needed.
>   Am I missing something?

No, you're correct and not missing anything.  I've cleaned this up.

Thanks again,

-K

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4] fuse: share lookup state between submount and its parent
  2023-10-19 12:39                   ` Miklos Szeredi
  2023-10-20 21:33                     ` Krister Johansen
@ 2023-10-20 21:34                     ` Krister Johansen
  1 sibling, 0 replies; 25+ messages in thread
From: Krister Johansen @ 2023-10-20 21:34 UTC (permalink / raw)
  To: Miklos Szeredi, linux-fsdevel
  Cc: Miklos Szeredi, linux-kernel, German Maglione, Greg Kurz,
	Max Reitz, Bernd Schubert

Fuse submounts do not perform a lookup for the nodeid that they inherit
from their parent.  Instead, the code decrements the nlookup on the
submount's fuse_inode when it is instantiated, and no forget is
performed when a submount root is evicted.

Trouble arises when the submount's parent is evicted despite the
submount itself being in use.  In this author's case, the submount was
in a container and deatched from the initial mount namespace via a
MNT_DEATCH operation.  When memory pressure triggered the shrinker, the
inode from the parent was evicted, which triggered enough forgets to
render the submount's nodeid invalid.

Since submounts should still function, even if their parent goes away,
solve this problem by sharing refcounted state between the parent and
its submount.  When all of the references on this shared state reach
zero, it's safe to forget the final lookup of the fuse nodeid.

Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
Cc: stable@vger.kernel.org
Fixes: 1866d779d5d2 ("fuse: Allow fuse_fill_super_common() for submounts")
---
Changes since v3:

- Remove rcu head from lookup tracking struct along with unnecessary
  kfree_rcu call. (Feedback from Miklos Szeredi)
- Make nlookup one implicitly.  Remove from struct and simplify places
  where it was being used. (Feedback from Miklos Szeredi)
- Remove unnecessary spinlock acquisition. (Feedback from Miklos
  Szeredi)
- Add a WARN_ON if the lookup tracking cookie cannot be found during
  fuse_fill_super_submount.  (Feedback from Miklos Szeredi)

Changes since v2:

- Move to an approach where the lookup is shared between the submount's
  parent and children.  Use a reference counted lookup cookie to decide
  when it is safe to perform the forget of the final reference.
  (Feedback from Miklos Szeredi)

Changes since v1:

- Cleanups to pacify test robot

Changes since RFC:

- Modified fuse_fill_super_submount to always fail if dentry cannot be
  revalidated.  (Feedback from Bernd Schubert)
- Fixed up an edge case where looked up but subsequently declared
  invalid dentries were not correctly tracking nlookup.  (Error was
  introduced in my RFC).
---
 fs/fuse/fuse_i.h | 15 ++++++++++
 fs/fuse/inode.c  | 74 ++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 86 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 405252bb51f2..9377c46f14c4 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -63,6 +63,19 @@ struct fuse_forget_link {
 	struct fuse_forget_link *next;
 };
 
+/* Submount lookup tracking */
+struct fuse_submount_lookup {
+	/** Refcount */
+	refcount_t count;
+
+	/** Unique ID, which identifies the inode between userspace
+	 * and kernel */
+	u64 nodeid;
+
+	/** The request used for sending the FORGET message */
+	struct fuse_forget_link *forget;
+};
+
 /** FUSE inode */
 struct fuse_inode {
 	/** Inode data */
@@ -158,6 +171,8 @@ struct fuse_inode {
 	 */
 	struct fuse_inode_dax *dax;
 #endif
+	/** Submount specific lookup tracking */
+	struct fuse_submount_lookup *submount_lookup;
 };
 
 /** FUSE inode state bits */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 444418e240c8..243bda3cfdf6 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -68,6 +68,24 @@ struct fuse_forget_link *fuse_alloc_forget(void)
 	return kzalloc(sizeof(struct fuse_forget_link), GFP_KERNEL_ACCOUNT);
 }
 
+static struct fuse_submount_lookup *fuse_alloc_submount_lookup(void)
+{
+	struct fuse_submount_lookup *sl;
+
+	sl = kzalloc(sizeof(struct fuse_submount_lookup), GFP_KERNEL_ACCOUNT);
+	if (!sl)
+		return NULL;
+	sl->forget = fuse_alloc_forget();
+	if (!sl->forget)
+		goto out_free;
+
+	return sl;
+
+out_free:
+	kfree(sl);
+	return NULL;
+}
+
 static struct inode *fuse_alloc_inode(struct super_block *sb)
 {
 	struct fuse_inode *fi;
@@ -113,6 +131,17 @@ static void fuse_free_inode(struct inode *inode)
 	kmem_cache_free(fuse_inode_cachep, fi);
 }
 
+static void fuse_cleanup_submount_lookup(struct fuse_conn *fc,
+					 struct fuse_submount_lookup *sl)
+{
+	if (!refcount_dec_and_test(&sl->count))
+		return;
+
+	fuse_queue_forget(fc, sl->forget, sl->nodeid, 1);
+	sl->forget = NULL;
+	kfree(sl);
+}
+
 static void fuse_evict_inode(struct inode *inode)
 {
 	struct fuse_inode *fi = get_fuse_inode(inode);
@@ -132,6 +161,11 @@ static void fuse_evict_inode(struct inode *inode)
 					  fi->nlookup);
 			fi->forget = NULL;
 		}
+
+		if (fi->submount_lookup) {
+			fuse_cleanup_submount_lookup(fc, fi->submount_lookup);
+			fi->submount_lookup = NULL;
+		}
 	}
 	if (S_ISREG(inode->i_mode) && !fuse_is_bad(inode)) {
 		WARN_ON(!list_empty(&fi->write_files));
@@ -332,6 +366,13 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
 		fuse_dax_dontcache(inode, attr->flags);
 }
 
+static void fuse_init_submount_lookup(struct fuse_submount_lookup *sl,
+				      u64 nodeid)
+{
+	sl->nodeid = nodeid;
+	refcount_set(&sl->count, 1);
+}
+
 static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr,
 			    struct fuse_conn *fc)
 {
@@ -395,12 +436,22 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 	 */
 	if (fc->auto_submounts && (attr->flags & FUSE_ATTR_SUBMOUNT) &&
 	    S_ISDIR(attr->mode)) {
+		struct fuse_inode *fi;
+
 		inode = new_inode(sb);
 		if (!inode)
 			return NULL;
 
 		fuse_init_inode(inode, attr, fc);
-		get_fuse_inode(inode)->nodeid = nodeid;
+		fi = get_fuse_inode(inode);
+		fi->nodeid = nodeid;
+		fi->submount_lookup = fuse_alloc_submount_lookup();
+		if (!fi->submount_lookup) {
+			iput(inode);
+			return NULL;
+		}
+		/* Sets nlookup = 1 on fi->submount_lookup->nlookup */
+		fuse_init_submount_lookup(fi->submount_lookup, nodeid);
 		inode->i_flags |= S_AUTOMOUNT;
 		goto done;
 	}
@@ -423,11 +474,11 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 		iput(inode);
 		goto retry;
 	}
-done:
 	fi = get_fuse_inode(inode);
 	spin_lock(&fi->lock);
 	fi->nlookup++;
 	spin_unlock(&fi->lock);
+done:
 	fuse_change_attributes(inode, attr, NULL, attr_valid, attr_version);
 
 	return inode;
@@ -1465,6 +1516,8 @@ static int fuse_fill_super_submount(struct super_block *sb,
 	struct super_block *parent_sb = parent_fi->inode.i_sb;
 	struct fuse_attr root_attr;
 	struct inode *root;
+	struct fuse_submount_lookup *sl;
+	struct fuse_inode *fi;
 
 	fuse_sb_defaults(sb);
 	fm->sb = sb;
@@ -1487,12 +1540,27 @@ static int fuse_fill_super_submount(struct super_block *sb,
 	 * its nlookup should not be incremented.  fuse_iget() does
 	 * that, though, so undo it here.
 	 */
-	get_fuse_inode(root)->nlookup--;
+	fi = get_fuse_inode(root);
+	fi->nlookup--;
+
 	sb->s_d_op = &fuse_dentry_operations;
 	sb->s_root = d_make_root(root);
 	if (!sb->s_root)
 		return -ENOMEM;
 
+	/*
+	 * Grab the parent's submount_lookup pointer and take a
+	 * reference on the shared nlookup from the parent.  This is to
+	 * prevent the last forget for this nodeid from getting
+	 * triggered until all users have finished with it.
+	 */
+	sl = parent_fi->submount_lookup;
+	WARN_ON(!sl);
+	if (sl) {
+		refcount_inc(&sl->count);
+		fi->submount_lookup = sl;
+	}
+
 	return 0;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2023-10-20 21:35 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-02 15:24 [resend PATCH v2 0/2] virtiofs submounts that are still in use forgotten by shrinker Krister Johansen
2023-10-02 15:24 ` [resend PATCH v2 1/2] fuse: revalidate: move lookup into a separate function Krister Johansen
2023-10-02 15:24 ` [resend PATCH v2 2/2] fuse: ensure that submounts lookup their parent Krister Johansen
2023-10-06 17:13   ` Bernd Schubert
2023-10-07  0:41     ` Krister Johansen
2023-10-09 12:52       ` Bernd Schubert
2023-10-09 17:15         ` Krister Johansen
2023-10-09 18:43           ` Bernd Schubert
2023-10-10  2:35             ` Krister Johansen
2023-10-09 19:45   ` Miklos Szeredi
2023-10-10  2:35     ` Krister Johansen
2023-10-10  8:15       ` Miklos Szeredi
2023-10-11  1:25         ` Krister Johansen
2023-10-11  7:07           ` Miklos Szeredi
2023-10-11 16:32             ` Krister Johansen
2023-10-11 18:27               ` Miklos Szeredi
2023-10-18  1:33                 ` Krister Johansen
2023-10-18  1:33                 ` [PATCH v3] fuse: share lookup state between submount and its parent Krister Johansen
2023-10-19 12:39                   ` Miklos Szeredi
2023-10-20 21:33                     ` Krister Johansen
2023-10-20 21:34                     ` [PATCH v4] " Krister Johansen
2023-10-02 22:18 ` [resend PATCH v2 0/2] virtiofs submounts that are still in use forgotten by shrinker Bernd Schubert
2023-10-03 16:48   ` Krister Johansen
2023-10-03 22:54     ` Bernd Schubert
2023-10-04 13:58       ` Krister Johansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).