All of lore.kernel.org
 help / color / mirror / Atom feed
* [Virtio-fs] 'FORGET' ordering semantics (vs unlink & NFS)
@ 2021-01-04 16:00 Dr. David Alan Gilbert
  2021-01-04 18:45 ` Vivek Goyal
  2021-01-05 10:11 ` Nikolaus Rath
  0 siblings, 2 replies; 26+ messages in thread
From: Dr. David Alan Gilbert @ 2021-01-04 16:00 UTC (permalink / raw)
  To: fuse-devel; +Cc: virtio-fs, vgoyal

Hi,
  On virtio-fs we're hitting a problem with NFS, where
unlinking a file in a directory and then rmdir'ing that
directory fails complaining about the directory not being empty.

The problem here is that if a file has an open fd, NFS doesn't
actually delete the file on unlink, it just renames it to
a hidden file (e.g. .nfs*******).  That open file is there because
the 'FORGET' hasn't completed yet by the time the rmdir is issued.

Question:
  a) In the FUSE protocol, are requests assumed to complete in order;
i.e.  unlink, forget, rmdir   is it required that 'forget' completes
before the rmdir is processed?
     (In virtiofs we've been processing requests, in parallel, and
have sent forgets down a separate queue to keep them out of the way).

  b) 'forget' doesn't send a reply - so the kernel can't wait for the
client to have finished it;  do we need a synchronous forget here?

  c) Has this problem been hit by any other fuse users (with NFS or otherwise)?

Dave

-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-04 16:00 [Virtio-fs] 'FORGET' ordering semantics (vs unlink & NFS) Dr. David Alan Gilbert
@ 2021-01-04 18:45 ` Vivek Goyal
  2021-01-04 18:56   ` Dr. David Alan Gilbert
  2021-01-05 10:11 ` Nikolaus Rath
  1 sibling, 1 reply; 26+ messages in thread
From: Vivek Goyal @ 2021-01-04 18:45 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: fuse-devel, virtio-fs

On Mon, Jan 04, 2021 at 04:00:13PM +0000, Dr. David Alan Gilbert wrote:
> Hi,
>   On virtio-fs we're hitting a problem with NFS, where
> unlinking a file in a directory and then rmdir'ing that
> directory fails complaining about the directory not being empty.
> 
> The problem here is that if a file has an open fd, NFS doesn't
> actually delete the file on unlink, it just renames it to
> a hidden file (e.g. .nfs*******).  That open file is there because
> the 'FORGET' hasn't completed yet by the time the rmdir is issued.
> 
> Question:
>   a) In the FUSE protocol, are requests assumed to complete in order;
> i.e.  unlink, forget, rmdir   is it required that 'forget' completes
> before the rmdir is processed?
>      (In virtiofs we've been processing requests, in parallel, and
> have sent forgets down a separate queue to keep them out of the way).
> 
>   b) 'forget' doesn't send a reply - so the kernel can't wait for the
> client to have finished it;  do we need a synchronous forget here?

Even if we introduce a synchronous forget, will that really fix the
issue. For example, this could also happen if file has been unlinked
but it is still open and directory is being removed.

fd = open(foo/bar.txt)
unlink foo/bar.txt
rmdir foo
close(fd).

In this case, final forget should go after fd has been closed. Its
not a forget race. 

I wrote a test case for this and it works on regular file systems.

https://github.com/rhvgoyal/misc/blob/master/virtiofs-tests/rmdir.c

I suspect it will fail on nfs because I am assuming that temporary
file will be there till final close(fd) happens. If that's the
case this is a NFS specific issue because its behavior is different
from other file systems.

Vivek

> 
>   c) Has this problem been hit by any other fuse users (with NFS or otherwise)?
> 
> Dave
> 
> -- 
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-04 18:45 ` Vivek Goyal
@ 2021-01-04 18:56   ` Dr. David Alan Gilbert
  2021-01-04 19:04     ` Vivek Goyal
                       ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Dr. David Alan Gilbert @ 2021-01-04 18:56 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: fuse-devel, virtio-fs

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Mon, Jan 04, 2021 at 04:00:13PM +0000, Dr. David Alan Gilbert wrote:
> > Hi,
> >   On virtio-fs we're hitting a problem with NFS, where
> > unlinking a file in a directory and then rmdir'ing that
> > directory fails complaining about the directory not being empty.
> > 
> > The problem here is that if a file has an open fd, NFS doesn't
> > actually delete the file on unlink, it just renames it to
> > a hidden file (e.g. .nfs*******).  That open file is there because
> > the 'FORGET' hasn't completed yet by the time the rmdir is issued.
> > 
> > Question:
> >   a) In the FUSE protocol, are requests assumed to complete in order;
> > i.e.  unlink, forget, rmdir   is it required that 'forget' completes
> > before the rmdir is processed?
> >      (In virtiofs we've been processing requests, in parallel, and
> > have sent forgets down a separate queue to keep them out of the way).
> > 
> >   b) 'forget' doesn't send a reply - so the kernel can't wait for the
> > client to have finished it;  do we need a synchronous forget here?
> 
> Even if we introduce a synchronous forget, will that really fix the
> issue. For example, this could also happen if file has been unlinked
> but it is still open and directory is being removed.
> 
> fd = open(foo/bar.txt)
> unlink foo/bar.txt
> rmdir foo
> close(fd).
> 
> In this case, final forget should go after fd has been closed. Its
> not a forget race. 
> 
> I wrote a test case for this and it works on regular file systems.
> 
> https://github.com/rhvgoyal/misc/blob/master/virtiofs-tests/rmdir.c
> 
> I suspect it will fail on nfs because I am assuming that temporary
> file will be there till final close(fd) happens. If that's the
> case this is a NFS specific issue because its behavior is different
> from other file systems.

That's true; but that's NFS just being NFS; in our case we're keeping
an fd open even though the guest has been smart enough not to; so we're
causing the NFS oddity when it wouldn't normally happen.

Dave

> Vivek
> 
> > 
> >   c) Has this problem been hit by any other fuse users (with NFS or otherwise)?
> > 
> > Dave
> > 
> > -- 
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-04 18:56   ` Dr. David Alan Gilbert
@ 2021-01-04 19:04     ` Vivek Goyal
  2021-01-04 19:16       ` Vivek Goyal
  2021-01-05 11:24     ` [Virtio-fs] [fuse-devel] " Miklos Szeredi
  2021-01-06  4:29     ` Amir Goldstein
  2 siblings, 1 reply; 26+ messages in thread
From: Vivek Goyal @ 2021-01-04 19:04 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: fuse-devel, virtio-fs

On Mon, Jan 04, 2021 at 06:56:55PM +0000, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgoyal@redhat.com) wrote:
> > On Mon, Jan 04, 2021 at 04:00:13PM +0000, Dr. David Alan Gilbert wrote:
> > > Hi,
> > >   On virtio-fs we're hitting a problem with NFS, where
> > > unlinking a file in a directory and then rmdir'ing that
> > > directory fails complaining about the directory not being empty.
> > > 
> > > The problem here is that if a file has an open fd, NFS doesn't
> > > actually delete the file on unlink, it just renames it to
> > > a hidden file (e.g. .nfs*******).  That open file is there because
> > > the 'FORGET' hasn't completed yet by the time the rmdir is issued.
> > > 
> > > Question:
> > >   a) In the FUSE protocol, are requests assumed to complete in order;
> > > i.e.  unlink, forget, rmdir   is it required that 'forget' completes
> > > before the rmdir is processed?
> > >      (In virtiofs we've been processing requests, in parallel, and
> > > have sent forgets down a separate queue to keep them out of the way).
> > > 
> > >   b) 'forget' doesn't send a reply - so the kernel can't wait for the
> > > client to have finished it;  do we need a synchronous forget here?
> > 
> > Even if we introduce a synchronous forget, will that really fix the
> > issue. For example, this could also happen if file has been unlinked
> > but it is still open and directory is being removed.
> > 
> > fd = open(foo/bar.txt)
> > unlink foo/bar.txt
> > rmdir foo
> > close(fd).
> > 
> > In this case, final forget should go after fd has been closed. Its
> > not a forget race. 
> > 
> > I wrote a test case for this and it works on regular file systems.
> > 
> > https://github.com/rhvgoyal/misc/blob/master/virtiofs-tests/rmdir.c
> > 
> > I suspect it will fail on nfs because I am assuming that temporary
> > file will be there till final close(fd) happens. If that's the
> > case this is a NFS specific issue because its behavior is different
> > from other file systems.
> 
> That's true; but that's NFS just being NFS; in our case we're keeping
> an fd open even though the guest has been smart enough not to; so we're
> causing the NFS oddity when it wouldn't normally happen.

So what file descriptor is that? Is it because of O_PATH fd we have
stashed away in lo_inode?

Will NFS keep .nfsXXXX file around even if O_PATH fd is open for
unlinked file?

Vivek


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-04 19:04     ` Vivek Goyal
@ 2021-01-04 19:16       ` Vivek Goyal
  0 siblings, 0 replies; 26+ messages in thread
From: Vivek Goyal @ 2021-01-04 19:16 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: fuse-devel, virtio-fs

On Mon, Jan 04, 2021 at 02:04:39PM -0500, Vivek Goyal wrote:
> On Mon, Jan 04, 2021 at 06:56:55PM +0000, Dr. David Alan Gilbert wrote:
> > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > On Mon, Jan 04, 2021 at 04:00:13PM +0000, Dr. David Alan Gilbert wrote:
> > > > Hi,
> > > >   On virtio-fs we're hitting a problem with NFS, where
> > > > unlinking a file in a directory and then rmdir'ing that
> > > > directory fails complaining about the directory not being empty.
> > > > 
> > > > The problem here is that if a file has an open fd, NFS doesn't
> > > > actually delete the file on unlink, it just renames it to
> > > > a hidden file (e.g. .nfs*******).  That open file is there because
> > > > the 'FORGET' hasn't completed yet by the time the rmdir is issued.
> > > > 
> > > > Question:
> > > >   a) In the FUSE protocol, are requests assumed to complete in order;
> > > > i.e.  unlink, forget, rmdir   is it required that 'forget' completes
> > > > before the rmdir is processed?
> > > >      (In virtiofs we've been processing requests, in parallel, and
> > > > have sent forgets down a separate queue to keep them out of the way).
> > > > 
> > > >   b) 'forget' doesn't send a reply - so the kernel can't wait for the
> > > > client to have finished it;  do we need a synchronous forget here?
> > > 
> > > Even if we introduce a synchronous forget, will that really fix the
> > > issue. For example, this could also happen if file has been unlinked
> > > but it is still open and directory is being removed.
> > > 
> > > fd = open(foo/bar.txt)
> > > unlink foo/bar.txt
> > > rmdir foo
> > > close(fd).
> > > 
> > > In this case, final forget should go after fd has been closed. Its
> > > not a forget race. 
> > > 
> > > I wrote a test case for this and it works on regular file systems.
> > > 
> > > https://github.com/rhvgoyal/misc/blob/master/virtiofs-tests/rmdir.c
> > > 
> > > I suspect it will fail on nfs because I am assuming that temporary
> > > file will be there till final close(fd) happens. If that's the
> > > case this is a NFS specific issue because its behavior is different
> > > from other file systems.
> > 
> > That's true; but that's NFS just being NFS; in our case we're keeping
> > an fd open even though the guest has been smart enough not to; so we're
> > causing the NFS oddity when it wouldn't normally happen.
> 
> So what file descriptor is that? Is it because of O_PATH fd we have
> stashed away in lo_inode?
> 
> Will NFS keep .nfsXXXX file around even if O_PATH fd is open for
> unlinked file?

If this issue is due to O_PATH fd opened by virtiofsd, then I guess
we will have look into introducing synchronous and use that instead
when inode is being dropped (I guess in fuse_evict_inode()).

Vivek


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [fuse-devel] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-04 16:00 [Virtio-fs] 'FORGET' ordering semantics (vs unlink & NFS) Dr. David Alan Gilbert
  2021-01-04 18:45 ` Vivek Goyal
@ 2021-01-05 10:11 ` Nikolaus Rath
  2021-01-05 12:28   ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 26+ messages in thread
From: Nikolaus Rath @ 2021-01-05 10:11 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: fuse-devel, virtio-fs, vgoyal

On Jan 04 2021, "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> Hi,
>   On virtio-fs we're hitting a problem with NFS, where
> unlinking a file in a directory and then rmdir'ing that
> directory fails complaining about the directory not being empty.
>
> The problem here is that if a file has an open fd, NFS doesn't
> actually delete the file on unlink, it just renames it to
> a hidden file (e.g. .nfs*******).  That open file is there because
> the 'FORGET' hasn't completed yet by the time the rmdir is issued.


Are you really talking about FORGET and file descriptors here? I always
assumed that the kernel will only drop dentries (aka emit FORGET) when
all fds are closed and never saw otherwise in practice.

Do you mean RELEASE rather than FORGET, or dentry rather than file descriptor?

Best,
-Nikolaus

-- 
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

             »Time flies like an arrow, fruit flies like a Banana.«



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [fuse-devel] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-04 18:56   ` Dr. David Alan Gilbert
  2021-01-04 19:04     ` Vivek Goyal
@ 2021-01-05 11:24     ` Miklos Szeredi
  2021-01-05 15:42       ` Vivek Goyal
  2021-01-06  4:29     ` Amir Goldstein
  2 siblings, 1 reply; 26+ messages in thread
From: Miklos Szeredi @ 2021-01-05 11:24 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: fuse-devel, virtio-fs-list, Vivek Goyal

[-- Attachment #1: Type: text/plain, Size: 2308 bytes --]

On Mon, Jan 4, 2021 at 7:57 PM Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> * Vivek Goyal (vgoyal@redhat.com) wrote:
> > On Mon, Jan 04, 2021 at 04:00:13PM +0000, Dr. David Alan Gilbert wrote:
> > > Hi,
> > >   On virtio-fs we're hitting a problem with NFS, where
> > > unlinking a file in a directory and then rmdir'ing that
> > > directory fails complaining about the directory not being empty.
> > >
> > > The problem here is that if a file has an open fd, NFS doesn't
> > > actually delete the file on unlink, it just renames it to
> > > a hidden file (e.g. .nfs*******).  That open file is there because
> > > the 'FORGET' hasn't completed yet by the time the rmdir is issued.
> > >
> > > Question:
> > >   a) In the FUSE protocol, are requests assumed to complete in order;
> > > i.e.  unlink, forget, rmdir   is it required that 'forget' completes
> > > before the rmdir is processed?
> > >      (In virtiofs we've been processing requests, in parallel, and
> > > have sent forgets down a separate queue to keep them out of the way).
> > >
> > >   b) 'forget' doesn't send a reply - so the kernel can't wait for the
> > > client to have finished it;  do we need a synchronous forget here?
> >
> > Even if we introduce a synchronous forget, will that really fix the
> > issue. For example, this could also happen if file has been unlinked
> > but it is still open and directory is being removed.
> >
> > fd = open(foo/bar.txt)
> > unlink foo/bar.txt
> > rmdir foo
> > close(fd).
> >
> > In this case, final forget should go after fd has been closed. Its
> > not a forget race.
> >
> > I wrote a test case for this and it works on regular file systems.
> >
> > https://github.com/rhvgoyal/misc/blob/master/virtiofs-tests/rmdir.c
> >
> > I suspect it will fail on nfs because I am assuming that temporary
> > file will be there till final close(fd) happens. If that's the
> > case this is a NFS specific issue because its behavior is different
> > from other file systems.
>
> That's true; but that's NFS just being NFS; in our case we're keeping
> an fd open even though the guest has been smart enough not to; so we're
> causing the NFS oddity when it wouldn't normally happen.

Can't think of anything better than a synchronous forget.   Compile
only tested patch attached.

Thanks,
Miklos

[-- Attachment #2: fuse-sync-forget.patch --]
[-- Type: text/x-patch, Size: 3319 bytes --]

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 78f9f209078c..daa4e669441d 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -373,6 +373,26 @@ static struct vfsmount *fuse_dentry_automount(struct path *path)
 	return ERR_PTR(err);
 }
 
+static void fuse_dentry_iput(struct dentry *dentry, struct inode *inode)
+{
+	if (!__lockref_is_dead(&dentry->d_lockref)) {
+		/*
+		 * This is an unlink/rmdir and removing the last ref to the
+		 * dentry.  Use synchronous FORGET in case filesystem requests
+		 * it.
+		 *
+		 * FIXME: This is racy!  Two or more instances of
+		 * fuse_dentry_iput() could be running concurrently (unlink of
+		 * several aliases in different directories).
+		 */
+		set_bit(FUSE_I_SYNC_FORGET, &get_fuse_inode(inode)->state);
+		iput(inode);
+		clear_bit(FUSE_I_SYNC_FORGET, &get_fuse_inode(inode)->state);
+	} else {
+		iput(inode);
+	}
+}
+
 const struct dentry_operations fuse_dentry_operations = {
 	.d_revalidate	= fuse_dentry_revalidate,
 	.d_delete	= fuse_dentry_delete,
@@ -381,6 +401,7 @@ const struct dentry_operations fuse_dentry_operations = {
 	.d_release	= fuse_dentry_release,
 #endif
 	.d_automount	= fuse_dentry_automount,
+	.d_iput		= fuse_dentry_iput,
 };
 
 const struct dentry_operations fuse_root_dentry_operations = {
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 7c4b8cb93f9f..0820b7a63ca7 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -174,6 +174,8 @@ enum {
 	FUSE_I_SIZE_UNSTABLE,
 	/* Bad inode */
 	FUSE_I_BAD,
+	/* Synchronous forget requested */
+	FUSE_I_SYNC_FORGET,
 };
 
 struct fuse_conn;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index b0e18b470e91..a49ff30d1ecc 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -115,6 +115,26 @@ static void fuse_free_inode(struct inode *inode)
 	kmem_cache_free(fuse_inode_cachep, fi);
 }
 
+static void fuse_sync_forget(struct inode *inode)
+{
+	struct fuse_mount *fm = get_fuse_mount(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_forget_in inarg;
+	FUSE_ARGS(args);
+
+	memset(&inarg, 0, sizeof(inarg));
+	inarg.nlookup = fi->nlookup;
+	args.opcode = FUSE_SYNC_FORGET;
+	args.nodeid = fi->nodeid;
+	args.in_numargs = 1;
+	args.in_args[0].size = sizeof(inarg);
+	args.in_args[0].value = &inarg;
+	args.force = true;
+
+	fuse_simple_request(fm, &args);
+	/* ignore errors */
+}
+
 static void fuse_evict_inode(struct inode *inode)
 {
 	struct fuse_inode *fi = get_fuse_inode(inode);
@@ -127,9 +147,13 @@ static void fuse_evict_inode(struct inode *inode)
 		if (FUSE_IS_DAX(inode))
 			fuse_dax_inode_cleanup(inode);
 		if (fi->nlookup) {
-			fuse_queue_forget(fc, fi->forget, fi->nodeid,
-					  fi->nlookup);
-			fi->forget = NULL;
+			if (test_bit(FUSE_I_SYNC_FORGET, &fi->state)) {
+				fuse_sync_forget(inode);
+			} else {
+				fuse_queue_forget(fc, fi->forget, fi->nodeid,
+						  fi->nlookup);
+				fi->forget = NULL;
+			}
 		}
 	}
 	if (S_ISREG(inode->i_mode) && !fuse_is_bad(inode)) {
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 98ca64d1beb6..cfcf95cfde76 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -499,6 +499,7 @@ enum fuse_opcode {
 	FUSE_COPY_FILE_RANGE	= 47,
 	FUSE_SETUPMAPPING	= 48,
 	FUSE_REMOVEMAPPING	= 49,
+	FUSE_SYNC_FORGET	= 50,
 
 	/* CUSE specific operations */
 	CUSE_INIT		= 4096,

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [fuse-devel] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-05 10:11 ` Nikolaus Rath
@ 2021-01-05 12:28   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 26+ messages in thread
From: Dr. David Alan Gilbert @ 2021-01-05 12:28 UTC (permalink / raw)
  To: fuse-devel, virtio-fs, mszeredi, vgoyal

* Nikolaus Rath (Nikolaus@rath.org) wrote:
> On Jan 04 2021, "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > Hi,
> >   On virtio-fs we're hitting a problem with NFS, where
> > unlinking a file in a directory and then rmdir'ing that
> > directory fails complaining about the directory not being empty.
> >
> > The problem here is that if a file has an open fd, NFS doesn't
> > actually delete the file on unlink, it just renames it to
> > a hidden file (e.g. .nfs*******).  That open file is there because
> > the 'FORGET' hasn't completed yet by the time the rmdir is issued.
> 
> 
> Are you really talking about FORGET and file descriptors here? I always
> assumed that the kernel will only drop dentries (aka emit FORGET) when
> all fds are closed and never saw otherwise in practice.
> 
> Do you mean RELEASE rather than FORGET, or dentry rather than file descriptor?

<reads code again> I still think I mean FORGET, and deamons fd.

In the (passthrough_ll) daemon we have two fd's; if I've got this right
then RELEASE closes fi->fh (i.e. a normal fd) where as FORGET
will close inode->fd which is the O_PATH fd.

If the user was to do:

    1 touch a/x
    2 rm a/x
    3 rmdir a/x

They would close(2) the fd at the end of the touch; so I think that
would cause us to see the RELEASE - but we don't see the FORGET yet.
Only after the 'rm' when a/x shouldn't exist any more; then I think we
can get a FORGET and close our inode->fd

Dave

> Best,
> -Nikolaus
> 
> -- 
> GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F
> 
>              »Time flies like an arrow, fruit flies like a Banana.«
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [fuse-devel] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-05 11:24     ` [Virtio-fs] [fuse-devel] " Miklos Szeredi
@ 2021-01-05 15:42       ` Vivek Goyal
  0 siblings, 0 replies; 26+ messages in thread
From: Vivek Goyal @ 2021-01-05 15:42 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: fuse-devel, virtio-fs-list

On Tue, Jan 05, 2021 at 12:24:48PM +0100, Miklos Szeredi wrote:
> On Mon, Jan 4, 2021 at 7:57 PM Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> >
> > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > On Mon, Jan 04, 2021 at 04:00:13PM +0000, Dr. David Alan Gilbert wrote:
> > > > Hi,
> > > >   On virtio-fs we're hitting a problem with NFS, where
> > > > unlinking a file in a directory and then rmdir'ing that
> > > > directory fails complaining about the directory not being empty.
> > > >
> > > > The problem here is that if a file has an open fd, NFS doesn't
> > > > actually delete the file on unlink, it just renames it to
> > > > a hidden file (e.g. .nfs*******).  That open file is there because
> > > > the 'FORGET' hasn't completed yet by the time the rmdir is issued.
> > > >
> > > > Question:
> > > >   a) In the FUSE protocol, are requests assumed to complete in order;
> > > > i.e.  unlink, forget, rmdir   is it required that 'forget' completes
> > > > before the rmdir is processed?
> > > >      (In virtiofs we've been processing requests, in parallel, and
> > > > have sent forgets down a separate queue to keep them out of the way).
> > > >
> > > >   b) 'forget' doesn't send a reply - so the kernel can't wait for the
> > > > client to have finished it;  do we need a synchronous forget here?
> > >
> > > Even if we introduce a synchronous forget, will that really fix the
> > > issue. For example, this could also happen if file has been unlinked
> > > but it is still open and directory is being removed.
> > >
> > > fd = open(foo/bar.txt)
> > > unlink foo/bar.txt
> > > rmdir foo
> > > close(fd).
> > >
> > > In this case, final forget should go after fd has been closed. Its
> > > not a forget race.
> > >
> > > I wrote a test case for this and it works on regular file systems.
> > >
> > > https://github.com/rhvgoyal/misc/blob/master/virtiofs-tests/rmdir.c
> > >
> > > I suspect it will fail on nfs because I am assuming that temporary
> > > file will be there till final close(fd) happens. If that's the
> > > case this is a NFS specific issue because its behavior is different
> > > from other file systems.
> >
> > That's true; but that's NFS just being NFS; in our case we're keeping
> > an fd open even though the guest has been smart enough not to; so we're
> > causing the NFS oddity when it wouldn't normally happen.
> 
> Can't think of anything better than a synchronous forget.   Compile
> only tested patch attached.
> 
> Thanks,
> Miklos

> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 78f9f209078c..daa4e669441d 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -373,6 +373,26 @@ static struct vfsmount *fuse_dentry_automount(struct path *path)
>  	return ERR_PTR(err);
>  }
>  
> +static void fuse_dentry_iput(struct dentry *dentry, struct inode *inode)
> +{
> +	if (!__lockref_is_dead(&dentry->d_lockref)) {
> +		/*
> +		 * This is an unlink/rmdir and removing the last ref to the
> +		 * dentry.  Use synchronous FORGET in case filesystem requests
> +		 * it.
> +		 *
> +		 * FIXME: This is racy!  Two or more instances of
> +		 * fuse_dentry_iput() could be running concurrently (unlink of
> +		 * several aliases in different directories).
> +		 */
> +		set_bit(FUSE_I_SYNC_FORGET, &get_fuse_inode(inode)->state);
> +		iput(inode);
> +		clear_bit(FUSE_I_SYNC_FORGET, &get_fuse_inode(inode)->state);

Hi Miklos,

Can above iput() be final reference on inode and free the inode? If yes,
then it is not safe to call clear_bit() after iput()?

Vivek

> +	} else {
> +		iput(inode);
> +	}
> +}
> +
>  const struct dentry_operations fuse_dentry_operations = {
>  	.d_revalidate	= fuse_dentry_revalidate,
>  	.d_delete	= fuse_dentry_delete,
> @@ -381,6 +401,7 @@ const struct dentry_operations fuse_dentry_operations = {
>  	.d_release	= fuse_dentry_release,
>  #endif
>  	.d_automount	= fuse_dentry_automount,
> +	.d_iput		= fuse_dentry_iput,
>  };
>  
>  const struct dentry_operations fuse_root_dentry_operations = {
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 7c4b8cb93f9f..0820b7a63ca7 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -174,6 +174,8 @@ enum {
>  	FUSE_I_SIZE_UNSTABLE,
>  	/* Bad inode */
>  	FUSE_I_BAD,
> +	/* Synchronous forget requested */
> +	FUSE_I_SYNC_FORGET,
>  };
>  
>  struct fuse_conn;
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index b0e18b470e91..a49ff30d1ecc 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -115,6 +115,26 @@ static void fuse_free_inode(struct inode *inode)
>  	kmem_cache_free(fuse_inode_cachep, fi);
>  }
>  
> +static void fuse_sync_forget(struct inode *inode)
> +{
> +	struct fuse_mount *fm = get_fuse_mount(inode);
> +	struct fuse_inode *fi = get_fuse_inode(inode);
> +	struct fuse_forget_in inarg;
> +	FUSE_ARGS(args);
> +
> +	memset(&inarg, 0, sizeof(inarg));
> +	inarg.nlookup = fi->nlookup;
> +	args.opcode = FUSE_SYNC_FORGET;
> +	args.nodeid = fi->nodeid;
> +	args.in_numargs = 1;
> +	args.in_args[0].size = sizeof(inarg);
> +	args.in_args[0].value = &inarg;
> +	args.force = true;
> +
> +	fuse_simple_request(fm, &args);
> +	/* ignore errors */
> +}
> +
>  static void fuse_evict_inode(struct inode *inode)
>  {
>  	struct fuse_inode *fi = get_fuse_inode(inode);
> @@ -127,9 +147,13 @@ static void fuse_evict_inode(struct inode *inode)
>  		if (FUSE_IS_DAX(inode))
>  			fuse_dax_inode_cleanup(inode);
>  		if (fi->nlookup) {
> -			fuse_queue_forget(fc, fi->forget, fi->nodeid,
> -					  fi->nlookup);
> -			fi->forget = NULL;
> +			if (test_bit(FUSE_I_SYNC_FORGET, &fi->state)) {
> +				fuse_sync_forget(inode);
> +			} else {
> +				fuse_queue_forget(fc, fi->forget, fi->nodeid,
> +						  fi->nlookup);
> +				fi->forget = NULL;
> +			}
>  		}
>  	}
>  	if (S_ISREG(inode->i_mode) && !fuse_is_bad(inode)) {
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index 98ca64d1beb6..cfcf95cfde76 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -499,6 +499,7 @@ enum fuse_opcode {
>  	FUSE_COPY_FILE_RANGE	= 47,
>  	FUSE_SETUPMAPPING	= 48,
>  	FUSE_REMOVEMAPPING	= 49,
> +	FUSE_SYNC_FORGET	= 50,
>  
>  	/* CUSE specific operations */
>  	CUSE_INIT		= 4096,


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [fuse-devel] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-04 18:56   ` Dr. David Alan Gilbert
  2021-01-04 19:04     ` Vivek Goyal
  2021-01-05 11:24     ` [Virtio-fs] [fuse-devel] " Miklos Szeredi
@ 2021-01-06  4:29     ` Amir Goldstein
  2021-01-06  8:01       ` Miklos Szeredi
  2 siblings, 1 reply; 26+ messages in thread
From: Amir Goldstein @ 2021-01-06  4:29 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: fuse-devel, virtio-fs-list, Vivek Goyal

On Mon, Jan 4, 2021 at 8:57 PM Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> * Vivek Goyal (vgoyal@redhat.com) wrote:
> > On Mon, Jan 04, 2021 at 04:00:13PM +0000, Dr. David Alan Gilbert wrote:
> > > Hi,
> > >   On virtio-fs we're hitting a problem with NFS, where
> > > unlinking a file in a directory and then rmdir'ing that
> > > directory fails complaining about the directory not being empty.
> > >
> > > The problem here is that if a file has an open fd, NFS doesn't
> > > actually delete the file on unlink, it just renames it to
> > > a hidden file (e.g. .nfs*******).  That open file is there because
> > > the 'FORGET' hasn't completed yet by the time the rmdir is issued.
> > >
> > > Question:
> > >   a) In the FUSE protocol, are requests assumed to complete in order;
> > > i.e.  unlink, forget, rmdir   is it required that 'forget' completes
> > > before the rmdir is processed?
> > >      (In virtiofs we've been processing requests, in parallel, and
> > > have sent forgets down a separate queue to keep them out of the way).
> > >
> > >   b) 'forget' doesn't send a reply - so the kernel can't wait for the
> > > client to have finished it;  do we need a synchronous forget here?
> >
> > Even if we introduce a synchronous forget, will that really fix the
> > issue. For example, this could also happen if file has been unlinked
> > but it is still open and directory is being removed.
> >
> > fd = open(foo/bar.txt)
> > unlink foo/bar.txt
> > rmdir foo
> > close(fd).
> >
> > In this case, final forget should go after fd has been closed. Its
> > not a forget race.
> >
> > I wrote a test case for this and it works on regular file systems.
> >
> > https://github.com/rhvgoyal/misc/blob/master/virtiofs-tests/rmdir.c
> >
> > I suspect it will fail on nfs because I am assuming that temporary
> > file will be there till final close(fd) happens. If that's the
> > case this is a NFS specific issue because its behavior is different
> > from other file systems.
>
> That's true; but that's NFS just being NFS; in our case we're keeping
> an fd open even though the guest has been smart enough not to; so we're
> causing the NFS oddity when it wouldn't normally happen.
>

Are you sure that you really need this oddity?

My sense from looking virtiofsd is that the open O_PATH fd
in InodeData for non-directories are an overkill and even the need
for open fd for all directories is questionable.

If you store a FileHandle (name_to_handle_at(2)) instead of an open fd
for non-directories, you won't be keeping a reference on the underlying inode
so no unlink issue.

open_by_handle_at(2) is very cheap for non-directory when underlying inode
is cached and as cheap as it can get even when inode is not in cache, so no
performance penalty is expected.

Essentially, all the open fd gives you in pinning the underlying inode to cache
and I am not sure at all this is what you want to achieve.

Note that if virtifs wants to provide "proper" NFS support, where NFS
file handles
survive host reboot, you will need FUSE support for LOOKUP_HANDLE [1] and
then you don't even have to keep InodeData around until FORGET, but IMO
there is no reason to wait for LOOKUP_HANDLE before making the change above.

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [fuse-devel] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-06  4:29     ` Amir Goldstein
@ 2021-01-06  8:01       ` Miklos Szeredi
  2021-01-06  9:16         ` Amir Goldstein
  0 siblings, 1 reply; 26+ messages in thread
From: Miklos Szeredi @ 2021-01-06  8:01 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: fuse-devel, Max Reitz, virtio-fs-list, Vivek Goyal

On Wed, Jan 6, 2021 at 5:29 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, Jan 4, 2021 at 8:57 PM Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> >
> > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > On Mon, Jan 04, 2021 at 04:00:13PM +0000, Dr. David Alan Gilbert wrote:
> > > > Hi,
> > > >   On virtio-fs we're hitting a problem with NFS, where
> > > > unlinking a file in a directory and then rmdir'ing that
> > > > directory fails complaining about the directory not being empty.
> > > >
> > > > The problem here is that if a file has an open fd, NFS doesn't
> > > > actually delete the file on unlink, it just renames it to
> > > > a hidden file (e.g. .nfs*******).  That open file is there because
> > > > the 'FORGET' hasn't completed yet by the time the rmdir is issued.
> > > >
> > > > Question:
> > > >   a) In the FUSE protocol, are requests assumed to complete in order;
> > > > i.e.  unlink, forget, rmdir   is it required that 'forget' completes
> > > > before the rmdir is processed?
> > > >      (In virtiofs we've been processing requests, in parallel, and
> > > > have sent forgets down a separate queue to keep them out of the way).
> > > >
> > > >   b) 'forget' doesn't send a reply - so the kernel can't wait for the
> > > > client to have finished it;  do we need a synchronous forget here?
> > >
> > > Even if we introduce a synchronous forget, will that really fix the
> > > issue. For example, this could also happen if file has been unlinked
> > > but it is still open and directory is being removed.
> > >
> > > fd = open(foo/bar.txt)
> > > unlink foo/bar.txt
> > > rmdir foo
> > > close(fd).
> > >
> > > In this case, final forget should go after fd has been closed. Its
> > > not a forget race.
> > >
> > > I wrote a test case for this and it works on regular file systems.
> > >
> > > https://github.com/rhvgoyal/misc/blob/master/virtiofs-tests/rmdir.c
> > >
> > > I suspect it will fail on nfs because I am assuming that temporary
> > > file will be there till final close(fd) happens. If that's the
> > > case this is a NFS specific issue because its behavior is different
> > > from other file systems.
> >
> > That's true; but that's NFS just being NFS; in our case we're keeping
> > an fd open even though the guest has been smart enough not to; so we're
> > causing the NFS oddity when it wouldn't normally happen.
> >
>
> Are you sure that you really need this oddity?
>
> My sense from looking virtiofsd is that the open O_PATH fd
> in InodeData for non-directories are an overkill and even the need
> for open fd for all directories is questionable.
>
> If you store a FileHandle (name_to_handle_at(2)) instead of an open fd
> for non-directories, you won't be keeping a reference on the underlying inode
> so no unlink issue.
>
> open_by_handle_at(2) is very cheap for non-directory when underlying inode
> is cached and as cheap as it can get even when inode is not in cache, so no
> performance penalty is expected.

You are perfectly right that using file handles would solve a number
of issues, one being too many open file descriptors.

The issue with open_by_handle_at(2) is that it needs
CAP_DAC_READ_SEARCH in the initial user namespace. That currently
makes it impossible to use in containers and such.

Not sure if there has been proposals for making open_by_handle_at(2)
usable in user namespaces.  The difficulty is in verifying that an
open file would have been obtainable by path lookup by the given user.

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [fuse-devel] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-06  8:01       ` Miklos Szeredi
@ 2021-01-06  9:16         ` Amir Goldstein
  2021-01-06  9:27           ` Amir Goldstein
                             ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Amir Goldstein @ 2021-01-06  9:16 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: fuse-devel, Max Reitz, virtio-fs-list, Vivek Goyal

On Wed, Jan 6, 2021 at 10:02 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Wed, Jan 6, 2021 at 5:29 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Mon, Jan 4, 2021 at 8:57 PM Dr. David Alan Gilbert
> > <dgilbert@redhat.com> wrote:
> > >
> > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > On Mon, Jan 04, 2021 at 04:00:13PM +0000, Dr. David Alan Gilbert wrote:
> > > > > Hi,
> > > > >   On virtio-fs we're hitting a problem with NFS, where
> > > > > unlinking a file in a directory and then rmdir'ing that
> > > > > directory fails complaining about the directory not being empty.
> > > > >
> > > > > The problem here is that if a file has an open fd, NFS doesn't
> > > > > actually delete the file on unlink, it just renames it to
> > > > > a hidden file (e.g. .nfs*******).  That open file is there because
> > > > > the 'FORGET' hasn't completed yet by the time the rmdir is issued.
> > > > >
> > > > > Question:
> > > > >   a) In the FUSE protocol, are requests assumed to complete in order;
> > > > > i.e.  unlink, forget, rmdir   is it required that 'forget' completes
> > > > > before the rmdir is processed?
> > > > >      (In virtiofs we've been processing requests, in parallel, and
> > > > > have sent forgets down a separate queue to keep them out of the way).
> > > > >
> > > > >   b) 'forget' doesn't send a reply - so the kernel can't wait for the
> > > > > client to have finished it;  do we need a synchronous forget here?
> > > >
> > > > Even if we introduce a synchronous forget, will that really fix the
> > > > issue. For example, this could also happen if file has been unlinked
> > > > but it is still open and directory is being removed.
> > > >
> > > > fd = open(foo/bar.txt)
> > > > unlink foo/bar.txt
> > > > rmdir foo
> > > > close(fd).
> > > >
> > > > In this case, final forget should go after fd has been closed. Its
> > > > not a forget race.
> > > >
> > > > I wrote a test case for this and it works on regular file systems.
> > > >
> > > > https://github.com/rhvgoyal/misc/blob/master/virtiofs-tests/rmdir.c
> > > >
> > > > I suspect it will fail on nfs because I am assuming that temporary
> > > > file will be there till final close(fd) happens. If that's the
> > > > case this is a NFS specific issue because its behavior is different
> > > > from other file systems.
> > >
> > > That's true; but that's NFS just being NFS; in our case we're keeping
> > > an fd open even though the guest has been smart enough not to; so we're
> > > causing the NFS oddity when it wouldn't normally happen.
> > >
> >
> > Are you sure that you really need this oddity?
> >
> > My sense from looking virtiofsd is that the open O_PATH fd
> > in InodeData for non-directories are an overkill and even the need
> > for open fd for all directories is questionable.
> >
> > If you store a FileHandle (name_to_handle_at(2)) instead of an open fd
> > for non-directories, you won't be keeping a reference on the underlying inode
> > so no unlink issue.
> >
> > open_by_handle_at(2) is very cheap for non-directory when underlying inode
> > is cached and as cheap as it can get even when inode is not in cache, so no
> > performance penalty is expected.
>
> You are perfectly right that using file handles would solve a number
> of issues, one being too many open file descriptors.
>
> The issue with open_by_handle_at(2) is that it needs
> CAP_DAC_READ_SEARCH in the initial user namespace. That currently
> makes it impossible to use in containers and such.

Is that a problem for virtiofsd? does it also run inside a container??

Please note that NFS doesn't do "silly rename" for directories,
so mitigation is mostly needed for non-dir.

An alternative method if daemon is not capable, is to store parent dirfd
in addition to filehandle and implement open_child_by_handle_at(int
parent_fd, ...):
- readdir(parend_fd)
- search a match for d_ino
- name_to_handle_at() and verify match to stored filehandle

This is essentially what open_by_handle_at(2) does under the covers
with a "connectable" non-dir filehandle after having resolved the
parent file handle part. And "connectable" file handles are used by nfsd
to enforce "subtree_check" to make sure that file wasn't moved outside
obtainable path after initial lookup.

>
> Not sure if there has been proposals for making open_by_handle_at(2)
> usable in user namespaces.

I don't remember seeing any.

> The difficulty is in verifying that an
> open file would have been obtainable by path lookup by the given user.

I think it can be allowed for user with CAP_DAC_READ_SEARCH in
userns for FS_USERNS_MOUNT mounted in that userns.

Thanks,
Amir.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [fuse-devel] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-06  9:16         ` Amir Goldstein
@ 2021-01-06  9:27           ` Amir Goldstein
  2021-01-06 13:40           ` Miklos Szeredi
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: Amir Goldstein @ 2021-01-06  9:27 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: fuse-devel, Max Reitz, virtio-fs-list, Vivek Goyal

On Wed, Jan 6, 2021 at 11:16 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Wed, Jan 6, 2021 at 10:02 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Wed, Jan 6, 2021 at 5:29 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > On Mon, Jan 4, 2021 at 8:57 PM Dr. David Alan Gilbert
> > > <dgilbert@redhat.com> wrote:
> > > >
> > > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > > On Mon, Jan 04, 2021 at 04:00:13PM +0000, Dr. David Alan Gilbert wrote:
> > > > > > Hi,
> > > > > >   On virtio-fs we're hitting a problem with NFS, where
> > > > > > unlinking a file in a directory and then rmdir'ing that
> > > > > > directory fails complaining about the directory not being empty.
> > > > > >
> > > > > > The problem here is that if a file has an open fd, NFS doesn't
> > > > > > actually delete the file on unlink, it just renames it to
> > > > > > a hidden file (e.g. .nfs*******).  That open file is there because
> > > > > > the 'FORGET' hasn't completed yet by the time the rmdir is issued.
> > > > > >
> > > > > > Question:
> > > > > >   a) In the FUSE protocol, are requests assumed to complete in order;
> > > > > > i.e.  unlink, forget, rmdir   is it required that 'forget' completes
> > > > > > before the rmdir is processed?
> > > > > >      (In virtiofs we've been processing requests, in parallel, and
> > > > > > have sent forgets down a separate queue to keep them out of the way).
> > > > > >
> > > > > >   b) 'forget' doesn't send a reply - so the kernel can't wait for the
> > > > > > client to have finished it;  do we need a synchronous forget here?
> > > > >
> > > > > Even if we introduce a synchronous forget, will that really fix the
> > > > > issue. For example, this could also happen if file has been unlinked
> > > > > but it is still open and directory is being removed.
> > > > >
> > > > > fd = open(foo/bar.txt)
> > > > > unlink foo/bar.txt
> > > > > rmdir foo
> > > > > close(fd).
> > > > >
> > > > > In this case, final forget should go after fd has been closed. Its
> > > > > not a forget race.
> > > > >
> > > > > I wrote a test case for this and it works on regular file systems.
> > > > >
> > > > > https://github.com/rhvgoyal/misc/blob/master/virtiofs-tests/rmdir.c
> > > > >
> > > > > I suspect it will fail on nfs because I am assuming that temporary
> > > > > file will be there till final close(fd) happens. If that's the
> > > > > case this is a NFS specific issue because its behavior is different
> > > > > from other file systems.
> > > >
> > > > That's true; but that's NFS just being NFS; in our case we're keeping
> > > > an fd open even though the guest has been smart enough not to; so we're
> > > > causing the NFS oddity when it wouldn't normally happen.
> > > >
> > >
> > > Are you sure that you really need this oddity?
> > >
> > > My sense from looking virtiofsd is that the open O_PATH fd
> > > in InodeData for non-directories are an overkill and even the need
> > > for open fd for all directories is questionable.
> > >
> > > If you store a FileHandle (name_to_handle_at(2)) instead of an open fd
> > > for non-directories, you won't be keeping a reference on the underlying inode
> > > so no unlink issue.
> > >
> > > open_by_handle_at(2) is very cheap for non-directory when underlying inode
> > > is cached and as cheap as it can get even when inode is not in cache, so no
> > > performance penalty is expected.
> >
> > You are perfectly right that using file handles would solve a number
> > of issues, one being too many open file descriptors.
> >
> > The issue with open_by_handle_at(2) is that it needs
> > CAP_DAC_READ_SEARCH in the initial user namespace. That currently
> > makes it impossible to use in containers and such.
>
> Is that a problem for virtiofsd? does it also run inside a container??
>
> Please note that NFS doesn't do "silly rename" for directories,
> so mitigation is mostly needed for non-dir.
>
> An alternative method if daemon is not capable, is to store parent dirfd
> in addition to filehandle and implement open_child_by_handle_at(int
> parent_fd, ...):
> - readdir(parend_fd)
> - search a match for d_ino
> - name_to_handle_at() and verify match to stored filehandle

That would have to be AT_EMPTY_PATH after opening the file
by name.

>
> This is essentially what open_by_handle_at(2) does under the covers
> with a "connectable" non-dir filehandle after having resolved the
> parent file handle part.

I meant this is what's done under the covers for the non cached inode case
of "connectable" non-dir filehandles. Of course we cannot do that on
every lookup.

What I forgot to say is that with this method we can close file fds if
we have too many open files or on unlink() and open the file again in
this method
if needed.

Thanks,
Amir.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [fuse-devel] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-06  9:16         ` Amir Goldstein
  2021-01-06  9:27           ` Amir Goldstein
@ 2021-01-06 13:40           ` Miklos Szeredi
  2021-01-06 16:57             ` Vivek Goyal
  2021-01-08 15:55           ` Vivek Goyal
  2021-01-11 15:48           ` Dr. David Alan Gilbert
  3 siblings, 1 reply; 26+ messages in thread
From: Miklos Szeredi @ 2021-01-06 13:40 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: fuse-devel, Max Reitz, virtio-fs-list, Vivek Goyal

[-- Attachment #1: Type: text/plain, Size: 1268 bytes --]

On Wed, Jan 6, 2021 at 10:16 AM Amir Goldstein <amir73il@gmail.com> wrote:

> Please note that NFS doesn't do "silly rename" for directories,
> so mitigation is mostly needed for non-dir.

Okay.

> An alternative method if daemon is not capable, is to store parent dirfd
> in addition to filehandle and implement open_child_by_handle_at(int
> parent_fd, ...):
> - readdir(parend_fd)
> - search a match for d_ino
> - name_to_handle_at() and verify match to stored filehandle
>
> This is essentially what open_by_handle_at(2) does under the covers
> with a "connectable" non-dir filehandle after having resolved the
> parent file handle part. And "connectable" file handles are used by nfsd
> to enforce "subtree_check" to make sure that file wasn't moved outside
> obtainable path after initial lookup.

Yes, sort of makes sense, but will have corner cases for hard links in
different directories, open files after unlink, etc..

Also back to the original problem: what we really want is to close the
O_PATH descriptor on unlink().  This should be possible, regardless of
any FORGET, assuming

1) no open files exist that reference the inode
2) no aliases have been looked up (i.e. just one cached dentry)

The attached untested patch tries to do this.

Thanks,
Miklos

[-- Attachment #2: virtiofsd-close-on-unlink.patch --]
[-- Type: text/x-patch, Size: 3920 bytes --]

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index ec1008bceba8..d9c03e87e57d 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -107,6 +107,11 @@ struct lo_inode {
      */
     gint refcount;
 
+    /*
+     * Number of open instances
+     */
+    gint opencount;
+
     struct lo_key key;
 
     /*
@@ -901,6 +906,18 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
 
     inode = lo_find(lo, &e->attr, mnt_id);
     if (inode) {
+        char buf1[PATH_MAX + 1], buf2[PATH_MAX + 1], procname[64];
+        ssize_t siz1, siz2;
+
+        sprintf(procname, "%i", inode->fd);
+        siz1 = readlinkat(lo->proc_self_fd, procname, buf1, sizeof(buf1));
+        sprintf(procname, "%i", newfd);
+        siz2 = readlinkat(lo->proc_self_fd, procname, buf2, sizeof(buf2));
+
+        /* disable close on unlink if alias is detected */
+        if (siz1 != siz2 || memcmp(buf1, buf2, siz1))
+            g_atomic_int_inc(&inode->opencount);
+
         close(newfd);
     } else {
         inode = calloc(1, sizeof(struct lo_inode));
@@ -917,6 +934,7 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
          */
         g_atomic_int_set(&inode->refcount, 2);
 
+        g_atomic_int_set(&inode->opencount, 0);
         inode->nlookup = 1;
         inode->fd = newfd;
         inode->key.ino = e->attr.st_ino;
@@ -1295,6 +1313,10 @@ static void lo_unlink(fuse_req_t req, fuse_ino_t parent, const char *name)
     res = unlinkat(lo_fd(req, parent), name, 0);
 
     fuse_reply_err(req, res == -1 ? errno : 0);
+    if (!g_atomic_int_get(&inode->opencount)) {
+        close(inode->fd);
+        inode->fd = -1;
+    }
     unref_inode_lolocked(lo, inode, 1);
     lo_inode_put(lo, &inode);
 }
@@ -1904,25 +1926,30 @@ static void lo_open(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi)
     ssize_t fh;
     char buf[64];
     struct lo_data *lo = lo_data(req);
+    struct lo_inode *inode = lo_inode(req, ino);
+    int err = EBADF;
+
+    if (!inode)
+        goto out_err;
 
     fuse_log(FUSE_LOG_DEBUG, "lo_open(ino=%" PRIu64 ", flags=%d)\n", ino,
              fi->flags);
 
     update_open_flags(lo->writeback, lo->allow_direct_io, fi);
 
-    sprintf(buf, "%i", lo_fd(req, ino));
+    sprintf(buf, "%i", inode->fd);
     fd = openat(lo->proc_self_fd, buf, fi->flags & ~O_NOFOLLOW);
-    if (fd == -1) {
-        return (void)fuse_reply_err(req, errno);
-    }
+    err = errno;
+    if (fd == -1)
+        goto out_err;
 
     pthread_mutex_lock(&lo->mutex);
     fh = lo_add_fd_mapping(req, fd);
     pthread_mutex_unlock(&lo->mutex);
     if (fh == -1) {
         close(fd);
-        fuse_reply_err(req, ENOMEM);
-        return;
+        err = ENOMEM;
+        goto out_err;
     }
 
     fi->fh = fh;
@@ -1931,18 +1958,26 @@ static void lo_open(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi)
     } else if (lo->cache == CACHE_ALWAYS) {
         fi->keep_cache = 1;
     }
+    g_atomic_int_inc(&inode->opencount);
+    lo_inode_put(lo, &inode);
     fuse_reply_open(req, fi);
+    return;
+
+out_err:
+    lo_inode_put(lo, &inode);
+    fuse_reply_err(req, err);
 }
 
 static void lo_release(fuse_req_t req, fuse_ino_t ino,
                        struct fuse_file_info *fi)
 {
     struct lo_data *lo = lo_data(req);
+    struct lo_inode *inode = lo_inode(req, ino);
     struct lo_map_elem *elem;
     int fd = -1;
 
-    (void)ino;
-
+    if (inode)
+        g_atomic_int_dec_and_test(&inode->opencount);
     pthread_mutex_lock(&lo->mutex);
     elem = lo_map_get(&lo->fd_map, fi->fh);
     if (elem) {
@@ -1951,6 +1986,7 @@ static void lo_release(fuse_req_t req, fuse_ino_t ino,
         lo_map_remove(&lo->fd_map, fi->fh);
     }
     pthread_mutex_unlock(&lo->mutex);
+    lo_inode_put(lo, &inode);
 
     close(fd);
     fuse_reply_err(req, 0);

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [fuse-devel] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-06 13:40           ` Miklos Szeredi
@ 2021-01-06 16:57             ` Vivek Goyal
  2021-01-07  8:44               ` Miklos Szeredi
  0 siblings, 1 reply; 26+ messages in thread
From: Vivek Goyal @ 2021-01-06 16:57 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: fuse-devel, Amir Goldstein, Max Reitz, virtio-fs-list

On Wed, Jan 06, 2021 at 02:40:45PM +0100, Miklos Szeredi wrote:
> On Wed, Jan 6, 2021 at 10:16 AM Amir Goldstein <amir73il@gmail.com> wrote:
> 
> > Please note that NFS doesn't do "silly rename" for directories,
> > so mitigation is mostly needed for non-dir.
> 
> Okay.
> 
> > An alternative method if daemon is not capable, is to store parent dirfd
> > in addition to filehandle and implement open_child_by_handle_at(int
> > parent_fd, ...):
> > - readdir(parend_fd)
> > - search a match for d_ino
> > - name_to_handle_at() and verify match to stored filehandle
> >
> > This is essentially what open_by_handle_at(2) does under the covers
> > with a "connectable" non-dir filehandle after having resolved the
> > parent file handle part. And "connectable" file handles are used by nfsd
> > to enforce "subtree_check" to make sure that file wasn't moved outside
> > obtainable path after initial lookup.
> 
> Yes, sort of makes sense, but will have corner cases for hard links in
> different directories, open files after unlink, etc..
> 
> Also back to the original problem: what we really want is to close the
> O_PATH descriptor on unlink().  This should be possible, regardless of
> any FORGET, assuming
> 
> 1) no open files exist that reference the inode
> 2) no aliases have been looked up (i.e. just one cached dentry)
> 
> The attached untested patch tries to do this.
> 
> Thanks,
> Miklos

> diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
> index ec1008bceba8..d9c03e87e57d 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -107,6 +107,11 @@ struct lo_inode {
>       */
>      gint refcount;
>  
> +    /*
> +     * Number of open instances
> +     */
> +    gint opencount;
> +
>      struct lo_key key;
>  
>      /*
> @@ -901,6 +906,18 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
>  
>      inode = lo_find(lo, &e->attr, mnt_id);
>      if (inode) {
> +        char buf1[PATH_MAX + 1], buf2[PATH_MAX + 1], procname[64];
> +        ssize_t siz1, siz2;
> +
> +        sprintf(procname, "%i", inode->fd);
> +        siz1 = readlinkat(lo->proc_self_fd, procname, buf1, sizeof(buf1));
> +        sprintf(procname, "%i", newfd);
> +        siz2 = readlinkat(lo->proc_self_fd, procname, buf2, sizeof(buf2));
> +
> +        /* disable close on unlink if alias is detected */
> +        if (siz1 != siz2 || memcmp(buf1, buf2, siz1))
> +            g_atomic_int_inc(&inode->opencount);
> +

Hi Miklos,

So if I have a hard links to a file in a separate directories, then this
path can hit. (Say dir1/file.txt and dir2/file-link.txt). IIUC, we will
disable automatic inode->fd closing on these hardlinked files. That
means this solution will not solve problem at hand for hard linked files.
Am I missing something.

>          close(newfd);
>      } else {
>          inode = calloc(1, sizeof(struct lo_inode));
> @@ -917,6 +934,7 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
>           */
>          g_atomic_int_set(&inode->refcount, 2);
>  
> +        g_atomic_int_set(&inode->opencount, 0);
>          inode->nlookup = 1;
>          inode->fd = newfd;
>          inode->key.ino = e->attr.st_ino;
> @@ -1295,6 +1313,10 @@ static void lo_unlink(fuse_req_t req, fuse_ino_t parent, const char *name)
>      res = unlinkat(lo_fd(req, parent), name, 0);
>  
>      fuse_reply_err(req, res == -1 ? errno : 0);
> +    if (!g_atomic_int_get(&inode->opencount)) {
> +        close(inode->fd);
> +        inode->fd = -1;
> +    }

Can this be racy w.r.t lo_lookup(). IOW, say dir1/file.txt is being
unlinked and we closed inode->fd. And before we could execute
unref_inode_lolocked(), another parallel lookup of dir2/file-link.txt
happens if it gets lo->muxtex lock first, it can still find this
inode and bump up reference count. And that means lo_unlink() will
not free inode (but close inode->fd) and now we will use an inode
with closed O_PATH fd which lead to other failures later.

Thanks
Vivek

>      unref_inode_lolocked(lo, inode, 1);
>      lo_inode_put(lo, &inode);
>  }
> @@ -1904,25 +1926,30 @@ static void lo_open(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi)
>      ssize_t fh;
>      char buf[64];
>      struct lo_data *lo = lo_data(req);
> +    struct lo_inode *inode = lo_inode(req, ino);
> +    int err = EBADF;
> +
> +    if (!inode)
> +        goto out_err;
>  
>      fuse_log(FUSE_LOG_DEBUG, "lo_open(ino=%" PRIu64 ", flags=%d)\n", ino,
>               fi->flags);
>  
>      update_open_flags(lo->writeback, lo->allow_direct_io, fi);
>  
> -    sprintf(buf, "%i", lo_fd(req, ino));
> +    sprintf(buf, "%i", inode->fd);
>      fd = openat(lo->proc_self_fd, buf, fi->flags & ~O_NOFOLLOW);
> -    if (fd == -1) {
> -        return (void)fuse_reply_err(req, errno);
> -    }
> +    err = errno;
> +    if (fd == -1)
> +        goto out_err;
>  
>      pthread_mutex_lock(&lo->mutex);
>      fh = lo_add_fd_mapping(req, fd);
>      pthread_mutex_unlock(&lo->mutex);
>      if (fh == -1) {
>          close(fd);
> -        fuse_reply_err(req, ENOMEM);
> -        return;
> +        err = ENOMEM;
> +        goto out_err;
>      }
>  
>      fi->fh = fh;
> @@ -1931,18 +1958,26 @@ static void lo_open(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi)
>      } else if (lo->cache == CACHE_ALWAYS) {
>          fi->keep_cache = 1;
>      }
> +    g_atomic_int_inc(&inode->opencount);
> +    lo_inode_put(lo, &inode);
>      fuse_reply_open(req, fi);
> +    return;
> +
> +out_err:
> +    lo_inode_put(lo, &inode);
> +    fuse_reply_err(req, err);
>  }
>  
>  static void lo_release(fuse_req_t req, fuse_ino_t ino,
>                         struct fuse_file_info *fi)
>  {
>      struct lo_data *lo = lo_data(req);
> +    struct lo_inode *inode = lo_inode(req, ino);
>      struct lo_map_elem *elem;
>      int fd = -1;
>  
> -    (void)ino;
> -
> +    if (inode)
> +        g_atomic_int_dec_and_test(&inode->opencount);
>      pthread_mutex_lock(&lo->mutex);
>      elem = lo_map_get(&lo->fd_map, fi->fh);
>      if (elem) {
> @@ -1951,6 +1986,7 @@ static void lo_release(fuse_req_t req, fuse_ino_t ino,
>          lo_map_remove(&lo->fd_map, fi->fh);
>      }
>      pthread_mutex_unlock(&lo->mutex);
> +    lo_inode_put(lo, &inode);
>  
>      close(fd);
>      fuse_reply_err(req, 0);


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [fuse-devel] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-06 16:57             ` Vivek Goyal
@ 2021-01-07  8:44               ` Miklos Szeredi
  2021-01-07 10:42                 ` Amir Goldstein
  0 siblings, 1 reply; 26+ messages in thread
From: Miklos Szeredi @ 2021-01-07  8:44 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: fuse-devel, Amir Goldstein, Max Reitz, virtio-fs-list

On Wed, Jan 6, 2021 at 5:58 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Wed, Jan 06, 2021 at 02:40:45PM +0100, Miklos Szeredi wrote:

> > @@ -901,6 +906,18 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
> >
> >      inode = lo_find(lo, &e->attr, mnt_id);
> >      if (inode) {
> > +        char buf1[PATH_MAX + 1], buf2[PATH_MAX + 1], procname[64];
> > +        ssize_t siz1, siz2;
> > +
> > +        sprintf(procname, "%i", inode->fd);
> > +        siz1 = readlinkat(lo->proc_self_fd, procname, buf1, sizeof(buf1));
> > +        sprintf(procname, "%i", newfd);
> > +        siz2 = readlinkat(lo->proc_self_fd, procname, buf2, sizeof(buf2));
> > +
> > +        /* disable close on unlink if alias is detected */
> > +        if (siz1 != siz2 || memcmp(buf1, buf2, siz1))
> > +            g_atomic_int_inc(&inode->opencount);
> > +
>
> Hi Miklos,
>
> So if I have a hard links to a file in a separate directories, then this
> path can hit. (Say dir1/file.txt and dir2/file-link.txt). IIUC, we will
> disable automatic inode->fd closing on these hardlinked files. That
> means this solution will not solve problem at hand for hard linked files.
> Am I missing something.

You are correct, this is a limited solution.  The one above is a
corner case, and probably not worth worrying about too much.  However
it  can be fixed by having separate O_PATH fd's open for each alias
(or at least for each alias that resides in a distinct directory).

NB: doing a synchronous FORGET on unlink does not solve this case
either, since the inode will not receive a FORGET until the last alias
is dropped.

> >          close(newfd);
> >      } else {
> >          inode = calloc(1, sizeof(struct lo_inode));
> > @@ -917,6 +934,7 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
> >           */
> >          g_atomic_int_set(&inode->refcount, 2);
> >
> > +        g_atomic_int_set(&inode->opencount, 0);
> >          inode->nlookup = 1;
> >          inode->fd = newfd;
> >          inode->key.ino = e->attr.st_ino;
> > @@ -1295,6 +1313,10 @@ static void lo_unlink(fuse_req_t req, fuse_ino_t parent, const char *name)
> >      res = unlinkat(lo_fd(req, parent), name, 0);
> >
> >      fuse_reply_err(req, res == -1 ? errno : 0);
> > +    if (!g_atomic_int_get(&inode->opencount)) {
> > +        close(inode->fd);
> > +        inode->fd = -1;
> > +    }
>
> Can this be racy w.r.t lo_lookup(). IOW, say dir1/file.txt is being
> unlinked and we closed inode->fd. And before we could execute
> unref_inode_lolocked(), another parallel lookup of dir2/file-link.txt
> happens if it gets lo->muxtex lock first, it can still find this
> inode and bump up reference count. And that means lo_unlink() will
> not free inode (but close inode->fd) and now we will use an inode
> with closed O_PATH fd which lead to other failures later.

Yes, that's a bug.   Also needs serialization against all access to inode->fd.

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [fuse-devel] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-07  8:44               ` Miklos Szeredi
@ 2021-01-07 10:42                 ` Amir Goldstein
  2021-01-07 20:10                   ` Dr. David Alan Gilbert
  2021-01-08  4:12                   ` Eryu Guan
  0 siblings, 2 replies; 26+ messages in thread
From: Amir Goldstein @ 2021-01-07 10:42 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: fuse-devel, Max Reitz, virtio-fs-list, Vivek Goyal

On Thu, Jan 7, 2021 at 10:44 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Wed, Jan 6, 2021 at 5:58 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Wed, Jan 06, 2021 at 02:40:45PM +0100, Miklos Szeredi wrote:
>
> > > @@ -901,6 +906,18 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
> > >
> > >      inode = lo_find(lo, &e->attr, mnt_id);
> > >      if (inode) {
> > > +        char buf1[PATH_MAX + 1], buf2[PATH_MAX + 1], procname[64];
> > > +        ssize_t siz1, siz2;
> > > +
> > > +        sprintf(procname, "%i", inode->fd);
> > > +        siz1 = readlinkat(lo->proc_self_fd, procname, buf1, sizeof(buf1));
> > > +        sprintf(procname, "%i", newfd);
> > > +        siz2 = readlinkat(lo->proc_self_fd, procname, buf2, sizeof(buf2));
> > > +
> > > +        /* disable close on unlink if alias is detected */
> > > +        if (siz1 != siz2 || memcmp(buf1, buf2, siz1))
> > > +            g_atomic_int_inc(&inode->opencount);
> > > +
> >
> > Hi Miklos,
> >
> > So if I have a hard links to a file in a separate directories, then this
> > path can hit. (Say dir1/file.txt and dir2/file-link.txt). IIUC, we will
> > disable automatic inode->fd closing on these hardlinked files. That
> > means this solution will not solve problem at hand for hard linked files.
> > Am I missing something.
>
> You are correct, this is a limited solution.  The one above is a
> corner case, and probably not worth worrying about too much.  However
> it  can be fixed by having separate O_PATH fd's open for each alias
> (or at least for each alias that resides in a distinct directory).
>
> NB: doing a synchronous FORGET on unlink does not solve this case
> either, since the inode will not receive a FORGET until the last alias
> is dropped.
>
> > >          close(newfd);
> > >      } else {
> > >          inode = calloc(1, sizeof(struct lo_inode));
> > > @@ -917,6 +934,7 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
> > >           */
> > >          g_atomic_int_set(&inode->refcount, 2);
> > >
> > > +        g_atomic_int_set(&inode->opencount, 0);
> > >          inode->nlookup = 1;
> > >          inode->fd = newfd;
> > >          inode->key.ino = e->attr.st_ino;
> > > @@ -1295,6 +1313,10 @@ static void lo_unlink(fuse_req_t req, fuse_ino_t parent, const char *name)
> > >      res = unlinkat(lo_fd(req, parent), name, 0);
> > >
> > >      fuse_reply_err(req, res == -1 ? errno : 0);
> > > +    if (!g_atomic_int_get(&inode->opencount)) {
> > > +        close(inode->fd);
> > > +        inode->fd = -1;
> > > +    }
> >
> > Can this be racy w.r.t lo_lookup(). IOW, say dir1/file.txt is being
> > unlinked and we closed inode->fd. And before we could execute
> > unref_inode_lolocked(), another parallel lookup of dir2/file-link.txt
> > happens if it gets lo->muxtex lock first, it can still find this
> > inode and bump up reference count. And that means lo_unlink() will
> > not free inode (but close inode->fd) and now we will use an inode
> > with closed O_PATH fd which lead to other failures later.
>
> Yes, that's a bug.   Also needs serialization against all access to inode->fd.
>

Miklos,

I would like to point out that this discussion is relevant to any low level
fuse filesystem, but especially those that are proxying a real filesystem.

I have raised this question before in the FUSE_PASSTHROUGH threads.
There is a lot of code duplication among various low-level fuse projects and
as we get to NFS export support and complex issues like this one, is it
getting unlikely that all projects will handle this correctly.

Do you think there is room for some more generic code in libfuse and if
so how? Following an example implementation (assuming it gets fixed)
and hand picking fixes to projects cannot scale for long.

The challenge is that most of the generic code would be in lookup and
maintaining the internal inode cache, but each filesystem may need
different representations of the Inode object.

I was thinking of something along the lines of an OO library that
implements the generic lookup/inode cache for a base FuseInode class
that implementers can inherit from.

This doesn't need to be in the libfuse project at all.
Seeing the virtiofsd already has a Rust implementation that is BSD
licensed, maybe that can be used as a starting point?

David,

How hard do you think it would be to re-factor virtiofsd-rs to
an implementation that inherits from a base passthroughfsd-rs?

BTW, is virtiofsd-rs the offical virtiofsd or an experimental one?
Which tree does Miklos' patch apply to?

Anyone has other thoughts about how to reduce fragmentation in
implementations?

Thanks,
Amir.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [fuse-devel] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-07 10:42                 ` Amir Goldstein
@ 2021-01-07 20:10                   ` Dr. David Alan Gilbert
  2021-01-08  4:12                   ` Eryu Guan
  1 sibling, 0 replies; 26+ messages in thread
From: Dr. David Alan Gilbert @ 2021-01-07 20:10 UTC (permalink / raw)
  To: Amir Goldstein, slp
  Cc: Miklos Szeredi, fuse-devel, Max Reitz, virtio-fs-list, Vivek Goyal

* Amir Goldstein (amir73il@gmail.com) wrote:
> On Thu, Jan 7, 2021 at 10:44 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Wed, Jan 6, 2021 at 5:58 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Wed, Jan 06, 2021 at 02:40:45PM +0100, Miklos Szeredi wrote:
> >
> > > > @@ -901,6 +906,18 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
> > > >
> > > >      inode = lo_find(lo, &e->attr, mnt_id);
> > > >      if (inode) {
> > > > +        char buf1[PATH_MAX + 1], buf2[PATH_MAX + 1], procname[64];
> > > > +        ssize_t siz1, siz2;
> > > > +
> > > > +        sprintf(procname, "%i", inode->fd);
> > > > +        siz1 = readlinkat(lo->proc_self_fd, procname, buf1, sizeof(buf1));
> > > > +        sprintf(procname, "%i", newfd);
> > > > +        siz2 = readlinkat(lo->proc_self_fd, procname, buf2, sizeof(buf2));
> > > > +
> > > > +        /* disable close on unlink if alias is detected */
> > > > +        if (siz1 != siz2 || memcmp(buf1, buf2, siz1))
> > > > +            g_atomic_int_inc(&inode->opencount);
> > > > +
> > >
> > > Hi Miklos,
> > >
> > > So if I have a hard links to a file in a separate directories, then this
> > > path can hit. (Say dir1/file.txt and dir2/file-link.txt). IIUC, we will
> > > disable automatic inode->fd closing on these hardlinked files. That
> > > means this solution will not solve problem at hand for hard linked files.
> > > Am I missing something.
> >
> > You are correct, this is a limited solution.  The one above is a
> > corner case, and probably not worth worrying about too much.  However
> > it  can be fixed by having separate O_PATH fd's open for each alias
> > (or at least for each alias that resides in a distinct directory).
> >
> > NB: doing a synchronous FORGET on unlink does not solve this case
> > either, since the inode will not receive a FORGET until the last alias
> > is dropped.
> >
> > > >          close(newfd);
> > > >      } else {
> > > >          inode = calloc(1, sizeof(struct lo_inode));
> > > > @@ -917,6 +934,7 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
> > > >           */
> > > >          g_atomic_int_set(&inode->refcount, 2);
> > > >
> > > > +        g_atomic_int_set(&inode->opencount, 0);
> > > >          inode->nlookup = 1;
> > > >          inode->fd = newfd;
> > > >          inode->key.ino = e->attr.st_ino;
> > > > @@ -1295,6 +1313,10 @@ static void lo_unlink(fuse_req_t req, fuse_ino_t parent, const char *name)
> > > >      res = unlinkat(lo_fd(req, parent), name, 0);
> > > >
> > > >      fuse_reply_err(req, res == -1 ? errno : 0);
> > > > +    if (!g_atomic_int_get(&inode->opencount)) {
> > > > +        close(inode->fd);
> > > > +        inode->fd = -1;
> > > > +    }
> > >
> > > Can this be racy w.r.t lo_lookup(). IOW, say dir1/file.txt is being
> > > unlinked and we closed inode->fd. And before we could execute
> > > unref_inode_lolocked(), another parallel lookup of dir2/file-link.txt
> > > happens if it gets lo->muxtex lock first, it can still find this
> > > inode and bump up reference count. And that means lo_unlink() will
> > > not free inode (but close inode->fd) and now we will use an inode
> > > with closed O_PATH fd which lead to other failures later.
> >
> > Yes, that's a bug.   Also needs serialization against all access to inode->fd.
> >
> 
> Miklos,
> 
> I would like to point out that this discussion is relevant to any low level
> fuse filesystem, but especially those that are proxying a real filesystem.
> 
> I have raised this question before in the FUSE_PASSTHROUGH threads.
> There is a lot of code duplication among various low-level fuse projects and
> as we get to NFS export support and complex issues like this one, is it
> getting unlikely that all projects will handle this correctly.
> 
> Do you think there is room for some more generic code in libfuse and if
> so how? Following an example implementation (assuming it gets fixed)
> and hand picking fixes to projects cannot scale for long.
> 
> The challenge is that most of the generic code would be in lookup and
> maintaining the internal inode cache, but each filesystem may need
> different representations of the Inode object.
> 
> I was thinking of something along the lines of an OO library that
> implements the generic lookup/inode cache for a base FuseInode class
> that implementers can inherit from.
> 
> This doesn't need to be in the libfuse project at all.
> Seeing the virtiofsd already has a Rust implementation that is BSD
> licensed, maybe that can be used as a starting point?
> 
> David,
> 
> How hard do you think it would be to re-factor virtiofsd-rs to
> an implementation that inherits from a base passthroughfsd-rs?
> 
> BTW, is virtiofsd-rs the offical virtiofsd or an experimental one?
> Which tree does Miklos' patch apply to?

The C version is the current main one, but the Rust version is catching
up fast and I hope it will displace the C one sometime this year.
Sergio owns the Rust version, cc'd (slp@ )

> Anyone has other thoughts about how to reduce fragmentation in
> implementations?
> 
> Thanks,
> Amir.

Dave

-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [fuse-devel] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-07 10:42                 ` Amir Goldstein
  2021-01-07 20:10                   ` Dr. David Alan Gilbert
@ 2021-01-08  4:12                   ` Eryu Guan
  2021-01-08  9:08                     ` Amir Goldstein
  2021-01-08 15:26                     ` Vivek Goyal
  1 sibling, 2 replies; 26+ messages in thread
From: Eryu Guan @ 2021-01-08  4:12 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, fuse-devel, Max Reitz, virtio-fs-list, bergwolf,
	gerry, Vivek Goyal

On Thu, Jan 07, 2021 at 12:42:00PM +0200, Amir Goldstein wrote:
> On Thu, Jan 7, 2021 at 10:44 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Wed, Jan 6, 2021 at 5:58 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Wed, Jan 06, 2021 at 02:40:45PM +0100, Miklos Szeredi wrote:
> >
> > > > @@ -901,6 +906,18 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
> > > >
> > > >      inode = lo_find(lo, &e->attr, mnt_id);
> > > >      if (inode) {
> > > > +        char buf1[PATH_MAX + 1], buf2[PATH_MAX + 1], procname[64];
> > > > +        ssize_t siz1, siz2;
> > > > +
> > > > +        sprintf(procname, "%i", inode->fd);
> > > > +        siz1 = readlinkat(lo->proc_self_fd, procname, buf1, sizeof(buf1));
> > > > +        sprintf(procname, "%i", newfd);
> > > > +        siz2 = readlinkat(lo->proc_self_fd, procname, buf2, sizeof(buf2));
> > > > +
> > > > +        /* disable close on unlink if alias is detected */
> > > > +        if (siz1 != siz2 || memcmp(buf1, buf2, siz1))
> > > > +            g_atomic_int_inc(&inode->opencount);
> > > > +
> > >
> > > Hi Miklos,
> > >
> > > So if I have a hard links to a file in a separate directories, then this
> > > path can hit. (Say dir1/file.txt and dir2/file-link.txt). IIUC, we will
> > > disable automatic inode->fd closing on these hardlinked files. That
> > > means this solution will not solve problem at hand for hard linked files.
> > > Am I missing something.
> >
> > You are correct, this is a limited solution.  The one above is a
> > corner case, and probably not worth worrying about too much.  However
> > it  can be fixed by having separate O_PATH fd's open for each alias
> > (or at least for each alias that resides in a distinct directory).
> >
> > NB: doing a synchronous FORGET on unlink does not solve this case
> > either, since the inode will not receive a FORGET until the last alias
> > is dropped.
> >
> > > >          close(newfd);
> > > >      } else {
> > > >          inode = calloc(1, sizeof(struct lo_inode));
> > > > @@ -917,6 +934,7 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
> > > >           */
> > > >          g_atomic_int_set(&inode->refcount, 2);
> > > >
> > > > +        g_atomic_int_set(&inode->opencount, 0);
> > > >          inode->nlookup = 1;
> > > >          inode->fd = newfd;
> > > >          inode->key.ino = e->attr.st_ino;
> > > > @@ -1295,6 +1313,10 @@ static void lo_unlink(fuse_req_t req, fuse_ino_t parent, const char *name)
> > > >      res = unlinkat(lo_fd(req, parent), name, 0);
> > > >
> > > >      fuse_reply_err(req, res == -1 ? errno : 0);
> > > > +    if (!g_atomic_int_get(&inode->opencount)) {
> > > > +        close(inode->fd);
> > > > +        inode->fd = -1;
> > > > +    }
> > >
> > > Can this be racy w.r.t lo_lookup(). IOW, say dir1/file.txt is being
> > > unlinked and we closed inode->fd. And before we could execute
> > > unref_inode_lolocked(), another parallel lookup of dir2/file-link.txt
> > > happens if it gets lo->muxtex lock first, it can still find this
> > > inode and bump up reference count. And that means lo_unlink() will
> > > not free inode (but close inode->fd) and now we will use an inode
> > > with closed O_PATH fd which lead to other failures later.
> >
> > Yes, that's a bug.   Also needs serialization against all access to inode->fd.
> >
> 
> Miklos,
> 
> I would like to point out that this discussion is relevant to any low level
> fuse filesystem, but especially those that are proxying a real filesystem.
> 
> I have raised this question before in the FUSE_PASSTHROUGH threads.
> There is a lot of code duplication among various low-level fuse projects and
> as we get to NFS export support and complex issues like this one, is it
> getting unlikely that all projects will handle this correctly.
> 
> Do you think there is room for some more generic code in libfuse and if
> so how? Following an example implementation (assuming it gets fixed)
> and hand picking fixes to projects cannot scale for long.
> 
> The challenge is that most of the generic code would be in lookup and
> maintaining the internal inode cache, but each filesystem may need
> different representations of the Inode object.
> 
> I was thinking of something along the lines of an OO library that
> implements the generic lookup/inode cache for a base FuseInode class
> that implementers can inherit from.
> 
> This doesn't need to be in the libfuse project at all.
> Seeing the virtiofsd already has a Rust implementation that is BSD
> licensed, maybe that can be used as a starting point?
> 
> David,
> 
> How hard do you think it would be to re-factor virtiofsd-rs to
> an implementation that inherits from a base passthroughfsd-rs?
> 
> BTW, is virtiofsd-rs the offical virtiofsd or an experimental one?
> Which tree does Miklos' patch apply to?
> 
> Anyone has other thoughts about how to reduce fragmentation in
> implementations?

There's an fuse-backend-rs[1] project hosted on cloud-hypervisor, it is
a library to communicate with the Linux FUSE clients, which includes:

- ABI layer, which defines all data structures shared between linux Fuse
  framework and Fuse daemons.
- API layer, defines the interfaces for Fuse daemons to implement a
  userspace file system.
- Transport layer, which supports both the Linux Fuse device and
  virtio-fs protocol.
- VFS/pseudo_fs, an abstraction layer to support multiple file systems
  by a single virtio-fs device.
- A sample passthrough file system implementation, which passes through
  files from daemons to clients.

I'm wondering if fuse-backend-rs is a proper project to work on, and
maybe virtiofsd-rs could be switched to use it as well in the future.

Thanks,
Eryu

[1] https://github.com/cloud-hypervisor/fuse-backend-rs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [fuse-devel] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-08  4:12                   ` Eryu Guan
@ 2021-01-08  9:08                     ` Amir Goldstein
  2021-01-08  9:25                       ` Liu, Jiang
  2021-01-08 10:18                       ` Eryu Guan
  2021-01-08 15:26                     ` Vivek Goyal
  1 sibling, 2 replies; 26+ messages in thread
From: Amir Goldstein @ 2021-01-08  9:08 UTC (permalink / raw)
  To: Eryu Guan
  Cc: Miklos Szeredi, fuse-devel, Max Reitz, virtio-fs-list, bergwolf,
	gerry, Vivek Goyal

> > Miklos,
> >
> > I would like to point out that this discussion is relevant to any low level
> > fuse filesystem, but especially those that are proxying a real filesystem.
> >
> > I have raised this question before in the FUSE_PASSTHROUGH threads.
> > There is a lot of code duplication among various low-level fuse projects and
> > as we get to NFS export support and complex issues like this one, is it
> > getting unlikely that all projects will handle this correctly.
> >
> > Do you think there is room for some more generic code in libfuse and if
> > so how? Following an example implementation (assuming it gets fixed)
> > and hand picking fixes to projects cannot scale for long.
> >
> > The challenge is that most of the generic code would be in lookup and
> > maintaining the internal inode cache, but each filesystem may need
> > different representations of the Inode object.
> >
> > I was thinking of something along the lines of an OO library that
> > implements the generic lookup/inode cache for a base FuseInode class
> > that implementers can inherit from.
> >
> > This doesn't need to be in the libfuse project at all.
> > Seeing the virtiofsd already has a Rust implementation that is BSD
> > licensed, maybe that can be used as a starting point?
> >
> > David,
> >
> > How hard do you think it would be to re-factor virtiofsd-rs to
> > an implementation that inherits from a base passthroughfsd-rs?
> >
> > BTW, is virtiofsd-rs the offical virtiofsd or an experimental one?
> > Which tree does Miklos' patch apply to?
> >
> > Anyone has other thoughts about how to reduce fragmentation in
> > implementations?
>
> There's an fuse-backend-rs[1] project hosted on cloud-hypervisor, it is
> a library to communicate with the Linux FUSE clients, which includes:
>
> - ABI layer, which defines all data structures shared between linux Fuse
>   framework and Fuse daemons.
> - API layer, defines the interfaces for Fuse daemons to implement a
>   userspace file system.
> - Transport layer, which supports both the Linux Fuse device and
>   virtio-fs protocol.
> - VFS/pseudo_fs, an abstraction layer to support multiple file systems
>   by a single virtio-fs device.
> - A sample passthrough file system implementation, which passes through
>   files from daemons to clients.
>
> I'm wondering if fuse-backend-rs is a proper project to work on, and
> maybe virtiofsd-rs could be switched to use it as well in the future.
>
> Thanks,
> Eryu
>
> [1] https://github.com/cloud-hypervisor/fuse-backend-rs


Hi Eryu!

This looks very interesting.
Can you guys say a few words about the maturity of this project.
Does it have any CI? any beta/production workloads that use it?
I would be happy to contribute the open_by_handle_at() changes
if I know they will get properly tested.

As demonstrated in this demo fs [1], with xfs/ext4 as underlying filesystem,
full NFS support can be implemented by implementing lookup in filesystem
by inode only, before fuse adds support to LOOKUP_HANDLE in the protocol.

Thanks,
Amir.

[1] https://github.com/amir73il/libfuse/commits/cachegwfs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [fuse-devel] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-08  9:08                     ` Amir Goldstein
@ 2021-01-08  9:25                       ` Liu, Jiang
  2021-01-08 10:18                       ` Eryu Guan
  1 sibling, 0 replies; 26+ messages in thread
From: Liu, Jiang @ 2021-01-08  9:25 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, fuse-devel, Max Reitz, virtio-fs-list, bergwolf,
	Vivek Goyal



> On Jan 8, 2021, at 5:08 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> 
>>> Miklos,
>>> 
>>> I would like to point out that this discussion is relevant to any low level
>>> fuse filesystem, but especially those that are proxying a real filesystem.
>>> 
>>> I have raised this question before in the FUSE_PASSTHROUGH threads.
>>> There is a lot of code duplication among various low-level fuse projects and
>>> as we get to NFS export support and complex issues like this one, is it
>>> getting unlikely that all projects will handle this correctly.
>>> 
>>> Do you think there is room for some more generic code in libfuse and if
>>> so how? Following an example implementation (assuming it gets fixed)
>>> and hand picking fixes to projects cannot scale for long.
>>> 
>>> The challenge is that most of the generic code would be in lookup and
>>> maintaining the internal inode cache, but each filesystem may need
>>> different representations of the Inode object.
>>> 
>>> I was thinking of something along the lines of an OO library that
>>> implements the generic lookup/inode cache for a base FuseInode class
>>> that implementers can inherit from.
>>> 
>>> This doesn't need to be in the libfuse project at all.
>>> Seeing the virtiofsd already has a Rust implementation that is BSD
>>> licensed, maybe that can be used as a starting point?
>>> 
>>> David,
>>> 
>>> How hard do you think it would be to re-factor virtiofsd-rs to
>>> an implementation that inherits from a base passthroughfsd-rs?
>>> 
>>> BTW, is virtiofsd-rs the offical virtiofsd or an experimental one?
>>> Which tree does Miklos' patch apply to?
>>> 
>>> Anyone has other thoughts about how to reduce fragmentation in
>>> implementations?
>> 
>> There's an fuse-backend-rs[1] project hosted on cloud-hypervisor, it is
>> a library to communicate with the Linux FUSE clients, which includes:
>> 
>> - ABI layer, which defines all data structures shared between linux Fuse
>>  framework and Fuse daemons.
>> - API layer, defines the interfaces for Fuse daemons to implement a
>>  userspace file system.
>> - Transport layer, which supports both the Linux Fuse device and
>>  virtio-fs protocol.
>> - VFS/pseudo_fs, an abstraction layer to support multiple file systems
>>  by a single virtio-fs device.
>> - A sample passthrough file system implementation, which passes through
>>  files from daemons to clients.
>> 
>> I'm wondering if fuse-backend-rs is a proper project to work on, and
>> maybe virtiofsd-rs could be switched to use it as well in the future.
>> 
>> Thanks,
>> Eryu
>> 
>> [1] https://github.com/cloud-hypervisor/fuse-backend-rs
> 
> 
> Hi Eryu!
> 
> This looks very interesting.
> Can you guys say a few words about the maturity of this project.
> Does it have any CI? any beta/production workloads that use it?
> I would be happy to contribute the open_by_handle_at() changes
> if I know they will get properly tested.
> 
> As demonstrated in this demo fs [1], with xfs/ext4 as underlying filesystem,
> full NFS support can be implemented by implementing lookup in filesystem
> by inode only, before fuse adds support to LOOKUP_HANDLE in the protocol.
> 
> Thanks,
> Amir.
Hi Amir,
	This project is derived from crosvm and Cloud Hypervisor, and we have
extended it to support some Container centric usage mode. About quality, I 
think it may be some sort of “product”, but the project itself still lacks of a robust
CI system.
	The proposal you mentioned about open_by_handle_at() is very attractive:)
We have encountered some cases running out of file descriptors, we have also
encountered issues to enabling virtio-fs over nfs. Absolutely the open_by_handle_at() 
proposal helps here:)
	And the next big step of the project is to enable io_uring and rust async io:)
Thanks!

> 
> [1] https://github.com/amir73il/libfuse/commits/cachegwfs



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [fuse-devel] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-08  9:08                     ` Amir Goldstein
  2021-01-08  9:25                       ` Liu, Jiang
@ 2021-01-08 10:18                       ` Eryu Guan
  1 sibling, 0 replies; 26+ messages in thread
From: Eryu Guan @ 2021-01-08 10:18 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, fuse-devel, Max Reitz, virtio-fs-list, bergwolf,
	gerry, Vivek Goyal

On Fri, Jan 08, 2021 at 11:08:41AM +0200, Amir Goldstein wrote:
> > > Miklos,
> > >
> > > I would like to point out that this discussion is relevant to any low level
> > > fuse filesystem, but especially those that are proxying a real filesystem.
> > >
> > > I have raised this question before in the FUSE_PASSTHROUGH threads.
> > > There is a lot of code duplication among various low-level fuse projects and
> > > as we get to NFS export support and complex issues like this one, is it
> > > getting unlikely that all projects will handle this correctly.
> > >
> > > Do you think there is room for some more generic code in libfuse and if
> > > so how? Following an example implementation (assuming it gets fixed)
> > > and hand picking fixes to projects cannot scale for long.
> > >
> > > The challenge is that most of the generic code would be in lookup and
> > > maintaining the internal inode cache, but each filesystem may need
> > > different representations of the Inode object.
> > >
> > > I was thinking of something along the lines of an OO library that
> > > implements the generic lookup/inode cache for a base FuseInode class
> > > that implementers can inherit from.
> > >
> > > This doesn't need to be in the libfuse project at all.
> > > Seeing the virtiofsd already has a Rust implementation that is BSD
> > > licensed, maybe that can be used as a starting point?
> > >
> > > David,
> > >
> > > How hard do you think it would be to re-factor virtiofsd-rs to
> > > an implementation that inherits from a base passthroughfsd-rs?
> > >
> > > BTW, is virtiofsd-rs the offical virtiofsd or an experimental one?
> > > Which tree does Miklos' patch apply to?
> > >
> > > Anyone has other thoughts about how to reduce fragmentation in
> > > implementations?
> >
> > There's an fuse-backend-rs[1] project hosted on cloud-hypervisor, it is
> > a library to communicate with the Linux FUSE clients, which includes:
> >
> > - ABI layer, which defines all data structures shared between linux Fuse
> >   framework and Fuse daemons.
> > - API layer, defines the interfaces for Fuse daemons to implement a
> >   userspace file system.
> > - Transport layer, which supports both the Linux Fuse device and
> >   virtio-fs protocol.
> > - VFS/pseudo_fs, an abstraction layer to support multiple file systems
> >   by a single virtio-fs device.
> > - A sample passthrough file system implementation, which passes through
> >   files from daemons to clients.
> >
> > I'm wondering if fuse-backend-rs is a proper project to work on, and
> > maybe virtiofsd-rs could be switched to use it as well in the future.
> >
> > Thanks,
> > Eryu
> >
> > [1] https://github.com/cloud-hypervisor/fuse-backend-rs
> 
> 
> Hi Eryu!
> 
> This looks very interesting.
> Can you guys say a few words about the maturity of this project.
> Does it have any CI? any beta/production workloads that use it?

We do have secure container workloads running in production env (limited
scale for testing though) and using fuse-backend-rs to passthrough fs
operations.

> I would be happy to contribute the open_by_handle_at() changes
> if I know they will get properly tested.

Yes, as Liu Jiang said, we have similar issues when using viriofs, and
are happy to test the open_by_handle code!

Thanks,
Eryu

> 
> As demonstrated in this demo fs [1], with xfs/ext4 as underlying filesystem,
> full NFS support can be implemented by implementing lookup in filesystem
> by inode only, before fuse adds support to LOOKUP_HANDLE in the protocol.
> 
> Thanks,
> Amir.
> 
> [1] https://github.com/amir73il/libfuse/commits/cachegwfs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [fuse-devel] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-08  4:12                   ` Eryu Guan
  2021-01-08  9:08                     ` Amir Goldstein
@ 2021-01-08 15:26                     ` Vivek Goyal
  2021-01-15 10:20                       ` Peng Tao
  1 sibling, 1 reply; 26+ messages in thread
From: Vivek Goyal @ 2021-01-08 15:26 UTC (permalink / raw)
  To: Eryu Guan
  Cc: Miklos Szeredi, fuse-devel, Amir Goldstein, Max Reitz,
	virtio-fs-list, gerry, bergwolf

On Fri, Jan 08, 2021 at 12:12:52PM +0800, Eryu Guan wrote:

[..]
> > Anyone has other thoughts about how to reduce fragmentation in
> > implementations?
> 
> There's an fuse-backend-rs[1] project hosted on cloud-hypervisor, it is
> a library to communicate with the Linux FUSE clients, which includes:
> 
> - ABI layer, which defines all data structures shared between linux Fuse
>   framework and Fuse daemons.
> - API layer, defines the interfaces for Fuse daemons to implement a
>   userspace file system.
> - Transport layer, which supports both the Linux Fuse device and
>   virtio-fs protocol.
> - VFS/pseudo_fs, an abstraction layer to support multiple file systems
>   by a single virtio-fs device.
> - A sample passthrough file system implementation, which passes through
>   files from daemons to clients.
> 

So at a high level, is this equivalent of libfuse written in rust?

Vivek

> I'm wondering if fuse-backend-rs is a proper project to work on, and
> maybe virtiofsd-rs could be switched to use it as well in the future.
> 
> Thanks,
> Eryu
> 
> [1] https://github.com/cloud-hypervisor/fuse-backend-rs
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [fuse-devel] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-06  9:16         ` Amir Goldstein
  2021-01-06  9:27           ` Amir Goldstein
  2021-01-06 13:40           ` Miklos Szeredi
@ 2021-01-08 15:55           ` Vivek Goyal
  2021-01-11 15:48           ` Dr. David Alan Gilbert
  3 siblings, 0 replies; 26+ messages in thread
From: Vivek Goyal @ 2021-01-08 15:55 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Miklos Szeredi, fuse-devel, Max Reitz, virtio-fs-list

On Wed, Jan 06, 2021 at 11:16:28AM +0200, Amir Goldstein wrote:
> On Wed, Jan 6, 2021 at 10:02 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Wed, Jan 6, 2021 at 5:29 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > On Mon, Jan 4, 2021 at 8:57 PM Dr. David Alan Gilbert
> > > <dgilbert@redhat.com> wrote:
> > > >
> > > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > > On Mon, Jan 04, 2021 at 04:00:13PM +0000, Dr. David Alan Gilbert wrote:
> > > > > > Hi,
> > > > > >   On virtio-fs we're hitting a problem with NFS, where
> > > > > > unlinking a file in a directory and then rmdir'ing that
> > > > > > directory fails complaining about the directory not being empty.
> > > > > >
> > > > > > The problem here is that if a file has an open fd, NFS doesn't
> > > > > > actually delete the file on unlink, it just renames it to
> > > > > > a hidden file (e.g. .nfs*******).  That open file is there because
> > > > > > the 'FORGET' hasn't completed yet by the time the rmdir is issued.
> > > > > >
> > > > > > Question:
> > > > > >   a) In the FUSE protocol, are requests assumed to complete in order;
> > > > > > i.e.  unlink, forget, rmdir   is it required that 'forget' completes
> > > > > > before the rmdir is processed?
> > > > > >      (In virtiofs we've been processing requests, in parallel, and
> > > > > > have sent forgets down a separate queue to keep them out of the way).
> > > > > >
> > > > > >   b) 'forget' doesn't send a reply - so the kernel can't wait for the
> > > > > > client to have finished it;  do we need a synchronous forget here?
> > > > >
> > > > > Even if we introduce a synchronous forget, will that really fix the
> > > > > issue. For example, this could also happen if file has been unlinked
> > > > > but it is still open and directory is being removed.
> > > > >
> > > > > fd = open(foo/bar.txt)
> > > > > unlink foo/bar.txt
> > > > > rmdir foo
> > > > > close(fd).
> > > > >
> > > > > In this case, final forget should go after fd has been closed. Its
> > > > > not a forget race.
> > > > >
> > > > > I wrote a test case for this and it works on regular file systems.
> > > > >
> > > > > https://github.com/rhvgoyal/misc/blob/master/virtiofs-tests/rmdir.c
> > > > >
> > > > > I suspect it will fail on nfs because I am assuming that temporary
> > > > > file will be there till final close(fd) happens. If that's the
> > > > > case this is a NFS specific issue because its behavior is different
> > > > > from other file systems.
> > > >
> > > > That's true; but that's NFS just being NFS; in our case we're keeping
> > > > an fd open even though the guest has been smart enough not to; so we're
> > > > causing the NFS oddity when it wouldn't normally happen.
> > > >
> > >
> > > Are you sure that you really need this oddity?
> > >
> > > My sense from looking virtiofsd is that the open O_PATH fd
> > > in InodeData for non-directories are an overkill and even the need
> > > for open fd for all directories is questionable.
> > >
> > > If you store a FileHandle (name_to_handle_at(2)) instead of an open fd
> > > for non-directories, you won't be keeping a reference on the underlying inode
> > > so no unlink issue.
> > >
> > > open_by_handle_at(2) is very cheap for non-directory when underlying inode
> > > is cached and as cheap as it can get even when inode is not in cache, so no
> > > performance penalty is expected.
> >
> > You are perfectly right that using file handles would solve a number
> > of issues, one being too many open file descriptors.
> >
> > The issue with open_by_handle_at(2) is that it needs
> > CAP_DAC_READ_SEARCH in the initial user namespace. That currently
> > makes it impossible to use in containers and such.
> 
> Is that a problem for virtiofsd? does it also run inside a container??

Yes, there have been cases where virtiofsd is running inside container.
For the same reason stefan introduced patches to not setup all the
namespace by the daemon. It will be setup by the container manager.
And container manager wants to give minimum privilige to virtiofsd
container (same capabilities as any other standard container) by
default.

Vivek

> 
> Please note that NFS doesn't do "silly rename" for directories,
> so mitigation is mostly needed for non-dir.
> 
> An alternative method if daemon is not capable, is to store parent dirfd
> in addition to filehandle and implement open_child_by_handle_at(int
> parent_fd, ...):
> - readdir(parend_fd)
> - search a match for d_ino
> - name_to_handle_at() and verify match to stored filehandle
> 
> This is essentially what open_by_handle_at(2) does under the covers
> with a "connectable" non-dir filehandle after having resolved the
> parent file handle part. And "connectable" file handles are used by nfsd
> to enforce "subtree_check" to make sure that file wasn't moved outside
> obtainable path after initial lookup.
> 
> >
> > Not sure if there has been proposals for making open_by_handle_at(2)
> > usable in user namespaces.
> 
> I don't remember seeing any.
> 
> > The difficulty is in verifying that an
> > open file would have been obtainable by path lookup by the given user.
> 
> I think it can be allowed for user with CAP_DAC_READ_SEARCH in
> userns for FS_USERNS_MOUNT mounted in that userns.
> 
> Thanks,
> Amir.
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [fuse-devel] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-06  9:16         ` Amir Goldstein
                             ` (2 preceding siblings ...)
  2021-01-08 15:55           ` Vivek Goyal
@ 2021-01-11 15:48           ` Dr. David Alan Gilbert
  3 siblings, 0 replies; 26+ messages in thread
From: Dr. David Alan Gilbert @ 2021-01-11 15:48 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, fuse-devel, Max Reitz, virtio-fs-list, Vivek Goyal

* Amir Goldstein (amir73il@gmail.com) wrote:
> On Wed, Jan 6, 2021 at 10:02 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Wed, Jan 6, 2021 at 5:29 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > On Mon, Jan 4, 2021 at 8:57 PM Dr. David Alan Gilbert
> > > <dgilbert@redhat.com> wrote:
> > > >
> > > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > > On Mon, Jan 04, 2021 at 04:00:13PM +0000, Dr. David Alan Gilbert wrote:
> > > > > > Hi,
> > > > > >   On virtio-fs we're hitting a problem with NFS, where
> > > > > > unlinking a file in a directory and then rmdir'ing that
> > > > > > directory fails complaining about the directory not being empty.
> > > > > >
> > > > > > The problem here is that if a file has an open fd, NFS doesn't
> > > > > > actually delete the file on unlink, it just renames it to
> > > > > > a hidden file (e.g. .nfs*******).  That open file is there because
> > > > > > the 'FORGET' hasn't completed yet by the time the rmdir is issued.
> > > > > >
> > > > > > Question:
> > > > > >   a) In the FUSE protocol, are requests assumed to complete in order;
> > > > > > i.e.  unlink, forget, rmdir   is it required that 'forget' completes
> > > > > > before the rmdir is processed?
> > > > > >      (In virtiofs we've been processing requests, in parallel, and
> > > > > > have sent forgets down a separate queue to keep them out of the way).
> > > > > >
> > > > > >   b) 'forget' doesn't send a reply - so the kernel can't wait for the
> > > > > > client to have finished it;  do we need a synchronous forget here?
> > > > >
> > > > > Even if we introduce a synchronous forget, will that really fix the
> > > > > issue. For example, this could also happen if file has been unlinked
> > > > > but it is still open and directory is being removed.
> > > > >
> > > > > fd = open(foo/bar.txt)
> > > > > unlink foo/bar.txt
> > > > > rmdir foo
> > > > > close(fd).
> > > > >
> > > > > In this case, final forget should go after fd has been closed. Its
> > > > > not a forget race.
> > > > >
> > > > > I wrote a test case for this and it works on regular file systems.
> > > > >
> > > > > https://github.com/rhvgoyal/misc/blob/master/virtiofs-tests/rmdir.c
> > > > >
> > > > > I suspect it will fail on nfs because I am assuming that temporary
> > > > > file will be there till final close(fd) happens. If that's the
> > > > > case this is a NFS specific issue because its behavior is different
> > > > > from other file systems.
> > > >
> > > > That's true; but that's NFS just being NFS; in our case we're keeping
> > > > an fd open even though the guest has been smart enough not to; so we're
> > > > causing the NFS oddity when it wouldn't normally happen.
> > > >
> > >
> > > Are you sure that you really need this oddity?
> > >
> > > My sense from looking virtiofsd is that the open O_PATH fd
> > > in InodeData for non-directories are an overkill and even the need
> > > for open fd for all directories is questionable.
> > >
> > > If you store a FileHandle (name_to_handle_at(2)) instead of an open fd
> > > for non-directories, you won't be keeping a reference on the underlying inode
> > > so no unlink issue.
> > >
> > > open_by_handle_at(2) is very cheap for non-directory when underlying inode
> > > is cached and as cheap as it can get even when inode is not in cache, so no
> > > performance penalty is expected.
> >
> > You are perfectly right that using file handles would solve a number
> > of issues, one being too many open file descriptors.
> >
> > The issue with open_by_handle_at(2) is that it needs
> > CAP_DAC_READ_SEARCH in the initial user namespace. That currently
> > makes it impossible to use in containers and such.
> 
> Is that a problem for virtiofsd? does it also run inside a container??

Yes it does; certainly for Kata it runs inside a container, and I think
it does in some other setups.


> Please note that NFS doesn't do "silly rename" for directories,
> so mitigation is mostly needed for non-dir.
> 
> An alternative method if daemon is not capable, is to store parent dirfd
> in addition to filehandle and implement open_child_by_handle_at(int
> parent_fd, ...):
> - readdir(parend_fd)
> - search a match for d_ino
> - name_to_handle_at() and verify match to stored filehandle
> 
> This is essentially what open_by_handle_at(2) does under the covers
> with a "connectable" non-dir filehandle after having resolved the
> parent file handle part. And "connectable" file handles are used by nfsd
> to enforce "subtree_check" to make sure that file wasn't moved outside
> obtainable path after initial lookup.
> 
> >
> > Not sure if there has been proposals for making open_by_handle_at(2)
> > usable in user namespaces.
> 
> I don't remember seeing any.
> 
> > The difficulty is in verifying that an
> > open file would have been obtainable by path lookup by the given user.
> 
> I think it can be allowed for user with CAP_DAC_READ_SEARCH in
> userns for FS_USERNS_MOUNT mounted in that userns.

It worries me a lot that open_by_handle_at is just too powerful; unless
you're giving the user/container it's own fileystem, the guest kernel
can basically open any file.

Dave

> Thanks,
> Amir.
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Virtio-fs] [fuse-devel] 'FORGET' ordering semantics (vs unlink & NFS)
  2021-01-08 15:26                     ` Vivek Goyal
@ 2021-01-15 10:20                       ` Peng Tao
  0 siblings, 0 replies; 26+ messages in thread
From: Peng Tao @ 2021-01-15 10:20 UTC (permalink / raw)
  To: Vivek Goyal, Eryu Guan
  Cc: Miklos Szeredi, fuse-devel, Amir Goldstein, Max Reitz,
	virtio-fs-list, gerry, bergwolf



On 2021/1/8 23:26, Vivek Goyal wrote:
> On Fri, Jan 08, 2021 at 12:12:52PM +0800, Eryu Guan wrote:
> 
> [..]
>>> Anyone has other thoughts about how to reduce fragmentation in
>>> implementations?
>>
>> There's an fuse-backend-rs[1] project hosted on cloud-hypervisor, it is
>> a library to communicate with the Linux FUSE clients, which includes:
>>
>> - ABI layer, which defines all data structures shared between linux Fuse
>>    framework and Fuse daemons.
>> - API layer, defines the interfaces for Fuse daemons to implement a
>>    userspace file system.
>> - Transport layer, which supports both the Linux Fuse device and
>>    virtio-fs protocol.
>> - VFS/pseudo_fs, an abstraction layer to support multiple file systems
>>    by a single virtio-fs device.
>> - A sample passthrough file system implementation, which passes through
>>    files from daemons to clients.
>>
> 
> So at a high level, is this equivalent of libfuse written in rust?
Yes, they are similar in concept. Though fuse-backend-rs only provides 
interfaces like the libfuse lowlevel APIs for filesystem 
implementations. The fuse session handling part is still outside [1] but 
we are planning to move it into fuse-backend-rs as well.

Cheers,
Tao

[1] 
https://github.com/dragonflyoss/image-service/blob/master/utils/src/fuse.rs
-- 
Into something rich and strange.


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-01-15 10:20 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-04 16:00 [Virtio-fs] 'FORGET' ordering semantics (vs unlink & NFS) Dr. David Alan Gilbert
2021-01-04 18:45 ` Vivek Goyal
2021-01-04 18:56   ` Dr. David Alan Gilbert
2021-01-04 19:04     ` Vivek Goyal
2021-01-04 19:16       ` Vivek Goyal
2021-01-05 11:24     ` [Virtio-fs] [fuse-devel] " Miklos Szeredi
2021-01-05 15:42       ` Vivek Goyal
2021-01-06  4:29     ` Amir Goldstein
2021-01-06  8:01       ` Miklos Szeredi
2021-01-06  9:16         ` Amir Goldstein
2021-01-06  9:27           ` Amir Goldstein
2021-01-06 13:40           ` Miklos Szeredi
2021-01-06 16:57             ` Vivek Goyal
2021-01-07  8:44               ` Miklos Szeredi
2021-01-07 10:42                 ` Amir Goldstein
2021-01-07 20:10                   ` Dr. David Alan Gilbert
2021-01-08  4:12                   ` Eryu Guan
2021-01-08  9:08                     ` Amir Goldstein
2021-01-08  9:25                       ` Liu, Jiang
2021-01-08 10:18                       ` Eryu Guan
2021-01-08 15:26                     ` Vivek Goyal
2021-01-15 10:20                       ` Peng Tao
2021-01-08 15:55           ` Vivek Goyal
2021-01-11 15:48           ` Dr. David Alan Gilbert
2021-01-05 10:11 ` Nikolaus Rath
2021-01-05 12:28   ` Dr. David Alan Gilbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.