linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10 v7] NFSD: Pin to vfsmount for nfsd exports cache
@ 2015-07-11 12:46 Kinglong Mee
  2015-07-11 12:49 ` [PATCH 05/10 v7] sunrpc: Store cache_detail in seq_file's private, directly Kinglong Mee
                   ` (3 more replies)
  0 siblings, 4 replies; 40+ messages in thread
From: Kinglong Mee @ 2015-07-11 12:46 UTC (permalink / raw)
  To: Al Viro, J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: NeilBrown, Trond Myklebust, kinglongmee-Re5JQEeQqe8AvxtiuMwx3w

If there are some mount points(not exported for nfs) under pseudo root,
after client's operation of those entry under the root, anyone *can't*
unmount those mount points until export cache expired.

# cat /etc/exports
/nfs/xfs        *(rw,insecure,no_subtree_check,no_root_squash)
/nfs/pnfs       *(rw,insecure,no_subtree_check,no_root_squash)
# ll /nfs/
total 0
drwxr-xr-x. 3 root root 84 Apr 21 22:27 pnfs
drwxr-xr-x. 3 root root 84 Apr 21 22:27 test
drwxr-xr-x. 2 root root  6 Apr 20 22:01 xfs
# mount /dev/sde /nfs/test
# df
Filesystem                      1K-blocks    Used Available Use% Mounted on
......
/dev/sdd                          1038336   32944   1005392   4% /nfs/pnfs
/dev/sdc                         10475520   32928  10442592   1% /nfs/xfs
/dev/sde                           999320    1284    929224   1% /nfs/test
# mount -t nfs 127.0.0.1:/nfs/ /mnt
# ll /mnt/*/
/mnt/pnfs/:
total 0
-rw-r--r--. 1 root root 0 Apr 21 22:23 attr
drwxr-xr-x. 2 root root 6 Apr 21 22:19 tmp

/mnt/xfs/:
total 0
# umount /nfs/test/
umount: /nfs/test/: target is busy
        (In some cases useful info about processes that
         use the device is found by lsof(8) or fuser(1).)

It's caused by exports cache of nfsd holds the reference of
the path (here is /nfs/test/), so, it can't be umounted.

I don't think that's user expect, they want umount /nfs/test/.
Bruce think user can also umount /nfs/pnfs/ and /nfs/xfs.

This patch site lets nfsd exports pinning to vfsmount, 
not using mntget, so user can umount any exports mountpoint now.

v3, 
1. New helpers path_get_pin/path_put_unpin for path pin.
2. Use kzalloc for allocating memory.

v4, Thanks for Al Viro's commets for the logic of fs_pin.
1. add a completion for pin_kill waiting the reference is decreased to zero.
2. add a work_struct for pin_kill decreases the reference indirectly.
3. free svc_export/svc_expkey in pin_kill, not svc_export_put/svc_expkey_put.
4. svc_export_put/svc_expkey_put go though pin_kill logic.

v5, 
let killing fs_pin under a reference of vfsmnt.

v6,
1. revert the change of v5
2. new helper legitimize_mntget() for nfsd exports/expkey cache
   get vfsmount from fs_pin
3. cleanup some codes of sunrpc's cache
4. switch using list_head instead of single list for cache_head
   in cache_detail
5. new functions validate/invalidate for processing of reference
   increase/decrease change (nfsd exports/expkey using grab the
   reference of mnt)
6. delete cache_head directly from cache_detail in pin_kill

v7, 
implement self reference increase and decrease for nfsd exports/expkey 

When reference of cahce_head increase(>1), grab a reference of mnt once.
and reference decrease to 1 (==1), drop the reference of mnt.

So after that,
When ref > 1, user cannot umount the filesystem with -EBUSY.
when ref ==1, means cache only reference by nfsd cache,
no other reference. So user can try umount, 
1. before set MNT_UMOUNT (protected by mount_lock), nfsd cache is
   referenced (ref > 1, legitimize_mntget), umount will fail with -EBUSY.
2. after set MNT_UMOUNT, nfsd cache is referenced (ref == 2),
   legitimize_mntget will fail, and set cache to CACHE_NEGATIVE,
   and the reference will be dropped, re-back to 1.
   So, pin_kill can delete the cache and umount success.
3. when umountting, no reference to nfsd cache, 
   pin_kill can delete the cache and umount success.

Kinglong Mee (10):
  fs_pin: Initialize value for fs_pin explicitly
  fs_pin: Export functions for specific filesystem
  path: New helpers path_get_pin/path_put_unpin for path pin
  fs: New helper legitimize_mntget() for getting a legitimize mnt
  sunrpc: Store cache_detail in seq_file's private directly
  sunrpc/nfsd: Remove redundant code by exports seq_operations functions
  sunrpc: Switch to using list_head instead single list
  sunrpc: New helper cache_delete_entry for deleting cache_head directly
  sunrpc: Support get_ref/put_ref for reference change in cache_head
  nfsd: Allows user un-mounting filesystem where nfsd exports base on

 fs/fs_pin.c                  |   4 +
 fs/namei.c                   |  26 +++++
 fs/namespace.c               |  19 ++++
 fs/nfsd/export.c             | 259 ++++++++++++++++++++++++++++---------------
 fs/nfsd/export.h             |  22 +++-
 include/linux/fs_pin.h       |   6 +
 include/linux/mount.h        |   1 +
 include/linux/path.h         |   4 +
 include/linux/sunrpc/cache.h |  25 ++++-
 net/sunrpc/cache.c           | 149 ++++++++++++++++---------
 10 files changed, 368 insertions(+), 147 deletions(-)

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 01/10 v7] fs_pin: Initialize value for fs_pin explicitly
       [not found] ` <55A11010.6050005-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2015-07-11 12:47   ` Kinglong Mee
  2015-07-11 12:47   ` [PATCH 02/10 v7] fs_pin: Export functions for specific filesystem Kinglong Mee
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 40+ messages in thread
From: Kinglong Mee @ 2015-07-11 12:47 UTC (permalink / raw)
  To: Al Viro, J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: NeilBrown, Trond Myklebust, kinglongmee-Re5JQEeQqe8AvxtiuMwx3w

Without initialized, done in fs_pin at stack space may
contains strange value.

v3, v4, v5, v6, v7
Adds macro for header file

Signed-off-by: Kinglong Mee <kinglongmee-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 include/linux/fs_pin.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/fs_pin.h b/include/linux/fs_pin.h
index 3886b3b..0dde7b7 100644
--- a/include/linux/fs_pin.h
+++ b/include/linux/fs_pin.h
@@ -1,3 +1,6 @@
+#ifndef _LINUX_FS_PIN_H
+#define _LINUX_FS_PIN_H
+
 #include <linux/wait.h>
 
 struct fs_pin {
@@ -16,9 +19,12 @@ static inline void init_fs_pin(struct fs_pin *p, void (*kill)(struct fs_pin *))
 	INIT_HLIST_NODE(&p->s_list);
 	INIT_HLIST_NODE(&p->m_list);
 	p->kill = kill;
+	p->done = 0;
 }
 
 void pin_remove(struct fs_pin *);
 void pin_insert_group(struct fs_pin *, struct vfsmount *, struct hlist_head *);
 void pin_insert(struct fs_pin *, struct vfsmount *);
 void pin_kill(struct fs_pin *);
+
+#endif
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 02/10 v7] fs_pin: Export functions for specific filesystem
       [not found] ` <55A11010.6050005-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2015-07-11 12:47   ` [PATCH 01/10 v7] fs_pin: Initialize value for fs_pin explicitly Kinglong Mee
@ 2015-07-11 12:47   ` Kinglong Mee
  2015-07-11 12:48   ` [PATCH 03/10 v7] path: New helpers path_get_pin/path_put_unpin for path pin Kinglong Mee
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 40+ messages in thread
From: Kinglong Mee @ 2015-07-11 12:47 UTC (permalink / raw)
  To: Al Viro, J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: NeilBrown, Trond Myklebust, kinglongmee-Re5JQEeQqe8AvxtiuMwx3w

Exports functions for others who want pin to vfsmount,
eg, nfsd's export cache.

v4, v5, v6, v7
add exporting of pin_kill.

Signed-off-by: Kinglong Mee <kinglongmee-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 fs/fs_pin.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/fs_pin.c b/fs/fs_pin.c
index 611b540..a1a4eb2 100644
--- a/fs/fs_pin.c
+++ b/fs/fs_pin.c
@@ -17,6 +17,7 @@ void pin_remove(struct fs_pin *pin)
 	wake_up_locked(&pin->wait);
 	spin_unlock_irq(&pin->wait.lock);
 }
+EXPORT_SYMBOL(pin_remove);
 
 void pin_insert_group(struct fs_pin *pin, struct vfsmount *m, struct hlist_head *p)
 {
@@ -26,11 +27,13 @@ void pin_insert_group(struct fs_pin *pin, struct vfsmount *m, struct hlist_head
 	hlist_add_head(&pin->m_list, &real_mount(m)->mnt_pins);
 	spin_unlock(&pin_lock);
 }
+EXPORT_SYMBOL(pin_insert_group);
 
 void pin_insert(struct fs_pin *pin, struct vfsmount *m)
 {
 	pin_insert_group(pin, m, &m->mnt_sb->s_pins);
 }
+EXPORT_SYMBOL(pin_insert);
 
 void pin_kill(struct fs_pin *p)
 {
@@ -72,6 +75,7 @@ void pin_kill(struct fs_pin *p)
 	}
 	rcu_read_unlock();
 }
+EXPORT_SYMBOL(pin_kill);
 
 void mnt_pin_kill(struct mount *m)
 {
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 03/10 v7] path: New helpers path_get_pin/path_put_unpin for path pin
       [not found] ` <55A11010.6050005-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2015-07-11 12:47   ` [PATCH 01/10 v7] fs_pin: Initialize value for fs_pin explicitly Kinglong Mee
  2015-07-11 12:47   ` [PATCH 02/10 v7] fs_pin: Export functions for specific filesystem Kinglong Mee
@ 2015-07-11 12:48   ` Kinglong Mee
  2015-07-11 12:48   ` [PATCH 04/10 v7] fs: New helper legitimize_mntget() for getting a legitimize mnt Kinglong Mee
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 40+ messages in thread
From: Kinglong Mee @ 2015-07-11 12:48 UTC (permalink / raw)
  To: Al Viro, J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: NeilBrown, Trond Myklebust, kinglongmee-Re5JQEeQqe8AvxtiuMwx3w

Two helpers for filesystem pining to vfsmnt, not mntget.

v4, v5, v6, v7 same as v2.

Signed-off-by: Kinglong Mee <kinglongmee-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 fs/namei.c           | 26 ++++++++++++++++++++++++++
 include/linux/path.h |  4 ++++
 2 files changed, 30 insertions(+)

diff --git a/fs/namei.c b/fs/namei.c
index ae4e4c1..08f2cfc 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -492,6 +492,32 @@ void path_put(const struct path *path)
 }
 EXPORT_SYMBOL(path_put);
 
+/**
+ * path_get_pin - get a reference to a path's dentry
+ *                and pin to path's vfsmnt
+ * @path: path to get the reference to
+ * @p: the fs_pin pin to vfsmnt
+ */
+void path_get_pin(struct path *path, struct fs_pin *p)
+{
+	dget(path->dentry);
+	pin_insert_group(p, path->mnt, NULL);
+}
+EXPORT_SYMBOL(path_get_pin);
+
+/**
+ * path_put_unpin - put a reference to a path's dentry
+ *                  and remove pin to path's vfsmnt
+ * @path: path to put the reference to
+ * @p: the fs_pin removed from vfsmnt
+ */
+void path_put_unpin(struct path *path, struct fs_pin *p)
+{
+	dput(path->dentry);
+	pin_remove(p);
+}
+EXPORT_SYMBOL(path_put_unpin);
+
 #define EMBEDDED_LEVELS 2
 struct nameidata {
 	struct path	path;
diff --git a/include/linux/path.h b/include/linux/path.h
index d137218..34599fb 100644
--- a/include/linux/path.h
+++ b/include/linux/path.h
@@ -3,6 +3,7 @@
 
 struct dentry;
 struct vfsmount;
+struct fs_pin;
 
 struct path {
 	struct vfsmount *mnt;
@@ -12,6 +13,9 @@ struct path {
 extern void path_get(const struct path *);
 extern void path_put(const struct path *);
 
+extern void path_get_pin(struct path *, struct fs_pin *);
+extern void path_put_unpin(struct path *, struct fs_pin *);
+
 static inline int path_equal(const struct path *path1, const struct path *path2)
 {
 	return path1->mnt == path2->mnt && path1->dentry == path2->dentry;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 04/10 v7] fs: New helper legitimize_mntget() for getting a legitimize mnt
       [not found] ` <55A11010.6050005-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2015-07-11 12:48   ` [PATCH 03/10 v7] path: New helpers path_get_pin/path_put_unpin for path pin Kinglong Mee
@ 2015-07-11 12:48   ` Kinglong Mee
  2015-07-11 12:50   ` [PATCH 07/10 v7] sunrpc: Switch to using list_head instead single list Kinglong Mee
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 40+ messages in thread
From: Kinglong Mee @ 2015-07-11 12:48 UTC (permalink / raw)
  To: Al Viro, J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: NeilBrown, Trond Myklebust, kinglongmee-Re5JQEeQqe8AvxtiuMwx3w

New helper legitimize_mntget for getting a mnt without setting
MNT_SYNC_UMOUNT | MNT_UMOUNT | MNT_DOOMED, otherwise return NULL.

v7, same as v6

Signed-off-by: Kinglong Mee <kinglongmee-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 fs/namespace.c        | 19 +++++++++++++++++++
 include/linux/mount.h |  1 +
 2 files changed, 20 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index c7cb8a5..8709f61 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1153,6 +1153,25 @@ struct vfsmount *mntget(struct vfsmount *mnt)
 }
 EXPORT_SYMBOL(mntget);
 
+struct vfsmount *legitimize_mntget(struct vfsmount *vfsmnt)
+{
+	struct mount *mnt;
+
+	if (vfsmnt == NULL)
+		return NULL;
+
+	read_seqlock_excl(&mount_lock);
+	mnt = real_mount(vfsmnt);
+	if (vfsmnt->mnt_flags & (MNT_SYNC_UMOUNT | MNT_UMOUNT | MNT_DOOMED))
+		vfsmnt = NULL;
+	else
+		mnt_add_count(mnt, 1);
+	read_sequnlock_excl(&mount_lock);
+
+	return vfsmnt;
+}
+EXPORT_SYMBOL(legitimize_mntget);
+
 struct vfsmount *mnt_clone_internal(struct path *path)
 {
 	struct mount *p;
diff --git a/include/linux/mount.h b/include/linux/mount.h
index f822c3c..8ae9dc0 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -79,6 +79,7 @@ extern void mnt_drop_write(struct vfsmount *mnt);
 extern void mnt_drop_write_file(struct file *file);
 extern void mntput(struct vfsmount *mnt);
 extern struct vfsmount *mntget(struct vfsmount *mnt);
+extern struct vfsmount *legitimize_mntget(struct vfsmount *vfsmnt);
 extern struct vfsmount *mnt_clone_internal(struct path *path);
 extern int __mnt_is_readonly(struct vfsmount *mnt);
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 05/10 v7] sunrpc: Store cache_detail in seq_file's private, directly
  2015-07-11 12:46 [PATCH 00/10 v7] NFSD: Pin to vfsmount for nfsd exports cache Kinglong Mee
@ 2015-07-11 12:49 ` Kinglong Mee
  2015-07-11 12:49 ` [PATCH 06/10 v7] sunrpc/nfsd: Remove redundant code by exports seq_operations functions Kinglong Mee
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 40+ messages in thread
From: Kinglong Mee @ 2015-07-11 12:49 UTC (permalink / raw)
  To: Al Viro, J. Bruce Fields, linux-nfs, linux-fsdevel
  Cc: NeilBrown, Trond Myklebust, kinglongmee

Cleanup.

Just store cache_detail in seq_file's private,
an allocated handle is redundant.

v7, same as v6.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
---
 net/sunrpc/cache.c | 28 +++++++++++++---------------
 1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index 2928aff..edec603 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -1270,18 +1270,13 @@ EXPORT_SYMBOL_GPL(qword_get);
  * get a header, then pass each real item in the cache
  */
 
-struct handle {
-	struct cache_detail *cd;
-};
-
 static void *c_start(struct seq_file *m, loff_t *pos)
 	__acquires(cd->hash_lock)
 {
 	loff_t n = *pos;
 	unsigned int hash, entry;
 	struct cache_head *ch;
-	struct cache_detail *cd = ((struct handle*)m->private)->cd;
-
+	struct cache_detail *cd = m->private;
 
 	read_lock(&cd->hash_lock);
 	if (!n--)
@@ -1308,7 +1303,7 @@ static void *c_next(struct seq_file *m, void *p, loff_t *pos)
 {
 	struct cache_head *ch = p;
 	int hash = (*pos >> 32);
-	struct cache_detail *cd = ((struct handle*)m->private)->cd;
+	struct cache_detail *cd = m->private;
 
 	if (p == SEQ_START_TOKEN)
 		hash = 0;
@@ -1334,14 +1329,14 @@ static void *c_next(struct seq_file *m, void *p, loff_t *pos)
 static void c_stop(struct seq_file *m, void *p)
 	__releases(cd->hash_lock)
 {
-	struct cache_detail *cd = ((struct handle*)m->private)->cd;
+	struct cache_detail *cd = m->private;
 	read_unlock(&cd->hash_lock);
 }
 
 static int c_show(struct seq_file *m, void *p)
 {
 	struct cache_head *cp = p;
-	struct cache_detail *cd = ((struct handle*)m->private)->cd;
+	struct cache_detail *cd = m->private;
 
 	if (p == SEQ_START_TOKEN)
 		return cd->cache_show(m, cd, NULL);
@@ -1373,24 +1368,27 @@ static const struct seq_operations cache_content_op = {
 static int content_open(struct inode *inode, struct file *file,
 			struct cache_detail *cd)
 {
-	struct handle *han;
+	struct seq_file *seq;
+	int err;
 
 	if (!cd || !try_module_get(cd->owner))
 		return -EACCES;
-	han = __seq_open_private(file, &cache_content_op, sizeof(*han));
-	if (han == NULL) {
+
+	err = seq_open(file, &cache_content_op);
+	if (err) {
 		module_put(cd->owner);
-		return -ENOMEM;
+		return err;
 	}
 
-	han->cd = cd;
+	seq = file->private_data;
+	seq->private = cd;
 	return 0;
 }
 
 static int content_release(struct inode *inode, struct file *file,
 		struct cache_detail *cd)
 {
-	int ret = seq_release_private(inode, file);
+	int ret = seq_release(inode, file);
 	module_put(cd->owner);
 	return ret;
 }
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 06/10 v7] sunrpc/nfsd: Remove redundant code by exports seq_operations functions
  2015-07-11 12:46 [PATCH 00/10 v7] NFSD: Pin to vfsmount for nfsd exports cache Kinglong Mee
  2015-07-11 12:49 ` [PATCH 05/10 v7] sunrpc: Store cache_detail in seq_file's private, directly Kinglong Mee
@ 2015-07-11 12:49 ` Kinglong Mee
       [not found] ` <55A11010.6050005-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2015-07-11 12:51 ` [PATCH 08/10 v7] sunrpc: New helper cache_delete_entry for deleting cache_head directly Kinglong Mee
  3 siblings, 0 replies; 40+ messages in thread
From: Kinglong Mee @ 2015-07-11 12:49 UTC (permalink / raw)
  To: Al Viro, J. Bruce Fields, linux-nfs, linux-fsdevel
  Cc: NeilBrown, Trond Myklebust, kinglongmee

Nfsd has implement a site of seq_operations functions as sunrpc's cache.
Just exports sunrpc's codes, and remove nfsd's redundant codes.

v7, same as v6

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
---
 fs/nfsd/export.c             | 73 ++------------------------------------------
 include/linux/sunrpc/cache.h |  5 +++
 net/sunrpc/cache.c           | 15 +++++----
 3 files changed, 17 insertions(+), 76 deletions(-)

diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
index f79521a..b4d84b5 100644
--- a/fs/nfsd/export.c
+++ b/fs/nfsd/export.c
@@ -1075,73 +1075,6 @@ exp_pseudoroot(struct svc_rqst *rqstp, struct svc_fh *fhp)
 	return rv;
 }
 
-/* Iterator */
-
-static void *e_start(struct seq_file *m, loff_t *pos)
-	__acquires(((struct cache_detail *)m->private)->hash_lock)
-{
-	loff_t n = *pos;
-	unsigned hash, export;
-	struct cache_head *ch;
-	struct cache_detail *cd = m->private;
-	struct cache_head **export_table = cd->hash_table;
-
-	read_lock(&cd->hash_lock);
-	if (!n--)
-		return SEQ_START_TOKEN;
-	hash = n >> 32;
-	export = n & ((1LL<<32) - 1);
-
-	
-	for (ch=export_table[hash]; ch; ch=ch->next)
-		if (!export--)
-			return ch;
-	n &= ~((1LL<<32) - 1);
-	do {
-		hash++;
-		n += 1LL<<32;
-	} while(hash < EXPORT_HASHMAX && export_table[hash]==NULL);
-	if (hash >= EXPORT_HASHMAX)
-		return NULL;
-	*pos = n+1;
-	return export_table[hash];
-}
-
-static void *e_next(struct seq_file *m, void *p, loff_t *pos)
-{
-	struct cache_head *ch = p;
-	int hash = (*pos >> 32);
-	struct cache_detail *cd = m->private;
-	struct cache_head **export_table = cd->hash_table;
-
-	if (p == SEQ_START_TOKEN)
-		hash = 0;
-	else if (ch->next == NULL) {
-		hash++;
-		*pos += 1LL<<32;
-	} else {
-		++*pos;
-		return ch->next;
-	}
-	*pos &= ~((1LL<<32) - 1);
-	while (hash < EXPORT_HASHMAX && export_table[hash] == NULL) {
-		hash++;
-		*pos += 1LL<<32;
-	}
-	if (hash >= EXPORT_HASHMAX)
-		return NULL;
-	++*pos;
-	return export_table[hash];
-}
-
-static void e_stop(struct seq_file *m, void *p)
-	__releases(((struct cache_detail *)m->private)->hash_lock)
-{
-	struct cache_detail *cd = m->private;
-
-	read_unlock(&cd->hash_lock);
-}
-
 static struct flags {
 	int flag;
 	char *name[2];
@@ -1270,9 +1203,9 @@ static int e_show(struct seq_file *m, void *p)
 }
 
 const struct seq_operations nfs_exports_op = {
-	.start	= e_start,
-	.next	= e_next,
-	.stop	= e_stop,
+	.start	= cache_seq_start,
+	.next	= cache_seq_next,
+	.stop	= cache_seq_stop,
 	.show	= e_show,
 };
 
diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h
index 437ddb6..04ee5a2 100644
--- a/include/linux/sunrpc/cache.h
+++ b/include/linux/sunrpc/cache.h
@@ -224,6 +224,11 @@ extern int sunrpc_cache_register_pipefs(struct dentry *parent, const char *,
 					umode_t, struct cache_detail *);
 extern void sunrpc_cache_unregister_pipefs(struct cache_detail *);
 
+/* Must store cache_detail in seq_file->private if using next three functions */
+extern void *cache_seq_start(struct seq_file *file, loff_t *pos);
+extern void *cache_seq_next(struct seq_file *file, void *p, loff_t *pos);
+extern void cache_seq_stop(struct seq_file *file, void *p);
+
 extern void qword_add(char **bpp, int *lp, char *str);
 extern void qword_addhex(char **bpp, int *lp, char *buf, int blen);
 extern int qword_get(char **bpp, char *dest, int bufsize);
diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index edec603..673c2fa 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -1270,7 +1270,7 @@ EXPORT_SYMBOL_GPL(qword_get);
  * get a header, then pass each real item in the cache
  */
 
-static void *c_start(struct seq_file *m, loff_t *pos)
+void *cache_seq_start(struct seq_file *m, loff_t *pos)
 	__acquires(cd->hash_lock)
 {
 	loff_t n = *pos;
@@ -1298,8 +1298,9 @@ static void *c_start(struct seq_file *m, loff_t *pos)
 	*pos = n+1;
 	return cd->hash_table[hash];
 }
+EXPORT_SYMBOL_GPL(cache_seq_start);
 
-static void *c_next(struct seq_file *m, void *p, loff_t *pos)
+void *cache_seq_next(struct seq_file *m, void *p, loff_t *pos)
 {
 	struct cache_head *ch = p;
 	int hash = (*pos >> 32);
@@ -1325,13 +1326,15 @@ static void *c_next(struct seq_file *m, void *p, loff_t *pos)
 	++*pos;
 	return cd->hash_table[hash];
 }
+EXPORT_SYMBOL_GPL(cache_seq_next);
 
-static void c_stop(struct seq_file *m, void *p)
+void cache_seq_stop(struct seq_file *m, void *p)
 	__releases(cd->hash_lock)
 {
 	struct cache_detail *cd = m->private;
 	read_unlock(&cd->hash_lock);
 }
+EXPORT_SYMBOL_GPL(cache_seq_stop);
 
 static int c_show(struct seq_file *m, void *p)
 {
@@ -1359,9 +1362,9 @@ static int c_show(struct seq_file *m, void *p)
 }
 
 static const struct seq_operations cache_content_op = {
-	.start	= c_start,
-	.next	= c_next,
-	.stop	= c_stop,
+	.start	= cache_seq_start,
+	.next	= cache_seq_next,
+	.stop	= cache_seq_stop,
 	.show	= c_show,
 };
 
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 07/10 v7] sunrpc: Switch to using list_head instead single list
       [not found] ` <55A11010.6050005-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
                     ` (3 preceding siblings ...)
  2015-07-11 12:48   ` [PATCH 04/10 v7] fs: New helper legitimize_mntget() for getting a legitimize mnt Kinglong Mee
@ 2015-07-11 12:50   ` Kinglong Mee
       [not found]     ` <55A11112.8080502-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2015-07-13  1:30     ` NeilBrown
  2015-07-11 12:51   ` [PATCH 09/10 v7] sunrpc: Support get_ref/put_ref for reference change in cache_head Kinglong Mee
  2015-07-11 12:52   ` [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on Kinglong Mee
  6 siblings, 2 replies; 40+ messages in thread
From: Kinglong Mee @ 2015-07-11 12:50 UTC (permalink / raw)
  To: Al Viro, J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: NeilBrown, Trond Myklebust, kinglongmee-Re5JQEeQqe8AvxtiuMwx3w

Switch using list_head for cache_head in cache_detail,
it is useful of remove an cache_head entry directly from cache_detail.

v7, same as v6.

Signed-off-by: Kinglong Mee <kinglongmee-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 include/linux/sunrpc/cache.h |  4 +--
 net/sunrpc/cache.c           | 74 ++++++++++++++++++++++++--------------------
 2 files changed, 43 insertions(+), 35 deletions(-)

diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h
index 04ee5a2..ecc0ff6 100644
--- a/include/linux/sunrpc/cache.h
+++ b/include/linux/sunrpc/cache.h
@@ -46,7 +46,7 @@
  * 
  */
 struct cache_head {
-	struct cache_head * next;
+	struct list_head	cache_list;
 	time_t		expiry_time;	/* After time time, don't use the data */
 	time_t		last_refresh;   /* If CACHE_PENDING, this is when upcall 
 					 * was sent, else this is when update was received
@@ -73,7 +73,7 @@ struct cache_detail_pipefs {
 struct cache_detail {
 	struct module *		owner;
 	int			hash_size;
-	struct cache_head **	hash_table;
+	struct list_head *	hash_table;
 	rwlock_t		hash_lock;
 
 	atomic_t		inuse; /* active user-space update or lookup */
diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index 673c2fa..ad2155c 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -44,7 +44,7 @@ static void cache_revisit_request(struct cache_head *item);
 static void cache_init(struct cache_head *h)
 {
 	time_t now = seconds_since_boot();
-	h->next = NULL;
+	INIT_LIST_HEAD(&h->cache_list);
 	h->flags = 0;
 	kref_init(&h->ref);
 	h->expiry_time = now + CACHE_NEW_EXPIRY;
@@ -54,15 +54,16 @@ static void cache_init(struct cache_head *h)
 struct cache_head *sunrpc_cache_lookup(struct cache_detail *detail,
 				       struct cache_head *key, int hash)
 {
-	struct cache_head **head,  **hp;
 	struct cache_head *new = NULL, *freeme = NULL;
+	struct cache_head *tmp;
+	struct list_head *head, *pos, *tpos;
 
 	head = &detail->hash_table[hash];
 
 	read_lock(&detail->hash_lock);
 
-	for (hp=head; *hp != NULL ; hp = &(*hp)->next) {
-		struct cache_head *tmp = *hp;
+	list_for_each_safe(pos, tpos, head) {
+		tmp = list_entry(pos, struct cache_head, cache_list);
 		if (detail->match(tmp, key)) {
 			if (cache_is_expired(detail, tmp))
 				/* This entry is expired, we will discard it. */
@@ -88,12 +89,11 @@ struct cache_head *sunrpc_cache_lookup(struct cache_detail *detail,
 	write_lock(&detail->hash_lock);
 
 	/* check if entry appeared while we slept */
-	for (hp=head; *hp != NULL ; hp = &(*hp)->next) {
-		struct cache_head *tmp = *hp;
+	list_for_each_safe(pos, tpos, head) {
+		tmp = list_entry(pos, struct cache_head, cache_list);
 		if (detail->match(tmp, key)) {
 			if (cache_is_expired(detail, tmp)) {
-				*hp = tmp->next;
-				tmp->next = NULL;
+				list_del_init(&tmp->cache_list);
 				detail->entries --;
 				freeme = tmp;
 				break;
@@ -104,8 +104,8 @@ struct cache_head *sunrpc_cache_lookup(struct cache_detail *detail,
 			return tmp;
 		}
 	}
-	new->next = *head;
-	*head = new;
+
+	list_add(&new->cache_list, head);
 	detail->entries++;
 	cache_get(new);
 	write_unlock(&detail->hash_lock);
@@ -143,7 +143,6 @@ struct cache_head *sunrpc_cache_update(struct cache_detail *detail,
 	 * If 'old' is not VALID, we update it directly,
 	 * otherwise we need to replace it
 	 */
-	struct cache_head **head;
 	struct cache_head *tmp;
 
 	if (!test_bit(CACHE_VALID, &old->flags)) {
@@ -168,15 +167,13 @@ struct cache_head *sunrpc_cache_update(struct cache_detail *detail,
 	}
 	cache_init(tmp);
 	detail->init(tmp, old);
-	head = &detail->hash_table[hash];
 
 	write_lock(&detail->hash_lock);
 	if (test_bit(CACHE_NEGATIVE, &new->flags))
 		set_bit(CACHE_NEGATIVE, &tmp->flags);
 	else
 		detail->update(tmp, new);
-	tmp->next = *head;
-	*head = tmp;
+	list_add(&tmp->cache_list, &detail->hash_table[hash]);
 	detail->entries++;
 	cache_get(tmp);
 	cache_fresh_locked(tmp, new->expiry_time);
@@ -416,42 +413,44 @@ static int cache_clean(void)
 	/* find a non-empty bucket in the table */
 	while (current_detail &&
 	       current_index < current_detail->hash_size &&
-	       current_detail->hash_table[current_index] == NULL)
+	       list_empty(&current_detail->hash_table[current_index]))
 		current_index++;
 
 	/* find a cleanable entry in the bucket and clean it, or set to next bucket */
 
 	if (current_detail && current_index < current_detail->hash_size) {
-		struct cache_head *ch, **cp;
+		struct cache_head *ch = NULL, *putme = NULL;
+		struct list_head *head, *pos, *tpos;
 		struct cache_detail *d;
 
 		write_lock(&current_detail->hash_lock);
 
 		/* Ok, now to clean this strand */
 
-		cp = & current_detail->hash_table[current_index];
-		for (ch = *cp ; ch ; cp = & ch->next, ch = *cp) {
+		head = &current_detail->hash_table[current_index];
+		list_for_each_safe(pos, tpos, head) {
+			ch = list_entry(pos, struct cache_head, cache_list);
 			if (current_detail->nextcheck > ch->expiry_time)
 				current_detail->nextcheck = ch->expiry_time+1;
 			if (!cache_is_expired(current_detail, ch))
 				continue;
 
-			*cp = ch->next;
-			ch->next = NULL;
+			list_del_init(pos);
 			current_detail->entries--;
+			putme = ch;
 			rv = 1;
 			break;
 		}
 
 		write_unlock(&current_detail->hash_lock);
 		d = current_detail;
-		if (!ch)
+		if (!putme)
 			current_index ++;
 		spin_unlock(&cache_list_lock);
-		if (ch) {
-			set_bit(CACHE_CLEANED, &ch->flags);
-			cache_fresh_unlocked(ch, d);
-			cache_put(ch, d);
+		if (putme) {
+			set_bit(CACHE_CLEANED, &putme->flags);
+			cache_fresh_unlocked(putme, d);
+			cache_put(putme, d);
 		}
 	} else
 		spin_unlock(&cache_list_lock);
@@ -1277,6 +1276,7 @@ void *cache_seq_start(struct seq_file *m, loff_t *pos)
 	unsigned int hash, entry;
 	struct cache_head *ch;
 	struct cache_detail *cd = m->private;
+	struct list_head *ptr, *tptr;
 
 	read_lock(&cd->hash_lock);
 	if (!n--)
@@ -1284,19 +1284,22 @@ void *cache_seq_start(struct seq_file *m, loff_t *pos)
 	hash = n >> 32;
 	entry = n & ((1LL<<32) - 1);
 
-	for (ch=cd->hash_table[hash]; ch; ch=ch->next)
+	list_for_each_safe(ptr, tptr, &cd->hash_table[hash]) {
+		ch = list_entry(ptr, struct cache_head, cache_list);
 		if (!entry--)
 			return ch;
+	}
 	n &= ~((1LL<<32) - 1);
 	do {
 		hash++;
 		n += 1LL<<32;
 	} while(hash < cd->hash_size &&
-		cd->hash_table[hash]==NULL);
+		list_empty(&cd->hash_table[hash]));
 	if (hash >= cd->hash_size)
 		return NULL;
 	*pos = n+1;
-	return cd->hash_table[hash];
+	return list_first_entry_or_null(&cd->hash_table[hash],
+					struct cache_head, cache_list);
 }
 EXPORT_SYMBOL_GPL(cache_seq_start);
 
@@ -1308,23 +1311,24 @@ void *cache_seq_next(struct seq_file *m, void *p, loff_t *pos)
 
 	if (p == SEQ_START_TOKEN)
 		hash = 0;
-	else if (ch->next == NULL) {
+	else if (list_is_last(&ch->cache_list, &cd->hash_table[hash])) {
 		hash++;
 		*pos += 1LL<<32;
 	} else {
 		++*pos;
-		return ch->next;
+		return list_next_entry(ch, cache_list);
 	}
 	*pos &= ~((1LL<<32) - 1);
 	while (hash < cd->hash_size &&
-	       cd->hash_table[hash] == NULL) {
+	       list_empty(&cd->hash_table[hash])) {
 		hash++;
 		*pos += 1LL<<32;
 	}
 	if (hash >= cd->hash_size)
 		return NULL;
 	++*pos;
-	return cd->hash_table[hash];
+	return list_first_entry_or_null(&cd->hash_table[hash],
+					struct cache_head, cache_list);
 }
 EXPORT_SYMBOL_GPL(cache_seq_next);
 
@@ -1666,17 +1670,21 @@ EXPORT_SYMBOL_GPL(cache_unregister_net);
 struct cache_detail *cache_create_net(struct cache_detail *tmpl, struct net *net)
 {
 	struct cache_detail *cd;
+	int i;
 
 	cd = kmemdup(tmpl, sizeof(struct cache_detail), GFP_KERNEL);
 	if (cd == NULL)
 		return ERR_PTR(-ENOMEM);
 
-	cd->hash_table = kzalloc(cd->hash_size * sizeof(struct cache_head *),
+	cd->hash_table = kzalloc(cd->hash_size * sizeof(struct list_head),
 				 GFP_KERNEL);
 	if (cd->hash_table == NULL) {
 		kfree(cd);
 		return ERR_PTR(-ENOMEM);
 	}
+
+	for (i = 0; i < cd->hash_size; i++)
+		INIT_LIST_HEAD(&cd->hash_table[i]);
 	cd->net = net;
 	return cd;
 }
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 08/10 v7] sunrpc: New helper cache_delete_entry for deleting cache_head directly
  2015-07-11 12:46 [PATCH 00/10 v7] NFSD: Pin to vfsmount for nfsd exports cache Kinglong Mee
                   ` (2 preceding siblings ...)
       [not found] ` <55A11010.6050005-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2015-07-11 12:51 ` Kinglong Mee
  3 siblings, 0 replies; 40+ messages in thread
From: Kinglong Mee @ 2015-07-11 12:51 UTC (permalink / raw)
  To: Al Viro, J. Bruce Fields, linux-nfs, linux-fsdevel
  Cc: NeilBrown, Trond Myklebust, kinglongmee

A new helper cache_delete_entry() for delete cache_head from
cache_detail directly.

It will be used by pin_kill, so make sure the cache_detail is valid
before deleting is needed.

Because pin_kill is not many times, so the influence of performance
is accepted.

v7, same as v6.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
---
 include/linux/sunrpc/cache.h |  1 +
 net/sunrpc/cache.c           | 30 ++++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h
index ecc0ff6..5a4b921 100644
--- a/include/linux/sunrpc/cache.h
+++ b/include/linux/sunrpc/cache.h
@@ -210,6 +210,7 @@ extern int cache_check(struct cache_detail *detail,
 		       struct cache_head *h, struct cache_req *rqstp);
 extern void cache_flush(void);
 extern void cache_purge(struct cache_detail *detail);
+extern void cache_delete_entry(struct cache_detail *cd, struct cache_head *h);
 #define NEVER (0x7FFFFFFF)
 extern void __init cache_initialize(void);
 extern int cache_register_net(struct cache_detail *cd, struct net *net);
diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index ad2155c..8a27483 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -458,6 +458,36 @@ static int cache_clean(void)
 	return rv;
 }
 
+void cache_delete_entry(struct cache_detail *detail, struct cache_head *h)
+{
+	struct cache_detail *tmp;
+
+	if (!detail || !h)
+		return;
+
+	spin_lock(&cache_list_lock);
+	list_for_each_entry(tmp, &cache_list, others) {
+		if (tmp == detail)
+			goto found;
+	}
+	spin_unlock(&cache_list_lock);
+	printk(KERN_WARNING "%s: Deleted cache detail %p\n", __func__, detail);
+	return ;
+
+found:
+	write_lock(&detail->hash_lock);
+
+	list_del_init(&h->cache_list);
+	detail->entries--;
+	set_bit(CACHE_CLEANED, &h->flags);
+
+	write_unlock(&detail->hash_lock);
+	spin_unlock(&cache_list_lock);
+
+	cache_put(h, detail);
+}
+EXPORT_SYMBOL_GPL(cache_delete_entry);
+
 /*
  * We want to regularly clean the cache, so we need to schedule some work ...
  */
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 09/10 v7] sunrpc: Support get_ref/put_ref for reference change in cache_head
       [not found] ` <55A11010.6050005-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
                     ` (4 preceding siblings ...)
  2015-07-11 12:50   ` [PATCH 07/10 v7] sunrpc: Switch to using list_head instead single list Kinglong Mee
@ 2015-07-11 12:51   ` Kinglong Mee
  2015-07-11 12:52   ` [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on Kinglong Mee
  6 siblings, 0 replies; 40+ messages in thread
From: Kinglong Mee @ 2015-07-11 12:51 UTC (permalink / raw)
  To: Al Viro, J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: NeilBrown, Trond Myklebust, kinglongmee-Re5JQEeQqe8AvxtiuMwx3w

Add get_ref/put_ref functions in cache_head for processing
reference change (increase/decrease)

v7, new one

Signed-off-by: Kinglong Mee <kinglongmee-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 include/linux/sunrpc/cache.h | 15 +++++++++++++--
 net/sunrpc/cache.c           |  2 ++
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h
index 5a4b921..77bc623 100644
--- a/include/linux/sunrpc/cache.h
+++ b/include/linux/sunrpc/cache.h
@@ -52,6 +52,9 @@ struct cache_head {
 					 * was sent, else this is when update was received
 					 */
 	struct kref	ref;
+	void		(*get_ref)(struct cache_head *);
+	void		(*put_ref)(struct cache_head *);
+
 	unsigned long	flags;
 };
 #define	CACHE_VALID	0	/* Entry contains valid data */
@@ -187,7 +190,11 @@ extern void cache_clean_deferred(void *owner);
 
 static inline struct cache_head  *cache_get(struct cache_head *h)
 {
-	kref_get(&h->ref);
+	if (h->get_ref) {
+		h->get_ref(h);
+	} else
+		kref_get(&h->ref);
+
 	return h;
 }
 
@@ -197,7 +204,11 @@ static inline void cache_put(struct cache_head *h, struct cache_detail *cd)
 	if (atomic_read(&h->ref.refcount) <= 2 &&
 	    h->expiry_time < cd->nextcheck)
 		cd->nextcheck = h->expiry_time;
-	kref_put(&h->ref, cd->cache_put);
+
+	if (h->put_ref) {
+		h->put_ref(h);
+	} else
+		kref_put(&h->ref, cd->cache_put);
 }
 
 static inline int cache_is_expired(struct cache_detail *detail, struct cache_head *h)
diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index 8a27483..1ca88b0 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -49,6 +49,8 @@ static void cache_init(struct cache_head *h)
 	kref_init(&h->ref);
 	h->expiry_time = now + CACHE_NEW_EXPIRY;
 	h->last_refresh = now;
+	h->get_ref = NULL;
+	h->put_ref = NULL;
 }
 
 struct cache_head *sunrpc_cache_lookup(struct cache_detail *detail,
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
       [not found] ` <55A11010.6050005-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
                     ` (5 preceding siblings ...)
  2015-07-11 12:51   ` [PATCH 09/10 v7] sunrpc: Support get_ref/put_ref for reference change in cache_head Kinglong Mee
@ 2015-07-11 12:52   ` Kinglong Mee
       [not found]     ` <55A111A8.2040701-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  6 siblings, 1 reply; 40+ messages in thread
From: Kinglong Mee @ 2015-07-11 12:52 UTC (permalink / raw)
  To: Al Viro, J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: NeilBrown, Trond Myklebust, kinglongmee-Re5JQEeQqe8AvxtiuMwx3w

If there are some mount points(not exported for nfs) under pseudo root,
after client's operation of those entry under the root,  anyone *can't*
unmount those mount points until export cache expired.

/nfs/xfs        *(rw,insecure,no_subtree_check,no_root_squash)
/nfs/pnfs       *(rw,insecure,no_subtree_check,no_root_squash)
total 0
drwxr-xr-x. 3 root root 84 Apr 21 22:27 pnfs
drwxr-xr-x. 3 root root 84 Apr 21 22:27 test
drwxr-xr-x. 2 root root  6 Apr 20 22:01 xfs
Filesystem                      1K-blocks    Used Available Use% Mounted on
......
/dev/sdd                          1038336   32944   1005392   4% /nfs/pnfs
/dev/sdc                         10475520   32928  10442592   1% /nfs/xfs
/dev/sde                           999320    1284    929224   1% /nfs/test
/mnt/pnfs/:
total 0
-rw-r--r--. 1 root root 0 Apr 21 22:23 attr
drwxr-xr-x. 2 root root 6 Apr 21 22:19 tmp

/mnt/xfs/:
total 0
umount: /nfs/test/: target is busy
        (In some cases useful info about processes that
        use the device is found by lsof(8) or fuser(1).)

It's caused by exports cache of nfsd holds the reference of
the path (here is /nfs/test/), so, it can't be umounted.

I don't think that's user expect, they want umount /nfs/test/.
Bruce think user can also umount /nfs/pnfs/ and /nfs/xfs.

Also, using kzalloc for all memory allocating without kmalloc.
Thanks for Al Viro's commets for the logic of fs_pin.

v3,
1. using path_get_pin/path_put_unpin for path pin
2. using kzalloc for memory allocating

v4,
1. add a completion for pin_kill waiting the reference is decreased to zero.
2. add a work_struct for pin_kill decreases the reference indirectly.
3. free svc_export/svc_expkey in pin_kill, not svc_export_put/svc_expkey_put.
4. svc_export_put/svc_expkey_put go though pin_kill logic.

v5, same as v4.

v6,
1. Pin vfsmnt to mount point at first, when reference increace (==2),
   grab a reference to vfsmnt by mntget. When decreace (==1),
   drop the reference to vfsmnt, left pin.
2. Delete cache_head directly from cache_detail.

v7, 
implement self reference increase and decrease for nfsd exports/expkey 

When reference of cahce_head increase(>1), grab a reference of mnt once.
and reference decrease to 1 (==1), drop the reference of mnt.

So after that,
When ref > 1, user cannot umount the filesystem with -EBUSY.
when ref ==1, means cache only reference by nfsd cache,
no other reference. So user can try umount,
1. before set MNT_UMOUNT (protected by mount_lock), nfsd cache is
   referenced (ref > 1, legitimize_mntget), umount will fail with -EBUSY.
2. after set MNT_UMOUNT, nfsd cache is referenced (ref == 2),
   legitimize_mntget will fail, and set cache to CACHE_NEGATIVE,
   and the reference will be dropped, re-back to 1.
   So, pin_kill can delete the cache and umount success.
3. when umountting, no reference to nfsd cache,
   pin_kill can delete the cache and umount success.

Signed-off-by: Kinglong Mee <kinglongmee-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 fs/nfsd/export.c | 186 +++++++++++++++++++++++++++++++++++++++++++++++++------
 fs/nfsd/export.h |  22 ++++++-
 2 files changed, 189 insertions(+), 19 deletions(-)

diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
index b4d84b5..075bcc4 100644
--- a/fs/nfsd/export.c
+++ b/fs/nfsd/export.c
@@ -37,15 +37,23 @@
 #define	EXPKEY_HASHMAX		(1 << EXPKEY_HASHBITS)
 #define	EXPKEY_HASHMASK		(EXPKEY_HASHMAX -1)
 
+static void expkey_destroy(struct svc_expkey *key)
+{
+	auth_domain_put(key->ek_client);
+	kfree_rcu(key, rcu_head);
+}
+
 static void expkey_put(struct kref *ref)
 {
 	struct svc_expkey *key = container_of(ref, struct svc_expkey, h.ref);
 
 	if (test_bit(CACHE_VALID, &key->h.flags) &&
-	    !test_bit(CACHE_NEGATIVE, &key->h.flags))
-		path_put(&key->ek_path);
-	auth_domain_put(key->ek_client);
-	kfree(key);
+	    !test_bit(CACHE_NEGATIVE, &key->h.flags)) {
+		rcu_read_lock();
+		complete(&key->ek_done);
+		pin_kill(&key->ek_pin);
+	} else
+		expkey_destroy(key);
 }
 
 static void expkey_request(struct cache_detail *cd,
@@ -83,7 +91,7 @@ static int expkey_parse(struct cache_detail *cd, char *mesg, int mlen)
 		return -EINVAL;
 	mesg[mlen-1] = 0;
 
-	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
 	err = -ENOMEM;
 	if (!buf)
 		goto out;
@@ -119,6 +127,7 @@ static int expkey_parse(struct cache_detail *cd, char *mesg, int mlen)
 	if (key.h.expiry_time == 0)
 		goto out;
 
+	key.cd = cd;
 	key.ek_client = dom;	
 	key.ek_fsidtype = fsidtype;
 	memcpy(key.ek_fsid, buf, len);
@@ -199,6 +208,47 @@ static inline int expkey_match (struct cache_head *a, struct cache_head *b)
 	return 1;
 }
 
+static void expkey_get_ref(struct cache_head *h)
+{
+	struct svc_expkey *key = container_of(h, struct svc_expkey, h);
+
+	mutex_lock(&key->ref_mutex);
+	kref_get(&h->ref);
+
+	if (!test_bit(CACHE_VALID, &key->h.flags) ||
+	    test_bit(CACHE_NEGATIVE, &key->h.flags))
+		goto out;
+
+	if (atomic_read(&h->ref.refcount) == 2) {
+		if (legitimize_mntget(key->ek_path.mnt) == NULL) {
+			printk(KERN_WARNING "%s: Get mnt for %pd2 failed!\n",
+				__func__, key->ek_path.dentry);
+			set_bit(CACHE_NEGATIVE, &h->flags);
+		} else
+			key->ek_mnt_ref = true;
+	}
+out:
+	mutex_unlock(&key->ref_mutex);
+}
+
+static void expkey_put_ref(struct cache_head *h)
+{
+	struct svc_expkey *key = container_of(h, struct svc_expkey, h);
+
+	mutex_lock(&key->ref_mutex);
+	if (key->ek_mnt_ref && (atomic_read(&h->ref.refcount) == 2)) {
+		mntput(key->ek_path.mnt);
+		key->ek_mnt_ref = false;
+	}
+
+	if (unlikely(!atomic_dec_and_test(&h->ref.refcount))) {
+		mutex_unlock(&key->ref_mutex);
+		return ;
+	}
+
+	expkey_put(&h->ref);
+}
+
 static inline void expkey_init(struct cache_head *cnew,
 				   struct cache_head *citem)
 {
@@ -210,6 +260,28 @@ static inline void expkey_init(struct cache_head *cnew,
 	new->ek_fsidtype = item->ek_fsidtype;
 
 	memcpy(new->ek_fsid, item->ek_fsid, sizeof(new->ek_fsid));
+	new->cd = item->cd;
+
+	cnew->get_ref = expkey_get_ref;
+	cnew->put_ref = expkey_put_ref;
+}
+
+static void expkey_pin_kill(struct fs_pin *pin)
+{
+	struct svc_expkey *key = container_of(pin, struct svc_expkey, ek_pin);
+
+	if (!completion_done(&key->ek_done)) {
+		schedule_work(&key->ek_work);
+		wait_for_completion(&key->ek_done);
+	}
+	path_put_unpin(&key->ek_path, &key->ek_pin);
+	expkey_destroy(key);
+}
+
+static void expkey_close_work(struct work_struct *work)
+{
+	struct svc_expkey *key = container_of(work, struct svc_expkey, ek_work);
+	cache_delete_entry(key->cd, &key->h);
 }
 
 static inline void expkey_update(struct cache_head *cnew,
@@ -218,16 +290,20 @@ static inline void expkey_update(struct cache_head *cnew,
 	struct svc_expkey *new = container_of(cnew, struct svc_expkey, h);
 	struct svc_expkey *item = container_of(citem, struct svc_expkey, h);
 
+	init_fs_pin(&new->ek_pin, expkey_pin_kill);
 	new->ek_path = item->ek_path;
-	path_get(&item->ek_path);
+	path_get_pin(&new->ek_path, &new->ek_pin);
 }
 
 static struct cache_head *expkey_alloc(void)
 {
-	struct svc_expkey *i = kmalloc(sizeof(*i), GFP_KERNEL);
-	if (i)
+	struct svc_expkey *i = kzalloc(sizeof(*i), GFP_KERNEL);
+	if (i) {
+		INIT_WORK(&i->ek_work, expkey_close_work);
+		init_completion(&i->ek_done);
+		mutex_init(&i->ref_mutex);
 		return &i->h;
-	else
+	} else
 		return NULL;
 }
 
@@ -306,14 +382,21 @@ static void nfsd4_fslocs_free(struct nfsd4_fs_locations *fsloc)
 	fsloc->locations = NULL;
 }
 
-static void svc_export_put(struct kref *ref)
+static void svc_export_destroy(struct svc_export *exp)
 {
-	struct svc_export *exp = container_of(ref, struct svc_export, h.ref);
-	path_put(&exp->ex_path);
 	auth_domain_put(exp->ex_client);
 	nfsd4_fslocs_free(&exp->ex_fslocs);
 	kfree(exp->ex_uuid);
-	kfree(exp);
+	kfree_rcu(exp, rcu_head);
+}
+
+static void svc_export_put(struct kref *ref)
+{
+	struct svc_export *exp = container_of(ref, struct svc_export, h.ref);
+
+	rcu_read_lock();
+	complete(&exp->ex_done);
+	pin_kill(&exp->ex_pin);
 }
 
 static void svc_export_request(struct cache_detail *cd,
@@ -520,7 +603,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
 		return -EINVAL;
 	mesg[mlen-1] = 0;
 
-	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
 	if (!buf)
 		return -ENOMEM;
 
@@ -694,21 +777,84 @@ static int svc_export_match(struct cache_head *a, struct cache_head *b)
 		path_equal(&orig->ex_path, &new->ex_path);
 }
 
+static void export_get_ref(struct cache_head *h)
+{
+	struct svc_export *exp = container_of(h, struct svc_export, h);
+
+	mutex_lock(&exp->ref_mutex);
+	kref_get(&h->ref);
+
+	if (test_bit(CACHE_NEGATIVE, &h->flags))
+		goto out;
+
+	if (atomic_read(&h->ref.refcount) == 2) {
+		mutex_lock(&exp->ref_mutex);
+		if (legitimize_mntget(exp->ex_path.mnt) == NULL) {
+			printk(KERN_WARNING "%s: Get mnt for %pd2 failed!\n",
+				__func__, exp->ex_path.dentry);
+			set_bit(CACHE_NEGATIVE, &h->flags);
+		} else
+			exp->ex_mnt_ref = true;
+	}
+out:
+	mutex_unlock(&exp->ref_mutex);
+}
+
+static void export_put_ref(struct cache_head *h)
+{
+	struct svc_export *exp = container_of(h, struct svc_export, h);
+
+	mutex_lock(&exp->ref_mutex);
+	if (exp->ex_mnt_ref && (atomic_read(&h->ref.refcount) == 2)) {
+		mntput(exp->ex_path.mnt);
+		exp->ex_mnt_ref = false;
+	}
+
+	if (unlikely(!atomic_dec_and_test(&h->ref.refcount))) {
+		mutex_unlock(&exp->ref_mutex);
+		return ;
+	}
+
+	svc_export_put(&h->ref);
+}
+
+static void export_pin_kill(struct fs_pin *pin)
+{
+	struct svc_export *exp = container_of(pin, struct svc_export, ex_pin);
+
+	if (!completion_done(&exp->ex_done)) {
+		schedule_work(&exp->ex_work);
+		wait_for_completion(&exp->ex_done);
+	}
+	path_put_unpin(&exp->ex_path, &exp->ex_pin);
+	svc_export_destroy(exp);
+}
+
+static void export_close_work(struct work_struct *work)
+{
+	struct svc_export *exp = container_of(work, struct svc_export, ex_work);
+	cache_delete_entry(exp->cd, &exp->h);
+}
+
 static void svc_export_init(struct cache_head *cnew, struct cache_head *citem)
 {
 	struct svc_export *new = container_of(cnew, struct svc_export, h);
 	struct svc_export *item = container_of(citem, struct svc_export, h);
 
+	init_fs_pin(&new->ex_pin, export_pin_kill);
 	kref_get(&item->ex_client->ref);
 	new->ex_client = item->ex_client;
 	new->ex_path = item->ex_path;
-	path_get(&item->ex_path);
+	path_get_pin(&new->ex_path, &new->ex_pin);
 	new->ex_fslocs.locations = NULL;
 	new->ex_fslocs.locations_count = 0;
 	new->ex_fslocs.migrated = 0;
 	new->ex_layout_type = 0;
 	new->ex_uuid = NULL;
 	new->cd = item->cd;
+
+	cnew->get_ref = export_get_ref;
+	cnew->put_ref = export_put_ref;
 }
 
 static void export_update(struct cache_head *cnew, struct cache_head *citem)
@@ -740,10 +886,13 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem)
 
 static struct cache_head *svc_export_alloc(void)
 {
-	struct svc_export *i = kmalloc(sizeof(*i), GFP_KERNEL);
-	if (i)
+	struct svc_export *i = kzalloc(sizeof(*i), GFP_KERNEL);
+	if (i) {
+		INIT_WORK(&i->ex_work, export_close_work);
+		init_completion(&i->ex_done);
+		mutex_init(&i->ref_mutex);
 		return &i->h;
-	else
+	} else
 		return NULL;
 }
 
@@ -809,6 +958,7 @@ exp_find_key(struct cache_detail *cd, struct auth_domain *clp, int fsid_type,
 	if (!clp)
 		return ERR_PTR(-ENOENT);
 
+	key.cd = cd;
 	key.ek_client = clp;
 	key.ek_fsidtype = fsid_type;
 	memcpy(key.ek_fsid, fsidv, key_len(fsid_type));
diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
index 1f52bfc..aa4e47c 100644
--- a/fs/nfsd/export.h
+++ b/fs/nfsd/export.h
@@ -4,6 +4,7 @@
 #ifndef NFSD_EXPORT_H
 #define NFSD_EXPORT_H
 
+#include <linux/fs_pin.h>
 #include <linux/sunrpc/cache.h>
 #include <uapi/linux/nfsd/export.h>
 
@@ -46,6 +47,8 @@ struct exp_flavor_info {
 
 struct svc_export {
 	struct cache_head	h;
+	struct cache_detail	*cd;
+
 	struct auth_domain *	ex_client;
 	int			ex_flags;
 	struct path		ex_path;
@@ -58,7 +61,15 @@ struct svc_export {
 	struct exp_flavor_info	ex_flavors[MAX_SECINFO_LIST];
 	enum pnfs_layouttype	ex_layout_type;
 	struct nfsd4_deviceid_map *ex_devid_map;
-	struct cache_detail	*cd;
+
+	struct fs_pin		ex_pin;
+	bool			ex_mnt_ref;
+	struct rcu_head		rcu_head;
+	struct mutex		ref_mutex;
+
+	/* For cache_put and fs umounting window */
+	struct completion	ex_done;
+	struct work_struct	ex_work;
 };
 
 /* an "export key" (expkey) maps a filehandlefragement to an
@@ -67,12 +78,21 @@ struct svc_export {
  */
 struct svc_expkey {
 	struct cache_head	h;
+	struct cache_detail	*cd;
 
 	struct auth_domain *	ek_client;
 	int			ek_fsidtype;
 	u32			ek_fsid[6];
 
 	struct path		ek_path;
+	struct fs_pin		ek_pin;
+	bool			ek_mnt_ref;
+	struct rcu_head		rcu_head;
+	struct mutex		ref_mutex;
+
+	/* For cache_put and fs umounting window */
+	struct completion	ek_done;
+	struct work_struct	ek_work;
 };
 
 #define EX_ISSYNC(exp)		(!((exp)->ex_flags & NFSEXP_ASYNC))
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 07/10 v7] sunrpc: Switch to using list_head instead single list
       [not found]     ` <55A11112.8080502-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2015-07-11 12:54       ` Christoph Hellwig
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2015-07-11 12:54 UTC (permalink / raw)
  To: Kinglong Mee
  Cc: Al Viro, J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, NeilBrown, Trond Myklebust

> +	struct list_head *	hash_table;

For hash lists it might be useful to use a hlist_head structure, as that
avoids bloating the actual hash table while still allowing easy removal
of an entry.

> -	for (hp=head; *hp != NULL ; hp = &(*hp)->next) {
> -		struct cache_head *tmp = *hp;
> +	list_for_each_safe(pos, tpos, head) {
> +		tmp = list_entry(pos, struct cache_head, cache_list);

Please use the _entry iteration macros that consolidate these two.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 07/10 v7] sunrpc: Switch to using list_head instead single list
  2015-07-11 12:50   ` [PATCH 07/10 v7] sunrpc: Switch to using list_head instead single list Kinglong Mee
       [not found]     ` <55A11112.8080502-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2015-07-13  1:30     ` NeilBrown
  2015-07-13  8:27       ` Kinglong Mee
  1 sibling, 1 reply; 40+ messages in thread
From: NeilBrown @ 2015-07-13  1:30 UTC (permalink / raw)
  To: Kinglong Mee
  Cc: Al Viro, J. Bruce Fields, linux-nfs, linux-fsdevel, Trond Myklebust

On Sat, 11 Jul 2015 20:50:26 +0800 Kinglong Mee <kinglongmee@gmail.com>
wrote:

> Switch using list_head for cache_head in cache_detail,
> it is useful of remove an cache_head entry directly from cache_detail.
> 
> v7, same as v6.
> 
> Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
> ---
>  include/linux/sunrpc/cache.h |  4 +--
>  net/sunrpc/cache.c           | 74 ++++++++++++++++++++++++--------------------
>  2 files changed, 43 insertions(+), 35 deletions(-)
> 
> diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h
> index 04ee5a2..ecc0ff6 100644
> --- a/include/linux/sunrpc/cache.h
> +++ b/include/linux/sunrpc/cache.h
> @@ -46,7 +46,7 @@
>   * 
>   */
>  struct cache_head {
> -	struct cache_head * next;
> +	struct list_head	cache_list;
>  	time_t		expiry_time;	/* After time time, don't use the data */
>  	time_t		last_refresh;   /* If CACHE_PENDING, this is when upcall 
>  					 * was sent, else this is when update was received
> @@ -73,7 +73,7 @@ struct cache_detail_pipefs {
>  struct cache_detail {
>  	struct module *		owner;
>  	int			hash_size;
> -	struct cache_head **	hash_table;
> +	struct list_head *	hash_table;
>  	rwlock_t		hash_lock;

Given that these lists are chains in a hash table, it would make more
sense to use hlist_head rather than list_head.  They are designed for
exactly that purpose - and are smaller.

Actually, I'd really like to see hlist_bl_head used - so there was one
bit-spin-lock per chain, and use RCU for protecting searches.
However that would have to be left for later - no point delaying this
patch set for some minor performance gains.

But as you need to change this, I think it would be best to change to
hlist_head.

By the way, I see lots of nice clean-ups in this series.  Thanks!

NeilBrown


>  
>  	atomic_t		inuse; /* active user-space update or lookup */
> diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
> index 673c2fa..ad2155c 100644
> --- a/net/sunrpc/cache.c
> +++ b/net/sunrpc/cache.c
> @@ -44,7 +44,7 @@ static void cache_revisit_request(struct cache_head *item);
>  static void cache_init(struct cache_head *h)
>  {
>  	time_t now = seconds_since_boot();
> -	h->next = NULL;
> +	INIT_LIST_HEAD(&h->cache_list);
>  	h->flags = 0;
>  	kref_init(&h->ref);
>  	h->expiry_time = now + CACHE_NEW_EXPIRY;
> @@ -54,15 +54,16 @@ static void cache_init(struct cache_head *h)
>  struct cache_head *sunrpc_cache_lookup(struct cache_detail *detail,
>  				       struct cache_head *key, int hash)
>  {
> -	struct cache_head **head,  **hp;
>  	struct cache_head *new = NULL, *freeme = NULL;
> +	struct cache_head *tmp;
> +	struct list_head *head, *pos, *tpos;
>  
>  	head = &detail->hash_table[hash];
>  
>  	read_lock(&detail->hash_lock);
>  
> -	for (hp=head; *hp != NULL ; hp = &(*hp)->next) {
> -		struct cache_head *tmp = *hp;
> +	list_for_each_safe(pos, tpos, head) {
> +		tmp = list_entry(pos, struct cache_head, cache_list);
>  		if (detail->match(tmp, key)) {
>  			if (cache_is_expired(detail, tmp))
>  				/* This entry is expired, we will discard it. */
> @@ -88,12 +89,11 @@ struct cache_head *sunrpc_cache_lookup(struct cache_detail *detail,
>  	write_lock(&detail->hash_lock);
>  
>  	/* check if entry appeared while we slept */
> -	for (hp=head; *hp != NULL ; hp = &(*hp)->next) {
> -		struct cache_head *tmp = *hp;
> +	list_for_each_safe(pos, tpos, head) {
> +		tmp = list_entry(pos, struct cache_head, cache_list);
>  		if (detail->match(tmp, key)) {
>  			if (cache_is_expired(detail, tmp)) {
> -				*hp = tmp->next;
> -				tmp->next = NULL;
> +				list_del_init(&tmp->cache_list);
>  				detail->entries --;
>  				freeme = tmp;
>  				break;
> @@ -104,8 +104,8 @@ struct cache_head *sunrpc_cache_lookup(struct cache_detail *detail,
>  			return tmp;
>  		}
>  	}
> -	new->next = *head;
> -	*head = new;
> +
> +	list_add(&new->cache_list, head);
>  	detail->entries++;
>  	cache_get(new);
>  	write_unlock(&detail->hash_lock);
> @@ -143,7 +143,6 @@ struct cache_head *sunrpc_cache_update(struct cache_detail *detail,
>  	 * If 'old' is not VALID, we update it directly,
>  	 * otherwise we need to replace it
>  	 */
> -	struct cache_head **head;
>  	struct cache_head *tmp;
>  
>  	if (!test_bit(CACHE_VALID, &old->flags)) {
> @@ -168,15 +167,13 @@ struct cache_head *sunrpc_cache_update(struct cache_detail *detail,
>  	}
>  	cache_init(tmp);
>  	detail->init(tmp, old);
> -	head = &detail->hash_table[hash];
>  
>  	write_lock(&detail->hash_lock);
>  	if (test_bit(CACHE_NEGATIVE, &new->flags))
>  		set_bit(CACHE_NEGATIVE, &tmp->flags);
>  	else
>  		detail->update(tmp, new);
> -	tmp->next = *head;
> -	*head = tmp;
> +	list_add(&tmp->cache_list, &detail->hash_table[hash]);
>  	detail->entries++;
>  	cache_get(tmp);
>  	cache_fresh_locked(tmp, new->expiry_time);
> @@ -416,42 +413,44 @@ static int cache_clean(void)
>  	/* find a non-empty bucket in the table */
>  	while (current_detail &&
>  	       current_index < current_detail->hash_size &&
> -	       current_detail->hash_table[current_index] == NULL)
> +	       list_empty(&current_detail->hash_table[current_index]))
>  		current_index++;
>  
>  	/* find a cleanable entry in the bucket and clean it, or set to next bucket */
>  
>  	if (current_detail && current_index < current_detail->hash_size) {
> -		struct cache_head *ch, **cp;
> +		struct cache_head *ch = NULL, *putme = NULL;
> +		struct list_head *head, *pos, *tpos;
>  		struct cache_detail *d;
>  
>  		write_lock(&current_detail->hash_lock);
>  
>  		/* Ok, now to clean this strand */
>  
> -		cp = & current_detail->hash_table[current_index];
> -		for (ch = *cp ; ch ; cp = & ch->next, ch = *cp) {
> +		head = &current_detail->hash_table[current_index];
> +		list_for_each_safe(pos, tpos, head) {
> +			ch = list_entry(pos, struct cache_head, cache_list);
>  			if (current_detail->nextcheck > ch->expiry_time)
>  				current_detail->nextcheck = ch->expiry_time+1;
>  			if (!cache_is_expired(current_detail, ch))
>  				continue;
>  
> -			*cp = ch->next;
> -			ch->next = NULL;
> +			list_del_init(pos);
>  			current_detail->entries--;
> +			putme = ch;
>  			rv = 1;
>  			break;
>  		}
>  
>  		write_unlock(&current_detail->hash_lock);
>  		d = current_detail;
> -		if (!ch)
> +		if (!putme)
>  			current_index ++;
>  		spin_unlock(&cache_list_lock);
> -		if (ch) {
> -			set_bit(CACHE_CLEANED, &ch->flags);
> -			cache_fresh_unlocked(ch, d);
> -			cache_put(ch, d);
> +		if (putme) {
> +			set_bit(CACHE_CLEANED, &putme->flags);
> +			cache_fresh_unlocked(putme, d);
> +			cache_put(putme, d);
>  		}
>  	} else
>  		spin_unlock(&cache_list_lock);
> @@ -1277,6 +1276,7 @@ void *cache_seq_start(struct seq_file *m, loff_t *pos)
>  	unsigned int hash, entry;
>  	struct cache_head *ch;
>  	struct cache_detail *cd = m->private;
> +	struct list_head *ptr, *tptr;
>  
>  	read_lock(&cd->hash_lock);
>  	if (!n--)
> @@ -1284,19 +1284,22 @@ void *cache_seq_start(struct seq_file *m, loff_t *pos)
>  	hash = n >> 32;
>  	entry = n & ((1LL<<32) - 1);
>  
> -	for (ch=cd->hash_table[hash]; ch; ch=ch->next)
> +	list_for_each_safe(ptr, tptr, &cd->hash_table[hash]) {
> +		ch = list_entry(ptr, struct cache_head, cache_list);
>  		if (!entry--)
>  			return ch;
> +	}
>  	n &= ~((1LL<<32) - 1);
>  	do {
>  		hash++;
>  		n += 1LL<<32;
>  	} while(hash < cd->hash_size &&
> -		cd->hash_table[hash]==NULL);
> +		list_empty(&cd->hash_table[hash]));
>  	if (hash >= cd->hash_size)
>  		return NULL;
>  	*pos = n+1;
> -	return cd->hash_table[hash];
> +	return list_first_entry_or_null(&cd->hash_table[hash],
> +					struct cache_head, cache_list);
>  }
>  EXPORT_SYMBOL_GPL(cache_seq_start);
>  
> @@ -1308,23 +1311,24 @@ void *cache_seq_next(struct seq_file *m, void *p, loff_t *pos)
>  
>  	if (p == SEQ_START_TOKEN)
>  		hash = 0;
> -	else if (ch->next == NULL) {
> +	else if (list_is_last(&ch->cache_list, &cd->hash_table[hash])) {
>  		hash++;
>  		*pos += 1LL<<32;
>  	} else {
>  		++*pos;
> -		return ch->next;
> +		return list_next_entry(ch, cache_list);
>  	}
>  	*pos &= ~((1LL<<32) - 1);
>  	while (hash < cd->hash_size &&
> -	       cd->hash_table[hash] == NULL) {
> +	       list_empty(&cd->hash_table[hash])) {
>  		hash++;
>  		*pos += 1LL<<32;
>  	}
>  	if (hash >= cd->hash_size)
>  		return NULL;
>  	++*pos;
> -	return cd->hash_table[hash];
> +	return list_first_entry_or_null(&cd->hash_table[hash],
> +					struct cache_head, cache_list);
>  }
>  EXPORT_SYMBOL_GPL(cache_seq_next);
>  
> @@ -1666,17 +1670,21 @@ EXPORT_SYMBOL_GPL(cache_unregister_net);
>  struct cache_detail *cache_create_net(struct cache_detail *tmpl, struct net *net)
>  {
>  	struct cache_detail *cd;
> +	int i;
>  
>  	cd = kmemdup(tmpl, sizeof(struct cache_detail), GFP_KERNEL);
>  	if (cd == NULL)
>  		return ERR_PTR(-ENOMEM);
>  
> -	cd->hash_table = kzalloc(cd->hash_size * sizeof(struct cache_head *),
> +	cd->hash_table = kzalloc(cd->hash_size * sizeof(struct list_head),
>  				 GFP_KERNEL);
>  	if (cd->hash_table == NULL) {
>  		kfree(cd);
>  		return ERR_PTR(-ENOMEM);
>  	}
> +
> +	for (i = 0; i < cd->hash_size; i++)
> +		INIT_LIST_HEAD(&cd->hash_table[i]);
>  	cd->net = net;
>  	return cd;
>  }


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
       [not found]     ` <55A111A8.2040701-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2015-07-13  3:39       ` NeilBrown
  2015-07-13  4:02         ` Al Viro
                           ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: NeilBrown @ 2015-07-13  3:39 UTC (permalink / raw)
  To: Kinglong Mee
  Cc: Al Viro, J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust

On Sat, 11 Jul 2015 20:52:56 +0800 Kinglong Mee <kinglongmee-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
wrote:

> If there are some mount points(not exported for nfs) under pseudo root,
> after client's operation of those entry under the root,  anyone *can't*
> unmount those mount points until export cache expired.
> 
> /nfs/xfs        *(rw,insecure,no_subtree_check,no_root_squash)
> /nfs/pnfs       *(rw,insecure,no_subtree_check,no_root_squash)
> total 0
> drwxr-xr-x. 3 root root 84 Apr 21 22:27 pnfs
> drwxr-xr-x. 3 root root 84 Apr 21 22:27 test
> drwxr-xr-x. 2 root root  6 Apr 20 22:01 xfs
> Filesystem                      1K-blocks    Used Available Use% Mounted on
> ......
> /dev/sdd                          1038336   32944   1005392   4% /nfs/pnfs
> /dev/sdc                         10475520   32928  10442592   1% /nfs/xfs
> /dev/sde                           999320    1284    929224   1% /nfs/test
> /mnt/pnfs/:
> total 0
> -rw-r--r--. 1 root root 0 Apr 21 22:23 attr
> drwxr-xr-x. 2 root root 6 Apr 21 22:19 tmp
> 
> /mnt/xfs/:
> total 0
> umount: /nfs/test/: target is busy
>         (In some cases useful info about processes that
>         use the device is found by lsof(8) or fuser(1).)
> 
> It's caused by exports cache of nfsd holds the reference of
> the path (here is /nfs/test/), so, it can't be umounted.
> 
> I don't think that's user expect, they want umount /nfs/test/.
> Bruce think user can also umount /nfs/pnfs/ and /nfs/xfs.
> 
> Also, using kzalloc for all memory allocating without kmalloc.
> Thanks for Al Viro's commets for the logic of fs_pin.
> 
> v3,
> 1. using path_get_pin/path_put_unpin for path pin
> 2. using kzalloc for memory allocating
> 
> v4,
> 1. add a completion for pin_kill waiting the reference is decreased to zero.
> 2. add a work_struct for pin_kill decreases the reference indirectly.
> 3. free svc_export/svc_expkey in pin_kill, not svc_export_put/svc_expkey_put.
> 4. svc_export_put/svc_expkey_put go though pin_kill logic.
> 
> v5, same as v4.
> 
> v6,
> 1. Pin vfsmnt to mount point at first, when reference increace (==2),
>    grab a reference to vfsmnt by mntget. When decreace (==1),
>    drop the reference to vfsmnt, left pin.
> 2. Delete cache_head directly from cache_detail.
> 
> v7, 
> implement self reference increase and decrease for nfsd exports/expkey 
> 
> When reference of cahce_head increase(>1), grab a reference of mnt once.
> and reference decrease to 1 (==1), drop the reference of mnt.
> 
> So after that,
> When ref > 1, user cannot umount the filesystem with -EBUSY.
> when ref ==1, means cache only reference by nfsd cache,
> no other reference. So user can try umount,
> 1. before set MNT_UMOUNT (protected by mount_lock), nfsd cache is
>    referenced (ref > 1, legitimize_mntget), umount will fail with -EBUSY.
> 2. after set MNT_UMOUNT, nfsd cache is referenced (ref == 2),
>    legitimize_mntget will fail, and set cache to CACHE_NEGATIVE,
>    and the reference will be dropped, re-back to 1.
>    So, pin_kill can delete the cache and umount success.
> 3. when umountting, no reference to nfsd cache,
>    pin_kill can delete the cache and umount success.
> 
> Signed-off-by: Kinglong Mee <kinglongmee-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Wow.... this is turning out to be a lot more complex that I imagined at
first (isn't that always the way!).

There is a lot of good stuff here, but I think we can probably make it
simpler and so even better.

I particularly don't like the get_ref/put_ref pointers in cache_head.
They make cache_head a lot bigger than it was before, and they are only
needed for two specific caches.  And then they are the same for every element
in the cache.

I also don't like the ref_mutex ... or I don't like where it is used...
or something.  I definitely don't think we need one per cached entry.
Maybe one per cache.

I can certainly see that the "first" time we get a reference to a cache
item that holds a vfsmnt pointer, we need to "legitimize" that - or
fail.  But I don't think that has to happen inside the cache.c
machinery.

How about this:
 - add a new cache flag "CACHE_ACTIVE" (for example) which the cache
   owner can set whenever it likes.  When cache_put finds that CACHE_ACTIVE
   is set when refcount is <= 2, it calls a new cache_detail method: cache_deactivate.
 - cache_deactivate takes a mutex (yes, we do need one, don't we)
   and if CACHE_ACTIVE is still set and refcount is still <= 2,
   it drops the reference on the vfsmnt and clears CACHE_ACTIVE.
   This actually needs to be something like:
    if (test_and_clear_bit(CACHE_ACTIVE,...)) {
        if (atomic_read(..refcnt) > 2) {
             set_bit(CACHE_ACTIVE);
             mutex_unlock()
             return

   so that if other code gets a reference and tests CACHE_ACTIVE, it
   won't suddenly become inactive.  Might need a memory barrier in there...
   no, test_and_clear implies a memory barrier.

We only need to make changes to svc_export and svc_expkey - right?
So that would be:
 Change svc_export_lookup and svc_expkey_lookup so they look something
 like:

  svc_XX_lookup(struct cache_detail *cd, struct svc_XXX *item)
  {
      struct cache_head *ch;
      int hash = svc_XXX_hash(item);

      ch = sunrpc_cache_lookup(cd, &item->h, hash);
      if (!ch)
           return NULL;
      item = container_of(ch, struct svc_XXX, h);
      if (!test_bit(CACHE_VALID, &ch->flags) ||
          test_bit(CACHE_NEGATIVE, &ch->flags) ||
          test_bit(CACHE_ACTIVE, &ch->flags))
            return item;

      mutex_lock(&svc_XXX_mutex);
      if (!test_bit(CACHE_ACTIVE, &ch->flags)) {
              if (legitimize_mnt_get() == NULL) {
                      XXX_put(item);
                      item = NULL;
              } else
                      set_bit(CACHE_ACTIVE, &ch->flags);
      }
      mutex_unlock(&something);
      return item;
 }

Then the new 'cache_deactivate' function is something like:

  svc_XXX_deactivate(struct cache_detail *cd, struct cache_head *ch)
  {
       struct svc_XXX *item = container_of(ch, &item->h, item);

       mutex_lock(&svc_XXX_mutex);
       if (test_and_clear_bit(CACHE_ACTIVE, &ch->flags)) {
              if (atomic_read(&ch->ref.refcount) > 2) {
                   /* Race with get_ref - do nothing */
                   set_bit(CACHE_ACTIVE, &ch->flags);
              else
                   mntput(....mnt);
       }
       mutex_unlock(&svc_XXX_mutex);
  }


cache_put would have:

    if (test_bit(CACHE_ACTIVE, &h->flags) &&
        cd->cache_deactivate &&
        atomic_read(&h->ref.refcount <= 2))
           cd->cache_deactivate(cd, h);

but there is still a race.  If: (T1 and T2 are threads)
   T1: cache_put finds refcount is 2 and CACHE_ACTIVE is set and calls ->cache_deactiveate
   T2: cache_get increments the refcount to 3
   T1: cache_deactivate clears CACHE_ACTIVE and find refcount is 3
   T2: now calls cache_put, which sees CACHE_ACTIVE is clear so refcount becomes 2
   T1: sets CACHE_ACTIVE again and continues.  refcount becomes 1.

So not refcount is 1 and the item is still active.

We can fix this by making cache_put loop:
    while (test_bit(CACHE_ACTIVE, &h->flags) &&
          cd->cache_deactivate &&
          (smb_rmb(), 1) &&
          atomic_read(&h->ref.refcount <= 2))
           cd->cache_deactivate(cd, h);

This should ensure that refcount never gets to 1 with the
item still active (i.e. with a ref count on the mnt).


The work item and completion are a bit unfortunate too.

I guess the problem here is that pin_kill() can run while there are
some inactive references to the cache item.  There can then be a race
over who will use path_put_unpin to put the dentry.

Could we fix that by having expXXX_pin_kill() use kref_get_unless_zero()
on the cache item.
If that succeeds, then path_put_unpin hasn't been called and it won't be.
So expXXX_pin_kill can call it and then set CACHE_NEGATIVE.
If it fails, then it has already been called and nothing else need be done.
Almost.
If kref_get_unless_zero() fails, pin_remove() may not have been called
yet, but it will be soon.  We might need to wait.
It would be nice if pin_kill() would check ->done again after calling p->kill.
e.g.

diff --git a/fs/fs_pin.c b/fs/fs_pin.c
index 611b5408f6ec..c2ef5c9d4c0d 100644
--- a/fs/fs_pin.c
+++ b/fs/fs_pin.c
@@ -47,7 +47,9 @@ void pin_kill(struct fs_pin *p)
 		spin_unlock_irq(&p->wait.lock);
 		rcu_read_unlock();
 		p->kill(p);
-		return;
+		if (p->done > 0)
+			return;
+		spin_lock_irq(&p->wait.lock);
 	}
 	if (p->done > 0) {
 		spin_unlock_irq(&p->wait.lock);

I think that would close the last gap, without needing extra work
items and completion in the nfsd code.

Al: would you be OK with that change to pin_kill?

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
  2015-07-13  3:39       ` NeilBrown
@ 2015-07-13  4:02         ` Al Viro
       [not found]           ` <20150713040258.GM17109-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  2015-07-13  4:20         ` NeilBrown
  2015-07-15 21:07         ` J. Bruce Fields
  2 siblings, 1 reply; 40+ messages in thread
From: Al Viro @ 2015-07-13  4:02 UTC (permalink / raw)
  To: NeilBrown
  Cc: Kinglong Mee, J. Bruce Fields, linux-nfs, linux-fsdevel, Trond Myklebust

On Mon, Jul 13, 2015 at 01:39:34PM +1000, NeilBrown wrote:
> It would be nice if pin_kill() would check ->done again after calling p->kill.
> e.g.
> 
> diff --git a/fs/fs_pin.c b/fs/fs_pin.c
> index 611b5408f6ec..c2ef5c9d4c0d 100644
> --- a/fs/fs_pin.c
> +++ b/fs/fs_pin.c
> @@ -47,7 +47,9 @@ void pin_kill(struct fs_pin *p)
>  		spin_unlock_irq(&p->wait.lock);
>  		rcu_read_unlock();
>  		p->kill(p);
> -		return;
> +		if (p->done > 0)
> +			return;
> +		spin_lock_irq(&p->wait.lock);
>  	}
>  	if (p->done > 0) {
>  		spin_unlock_irq(&p->wait.lock);
> 
> I think that would close the last gap, without needing extra work
> items and completion in the nfsd code.
> 
> Al: would you be OK with that change to pin_kill?

Hell, no.  Intended use is to have ->kill() free the damn thing, period.
This code is very careful about how it waits in the "found it before
->kill() has removed all pointers leading to that object" case.  No go.
This change would break all existing users, with arseloads of extra
complications for no good reason whatsoever.

And frankly, I'm still not convinced that fs_pin is a good match for the
problem here - I'm not saying it's impossible to produce an fs_pin-based
solution (and I hadn't reviewed the latest iteration yet), but attempts so
far didn't look particularly promising.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
  2015-07-13  3:39       ` NeilBrown
  2015-07-13  4:02         ` Al Viro
@ 2015-07-13  4:20         ` NeilBrown
  2015-07-13  4:45           ` Al Viro
  2015-07-15 21:07         ` J. Bruce Fields
  2 siblings, 1 reply; 40+ messages in thread
From: NeilBrown @ 2015-07-13  4:20 UTC (permalink / raw)
  To: Kinglong Mee
  Cc: Al Viro, J. Bruce Fields, linux-nfs, linux-fsdevel, Trond Myklebust

On Mon, 13 Jul 2015 13:39:34 +1000 NeilBrown <neilb@suse.com> wrote:


> The work item and completion are a bit unfortunate too.
> 
> I guess the problem here is that pin_kill() can run while there are
> some inactive references to the cache item.  There can then be a race
> over who will use path_put_unpin to put the dentry.
> 
> Could we fix that by having expXXX_pin_kill() use kref_get_unless_zero()
> on the cache item.
> If that succeeds, then path_put_unpin hasn't been called and it won't be.
> So expXXX_pin_kill can call it and then set CACHE_NEGATIVE.
> If it fails, then it has already been called and nothing else need be done.
> Almost.
> If kref_get_unless_zero() fails, pin_remove() may not have been called
> yet, but it will be soon.  We might need to wait.
> It would be nice if pin_kill() would check ->done again after calling p->kill.
> e.g.
> 
> diff --git a/fs/fs_pin.c b/fs/fs_pin.c
> index 611b5408f6ec..c2ef5c9d4c0d 100644
> --- a/fs/fs_pin.c
> +++ b/fs/fs_pin.c
> @@ -47,7 +47,9 @@ void pin_kill(struct fs_pin *p)
>  		spin_unlock_irq(&p->wait.lock);
>  		rcu_read_unlock();
>  		p->kill(p);
> -		return;
> +		if (p->done > 0)
> +			return;
> +		spin_lock_irq(&p->wait.lock);
>  	}
>  	if (p->done > 0) {
>  		spin_unlock_irq(&p->wait.lock);
> 
> I think that would close the last gap, without needing extra work
> items and completion in the nfsd code.
> 
> Al: would you be OK with that change to pin_kill?

Actually, with that change to pin_kill, this side of things becomes
really easy.
All expXXX_pin_kill needs to do is call your new cache_delete_entry.
If that doesn't cause the entry to be put, then something else has a
temporary reference which will be put soon.  In any case, pin_kill()
will wait long enough, but not indefinitely.
No need for kref_get_unless_zero() or any of that.

NeilBrown

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
  2015-07-13  4:20         ` NeilBrown
@ 2015-07-13  4:45           ` Al Viro
       [not found]             ` <20150713044553.GN17109-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  2015-07-24  2:05             ` NeilBrown
  0 siblings, 2 replies; 40+ messages in thread
From: Al Viro @ 2015-07-13  4:45 UTC (permalink / raw)
  To: NeilBrown
  Cc: Kinglong Mee, J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust

On Mon, Jul 13, 2015 at 02:20:59PM +1000, NeilBrown wrote:

> Actually, with that change to pin_kill, this side of things becomes
> really easy.
> All expXXX_pin_kill needs to do is call your new cache_delete_entry.
> If that doesn't cause the entry to be put, then something else has a
> temporary reference which will be put soon.  In any case, pin_kill()
> will wait long enough, but not indefinitely.
> No need for kref_get_unless_zero() or any of that.

No.  You are seriously misunderstanding what ->kill() is for and what the
existing instances are doing.  Again, there is no promise whatsoever that
the object containing fs_pin instance will *survive* past ->kill().
At all.

RTFS, please.  What is sorely missing in this recurring patchset is a clear
description of lifetime rules and ordering (who waits for whom and how long).
For all the objects involved.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
       [not found]           ` <20150713040258.GM17109-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-07-13  5:19             ` NeilBrown
  2015-07-13  6:02               ` Al Viro
  0 siblings, 1 reply; 40+ messages in thread
From: NeilBrown @ 2015-07-13  5:19 UTC (permalink / raw)
  To: Al Viro
  Cc: Kinglong Mee, J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust

On Mon, 13 Jul 2015 05:02:58 +0100 Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
wrote:

> On Mon, Jul 13, 2015 at 01:39:34PM +1000, NeilBrown wrote:
> > It would be nice if pin_kill() would check ->done again after calling p->kill.
> > e.g.
> > 
> > diff --git a/fs/fs_pin.c b/fs/fs_pin.c
> > index 611b5408f6ec..c2ef5c9d4c0d 100644
> > --- a/fs/fs_pin.c
> > +++ b/fs/fs_pin.c
> > @@ -47,7 +47,9 @@ void pin_kill(struct fs_pin *p)
> >  		spin_unlock_irq(&p->wait.lock);
> >  		rcu_read_unlock();
> >  		p->kill(p);
> > -		return;
> > +		if (p->done > 0)
> > +			return;
> > +		spin_lock_irq(&p->wait.lock);
> >  	}
> >  	if (p->done > 0) {
> >  		spin_unlock_irq(&p->wait.lock);
> > 
> > I think that would close the last gap, without needing extra work
> > items and completion in the nfsd code.
> > 
> > Al: would you be OK with that change to pin_kill?
> 
> Hell, no.  Intended use is to have ->kill() free the damn thing, period.

It is not possible to revise that intention?
The apparent purpose of pin_kill() is to wait until the thing is freed,
or to trigger that freeing itself.  Why not do both: trigger then wait?


> This code is very careful about how it waits in the "found it before
> ->kill() has removed all pointers leading to that object" case.  No go.
> This change would break all existing users, with arseloads of extra
> complications for no good reason whatsoever.

Given that all current uses have ->kill() call pin_remove, and as
pin_remove sets ->done to 1, and as the patch makes no change to
behaviour when ->kill() completes with ->done set to 1, I don't see how
it can break anything.
'rcu' ensures that it is still save to examine p->done, and it will be
'1'.

Thanks,
NeilBrown


> 
> And frankly, I'm still not convinced that fs_pin is a good match for the
> problem here - I'm not saying it's impossible to produce an fs_pin-based
> solution (and I hadn't reviewed the latest iteration yet), but attempts so
> far didn't look particularly promising.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
       [not found]             ` <20150713044553.GN17109-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-07-13  5:21               ` NeilBrown
  2015-07-13  6:02                 ` NeilBrown
  0 siblings, 1 reply; 40+ messages in thread
From: NeilBrown @ 2015-07-13  5:21 UTC (permalink / raw)
  To: Al Viro
  Cc: Kinglong Mee, J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust

On Mon, 13 Jul 2015 05:45:53 +0100 Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
wrote:

> On Mon, Jul 13, 2015 at 02:20:59PM +1000, NeilBrown wrote:
> 
> > Actually, with that change to pin_kill, this side of things becomes
> > really easy.
> > All expXXX_pin_kill needs to do is call your new cache_delete_entry.
> > If that doesn't cause the entry to be put, then something else has a
> > temporary reference which will be put soon.  In any case, pin_kill()
> > will wait long enough, but not indefinitely.
> > No need for kref_get_unless_zero() or any of that.
> 
> No.  You are seriously misunderstanding what ->kill() is for and what the
> existing instances are doing.  Again, there is no promise whatsoever that
> the object containing fs_pin instance will *survive* past ->kill().
> At all.

Ah... I missed that rcu_read_unlock happened before ->kill.  Sorry
about that.

It still seems like the waiting that pin_kill does is exactly what we
need.

I'll think about it some more.

Thanks,
NeilBrown


> 
> RTFS, please.  What is sorely missing in this recurring patchset is a clear
> description of lifetime rules and ordering (who waits for whom and how long).
> For all the objects involved.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
  2015-07-13  5:19             ` NeilBrown
@ 2015-07-13  6:02               ` Al Viro
  0 siblings, 0 replies; 40+ messages in thread
From: Al Viro @ 2015-07-13  6:02 UTC (permalink / raw)
  To: NeilBrown
  Cc: Kinglong Mee, J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust

On Mon, Jul 13, 2015 at 03:19:10PM +1000, NeilBrown wrote:
> > Hell, no.  Intended use is to have ->kill() free the damn thing, period.
> 
> It is not possible to revise that intention?
> The apparent purpose of pin_kill() is to wait until the thing is freed,
> or to trigger that freeing itself.  Why not do both: trigger then wait?

Huh?  The first to come calls ->kill(); anybody coming between that and
pin_remove() done by ->kill() waits until pin_remove() is called and
buggers off.

> Given that all current uses have ->kill() call pin_remove, and as
> pin_remove sets ->done to 1, and as the patch makes no change to
> behaviour when ->kill() completes with ->done set to 1, I don't see how
> it can break anything.
> 'rcu' ensures that it is still save to examine p->done, and it will be
> '1'.

RCU ensures no such thing.  Where's the rcu_read_lock() opening the
RCU-critical area you need?
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
  2015-07-13  5:21               ` NeilBrown
@ 2015-07-13  6:02                 ` NeilBrown
  2015-07-13  6:08                   ` Al Viro
  0 siblings, 1 reply; 40+ messages in thread
From: NeilBrown @ 2015-07-13  6:02 UTC (permalink / raw)
  To: Al Viro
  Cc: Kinglong Mee, J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust

On Mon, 13 Jul 2015 15:21:33 +1000 NeilBrown <neilb-IBi9RG/b67k@public.gmane.org> wrote:

> On Mon, 13 Jul 2015 05:45:53 +0100 Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
> wrote:
> 
> > On Mon, Jul 13, 2015 at 02:20:59PM +1000, NeilBrown wrote:
> > 
> > > Actually, with that change to pin_kill, this side of things becomes
> > > really easy.
> > > All expXXX_pin_kill needs to do is call your new cache_delete_entry.
> > > If that doesn't cause the entry to be put, then something else has a
> > > temporary reference which will be put soon.  In any case, pin_kill()
> > > will wait long enough, but not indefinitely.
> > > No need for kref_get_unless_zero() or any of that.
> > 
> > No.  You are seriously misunderstanding what ->kill() is for and what the
> > existing instances are doing.  Again, there is no promise whatsoever that
> > the object containing fs_pin instance will *survive* past ->kill().
> > At all.
> 
> Ah... I missed that rcu_read_unlock happened before ->kill.  Sorry
> about that.
> 
> It still seems like the waiting that pin_kill does is exactly what we
> need.
> 
> I'll think about it some more.
> 

Ok....

A key issue is that the code currently assumes that the only way a
'pin' is removed is by the pinned thing calling pin_kill().

The problem is that we want the pinning thing to be able to remove
itself.

I think that means we need a variant of pin_remove() which reports if
pin->done was 0 or -1.
If it was 0, then ->kill hasn't been called, and it won't be.  So the
caller is free to clean up how it likes (providing RCU is used for
freeing).
If it was -1, then ->kill has been called and is expected to clean up -
the caller should assume that has already happened.


So path_put_unpin() needs to call pin_remove_and_test() (or whatever it
is called) and return the value.

Then expXXX_put() calls path_put_unpin and only calls kfree_rcu() if
->done was previously 0.

expXXX_pin_kill() calls cache_delete_entry, and then calls pin_kill()
This recursive call to pin_kill() will wait until expXXX_put() has
called pin_remove_and_test() and then returns.
At this point there are no references to the cache entry except the one
that expXXX_pin_kill() holds.  So it can call kfree_rcu().


Would that work?
Are you happy with pin_remove() returning a status?

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
  2015-07-13  6:02                 ` NeilBrown
@ 2015-07-13  6:08                   ` Al Viro
       [not found]                     ` <20150713060802.GP17109-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Al Viro @ 2015-07-13  6:08 UTC (permalink / raw)
  To: NeilBrown
  Cc: Kinglong Mee, J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust

On Mon, Jul 13, 2015 at 04:02:43PM +1000, NeilBrown wrote:

> I think that means we need a variant of pin_remove() which reports if
> pin->done was 0 or -1.
> If it was 0, then ->kill hasn't been called, and it won't be.  So the
> caller is free to clean up how it likes (providing RCU is used for
> freeing).

Grr...  What will umount(2) wait for if it comes during your cleanup?
You are putting them in the wrong order - pin_remove() is "I'm *DONE*
killing that sucker, nothing to wait for", not "move along, don't wait
for me, I've taken over killing it off".
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
       [not found]                     ` <20150713060802.GP17109-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-07-13  6:32                       ` NeilBrown
  2015-07-13  6:43                         ` Al Viro
  0 siblings, 1 reply; 40+ messages in thread
From: NeilBrown @ 2015-07-13  6:32 UTC (permalink / raw)
  To: Al Viro
  Cc: Kinglong Mee, J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust

On Mon, 13 Jul 2015 07:08:02 +0100 Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
wrote:

> On Mon, Jul 13, 2015 at 04:02:43PM +1000, NeilBrown wrote:
> 
> > I think that means we need a variant of pin_remove() which reports if
> > pin->done was 0 or -1.
> > If it was 0, then ->kill hasn't been called, and it won't be.  So the
> > caller is free to clean up how it likes (providing RCU is used for
> > freeing).
> 
> Grr...  What will umount(2) wait for if it comes during your cleanup?
> You are putting them in the wrong order - pin_remove() is "I'm *DONE*
> killing that sucker, nothing to wait for", not "move along, don't wait
> for me, I've taken over killing it off".

pin_remove() disconnects the pinning thing (sunrpc cache entry in this
case) from the pinned thing (vfsmnt in this case).
After it has run the pinned thing can do whatever it likes without any
reference to the pinning thing, and the pinning thing just needs to wait
an RCU grace period, and then can do whatever it likes.

The "cleanup" is, in this case, just a call to rcu_kfree().  There is
no need for umount(2) to wait for it.


Certainly any state that the pinning structure has that relates to the
pinned structure must be cleaned up before calling pin_remove, so for
example dput() must be called on path.dentry *before* pin_remove is
called on path.mnt.  But other parts of the pinning structure can be
handled as its owner chooses.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
  2015-07-13  6:32                       ` NeilBrown
@ 2015-07-13  6:43                         ` Al Viro
  2015-07-15  3:49                           ` NeilBrown
  0 siblings, 1 reply; 40+ messages in thread
From: Al Viro @ 2015-07-13  6:43 UTC (permalink / raw)
  To: NeilBrown
  Cc: Kinglong Mee, J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust

On Mon, Jul 13, 2015 at 04:32:01PM +1000, NeilBrown wrote:
> pin_remove() disconnects the pinning thing (sunrpc cache entry in this
> case) from the pinned thing (vfsmnt in this case).
> After it has run the pinned thing can do whatever it likes without any
> reference to the pinning thing, and the pinning thing just needs to wait
> an RCU grace period, and then can do whatever it likes.
> 
> The "cleanup" is, in this case, just a call to rcu_kfree().  There is
> no need for umount(2) to wait for it.
> 
> 
> Certainly any state that the pinning structure has that relates to the
> pinned structure must be cleaned up before calling pin_remove, so for
> example dput() must be called on path.dentry *before* pin_remove is
> called on path.mnt.  But other parts of the pinning structure can be
> handled as its owner chooses.

Then what's the difference between that and what's getting done in ->kill()
triggered by cleanup_mnt()?

In any case, you have two possibilities - explicit unexport triggering that
dput(), etc. and umount(2) triggering the same.  Whoever comes second gets
to wait until it's done.  So why not make the point where we commit to
unexporting the sucker the place where we do pin_kill()?  And have ->kill()
of that thing prevent appearance of new entries, then wait for them to run
down.  Which is precisely the same thing you want to happen on umount...
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 07/10 v7] sunrpc: Switch to using list_head instead single list
  2015-07-13  1:30     ` NeilBrown
@ 2015-07-13  8:27       ` Kinglong Mee
  0 siblings, 0 replies; 40+ messages in thread
From: Kinglong Mee @ 2015-07-13  8:27 UTC (permalink / raw)
  To: NeilBrown
  Cc: Al Viro, J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust,
	kinglongmee-Re5JQEeQqe8AvxtiuMwx3w


On 7/13/2015 09:30, NeilBrown wrote:
> On Sat, 11 Jul 2015 20:50:26 +0800 Kinglong Mee <kinglongmee-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> wrote:
> 
>> Switch using list_head for cache_head in cache_detail,
>> it is useful of remove an cache_head entry directly from cache_detail.
>>
>> v7, same as v6.
>>
>> Signed-off-by: Kinglong Mee <kinglongmee-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>> ---
>>  include/linux/sunrpc/cache.h |  4 +--
>>  net/sunrpc/cache.c           | 74 ++++++++++++++++++++++++--------------------
>>  2 files changed, 43 insertions(+), 35 deletions(-)
>>
>> diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h
>> index 04ee5a2..ecc0ff6 100644
>> --- a/include/linux/sunrpc/cache.h
>> +++ b/include/linux/sunrpc/cache.h
>> @@ -46,7 +46,7 @@
>>   * 
>>   */
>>  struct cache_head {
>> -	struct cache_head * next;
>> +	struct list_head	cache_list;
>>  	time_t		expiry_time;	/* After time time, don't use the data */
>>  	time_t		last_refresh;   /* If CACHE_PENDING, this is when upcall 
>>  					 * was sent, else this is when update was received
>> @@ -73,7 +73,7 @@ struct cache_detail_pipefs {
>>  struct cache_detail {
>>  	struct module *		owner;
>>  	int			hash_size;
>> -	struct cache_head **	hash_table;
>> +	struct list_head *	hash_table;
>>  	rwlock_t		hash_lock;
> 
> Given that these lists are chains in a hash table, it would make more
> sense to use hlist_head rather than list_head.  They are designed for
> exactly that purpose - and are smaller.
> 
> Actually, I'd really like to see hlist_bl_head used - so there was one
> bit-spin-lock per chain, and use RCU for protecting searches.
> However that would have to be left for later - no point delaying this
> patch set for some minor performance gains.
> 
> But as you need to change this, I think it would be best to change to
> hlist_head.

Got it.
I will update and test it.

thanks,
Kinglong Mee

> 
> By the way, I see lots of nice clean-ups in this series.  Thanks!
> 
> NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
  2015-07-13  6:43                         ` Al Viro
@ 2015-07-15  3:49                           ` NeilBrown
  2015-07-15  4:57                             ` Al Viro
  0 siblings, 1 reply; 40+ messages in thread
From: NeilBrown @ 2015-07-15  3:49 UTC (permalink / raw)
  To: Al Viro
  Cc: Kinglong Mee, J. Bruce Fields, linux-nfs, linux-fsdevel, Trond Myklebust

On Mon, 13 Jul 2015 07:43:53 +0100 Al Viro <viro@ZenIV.linux.org.uk>
wrote:

> On Mon, Jul 13, 2015 at 04:32:01PM +1000, NeilBrown wrote:
> > pin_remove() disconnects the pinning thing (sunrpc cache entry in this
> > case) from the pinned thing (vfsmnt in this case).
> > After it has run the pinned thing can do whatever it likes without any
> > reference to the pinning thing, and the pinning thing just needs to wait
> > an RCU grace period, and then can do whatever it likes.
> > 
> > The "cleanup" is, in this case, just a call to rcu_kfree().  There is
> > no need for umount(2) to wait for it.
> > 
> > 
> > Certainly any state that the pinning structure has that relates to the
> > pinned structure must be cleaned up before calling pin_remove, so for
> > example dput() must be called on path.dentry *before* pin_remove is
> > called on path.mnt.  But other parts of the pinning structure can be
> > handled as its owner chooses.
> 
> Then what's the difference between that and what's getting done in ->kill()
> triggered by cleanup_mnt()?

Uhm... probably nothing.  I'm not sure what you are getting at.
I just need to do it at a different time to cleanup_mnt(), but also to
be aware that doing it might race with clean_mnt().


> 
> In any case, you have two possibilities - explicit unexport triggering that
> dput(), etc. and umount(2) triggering the same.  Whoever comes second gets
> to wait until it's done.  So why not make the point where we commit to
> unexporting the sucker the place where we do pin_kill()?  And have ->kill()
> of that thing prevent appearance of new entries, then wait for them to run
> down.  Which is precisely the same thing you want to happen on umount...

The "wait for them to run down" part is the sticking point.  We don't
have any easy way to wait for there to be no more references, so I'd
really like to use the waiting that pin_kill() already does.

I want the ->kill function to just unhash the cache entry, and then
wait for pin_delete() to be called.

The final 'put' on the cache entry calls dput on the dentry and then
pin_remove().

The ->kill function can wait for that to happen by calling pin_kill().

I guess there is no real need for a return value from pin_remove().

So
  static void expkey_pin_kill(struct fs_pin *pin)
  {
     struct svc_expkey *key = container_of(pin, ....);

     cache_delete_entry(key->cd, &key->h);
     pin_kill(&key->ek_pin); /* recursive call will wait for
                              * pin_delete() to be called */
  }

and static void expkey_put(struct kref *ref)
    {
        struct svc_expkey *key = container_of(ref, ....); 

        auth_domain_put(key->ek_client);
        if (test_bit(CACHE_VALID, &key->h.flags) &&
            !test_bit(CACHE_NEGATIVE, &key->h.flags))
               path_put_unpin(&key->ek_path, &key->ek_pin);
        kfree_rcu(key, rcu_head):
    }

We ensure that no new references are taken by svc_expkey_lookup()
calling legitimize_mntget() and returning NULL if that fails.
It should probably call cache_delete_entry() when that happens just to
be on the safe side.
cache_delete_entry() must check if the object is still in the hash
table before deleting it.

So I think it can work nicely without any changes to the fs_pin code.

Can you see any further problems?

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
  2015-07-15  3:49                           ` NeilBrown
@ 2015-07-15  4:57                             ` Al Viro
  2015-07-15  6:51                               ` NeilBrown
  0 siblings, 1 reply; 40+ messages in thread
From: Al Viro @ 2015-07-15  4:57 UTC (permalink / raw)
  To: NeilBrown
  Cc: Kinglong Mee, J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust

On Wed, Jul 15, 2015 at 01:49:48PM +1000, NeilBrown wrote:

> > In any case, you have two possibilities - explicit unexport triggering that
> > dput(), etc. and umount(2) triggering the same.  Whoever comes second gets
> > to wait until it's done.  So why not make the point where we commit to
> > unexporting the sucker the place where we do pin_kill()?  And have ->kill()
> > of that thing prevent appearance of new entries, then wait for them to run
> > down.  Which is precisely the same thing you want to happen on umount...
> 
> The "wait for them to run down" part is the sticking point.  We don't
> have any easy way to wait for there to be no more references, so I'd
> really like to use the waiting that pin_kill() already does.

> I want the ->kill function to just unhash the cache entry, and then
> wait for pin_delete() to be called.

pin_remove, presumably?

> The final 'put' on the cache entry calls dput on the dentry and then
> pin_remove().
> 
> The ->kill function can wait for that to happen by calling pin_kill().

> I guess there is no real need for a return value from pin_remove().
> 
> So
>   static void expkey_pin_kill(struct fs_pin *pin)
>   {
>      struct svc_expkey *key = container_of(pin, ....);
> 
>      cache_delete_entry(key->cd, &key->h);


... and another CPU drops the last reference, freeing the sucker.

>      pin_kill(&key->ek_pin); /* recursive call will wait for
>                               * pin_delete() to be called */

... oopsing on the spot.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
  2015-07-15  4:57                             ` Al Viro
@ 2015-07-15  6:51                               ` NeilBrown
  0 siblings, 0 replies; 40+ messages in thread
From: NeilBrown @ 2015-07-15  6:51 UTC (permalink / raw)
  To: Al Viro
  Cc: Kinglong Mee, J. Bruce Fields, linux-nfs, linux-fsdevel, Trond Myklebust

On Wed, 15 Jul 2015 05:57:56 +0100 Al Viro <viro@ZenIV.linux.org.uk>
wrote:

> On Wed, Jul 15, 2015 at 01:49:48PM +1000, NeilBrown wrote:
> 
> > > In any case, you have two possibilities - explicit unexport triggering that
> > > dput(), etc. and umount(2) triggering the same.  Whoever comes second gets
> > > to wait until it's done.  So why not make the point where we commit to
> > > unexporting the sucker the place where we do pin_kill()?  And have ->kill()
> > > of that thing prevent appearance of new entries, then wait for them to run
> > > down.  Which is precisely the same thing you want to happen on umount...
> > 
> > The "wait for them to run down" part is the sticking point.  We don't
> > have any easy way to wait for there to be no more references, so I'd
> > really like to use the waiting that pin_kill() already does.
> 
> > I want the ->kill function to just unhash the cache entry, and then
> > wait for pin_delete() to be called.
> 
> pin_remove, presumably?

Presumably, yes.

> 
> > The final 'put' on the cache entry calls dput on the dentry and then
> > pin_remove().
> > 
> > The ->kill function can wait for that to happen by calling pin_kill().
> 
> > I guess there is no real need for a return value from pin_remove().
> > 
> > So
> >   static void expkey_pin_kill(struct fs_pin *pin)
> >   {
> >      struct svc_expkey *key = container_of(pin, ....);
> > 
> >      cache_delete_entry(key->cd, &key->h);
> 
> 
> ... and another CPU drops the last reference, freeing the sucker.

*That's* why I wanted pin_remove to return something.

Why didn't you like that option again?

To recap:
pin_remove does:

  tmp = pin->done;
  pin->done = 1;
  ...
  return tmp;

path_put_unpin() returns the return value of pin_remove()

expkey_put() does:
   auth_domain_put(key->ek_client);
   if (test some bits)
       if (path_put_unpin() < 0)
          /* pin_kill has taken over */
          return
   kfree_rcu(key, rcu_head);
}

and expkey_pin_kill() does:

   cache_delete_entry(key->cd, &key->h);
   pin_kill(&key->ek_pin);
   kfee_rcu(key, rcu_head);
}


Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
  2015-07-13  3:39       ` NeilBrown
  2015-07-13  4:02         ` Al Viro
  2015-07-13  4:20         ` NeilBrown
@ 2015-07-15 21:07         ` J. Bruce Fields
       [not found]           ` <20150715210756.GE21669-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
  2 siblings, 1 reply; 40+ messages in thread
From: J. Bruce Fields @ 2015-07-15 21:07 UTC (permalink / raw)
  To: NeilBrown
  Cc: Kinglong Mee, Al Viro, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust

On Mon, Jul 13, 2015 at 01:39:34PM +1000, NeilBrown wrote:
> On Sat, 11 Jul 2015 20:52:56 +0800 Kinglong Mee <kinglongmee-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> wrote:
> 
> > If there are some mount points(not exported for nfs) under pseudo root,
> > after client's operation of those entry under the root,  anyone *can't*
> > unmount those mount points until export cache expired.
> > 
> > /nfs/xfs        *(rw,insecure,no_subtree_check,no_root_squash)
> > /nfs/pnfs       *(rw,insecure,no_subtree_check,no_root_squash)
> > total 0
> > drwxr-xr-x. 3 root root 84 Apr 21 22:27 pnfs
> > drwxr-xr-x. 3 root root 84 Apr 21 22:27 test
> > drwxr-xr-x. 2 root root  6 Apr 20 22:01 xfs
> > Filesystem                      1K-blocks    Used Available Use% Mounted on
> > ......
> > /dev/sdd                          1038336   32944   1005392   4% /nfs/pnfs
> > /dev/sdc                         10475520   32928  10442592   1% /nfs/xfs
> > /dev/sde                           999320    1284    929224   1% /nfs/test
> > /mnt/pnfs/:
> > total 0
> > -rw-r--r--. 1 root root 0 Apr 21 22:23 attr
> > drwxr-xr-x. 2 root root 6 Apr 21 22:19 tmp
> > 
> > /mnt/xfs/:
> > total 0
> > umount: /nfs/test/: target is busy
> >         (In some cases useful info about processes that
> >         use the device is found by lsof(8) or fuser(1).)
> > 
> > It's caused by exports cache of nfsd holds the reference of
> > the path (here is /nfs/test/), so, it can't be umounted.
> > 
> > I don't think that's user expect, they want umount /nfs/test/.
> > Bruce think user can also umount /nfs/pnfs/ and /nfs/xfs.
> > 
> > Also, using kzalloc for all memory allocating without kmalloc.
> > Thanks for Al Viro's commets for the logic of fs_pin.
> > 
> > v3,
> > 1. using path_get_pin/path_put_unpin for path pin
> > 2. using kzalloc for memory allocating
> > 
> > v4,
> > 1. add a completion for pin_kill waiting the reference is decreased to zero.
> > 2. add a work_struct for pin_kill decreases the reference indirectly.
> > 3. free svc_export/svc_expkey in pin_kill, not svc_export_put/svc_expkey_put.
> > 4. svc_export_put/svc_expkey_put go though pin_kill logic.
> > 
> > v5, same as v4.
> > 
> > v6,
> > 1. Pin vfsmnt to mount point at first, when reference increace (==2),
> >    grab a reference to vfsmnt by mntget. When decreace (==1),
> >    drop the reference to vfsmnt, left pin.
> > 2. Delete cache_head directly from cache_detail.
> > 
> > v7, 
> > implement self reference increase and decrease for nfsd exports/expkey 
> > 
> > When reference of cahce_head increase(>1), grab a reference of mnt once.
> > and reference decrease to 1 (==1), drop the reference of mnt.
> > 
> > So after that,
> > When ref > 1, user cannot umount the filesystem with -EBUSY.
> > when ref ==1, means cache only reference by nfsd cache,
> > no other reference. So user can try umount,
> > 1. before set MNT_UMOUNT (protected by mount_lock), nfsd cache is
> >    referenced (ref > 1, legitimize_mntget), umount will fail with -EBUSY.
> > 2. after set MNT_UMOUNT, nfsd cache is referenced (ref == 2),
> >    legitimize_mntget will fail, and set cache to CACHE_NEGATIVE,
> >    and the reference will be dropped, re-back to 1.
> >    So, pin_kill can delete the cache and umount success.
> > 3. when umountting, no reference to nfsd cache,
> >    pin_kill can delete the cache and umount success.
> > 
> > Signed-off-by: Kinglong Mee <kinglongmee-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> 
> Wow.... this is turning out to be a lot more complex that I imagined at
> first (isn't that always the way!).
> 
> There is a lot of good stuff here, but I think we can probably make it
> simpler and so even better.

I'm still not convinced that the expkey should have a dentry reference
in the key in the first place.  Fixing that would fix the immediate
problem.

(Though it might still be useful to have a way to do stuff on umount of
an exported filesystem.)

--b.

> 
> I particularly don't like the get_ref/put_ref pointers in cache_head.
> They make cache_head a lot bigger than it was before, and they are only
> needed for two specific caches.  And then they are the same for every element
> in the cache.
> 
> I also don't like the ref_mutex ... or I don't like where it is used...
> or something.  I definitely don't think we need one per cached entry.
> Maybe one per cache.
> 
> I can certainly see that the "first" time we get a reference to a cache
> item that holds a vfsmnt pointer, we need to "legitimize" that - or
> fail.  But I don't think that has to happen inside the cache.c
> machinery.
> 
> How about this:
>  - add a new cache flag "CACHE_ACTIVE" (for example) which the cache
>    owner can set whenever it likes.  When cache_put finds that CACHE_ACTIVE
>    is set when refcount is <= 2, it calls a new cache_detail method: cache_deactivate.
>  - cache_deactivate takes a mutex (yes, we do need one, don't we)
>    and if CACHE_ACTIVE is still set and refcount is still <= 2,
>    it drops the reference on the vfsmnt and clears CACHE_ACTIVE.
>    This actually needs to be something like:
>     if (test_and_clear_bit(CACHE_ACTIVE,...)) {
>         if (atomic_read(..refcnt) > 2) {
>              set_bit(CACHE_ACTIVE);
>              mutex_unlock()
>              return
> 
>    so that if other code gets a reference and tests CACHE_ACTIVE, it
>    won't suddenly become inactive.  Might need a memory barrier in there...
>    no, test_and_clear implies a memory barrier.
> 
> We only need to make changes to svc_export and svc_expkey - right?
> So that would be:
>  Change svc_export_lookup and svc_expkey_lookup so they look something
>  like:
> 
>   svc_XX_lookup(struct cache_detail *cd, struct svc_XXX *item)
>   {
>       struct cache_head *ch;
>       int hash = svc_XXX_hash(item);
> 
>       ch = sunrpc_cache_lookup(cd, &item->h, hash);
>       if (!ch)
>            return NULL;
>       item = container_of(ch, struct svc_XXX, h);
>       if (!test_bit(CACHE_VALID, &ch->flags) ||
>           test_bit(CACHE_NEGATIVE, &ch->flags) ||
>           test_bit(CACHE_ACTIVE, &ch->flags))
>             return item;
> 
>       mutex_lock(&svc_XXX_mutex);
>       if (!test_bit(CACHE_ACTIVE, &ch->flags)) {
>               if (legitimize_mnt_get() == NULL) {
>                       XXX_put(item);
>                       item = NULL;
>               } else
>                       set_bit(CACHE_ACTIVE, &ch->flags);
>       }
>       mutex_unlock(&something);
>       return item;
>  }
> 
> Then the new 'cache_deactivate' function is something like:
> 
>   svc_XXX_deactivate(struct cache_detail *cd, struct cache_head *ch)
>   {
>        struct svc_XXX *item = container_of(ch, &item->h, item);
> 
>        mutex_lock(&svc_XXX_mutex);
>        if (test_and_clear_bit(CACHE_ACTIVE, &ch->flags)) {
>               if (atomic_read(&ch->ref.refcount) > 2) {
>                    /* Race with get_ref - do nothing */
>                    set_bit(CACHE_ACTIVE, &ch->flags);
>               else
>                    mntput(....mnt);
>        }
>        mutex_unlock(&svc_XXX_mutex);
>   }
> 
> 
> cache_put would have:
> 
>     if (test_bit(CACHE_ACTIVE, &h->flags) &&
>         cd->cache_deactivate &&
>         atomic_read(&h->ref.refcount <= 2))
>            cd->cache_deactivate(cd, h);
> 
> but there is still a race.  If: (T1 and T2 are threads)
>    T1: cache_put finds refcount is 2 and CACHE_ACTIVE is set and calls ->cache_deactiveate
>    T2: cache_get increments the refcount to 3
>    T1: cache_deactivate clears CACHE_ACTIVE and find refcount is 3
>    T2: now calls cache_put, which sees CACHE_ACTIVE is clear so refcount becomes 2
>    T1: sets CACHE_ACTIVE again and continues.  refcount becomes 1.
> 
> So not refcount is 1 and the item is still active.
> 
> We can fix this by making cache_put loop:
>     while (test_bit(CACHE_ACTIVE, &h->flags) &&
>           cd->cache_deactivate &&
>           (smb_rmb(), 1) &&
>           atomic_read(&h->ref.refcount <= 2))
>            cd->cache_deactivate(cd, h);
> 
> This should ensure that refcount never gets to 1 with the
> item still active (i.e. with a ref count on the mnt).
> 
> 
> The work item and completion are a bit unfortunate too.
> 
> I guess the problem here is that pin_kill() can run while there are
> some inactive references to the cache item.  There can then be a race
> over who will use path_put_unpin to put the dentry.
> 
> Could we fix that by having expXXX_pin_kill() use kref_get_unless_zero()
> on the cache item.
> If that succeeds, then path_put_unpin hasn't been called and it won't be.
> So expXXX_pin_kill can call it and then set CACHE_NEGATIVE.
> If it fails, then it has already been called and nothing else need be done.
> Almost.
> If kref_get_unless_zero() fails, pin_remove() may not have been called
> yet, but it will be soon.  We might need to wait.
> It would be nice if pin_kill() would check ->done again after calling p->kill.
> e.g.
> 
> diff --git a/fs/fs_pin.c b/fs/fs_pin.c
> index 611b5408f6ec..c2ef5c9d4c0d 100644
> --- a/fs/fs_pin.c
> +++ b/fs/fs_pin.c
> @@ -47,7 +47,9 @@ void pin_kill(struct fs_pin *p)
>  		spin_unlock_irq(&p->wait.lock);
>  		rcu_read_unlock();
>  		p->kill(p);
> -		return;
> +		if (p->done > 0)
> +			return;
> +		spin_lock_irq(&p->wait.lock);
>  	}
>  	if (p->done > 0) {
>  		spin_unlock_irq(&p->wait.lock);
> 
> I think that would close the last gap, without needing extra work
> items and completion in the nfsd code.
> 
> Al: would you be OK with that change to pin_kill?
> 
> Thanks,
> NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
       [not found]           ` <20150715210756.GE21669-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
@ 2015-07-15 23:40             ` NeilBrown
  2015-07-16 20:51               ` J. Bruce Fields
  0 siblings, 1 reply; 40+ messages in thread
From: NeilBrown @ 2015-07-15 23:40 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Kinglong Mee, Al Viro, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust

On Wed, 15 Jul 2015 17:07:56 -0400 "J. Bruce Fields"
<bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> wrote:

> > Wow.... this is turning out to be a lot more complex that I imagined at
> > first (isn't that always the way!).
> > 
> > There is a lot of good stuff here, but I think we can probably make it
> > simpler and so even better.
> 
> I'm still not convinced that the expkey should have a dentry reference
> in the key in the first place.  Fixing that would fix the immediate
> problem.

???  If we removed the dentry, how would you export a subdirectory of a
filesystem?

NeilBrown


> 
> (Though it might still be useful to have a way to do stuff on umount of
> an exported filesystem.)
> 
> --b.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
  2015-07-15 23:40             ` NeilBrown
@ 2015-07-16 20:51               ` J. Bruce Fields
       [not found]                 ` <20150716205148.GC10673-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: J. Bruce Fields @ 2015-07-16 20:51 UTC (permalink / raw)
  To: NeilBrown
  Cc: Kinglong Mee, Al Viro, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust

On Thu, Jul 16, 2015 at 09:40:46AM +1000, NeilBrown wrote:
> On Wed, 15 Jul 2015 17:07:56 -0400 "J. Bruce Fields"
> <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> wrote:
> 
> > > Wow.... this is turning out to be a lot more complex that I imagined at
> > > first (isn't that always the way!).
> > > 
> > > There is a lot of good stuff here, but I think we can probably make it
> > > simpler and so even better.
> > 
> > I'm still not convinced that the expkey

(Sorry, I meant an entry in the export cache, not the expkey cache.)

> should have a dentry reference
> > in the key in the first place.  Fixing that would fix the immediate
> > problem.
> 
> ???  If we removed the dentry, how would you export a subdirectory of a
> filesystem?

I've been wondering if the export cache should really be keyed on the
string representation of the path instead of the struct path.  That's
what the userspace interface uses.

There's a related bug: if there are mountpoints at both /a and /a/b,
then thanks to the lookup-underneath-mountpoint behavior of the server,
an NFSv3 client looking that up will end up going underneath the first
mountpoint and doing an export cache lookup for

	(vfsmnt, dentry) == (/, /a/b)

When the server gets a response that starts with "/a/b", it interprets
that as applying to the path (/a, /a/b), so doesn't recognize it as
resolving the query about (/, /a/b).

Well, at least I assume that's why I see "ls" hang if I run "ls
/mnt/a/b" on the client.  And there may be some better fix, but I always
figured the root (hah) problem here was due to indexing the cache on
struct path while the upcall interface uses the full path string.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
       [not found]                 ` <20150716205148.GC10673-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
@ 2015-07-21 21:58                   ` NeilBrown
  2015-07-22 15:08                     ` J. Bruce Fields
  0 siblings, 1 reply; 40+ messages in thread
From: NeilBrown @ 2015-07-21 21:58 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Kinglong Mee, Al Viro, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust

On Thu, 16 Jul 2015 16:51:48 -0400 "J. Bruce Fields"
<bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> wrote:

> On Thu, Jul 16, 2015 at 09:40:46AM +1000, NeilBrown wrote:
> > On Wed, 15 Jul 2015 17:07:56 -0400 "J. Bruce Fields"
> > <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> wrote:
> > 
> > > > Wow.... this is turning out to be a lot more complex that I imagined at
> > > > first (isn't that always the way!).
> > > > 
> > > > There is a lot of good stuff here, but I think we can probably make it
> > > > simpler and so even better.
> > > 
> > > I'm still not convinced that the expkey
> 
> (Sorry, I meant an entry in the export cache, not the expkey cache.)

They are very closely related.  An incoming filehandle has its 'fs'
identifier mapped through the expkey cache to get an "export point".
Then the "export point" plus "client identifier" are mapped through the
export cache to get export options.

So the "export point" thing in the expkey cache should really be the
same as the thing in the export cache.

> 
> > should have a dentry reference
> > > in the key in the first place.  Fixing that would fix the immediate
> > > problem.
> > 
> > ???  If we removed the dentry, how would you export a subdirectory of a
> > filesystem?
> 
> I've been wondering if the export cache should really be keyed on the
> string representation of the path instead of the struct path.  That's
> what the userspace interface uses.

That makes sense for handling updates to the cache from user-space.
I'm not sure it is so good for handling mapping from file-handle to
export flags.

> 
> There's a related bug: if there are mountpoints at both /a and /a/b,

Just to make sure I'm clear on what you are saying, this would be
achieved by, e.g.

 mkdir -p /a/b
 mount /dev/sdX /a/b
 mount /dev/sdY /a

so the mount of /dev/sdX is complete unreachable from '/'.

Correct?

> then thanks to the lookup-underneath-mountpoint behavior of the server,
> an NFSv3 client looking that up will end up going underneath the first
> mountpoint and doing an export cache lookup for
> 
> 	(vfsmnt, dentry) == (/, /a/b)

Maybe this step is where the bug is. "/a/b" is not really a valid name.
Should d_path() check for paths that have been mounted over, and attach
"(unreachable)" to the end of the path, similar to "(deleted)".

sys_getcwd() can give you "(unreachable)" when in a filesystem that has
e.g. been lazy-unmounted.  Maybe we want something similar for a
mounted-over filesystem???


> 
> When the server gets a response that starts with "/a/b", it interprets
> that as applying to the path (/a, /a/b), so doesn't recognize it as
> resolving the query about (/, /a/b).
> 
> Well, at least I assume that's why I see "ls" hang if I run "ls
> /mnt/a/b" on the client.  And there may be some better fix, but I always
> figured the root (hah) problem here was due to indexing the cache on
> struct path while the upcall interface uses the full path string.
> 
Sounds like a very odd corner case - how did you stumble on to it?

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
  2015-07-21 21:58                   ` NeilBrown
@ 2015-07-22 15:08                     ` J. Bruce Fields
       [not found]                       ` <20150722150840.GH22718-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: J. Bruce Fields @ 2015-07-22 15:08 UTC (permalink / raw)
  To: NeilBrown
  Cc: Kinglong Mee, Al Viro, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust

On Wed, Jul 22, 2015 at 07:58:24AM +1000, NeilBrown wrote:
> On Thu, 16 Jul 2015 16:51:48 -0400 "J. Bruce Fields"
> <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> wrote:
> 
> > On Thu, Jul 16, 2015 at 09:40:46AM +1000, NeilBrown wrote:
> > > On Wed, 15 Jul 2015 17:07:56 -0400 "J. Bruce Fields"
> > > <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> wrote:
> > > 
> > > > > Wow.... this is turning out to be a lot more complex that I imagined at
> > > > > first (isn't that always the way!).
> > > > > 
> > > > > There is a lot of good stuff here, but I think we can probably make it
> > > > > simpler and so even better.
> > > > 
> > > > I'm still not convinced that the expkey
> > 
> > (Sorry, I meant an entry in the export cache, not the expkey cache.)
> 
> They are very closely related.  An incoming filehandle has its 'fs'
> identifier mapped through the expkey cache to get an "export point".
> Then the "export point" plus "client identifier" are mapped through the
> export cache to get export options.
> 
> So the "export point" thing in the expkey cache should really be the
> same as the thing in the export cache.
> 
> > 
> > > should have a dentry reference
> > > > in the key in the first place.  Fixing that would fix the immediate
> > > > problem.
> > > 
> > > ???  If we removed the dentry, how would you export a subdirectory of a
> > > filesystem?
> > 
> > I've been wondering if the export cache should really be keyed on the
> > string representation of the path instead of the struct path.  That's
> > what the userspace interface uses.
> 
> That makes sense for handling updates to the cache from user-space.
> I'm not sure it is so good for handling mapping from file-handle to
> export flags.
> 
> > 
> > There's a related bug: if there are mountpoints at both /a and /a/b,
> 
> Just to make sure I'm clear on what you are saying, this would be
> achieved by, e.g.
> 
>  mkdir -p /a/b
>  mount /dev/sdX /a/b
>  mount /dev/sdY /a
> 
> so the mount of /dev/sdX is complete unreachable from '/'.
> 
> Correct?

Actually my reproducer used something like:

	server# mount --bind /a
	server# mount --bind /a/b

then a v3 mount of / and "ls /mnt/a/b" from the client.

> > then thanks to the lookup-underneath-mountpoint behavior of the server,
> > an NFSv3 client looking that up will end up going underneath the first
> > mountpoint and doing an export cache lookup for
> > 
> > 	(vfsmnt, dentry) == (/, /a/b)
> 
> Maybe this step is where the bug is. "/a/b" is not really a valid name.
> Should d_path() check for paths that have been mounted over, and attach
> "(unreachable)" to the end of the path, similar to "(deleted)".
> 
> sys_getcwd() can give you "(unreachable)" when in a filesystem that has
> e.g. been lazy-unmounted.  Maybe we want something similar for a
> mounted-over filesystem???
> 
> 
> > 
> > When the server gets a response that starts with "/a/b", it interprets
> > that as applying to the path (/a, /a/b), so doesn't recognize it as
> > resolving the query about (/, /a/b).
> > 
> > Well, at least I assume that's why I see "ls" hang if I run "ls
> > /mnt/a/b" on the client.  And there may be some better fix, but I always
> > figured the root (hah) problem here was due to indexing the cache on
> > struct path while the upcall interface uses the full path string.
> > 
> Sounds like a very odd corner case - how did you stumble on to it?

I dug through my old email, but that may be lost in the mists of
time....

My memory is that I ran across a similar hang while testing some mountd
changes, but couldn't reproduce it.  (It might have involved a change to
the exports?)  So came up with this case by inspection.

I've had this nagging todo to work out if there are other interesting
consequences of the fact that the cache is internally keyed on one thing
and appears to mountd to be keyed on another.  (And that there's a
complicated many<->many relationship between those two things.)  But I
haven't gotten to it.  Could be all unlikely corner cases, for all I
know.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: export table lookup: was [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
       [not found]                       ` <20150722150840.GH22718-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
@ 2015-07-23 23:46                         ` NeilBrown
  2015-07-24 19:48                           ` J. Bruce Fields
  0 siblings, 1 reply; 40+ messages in thread
From: NeilBrown @ 2015-07-23 23:46 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Kinglong Mee, Al Viro, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust

On Wed, 22 Jul 2015 11:08:40 -0400 "J. Bruce Fields"
<bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> wrote:


> I've had this nagging todo to work out if there are other interesting
> consequences of the fact that the cache is internally keyed on one thing
> and appears to mountd to be keyed on another.  (And that there's a
> complicated many<->many relationship between those two things.)  But I
> haven't gotten to it.  Could be all unlikely corner cases, for all I
> know.

Even corner cases are worth resolving - and you got me interested now:-)

I think the distinction between pathnames and mnt+dentry is not quite
the important one.
I think mnt+dentry is the primary object - it is what a filehandle maps
to and what a pathname maps to.

The problem is that some mnt+dentry pairs do not have matching path
names.  If nfsd gets hold of one of these pairs, it shouldn't try
asking mountd about it because there is no way to ask the question, and
even if there was it isn't clear there is any way for mountd to answer.

If think that nfsd should assume that any such mountpoint is not
exported.

So something vaguely like:

diff --git a/fs/dcache.c b/fs/dcache.c
index 5c8ea15e73a5..a0651872ae8e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2943,6 +2943,12 @@ restart:
 		if (error)
 			break;
 
+		if (unlikely(d_mountpoint(dentry))) {
+			struct mount *mounted = __lookup_mnt(vfsmnt, dentry);
+			if (mounted)
+				prepend(&bptr, &blen, "//(unreachable)",15);
+		}
+
 		dentry = parent;
 	}
 	if (!(seq & 1))

Would mean that if I

# cd /tmp/a/b/c
# mount --bind /etc /tmp/a
# /bin/pwd

I get 

/tmp//(unreachable)/a/b/c

would could be checked for by nfsd to decide that there is no point asking user-space.
I'm not at all certain that this is a good interface (or that the code isn't racy) - it is just
a proof-of-concept.

We should probably place the (unreachable) at the front rather than in the middle.

Does that seem like a reasonable approach from your understanding of the problem?

Thanks,
NeilBrown









--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
  2015-07-13  4:45           ` Al Viro
       [not found]             ` <20150713044553.GN17109-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2015-07-24  2:05             ` NeilBrown
  2015-07-27  2:28               ` Kinglong Mee
  1 sibling, 1 reply; 40+ messages in thread
From: NeilBrown @ 2015-07-24  2:05 UTC (permalink / raw)
  To: Al Viro
  Cc: Kinglong Mee, J. Bruce Fields, linux-nfs, linux-fsdevel, Trond Myklebust

On Mon, 13 Jul 2015 05:45:53 +0100 Al Viro <viro@ZenIV.linux.org.uk>
wrote:

> On Mon, Jul 13, 2015 at 02:20:59PM +1000, NeilBrown wrote:
> 
> > Actually, with that change to pin_kill, this side of things becomes
> > really easy.
> > All expXXX_pin_kill needs to do is call your new cache_delete_entry.
> > If that doesn't cause the entry to be put, then something else has a
> > temporary reference which will be put soon.  In any case, pin_kill()
> > will wait long enough, but not indefinitely.
> > No need for kref_get_unless_zero() or any of that.
> 
> No.  You are seriously misunderstanding what ->kill() is for and what the
> existing instances are doing.  Again, there is no promise whatsoever that
> the object containing fs_pin instance will *survive* past ->kill().
> At all.
> 
> RTFS, please.  What is sorely missing in this recurring patchset is a clear
> description of lifetime rules and ordering (who waits for whom and how long).
> For all the objects involved.

Good point.  Let me try.

Entries in the sunrpc 'cache' each contain some 'key' fields and some
'content' fields.

The key fields are set by the .init() method when the entry is
created, which can happen in a call to sunrpc_cache_lookup() or to
sunrpc_cache_update().

The content fields are set by the .update() method when a value is
provided for the cache entry.  This happens in sunrpc_cache_update();

A cache entry can be not-valid, negative, or valid.
It starts non-valid when sunrpc_cache_lookup() fails to find the search
key and so creates a new entry (and sets up the key with .init).
It then transitions to either negative or valid.
This can happen through sunrpc_cache_update() or through an error when
instigating an up-call, in which case it goes to negative.
Once it is negative or valid, it stays that way until it is released.
If sunrpc_cache_update is called on an entry that is not not-valid,
then a new entry is created and the old one is marked as expired.
A cache search will find the new one before the old.

The vfsmount object is involved in two separate caches.
It is part of the content of svc_expkey and part of the key of
svc_export.

An svc_expkey entry is only ever held transiently.  It is held while an
update is being processed, and it is held briefly while mapping a
filehandle to a mnt+dentry.
Firstly part of the filehandle is used to acccess the svc_expkey cache
to get the vfsmnt.  Then that vfsmnt plus the client identification is
looked up in the svc_export cache to find the export options.  Then the
svc_expkey cache entry is released.

So it is only held during a lookup of another cache.  This can take an
arbitrarily long time as the lookup can go to rpc.mountd in user-space.


The svc_export cache entry can be held for the duration of a single NFS
request.  It is stored in the 'struct svc_fh' file handle structure
which is release at the end of handling the request.

The vfsmnt and dentry are only "used" to validate the filehandle and
then while that filehandle is still active.


To avoid having unmount hang while nfsd is performing an upcall to
mountd, we need to legitimize the vfsmnt in the svc_expkey.  If that
fails, exp_find_key() can fail and we would never perform the lookup on
svc_export.

If it succeeds, then the legitimacy can be handed over to the svc_export
cache entry, which could then continue to own it, or could hand it on
to the svc_fh.

The latter is *probably* cleanest.
i.e. an svc_fh should always own a reference to exp->ex_path.mnt, and
fh_put must put it.

exp_find_key needs to legitimize ek->ek_path.mnt, so a successful
return from exp_find implies an active refernece to ->ex_path.mnt.
If exp_find fails, it needs to mnt_put(ek->ek_path.mnt).
All callers of exp_find need to mnt_put(exp->ex_path.mnt) when they
decide not to use the exp, and must otherwise store it in an svc_fh.

With this, pin_kill() should only need to wait for  exp_find_key() to
discover that it cannot legitimize the mount, or for expkey_path() to
replace the key via sunrpc_cache_update(), or maybe for cache_clean()
to discard an old entry.

Hopefully that makes it all clear.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: export table lookup: was [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
  2015-07-23 23:46                         ` export table lookup: was " NeilBrown
@ 2015-07-24 19:48                           ` J. Bruce Fields
  2015-07-25  0:40                             ` NeilBrown
  0 siblings, 1 reply; 40+ messages in thread
From: J. Bruce Fields @ 2015-07-24 19:48 UTC (permalink / raw)
  To: NeilBrown
  Cc: Kinglong Mee, Al Viro, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust

On Fri, Jul 24, 2015 at 09:46:57AM +1000, NeilBrown wrote:
> On Wed, 22 Jul 2015 11:08:40 -0400 "J. Bruce Fields"
> <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> wrote:
> 
> 
> > I've had this nagging todo to work out if there are other interesting
> > consequences of the fact that the cache is internally keyed on one thing
> > and appears to mountd to be keyed on another.  (And that there's a
> > complicated many<->many relationship between those two things.)  But I
> > haven't gotten to it.  Could be all unlikely corner cases, for all I
> > know.
> 
> Even corner cases are worth resolving - and you got me interested now:-)
> 
> I think the distinction between pathnames and mnt+dentry is not quite
> the important one.
> I think mnt+dentry is the primary object - it is what a filehandle maps
> to and what a pathname maps to.
> 
> The problem is that some mnt+dentry pairs do not have matching path
> names.  If nfsd gets hold of one of these pairs, it shouldn't try
> asking mountd about it because there is no way to ask the question, and
> even if there was it isn't clear there is any way for mountd to answer.
> 
> If think that nfsd should assume that any such mountpoint is not
> exported.
> 
> So something vaguely like:
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 5c8ea15e73a5..a0651872ae8e 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -2943,6 +2943,12 @@ restart:
>  		if (error)
>  			break;
>  
> +		if (unlikely(d_mountpoint(dentry))) {
> +			struct mount *mounted = __lookup_mnt(vfsmnt, dentry);
> +			if (mounted)
> +				prepend(&bptr, &blen, "//(unreachable)",15);
> +		}
> +
>  		dentry = parent;
>  	}
>  	if (!(seq & 1))
> 
> Would mean that if I
> 
> # cd /tmp/a/b/c
> # mount --bind /etc /tmp/a
> # /bin/pwd
> 
> I get 
> 
> /tmp//(unreachable)/a/b/c
> 
> would could be checked for by nfsd to decide that there is no point asking user-space.
> I'm not at all certain that this is a good interface (or that the code isn't racy) - it is just
> a proof-of-concept.
> 
> We should probably place the (unreachable) at the front rather than in the middle.
> 
> Does that seem like a reasonable approach from your understanding of the problem?

So something like that could give us a way to prevent asking mountd
about mounts that it can't see.

Except when things change: it's possible a mount that would pass this
test at the time we create the request is no longer by the time we get
mountd's reply.

You can tell people not to do that.  It still bugs me to have the
possibility of a unanswereable request.  

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: export table lookup: was [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
  2015-07-24 19:48                           ` J. Bruce Fields
@ 2015-07-25  0:40                             ` NeilBrown
  0 siblings, 0 replies; 40+ messages in thread
From: NeilBrown @ 2015-07-25  0:40 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Kinglong Mee, Al Viro, linux-nfs, linux-fsdevel, Trond Myklebust

On Fri, 24 Jul 2015 15:48:17 -0400 "J. Bruce Fields"
<bfields@fieldses.org> wrote:

> On Fri, Jul 24, 2015 at 09:46:57AM +1000, NeilBrown wrote:

> > Does that seem like a reasonable approach from your understanding of the problem?
> 
> So something like that could give us a way to prevent asking mountd
> about mounts that it can't see.
> 
> Except when things change: it's possible a mount that would pass this
> test at the time we create the request is no longer by the time we get
> mountd's reply.
> 
> You can tell people not to do that.  It still bugs me to have the
> possibility of a unanswereable request.  

I can see three general ways to cope with the fact that things could
change between creating a request and receiving the reply:
 - lock to prevent the change
 - refcounts to provide a stable reference to the thing of interest
 - detect change and retry.

These correspond roughly to spinlock, kref, and seqlock.

Trying to prevent changes in the filesystem over an upcall-and-reply is
out of the question.

A refcount could be implemented as a file descriptor. i.e. when nfsd
finds a mountpoint at the end of a 'lookup', it creates a
file descriptor for the target object and passes that to mountd.  mountd
does what it does and sends the reply back with the same file
descriptor.
I think that is needlessly complex.

detect-change-and-retry is, I think, the best.  The cache already has a
retry mechanism.  It can often detect a change implicitly if it gets
told about some filesystem object that it doesn't really care about.
The only weakness is that it can't currently detect if its question can
no longer be answered...

I agree that an unanswerable request seems ugly.  But sometimes the
best way to handle races is to let ugly things happen temporarily.

I should probably double-check that the cache will retry the upcall in
a reasonable time frame - I only have a vague recollection of how that
works...

NeilBrown


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
  2015-07-24  2:05             ` NeilBrown
@ 2015-07-27  2:28               ` Kinglong Mee
       [not found]                 ` <55B59764.1020506-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 40+ messages in thread
From: Kinglong Mee @ 2015-07-27  2:28 UTC (permalink / raw)
  To: NeilBrown, Al Viro
  Cc: J. Bruce Fields, linux-nfs, linux-fsdevel, Trond Myklebust, kinglongmee

On 7/24/2015 10:05, NeilBrown wrote:
> On Mon, 13 Jul 2015 05:45:53 +0100 Al Viro <viro@ZenIV.linux.org.uk>
> wrote:
> 
>> On Mon, Jul 13, 2015 at 02:20:59PM +1000, NeilBrown wrote:
>>
>>> Actually, with that change to pin_kill, this side of things becomes
>>> really easy.
>>> All expXXX_pin_kill needs to do is call your new cache_delete_entry.
>>> If that doesn't cause the entry to be put, then something else has a
>>> temporary reference which will be put soon.  In any case, pin_kill()
>>> will wait long enough, but not indefinitely.
>>> No need for kref_get_unless_zero() or any of that.
>>
>> No.  You are seriously misunderstanding what ->kill() is for and what the
>> existing instances are doing.  Again, there is no promise whatsoever that
>> the object containing fs_pin instance will *survive* past ->kill().
>> At all.
>>
>> RTFS, please.  What is sorely missing in this recurring patchset is a clear
>> description of lifetime rules and ordering (who waits for whom and how long).
>> For all the objects involved.
> 
> Good point.  Let me try.
> 
> Entries in the sunrpc 'cache' each contain some 'key' fields and some
> 'content' fields.
> 
> The key fields are set by the .init() method when the entry is
> created, which can happen in a call to sunrpc_cache_lookup() or to
> sunrpc_cache_update().
> 
> The content fields are set by the .update() method when a value is
> provided for the cache entry.  This happens in sunrpc_cache_update();
> 
> A cache entry can be not-valid, negative, or valid.
> It starts non-valid when sunrpc_cache_lookup() fails to find the search
> key and so creates a new entry (and sets up the key with .init).
> It then transitions to either negative or valid.
> This can happen through sunrpc_cache_update() or through an error when
> instigating an up-call, in which case it goes to negative.
> Once it is negative or valid, it stays that way until it is released.
> If sunrpc_cache_update is called on an entry that is not not-valid,
> then a new entry is created and the old one is marked as expired.
> A cache search will find the new one before the old.
> 
> The vfsmount object is involved in two separate caches.
> It is part of the content of svc_expkey and part of the key of
> svc_export.
> 
> An svc_expkey entry is only ever held transiently.  It is held while an
> update is being processed, and it is held briefly while mapping a
> filehandle to a mnt+dentry.
> Firstly part of the filehandle is used to acccess the svc_expkey cache
> to get the vfsmnt.  Then that vfsmnt plus the client identification is
> looked up in the svc_export cache to find the export options.  Then the
> svc_expkey cache entry is released.
> 
> So it is only held during a lookup of another cache.  This can take an
> arbitrarily long time as the lookup can go to rpc.mountd in user-space.
> 
> 
> The svc_export cache entry can be held for the duration of a single NFS
> request.  It is stored in the 'struct svc_fh' file handle structure
> which is release at the end of handling the request.
> 
> The vfsmnt and dentry are only "used" to validate the filehandle and
> then while that filehandle is still active.
> 
> 
> To avoid having unmount hang while nfsd is performing an upcall to
> mountd, we need to legitimize the vfsmnt in the svc_expkey.  If that
> fails, exp_find_key() can fail and we would never perform the lookup on
> svc_export.
> 
> If it succeeds, then the legitimacy can be handed over to the svc_export
> cache entry, which could then continue to own it, or could hand it on
> to the svc_fh.
> 
> The latter is *probably* cleanest.
> i.e. an svc_fh should always own a reference to exp->ex_path.mnt, and
> fh_put must put it.

I don't agree adding new argument (eg, fh_vfsmnt) in svc_fh.

With it, should nfsd using fh_vfsmnt always, never using exp->ex_path.mnt
outside of export.c/export.h ?

If choose fh_vfsmnt, so many codes need be updated, especially functions.
If exp->ex_path.mnt, the new argument fh_vfsmnt seems redundant.

Thanks for your work.

It reminders a new method,

1. There are only one outlet from each cache, exp_find_key() for expkey, 
   exp_get_by_name() for export.
2. Any fsid to export or filehandle to export will call the function.
3. exp_get()/exp_put() increase/decrease the reference of export.

Like the fh_vfsmnt (not same), call legitimize_mntget() in the only
outlet function exp_find_key()/exp_get_by_name(), if fail return STALE,
otherwise, any valid expkey/export from the cache is validated (Have
get the reference of vfsmnt).

Add mntget() in exp_get() and mntput() in exp_put(), because the export
passed to exp_get/exp_put are returned from exp_find_key/exp_get_by_name.

> 
> exp_find_key needs to legitimize ek->ek_path.mnt, so a successful
> return from exp_find implies an active refernece to ->ex_path.mnt.
> If exp_find fails, it needs to mnt_put(ek->ek_path.mnt).

Yes, it's great.

> All callers of exp_find need to mnt_put(exp->ex_path.mnt) when they
> decide not to use the exp, and must otherwise store it in an svc_fh.
> 
> With this, pin_kill() should only need to wait for  exp_find_key() to
> discover that it cannot legitimize the mount, or for expkey_path() to
> replace the key via sunrpc_cache_update(), or maybe for cache_clean()
> to discard an old entry.
> 
> Hopefully that makes it all clear.

Yes, thanks again.

With my method, for expkey cache,
1. At first, a fsid is passed to exp_find_key, and lookup a cache
   in svc_expkey_lookup, if success, ekey->ek_path is pined to mount.
2. Then call legitimize_mntget getting a reference of vfsmnt 
   before return from exp_find_key.
3. Any calling exp_find_key with valid cache must put the vfsmnt.

for export cache,
1. At first, a path (returned from exp_find_key) with validate vfsmnt
   is passed to exp_get_by_name, if success, exp->ex_path is pined to mount.
2. Then call legitimize_mntget getting a reference of vfsmnt 
   before return from exp_get_by_name.
3. Any calling exp_get_by_name with valid cache must put the vfsmnt
   by exp_put();
4. Any using the exp returned from exp_get_by_name must call exp_get(),
   will increase the reference of vfsmnt.

So,
a. After getting the reference in 2, any umount of filesystem will get -EBUSY.
b. After put all reference after 4, or before get the reference in 2, 
   any umount of filesystem will call pin_kill, and delete the cache directly,
   also unpin the vfsmount.
c. Between 1 and 2, have get the reference of exp/key cache, with invalidate vfsmnt.
   As you said, umount of filesystem only wait exp_find_key/exp_get_by_name
   put the reference of cache when legitimize_mntget fail.

A new update of this patch site will be push later.

thanks,
Kinglong Mee

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
       [not found]                 ` <55B59764.1020506-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2015-07-27  2:51                   ` NeilBrown
  2015-07-27  3:17                     ` Kinglong Mee
  0 siblings, 1 reply; 40+ messages in thread
From: NeilBrown @ 2015-07-27  2:51 UTC (permalink / raw)
  To: Kinglong Mee
  Cc: Al Viro, J. Bruce Fields, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust

On Mon, 27 Jul 2015 10:28:52 +0800 Kinglong Mee <kinglongmee-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
wrote:

> On 7/24/2015 10:05, NeilBrown wrote:
> > On Mon, 13 Jul 2015 05:45:53 +0100 Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
> > wrote:
> > 
> >> On Mon, Jul 13, 2015 at 02:20:59PM +1000, NeilBrown wrote:
> >>
> >>> Actually, with that change to pin_kill, this side of things becomes
> >>> really easy.
> >>> All expXXX_pin_kill needs to do is call your new cache_delete_entry.
> >>> If that doesn't cause the entry to be put, then something else has a
> >>> temporary reference which will be put soon.  In any case, pin_kill()
> >>> will wait long enough, but not indefinitely.
> >>> No need for kref_get_unless_zero() or any of that.
> >>
> >> No.  You are seriously misunderstanding what ->kill() is for and what the
> >> existing instances are doing.  Again, there is no promise whatsoever that
> >> the object containing fs_pin instance will *survive* past ->kill().
> >> At all.
> >>
> >> RTFS, please.  What is sorely missing in this recurring patchset is a clear
> >> description of lifetime rules and ordering (who waits for whom and how long).
> >> For all the objects involved.
> > 
> > Good point.  Let me try.
> > 
> > Entries in the sunrpc 'cache' each contain some 'key' fields and some
> > 'content' fields.
> > 
> > The key fields are set by the .init() method when the entry is
> > created, which can happen in a call to sunrpc_cache_lookup() or to
> > sunrpc_cache_update().
> > 
> > The content fields are set by the .update() method when a value is
> > provided for the cache entry.  This happens in sunrpc_cache_update();
> > 
> > A cache entry can be not-valid, negative, or valid.
> > It starts non-valid when sunrpc_cache_lookup() fails to find the search
> > key and so creates a new entry (and sets up the key with .init).
> > It then transitions to either negative or valid.
> > This can happen through sunrpc_cache_update() or through an error when
> > instigating an up-call, in which case it goes to negative.
> > Once it is negative or valid, it stays that way until it is released.
> > If sunrpc_cache_update is called on an entry that is not not-valid,
> > then a new entry is created and the old one is marked as expired.
> > A cache search will find the new one before the old.
> > 
> > The vfsmount object is involved in two separate caches.
> > It is part of the content of svc_expkey and part of the key of
> > svc_export.
> > 
> > An svc_expkey entry is only ever held transiently.  It is held while an
> > update is being processed, and it is held briefly while mapping a
> > filehandle to a mnt+dentry.
> > Firstly part of the filehandle is used to acccess the svc_expkey cache
> > to get the vfsmnt.  Then that vfsmnt plus the client identification is
> > looked up in the svc_export cache to find the export options.  Then the
> > svc_expkey cache entry is released.
> > 
> > So it is only held during a lookup of another cache.  This can take an
> > arbitrarily long time as the lookup can go to rpc.mountd in user-space.
> > 
> > 
> > The svc_export cache entry can be held for the duration of a single NFS
> > request.  It is stored in the 'struct svc_fh' file handle structure
> > which is release at the end of handling the request.
> > 
> > The vfsmnt and dentry are only "used" to validate the filehandle and
> > then while that filehandle is still active.
> > 
> > 
> > To avoid having unmount hang while nfsd is performing an upcall to
> > mountd, we need to legitimize the vfsmnt in the svc_expkey.  If that
> > fails, exp_find_key() can fail and we would never perform the lookup on
> > svc_export.
> > 
> > If it succeeds, then the legitimacy can be handed over to the svc_export
> > cache entry, which could then continue to own it, or could hand it on
> > to the svc_fh.
> > 
> > The latter is *probably* cleanest.
> > i.e. an svc_fh should always own a reference to exp->ex_path.mnt, and
> > fh_put must put it.
> 
> I don't agree adding new argument (eg, fh_vfsmnt) in svc_fh.

I wasn't suggesting that a new field be added to svc_fh.
Just that if svc_fh->fh_export was not NULL, then the svc_fh "owned" a
reference to svc_fh->fh_export->ex_path.mnt which it had to mnt_put()
when it released ->fh_export.

So fh_put would need to change, but not much else.

It isn't the only way to handle that references - it just seemed the
neatest as I was writing the description.  Something else might work
better in the code.

> 
> With it, should nfsd using fh_vfsmnt always, never using exp->ex_path.mnt
> outside of export.c/export.h ?
> 
> If choose fh_vfsmnt, so many codes need be updated, especially functions.
> If exp->ex_path.mnt, the new argument fh_vfsmnt seems redundant.
> 
> Thanks for your work.
> 
> It reminders a new method,
> 
> 1. There are only one outlet from each cache, exp_find_key() for expkey, 
>    exp_get_by_name() for export.
> 2. Any fsid to export or filehandle to export will call the function.
> 3. exp_get()/exp_put() increase/decrease the reference of export.
> 
> Like the fh_vfsmnt (not same), call legitimize_mntget() in the only
> outlet function exp_find_key()/exp_get_by_name(), if fail return STALE,
> otherwise, any valid expkey/export from the cache is validated (Have
> get the reference of vfsmnt).
> 
> Add mntget() in exp_get() and mntput() in exp_put(), because the export
> passed to exp_get/exp_put are returned from exp_find_key/exp_get_by_name.
> 
> > 
> > exp_find_key needs to legitimize ek->ek_path.mnt, so a successful
> > return from exp_find implies an active refernece to ->ex_path.mnt.
> > If exp_find fails, it needs to mnt_put(ek->ek_path.mnt).
> 
> Yes, it's great.
> 
> > All callers of exp_find need to mnt_put(exp->ex_path.mnt) when they
> > decide not to use the exp, and must otherwise store it in an svc_fh.
> > 
> > With this, pin_kill() should only need to wait for  exp_find_key() to
> > discover that it cannot legitimize the mount, or for expkey_path() to
> > replace the key via sunrpc_cache_update(), or maybe for cache_clean()
> > to discard an old entry.
> > 
> > Hopefully that makes it all clear.
> 
> Yes, thanks again.
> 
> With my method, for expkey cache,
> 1. At first, a fsid is passed to exp_find_key, and lookup a cache
>    in svc_expkey_lookup, if success, ekey->ek_path is pined to mount.
> 2. Then call legitimize_mntget getting a reference of vfsmnt 
>    before return from exp_find_key.
> 3. Any calling exp_find_key with valid cache must put the vfsmnt.
> 
> for export cache,
> 1. At first, a path (returned from exp_find_key) with validate vfsmnt
>    is passed to exp_get_by_name, if success, exp->ex_path is pined to mount.
> 2. Then call legitimize_mntget getting a reference of vfsmnt 
>    before return from exp_get_by_name.

I don't see any point in calling legitimise_mntget here.  exp_find_key
already did the 'legitimize' bit so there is no need to do it again.

> 3. Any calling exp_get_by_name with valid cache must put the vfsmnt
>    by exp_put();
> 4. Any using the exp returned from exp_get_by_name must call exp_get(),
>    will increase the reference of vfsmnt.
> 
> So,
> a. After getting the reference in 2, any umount of filesystem will get -EBUSY.
> b. After put all reference after 4, or before get the reference in 2, 
>    any umount of filesystem will call pin_kill, and delete the cache directly,
>    also unpin the vfsmount.
> c. Between 1 and 2, have get the reference of exp/key cache, with invalidate vfsmnt.
>    As you said, umount of filesystem only wait exp_find_key/exp_get_by_name
>    put the reference of cache when legitimize_mntget fail.
> 
> A new update of this patch site will be push later.

I look forward to it.  Thanks,

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on
  2015-07-27  2:51                   ` NeilBrown
@ 2015-07-27  3:17                     ` Kinglong Mee
  0 siblings, 0 replies; 40+ messages in thread
From: Kinglong Mee @ 2015-07-27  3:17 UTC (permalink / raw)
  To: NeilBrown
  Cc: Al Viro, J. Bruce Fields, linux-nfs, linux-fsdevel,
	Trond Myklebust, kinglongmee


On 7/27/2015 10:51, NeilBrown wrote:
> On Mon, 27 Jul 2015 10:28:52 +0800 Kinglong Mee <kinglongmee@gmail.com>
> wrote:
> 
>> On 7/24/2015 10:05, NeilBrown wrote:
>>> On Mon, 13 Jul 2015 05:45:53 +0100 Al Viro <viro@ZenIV.linux.org.uk>
>>> wrote:
>>>
>>>> On Mon, Jul 13, 2015 at 02:20:59PM +1000, NeilBrown wrote:
>>>>
>>>>> Actually, with that change to pin_kill, this side of things becomes
>>>>> really easy.
>>>>> All expXXX_pin_kill needs to do is call your new cache_delete_entry.
>>>>> If that doesn't cause the entry to be put, then something else has a
>>>>> temporary reference which will be put soon.  In any case, pin_kill()
>>>>> will wait long enough, but not indefinitely.
>>>>> No need for kref_get_unless_zero() or any of that.
>>>>
>>>> No.  You are seriously misunderstanding what ->kill() is for and what the
>>>> existing instances are doing.  Again, there is no promise whatsoever that
>>>> the object containing fs_pin instance will *survive* past ->kill().
>>>> At all.
>>>>
>>>> RTFS, please.  What is sorely missing in this recurring patchset is a clear
>>>> description of lifetime rules and ordering (who waits for whom and how long).
>>>> For all the objects involved.
>>>
>>> Good point.  Let me try.
>>>
>>> Entries in the sunrpc 'cache' each contain some 'key' fields and some
>>> 'content' fields.
>>>
>>> The key fields are set by the .init() method when the entry is
>>> created, which can happen in a call to sunrpc_cache_lookup() or to
>>> sunrpc_cache_update().
>>>
>>> The content fields are set by the .update() method when a value is
>>> provided for the cache entry.  This happens in sunrpc_cache_update();
>>>
>>> A cache entry can be not-valid, negative, or valid.
>>> It starts non-valid when sunrpc_cache_lookup() fails to find the search
>>> key and so creates a new entry (and sets up the key with .init).
>>> It then transitions to either negative or valid.
>>> This can happen through sunrpc_cache_update() or through an error when
>>> instigating an up-call, in which case it goes to negative.
>>> Once it is negative or valid, it stays that way until it is released.
>>> If sunrpc_cache_update is called on an entry that is not not-valid,
>>> then a new entry is created and the old one is marked as expired.
>>> A cache search will find the new one before the old.
>>>
>>> The vfsmount object is involved in two separate caches.
>>> It is part of the content of svc_expkey and part of the key of
>>> svc_export.
>>>
>>> An svc_expkey entry is only ever held transiently.  It is held while an
>>> update is being processed, and it is held briefly while mapping a
>>> filehandle to a mnt+dentry.
>>> Firstly part of the filehandle is used to acccess the svc_expkey cache
>>> to get the vfsmnt.  Then that vfsmnt plus the client identification is
>>> looked up in the svc_export cache to find the export options.  Then the
>>> svc_expkey cache entry is released.
>>>
>>> So it is only held during a lookup of another cache.  This can take an
>>> arbitrarily long time as the lookup can go to rpc.mountd in user-space.
>>>
>>>
>>> The svc_export cache entry can be held for the duration of a single NFS
>>> request.  It is stored in the 'struct svc_fh' file handle structure
>>> which is release at the end of handling the request.
>>>
>>> The vfsmnt and dentry are only "used" to validate the filehandle and
>>> then while that filehandle is still active.
>>>
>>>
>>> To avoid having unmount hang while nfsd is performing an upcall to
>>> mountd, we need to legitimize the vfsmnt in the svc_expkey.  If that
>>> fails, exp_find_key() can fail and we would never perform the lookup on
>>> svc_export.
>>>
>>> If it succeeds, then the legitimacy can be handed over to the svc_export
>>> cache entry, which could then continue to own it, or could hand it on
>>> to the svc_fh.
>>>
>>> The latter is *probably* cleanest.
>>> i.e. an svc_fh should always own a reference to exp->ex_path.mnt, and
>>> fh_put must put it.
>>
>> I don't agree adding new argument (eg, fh_vfsmnt) in svc_fh.
> 
> I wasn't suggesting that a new field be added to svc_fh.
> Just that if svc_fh->fh_export was not NULL, then the svc_fh "owned" a
> reference to svc_fh->fh_export->ex_path.mnt which it had to mnt_put()
> when it released ->fh_export.
> 
> So fh_put would need to change, but not much else.
> 
> It isn't the only way to handle that references - it just seemed the
> neatest as I was writing the description.  Something else might work
> better in the code.

Got it, thanks for your comments.

> 
>>
>> With it, should nfsd using fh_vfsmnt always, never using exp->ex_path.mnt
>> outside of export.c/export.h ?
>>
>> If choose fh_vfsmnt, so many codes need be updated, especially functions.
>> If exp->ex_path.mnt, the new argument fh_vfsmnt seems redundant.
>>
>> Thanks for your work.
>>
>> It reminders a new method,
>>
>> 1. There are only one outlet from each cache, exp_find_key() for expkey, 
>>    exp_get_by_name() for export.
>> 2. Any fsid to export or filehandle to export will call the function.
>> 3. exp_get()/exp_put() increase/decrease the reference of export.
>>
>> Like the fh_vfsmnt (not same), call legitimize_mntget() in the only
>> outlet function exp_find_key()/exp_get_by_name(), if fail return STALE,
>> otherwise, any valid expkey/export from the cache is validated (Have
>> get the reference of vfsmnt).
>>
>> Add mntget() in exp_get() and mntput() in exp_put(), because the export
>> passed to exp_get/exp_put are returned from exp_find_key/exp_get_by_name.
>>
>>>
>>> exp_find_key needs to legitimize ek->ek_path.mnt, so a successful
>>> return from exp_find implies an active refernece to ->ex_path.mnt.
>>> If exp_find fails, it needs to mnt_put(ek->ek_path.mnt).
>>
>> Yes, it's great.
>>
>>> All callers of exp_find need to mnt_put(exp->ex_path.mnt) when they
>>> decide not to use the exp, and must otherwise store it in an svc_fh.
>>>
>>> With this, pin_kill() should only need to wait for  exp_find_key() to
>>> discover that it cannot legitimize the mount, or for expkey_path() to
>>> replace the key via sunrpc_cache_update(), or maybe for cache_clean()
>>> to discard an old entry.
>>>
>>> Hopefully that makes it all clear.
>>
>> Yes, thanks again.
>>
>> With my method, for expkey cache,
>> 1. At first, a fsid is passed to exp_find_key, and lookup a cache
>>    in svc_expkey_lookup, if success, ekey->ek_path is pined to mount.
>> 2. Then call legitimize_mntget getting a reference of vfsmnt 
>>    before return from exp_find_key.
>> 3. Any calling exp_find_key with valid cache must put the vfsmnt.
>>
>> for export cache,
>> 1. At first, a path (returned from exp_find_key) with validate vfsmnt
>>    is passed to exp_get_by_name, if success, exp->ex_path is pined to mount.
>> 2. Then call legitimize_mntget getting a reference of vfsmnt 
>>    before return from exp_get_by_name.
> 
> I don't see any point in calling legitimise_mntget here.  exp_find_key
> already did the 'legitimize' bit so there is no need to do it again.

I just think they are in two logical.

But, does export cache contains a different vfsmnt as expkey exist?

thanks,
Kinglong Mee

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2015-07-27  3:17 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-11 12:46 [PATCH 00/10 v7] NFSD: Pin to vfsmount for nfsd exports cache Kinglong Mee
2015-07-11 12:49 ` [PATCH 05/10 v7] sunrpc: Store cache_detail in seq_file's private, directly Kinglong Mee
2015-07-11 12:49 ` [PATCH 06/10 v7] sunrpc/nfsd: Remove redundant code by exports seq_operations functions Kinglong Mee
     [not found] ` <55A11010.6050005-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2015-07-11 12:47   ` [PATCH 01/10 v7] fs_pin: Initialize value for fs_pin explicitly Kinglong Mee
2015-07-11 12:47   ` [PATCH 02/10 v7] fs_pin: Export functions for specific filesystem Kinglong Mee
2015-07-11 12:48   ` [PATCH 03/10 v7] path: New helpers path_get_pin/path_put_unpin for path pin Kinglong Mee
2015-07-11 12:48   ` [PATCH 04/10 v7] fs: New helper legitimize_mntget() for getting a legitimize mnt Kinglong Mee
2015-07-11 12:50   ` [PATCH 07/10 v7] sunrpc: Switch to using list_head instead single list Kinglong Mee
     [not found]     ` <55A11112.8080502-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2015-07-11 12:54       ` Christoph Hellwig
2015-07-13  1:30     ` NeilBrown
2015-07-13  8:27       ` Kinglong Mee
2015-07-11 12:51   ` [PATCH 09/10 v7] sunrpc: Support get_ref/put_ref for reference change in cache_head Kinglong Mee
2015-07-11 12:52   ` [PATCH 10/10 v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on Kinglong Mee
     [not found]     ` <55A111A8.2040701-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2015-07-13  3:39       ` NeilBrown
2015-07-13  4:02         ` Al Viro
     [not found]           ` <20150713040258.GM17109-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-07-13  5:19             ` NeilBrown
2015-07-13  6:02               ` Al Viro
2015-07-13  4:20         ` NeilBrown
2015-07-13  4:45           ` Al Viro
     [not found]             ` <20150713044553.GN17109-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-07-13  5:21               ` NeilBrown
2015-07-13  6:02                 ` NeilBrown
2015-07-13  6:08                   ` Al Viro
     [not found]                     ` <20150713060802.GP17109-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2015-07-13  6:32                       ` NeilBrown
2015-07-13  6:43                         ` Al Viro
2015-07-15  3:49                           ` NeilBrown
2015-07-15  4:57                             ` Al Viro
2015-07-15  6:51                               ` NeilBrown
2015-07-24  2:05             ` NeilBrown
2015-07-27  2:28               ` Kinglong Mee
     [not found]                 ` <55B59764.1020506-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2015-07-27  2:51                   ` NeilBrown
2015-07-27  3:17                     ` Kinglong Mee
2015-07-15 21:07         ` J. Bruce Fields
     [not found]           ` <20150715210756.GE21669-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2015-07-15 23:40             ` NeilBrown
2015-07-16 20:51               ` J. Bruce Fields
     [not found]                 ` <20150716205148.GC10673-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2015-07-21 21:58                   ` NeilBrown
2015-07-22 15:08                     ` J. Bruce Fields
     [not found]                       ` <20150722150840.GH22718-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2015-07-23 23:46                         ` export table lookup: was " NeilBrown
2015-07-24 19:48                           ` J. Bruce Fields
2015-07-25  0:40                             ` NeilBrown
2015-07-11 12:51 ` [PATCH 08/10 v7] sunrpc: New helper cache_delete_entry for deleting cache_head directly Kinglong Mee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).