[PATCH -V3] Generic name to handle and open by handle syscalls

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH -V3] Generic name to handle and open by handle syscalls
@ 2010-04-22 18:15 Aneesh Kumar K.V
  2010-04-22 18:15 ` [PATCH -V3 1/5] exportfs: Return the minimum required handle size Aneesh Kumar K.V
                   ` (7 more replies)
  0 siblings, 8 replies; 23+ messages in thread
From: Aneesh Kumar K.V @ 2010-04-22 18:15 UTC (permalink / raw)
  To: hch, viro, adilger, corbet; +Cc: linux-fsdevel, sfrench

Hi,

The below set of patches implement open by handle support using exportfs
operations. This allows user space application to map a file name to file 
handle and later open the file using handle. This should be usable
for userspace NFS [1] and 9P server [2]. XFS already support this with the ioctls
XFS_IOC_PATH_TO_HANDLE and XFS_IOC_OPEN_BY_HANDLE.

[1] http://nfs-ganesha.sourceforge.net/
[2] http://lists.gnu.org/archive/html/qemu-devel/2010-03/msg01087.html

TODO:
I guess we would need to optimize how we get the vfsmount for the filesystem
uuid specified. Searching the file system list and task name space may be a big
overhead for each open by handle call.

Chages from V2:
a) Support system wide unique handle.

Changes from v1:
a) handle size is now specified in bytes
b) returns -EOVERFLOW if the handle size is small
c) dropped open_handle syscall and added open_by_handle_at syscall
   open_by_handle_at takes mount_fd as the directory fd of the mount point
   containing the file
e) handle will only be unique in a given file system. So for an NFS server
   exporting multiple file system, NFS server will have to internally track the
   mount point to which a file handle belongs to. We should be able to do it much
   easily than expecting kernel to give a system wide unique file handle. System
   wide unique file handle would need much larger changes to the exportfs or VFS
   interface and I was not sure whether we really need to do that in the kernel or
   in the user space
f) open_handle_at now only check for DAC_OVERRIDE capability

Example program:
----------------
#include <stdio.h>
#include <stdlib.h>

#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <string.h>

struct uuid {
	char uuid[16];
};
struct file_handle {
        int handle_size;
        int handle_type;
	struct uuid fsid;
        void *handle;
};

static int name_to_handle(const char *name, struct file_handle  *fh)
{
	return syscall(338, name, fh);
}

static int open_by_handle(struct file_handle *fh,  int flags)
{
	return syscall(339, fh, flags);
}

int main(int argc, char *argv[])
{
        int ret;
        int fd, dirfd;
        char buf[100];
        struct file_handle fh;
        fh.handle_size = 0;
again:
	if (fh.handle_size)
        	fh.handle = malloc(fh.handle_size);
        fh.handle_type = 0;
        errno  = 0;
        ret = name_to_handle(argv[1], &fh);
        if (ret) {
                perror("Error:");
		printf("Found the handle size needed to be %d\n", fh.handle_size);
		printf("Trying again..\n");
		goto again;
                exit(1);
        }
        fd = open_by_handle(&fh, O_RDONLY);
        if (fd <= 0 ) {
                perror("Error:");
                exit(1);
        }
        memset(buf, 0 , 100);
        while (read(fd, buf, 100) > 0) {
                printf("%s", buf);
                memset(buf, 0 , 100);
        }
        return 0;
}

-aneesh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH -V3 1/5] exportfs: Return the minimum required handle size
  2010-04-22 18:15 [PATCH -V3] Generic name to handle and open by handle syscalls Aneesh Kumar K.V
@ 2010-04-22 18:15 ` Aneesh Kumar K.V
  2010-04-22 18:15 ` [PATCH -V3 2/5] vfs: Add name to file handle conversion support Aneesh Kumar K.V
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Aneesh Kumar K.V @ 2010-04-22 18:15 UTC (permalink / raw)
  To: hch, viro, adilger, corbet; +Cc: linux-fsdevel, sfrench, Aneesh Kumar K.V

The exportfs encode handle function should return the minimum required
handle size. This helps user to find out the handle size by passing 0
handle size in the first step and then redoing to the call again with
the returned handle size value.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 fs/btrfs/export.c             |    8 ++++++--
 fs/exportfs/expfs.c           |    9 +++++++--
 fs/fat/inode.c                |    4 +++-
 fs/fuse/inode.c               |    4 +++-
 fs/gfs2/export.c              |    8 ++++++--
 fs/isofs/export.c             |    8 ++++++--
 fs/ocfs2/export.c             |    8 +++++++-
 fs/reiserfs/inode.c           |    7 ++++++-
 fs/udf/namei.c                |    7 ++++++-
 fs/xfs/linux-2.6/xfs_export.c |    4 +++-
 mm/shmem.c                    |    4 +++-
 11 files changed, 56 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c
index 951ef09..5f8ee5a 100644
--- a/fs/btrfs/export.c
+++ b/fs/btrfs/export.c
@@ -21,9 +21,13 @@ static int btrfs_encode_fh(struct dentry *dentry, u32 *fh, int *max_len,
 	int len = *max_len;
 	int type;
 
-	if ((len < BTRFS_FID_SIZE_NON_CONNECTABLE) ||
-	    (connectable && len < BTRFS_FID_SIZE_CONNECTABLE))
+	if (connectable && (len < BTRFS_FID_SIZE_CONNECTABLE)) {
+		*max_len = BTRFS_FID_SIZE_CONNECTABLE;
 		return 255;
+	} else if (len < BTRFS_FID_SIZE_NON_CONNECTABLE) {
+		*max_len = BTRFS_FID_SIZE_NON_CONNECTABLE;
+		return 255;
+	}
 
 	len  = BTRFS_FID_SIZE_NON_CONNECTABLE;
 	type = FILEID_BTRFS_WITHOUT_PARENT;
diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index e9e1759..cfee0f0 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -319,9 +319,14 @@ static int export_encode_fh(struct dentry *dentry, struct fid *fid,
 	struct inode * inode = dentry->d_inode;
 	int len = *max_len;
 	int type = FILEID_INO32_GEN;
-	
-	if (len < 2 || (connectable && len < 4))
+
+	if (connectable && (len < 4)) {
+		*max_len = 4;
+		return 255;
+	} else if (len < 2) {
+		*max_len = 2;
 		return 255;
+	}
 
 	len = 2;
 	fid->i32.ino = inode->i_ino;
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 0ce143b..6f83bc7 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -738,8 +738,10 @@ fat_encode_fh(struct dentry *de, __u32 *fh, int *lenp, int connectable)
 	struct inode *inode =  de->d_inode;
 	u32 ipos_h, ipos_m, ipos_l;
 
-	if (len < 5)
+	if (len < 5) {
+		*lenp = 5;
 		return 255; /* no room */
+	}
 
 	ipos_h = MSDOS_I(inode)->i_pos >> 8;
 	ipos_m = (MSDOS_I(inode)->i_pos & 0xf0) << 24;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index ec14d19..beaea69 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -638,8 +638,10 @@ static int fuse_encode_fh(struct dentry *dentry, u32 *fh, int *max_len,
 	u64 nodeid;
 	u32 generation;
 
-	if (*max_len < len)
+	if (*max_len < len) {
+		*max_len = len;
 		return  255;
+	}
 
 	nodeid = get_fuse_inode(inode)->nodeid;
 	generation = inode->i_generation;
diff --git a/fs/gfs2/export.c b/fs/gfs2/export.c
index c22c211..d022236 100644
--- a/fs/gfs2/export.c
+++ b/fs/gfs2/export.c
@@ -36,9 +36,13 @@ static int gfs2_encode_fh(struct dentry *dentry, __u32 *p, int *len,
 	struct super_block *sb = inode->i_sb;
 	struct gfs2_inode *ip = GFS2_I(inode);
 
-	if (*len < GFS2_SMALL_FH_SIZE ||
-	    (connectable && *len < GFS2_LARGE_FH_SIZE))
+	if (connectable && (*len < GFS2_LARGE_FH_SIZE)) {
+		*len = GFS2_LARGE_FH_SIZE;
 		return 255;
+	} else if (*len < GFS2_SMALL_FH_SIZE) {
+		*len = GFS2_SMALL_FH_SIZE;
+		return 255;
+	}
 
 	fh[0] = cpu_to_be32(ip->i_no_formal_ino >> 32);
 	fh[1] = cpu_to_be32(ip->i_no_formal_ino & 0xFFFFFFFF);
diff --git a/fs/isofs/export.c b/fs/isofs/export.c
index ed752cb..dd4687f 100644
--- a/fs/isofs/export.c
+++ b/fs/isofs/export.c
@@ -124,9 +124,13 @@ isofs_export_encode_fh(struct dentry *dentry,
 	 * offset of the inode and the upper 16 bits of fh32[1] to
 	 * hold the offset of the parent.
 	 */
-
-	if (len < 3 || (connectable && len < 5))
+	if (connectable && (len < 5)) {
+		*max_len = 5;
+		return 255;
+	} else if (len < 3) {
+		*max_len = 3;
 		return 255;
+	}
 
 	len = 3;
 	fh32[0] = ei->i_iget5_block;
diff --git a/fs/ocfs2/export.c b/fs/ocfs2/export.c
index 19ad145..250a347 100644
--- a/fs/ocfs2/export.c
+++ b/fs/ocfs2/export.c
@@ -201,8 +201,14 @@ static int ocfs2_encode_fh(struct dentry *dentry, u32 *fh_in, int *max_len,
 		   dentry->d_name.len, dentry->d_name.name,
 		   fh, len, connectable);
 
-	if (len < 3 || (connectable && len < 6)) {
+	if (connectable && (len < 6)) {
 		mlog(ML_ERROR, "fh buffer is too small for encoding\n");
+		*max_len = 6;
+		type = 255;
+		goto bail;
+	} else if (len < 3) {
+		mlog(ML_ERROR, "fh buffer is too small for encoding\n");
+		*max_len = 3;
 		type = 255;
 		goto bail;
 	}
diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index dc2c65e..5fff1e2 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -1588,8 +1588,13 @@ int reiserfs_encode_fh(struct dentry *dentry, __u32 * data, int *lenp,
 	struct inode *inode = dentry->d_inode;
 	int maxlen = *lenp;
 
-	if (maxlen < 3)
+	if (need_parent && (maxlen < 5)) {
+		*lenp = 5;
 		return 255;
+	} else if (maxlen < 3) {
+		*lenp = 3;
+		return 255;
+	}
 
 	data[0] = inode->i_ino;
 	data[1] = le32_to_cpu(INODE_PKEY(inode)->k_dir_id);
diff --git a/fs/udf/namei.c b/fs/udf/namei.c
index 7581602..37ce713 100644
--- a/fs/udf/namei.c
+++ b/fs/udf/namei.c
@@ -1360,8 +1360,13 @@ static int udf_encode_fh(struct dentry *de, __u32 *fh, int *lenp,
 	struct fid *fid = (struct fid *)fh;
 	int type = FILEID_UDF_WITHOUT_PARENT;
 
-	if (len < 3 || (connectable && len < 5))
+	if (connectable && (len < 5)) {
+		*lenp = 5;
+		return 255;
+	} else if (len < 3) {
+		*lenp = 3;
 		return 255;
+	}
 
 	*lenp = 3;
 	fid->udf.block = location.logicalBlockNum;
diff --git a/fs/xfs/linux-2.6/xfs_export.c b/fs/xfs/linux-2.6/xfs_export.c
index 846b75a..82c0553 100644
--- a/fs/xfs/linux-2.6/xfs_export.c
+++ b/fs/xfs/linux-2.6/xfs_export.c
@@ -81,8 +81,10 @@ xfs_fs_encode_fh(
 	 * seven combinations work.  The real answer is "don't use v2".
 	 */
 	len = xfs_fileid_length(fileid_type);
-	if (*max_len < len)
+	if (*max_len < len) {
+		*max_len = len
 		return 255;
+	}
 	*max_len = len;
 
 	switch (fileid_type) {
diff --git a/mm/shmem.c b/mm/shmem.c
index eef4ebe..bbeda1c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2125,8 +2125,10 @@ static int shmem_encode_fh(struct dentry *dentry, __u32 *fh, int *len,
 {
 	struct inode *inode = dentry->d_inode;
 
-	if (*len < 3)
+	if (*len < 3) {
+		*len = 3;
 		return 255;
+	}
 
 	if (hlist_unhashed(&inode->i_hash)) {
 		/* Unfortunately insert_inode_hash is not idempotent,
-- 
1.7.0.4.360.g11766c


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH -V3 2/5] vfs: Add name to file handle conversion support
  2010-04-22 18:15 [PATCH -V3] Generic name to handle and open by handle syscalls Aneesh Kumar K.V
  2010-04-22 18:15 ` [PATCH -V3 1/5] exportfs: Return the minimum required handle size Aneesh Kumar K.V
@ 2010-04-22 18:15 ` Aneesh Kumar K.V
  2010-04-22 18:15 ` [PATCH -V3 3/5] vfs: Add open by file handle support Aneesh Kumar K.V
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Aneesh Kumar K.V @ 2010-04-22 18:15 UTC (permalink / raw)
  To: hch, viro, adilger, corbet; +Cc: linux-fsdevel, sfrench, Aneesh Kumar K.V

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 fs/open.c          |   95 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |   17 +++++++++
 2 files changed, 112 insertions(+), 0 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 74e5cd9..2d9a92b 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -30,6 +30,8 @@
 #include <linux/falloc.h>
 #include <linux/fs_struct.h>
 #include <linux/ima.h>
+#include <linux/exportfs.h>
+#include <linux/mnt_namespace.h>
 
 #include "internal.h"
 
@@ -1206,3 +1208,96 @@ int nonseekable_open(struct inode *inode, struct file *filp)
 }
 
 EXPORT_SYMBOL(nonseekable_open);
+
+/* limit the handle size to some value */
+#define MAX_HANDLE_SZ 4096
+static int do_sys_name_to_handle(const char __user *name,
+				struct file_handle *fh)
+{
+	int retval;
+	int handle_size;
+	struct path path;
+	struct inode *inode;
+	void *handle = NULL;
+	struct super_block *sb;
+	struct uuid *this_fs_id;
+
+	if (fh->handle_size > MAX_HANDLE_SZ)
+		return -EINVAL;
+
+	retval = user_lpath(name, &path);
+	if (retval)
+		return retval;
+
+	inode = path.dentry->d_inode;
+	/*
+	 * name to handle conversion only done for regular files
+	 * directories and symbolic links
+	 */
+	if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode) &&
+		!S_ISLNK(inode->i_mode)) {
+		retval =  -EINVAL;
+		goto err_out;
+	}
+	if (fh->handle_size > 0) {
+		handle = kmalloc(fh->handle_size, GFP_KERNEL);
+		if (!handle) {
+			retval = -ENOMEM;
+			goto err_out;
+		}
+
+		handle_size = fh->handle_size;
+	} else
+		handle_size = 0;
+	/* we ask for a non connected handle */
+	retval = exportfs_encode_fh(path.dentry, (struct fid *)handle,
+				&handle_size,  0);
+	/* convert handle size to bytes */
+	handle_size *= sizeof(u32);
+	fh->handle_type = retval;
+	if (handle_size <= fh->handle_size) {
+		if (copy_to_user(fh->f_handle, handle,
+					handle_size))
+			retval = -EFAULT;
+		else
+			retval = 0;
+	} else
+		retval = -EOVERFLOW;
+	fh->handle_size = handle_size;
+	sb = path.mnt->mnt_sb;
+	if (sb->s_op->get_fsid) {
+		this_fs_id = sb->s_op->get_fsid(sb);
+		memcpy(fh->fsid.uuid, this_fs_id->uuid, sizeof(fh->fsid.uuid));
+	} else
+		memset(fh->fsid.uuid, 0, sizeof(fh->fsid.uuid));
+	kfree(handle);
+
+err_out:
+	path_put(&path);
+	return retval;
+}
+
+SYSCALL_DEFINE2(name_to_handle, const char __user *, name,
+		struct file_handle __user *, handle)
+{
+	long ret;
+	struct file_handle f_handle;
+	if (copy_from_user(&f_handle, handle, sizeof(struct file_handle))) {
+		ret = -EFAULT;
+		goto err_out;
+	}
+	ret = do_sys_name_to_handle(name, &f_handle);
+	if (copy_to_user(&handle->handle_type,
+			&f_handle.handle_type, sizeof(f_handle.handle_type)) ||
+		copy_to_user(&handle->handle_size,
+			&f_handle.handle_size, sizeof(f_handle.handle_size)) ||
+		copy_to_user(handle->fsid.uuid,
+			f_handle.fsid.uuid, sizeof(f_handle.fsid.uuid))) {
+
+		ret = -EFAULT;
+	}
+err_out:
+	/* avoid REGPARM breakage on x86: */
+	asmlinkage_protect(2, ret, name, handle);
+	return ret;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 39d57bc..f7b9557 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -948,6 +948,20 @@ struct file {
 	unsigned long f_mnt_write_state;
 #endif
 };
+
+struct uuid {
+	char uuid[16];
+};
+
+struct file_handle {
+	int handle_size;
+	int handle_type;
+	/* File system identifier */
+	struct uuid fsid;
+	/* file identifier */
+	void *f_handle;
+};
+
 extern spinlock_t files_lock;
 #define file_list_lock() spin_lock(&files_lock);
 #define file_list_unlock() spin_unlock(&files_lock);
@@ -1580,6 +1594,7 @@ struct super_operations {
 	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 #endif
 	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
+	struct uuid *(*get_fsid)(struct super_block *);
 };
 
 /*
@@ -2328,6 +2343,8 @@ extern struct super_block *get_super(struct block_device *);
 extern struct super_block *get_active_super(struct block_device *bdev);
 extern struct super_block *user_get_super(dev_t);
 extern void drop_super(struct super_block *sb);
+extern struct super_block *fs_get_sb(struct uuid *fsid);
+
 
 extern int dcache_dir_open(struct inode *, struct file *);
 extern int dcache_dir_close(struct inode *, struct file *);
-- 
1.7.0.4.360.g11766c


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH -V3 3/5] vfs: Add open by file handle support
  2010-04-22 18:15 [PATCH -V3] Generic name to handle and open by handle syscalls Aneesh Kumar K.V
  2010-04-22 18:15 ` [PATCH -V3 1/5] exportfs: Return the minimum required handle size Aneesh Kumar K.V
  2010-04-22 18:15 ` [PATCH -V3 2/5] vfs: Add name to file handle conversion support Aneesh Kumar K.V
@ 2010-04-22 18:15 ` Aneesh Kumar K.V
  2010-04-22 19:22   ` Andreas Dilger
  2010-04-22 18:15 ` [PATCH -V3 4/5] x86: Add new syscalls for x86_32 Aneesh Kumar K.V
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Aneesh Kumar K.V @ 2010-04-22 18:15 UTC (permalink / raw)
  To: hch, viro, adilger, corbet; +Cc: linux-fsdevel, sfrench, Aneesh Kumar K.V

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 fs/filesystems.c              |   32 ++++++++-
 fs/namei.c                    |   24 -------
 fs/namespace.c                |   38 ++++++++++
 fs/open.c                     |  154 +++++++++++++++++++++++++++++++++++++++++
 fs/pnode.c                    |    2 +-
 include/linux/mnt_namespace.h |    2 +
 include/linux/namei.h         |   24 +++++++
 7 files changed, 250 insertions(+), 26 deletions(-)

diff --git a/fs/filesystems.c b/fs/filesystems.c
index 68ba492..743d36e 100644
--- a/fs/filesystems.c
+++ b/fs/filesystems.c
@@ -281,5 +281,35 @@ struct file_system_type *get_fs_type(const char *name)
 	}
 	return fs;
 }
-
 EXPORT_SYMBOL(get_fs_type);
+
+struct super_block *fs_get_sb(struct uuid *fsid)
+{
+	struct uuid *this_fsid;
+	struct file_system_type *fs_type;
+	struct super_block *sb, *found_sb = NULL;
+
+	read_lock(&file_systems_lock);
+	fs_type = file_systems;
+	while (fs_type) {
+		spin_lock(&sb_lock);
+		list_for_each_entry(sb, &fs_type->fs_supers, s_instances) {
+			if (!sb->s_op->get_fsid)
+				continue;
+			this_fsid = sb->s_op->get_fsid(sb);
+			if (!memcmp(fsid->uuid, this_fsid->uuid,
+					sizeof(this_fsid->uuid))) {
+				/* found the matching super_block */
+				atomic_inc(&sb->s_active);
+				found_sb = sb;
+				spin_unlock(&sb_lock);
+				goto out;
+			}
+		}
+		spin_unlock(&sb_lock);
+		fs_type = fs_type->next;
+	}
+out:
+	read_unlock(&file_systems_lock);
+	return found_sb;
+}
diff --git a/fs/namei.c b/fs/namei.c
index a7dce91..a18711e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1521,30 +1521,6 @@ out_unlock:
 	return may_open(&nd->path, 0, open_flag & ~O_TRUNC);
 }
 
-/*
- * Note that while the flag value (low two bits) for sys_open means:
- *	00 - read-only
- *	01 - write-only
- *	10 - read-write
- *	11 - special
- * it is changed into
- *	00 - no permissions needed
- *	01 - read-permission
- *	10 - write-permission
- *	11 - read-write
- * for the internal routines (ie open_namei()/follow_link() etc)
- * This is more logical, and also allows the 00 "no perm needed"
- * to be used for symlinks (where the permissions are checked
- * later).
- *
-*/
-static inline int open_to_namei_flags(int flag)
-{
-	if ((flag+1) & O_ACCMODE)
-		flag++;
-	return flag;
-}
-
 static int open_will_truncate(int flag, struct inode *inode)
 {
 	/*
diff --git a/fs/namespace.c b/fs/namespace.c
index 8174c8a..6168526 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2364,3 +2364,41 @@ void put_mnt_ns(struct mnt_namespace *ns)
 	kfree(ns);
 }
 EXPORT_SYMBOL(put_mnt_ns);
+
+/*
+ * Get any vfsmount mapping the superblock in the
+ * task namespace
+ */
+struct vfsmount *fs_get_vfsmount(struct task_struct *task,
+				struct super_block *sb)
+{
+	struct nsproxy *nsp;
+	struct list_head *mount_list;
+	struct mnt_namespace *ns = NULL;
+	struct vfsmount *mnt, *sb_mnt = NULL;
+
+	rcu_read_lock();
+	nsp = task_nsproxy(task);
+	if (nsp) {
+		ns = nsp->mnt_ns;
+		if (ns)
+			get_mnt_ns(ns);
+	}
+	rcu_read_unlock();
+	if (!ns)
+		return NULL;
+	down_read(&namespace_sem);
+	list_for_each(mount_list, &ns->list) {
+		mnt = list_entry(mount_list, struct vfsmount, mnt_list);
+		if (mnt->mnt_sb == sb) {
+			/* found the matching super block */
+			sb_mnt = mnt;
+			mntget(sb_mnt);
+			break;
+		}
+	}
+	up_read(&namespace_sem);
+
+	put_mnt_ns(ns);
+	return sb_mnt;
+}
diff --git a/fs/open.c b/fs/open.c
index 2d9a92b..4e8a8f4 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1301,3 +1301,157 @@ err_out:
 	asmlinkage_protect(2, ret, name, handle);
 	return ret;
 }
+
+static int vfs_dentry_acceptable(void *context, struct dentry *dentry)
+{
+	return 1;
+}
+
+static struct dentry *handle_to_dentry(struct vfsmount *mnt,
+				struct file_handle *fh)
+{
+	int retval = 0;
+	int handle_size;
+	void *handle = NULL;
+	struct dentry *dentry;
+
+	if ((fh->handle_size > MAX_HANDLE_SZ) ||
+		(fh->handle_size <= 0)) {
+		retval = -EINVAL;
+		goto err_out;
+	}
+	handle = kmalloc(fh->handle_size, GFP_KERNEL);
+	if (!handle) {
+		retval =  -ENOMEM;
+		goto err_out;
+	}
+	if (copy_from_user(handle, fh->f_handle, fh->handle_size)) {
+		retval = -EFAULT;
+		goto err_out;
+	}
+	/* change the handle size to multiple of sizeof(u32) */
+	handle_size = fh->handle_size >> 2;
+	dentry = exportfs_decode_fh(mnt, (struct fid *)handle,
+					handle_size, fh->handle_type,
+					vfs_dentry_acceptable, NULL);
+	kfree(handle);
+	return dentry;
+
+err_out:
+	kfree(handle);
+	return ERR_PTR(retval);
+}
+
+long do_sys_open_by_handle(struct file_handle *fh, int flags)
+{
+	int fd;
+	int retval = 0;
+	int d_flags  = flags;
+	struct file *filp;
+	struct vfsmount *mnt;
+	struct inode *inode;
+	struct dentry *dentry;
+	struct super_block *sb;
+
+	if (!capable(CAP_DAC_OVERRIDE))
+		return -EPERM;
+
+	sb = fs_get_sb(&fh->fsid);
+	if (!sb)
+		return -ESTALE;
+	/*
+	 * Find the vfsmount for this superblock in the
+	 * current namespace
+	 */
+	mnt = fs_get_vfsmount(current, sb);
+	if (!mnt) {
+		deactivate_super(sb);
+		return -ESTALE;
+	}
+
+	dentry = handle_to_dentry(mnt, fh);
+	if (IS_ERR(dentry)) {
+		mntput(mnt);
+		deactivate_super(sb);
+		return PTR_ERR(dentry);
+	}
+
+	inode = dentry->d_inode;
+	/* Restrict open_by_handle to directories & regular files. */
+	if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode))) {
+		retval = -EINVAL;
+		goto err_out;
+	}
+
+	flags  = open_to_namei_flags(flags);
+	/* O_TRUNC implies we need access checks for write permissions */
+	if (flags & O_TRUNC)
+		flags |= MAY_WRITE;
+
+	if ((!(flags & O_APPEND) || (flags & O_TRUNC)) &&
+		(flags & FMODE_WRITE) && IS_APPEND(inode)) {
+		retval = -EPERM;
+		goto err_out;
+	}
+
+	if ((flags & FMODE_WRITE) && IS_IMMUTABLE(inode)) {
+		retval = -EACCES;
+		goto err_out;
+	}
+
+	/* Can't write directories. */
+	if (S_ISDIR(inode->i_mode) && (flags & FMODE_WRITE)) {
+		retval = -EISDIR;
+		goto err_out;
+	}
+
+	fd = get_unused_fd();
+	if (fd < 0) {
+		retval = fd;
+		goto err_out;
+	}
+
+	filp = dentry_open(dentry, mntget(mnt),
+			d_flags, current_cred());
+	if (IS_ERR(filp)) {
+		put_unused_fd(fd);
+		retval =  PTR_ERR(filp);
+		goto err_out;
+	}
+
+	if (inode->i_mode & S_IFREG) {
+		filp->f_flags |= O_NOATIME;
+		filp->f_mode |= FMODE_NOCMTIME;
+	}
+	fsnotify_open(filp->f_path.dentry);
+	fd_install(fd, filp);
+	mntput(mnt);
+	deactivate_super(sb);
+	return fd;
+
+err_out:
+	mntput(mnt);
+	deactivate_super(sb);
+	dput(dentry);
+	return retval;
+}
+
+SYSCALL_DEFINE2(open_by_handle_at, struct file_handle __user *, handle,
+		int, flags)
+{
+	long ret;
+	struct file_handle f_handle;
+
+	if (force_o_largefile())
+		flags |= O_LARGEFILE;
+
+	if (copy_from_user(&f_handle, handle, sizeof(struct file_handle))) {
+		ret = -EFAULT;
+		goto err_out;
+	}
+	ret = do_sys_open_by_handle(&f_handle, flags);
+err_out:
+	/* avoid REGPARM breakage on x86: */
+	asmlinkage_protect(2, ret, handle, flags);
+	return ret;
+}
diff --git a/fs/pnode.c b/fs/pnode.c
index 5cc564a..9f6d12d 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -6,9 +6,9 @@
  *	Author : Ram Pai (linuxram@us.ibm.com)
  *
  */
+#include <linux/fs.h>
 #include <linux/mnt_namespace.h>
 #include <linux/mount.h>
-#include <linux/fs.h>
 #include "internal.h"
 #include "pnode.h"
 
diff --git a/include/linux/mnt_namespace.h b/include/linux/mnt_namespace.h
index 0b89efc..d363ecc 100644
--- a/include/linux/mnt_namespace.h
+++ b/include/linux/mnt_namespace.h
@@ -36,6 +36,8 @@ extern const struct seq_operations mounts_op;
 extern const struct seq_operations mountinfo_op;
 extern const struct seq_operations mountstats_op;
 extern int mnt_had_events(struct proc_mounts *);
+extern struct vfsmount *fs_get_vfsmount(struct task_struct *task,
+					struct super_block *sb);
 
 #endif
 #endif
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 05b441d..a853aa0 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -4,6 +4,7 @@
 #include <linux/dcache.h>
 #include <linux/linkage.h>
 #include <linux/path.h>
+#include <asm-generic/fcntl.h>
 
 struct vfsmount;
 
@@ -96,4 +97,27 @@ static inline void nd_terminate_link(void *name, size_t len, size_t maxlen)
 	((char *) name)[min(len, maxlen)] = '\0';
 }
 
+/*
+ * Note that while the flag value (low two bits) for sys_open means:
+ *	00 - read-only
+ *	01 - write-only
+ *	10 - read-write
+ *	11 - special
+ * it is changed into
+ *	00 - no permissions needed
+ *	01 - read-permission
+ *	10 - write-permission
+ *	11 - read-write
+ * for the internal routines (ie open_namei()/follow_link() etc)
+ * This is more logical, and also allows the 00 "no perm needed"
+ * to be used for symlinks (where the permissions are checked
+ * later).
+ *
+*/
+static inline int open_to_namei_flags(int flag)
+{
+	if ((flag+1) & O_ACCMODE)
+		flag++;
+	return flag;
+}
 #endif /* _LINUX_NAMEI_H */
-- 
1.7.0.4.360.g11766c


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH -V3 4/5] x86: Add new syscalls for x86_32
  2010-04-22 18:15 [PATCH -V3] Generic name to handle and open by handle syscalls Aneesh Kumar K.V
                   ` (2 preceding siblings ...)
  2010-04-22 18:15 ` [PATCH -V3 3/5] vfs: Add open by file handle support Aneesh Kumar K.V
@ 2010-04-22 18:15 ` Aneesh Kumar K.V
  2010-04-22 18:15 ` [PATCH -V3 5/5] ext4: Add get_fsid callback Aneesh Kumar K.V
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Aneesh Kumar K.V @ 2010-04-22 18:15 UTC (permalink / raw)
  To: hch, viro, adilger, corbet; +Cc: linux-fsdevel, sfrench, Aneesh Kumar K.V

This patch adds sys_name_to_handle and sys_open_by_handle
syscalls to x86_32

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/x86/include/asm/unistd_32.h   |    4 +++-
 arch/x86/kernel/syscall_table_32.S |    2 ++
 2 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index beb9b5f..4d736d0 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -343,10 +343,12 @@
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_event_open	336
 #define __NR_recvmmsg		337
+#define __NR_name_to_handle	338
+#define __NR_open_by_handle_at  339
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 338
+#define NR_syscalls 340
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 8b37293..db06312 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -337,3 +337,5 @@ ENTRY(sys_call_table)
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_event_open
 	.long sys_recvmmsg
+	.long sys_name_to_handle
+	.long sys_open_by_handle_at	/* 339 */
-- 
1.7.0.4.360.g11766c


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH -V3 5/5] ext4: Add get_fsid callback
  2010-04-22 18:15 [PATCH -V3] Generic name to handle and open by handle syscalls Aneesh Kumar K.V
                   ` (3 preceding siblings ...)
  2010-04-22 18:15 ` [PATCH -V3 4/5] x86: Add new syscalls for x86_32 Aneesh Kumar K.V
@ 2010-04-22 18:15 ` Aneesh Kumar K.V
  2010-04-22 19:07 ` [PATCH -V3] Generic name to handle and open by handle syscalls Andreas Dilger
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Aneesh Kumar K.V @ 2010-04-22 18:15 UTC (permalink / raw)
  To: hch, viro, adilger, corbet; +Cc: linux-fsdevel, sfrench, Aneesh Kumar K.V

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 fs/ext4/super.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e14d22c..b485f0a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1049,6 +1049,14 @@ static int bdev_try_to_free_page(struct super_block *sb, struct page *page,
 	return try_to_free_buffers(page);
 }
 
+static struct uuid *ext4_get_fsid(struct super_block *sb)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_super_block *es = sbi->s_es;
+
+	return (struct uuid *)es->s_uuid;
+}
+
 #ifdef CONFIG_QUOTA
 #define QTYPE2NAME(t) ((t) == USRQUOTA ? "user" : "group")
 #define QTYPE2MOPT(on, t) ((t) == USRQUOTA?((on)##USRJQUOTA):((on)##GRPJQUOTA))
@@ -1109,6 +1117,7 @@ static const struct super_operations ext4_sops = {
 	.quota_write	= ext4_quota_write,
 #endif
 	.bdev_try_to_free_page = bdev_try_to_free_page,
+	.get_fsid	= ext4_get_fsid,
 };
 
 static const struct super_operations ext4_nojournal_sops = {
@@ -1128,6 +1137,7 @@ static const struct super_operations ext4_nojournal_sops = {
 	.quota_write	= ext4_quota_write,
 #endif
 	.bdev_try_to_free_page = bdev_try_to_free_page,
+	.get_fsid	= ext4_get_fsid,
 };
 
 static const struct export_operations ext4_export_ops = {
-- 
1.7.0.4.360.g11766c


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH -V3] Generic name to handle and open by handle syscalls
  2010-04-22 18:15 [PATCH -V3] Generic name to handle and open by handle syscalls Aneesh Kumar K.V
                   ` (4 preceding siblings ...)
  2010-04-22 18:15 ` [PATCH -V3 5/5] ext4: Add get_fsid callback Aneesh Kumar K.V
@ 2010-04-22 19:07 ` Andreas Dilger
  2010-04-22 22:49 ` Serge E. Hallyn
  2010-04-23 13:23 ` Theodore Tso
  7 siblings, 0 replies; 23+ messages in thread
From: Andreas Dilger @ 2010-04-22 19:07 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: hch, viro, corbet, linux-fsdevel, sfrench

On 2010-04-22, at 12:15, Aneesh Kumar K.V wrote:
> The below set of patches implement open by handle support using exportfs
> operations. This allows user space application to map a file name to file 
> handle and later open the file using handle.
> 
> TODO:
> I guess we would need to optimize how we get the vfsmount for the filesystem
> uuid specified. Searching the file system list and task name space may be a big overhead for each open by handle call.

I agree this is a desirable feature, but I don't think it is a requirement before this code could be landed.  The initial adoption of this syscall will likely take some time, and it will work fine on systems until there are hundreds of filesystems/namespaces.

It definitely makes sense to hash the UUIDs of the filesystems at mount time so that they can be looked up more efficiently.  It also makes sense to hang a vfsmnt list off the superblock to make that lookup more efficient, instead of having to search through all of the vfsmnts on the system.

> Chages from V2:
> a) Support system wide unique handle.

Thanks for adding this in.  I think it is an important addition to the API.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -V3 3/5] vfs: Add open by file handle support
  2010-04-22 18:15 ` [PATCH -V3 3/5] vfs: Add open by file handle support Aneesh Kumar K.V
@ 2010-04-22 19:22   ` Andreas Dilger
  2010-04-23 11:40     ` Aneesh Kumar K. V
  0 siblings, 1 reply; 23+ messages in thread
From: Andreas Dilger @ 2010-04-22 19:22 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: hch, viro, corbet, linux-fsdevel, sfrench

On 2010-04-22, at 12:15, Aneesh Kumar K.V wrote:
> +long do_sys_open_by_handle(struct file_handle *fh, int flags)
> +{

Note this is inconsistent with the patch description that says that the syscall is open_by_handle_at().  While the syscall is still using the "_at()" suffix, it is no longer passing any directory.

> +	sb = fs_get_sb(&fh->fsid);
> +	if (!sb)
> +		return -ESTALE;

I was going to say that it would be nice to allow a fallback to the "_at" directory of the filesystem, but since the handle is specific to the exact filesystem it was created on anyway (inode number+generation, not even an rsync would be enough) the only way the handle is useful on another node is if the filesystem was cloned at the block device level (or is a shared filesystem identical on all nodes) so the UUID should be identical too.

> +	 * Find the vfsmount for this superblock in the
> +	 * current namespace
> +	 */
> +	mnt = fs_get_vfsmount(current, sb);
> +	if (!mnt) {
> +		deactivate_super(sb);
> +		return -ESTALE;
> +	}
> +
> +	dentry = handle_to_dentry(mnt, fh);
> +	if (IS_ERR(dentry)) {
> +		mntput(mnt);
> +		deactivate_super(sb);
> +		return PTR_ERR(dentry);
> +	}

Why is this code doing "manual" cleanup instead of setting retval and goto out_sb: or out_mnt: type labels that only do partial cleanup?

> +	fsnotify_open(filp->f_path.dentry);
> +	fd_install(fd, filp);
> +	mntput(mnt);
> +	deactivate_super(sb);
> +	return fd;

This could set retval = fd and goto out_mnt: (skipping dput(dentry)) in conjunction with the change proposed below.

> +err_out:
> +	mntput(mnt);
> +	deactivate_super(sb);
> +	dput(dentry);
> +	return retval;
> +}

I think it makes sense to do the dput(dentry) first, since it will be holding a reference on the sb and vfsmnt, and it also allows code simplification:

out_err:
       dput(dentry);
out_mnt:
       mntput(mnt);
out_sb:
       deactivate_super(sb);

       return retval;
}

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -V3] Generic name to handle and open by handle syscalls
  2010-04-22 18:15 [PATCH -V3] Generic name to handle and open by handle syscalls Aneesh Kumar K.V
                   ` (5 preceding siblings ...)
  2010-04-22 19:07 ` [PATCH -V3] Generic name to handle and open by handle syscalls Andreas Dilger
@ 2010-04-22 22:49 ` Serge E. Hallyn
  2010-04-23 11:45   ` Aneesh Kumar K. V
  2010-04-23 13:23 ` Theodore Tso
  7 siblings, 1 reply; 23+ messages in thread
From: Serge E. Hallyn @ 2010-04-22 22:49 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: hch, viro, adilger, corbet, linux-fsdevel, sfrench

Quoting Aneesh Kumar K.V (aneesh.kumar@linux.vnet.ibm.com):
> Hi,
> 
> The below set of patches implement open by handle support using exportfs
> operations. This allows user space application to map a file name to file 
> handle and later open the file using handle. This should be usable
> for userspace NFS [1] and 9P server [2]. XFS already support this with the ioctls
> XFS_IOC_PATH_TO_HANDLE and XFS_IOC_OPEN_BY_HANDLE.
> 
> [1] http://nfs-ganesha.sourceforge.net/
> [2] http://lists.gnu.org/archive/html/qemu-devel/2010-03/msg01087.html
> 
> TODO:
> I guess we would need to optimize how we get the vfsmount for the filesystem
> uuid specified. Searching the file system list and task name space may be a big
> overhead for each open by handle call.
> 
> Chages from V2:
> a) Support system wide unique handle.
> 
> Changes from v1:
> a) handle size is now specified in bytes
> b) returns -EOVERFLOW if the handle size is small
> c) dropped open_handle syscall and added open_by_handle_at syscall
>    open_by_handle_at takes mount_fd as the directory fd of the mount point
>    containing the file
> e) handle will only be unique in a given file system. So for an NFS server
>    exporting multiple file system, NFS server will have to internally track the
>    mount point to which a file handle belongs to. We should be able to do it much
>    easily than expecting kernel to give a system wide unique file handle. System
>    wide unique file handle would need much larger changes to the exportfs or VFS
>    interface and I was not sure whether we really need to do that in the kernel or
>    in the user space
> f) open_handle_at now only check for DAC_OVERRIDE capability

Hi Aneesh,

it still scares me a bit as I expect to see some privileged userspace
daemon accept (forged) handles from unprivileged users and pass back
open fds, but adding some random token to prevent that would require
storing it, so I think you're doing what you reasonably can by
requiring DAC_OVERRIDE.

nsproxy bit looks fine, and i saw no problems in the rest of the set.
As Andreas was saying I don't see why you're calling it _at, but
meanwhile

Acked-by: Serge Hallyn <serue@us.ibm.com>

to the set.

-serge

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -V3 3/5] vfs: Add open by file handle support
  2010-04-22 19:22   ` Andreas Dilger
@ 2010-04-23 11:40     ` Aneesh Kumar K. V
  0 siblings, 0 replies; 23+ messages in thread
From: Aneesh Kumar K. V @ 2010-04-23 11:40 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: hch, viro, corbet, linux-fsdevel, sfrench

On Thu, 22 Apr 2010 13:22:32 -0600, Andreas Dilger <adilger@sun.com> wrote:
> On 2010-04-22, at 12:15, Aneesh Kumar K.V wrote:
> > +long do_sys_open_by_handle(struct file_handle *fh, int flags)
> > +{
> 
> Note this is inconsistent with the patch description that says that
> the syscall is open_by_handle_at().  While the syscall is still using
> the "_at()" suffix, it is no longer passing any directory.

Fixed. 

> 
> > +	sb = fs_get_sb(&fh->fsid);
> > +	if (!sb)
> > +		return -ESTALE;
> 
> I was going to say that it would be nice to allow a fallback to the "_at" directory of the filesystem, but since the handle is specific to the exact filesystem it was created on anyway (inode number+generation, not even an rsync would be enough) the only way the handle is useful on another node is if the filesystem was cloned at the block device level (or is a shared filesystem identical on all nodes) so the UUID should be identical too.
> 
> > +	 * Find the vfsmount for this superblock in the
> > +	 * current namespace
> > +	 */
> > +	mnt = fs_get_vfsmount(current, sb);
> > +	if (!mnt) {
> > +		deactivate_super(sb);
> > +		return -ESTALE;
> > +	}
> > +
> > +	dentry = handle_to_dentry(mnt, fh);
> > +	if (IS_ERR(dentry)) {
> > +		mntput(mnt);
> > +		deactivate_super(sb);
> > +		return PTR_ERR(dentry);
> > +	}
> 
> Why is this code doing "manual" cleanup instead of setting retval and
> goto out_sb: or out_mnt: type labels that only do partial cleanup?

Fixed

> 
> > +	fsnotify_open(filp->f_path.dentry);
> > +	fd_install(fd, filp);
> > +	mntput(mnt);
> > +	deactivate_super(sb);
> > +	return fd;
> 
> This could set retval = fd and goto out_mnt: (skipping dput(dentry)) in conjunction with the change proposed below.
> 
> > +err_out:
> > +	mntput(mnt);
> > +	deactivate_super(sb);
> > +	dput(dentry);
> > +	return retval;
> > +}
> 
> I think it makes sense to do the dput(dentry) first, since it will be holding a reference on the sb and vfsmnt, and it also allows code simplification:
> 
> out_err:
>        dput(dentry);
> out_mnt:
>        mntput(mnt);
> out_sb:
>        deactivate_super(sb);
> 
>        return retval;
> }
>

Fixed

I sent a V4 version of the patch with compat syscall support.

-aneesh 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -V3] Generic name to handle and open by handle syscalls
  2010-04-22 22:49 ` Serge E. Hallyn
@ 2010-04-23 11:45   ` Aneesh Kumar K. V
  2010-04-23 13:49     ` Serge E. Hallyn
  0 siblings, 1 reply; 23+ messages in thread
From: Aneesh Kumar K. V @ 2010-04-23 11:45 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: hch, viro, adilger, corbet, linux-fsdevel, sfrench

On Thu, 22 Apr 2010 17:49:58 -0500, "Serge E. Hallyn" <serue@us.ibm.com> wrote:
> Quoting Aneesh Kumar K.V (aneesh.kumar@linux.vnet.ibm.com):
> > Hi,
> > 
> > The below set of patches implement open by handle support using exportfs
> > operations. This allows user space application to map a file name to file 
> > handle and later open the file using handle. This should be usable
> > for userspace NFS [1] and 9P server [2]. XFS already support this with the ioctls
> > XFS_IOC_PATH_TO_HANDLE and XFS_IOC_OPEN_BY_HANDLE.
> > 
> > [1] http://nfs-ganesha.sourceforge.net/
> > [2] http://lists.gnu.org/archive/html/qemu-devel/2010-03/msg01087.html
> > 
> > TODO:
> > I guess we would need to optimize how we get the vfsmount for the filesystem
> > uuid specified. Searching the file system list and task name space may be a big
> > overhead for each open by handle call.
> > 
> > Chages from V2:
> > a) Support system wide unique handle.
> > 
> > Changes from v1:
> > a) handle size is now specified in bytes
> > b) returns -EOVERFLOW if the handle size is small
> > c) dropped open_handle syscall and added open_by_handle_at syscall
> >    open_by_handle_at takes mount_fd as the directory fd of the mount point
> >    containing the file
> > e) handle will only be unique in a given file system. So for an NFS server
> >    exporting multiple file system, NFS server will have to internally track the
> >    mount point to which a file handle belongs to. We should be able to do it much
> >    easily than expecting kernel to give a system wide unique file handle. System
> >    wide unique file handle would need much larger changes to the exportfs or VFS
> >    interface and I was not sure whether we really need to do that in the kernel or
> >    in the user space
> > f) open_handle_at now only check for DAC_OVERRIDE capability
> 
> Hi Aneesh,
> 
> it still scares me a bit as I expect to see some privileged userspace
> daemon accept (forged) handles from unprivileged users and pass back
> open fds, but adding some random token to prevent that would require
> storing it, so I think you're doing what you reasonably can by
> requiring DAC_OVERRIDE.

We still do access permission check using the file permission
modes. What we don't do is to check the execute (x) bit on the
directory. But the idea of adding DAC_OVERRIDE is to make sure
that we allow only privileged user to use the open by handle interface.


> 
> nsproxy bit looks fine, and i saw no problems in the rest of the set.
> As Andreas was saying I don't see why you're calling it _at, but
> meanwhile
> 
> Acked-by: Serge Hallyn <serue@us.ibm.com>


I missed adding Acked-by:  in V4 iteration. But will add in V5
iteration. Can i add the same to all the patches ? Or do you want me to
add only to the nsproxy bits.

> 
> to the set.
> 
> -serge

-aneesh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -V3] Generic name to handle and open by handle syscalls
  2010-04-22 18:15 [PATCH -V3] Generic name to handle and open by handle syscalls Aneesh Kumar K.V
                   ` (6 preceding siblings ...)
  2010-04-22 22:49 ` Serge E. Hallyn
@ 2010-04-23 13:23 ` Theodore Tso
  2010-04-24  0:19   ` Andreas Dilger
  7 siblings, 1 reply; 23+ messages in thread
From: Theodore Tso @ 2010-04-23 13:23 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: hch, viro, adilger, corbet, linux-fsdevel, sfrench

On Apr 22, 2010, at 2:15 PM, Aneesh Kumar K.V wrote:
> 
> TODO:
> I guess we would need to optimize how we get the vfsmount for the filesystem
> uuid specified. Searching the file system list and task name space may be a big
> overhead for each open by handle call.

Something to consider is whether anything bad happens if there are multiple filesystems mounted with the same UUID.  I can think of two ways this could happen.   One is when we make a read-only LVM snapshot of a filesystem, and then mount it to back up a stable snapshot.  This might happen if the sysadmin is trying to backup a SQL database, for example; the database gets frozen, we take a snapshot, and then we unfreeze the database and mount the snapshot.   Now suppose we try to open-by-handle the mysql database --- should the system return the a file from the r/o frozen snapshot, or from the r/w file system?

The second case is that there are people, especially people playing games with virtualization, who copy file system images around, and then don't bother to use "tune2fs -U random" afterwards to make sure each file system gets its own unique UUID.   I even have some vague memories that some distribution was thinking about copying an image-copy of a filesystem from a Live CD and then using resize2fs to re-inflate the fs image as a way of doing a quick-and-dirty install.  So there might be a real risk that you might have multiple file systems that have diverged significantly from each other, but which have the same UUID.

Something we might do is to add a check and refuse mounting file systems with duplicate UUID's, and changing the LVM snapshot code to do run some kind of hook after a snapshot which runs a "tune2fs -U random" on the snapshot.   For r/o LVM snapshots, we could also put in a hack that if there are two file systems mounted, one r/o and one r/w, we return the r/w file system.  That doesn't handle the case of a r/w mounted LVM snapshot,  as well as other cases where we might have duplicate fs UUID's, though....

-- Ted

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -V3] Generic name to handle and open by handle syscalls
  2010-04-23 11:45   ` Aneesh Kumar K. V
@ 2010-04-23 13:49     ` Serge E. Hallyn
  0 siblings, 0 replies; 23+ messages in thread
From: Serge E. Hallyn @ 2010-04-23 13:49 UTC (permalink / raw)
  To: Aneesh Kumar K. V; +Cc: hch, viro, adilger, corbet, linux-fsdevel, sfrench

Quoting Aneesh Kumar K. V (aneesh.kumar@linux.vnet.ibm.com):
> On Thu, 22 Apr 2010 17:49:58 -0500, "Serge E. Hallyn" <serue@us.ibm.com> wrote:
> > Quoting Aneesh Kumar K.V (aneesh.kumar@linux.vnet.ibm.com):
> > > Hi,
> > > 
> > > The below set of patches implement open by handle support using exportfs
> > > operations. This allows user space application to map a file name to file 
> > > handle and later open the file using handle. This should be usable
> > > for userspace NFS [1] and 9P server [2]. XFS already support this with the ioctls
> > > XFS_IOC_PATH_TO_HANDLE and XFS_IOC_OPEN_BY_HANDLE.
> > > 
> > > [1] http://nfs-ganesha.sourceforge.net/
> > > [2] http://lists.gnu.org/archive/html/qemu-devel/2010-03/msg01087.html
> > > 
> > > TODO:
> > > I guess we would need to optimize how we get the vfsmount for the filesystem
> > > uuid specified. Searching the file system list and task name space may be a big
> > > overhead for each open by handle call.
> > > 
> > > Chages from V2:
> > > a) Support system wide unique handle.
> > > 
> > > Changes from v1:
> > > a) handle size is now specified in bytes
> > > b) returns -EOVERFLOW if the handle size is small
> > > c) dropped open_handle syscall and added open_by_handle_at syscall
> > >    open_by_handle_at takes mount_fd as the directory fd of the mount point
> > >    containing the file
> > > e) handle will only be unique in a given file system. So for an NFS server
> > >    exporting multiple file system, NFS server will have to internally track the
> > >    mount point to which a file handle belongs to. We should be able to do it much
> > >    easily than expecting kernel to give a system wide unique file handle. System
> > >    wide unique file handle would need much larger changes to the exportfs or VFS
> > >    interface and I was not sure whether we really need to do that in the kernel or
> > >    in the user space
> > > f) open_handle_at now only check for DAC_OVERRIDE capability
> > 
> > Hi Aneesh,
> > 
> > it still scares me a bit as I expect to see some privileged userspace
> > daemon accept (forged) handles from unprivileged users and pass back
> > open fds, but adding some random token to prevent that would require
> > storing it, so I think you're doing what you reasonably can by
> > requiring DAC_OVERRIDE.
> 
> We still do access permission check using the file permission
> modes.

Hmm?  Just to be clear - you do that check at name_to_handle(), though
of course that doesn't help prevent forgeries.  At open_by_handle you
do dentry_open(), but since you require CAP_DAC_OVERRIDE anyway, we of
course know that the task doing open_by_handle() will have all the perms
it needs to open the file, and more :)

Which, alas, is why this will be less ideal for checkpoint/restart than
I had hoped.  Certainly a privileged task can restart an unprivileged
one, but it then has to be more careful about the authenticity of the
contents of the checkpoint file.  For the re-opening of checkpointed
files to be re-authorized according to the permissions of the
checkpointed task, the unprivileged user has to be able to do restart
himself.  Hmm, suppose we could add a re-authorize phase before
returning to userspace where we double-check...

> What we don't do is to check the execute (x) bit on the
> directory. But the idea of adding DAC_OVERRIDE is to make sure
> that we allow only privileged user to use the open by handle interface.
> 
> 
> > 
> > nsproxy bit looks fine, and i saw no problems in the rest of the set.
> > As Andreas was saying I don't see why you're calling it _at, but
> > meanwhile
> > 
> > Acked-by: Serge Hallyn <serue@us.ibm.com>
> 
> 
> I missed adding Acked-by:  in V4 iteration. But will add in V5
> iteration. Can i add the same to all the patches ? Or do you want me to
> add only to the nsproxy bits.

All the patches.

thanks,
-serge

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -V3] Generic name to handle and open by handle syscalls
  2010-04-23 13:23 ` Theodore Tso
@ 2010-04-24  0:19   ` Andreas Dilger
  2010-04-24  1:08     ` Neil Brown
  2010-04-25 18:07     ` Aneesh Kumar K. V
  0 siblings, 2 replies; 23+ messages in thread
From: Andreas Dilger @ 2010-04-24  0:19 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Aneesh Kumar K.V, hch, viro, corbet, linux-fsdevel, sfrench

On 2010-04-23, at 07:23, Theodore Tso wrote:
> 
> Something to consider is whether anything bad happens if there are multiple filesystems mounted with the same UUID.  I can think of two ways this could happen.   One is when we make a read-only LVM snapshot of a filesystem, and then mount it to back up a stable snapshot.  This might happen if the sysadmin is trying to backup a SQL database, for example; the database gets frozen, we take a snapshot, and then we unfreeze the database and mount the snapshot.   Now suppose we try to open-by-handle the mysql database --- should the system return the a file from the r/o frozen snapshot, or from the r/w file system?

I'd say from the r/w LV in virtually all cases.  We are safe from totally egregious errors, because the inode+generation will prevent totally incorrect files from being returned, but newer/older versions of the same file/director may be found.

> Something we might do is to add a check and refuse mounting file systems with duplicate UUID's, and changing the LVM snapshot code to do run some kind of hook after a snapshot which runs a "tune2fs -U random" on the snapshot.   For r/o LVM snapshots, we could also put in a hack that if there are two file systems mounted, one r/o and one r/w, we return the r/w file system.

I think this may break things if we change the UUID when a snapshot is created, because we don't know what userspace might be using the UUID for.  That said, I totally agree that returning the r/w LV makes sense.  The LVM code itself understands which LV is the primary and which is the snapshot, so it likely means that the "lookup the UUID" code might need to be smarter.

Probably the simplest thing is if a new filesystem is mounted, but a second filesystem with the same UUID is mounted that it is skipped.  If we keep the UUID list in FIFO order, that should be sufficient to ensure that the "primary" version is returned first.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -V3] Generic name to handle and open by handle syscalls
  2010-04-24  0:19   ` Andreas Dilger
@ 2010-04-24  1:08     ` Neil Brown
  2010-04-25 18:21       ` Aneesh Kumar K. V
  2010-04-26  9:56       ` Christoph Hellwig
  2010-04-25 18:07     ` Aneesh Kumar K. V
  1 sibling, 2 replies; 23+ messages in thread
From: Neil Brown @ 2010-04-24  1:08 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Theodore Tso, Aneesh Kumar K.V, hch, viro, corbet, linux-fsdevel,
	sfrench

On Fri, 23 Apr 2010 18:19:59 -0600
Andreas Dilger <adilger@sun.com> wrote:

> On 2010-04-23, at 07:23, Theodore Tso wrote:
> > 
> > Something to consider is whether anything bad happens if there are multiple filesystems mounted with the same UUID.  I can think of two ways this could happen.   One is when we make a read-only LVM snapshot of a filesystem, and then mount it to back up a stable snapshot.  This might happen if the sysadmin is trying to backup a SQL database, for example; the database gets frozen, we take a snapshot, and then we unfreeze the database and mount the snapshot.   Now suppose we try to open-by-handle the mysql database --- should the system return the a file from the r/o frozen snapshot, or from the r/w file system?
> 
> I'd say from the r/w LV in virtually all cases.  We are safe from totally egregious errors, because the inode+generation will prevent totally incorrect files from being returned, but newer/older versions of the same file/director may be found.
> 
> > Something we might do is to add a check and refuse mounting file systems with duplicate UUID's, and changing the LVM snapshot code to do run some kind of hook after a snapshot which runs a "tune2fs -U random" on the snapshot.   For r/o LVM snapshots, we could also put in a hack that if there are two file systems mounted, one r/o and one r/w, we return the r/w file system.
> 
> I think this may break things if we change the UUID when a snapshot is created, because we don't know what userspace might be using the UUID for.  That said, I totally agree that returning the r/w LV makes sense.  The LVM code itself understands which LV is the primary and which is the snapshot, so it likely means that the "lookup the UUID" code might need to be smarter.
> 
> Probably the simplest thing is if a new filesystem is mounted, but a second filesystem with the same UUID is mounted that it is skipped.  If we keep the UUID list in FIFO order, that should be sufficient to ensure that the "primary" version is returned first.
> 

I really think this sounds too much like 'policy'.  It is not a trivially
obvious algorithm for selecting the 'right' filesystem.  It depends on the
order things have happened, which might be right for the case that you are
thinking of, but might be wrong for some other case.

I haven't been following the conversation closely so I might have missed
something, but why don't we leave the mapping from handle->filesystem up to
userspace and just do the "filesystem+handle -> file" part in the kernel?
(i.e. just what nfsd does).

>From the kernel's perspective, the only unique identifier for a file system
is a (sometimes fictitious or arbitrary) device number.  Using anything else
(except maybe a mount point) in a kernel interface just seems wrong.

Maybe map the filesystem part of the handle from UUID (or whatever) to devno
in userspace, then pass the devno+file-part-of-handle to the kernel to
perform, the final mapping.

NeilBrown

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -V3] Generic name to handle and open by handle syscalls
  2010-04-24  0:19   ` Andreas Dilger
  2010-04-24  1:08     ` Neil Brown
@ 2010-04-25 18:07     ` Aneesh Kumar K. V
  1 sibling, 0 replies; 23+ messages in thread
From: Aneesh Kumar K. V @ 2010-04-25 18:07 UTC (permalink / raw)
  To: Andreas Dilger, Theodore Tso; +Cc: hch, viro, corbet, linux-fsdevel, sfrench

On Fri, 23 Apr 2010 18:19:59 -0600, Andreas Dilger <adilger@sun.com> wrote:
> On 2010-04-23, at 07:23, Theodore Tso wrote:
> > 
> > Something to consider is whether anything bad happens if there are multiple filesystems mounted with the same UUID.  I can think of two ways this could happen.   One is when we make a read-only LVM snapshot of a filesystem, and then mount it to back up a stable snapshot.  This might happen if the sysadmin is trying to backup a SQL database, for example; the database gets frozen, we take a snapshot, and then we unfreeze the database and mount the snapshot.   Now suppose we try to open-by-handle the mysql database --- should the system return the a file from the r/o frozen snapshot, or from the r/w file system?
> 
> I'd say from the r/w LV in virtually all cases.  We are safe from totally egregious errors, because the inode+generation will prevent totally incorrect files from being returned, but newer/older versions of the same file/director may be found.
> 
> > Something we might do is to add a check and refuse mounting file systems with duplicate UUID's, and changing the LVM snapshot code to do run some kind of hook after a snapshot which runs a "tune2fs -U random" on the snapshot.   For r/o LVM snapshots, we could also put in a hack that if there are two file systems mounted, one r/o and one r/w, we return the r/w file system.
> 
> I think this may break things if we change the UUID when a snapshot is created, because we don't know what userspace might be using the UUID for.  That said, I totally agree that returning the r/w LV makes sense.  The LVM code itself understands which LV is the primary and which is the snapshot, so it likely means that the "lookup the UUID" code might need to be smarter.
> 
> Probably the simplest thing is if a new filesystem is mounted, but a second filesystem with the same UUID is mounted that it is skipped.  If we keep the UUID list in FIFO order, that should be sufficient to ensure that the "primary" version is returned first.
> 

I was planning to make sure that when we add the UUID <->
super_block/vfsmount mapping into a hash table during mount we
don't add the second mapping with same uuid. This make sure the hash
table contain only one UUID to vfsmount mapping. And when we do an
open-by-handle if vfsmount passed is not same as the one found in the
mapping we fail the open-by-handle call.


-aneesh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -V3] Generic name to handle and open by handle syscalls
  2010-04-24  1:08     ` Neil Brown
@ 2010-04-25 18:21       ` Aneesh Kumar K. V
  2010-04-26  9:56       ` Christoph Hellwig
  1 sibling, 0 replies; 23+ messages in thread
From: Aneesh Kumar K. V @ 2010-04-25 18:21 UTC (permalink / raw)
  To: Neil Brown, Andreas Dilger
  Cc: Theodore Tso, hch, viro, corbet, linux-fsdevel, sfrench

On Sat, 24 Apr 2010 11:08:12 +1000, Neil Brown <neilb@suse.de> wrote:
> On Fri, 23 Apr 2010 18:19:59 -0600
> Andreas Dilger <adilger@sun.com> wrote:
> 
> > On 2010-04-23, at 07:23, Theodore Tso wrote:
> > > 
> > > Something to consider is whether anything bad happens if there are multiple filesystems mounted with the same UUID.  I can think of two ways this could happen.   One is when we make a read-only LVM snapshot of a filesystem, and then mount it to back up a stable snapshot.  This might happen if the sysadmin is trying to backup a SQL database, for example; the database gets frozen, we take a snapshot, and then we unfreeze the database and mount the snapshot.   Now suppose we try to open-by-handle the mysql database --- should the system return the a file from the r/o frozen snapshot, or from the r/w file system?
> > 
> > I'd say from the r/w LV in virtually all cases.  We are safe from totally egregious errors, because the inode+generation will prevent totally incorrect files from being returned, but newer/older versions of the same file/director may be found.
> > 
> > > Something we might do is to add a check and refuse mounting file systems with duplicate UUID's, and changing the LVM snapshot code to do run some kind of hook after a snapshot which runs a "tune2fs -U random" on the snapshot.   For r/o LVM snapshots, we could also put in a hack that if there are two file systems mounted, one r/o and one r/w, we return the r/w file system.
> > 
> > I think this may break things if we change the UUID when a snapshot is created, because we don't know what userspace might be using the UUID for.  That said, I totally agree that returning the r/w LV makes sense.  The LVM code itself understands which LV is the primary and which is the snapshot, so it likely means that the "lookup the UUID" code might need to be smarter.
> > 
> > Probably the simplest thing is if a new filesystem is mounted, but a second filesystem with the same UUID is mounted that it is skipped.  If we keep the UUID list in FIFO order, that should be sufficient to ensure that the "primary" version is returned first.
> > 
> 
> I really think this sounds too much like 'policy'.  It is not a trivially
> obvious algorithm for selecting the 'right' filesystem.  It depends on the
> order things have happened, which might be right for the case that you are
> thinking of, but might be wrong for some other case.
> 
> I haven't been following the conversation closely so I might have missed
> something, but why don't we leave the mapping from handle->filesystem up to
> userspace and just do the "filesystem+handle -> file" part in the kernel?
> (i.e. just what nfsd does).
> 
> From the kernel's perspective, the only unique identifier for a file system
> is a (sometimes fictitious or arbitrary) device number.  Using anything else
> (except maybe a mount point) in a kernel interface just seems wrong.
> 
> Maybe map the filesystem part of the handle from UUID (or whatever) to devno
> in userspace, then pass the devno+file-part-of-handle to the kernel to
> perform, the final mapping.
> 

My earlier patchset[1] more or less did that. It actually took a mountdir
fd and a file system unique handle to identify the inode in the file
system. The idea was to let the userspace NFS server track the mount
points in the exported path and use the right mountdir fd when it does
a open-by-handle syscall. NFSD can add extra information to the handle
returned from the syscall to indicate which mountdir fd should be used
with the open-by-handle request.


[1] http://article.gmane.org/gmane.linux.file-systems/38909
    Message-id:1268932144-14105-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com

-aneesh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -V3] Generic name to handle and open by handle syscalls
  2010-04-24  1:08     ` Neil Brown
  2010-04-25 18:21       ` Aneesh Kumar K. V
@ 2010-04-26  9:56       ` Christoph Hellwig
  2010-04-26 10:16         ` Neil Brown
  1 sibling, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2010-04-26  9:56 UTC (permalink / raw)
  To: Neil Brown
  Cc: Andreas Dilger, Theodore Tso, Aneesh Kumar K.V, hch, viro,
	corbet, linux-fsdevel, sfrench

On Sat, Apr 24, 2010 at 11:08:12AM +1000, Neil Brown wrote:
> Maybe map the filesystem part of the handle from UUID (or whatever) to devno
> in userspace, then pass the devno+file-part-of-handle to the kernel to
> perform, the final mapping.

The device number is not a useful kernel interface at all.  Getting a
uuid really is easy in kernelspace as it's available in the superblock
for every reasonable fs.  What's more difficult is finding the right
vfsmount instance of a superblock to use - not just due to read only
but also things like no* per-vfsmount flags.

If you look at libhandle in xfsprogs which wraps the existing xfs handle
ioctls for use in application you'll see such a hash table to map to
open file descriptors per filesystems due to the limits of the ioctl
interface.  Doing the uuid lookup in kernelspace sounds much saner as
we have a list of mounts there anyway.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -V3] Generic name to handle and open by handle syscalls
  2010-04-26  9:56       ` Christoph Hellwig
@ 2010-04-26 10:16         ` Neil Brown
  2010-04-26 10:28           ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Neil Brown @ 2010-04-26 10:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andreas Dilger, Theodore Tso, Aneesh Kumar K.V, viro, corbet,
	linux-fsdevel, sfrench

On Mon, 26 Apr 2010 05:56:58 -0400
Christoph Hellwig <hch@infradead.org> wrote:

> On Sat, Apr 24, 2010 at 11:08:12AM +1000, Neil Brown wrote:
> > Maybe map the filesystem part of the handle from UUID (or whatever) to devno
> > in userspace, then pass the devno+file-part-of-handle to the kernel to
> > perform, the final mapping.
> 
> The device number is not a useful kernel interface at all.  Getting a
> uuid really is easy in kernelspace as it's available in the superblock
> for every reasonable fs.  What's more difficult is finding the right
> vfsmount instance of a superblock to use - not just due to read only
> but also things like no* per-vfsmount flags.
> 
> If you look at libhandle in xfsprogs which wraps the existing xfs handle
> ioctls for use in application you'll see such a hash table to map to
> open file descriptors per filesystems due to the limits of the ioctl
> interface.  Doing the uuid lookup in kernelspace sounds much saner as
> we have a list of mounts there anyway.

'device number' is useful in that it is guaranteed to be unique,
and fairly trivial to simply generate a unique one in the kernel with no real
cost for any lack of stability.
Don't think of it as a device number - think of it as a system-wide
filesystem-descriptor.  You mount a filesystem and then use lstat to find
out what fsd was allocated and use that.  Sometimes it will be based on
the major/minor of a related device, other times it will not.
It is already the key that is used for accessing the per-filesystem bdi via
 /sys/class/bdi/XX:XX/... for setting e.g. read_ahead_kb and {min,max}_ratio.

uuids - as Ted as shown - are not necessarily unique (despite the name),
so using them to perform a lookup will either be imperfect (not good for a
kernel interface) or will impose extra uniqueness which would break current
configurations.

Surely the imperfection of the mapping is best handled in user-space, with
policy such as "prefer writeable snapshot" etc.

NeilBrown

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -V3] Generic name to handle and open by handle syscalls
  2010-04-26 10:16         ` Neil Brown
@ 2010-04-26 10:28           ` Christoph Hellwig
  2010-04-26 11:16             ` Neil Brown
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2010-04-26 10:28 UTC (permalink / raw)
  To: Neil Brown
  Cc: Christoph Hellwig, Andreas Dilger, Theodore Tso,
	Aneesh Kumar K.V, viro, corbet, linux-fsdevel, sfrench

On Mon, Apr 26, 2010 at 08:16:33PM +1000, Neil Brown wrote:
> uuids - as Ted as shown - are not necessarily unique (despite the name),
> so using them to perform a lookup will either be imperfect (not good for a
> kernel interface) or will impose extra uniqueness which would break current
> configurations.

UUID can easily be made unique.  XFS for example refuses to mount
filesystems with duplicate UUIDs.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -V3] Generic name to handle and open by handle syscalls
  2010-04-26 10:28           ` Christoph Hellwig
@ 2010-04-26 11:16             ` Neil Brown
  2010-04-26 14:53               ` Theodore Tso
  0 siblings, 1 reply; 23+ messages in thread
From: Neil Brown @ 2010-04-26 11:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andreas Dilger, Theodore Tso, Aneesh Kumar K.V, viro, corbet,
	linux-fsdevel, sfrench

On Mon, 26 Apr 2010 06:28:44 -0400
Christoph Hellwig <hch@infradead.org> wrote:

> On Mon, Apr 26, 2010 at 08:16:33PM +1000, Neil Brown wrote:
> > uuids - as Ted as shown - are not necessarily unique (despite the name),
> > so using them to perform a lookup will either be imperfect (not good for a
> > kernel interface) or will impose extra uniqueness which would break current
> > configurations.
> 
> UUID can easily be made unique.  XFS for example refuses to mount
> filesystems with duplicate UUIDs.
> 

In the email from Ted that I referred to he identified two cases where
UUIDs are not unique.  Are you suggesting that we should cause those
use-cases - which work today - to fail?

NeilBrown

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -V3] Generic name to handle and open by handle syscalls
  2010-04-26 11:16             ` Neil Brown
@ 2010-04-26 14:53               ` Theodore Tso
  2010-04-26 14:56                 ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Theodore Tso @ 2010-04-26 14:53 UTC (permalink / raw)
  To: Neil Brown
  Cc: Christoph Hellwig, Andreas Dilger, Aneesh Kumar K.V, viro,
	corbet, linux-fsdevel, sfrench


On Apr 26, 2010, at 7:16 AM, Neil Brown wrote:

> On Mon, 26 Apr 2010 06:28:44 -0400
> Christoph Hellwig <hch@infradead.org> wrote:
> 
>> On Mon, Apr 26, 2010 at 08:16:33PM +1000, Neil Brown wrote:
>>> uuids - as Ted as shown - are not necessarily unique (despite the name),
>>> so using them to perform a lookup will either be imperfect (not good for a
>>> kernel interface) or will impose extra uniqueness which would break current
>>> configurations.
>> 
>> UUID can easily be made unique.  XFS for example refuses to mount
>> filesystems with duplicate UUIDs.
>> 
> 
> In the email from Ted that I referred to he identified two cases where
> UUIDs are not unique.  Are you suggesting that we should cause those
> use-cases - which work today - to fail?

So XFS won't allow you to make a read-only snapshot of an XFS file system, and then mount it somewhere else?

Really?

-- Ted


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -V3] Generic name to handle and open by handle syscalls
  2010-04-26 14:53               ` Theodore Tso
@ 2010-04-26 14:56                 ` Christoph Hellwig
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2010-04-26 14:56 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Neil Brown, Christoph Hellwig, Andreas Dilger, Aneesh Kumar K.V,
	viro, corbet, linux-fsdevel, sfrench

On Mon, Apr 26, 2010 at 10:53:49AM -0400, Theodore Tso wrote:
> So XFS won't allow you to make a read-only snapshot of an XFS file system, and then mount it somewhere else?

You'll have to mount it with the nouuid option so that it doesn't end up
in the global uuid hash table.  Remember XFS on IRIX had these open by
handle and saner nfs integration things for a long time already.


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2010-04-26 14:56 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-22 18:15 [PATCH -V3] Generic name to handle and open by handle syscalls Aneesh Kumar K.V
2010-04-22 18:15 ` [PATCH -V3 1/5] exportfs: Return the minimum required handle size Aneesh Kumar K.V
2010-04-22 18:15 ` [PATCH -V3 2/5] vfs: Add name to file handle conversion support Aneesh Kumar K.V
2010-04-22 18:15 ` [PATCH -V3 3/5] vfs: Add open by file handle support Aneesh Kumar K.V
2010-04-22 19:22   ` Andreas Dilger
2010-04-23 11:40     ` Aneesh Kumar K. V
2010-04-22 18:15 ` [PATCH -V3 4/5] x86: Add new syscalls for x86_32 Aneesh Kumar K.V
2010-04-22 18:15 ` [PATCH -V3 5/5] ext4: Add get_fsid callback Aneesh Kumar K.V
2010-04-22 19:07 ` [PATCH -V3] Generic name to handle and open by handle syscalls Andreas Dilger
2010-04-22 22:49 ` Serge E. Hallyn
2010-04-23 11:45   ` Aneesh Kumar K. V
2010-04-23 13:49     ` Serge E. Hallyn
2010-04-23 13:23 ` Theodore Tso
2010-04-24  0:19   ` Andreas Dilger
2010-04-24  1:08     ` Neil Brown
2010-04-25 18:21       ` Aneesh Kumar K. V
2010-04-26  9:56       ` Christoph Hellwig
2010-04-26 10:16         ` Neil Brown
2010-04-26 10:28           ` Christoph Hellwig
2010-04-26 11:16             ` Neil Brown
2010-04-26 14:53               ` Theodore Tso
2010-04-26 14:56                 ` Christoph Hellwig
2010-04-25 18:07     ` Aneesh Kumar K. V

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.