All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] big fat transaction ioctl
@ 2009-11-10 20:12 Sage Weil
  2009-11-10 20:44 ` Andrey Kuzmin
  2009-11-11 14:54 ` Chris Mason
  0 siblings, 2 replies; 13+ messages in thread
From: Sage Weil @ 2009-11-10 20:12 UTC (permalink / raw)
  To: linux-btrfs

Hi all,

This is an alternative approach to atomic user transactions for btrfs.  
The old start/end ioctls suffer from some basic limitations, namely

 - We can't properly reserve space ahead of time to avoid ENOSPC part 
way through the transaction, and
 - The process may die (seg fault, SIGKILL) part way through the 
transaction.  Currently when that happens the partial transaction will 
commit.

This patch implements an ioctl that lets the application completely 
specify the entire transaction in a single syscall.  If the process gets 
killed or seg faults part way through, the entire transaction will still 
complete.

The goal is to atomically commit updates to multiple files, xattrs, 
directories.  But this is still a file system: we don't get rollback if 
things go wrong.  Instead, do what we can up front to make sure things 
will work out.  And if things do go wrong, optionally prevent a partial 
result from reaching the disk.

A few things:

 - The implementation just exports the sys_* calls it needs (a popular 
move, no doubt :).  I've looked at using the corresponding vfs_* 
instructions instead, and keeping a table of struct file *'s instead of 
fd's to avoid these exports, but this requires a large amount of 
duplication of semi-boilerplate path lookup, security_path_* hooks, and 
similar code from fs/namei.c and elsewhere.  If we want to go that 
route, there are some advantages, the main one being that we can verify 
that every dentry/inode we operate on belongs to the same fs.  But the 
code will be more complex... I'm not sure if I should pursue that just 
yet.

 - The application gets to define what defines a failure for each 
individual op based on its return value.

 - If the transaction fails, the process can instruct the fs to wedge 
itself so that a partial result does not commit.  This isn't a particuarly 
elegant approach, but a wedged fs may be preferable to a partial 
transaction commit.  (Alternatively, a failure could branch/jump to 
another point in the transaction op vector to do some cleanup and/or an 
explicit WEDGE op to accomplish the same thing?)

- This still uses the existing ioctl start transaction call.  Depending on 
how Josef's ENOSPC journal_info stuff works out, I should be able to avoid 
the current global open_ioctl_trans counter for a cleaner interaction with 
the btrfs transaction code.

- The data space reservation is still missing.  I need a way to 
find which space_info will be used, and pin it for the duration 
of the entire transaction.

- The metadata reservation is a worst case bound.  It could be less 
conservative, but currently each op is pulled out of the user address 
space individually so we'd either need two passes, a big kmalloc, or 
further trust the app to get the value right.  (Same goes for the data 
size, actually, although that's easier to get correct.)

Thoughts on this?

Thanks-
sage


Signed-off-by: Sage Weil <sage@newdream.net>
---
 fs/btrfs/ioctl.c |  187 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/ioctl.h |   49 ++++++++++++++
 fs/namei.c       |    3 +
 fs/open.c        |    2 +
 fs/read_write.c  |    2 +
 fs/xattr.c       |    2 +
 6 files changed, 245 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 136c5ed..4269616 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -37,6 +37,7 @@
 #include <linux/compat.h>
 #include <linux/bit_spinlock.h>
 #include <linux/security.h>
+#include <linux/syscalls.h>
 #include <linux/xattr.h>
 #include <linux/vmalloc.h>
 #include "compat.h"
@@ -1303,6 +1304,190 @@ long btrfs_ioctl_trans_end(struct file *file)
 	return 0;
 }
 
+/*
+ * return number of successfully complete ops via @ops_completed
+ * (where success/failure is defined by the _FAIL_* flags).
+ */
+static long do_usertrans(struct btrfs_root *root,
+			 struct btrfs_ioctl_usertrans *ut,
+			 u64 *ops_completed)
+{
+	int i;
+	int *fds;
+	int err;
+	struct file *file;
+	struct btrfs_ioctl_usertrans_op *ops = (void *)ut->ops_ptr;
+	int fd1, fd2;
+
+	fds = kcalloc(sizeof(int), ut->num_fds, GFP_KERNEL);
+	if (!fds)
+		return -ENOMEM;
+
+	for (i = 0; i < ut->num_ops; i++) {
+		struct btrfs_ioctl_usertrans_op op;
+		int ret;
+
+		err = -EFAULT;
+		if (copy_from_user(&op, &ops[i], sizeof(op)))
+			goto out;
+
+		/* lookup fd args? */
+		err = -EINVAL;
+		switch (op.op) {
+		case BTRFS_IOC_UT_OP_CLONERANGE:
+			if (op.args[1] < 0 || op.args[1] >= ut->num_fds)
+				goto out;
+			fd2 = fds[1];
+
+		case BTRFS_IOC_UT_OP_CLOSE:
+		case BTRFS_IOC_UT_OP_PWRITE:
+			if (op.args[0] < 0 || op.args[0] >= ut->num_fds)
+				goto out;
+			fd1 = fds[0];
+		}
+
+		/* do op */
+		switch (op.op) {
+		case BTRFS_IOC_UT_OP_OPEN:
+			ret = -EINVAL;
+			if (op.args[3] < 0 || op.args[3] >= ut->num_fds)
+				goto out;
+			ret = sys_open((const char __user *)op.args[0],
+				       op.args[1], op.args[2]);
+			fds[op.args[3]] = ret;
+			break;
+		case BTRFS_IOC_UT_OP_CLOSE:
+			ret = sys_close(fd1);
+			break;
+		case BTRFS_IOC_UT_OP_PWRITE:
+			ret = sys_pwrite64(fd1, (const char __user *)op.args[1],
+					   op.args[2], op.args[3]);
+			break;
+		case BTRFS_IOC_UT_OP_UNLINK:
+			ret = sys_unlink((const char __user *)op.args[0]);
+			break;
+		case BTRFS_IOC_UT_OP_MKDIR:
+			ret = sys_mkdir((const char __user *)op.args[0],
+				op.args[1]);
+			break;
+		case BTRFS_IOC_UT_OP_RMDIR:
+			ret = sys_rmdir((const char __user *)op.args[0]);
+			break;
+		case BTRFS_IOC_UT_OP_TRUNCATE:
+			ret = sys_truncate((const char __user *)op.args[0],
+					   op.args[1]);
+			break;
+		case BTRFS_IOC_UT_OP_SETXATTR:
+			ret = sys_setxattr((char __user *)op.args[0],
+					   (char __user *)op.args[1],
+					   (void __user *)op.args[2],
+					   op.args[3], op.args[4]);
+			break;
+		case BTRFS_IOC_UT_OP_REMOVEXATTR:
+			ret = sys_removexattr((char __user *)op.args[0],
+					      (char __user *)op.args[1]);
+			break;
+		case BTRFS_IOC_UT_OP_CLONERANGE:
+			ret = -EBADF;
+			file = fget(fd1);
+			if (file) {
+				ret = btrfs_ioctl_clone(file, fd2,
+							op.args[2], op.args[3],
+							op.args[4]);
+				fput(file);
+			}
+			break;
+		}
+		pr_debug(" ut %d/%d op %d args %llx %llx %llx %llx %llx = %d\n",
+			 i, (int)ut->num_ops, (int)op.op, op.args[0],
+			 op.args[1], op.args[2], op.args[3], op.args[4], ret);
+
+		put_user(ret, &ops[i].rval);
+
+		if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_NE) &&
+		    ret != op.rval)
+			goto out;
+		if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_EQ) &&
+		    ret == op.rval)
+			goto out;
+		if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LT) &&
+		    ret < op.rval)
+			goto out;
+		if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GT) &&
+		    ret > op.rval)
+			goto out;
+		if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LTE) &&
+		    ret <= op.rval)
+			goto out;
+		if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GTE) &&
+		    ret >= op.rval)
+			goto out;
+	}
+	err = 0;
+out:
+	*ops_completed = i;
+	kfree(fds);
+	return err;
+}
+
+long btrfs_ioctl_usertrans(struct file *file, void __user *arg)
+{
+	struct btrfs_root *root = BTRFS_I(fdentry(file)->d_inode)->root;
+	struct btrfs_trans_handle *trans;
+	struct btrfs_ioctl_usertrans ut, *orig_ut = arg;
+	u64 ops_completed = 0;
+	int ret;
+
+	ret = -EPERM;
+	if (!capable(CAP_SYS_ADMIN))
+		goto out;
+
+	ret = -EFAULT;
+	if (copy_from_user(&ut, orig_ut, sizeof(ut)))
+		goto out;
+
+	ret = mnt_want_write(file->f_path.mnt);
+	if (ret)
+		goto out;
+
+	ret = btrfs_reserve_metadata_space(root, 5*ut.num_ops);
+	if (ret)
+		goto out_drop_write;
+
+	mutex_lock(&root->fs_info->trans_mutex);
+	root->fs_info->open_ioctl_trans++;
+	mutex_unlock(&root->fs_info->trans_mutex);
+
+	ret = -ENOMEM;
+	trans = btrfs_start_ioctl_transaction(root, 0);
+	if (!trans)
+		goto out_drop;
+
+	ret = do_usertrans(root, &ut, &ops_completed);
+	put_user(ops_completed, &orig_ut->ops_completed);
+
+	if (ret < 0 && (ut.flags & BTRFS_IOC_UT_FLAG_WEDGEONFAIL))
+		pr_err("btrfs: usertrans failed, wedging to avoid partial "
+		       " commit\n");
+	else
+		btrfs_end_transaction(trans, root);
+
+out_drop:
+	mutex_lock(&root->fs_info->trans_mutex);
+	root->fs_info->open_ioctl_trans--;
+	mutex_unlock(&root->fs_info->trans_mutex);
+
+	btrfs_unreserve_metadata_space(root, 5*ut.num_ops);
+out_drop_write:
+	mnt_drop_write(file->f_path.mnt);
+out:
+	return ret;
+}
+
 long btrfs_ioctl(struct file *file, unsigned int
 		cmd, unsigned long arg)
 {
@@ -1343,6 +1528,8 @@ long btrfs_ioctl(struct file *file, unsigned int
 	case BTRFS_IOC_SYNC:
 		btrfs_sync_fs(file->f_dentry->d_sb, 1);
 		return 0;
+	case BTRFS_IOC_USERTRANS:
+		return btrfs_ioctl_usertrans(file, argp);
 	}
 
 	return -ENOTTY;
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index bc49914..f94e293 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -67,4 +67,53 @@ struct btrfs_ioctl_clone_range_args {
 				   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_SNAP_DESTROY _IOW(BTRFS_IOCTL_MAGIC, 15, \
 				struct btrfs_ioctl_vol_args)
+
+/* usertrans ops */
+/* the 'fd' values are _indices_ into a temporary fd table, see num_fds below */
+#define BTRFS_IOC_UT_OP_OPEN         1  /* path, flags, mode, fd */
+#define BTRFS_IOC_UT_OP_CLOSE        2  /* fd */
+#define BTRFS_IOC_UT_OP_PWRITE       3  /* fd, data, length, offset */
+#define BTRFS_IOC_UT_OP_UNLINK       4  /* path */
+#define BTRFS_IOC_UT_OP_LINK         5  /* oldpath, newpath */
+#define BTRFS_IOC_UT_OP_MKDIR        6  /* path, mode */
+#define BTRFS_IOC_UT_OP_RMDIR        7  /* path */
+#define BTRFS_IOC_UT_OP_TRUNCATE     8  /* path, size */
+#define BTRFS_IOC_UT_OP_SETXATTR     9  /* path, name, data, len */
+#define BTRFS_IOC_UT_OP_REMOVEXATTR 10  /* path, name */
+#define BTRFS_IOC_UT_OP_CLONERANGE  11  /* dst fd, src fd, off, len, dst off */
+
+/* define what 'failure' entails for each op based on return value */
+#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_NE    (1<< 1)
+#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_EQ    (1<< 2)
+#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LT    (1<< 3)
+#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GT    (1<< 4)
+#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LTE   (1<< 5)
+#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GTE   (1<< 6)
+
+struct btrfs_ioctl_usertrans_op {
+	__u64 op;
+	__s64 args[5];
+	__s64 rval;
+	__u64 flags;
+};
+
+/*
+ * If an op fails and we cannot complete the transaction, we may want
+ * to lock up the file system (requiring a reboot) to prevent a
+ * partial result from committing.
+ */
+#define BTRFS_IOC_UT_FLAG_WEDGEONFAIL (1<<13)
+
+struct btrfs_ioctl_usertrans {
+	__u64 num_ops;                  /* in: # ops */
+	__u64 ops_ptr;                  /* in: usertrans_op array */
+	__u64 num_fds;	                /* in: size of fd table (max fd + 1) */
+	__u64 data_bytes, metadata_ops; /* in: for space reservation */
+	__u64 flags;                    /* in: flags */
+	__u64 ops_completed;            /* out: # ops completed */
+};
+
+#define BTRFS_IOC_USERTRANS  _IOW(BTRFS_IOCTL_MAGIC, 16,	\
+				  struct btrfs_ioctl_usertrans)
+
 #endif
diff --git a/fs/namei.c b/fs/namei.c
index d11f404..4d53225 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2148,6 +2148,7 @@ SYSCALL_DEFINE2(mkdir, const char __user *, pathname, int, mode)
 {
 	return sys_mkdirat(AT_FDCWD, pathname, mode);
 }
+EXPORT_SYMBOL(sys_mkdir);
 
 /*
  * We try to drop the dentry early: we should have
@@ -2262,6 +2263,7 @@ SYSCALL_DEFINE1(rmdir, const char __user *, pathname)
 {
 	return do_rmdir(AT_FDCWD, pathname);
 }
+EXPORT_SYMBOL(sys_rmdir);
 
 int vfs_unlink(struct inode *dir, struct dentry *dentry)
 {
@@ -2369,6 +2371,7 @@ SYSCALL_DEFINE1(unlink, const char __user *, pathname)
 {
 	return do_unlinkat(AT_FDCWD, pathname);
 }
+EXPORT_SYMBOL(sys_unlink);
 
 int vfs_symlink(struct inode *dir, struct dentry *dentry, const char *oldname)
 {
diff --git a/fs/open.c b/fs/open.c
index 4f01e06..15eddfc 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -294,6 +294,7 @@ SYSCALL_DEFINE2(truncate, const char __user *, path, long, length)
 {
 	return do_sys_truncate(path, length);
 }
+EXPORT_SYMBOL(sys_truncate);
 
 static long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
 {
@@ -1062,6 +1063,7 @@ SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, int, mode)
 	asmlinkage_protect(3, ret, filename, flags, mode);
 	return ret;
 }
+EXPORT_SYMBOL(sys_open);
 
 SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags,
 		int, mode)
diff --git a/fs/read_write.c b/fs/read_write.c
index 3ac2898..75e9f60 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -453,6 +453,8 @@ SYSCALL_DEFINE(pwrite64)(unsigned int fd, const char __user *buf,
 
 	return ret;
 }
+EXPORT_SYMBOL(sys_pwrite64);
+
 #ifdef CONFIG_HAVE_SYSCALL_WRAPPERS
 asmlinkage long SyS_pwrite64(long fd, long buf, long count, loff_t pos)
 {
diff --git a/fs/xattr.c b/fs/xattr.c
index 6d4f6d3..488c889 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -294,6 +294,7 @@ SYSCALL_DEFINE5(setxattr, const char __user *, pathname,
 	path_put(&path);
 	return error;
 }
+EXPORT_SYMBOL(sys_setxattr);
 
 SYSCALL_DEFINE5(lsetxattr, const char __user *, pathname,
 		const char __user *, name, const void __user *, value,
@@ -523,6 +524,7 @@ SYSCALL_DEFINE2(removexattr, const char __user *, pathname,
 	path_put(&path);
 	return error;
 }
+EXPORT_SYMBOL(sys_removexattr);
 
 SYSCALL_DEFINE2(lremovexattr, const char __user *, pathname,
 		const char __user *, name)
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC] big fat transaction ioctl
  2009-11-10 20:12 [RFC] big fat transaction ioctl Sage Weil
@ 2009-11-10 20:44 ` Andrey Kuzmin
  2009-11-10 22:13   ` Sage Weil
  2009-11-11 14:54 ` Chris Mason
  1 sibling, 1 reply; 13+ messages in thread
From: Andrey Kuzmin @ 2009-11-10 20:44 UTC (permalink / raw)
  To: Sage Weil; +Cc: linux-btrfs

On Tue, Nov 10, 2009 at 11:12 PM, Sage Weil <sage@newdream.net> wrote:
> Hi all,
>
> This is an alternative approach to atomic user transactions for btrfs=
=2E
> The old start/end ioctls suffer from some basic limitations, namely
>
> =A0- We can't properly reserve space ahead of time to avoid ENOSPC pa=
rt
> way through the transaction, and
> =A0- The process may die (seg fault, SIGKILL) part way through the
> transaction. =A0Currently when that happens the partial transaction w=
ill
> commit.
>
> This patch implements an ioctl that lets the application completely
> specify the entire transaction in a single syscall. =A0If the process=
 gets
> killed or seg faults part way through, the entire transaction will st=
ill
> complete.
>
> The goal is to atomically commit updates to multiple files, xattrs,
> directories. =A0But this is still a file system: we don't get rollbac=
k if
> things go wrong. =A0Instead, do what we can up front to make sure thi=
ngs
> will work out. =A0And if things do go wrong, optionally prevent a par=
tial
> result from reaching the disk.

Why not snapshot respective root (doesn't work if transaction spans
multiple file-systems, but this doesn't look like a real-world
limitation), run txn against that snapshot and rollback on failure
instead? Snapshots are writable, cheap, and this looks like a real
transaction abort mechanism.

Regards,
Andrey

>
> A few things:
>
> =A0- The implementation just exports the sys_* calls it needs (a popu=
lar
> move, no doubt :). =A0I've looked at using the corresponding vfs_*
> instructions instead, and keeping a table of struct file *'s instead =
of
> fd's to avoid these exports, but this requires a large amount of
> duplication of semi-boilerplate path lookup, security_path_* hooks, a=
nd
> similar code from fs/namei.c and elsewhere. =A0If we want to go that
> route, there are some advantages, the main one being that we can veri=
fy
> that every dentry/inode we operate on belongs to the same fs. =A0But =
the
> code will be more complex... I'm not sure if I should pursue that jus=
t
> yet.
>
> =A0- The application gets to define what defines a failure for each
> individual op based on its return value.
>
> =A0- If the transaction fails, the process can instruct the fs to wed=
ge
> itself so that a partial result does not commit. =A0This isn't a part=
icuarly
> elegant approach, but a wedged fs may be preferable to a partial
> transaction commit. =A0(Alternatively, a failure could branch/jump to
> another point in the transaction op vector to do some cleanup and/or =
an
> explicit WEDGE op to accomplish the same thing?)
>
> - This still uses the existing ioctl start transaction call. =A0Depen=
ding on
> how Josef's ENOSPC journal_info stuff works out, I should be able to =
avoid
> the current global open_ioctl_trans counter for a cleaner interaction=
 with
> the btrfs transaction code.
>
> - The data space reservation is still missing. =A0I need a way to
> find which space_info will be used, and pin it for the duration
> of the entire transaction.
>
> - The metadata reservation is a worst case bound. =A0It could be less
> conservative, but currently each op is pulled out of the user address
> space individually so we'd either need two passes, a big kmalloc, or
> further trust the app to get the value right. =A0(Same goes for the d=
ata
> size, actually, although that's easier to get correct.)
>
> Thoughts on this?
>
> Thanks-
> sage
>
>
> Signed-off-by: Sage Weil <sage@newdream.net>
> ---
> =A0fs/btrfs/ioctl.c | =A0187 ++++++++++++++++++++++++++++++++++++++++=
++++++++++++++
> =A0fs/btrfs/ioctl.h | =A0 49 ++++++++++++++
> =A0fs/namei.c =A0 =A0 =A0 | =A0 =A03 +
> =A0fs/open.c =A0 =A0 =A0 =A0| =A0 =A02 +
> =A0fs/read_write.c =A0| =A0 =A02 +
> =A0fs/xattr.c =A0 =A0 =A0 | =A0 =A02 +
> =A06 files changed, 245 insertions(+), 0 deletions(-)
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 136c5ed..4269616 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -37,6 +37,7 @@
> =A0#include <linux/compat.h>
> =A0#include <linux/bit_spinlock.h>
> =A0#include <linux/security.h>
> +#include <linux/syscalls.h>
> =A0#include <linux/xattr.h>
> =A0#include <linux/vmalloc.h>
> =A0#include "compat.h"
> @@ -1303,6 +1304,190 @@ long btrfs_ioctl_trans_end(struct file *file)
> =A0 =A0 =A0 =A0return 0;
> =A0}
>
> +/*
> + * return number of successfully complete ops via @ops_completed
> + * (where success/failure is defined by the _FAIL_* flags).
> + */
> +static long do_usertrans(struct btrfs_root *root,
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct btrfs_ioctl_u=
sertrans *ut,
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0u64 *ops_completed)
> +{
> + =A0 =A0 =A0 int i;
> + =A0 =A0 =A0 int *fds;
> + =A0 =A0 =A0 int err;
> + =A0 =A0 =A0 struct file *file;
> + =A0 =A0 =A0 struct btrfs_ioctl_usertrans_op *ops =3D (void *)ut->op=
s_ptr;
> + =A0 =A0 =A0 int fd1, fd2;
> +
> + =A0 =A0 =A0 fds =3D kcalloc(sizeof(int), ut->num_fds, GFP_KERNEL);
> + =A0 =A0 =A0 if (!fds)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 return -ENOMEM;
> +
> + =A0 =A0 =A0 for (i =3D 0; i < ut->num_ops; i++) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct btrfs_ioctl_usertrans_op op;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 int ret;
> +
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 err =3D -EFAULT;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (copy_from_user(&op, &ops[i], sizeof=
(op)))
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out;
> +
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* lookup fd args? */
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 err =3D -EINVAL;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 switch (op.op) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 case BTRFS_IOC_UT_OP_CLONERANGE:
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (op.args[1] < 0 || o=
p.args[1] >=3D ut->num_fds)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto ou=
t;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 fd2 =3D fds[1];
> +
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 case BTRFS_IOC_UT_OP_CLOSE:
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 case BTRFS_IOC_UT_OP_PWRITE:
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (op.args[0] < 0 || o=
p.args[0] >=3D ut->num_fds)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto ou=
t;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 fd1 =3D fds[0];
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
> +
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* do op */
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 switch (op.op) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 case BTRFS_IOC_UT_OP_OPEN:
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret =3D -EINVAL;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (op.args[3] < 0 || o=
p.args[3] >=3D ut->num_fds)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto ou=
t;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret =3D sys_open((const=
 char __user *)op.args[0],
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0op.args[1], op.args[2]);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 fds[op.args[3]] =3D ret=
;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 case BTRFS_IOC_UT_OP_CLOSE:
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret =3D sys_close(fd1);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 case BTRFS_IOC_UT_OP_PWRITE:
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret =3D sys_pwrite64(fd=
1, (const char __user *)op.args[1],
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0op.args[2], op.args[3]);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 case BTRFS_IOC_UT_OP_UNLINK:
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret =3D sys_unlink((con=
st char __user *)op.args[0]);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 case BTRFS_IOC_UT_OP_MKDIR:
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret =3D sys_mkdir((cons=
t char __user *)op.args[0],
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 op.args=
[1]);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 case BTRFS_IOC_UT_OP_RMDIR:
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret =3D sys_rmdir((cons=
t char __user *)op.args[0]);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 case BTRFS_IOC_UT_OP_TRUNCATE:
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret =3D sys_truncate((c=
onst char __user *)op.args[0],
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0op.args[1]);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 case BTRFS_IOC_UT_OP_SETXATTR:
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret =3D sys_setxattr((c=
har __user *)op.args[0],
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0(char __user *)op.args[1],
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0(void __user *)op.args[2],
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0op.args[3], op.args[4]);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 case BTRFS_IOC_UT_OP_REMOVEXATTR:
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret =3D sys_removexattr=
((char __user *)op.args[0],
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0 (char __user *)op.args[1]);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 case BTRFS_IOC_UT_OP_CLONERANGE:
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret =3D -EBADF;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 file =3D fget(fd1);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (file) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret =3D=
 btrfs_ioctl_clone(file, fd2,
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 op.args[2], op.args[3],
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 op.args[4]);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 fput(fi=
le);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 pr_debug(" ut %d/%d op %d args %llx %ll=
x %llx %llx %llx =3D %d\n",
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0i, (int)ut->num_ops,=
 (int)op.op, op.args[0],
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0op.args[1], op.args[=
2], op.args[3], op.args[4], ret);
> +
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 put_user(ret, &ops[i].rval);
> +
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FA=
IL_ON_NE) &&
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret !=3D op.rval)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FA=
IL_ON_EQ) &&
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret =3D=3D op.rval)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FA=
IL_ON_LT) &&
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret < op.rval)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FA=
IL_ON_GT) &&
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret > op.rval)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FA=
IL_ON_LTE) &&
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret <=3D op.rval)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FA=
IL_ON_GTE) &&
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret >=3D op.rval)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out;
> + =A0 =A0 =A0 }
> + =A0 =A0 =A0 err =3D 0;
> +out:
> + =A0 =A0 =A0 *ops_completed =3D i;
> + =A0 =A0 =A0 kfree(fds);
> + =A0 =A0 =A0 return err;
> +}
> +
> +long btrfs_ioctl_usertrans(struct file *file, void __user *arg)
> +{
> + =A0 =A0 =A0 struct btrfs_root *root =3D BTRFS_I(fdentry(file)->d_in=
ode)->root;
> + =A0 =A0 =A0 struct btrfs_trans_handle *trans;
> + =A0 =A0 =A0 struct btrfs_ioctl_usertrans ut, *orig_ut =3D arg;
> + =A0 =A0 =A0 u64 ops_completed =3D 0;
> + =A0 =A0 =A0 int ret;
> +
> + =A0 =A0 =A0 ret =3D -EPERM;
> + =A0 =A0 =A0 if (!capable(CAP_SYS_ADMIN))
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out;
> +
> + =A0 =A0 =A0 ret =3D -EFAULT;
> + =A0 =A0 =A0 if (copy_from_user(&ut, orig_ut, sizeof(ut)))
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out;
> +
> + =A0 =A0 =A0 ret =3D mnt_want_write(file->f_path.mnt);
> + =A0 =A0 =A0 if (ret)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out;
> +
> + =A0 =A0 =A0 ret =3D btrfs_reserve_metadata_space(root, 5*ut.num_ops=
);
> + =A0 =A0 =A0 if (ret)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out_drop_write;
> +
> + =A0 =A0 =A0 mutex_lock(&root->fs_info->trans_mutex);
> + =A0 =A0 =A0 root->fs_info->open_ioctl_trans++;
> + =A0 =A0 =A0 mutex_unlock(&root->fs_info->trans_mutex);
> +
> + =A0 =A0 =A0 ret =3D -ENOMEM;
> + =A0 =A0 =A0 trans =3D btrfs_start_ioctl_transaction(root, 0);
> + =A0 =A0 =A0 if (!trans)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out_drop;
> +
> + =A0 =A0 =A0 ret =3D do_usertrans(root, &ut, &ops_completed);
> + =A0 =A0 =A0 put_user(ops_completed, &orig_ut->ops_completed);
> +
> + =A0 =A0 =A0 if (ret < 0 && (ut.flags & BTRFS_IOC_UT_FLAG_WEDGEONFAI=
L))
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 pr_err("btrfs: usertrans failed, wedgin=
g to avoid partial "
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0" commit\n");
> + =A0 =A0 =A0 else
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 btrfs_end_transaction(trans, root);
> +
> +out_drop:
> + =A0 =A0 =A0 mutex_lock(&root->fs_info->trans_mutex);
> + =A0 =A0 =A0 root->fs_info->open_ioctl_trans--;
> + =A0 =A0 =A0 mutex_unlock(&root->fs_info->trans_mutex);
> +
> + =A0 =A0 =A0 btrfs_unreserve_metadata_space(root, 5*ut.num_ops);
> +out_drop_write:
> + =A0 =A0 =A0 mnt_drop_write(file->f_path.mnt);
> +out:
> + =A0 =A0 =A0 return ret;
> +}
> +
> =A0long btrfs_ioctl(struct file *file, unsigned int
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cmd, unsigned long arg)
> =A0{
> @@ -1343,6 +1528,8 @@ long btrfs_ioctl(struct file *file, unsigned in=
t
> =A0 =A0 =A0 =A0case BTRFS_IOC_SYNC:
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0btrfs_sync_fs(file->f_dentry->d_sb, 1)=
;
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return 0;
> + =A0 =A0 =A0 case BTRFS_IOC_USERTRANS:
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 return btrfs_ioctl_usertrans(file, argp=
);
> =A0 =A0 =A0 =A0}
>
> =A0 =A0 =A0 =A0return -ENOTTY;
> diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
> index bc49914..f94e293 100644
> --- a/fs/btrfs/ioctl.h
> +++ b/fs/btrfs/ioctl.h
> @@ -67,4 +67,53 @@ struct btrfs_ioctl_clone_range_args {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 s=
truct btrfs_ioctl_vol_args)
> =A0#define BTRFS_IOC_SNAP_DESTROY _IOW(BTRFS_IOCTL_MAGIC, 15, \
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct=
 btrfs_ioctl_vol_args)
> +
> +/* usertrans ops */
> +/* the 'fd' values are _indices_ into a temporary fd table, see num_=
fds below */
> +#define BTRFS_IOC_UT_OP_OPEN =A0 =A0 =A0 =A0 1 =A0/* path, flags, mo=
de, fd */
> +#define BTRFS_IOC_UT_OP_CLOSE =A0 =A0 =A0 =A02 =A0/* fd */
> +#define BTRFS_IOC_UT_OP_PWRITE =A0 =A0 =A0 3 =A0/* fd, data, length,=
 offset */
> +#define BTRFS_IOC_UT_OP_UNLINK =A0 =A0 =A0 4 =A0/* path */
> +#define BTRFS_IOC_UT_OP_LINK =A0 =A0 =A0 =A0 5 =A0/* oldpath, newpat=
h */
> +#define BTRFS_IOC_UT_OP_MKDIR =A0 =A0 =A0 =A06 =A0/* path, mode */
> +#define BTRFS_IOC_UT_OP_RMDIR =A0 =A0 =A0 =A07 =A0/* path */
> +#define BTRFS_IOC_UT_OP_TRUNCATE =A0 =A0 8 =A0/* path, size */
> +#define BTRFS_IOC_UT_OP_SETXATTR =A0 =A0 9 =A0/* path, name, data, l=
en */
> +#define BTRFS_IOC_UT_OP_REMOVEXATTR 10 =A0/* path, name */
> +#define BTRFS_IOC_UT_OP_CLONERANGE =A011 =A0/* dst fd, src fd, off, =
len, dst off */
> +
> +/* define what 'failure' entails for each op based on return value *=
/
> +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_NE =A0 =A0(1<< 1)
> +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_EQ =A0 =A0(1<< 2)
> +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LT =A0 =A0(1<< 3)
> +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GT =A0 =A0(1<< 4)
> +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LTE =A0 (1<< 5)
> +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GTE =A0 (1<< 6)
> +
> +struct btrfs_ioctl_usertrans_op {
> + =A0 =A0 =A0 __u64 op;
> + =A0 =A0 =A0 __s64 args[5];
> + =A0 =A0 =A0 __s64 rval;
> + =A0 =A0 =A0 __u64 flags;
> +};
> +
> +/*
> + * If an op fails and we cannot complete the transaction, we may wan=
t
> + * to lock up the file system (requiring a reboot) to prevent a
> + * partial result from committing.
> + */
> +#define BTRFS_IOC_UT_FLAG_WEDGEONFAIL (1<<13)
> +
> +struct btrfs_ioctl_usertrans {
> + =A0 =A0 =A0 __u64 num_ops; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/* in=
: # ops */
> + =A0 =A0 =A0 __u64 ops_ptr; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/* in=
: usertrans_op array */
> + =A0 =A0 =A0 __u64 num_fds; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/* in=
: size of fd table (max fd + 1) */
> + =A0 =A0 =A0 __u64 data_bytes, metadata_ops; /* in: for space reserv=
ation */
> + =A0 =A0 =A0 __u64 flags; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/* =
in: flags */
> + =A0 =A0 =A0 __u64 ops_completed; =A0 =A0 =A0 =A0 =A0 =A0/* out: # o=
ps completed */
> +};
> +
> +#define BTRFS_IOC_USERTRANS =A0_IOW(BTRFS_IOCTL_MAGIC, 16, =A0 =A0 =A0=
 \
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 str=
uct btrfs_ioctl_usertrans)
> +
> =A0#endif
> diff --git a/fs/namei.c b/fs/namei.c
> index d11f404..4d53225 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2148,6 +2148,7 @@ SYSCALL_DEFINE2(mkdir, const char __user *, pat=
hname, int, mode)
> =A0{
> =A0 =A0 =A0 =A0return sys_mkdirat(AT_FDCWD, pathname, mode);
> =A0}
> +EXPORT_SYMBOL(sys_mkdir);
>
> =A0/*
> =A0* We try to drop the dentry early: we should have
> @@ -2262,6 +2263,7 @@ SYSCALL_DEFINE1(rmdir, const char __user *, pat=
hname)
> =A0{
> =A0 =A0 =A0 =A0return do_rmdir(AT_FDCWD, pathname);
> =A0}
> +EXPORT_SYMBOL(sys_rmdir);
>
> =A0int vfs_unlink(struct inode *dir, struct dentry *dentry)
> =A0{
> @@ -2369,6 +2371,7 @@ SYSCALL_DEFINE1(unlink, const char __user *, pa=
thname)
> =A0{
> =A0 =A0 =A0 =A0return do_unlinkat(AT_FDCWD, pathname);
> =A0}
> +EXPORT_SYMBOL(sys_unlink);
>
> =A0int vfs_symlink(struct inode *dir, struct dentry *dentry, const ch=
ar *oldname)
> =A0{
> diff --git a/fs/open.c b/fs/open.c
> index 4f01e06..15eddfc 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -294,6 +294,7 @@ SYSCALL_DEFINE2(truncate, const char __user *, pa=
th, long, length)
> =A0{
> =A0 =A0 =A0 =A0return do_sys_truncate(path, length);
> =A0}
> +EXPORT_SYMBOL(sys_truncate);
>
> =A0static long do_sys_ftruncate(unsigned int fd, loff_t length, int s=
mall)
> =A0{
> @@ -1062,6 +1063,7 @@ SYSCALL_DEFINE3(open, const char __user *, file=
name, int, flags, int, mode)
> =A0 =A0 =A0 =A0asmlinkage_protect(3, ret, filename, flags, mode);
> =A0 =A0 =A0 =A0return ret;
> =A0}
> +EXPORT_SYMBOL(sys_open);
>
> =A0SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, i=
nt, flags,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0int, mode)
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 3ac2898..75e9f60 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -453,6 +453,8 @@ SYSCALL_DEFINE(pwrite64)(unsigned int fd, const c=
har __user *buf,
>
> =A0 =A0 =A0 =A0return ret;
> =A0}
> +EXPORT_SYMBOL(sys_pwrite64);
> +
> =A0#ifdef CONFIG_HAVE_SYSCALL_WRAPPERS
> =A0asmlinkage long SyS_pwrite64(long fd, long buf, long count, loff_t=
 pos)
> =A0{
> diff --git a/fs/xattr.c b/fs/xattr.c
> index 6d4f6d3..488c889 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -294,6 +294,7 @@ SYSCALL_DEFINE5(setxattr, const char __user *, pa=
thname,
> =A0 =A0 =A0 =A0path_put(&path);
> =A0 =A0 =A0 =A0return error;
> =A0}
> +EXPORT_SYMBOL(sys_setxattr);
>
> =A0SYSCALL_DEFINE5(lsetxattr, const char __user *, pathname,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0const char __user *, name, const void =
__user *, value,
> @@ -523,6 +524,7 @@ SYSCALL_DEFINE2(removexattr, const char __user *,=
 pathname,
> =A0 =A0 =A0 =A0path_put(&path);
> =A0 =A0 =A0 =A0return error;
> =A0}
> +EXPORT_SYMBOL(sys_removexattr);
>
> =A0SYSCALL_DEFINE2(lremovexattr, const char __user *, pathname,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0const char __user *, name)
> --
> 1.5.6.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs=
" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] big fat transaction ioctl
  2009-11-10 20:44 ` Andrey Kuzmin
@ 2009-11-10 22:13   ` Sage Weil
  2009-11-11  0:49     ` Jeremy Fitzhardinge
  2009-11-11 15:03     ` Chris Mason
  0 siblings, 2 replies; 13+ messages in thread
From: Sage Weil @ 2009-11-10 22:13 UTC (permalink / raw)
  To: Andrey Kuzmin; +Cc: linux-btrfs

[-- Attachment #1: Type: TEXT/PLAIN, Size: 19568 bytes --]

On Tue, 10 Nov 2009, Andrey Kuzmin wrote:

> On Tue, Nov 10, 2009 at 11:12 PM, Sage Weil <sage@newdream.net> wrote:
> > Hi all,
> >
> > This is an alternative approach to atomic user transactions for btrfs.
> > The old start/end ioctls suffer from some basic limitations, namely
> >
> >  - We can't properly reserve space ahead of time to avoid ENOSPC part
> > way through the transaction, and
> >  - The process may die (seg fault, SIGKILL) part way through the
> > transaction.  Currently when that happens the partial transaction will
> > commit.
> >
> > This patch implements an ioctl that lets the application completely
> > specify the entire transaction in a single syscall.  If the process gets
> > killed or seg faults part way through, the entire transaction will still
> > complete.
> >
> > The goal is to atomically commit updates to multiple files, xattrs,
> > directories.  But this is still a file system: we don't get rollback if
> > things go wrong.  Instead, do what we can up front to make sure things
> > will work out.  And if things do go wrong, optionally prevent a partial
> > result from reaching the disk.
> 
> Why not snapshot respective root (doesn't work if transaction spans
> multiple file-systems, but this doesn't look like a real-world
> limitation), run txn against that snapshot and rollback on failure
> instead? Snapshots are writable, cheap, and this looks like a real
> transaction abort mechanism.

Good question.  :)

I hadn't looked into this before, but I think the snapshots could be used 
to achieve both atomicity and rollback.  If userspace uses an rw mutex to 
quiesce writes, it can make sure all transactions complete before creating 
a snapshot (commit).  The problem with this currently is the create 
snapshot ioctl is relatively slow... it calls commit_transaction, which 
blocks until everything reaches disk.  I think to perform well this 
approach would need a hook to start a commit and then return as soon as it 
can guarantee than any subsequent operation's start_transaction can't join 
in that commit.

This may be a better way to go about this, though.  Does that sound 
reasonable, Chris?

sage


> 
> Regards,
> Andrey
> 
> >
> > A few things:
> >
> >  - The implementation just exports the sys_* calls it needs (a popular
> > move, no doubt :).  I've looked at using the corresponding vfs_*
> > instructions instead, and keeping a table of struct file *'s instead of
> > fd's to avoid these exports, but this requires a large amount of
> > duplication of semi-boilerplate path lookup, security_path_* hooks, and
> > similar code from fs/namei.c and elsewhere.  If we want to go that
> > route, there are some advantages, the main one being that we can verify
> > that every dentry/inode we operate on belongs to the same fs.  But the
> > code will be more complex... I'm not sure if I should pursue that just
> > yet.
> >
> >  - The application gets to define what defines a failure for each
> > individual op based on its return value.
> >
> >  - If the transaction fails, the process can instruct the fs to wedge
> > itself so that a partial result does not commit.  This isn't a particuarly
> > elegant approach, but a wedged fs may be preferable to a partial
> > transaction commit.  (Alternatively, a failure could branch/jump to
> > another point in the transaction op vector to do some cleanup and/or an
> > explicit WEDGE op to accomplish the same thing?)
> >
> > - This still uses the existing ioctl start transaction call.  Depending on
> > how Josef's ENOSPC journal_info stuff works out, I should be able to avoid
> > the current global open_ioctl_trans counter for a cleaner interaction with
> > the btrfs transaction code.
> >
> > - The data space reservation is still missing.  I need a way to
> > find which space_info will be used, and pin it for the duration
> > of the entire transaction.
> >
> > - The metadata reservation is a worst case bound.  It could be less
> > conservative, but currently each op is pulled out of the user address
> > space individually so we'd either need two passes, a big kmalloc, or
> > further trust the app to get the value right.  (Same goes for the data
> > size, actually, although that's easier to get correct.)
> >
> > Thoughts on this?
> >
> > Thanks-
> > sage
> >
> >
> > Signed-off-by: Sage Weil <sage@newdream.net>
> > ---
> >  fs/btrfs/ioctl.c |  187 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/btrfs/ioctl.h |   49 ++++++++++++++
> >  fs/namei.c       |    3 +
> >  fs/open.c        |    2 +
> >  fs/read_write.c  |    2 +
> >  fs/xattr.c       |    2 +
> >  6 files changed, 245 insertions(+), 0 deletions(-)
> >
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index 136c5ed..4269616 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -37,6 +37,7 @@
> >  #include <linux/compat.h>
> >  #include <linux/bit_spinlock.h>
> >  #include <linux/security.h>
> > +#include <linux/syscalls.h>
> >  #include <linux/xattr.h>
> >  #include <linux/vmalloc.h>
> >  #include "compat.h"
> > @@ -1303,6 +1304,190 @@ long btrfs_ioctl_trans_end(struct file *file)
> >        return 0;
> >  }
> >
> > +/*
> > + * return number of successfully complete ops via @ops_completed
> > + * (where success/failure is defined by the _FAIL_* flags).
> > + */
> > +static long do_usertrans(struct btrfs_root *root,
> > +                        struct btrfs_ioctl_usertrans *ut,
> > +                        u64 *ops_completed)
> > +{
> > +       int i;
> > +       int *fds;
> > +       int err;
> > +       struct file *file;
> > +       struct btrfs_ioctl_usertrans_op *ops = (void *)ut->ops_ptr;
> > +       int fd1, fd2;
> > +
> > +       fds = kcalloc(sizeof(int), ut->num_fds, GFP_KERNEL);
> > +       if (!fds)
> > +               return -ENOMEM;
> > +
> > +       for (i = 0; i < ut->num_ops; i++) {
> > +               struct btrfs_ioctl_usertrans_op op;
> > +               int ret;
> > +
> > +               err = -EFAULT;
> > +               if (copy_from_user(&op, &ops[i], sizeof(op)))
> > +                       goto out;
> > +
> > +               /* lookup fd args? */
> > +               err = -EINVAL;
> > +               switch (op.op) {
> > +               case BTRFS_IOC_UT_OP_CLONERANGE:
> > +                       if (op.args[1] < 0 || op.args[1] >= ut->num_fds)
> > +                               goto out;
> > +                       fd2 = fds[1];
> > +
> > +               case BTRFS_IOC_UT_OP_CLOSE:
> > +               case BTRFS_IOC_UT_OP_PWRITE:
> > +                       if (op.args[0] < 0 || op.args[0] >= ut->num_fds)
> > +                               goto out;
> > +                       fd1 = fds[0];
> > +               }
> > +
> > +               /* do op */
> > +               switch (op.op) {
> > +               case BTRFS_IOC_UT_OP_OPEN:
> > +                       ret = -EINVAL;
> > +                       if (op.args[3] < 0 || op.args[3] >= ut->num_fds)
> > +                               goto out;
> > +                       ret = sys_open((const char __user *)op.args[0],
> > +                                      op.args[1], op.args[2]);
> > +                       fds[op.args[3]] = ret;
> > +                       break;
> > +               case BTRFS_IOC_UT_OP_CLOSE:
> > +                       ret = sys_close(fd1);
> > +                       break;
> > +               case BTRFS_IOC_UT_OP_PWRITE:
> > +                       ret = sys_pwrite64(fd1, (const char __user *)op.args[1],
> > +                                          op.args[2], op.args[3]);
> > +                       break;
> > +               case BTRFS_IOC_UT_OP_UNLINK:
> > +                       ret = sys_unlink((const char __user *)op.args[0]);
> > +                       break;
> > +               case BTRFS_IOC_UT_OP_MKDIR:
> > +                       ret = sys_mkdir((const char __user *)op.args[0],
> > +                               op.args[1]);
> > +                       break;
> > +               case BTRFS_IOC_UT_OP_RMDIR:
> > +                       ret = sys_rmdir((const char __user *)op.args[0]);
> > +                       break;
> > +               case BTRFS_IOC_UT_OP_TRUNCATE:
> > +                       ret = sys_truncate((const char __user *)op.args[0],
> > +                                          op.args[1]);
> > +                       break;
> > +               case BTRFS_IOC_UT_OP_SETXATTR:
> > +                       ret = sys_setxattr((char __user *)op.args[0],
> > +                                          (char __user *)op.args[1],
> > +                                          (void __user *)op.args[2],
> > +                                          op.args[3], op.args[4]);
> > +                       break;
> > +               case BTRFS_IOC_UT_OP_REMOVEXATTR:
> > +                       ret = sys_removexattr((char __user *)op.args[0],
> > +                                             (char __user *)op.args[1]);
> > +                       break;
> > +               case BTRFS_IOC_UT_OP_CLONERANGE:
> > +                       ret = -EBADF;
> > +                       file = fget(fd1);
> > +                       if (file) {
> > +                               ret = btrfs_ioctl_clone(file, fd2,
> > +                                                       op.args[2], op.args[3],
> > +                                                       op.args[4]);
> > +                               fput(file);
> > +                       }
> > +                       break;
> > +               }
> > +               pr_debug(" ut %d/%d op %d args %llx %llx %llx %llx %llx = %d\n",
> > +                        i, (int)ut->num_ops, (int)op.op, op.args[0],
> > +                        op.args[1], op.args[2], op.args[3], op.args[4], ret);
> > +
> > +               put_user(ret, &ops[i].rval);
> > +
> > +               if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_NE) &&
> > +                   ret != op.rval)
> > +                       goto out;
> > +               if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_EQ) &&
> > +                   ret == op.rval)
> > +                       goto out;
> > +               if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LT) &&
> > +                   ret < op.rval)
> > +                       goto out;
> > +               if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GT) &&
> > +                   ret > op.rval)
> > +                       goto out;
> > +               if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LTE) &&
> > +                   ret <= op.rval)
> > +                       goto out;
> > +               if ((op.flags & BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GTE) &&
> > +                   ret >= op.rval)
> > +                       goto out;
> > +       }
> > +       err = 0;
> > +out:
> > +       *ops_completed = i;
> > +       kfree(fds);
> > +       return err;
> > +}
> > +
> > +long btrfs_ioctl_usertrans(struct file *file, void __user *arg)
> > +{
> > +       struct btrfs_root *root = BTRFS_I(fdentry(file)->d_inode)->root;
> > +       struct btrfs_trans_handle *trans;
> > +       struct btrfs_ioctl_usertrans ut, *orig_ut = arg;
> > +       u64 ops_completed = 0;
> > +       int ret;
> > +
> > +       ret = -EPERM;
> > +       if (!capable(CAP_SYS_ADMIN))
> > +               goto out;
> > +
> > +       ret = -EFAULT;
> > +       if (copy_from_user(&ut, orig_ut, sizeof(ut)))
> > +               goto out;
> > +
> > +       ret = mnt_want_write(file->f_path.mnt);
> > +       if (ret)
> > +               goto out;
> > +
> > +       ret = btrfs_reserve_metadata_space(root, 5*ut.num_ops);
> > +       if (ret)
> > +               goto out_drop_write;
> > +
> > +       mutex_lock(&root->fs_info->trans_mutex);
> > +       root->fs_info->open_ioctl_trans++;
> > +       mutex_unlock(&root->fs_info->trans_mutex);
> > +
> > +       ret = -ENOMEM;
> > +       trans = btrfs_start_ioctl_transaction(root, 0);
> > +       if (!trans)
> > +               goto out_drop;
> > +
> > +       ret = do_usertrans(root, &ut, &ops_completed);
> > +       put_user(ops_completed, &orig_ut->ops_completed);
> > +
> > +       if (ret < 0 && (ut.flags & BTRFS_IOC_UT_FLAG_WEDGEONFAIL))
> > +               pr_err("btrfs: usertrans failed, wedging to avoid partial "
> > +                      " commit\n");
> > +       else
> > +               btrfs_end_transaction(trans, root);
> > +
> > +out_drop:
> > +       mutex_lock(&root->fs_info->trans_mutex);
> > +       root->fs_info->open_ioctl_trans--;
> > +       mutex_unlock(&root->fs_info->trans_mutex);
> > +
> > +       btrfs_unreserve_metadata_space(root, 5*ut.num_ops);
> > +out_drop_write:
> > +       mnt_drop_write(file->f_path.mnt);
> > +out:
> > +       return ret;
> > +}
> > +
> >  long btrfs_ioctl(struct file *file, unsigned int
> >                cmd, unsigned long arg)
> >  {
> > @@ -1343,6 +1528,8 @@ long btrfs_ioctl(struct file *file, unsigned int
> >        case BTRFS_IOC_SYNC:
> >                btrfs_sync_fs(file->f_dentry->d_sb, 1);
> >                return 0;
> > +       case BTRFS_IOC_USERTRANS:
> > +               return btrfs_ioctl_usertrans(file, argp);
> >        }
> >
> >        return -ENOTTY;
> > diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
> > index bc49914..f94e293 100644
> > --- a/fs/btrfs/ioctl.h
> > +++ b/fs/btrfs/ioctl.h
> > @@ -67,4 +67,53 @@ struct btrfs_ioctl_clone_range_args {
> >                                   struct btrfs_ioctl_vol_args)
> >  #define BTRFS_IOC_SNAP_DESTROY _IOW(BTRFS_IOCTL_MAGIC, 15, \
> >                                struct btrfs_ioctl_vol_args)
> > +
> > +/* usertrans ops */
> > +/* the 'fd' values are _indices_ into a temporary fd table, see num_fds below */
> > +#define BTRFS_IOC_UT_OP_OPEN         1  /* path, flags, mode, fd */
> > +#define BTRFS_IOC_UT_OP_CLOSE        2  /* fd */
> > +#define BTRFS_IOC_UT_OP_PWRITE       3  /* fd, data, length, offset */
> > +#define BTRFS_IOC_UT_OP_UNLINK       4  /* path */
> > +#define BTRFS_IOC_UT_OP_LINK         5  /* oldpath, newpath */
> > +#define BTRFS_IOC_UT_OP_MKDIR        6  /* path, mode */
> > +#define BTRFS_IOC_UT_OP_RMDIR        7  /* path */
> > +#define BTRFS_IOC_UT_OP_TRUNCATE     8  /* path, size */
> > +#define BTRFS_IOC_UT_OP_SETXATTR     9  /* path, name, data, len */
> > +#define BTRFS_IOC_UT_OP_REMOVEXATTR 10  /* path, name */
> > +#define BTRFS_IOC_UT_OP_CLONERANGE  11  /* dst fd, src fd, off, len, dst off */
> > +
> > +/* define what 'failure' entails for each op based on return value */
> > +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_NE    (1<< 1)
> > +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_EQ    (1<< 2)
> > +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LT    (1<< 3)
> > +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GT    (1<< 4)
> > +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LTE   (1<< 5)
> > +#define BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GTE   (1<< 6)
> > +
> > +struct btrfs_ioctl_usertrans_op {
> > +       __u64 op;
> > +       __s64 args[5];
> > +       __s64 rval;
> > +       __u64 flags;
> > +};
> > +
> > +/*
> > + * If an op fails and we cannot complete the transaction, we may want
> > + * to lock up the file system (requiring a reboot) to prevent a
> > + * partial result from committing.
> > + */
> > +#define BTRFS_IOC_UT_FLAG_WEDGEONFAIL (1<<13)
> > +
> > +struct btrfs_ioctl_usertrans {
> > +       __u64 num_ops;                  /* in: # ops */
> > +       __u64 ops_ptr;                  /* in: usertrans_op array */
> > +       __u64 num_fds;                  /* in: size of fd table (max fd + 1) */
> > +       __u64 data_bytes, metadata_ops; /* in: for space reservation */
> > +       __u64 flags;                    /* in: flags */
> > +       __u64 ops_completed;            /* out: # ops completed */
> > +};
> > +
> > +#define BTRFS_IOC_USERTRANS  _IOW(BTRFS_IOCTL_MAGIC, 16,       \
> > +                                 struct btrfs_ioctl_usertrans)
> > +
> >  #endif
> > diff --git a/fs/namei.c b/fs/namei.c
> > index d11f404..4d53225 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -2148,6 +2148,7 @@ SYSCALL_DEFINE2(mkdir, const char __user *, pathname, int, mode)
> >  {
> >        return sys_mkdirat(AT_FDCWD, pathname, mode);
> >  }
> > +EXPORT_SYMBOL(sys_mkdir);
> >
> >  /*
> >  * We try to drop the dentry early: we should have
> > @@ -2262,6 +2263,7 @@ SYSCALL_DEFINE1(rmdir, const char __user *, pathname)
> >  {
> >        return do_rmdir(AT_FDCWD, pathname);
> >  }
> > +EXPORT_SYMBOL(sys_rmdir);
> >
> >  int vfs_unlink(struct inode *dir, struct dentry *dentry)
> >  {
> > @@ -2369,6 +2371,7 @@ SYSCALL_DEFINE1(unlink, const char __user *, pathname)
> >  {
> >        return do_unlinkat(AT_FDCWD, pathname);
> >  }
> > +EXPORT_SYMBOL(sys_unlink);
> >
> >  int vfs_symlink(struct inode *dir, struct dentry *dentry, const char *oldname)
> >  {
> > diff --git a/fs/open.c b/fs/open.c
> > index 4f01e06..15eddfc 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -294,6 +294,7 @@ SYSCALL_DEFINE2(truncate, const char __user *, path, long, length)
> >  {
> >        return do_sys_truncate(path, length);
> >  }
> > +EXPORT_SYMBOL(sys_truncate);
> >
> >  static long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
> >  {
> > @@ -1062,6 +1063,7 @@ SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, int, mode)
> >        asmlinkage_protect(3, ret, filename, flags, mode);
> >        return ret;
> >  }
> > +EXPORT_SYMBOL(sys_open);
> >
> >  SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags,
> >                int, mode)
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 3ac2898..75e9f60 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -453,6 +453,8 @@ SYSCALL_DEFINE(pwrite64)(unsigned int fd, const char __user *buf,
> >
> >        return ret;
> >  }
> > +EXPORT_SYMBOL(sys_pwrite64);
> > +
> >  #ifdef CONFIG_HAVE_SYSCALL_WRAPPERS
> >  asmlinkage long SyS_pwrite64(long fd, long buf, long count, loff_t pos)
> >  {
> > diff --git a/fs/xattr.c b/fs/xattr.c
> > index 6d4f6d3..488c889 100644
> > --- a/fs/xattr.c
> > +++ b/fs/xattr.c
> > @@ -294,6 +294,7 @@ SYSCALL_DEFINE5(setxattr, const char __user *, pathname,
> >        path_put(&path);
> >        return error;
> >  }
> > +EXPORT_SYMBOL(sys_setxattr);
> >
> >  SYSCALL_DEFINE5(lsetxattr, const char __user *, pathname,
> >                const char __user *, name, const void __user *, value,
> > @@ -523,6 +524,7 @@ SYSCALL_DEFINE2(removexattr, const char __user *, pathname,
> >        path_put(&path);
> >        return error;
> >  }
> > +EXPORT_SYMBOL(sys_removexattr);
> >
> >  SYSCALL_DEFINE2(lremovexattr, const char __user *, pathname,
> >                const char __user *, name)
> > --
> > 1.5.6.5
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] big fat transaction ioctl
  2009-11-10 22:13   ` Sage Weil
@ 2009-11-11  0:49     ` Jeremy Fitzhardinge
  2009-11-11  5:15       ` Sage Weil
  2009-11-11 15:03     ` Chris Mason
  1 sibling, 1 reply; 13+ messages in thread
From: Jeremy Fitzhardinge @ 2009-11-11  0:49 UTC (permalink / raw)
  To: Sage Weil; +Cc: Andrey Kuzmin, linux-btrfs

On 11/10/09 14:13, Sage Weil wrote:
> On Tue, 10 Nov 2009, Andrey Kuzmin wrote:
>
>   
>> On Tue, Nov 10, 2009 at 11:12 PM, Sage Weil <sage@newdream.net> wrote:
>>     
>>> Hi all,
>>>
>>> This is an alternative approach to atomic user transactions for btrfs.
>>> The old start/end ioctls suffer from some basic limitations, namely
>>>
>>>  - We can't properly reserve space ahead of time to avoid ENOSPC part
>>> way through the transaction, and
>>>  - The process may die (seg fault, SIGKILL) part way through the
>>> transaction.  Currently when that happens the partial transaction will
>>> commit.
>>>
>>> This patch implements an ioctl that lets the application completely
>>> specify the entire transaction in a single syscall.  If the process gets
>>> killed or seg faults part way through, the entire transaction will still
>>> complete.
>>>
>>> The goal is to atomically commit updates to multiple files, xattrs,
>>> directories.  But this is still a file system: we don't get rollback if
>>> things go wrong.  Instead, do what we can up front to make sure things
>>> will work out.  And if things do go wrong, optionally prevent a partial
>>> result from reaching the disk.
>>>       
>> Why not snapshot respective root (doesn't work if transaction spans
>> multiple file-systems, but this doesn't look like a real-world
>> limitation), run txn against that snapshot and rollback on failure
>> instead? Snapshots are writable, cheap, and this looks like a real
>> transaction abort mechanism.
>>     
> Good question.  :)
>
> I hadn't looked into this before, but I think the snapshots could be used 
> to achieve both atomicity and rollback.  If userspace uses an rw mutex to 
> quiesce writes, it can make sure all transactions complete before creating 
> a snapshot (commit).  The problem with this currently is the create 
> snapshot ioctl is relatively slow... it calls commit_transaction, which 
> blocks until everything reaches disk.  I think to perform well this 
> approach would need a hook to start a commit and then return as soon as it 
> can guarantee than any subsequent operation's start_transaction can't join 
> in that commit.
>
> This may be a better way to go about this, though.  Does that sound 
> reasonable, Chris?
>   

If snapshots only capture what's currently physically on disk, then it
means that the transactions will be fairly heavyweight in requiring
everything to be physically synced.  That may be what some apps want
anyway, but I can certainly imagine apps wanting transaction semantics
without having fsync-level durability requirements.

    J

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] big fat transaction ioctl
  2009-11-11  0:49     ` Jeremy Fitzhardinge
@ 2009-11-11  5:15       ` Sage Weil
  0 siblings, 0 replies; 13+ messages in thread
From: Sage Weil @ 2009-11-11  5:15 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Andrey Kuzmin, linux-btrfs

On Tue, 10 Nov 2009, Jeremy Fitzhardinge wrote:

> On 11/10/09 14:13, Sage Weil wrote:
> > On Tue, 10 Nov 2009, Andrey Kuzmin wrote:
> >
> >   
> >> On Tue, Nov 10, 2009 at 11:12 PM, Sage Weil <sage@newdream.net> wrote:
> >>     
> >>> Hi all,
> >>>
> >>> This is an alternative approach to atomic user transactions for btrfs.
> >>> The old start/end ioctls suffer from some basic limitations, namely
> >>>
> >>>  - We can't properly reserve space ahead of time to avoid ENOSPC part
> >>> way through the transaction, and
> >>>  - The process may die (seg fault, SIGKILL) part way through the
> >>> transaction.  Currently when that happens the partial transaction will
> >>> commit.
> >>>
> >>> This patch implements an ioctl that lets the application completely
> >>> specify the entire transaction in a single syscall.  If the process gets
> >>> killed or seg faults part way through, the entire transaction will still
> >>> complete.
> >>>
> >>> The goal is to atomically commit updates to multiple files, xattrs,
> >>> directories.  But this is still a file system: we don't get rollback if
> >>> things go wrong.  Instead, do what we can up front to make sure things
> >>> will work out.  And if things do go wrong, optionally prevent a partial
> >>> result from reaching the disk.
> >>>       
> >> Why not snapshot respective root (doesn't work if transaction spans
> >> multiple file-systems, but this doesn't look like a real-world
> >> limitation), run txn against that snapshot and rollback on failure
> >> instead? Snapshots are writable, cheap, and this looks like a real
> >> transaction abort mechanism.
> >>     
> > Good question.  :)
> >
> > I hadn't looked into this before, but I think the snapshots could be used 
> > to achieve both atomicity and rollback.  If userspace uses an rw mutex to 
> > quiesce writes, it can make sure all transactions complete before creating 
> > a snapshot (commit).  The problem with this currently is the create 
> > snapshot ioctl is relatively slow... it calls commit_transaction, which 
> > blocks until everything reaches disk.  I think to perform well this 
> > approach would need a hook to start a commit and then return as soon as it 
> > can guarantee than any subsequent operation's start_transaction can't join 
> > in that commit.
> >
> > This may be a better way to go about this, though.  Does that sound 
> > reasonable, Chris?
> >   
> 
> If snapshots only capture what's currently physically on disk, then it
> means that the transactions will be fairly heavyweight in requiring
> everything to be physically synced.  That may be what some apps want
> anyway, but I can certainly imagine apps wanting transaction semantics
> without having fsync-level durability requirements.

Just to be clear, the transactions I'm talking about _only_ control the 
way operations are grouped when they commit to disk.  They do not in any 
way affect the view of the file system that another concurrently running 
process might see.  It's the application's responsibility to deal with 
the sort of transaction concurrency you're talking about.

sage

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] big fat transaction ioctl
  2009-11-10 20:12 [RFC] big fat transaction ioctl Sage Weil
  2009-11-10 20:44 ` Andrey Kuzmin
@ 2009-11-11 14:54 ` Chris Mason
  2009-11-11 18:22   ` Zach Brown
  1 sibling, 1 reply; 13+ messages in thread
From: Chris Mason @ 2009-11-11 14:54 UTC (permalink / raw)
  To: Sage Weil; +Cc: linux-btrfs, hch

On Tue, Nov 10, 2009 at 12:12:14PM -0800, Sage Weil wrote:
> Hi all,
> 
> This is an alternative approach to atomic user transactions for btrfs.  
> The old start/end ioctls suffer from some basic limitations, namely
> 
>  - We can't properly reserve space ahead of time to avoid ENOSPC part 
> way through the transaction, and
>  - The process may die (seg fault, SIGKILL) part way through the 
> transaction.  Currently when that happens the partial transaction will 
> commit.

I like this much more than providing a journal start/stop to userland.
If we can get Christoph to ack the exports we can work on the interface
in general.

-chris

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] big fat transaction ioctl
  2009-11-10 22:13   ` Sage Weil
  2009-11-11  0:49     ` Jeremy Fitzhardinge
@ 2009-11-11 15:03     ` Chris Mason
  2009-11-11 15:41       ` Andrey Kuzmin
  2009-11-11 17:19       ` Sage Weil
  1 sibling, 2 replies; 13+ messages in thread
From: Chris Mason @ 2009-11-11 15:03 UTC (permalink / raw)
  To: Sage Weil; +Cc: Andrey Kuzmin, linux-btrfs

On Tue, Nov 10, 2009 at 02:13:10PM -0800, Sage Weil wrote:
> On Tue, 10 Nov 2009, Andrey Kuzmin wrote:
>=20
> > On Tue, Nov 10, 2009 at 11:12 PM, Sage Weil <sage@newdream.net> wro=
te:
> > > Hi all,
> > >
> > > This is an alternative approach to atomic user transactions for b=
trfs.
> > > The old start/end ioctls suffer from some basic limitations, name=
ly
> > >
> > > =A0- We can't properly reserve space ahead of time to avoid ENOSP=
C part
> > > way through the transaction, and
> > > =A0- The process may die (seg fault, SIGKILL) part way through th=
e
> > > transaction. =A0Currently when that happens the partial transacti=
on will
> > > commit.
> > >
> > > This patch implements an ioctl that lets the application complete=
ly
> > > specify the entire transaction in a single syscall. =A0If the pro=
cess gets
> > > killed or seg faults part way through, the entire transaction wil=
l still
> > > complete.
> > >
> > > The goal is to atomically commit updates to multiple files, xattr=
s,
> > > directories. =A0But this is still a file system: we don't get rol=
lback if
> > > things go wrong. =A0Instead, do what we can up front to make sure=
 things
> > > will work out. =A0And if things do go wrong, optionally prevent a=
 partial
> > > result from reaching the disk.
> >=20
> > Why not snapshot respective root (doesn't work if transaction spans
> > multiple file-systems, but this doesn't look like a real-world
> > limitation), run txn against that snapshot and rollback on failure
> > instead? Snapshots are writable, cheap, and this looks like a real
> > transaction abort mechanism.
>=20
> Good question.  :)
>=20
> I hadn't looked into this before, but I think the snapshots could be =
used=20
> to achieve both atomicity and rollback.  If userspace uses an rw mute=
x to=20
> quiesce writes, it can make sure all transactions complete before cre=
ating=20
> a snapshot (commit).  The problem with this currently is the create=20
> snapshot ioctl is relatively slow... it calls commit_transaction, whi=
ch=20
> blocks until everything reaches disk.  I think to perform well this=20
> approach would need a hook to start a commit and then return as soon =
as it=20
> can guarantee than any subsequent operation's start_transaction can't=
 join=20
> in that commit.
>=20
> This may be a better way to go about this, though.  Does that sound=20
> reasonable, Chris?

Yes, we could do this, but I don't think it will perform very well
compared to your multi-operation ioctl.  It really does depend on how
often you need to do atomic ops (my guess is very).

Honestly you'll get better performance with a simple write-ahead log
from userland:

step1: write redo log somewhere in the FS, with enough information to
bring all the objects you're about to touch to a consistent state.
step2: fsync the log
step3: do your operations
step4: append a record to the undo log that invalidates the last log
op, or just truncate it to zero.
step5: fsync the log.

The big advantage of the log is that you won't be tied to btrfs, but
it's two fsyncs where the big transaction framework does none.  This
should allow you to turn on the fast fsync log again, but I think the
multi-operation ioctl would do that as well.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] big fat transaction ioctl
  2009-11-11 15:03     ` Chris Mason
@ 2009-11-11 15:41       ` Andrey Kuzmin
  2009-11-11 15:55         ` Chris Mason
  2009-11-11 17:19       ` Sage Weil
  1 sibling, 1 reply; 13+ messages in thread
From: Andrey Kuzmin @ 2009-11-11 15:41 UTC (permalink / raw)
  To: Chris Mason, Sage Weil, Andrey Kuzmin, linux-btrfs

On Wed, Nov 11, 2009 at 6:03 PM, Chris Mason <chris.mason@oracle.com> w=
rote:
> On Tue, Nov 10, 2009 at 02:13:10PM -0800, Sage Weil wrote:
>> On Tue, 10 Nov 2009, Andrey Kuzmin wrote:
>>
>> > On Tue, Nov 10, 2009 at 11:12 PM, Sage Weil <sage@newdream.net> wr=
ote:
>> > > Hi all,
>> > >
>> > > This is an alternative approach to atomic user transactions for =
btrfs.
>> > > The old start/end ioctls suffer from some basic limitations, nam=
ely
>> > >
>> > > =A0- We can't properly reserve space ahead of time to avoid ENOS=
PC part
>> > > way through the transaction, and
>> > > =A0- The process may die (seg fault, SIGKILL) part way through t=
he
>> > > transaction. =A0Currently when that happens the partial transact=
ion will
>> > > commit.
>> > >
>> > > This patch implements an ioctl that lets the application complet=
ely
>> > > specify the entire transaction in a single syscall. =A0If the pr=
ocess gets
>> > > killed or seg faults part way through, the entire transaction wi=
ll still
>> > > complete.
>> > >
>> > > The goal is to atomically commit updates to multiple files, xatt=
rs,
>> > > directories. =A0But this is still a file system: we don't get ro=
llback if
>> > > things go wrong. =A0Instead, do what we can up front to make sur=
e things
>> > > will work out. =A0And if things do go wrong, optionally prevent =
a partial
>> > > result from reaching the disk.
>> >
>> > Why not snapshot respective root (doesn't work if transaction span=
s
>> > multiple file-systems, but this doesn't look like a real-world
>> > limitation), run txn against that snapshot and rollback on failure
>> > instead? Snapshots are writable, cheap, and this looks like a real
>> > transaction abort mechanism.
>>
>> Good question. =A0:)
>>
>> I hadn't looked into this before, but I think the snapshots could be=
 used
>> to achieve both atomicity and rollback. =A0If userspace uses an rw m=
utex to
>> quiesce writes, it can make sure all transactions complete before cr=
eating
>> a snapshot (commit). =A0The problem with this currently is the creat=
e
>> snapshot ioctl is relatively slow... it calls commit_transaction, wh=
ich
>> blocks until everything reaches disk. =A0I think to perform well thi=
s
>> approach would need a hook to start a commit and then return as soon=
 as it
>> can guarantee than any subsequent operation's start_transaction can'=
t join
>> in that commit.
>>
>> This may be a better way to go about this, though. =A0Does that soun=
d
>> reasonable, Chris?
>
> Yes, we could do this, but I don't think it will perform very well
> compared to your multi-operation ioctl. =A0It really does depend on h=
ow
> often you need to do atomic ops (my guess is very).
>
> Honestly you'll get better performance with a simple write-ahead log
> from userland:

Write-ahead logging is necessary anyway if the aim is to provide
transactional semantics to an application. But, at the same time, w/o
snapshot there is no synchronization between the log and file-system
state.

Regards,
Andrey

>
> step1: write redo log somewhere in the FS, with enough information to
> bring all the objects you're about to touch to a consistent state.
> step2: fsync the log
> step3: do your operations
> step4: append a record to the undo log that invalidates the last log
> op, or just truncate it to zero.
> step5: fsync the log.
>
> The big advantage of the log is that you won't be tied to btrfs, but
> it's two fsyncs where the big transaction framework does none. =A0Thi=
s
> should allow you to turn on the fast fsync log again, but I think the
> multi-operation ioctl would do that as well.
>
> -chris
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] big fat transaction ioctl
  2009-11-11 15:41       ` Andrey Kuzmin
@ 2009-11-11 15:55         ` Chris Mason
  0 siblings, 0 replies; 13+ messages in thread
From: Chris Mason @ 2009-11-11 15:55 UTC (permalink / raw)
  To: Andrey Kuzmin; +Cc: Sage Weil, linux-btrfs

On Wed, Nov 11, 2009 at 06:41:06PM +0300, Andrey Kuzmin wrote:
> >> I hadn't looked into this before, but I think the snapshots could =
be used
> >> to achieve both atomicity and rollback. =A0If userspace uses an rw=
 mutex to
> >> quiesce writes, it can make sure all transactions complete before =
creating
> >> a snapshot (commit). =A0The problem with this currently is the cre=
ate
> >> snapshot ioctl is relatively slow... it calls commit_transaction, =
which
> >> blocks until everything reaches disk. =A0I think to perform well t=
his
> >> approach would need a hook to start a commit and then return as so=
on as it
> >> can guarantee than any subsequent operation's start_transaction ca=
n't join
> >> in that commit.
> >>
> >> This may be a better way to go about this, though. =A0Does that so=
und
> >> reasonable, Chris?
> >
> > Yes, we could do this, but I don't think it will perform very well
> > compared to your multi-operation ioctl. =A0It really does depend on=
 how
> > often you need to do atomic ops (my guess is very).
> >
> > Honestly you'll get better performance with a simple write-ahead lo=
g
> > from userland:
>=20
> Write-ahead logging is necessary anyway if the aim is to provide
> transactional semantics to an application.

Sage's big fat ioctl does provide the subset of transactional semantics
that ceph (and many other apps) require.  In this case, they just want
to know that a given set of operations will happen together.

> But, at the same time, w/o
> snapshot there is no synchronization between the log and file-system
> state.

Synchronizing the log and the filesystem state happens when the
application starts up after the crash (either app crash or system
crash).  The application would be in charge of applying the log to its
own files to get the system into whatever state the app thinks is
consistent.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] big fat transaction ioctl
  2009-11-11 15:03     ` Chris Mason
  2009-11-11 15:41       ` Andrey Kuzmin
@ 2009-11-11 17:19       ` Sage Weil
  2009-11-12  3:56         ` Andrey Kuzmin
  1 sibling, 1 reply; 13+ messages in thread
From: Sage Weil @ 2009-11-11 17:19 UTC (permalink / raw)
  To: Chris Mason; +Cc: Andrey Kuzmin, linux-btrfs

[-- Attachment #1: Type: TEXT/PLAIN, Size: 6207 bytes --]

On Wed, 11 Nov 2009, Chris Mason wrote:

> On Tue, Nov 10, 2009 at 02:13:10PM -0800, Sage Weil wrote:
> > On Tue, 10 Nov 2009, Andrey Kuzmin wrote:
> > 
> > > On Tue, Nov 10, 2009 at 11:12 PM, Sage Weil <sage@newdream.net> wrote:
> > > > Hi all,
> > > >
> > > > This is an alternative approach to atomic user transactions for btrfs.
> > > > The old start/end ioctls suffer from some basic limitations, namely
> > > >
> > > >  - We can't properly reserve space ahead of time to avoid ENOSPC part
> > > > way through the transaction, and
> > > >  - The process may die (seg fault, SIGKILL) part way through the
> > > > transaction.  Currently when that happens the partial transaction will
> > > > commit.
> > > >
> > > > This patch implements an ioctl that lets the application completely
> > > > specify the entire transaction in a single syscall.  If the process gets
> > > > killed or seg faults part way through, the entire transaction will still
> > > > complete.
> > > >
> > > > The goal is to atomically commit updates to multiple files, xattrs,
> > > > directories.  But this is still a file system: we don't get rollback if
> > > > things go wrong.  Instead, do what we can up front to make sure things
> > > > will work out.  And if things do go wrong, optionally prevent a partial
> > > > result from reaching the disk.
> > > 
> > > Why not snapshot respective root (doesn't work if transaction spans
> > > multiple file-systems, but this doesn't look like a real-world
> > > limitation), run txn against that snapshot and rollback on failure
> > > instead? Snapshots are writable, cheap, and this looks like a real
> > > transaction abort mechanism.
> > 
> > Good question.  :)
> > 
> > I hadn't looked into this before, but I think the snapshots could be used 
> > to achieve both atomicity and rollback.  If userspace uses an rw mutex to 
> > quiesce writes, it can make sure all transactions complete before creating 
> > a snapshot (commit).  The problem with this currently is the create 
> > snapshot ioctl is relatively slow... it calls commit_transaction, which 
> > blocks until everything reaches disk.  I think to perform well this 
> > approach would need a hook to start a commit and then return as soon as it 
> > can guarantee than any subsequent operation's start_transaction can't join 
> > in that commit.
> > 
> > This may be a better way to go about this, though.  Does that sound 
> > reasonable, Chris?
> 
> Yes, we could do this, but I don't think it will perform very well
> compared to your multi-operation ioctl.  It really does depend on how
> often you need to do atomic ops (my guess is very).

The thing is, I'm not sure using snaps is that different from what I'm 
doing now.  Currently the ioctl transactions don't hit disk until each 
full commit (flushoncommit, no fsync).  Unless the presense of a snapshot 
adds additional overhead (to the commit, or to cleaning up the slightly 
longer-living snapped roots), the difference would be that starting 
transactions would need to be blocked by the application instead of 
wait_current_trans in start_transaction, and (currently at least) they 
would wait longer (the extra writes between blocked = 0 and commit_done = 
1 in commit_transaction).  

The key, as now, is keeping the full fs syncs infrequent.  And, if 
possible, reducing the duration of the blocked == 1 period during 
commit_transaction.


> Honestly you'll get better performance with a simple write-ahead log
> from userland:

There actually is a log, but it's optional and not strictly write-ahead... 
it's only used to reduce the commit latency:

1- apply operations to fs (grouped into atomic transactions)
2- (optionally) write and flush log entry
...repeat...
3- periodically sync the fs, then trim the log.  or sync early if a 
client explicitly requests it.

But

1- I don't want to make the log required.  Sometimes you're more concerned 
about total throughput, not latency, and the log halves your write bw 
unless you add more spindles.

2- I don't want it strictly write-ahead because (in the absense of atomic 
ops) it means you have to wait for the log to sync before applying the ops 
to the fs (to ensure the fs doesn't get a partial transaction ahead of the 
log).  This marries atomicity with your schedule for durability, which 
isn't necessarily what you want.  (e.g., Ceph makes a distinction between 
serialized and commited ops, allowing limited sharing of data before it 
hits disk.  That's the nice thing about this ioctl... it's pretty common 
that atomicity is the only requirement.)

With the optional (write-behind?) log and transaction ioctls, IF you want 
low latency commits, enable the log and ideally give it it's own spindle, 
and infrequently sync btrfs to get good layout and low overhead. 


Unless you think I'm missing something with the snapshot approach, I can 
give that a try and see how it does.  It requires explicit management of 
the sync/commit schedule, but in my case at least I'm doing that already.  
A transaction ioctl is simpler for userland and would be more generically 
useful for other apps (particularly those who don't want to manage 
commits), but will always have some small possibility of partial 
failure/abort without rollback.

sage


> 
> step1: write redo log somewhere in the FS, with enough information to
> bring all the objects you're about to touch to a consistent state.
> step2: fsync the log
> step3: do your operations
> step4: append a record to the undo log that invalidates the last log
> op, or just truncate it to zero.
> step5: fsync the log.
> 
> The big advantage of the log is that you won't be tied to btrfs, but
> it's two fsyncs where the big transaction framework does none.  This
> should allow you to turn on the fast fsync log again, but I think the
> multi-operation ioctl would do that as well.
> 
> -chris
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] big fat transaction ioctl
  2009-11-11 14:54 ` Chris Mason
@ 2009-11-11 18:22   ` Zach Brown
  2009-11-11 22:22     ` Sage Weil
  0 siblings, 1 reply; 13+ messages in thread
From: Zach Brown @ 2009-11-11 18:22 UTC (permalink / raw)
  To: Chris Mason, Sage Weil, linux-btrfs, hch


> I like this much more than providing a journal start/stop to userland.
> If we can get Christoph to ack the exports we can work on the interface
> in general.

I'll note, briefly, that it seems dangerous to call right into the sys_
functions instead of going through the architecture's syscall number
dispatching path.  Do you know if the syscalls you're calling have
compat wrappers on some architectures for some userspace abis?

With that out of the way, though, I'll get on to my bigger point.

This interface for specifying an array of syscalls to call looks a whole
lot like the work that fs/aio.c, syslets, and acall have all done.  The
flags for stopping processing of the array based on errors from the
syscalls are remarkably similar to Ingo's atom structs.

So maybe there's an opportunity for a generic syscall for processing
batches of syscalls.  Maybe you'll bracket some of them with btrfs
ioctls for flagging the task_struct as being in a btrfs transaction, but
maybe you'll also flag some for concurrent acall processing or nutty
syslet thread spawning if they block.

It'll probably take some work to be able to call syscall handlers from C
on all architectures, and we'd have to be really careful about the
semantics if we start mixing btrfs ioctls and async flags, but it just
might be worth it.

- z

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] big fat transaction ioctl
  2009-11-11 18:22   ` Zach Brown
@ 2009-11-11 22:22     ` Sage Weil
  0 siblings, 0 replies; 13+ messages in thread
From: Sage Weil @ 2009-11-11 22:22 UTC (permalink / raw)
  To: Zach Brown; +Cc: Chris Mason, linux-btrfs, hch

On Wed, 11 Nov 2009, Zach Brown wrote:
> > I like this much more than providing a journal start/stop to userland.
> > If we can get Christoph to ack the exports we can work on the interface
> > in general.
> 
> I'll note, briefly, that it seems dangerous to call right into the sys_
> functions instead of going through the architecture's syscall number
> dispatching path.  Do you know if the syscalls you're calling have
> compat wrappers on some architectures for some userspace abis?
> 
> With that out of the way, though, I'll get on to my bigger point.
> 
> This interface for specifying an array of syscalls to call looks a whole
> lot like the work that fs/aio.c, syslets, and acall have all done.  The
> flags for stopping processing of the array based on errors from the
> syscalls are remarkably similar to Ingo's atom structs.

Yeah, I think both syslets (with atoms) and acall provide a much more 
elegant interface than what I've described.  (I should have looked at them 
more closely before; I didn't care about being asynchronous.)  The only 
real requirement for the atomic user transactions is that the batch of 
operations not be interrupted by SIGKILL or seg fault (at least for the 
calls that are being used).

> So maybe there's an opportunity for a generic syscall for processing
> batches of syscalls.  Maybe you'll bracket some of them with btrfs
> ioctls for flagging the task_struct as being in a btrfs transaction, but
> maybe you'll also flag some for concurrent acall processing or nutty
> syslet thread spawning if they block.

Right.  In my case running things asynchronously isn't strictly necessary.  
And it seems like the asynchrony and batching (in both syslets and acall) 
are somewhat orthogonal.  Maybe the async part should really be completely 
independent, and not, say, a flag?

	sys_batch(a vector/graph of ops)

and then

	sys_acall(a single call)
or	sys_syslet(a single call)	

Both Ingo's atoms or the simpler acall op vector would suffice in my 
case.  Even current acall() would do; the thread overhead won't be 
sigficant over the relatively slow fs transactions.

sage

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] big fat transaction ioctl
  2009-11-11 17:19       ` Sage Weil
@ 2009-11-12  3:56         ` Andrey Kuzmin
  0 siblings, 0 replies; 13+ messages in thread
From: Andrey Kuzmin @ 2009-11-12  3:56 UTC (permalink / raw)
  To: Sage Weil; +Cc: Chris Mason, linux-btrfs

On Wed, Nov 11, 2009 at 8:19 PM, Sage Weil <sage@newdream.net> wrote:
> On Wed, 11 Nov 2009, Chris Mason wrote:
>
>> On Tue, Nov 10, 2009 at 02:13:10PM -0800, Sage Weil wrote:
>> > On Tue, 10 Nov 2009, Andrey Kuzmin wrote:
>> >
>> > > On Tue, Nov 10, 2009 at 11:12 PM, Sage Weil <sage@newdream.net> =
wrote:
>> > > > Hi all,
>> > > >
>> > > > This is an alternative approach to atomic user transactions fo=
r btrfs.
>> > > > The old start/end ioctls suffer from some basic limitations, n=
amely
>> > > >
>> > > > =A0- We can't properly reserve space ahead of time to avoid EN=
OSPC part
>> > > > way through the transaction, and
>> > > > =A0- The process may die (seg fault, SIGKILL) part way through=
 the
>> > > > transaction. =A0Currently when that happens the partial transa=
ction will
>> > > > commit.
>> > > >
>> > > > This patch implements an ioctl that lets the application compl=
etely
>> > > > specify the entire transaction in a single syscall. =A0If the =
process gets
>> > > > killed or seg faults part way through, the entire transaction =
will still
>> > > > complete.
>> > > >
>> > > > The goal is to atomically commit updates to multiple files, xa=
ttrs,
>> > > > directories. =A0But this is still a file system: we don't get =
rollback if
>> > > > things go wrong. =A0Instead, do what we can up front to make s=
ure things
>> > > > will work out. =A0And if things do go wrong, optionally preven=
t a partial
>> > > > result from reaching the disk.
>> > >
>> > > Why not snapshot respective root (doesn't work if transaction sp=
ans
>> > > multiple file-systems, but this doesn't look like a real-world
>> > > limitation), run txn against that snapshot and rollback on failu=
re
>> > > instead? Snapshots are writable, cheap, and this looks like a re=
al
>> > > transaction abort mechanism.
>> >
>> > Good question. =A0:)
>> >
>> > I hadn't looked into this before, but I think the snapshots could =
be used
>> > to achieve both atomicity and rollback. =A0If userspace uses an rw=
 mutex to
>> > quiesce writes, it can make sure all transactions complete before =
creating
>> > a snapshot (commit). =A0The problem with this currently is the cre=
ate
>> > snapshot ioctl is relatively slow... it calls commit_transaction, =
which
>> > blocks until everything reaches disk. =A0I think to perform well t=
his
>> > approach would need a hook to start a commit and then return as so=
on as it
>> > can guarantee than any subsequent operation's start_transaction ca=
n't join
>> > in that commit.
>> >
>> > This may be a better way to go about this, though. =A0Does that so=
und
>> > reasonable, Chris?
>>
>> Yes, we could do this, but I don't think it will perform very well
>> compared to your multi-operation ioctl. =A0It really does depend on =
how
>> often you need to do atomic ops (my guess is very).
>
> The thing is, I'm not sure using snaps is that different from what I'=
m
> doing now. =A0Currently the ioctl transactions don't hit disk until e=
ach
> full commit (flushoncommit, no fsync). =A0Unless the presense of a sn=
apshot
> adds additional overhead (to the commit, or to cleaning up the slight=
ly
> longer-living snapped roots), the difference would be that starting
> transactions would need to be blocked by the application instead of
> wait_current_trans in start_transaction, and (currently at least) the=
y
> would wait longer (the extra writes between blocked =3D 0 and commit_=
done =3D
> 1 in commit_transaction).
>
> The key, as now, is keeping the full fs syncs infrequent. =A0And, if
> possible, reducing the duration of the blocked =3D=3D 1 period during
> commit_transaction.

It took me some time to associate you with Ceph project and to recall
what Ceph is, so my original snapshot suggestion was out-of-context.
When put into Ceph context, it looks too heavy-weight and may turn an
overkill. Chris's write-ahead logging idea looks much more realistic
for your use case.

>
>
>> Honestly you'll get better performance with a simple write-ahead log
>> from userland:
>
> There actually is a log, but it's optional and not strictly write-ahe=
ad...
> it's only used to reduce the commit latency:
>
> 1- apply operations to fs (grouped into atomic transactions)
> 2- (optionally) write and flush log entry
> ...repeat...
> 3- periodically sync the fs, then trim the log. =A0or sync early if a
> client explicitly requests it.
>
> But
>
> 1- I don't want to make the log required. =A0Sometimes you're more co=
ncerned
> about total throughput, not latency, and the log halves your write bw
> unless you add more spindles.

Log-induced latency penalty is the price for transactional consistency
:). Traditional mitigation recipe involves low-latency log device
(NVRAM and, recently, SLC flash). Since you specifically target
distributed systems, you have a distributed in-memory logging option.

Regards,
Andrey

>
> 2- I don't want it strictly write-ahead because (in the absense of at=
omic
> ops) it means you have to wait for the log to sync before applying th=
e ops
> to the fs (to ensure the fs doesn't get a partial transaction ahead o=
f the
> log). =A0This marries atomicity with your schedule for durability, wh=
ich
> isn't necessarily what you want. =A0(e.g., Ceph makes a distinction b=
etween
> serialized and commited ops, allowing limited sharing of data before =
it
> hits disk. =A0That's the nice thing about this ioctl... it's pretty c=
ommon
> that atomicity is the only requirement.)
>
> With the optional (write-behind?) log and transaction ioctls, IF you =
want
> low latency commits, enable the log and ideally give it it's own spin=
dle,
> and infrequently sync btrfs to get good layout and low overhead.
>
>
> Unless you think I'm missing something with the snapshot approach, I =
can
> give that a try and see how it does. =A0It requires explicit manageme=
nt of
> the sync/commit schedule, but in my case at least I'm doing that alre=
ady.
> A transaction ioctl is simpler for userland and would be more generic=
ally
> useful for other apps (particularly those who don't want to manage
> commits), but will always have some small possibility of partial
> failure/abort without rollback.
>
> sage
>
>
>>
>> step1: write redo log somewhere in the FS, with enough information t=
o
>> bring all the objects you're about to touch to a consistent state.
>> step2: fsync the log
>> step3: do your operations
>> step4: append a record to the undo log that invalidates the last log
>> op, or just truncate it to zero.
>> step5: fsync the log.
>>
>> The big advantage of the log is that you won't be tied to btrfs, but
>> it's two fsyncs where the big transaction framework does none. =A0Th=
is
>> should allow you to turn on the fast fsync log again, but I think th=
e
>> multi-operation ioctl would do that as well.
>>
>> -chris
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrf=
s" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-11-12  3:56 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-10 20:12 [RFC] big fat transaction ioctl Sage Weil
2009-11-10 20:44 ` Andrey Kuzmin
2009-11-10 22:13   ` Sage Weil
2009-11-11  0:49     ` Jeremy Fitzhardinge
2009-11-11  5:15       ` Sage Weil
2009-11-11 15:03     ` Chris Mason
2009-11-11 15:41       ` Andrey Kuzmin
2009-11-11 15:55         ` Chris Mason
2009-11-11 17:19       ` Sage Weil
2009-11-12  3:56         ` Andrey Kuzmin
2009-11-11 14:54 ` Chris Mason
2009-11-11 18:22   ` Zach Brown
2009-11-11 22:22     ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.