linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC v2 0/2]  readmirror feature
@ 2019-08-26  9:04 Anand Jain
  2019-08-26  9:04 ` [PATCH RFC v2] btrfs: add readmirror framework and policy devid Anand Jain
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Anand Jain @ 2019-08-26  9:04 UTC (permalink / raw)
  To: linux-btrfs

Function call chain  __btrfs_map_block()->find_live_mirror() uses
thread pid to determine the %mirror_num when the mirror_num=0.

This patch introduces a framework so that we can add policies to determine
the %mirror_num. And also adds the devid as the readmirror policy.

The new property is stored as an item in the device tree as show below.
    (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid)

To be able to set and get this new property also introduces new ioctls
BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument
is defined as
        struct btrfs_ioctl_readmirror_args {
                __u64 type; /* RW */
                __u64 device_bitmap; /* RW */
        }

An usage example as follows:
        btrfs property set /btrfs readmirror devid:1,3
        btrfs property get /btrfs readmirror
          readmirror devid:1 3
        btrfs property set /btrfs readmirror ""
        btrfs property get /btrfs readmirror
          readmirror default

This patchset has been tested completely, however marked as RFC for the 
following reasons and comments on them (or any other) are appreciated as
usual.
 . The new objectid is defined as
      #define BTRFS_READMIRROR_OBJECTID -1ULL
   Need consent we are fine to use this value, and with this value it
   shall be placed just before the DEV_STATS_OBJECTID item which is more
   frequently used only during the device errors.

.  I am using a u64 bitmap to represent the devices id, so the max device
   id that we could represent is 63, its a kind of limitation which should
   be addressed before integration, I wonder if there is any suggestion?
   Kindly note that, multiple ioctls with each time representing a set of
   device(s) is not a choice because we need to make sure the readmirror
   changes happens in a commit transaction.

v1->RFC v2:
  . Property is stored as a dev-tree item instead of root inode extended
    attribute.
  . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to BTRFS_DEV_STATE_READ_PREFERRED.
  . Changed format specifier from devid1,2,3.. to devid:1,2,3..

RFC->v1:
  Drops pid as one of the readmirror policy choices and as usual remains
  as default. And when the devid is reset the readmirror policy falls back
  to pid.
  Drops the mount -o readmirror idea, it can be added at a later point of
  time.
  Property now accepts more than 1 devid as readmirror device. As shown
  in the example above.

Anand Jain (1):
  btrfs: add readmirror framework and policy devid

 fs/btrfs/ctree.h                |   3 +
 fs/btrfs/disk-io.c              |   9 ++
 fs/btrfs/ioctl.c                | 108 ++++++++++++++++++++++++
 fs/btrfs/transaction.c          |   3 +
 fs/btrfs/volumes.c              | 145 +++++++++++++++++++++++++++++++-
 fs/btrfs/volumes.h              |   9 +-
 include/uapi/linux/btrfs.h      |  15 +++-
 include/uapi/linux/btrfs_tree.h |  11 +++
 8 files changed, 300 insertions(+), 3 deletions(-)

Anand Jain (1):
  btrfs_progs: add readmirror property and ioctl to set get

 ioctl.h | 14 +++++++++
 props.c | 98 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 112 insertions(+)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH RFC v2] btrfs: add readmirror framework and policy devid
  2019-08-26  9:04 [PATCH RFC v2 0/2] readmirror feature Anand Jain
@ 2019-08-26  9:04 ` Anand Jain
  2019-08-26  9:04 ` [PATCH RFC v2] btrfs-progs: add readmirror property and ioctl to set get readmirror Anand Jain
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Anand Jain @ 2019-08-26  9:04 UTC (permalink / raw)
  To: linux-btrfs

Introduces devid readmirror property, to direct read IO to the specified
device(s).

The readmirror property is stored as an item in the dev-tree. The readmirror
input format is devid:1,2,3.. etc. And for the each devid provided, a new
flag BTRFS_DEV_STATE_READ_PREFERRED is set.

As of now readmirror by devid supports only raid1s. Raid10 support has
to leverage device grouping feature, which is yet to be implemented.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
v1->RFC v2:
  . Property is stored as a dev-tree item instead of root inode extended
    attribute.
  . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to BTRFS_DEV_STATE_READ_PREFERRED.

 fs/btrfs/ctree.h                |   3 +
 fs/btrfs/disk-io.c              |   9 ++
 fs/btrfs/ioctl.c                | 108 +++++++++++++++++++++++
 fs/btrfs/transaction.c          |   3 +
 fs/btrfs/volumes.c              | 149 +++++++++++++++++++++++++++++++-
 fs/btrfs/volumes.h              |   9 +-
 include/uapi/linux/btrfs.h      |  15 +++-
 include/uapi/linux/btrfs_tree.h |  11 +++
 8 files changed, 304 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 357316173d84..a41b081846e3 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2394,6 +2394,9 @@ BTRFS_SETGET_STACK_FUNCS(stack_dev_replace_cursor_left,
 BTRFS_SETGET_STACK_FUNCS(stack_dev_replace_cursor_right,
 			 struct btrfs_dev_replace_item, cursor_right, 64);
 
+/* btrfs_readmirror_item */
+BTRFS_SETGET_FUNCS(readmirror_type, struct btrfs_readmirror_item, type, 64);
+
 /* helper function to cast into the data area of the leaf. */
 #define btrfs_item_ptr(leaf, slot, type) \
 	((type *)(BTRFS_LEAF_DATA_OFFSET + \
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d59b79ee4f9b..80e13b088634 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3086,6 +3086,15 @@ int open_ctree(struct super_block *sb,
 			  ret);
 		goto fail_block_groups;
 	}
+
+	ret = btrfs_init_readmirror(fs_info);
+	if (ret)
+		/*
+		 * failed to init means we will use default readmirror policy, so
+		 * warning is fine
+		 */
+		btrfs_warn(fs_info, "failed to init readmirror policy: %d", ret);
+
 	ret = btrfs_recover_balance(fs_info);
 	if (ret) {
 		btrfs_err(fs_info, "failed to recover balance: %d", ret);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 5ad41d4e113c..07aa47f70b55 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -5421,6 +5421,110 @@ static int _btrfs_ioctl_send(struct file *file, void __user *argp, bool compat)
 	return ret;
 }
 
+static int btrfs_ioctl_get_readmirror(struct btrfs_root *root,
+				      void __user *argp)
+{
+	struct btrfs_fs_devices *fs_devices = root->fs_info->fs_devices;
+	struct btrfs_ioctl_readmirror_args readmirror;
+	u64 device_bitmap = 0;
+
+	if (copy_from_user(&readmirror, argp, sizeof(readmirror)))
+		return -EFAULT;
+
+	readmirror.type = BTRFS_READMIRROR_DEFAULT;
+	readmirror.device_bitmap = 0;
+
+	if (fs_devices->readmirror_type == BTRFS_READMIRROR_DEVID) {
+		struct btrfs_device *device;
+
+		/*
+		 * No need to hold device_list_mutext for a read especially from
+		 * the user, user can read again to see the transient change.
+		 */
+		list_for_each_entry(device, &fs_devices->devices, dev_list) {
+			if (device && test_bit(BTRFS_DEV_STATE_READ_PREFERRED,
+					       &device->dev_state))
+				device_bitmap = device_bitmap |
+						(1ULL << device->devid);
+		}
+		readmirror.type = fs_devices->readmirror_type;
+		readmirror.device_bitmap = device_bitmap;
+	}
+
+	if (copy_to_user(argp, &readmirror, sizeof(readmirror)))
+		return -EFAULT;
+
+	return 0;
+}
+
+static int btrfs_ioctl_set_readmirror(struct btrfs_root *root, void __user *argp)
+{
+	int ret;
+	u64 devid;
+	struct btrfs_ioctl_readmirror_args readmirror;
+	struct btrfs_device *device;
+	struct btrfs_fs_devices *fs_devices = root->fs_info->fs_devices;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (copy_from_user(&readmirror, argp, sizeof(readmirror)))
+		return -EFAULT;
+
+	if (readmirror.type != BTRFS_READMIRROR_DEFAULT &&
+	    readmirror.type != BTRFS_READMIRROR_DEVID)
+		return -EINVAL;
+
+	ret = 0;
+	mutex_lock(&fs_devices->device_list_mutex);
+	if (readmirror.type == BTRFS_READMIRROR_DEVID) {
+		int nr_devices = 0;
+
+		for (devid = 0; devid < 64; devid++) {
+			if (!((1ULL << devid) & readmirror.device_bitmap))
+				continue;
+
+			device = btrfs_find_device(fs_devices, devid, NULL, NULL,
+						   false);
+			if (!device) {
+				ret = -EINVAL;
+				goto unlock_out;
+			}
+			nr_devices++;
+		}
+		if (nr_devices == 0) {
+			ret = -EINVAL;
+			goto unlock_out;
+		}
+	}
+
+	/* First reset and then set */
+	fs_devices->readmirror_type = BTRFS_READMIRROR_DEFAULT;
+	list_for_each_entry(device, &fs_devices->devices, dev_list) {
+		clear_bit(BTRFS_DEV_STATE_READ_PREFERRED,
+			  &device->dev_state);
+	}
+
+	if (readmirror.type == BTRFS_READMIRROR_DEVID) {
+		for (devid = 0; devid < 64; devid++) {
+			if (!((1ULL << devid) & readmirror.device_bitmap))
+				continue;
+
+			device = btrfs_find_device(fs_devices, devid, NULL, NULL,
+						   false);
+			if (device)
+				set_bit(BTRFS_DEV_STATE_READ_PREFERRED,
+					&device->dev_state);
+		}
+		fs_devices->readmirror_type = BTRFS_READMIRROR_DEVID;
+	}
+	atomic_inc(&device->update_readmirror);
+
+unlock_out:
+	mutex_unlock(&fs_devices->device_list_mutex);
+	return ret;
+}
+
 long btrfs_ioctl(struct file *file, unsigned int
 		cmd, unsigned long arg)
 {
@@ -5567,6 +5671,10 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_get_subvol_rootref(file, argp);
 	case BTRFS_IOC_INO_LOOKUP_USER:
 		return btrfs_ioctl_ino_lookup_user(file, argp);
+	case BTRFS_IOC_GET_READMIRROR:
+		return btrfs_ioctl_get_readmirror(root, argp);
+	case BTRFS_IOC_SET_READMIRROR:
+		return btrfs_ioctl_set_readmirror(root, argp);
 	}
 
 	return -ENOTTY;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 2e3f6778bfa3..212cd0aec2cf 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1127,6 +1127,9 @@ static noinline int commit_cowonly_roots(struct btrfs_trans_handle *trans)
 		return ret;
 
 	ret = btrfs_run_dev_stats(trans);
+	if (ret)
+		return ret;
+	ret = btrfs_run_readmirror(trans);
 	if (ret)
 		return ret;
 	ret = btrfs_run_dev_replace(trans);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e3460b47283e..94215f5ae0f9 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -402,6 +402,7 @@ static struct btrfs_device *__alloc_device(void)
 
 	atomic_set(&dev->reada_in_flight, 0);
 	atomic_set(&dev->dev_stats_ccnt, 0);
+	atomic_set(&dev->update_readmirror, 0);
 	btrfs_device_data_ordered_init(dev);
 	INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
 	INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
@@ -5267,7 +5268,28 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 	else
 		num_stripes = map->num_stripes;
 
-	preferred_mirror = first + current->pid % num_stripes;
+	switch(fs_info->fs_devices->readmirror_type) {
+	case BTRFS_READMIRROR_DEVID:
+		/*
+		 * choice of read a specific mirror is only for RAID1 as of now
+		 */
+		if (map->type & BTRFS_BLOCK_GROUP_RAID1) {
+			for (i = first; i < first + num_stripes; i++) {
+				if (test_bit(BTRFS_DEV_STATE_READ_PREFERRED,
+					     &map->stripes[i].dev->dev_state)) {
+					preferred_mirror = i;
+					break;
+				}
+			}
+		}
+		/* fall through */
+	case BTRFS_READMIRROR_DEFAULT:
+		/* fall through */
+	default:
+		/* readmirror as per thread pid */
+		preferred_mirror = first + current->pid % num_stripes;
+		break;
+	}
 
 	if (dev_replace_is_ongoing &&
 	    fs_info->dev_replace.cont_reading_from_srcdev_mode ==
@@ -7604,3 +7626,128 @@ bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr)
 	spin_unlock(&fs_info->swapfile_pins_lock);
 	return node != NULL;
 }
+
+int btrfs_init_readmirror(struct btrfs_fs_info *fs_info)
+{
+	int ret;
+	int slot;
+	struct btrfs_key key;
+	struct extent_buffer *eb;
+	struct btrfs_device *device;
+	struct btrfs_path *path = NULL;
+	struct btrfs_readmirror_item *ptr;
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	mutex_lock(&fs_devices->device_list_mutex);
+	list_for_each_entry(device, &fs_devices->devices, dev_list) {
+
+		key.objectid = BTRFS_READMIRROR_OBJECTID;
+		key.type = BTRFS_PERSISTENT_ITEM_KEY;
+		key.offset = device->devid;
+
+		ret = btrfs_search_slot(NULL, fs_info->dev_root, &key, path, 0, 0);
+		if (ret) {
+			btrfs_release_path(path);
+			continue;
+		}
+		slot = path->slots[0];
+		eb = path->nodes[0];
+
+		ptr = btrfs_item_ptr(eb, slot, struct btrfs_readmirror_item);
+
+		if (btrfs_readmirror_type(eb, ptr) == BTRFS_READMIRROR_DEVID) {
+			device->fs_devices->readmirror_type = BTRFS_READMIRROR_DEVID;
+			set_bit(BTRFS_DEV_STATE_READ_PREFERRED, &device->dev_state);
+		} else {
+			clear_bit(BTRFS_DEV_STATE_READ_PREFERRED, &device->dev_state);
+		}
+
+		btrfs_release_path(path);
+	}
+	mutex_unlock(&fs_devices->device_list_mutex);
+
+	btrfs_free_path(path);
+	return ret < 0 ? ret : 0;
+}
+
+static int update_readmirror_item(struct btrfs_trans_handle *trans,
+				  struct btrfs_device *device)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_root *dev_root = fs_info->dev_root;
+	struct btrfs_readmirror_item *ptr;
+	struct extent_buffer *eb;
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	u64 type;
+	int ret;
+
+	key.objectid = BTRFS_READMIRROR_OBJECTID;
+	key.type = BTRFS_PERSISTENT_ITEM_KEY;
+	key.offset = device->devid;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	ret = btrfs_search_slot(trans, dev_root, &key, path, 0, 1);
+	if (ret < 0) {
+		btrfs_warn_in_rcu(fs_info,
+			"error %d while searching for readmirror item for device %s",
+				  ret, rcu_str_deref(device->name));
+		goto out;
+	}
+
+	if (ret == 1) {
+		btrfs_release_path(path);
+		ret = btrfs_insert_empty_item(trans, dev_root, path,
+					      &key, sizeof(*ptr));
+		if (ret < 0) {
+			btrfs_warn_in_rcu(fs_info,
+				"insert readmirror item for device %s failed %d",
+				rcu_str_deref(device->name), ret);
+			goto out;
+		}
+	}
+
+	eb = path->nodes[0];
+	ptr = btrfs_item_ptr(eb, path->slots[0], struct btrfs_readmirror_item);
+	if (test_bit(BTRFS_DEV_STATE_READ_PREFERRED, &device->dev_state))
+		type = BTRFS_READMIRROR_DEVID;
+	else
+		type = BTRFS_READMIRROR_DEFAULT;
+
+	btrfs_set_readmirror_type(eb, ptr, type);
+	btrfs_mark_buffer_dirty(eb);
+
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+int btrfs_run_readmirror(struct btrfs_trans_handle *trans)
+{
+	int update;
+	int ret = 0;
+	struct btrfs_device *device;
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+
+	mutex_lock(&fs_devices->device_list_mutex);
+	list_for_each_entry(device, &fs_devices->devices, dev_list) {
+		update = atomic_read(&device->update_readmirror);
+		if (update == 0)
+			continue;
+
+		ret = update_readmirror_item(trans, device);
+		if (!ret)
+			atomic_sub(update, &device->update_readmirror);
+	}
+	mutex_unlock(&fs_devices->device_list_mutex);
+
+	return ret;
+}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 7be62728812d..aff48f130e1a 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -52,6 +52,7 @@ struct btrfs_io_geometry {
 #define BTRFS_DEV_STATE_MISSING		(2)
 #define BTRFS_DEV_STATE_REPLACE_TGT	(3)
 #define BTRFS_DEV_STATE_FLUSH_SENT	(4)
+#define BTRFS_DEV_STATE_READ_PREFERRED	(5)
 
 struct btrfs_device {
 	struct list_head dev_list; /* device_list_mutex */
@@ -141,6 +142,8 @@ struct btrfs_device {
 	atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
 
 	struct extent_io_tree alloc_state;
+
+	atomic_t update_readmirror;
 };
 
 /*
@@ -260,6 +263,8 @@ struct btrfs_fs_devices {
 	struct kobject fsid_kobj;
 	struct kobject *device_dir_kobj;
 	struct completion kobj_unregister;
+
+	int readmirror_type;
 };
 
 #define BTRFS_BIO_INLINE_CSUM_SIZE	64
@@ -474,6 +479,7 @@ int btrfs_get_dev_stats(struct btrfs_fs_info *fs_info,
 void btrfs_init_devices_late(struct btrfs_fs_info *fs_info);
 int btrfs_init_dev_stats(struct btrfs_fs_info *fs_info);
 int btrfs_run_dev_stats(struct btrfs_trans_handle *trans);
+int btrfs_run_readmirror(struct btrfs_trans_handle *trans);
 void btrfs_rm_dev_replace_remove_srcdev(struct btrfs_device *srcdev);
 void btrfs_rm_dev_replace_free_srcdev(struct btrfs_device *srcdev);
 void btrfs_destroy_dev_replace_tgtdev(struct btrfs_device *tgtdev);
@@ -578,5 +584,6 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
 int btrfs_bg_type_to_factor(u64 flags);
 const char *btrfs_bg_type_to_raid_name(u64 flags);
 int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
-
+int btrfs_run_readmirror(struct btrfs_trans_handle *trans);
+int btrfs_init_readmirror(struct btrfs_fs_info *fs_info);
 #endif
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 3ee0678c0a83..7d6d0ebb612c 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -822,6 +822,16 @@ struct btrfs_ioctl_get_subvol_rootref_args {
 		__u8 align[7];
 };
 
+enum btrfs_readmirror_types {
+	BTRFS_READMIRROR_DEFAULT = 0,
+	BTRFS_READMIRROR_DEVID,
+};
+
+struct btrfs_ioctl_readmirror_args {
+	__u64 type; /* RW */
+	__u64 device_bitmap; /* RW */
+};
+
 /* Error codes as returned by the kernel */
 enum btrfs_err_code {
 	BTRFS_ERROR_DEV_RAID1_MIN_NOT_MET = 1,
@@ -946,5 +956,8 @@ enum btrfs_err_code {
 				struct btrfs_ioctl_get_subvol_rootref_args)
 #define BTRFS_IOC_INO_LOOKUP_USER _IOWR(BTRFS_IOCTL_MAGIC, 62, \
 				struct btrfs_ioctl_ino_lookup_user_args)
-
+#define BTRFS_IOC_GET_READMIRROR _IOWR(BTRFS_IOCTL_MAGIC, 63, \
+				struct btrfs_ioctl_readmirror_args)
+#define BTRFS_IOC_SET_READMIRROR _IOWR(BTRFS_IOCTL_MAGIC, 64, \
+				struct btrfs_ioctl_readmirror_args)
 #endif /* _UAPI_LINUX_BTRFS_H */
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 97846fb57028..64cf120ac0f4 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -51,6 +51,9 @@
 /* device stats in the device tree */
 #define BTRFS_DEV_STATS_OBJECTID 0ULL
 
+/* store readmirror policy inforamtion in the device tree */
+#define BTRFS_READMIRROR_OBJECTID -3ULL
+
 /* for storing balance parameters in the root tree */
 #define BTRFS_BALANCE_OBJECTID -4ULL
 
@@ -977,4 +980,12 @@ struct btrfs_qgroup_limit_item {
 	__le64 rsv_excl;
 } __attribute__ ((__packed__));
 
+/*
+ * readmirror's persistent storage format
+ */
+struct btrfs_readmirror_item {
+	__le64 type;
+	__le64 unused[3];
+} __attribute__ ((__packed__));
+
 #endif /* _BTRFS_CTREE_H_ */
-- 
2.21.0 (Apple Git-120)


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH RFC v2] btrfs-progs: add readmirror property and ioctl to set get readmirror
  2019-08-26  9:04 [PATCH RFC v2 0/2] readmirror feature Anand Jain
  2019-08-26  9:04 ` [PATCH RFC v2] btrfs: add readmirror framework and policy devid Anand Jain
@ 2019-08-26  9:04 ` Anand Jain
  2019-08-29  3:39   ` [PATCH RFC v2.1] btrfs-progs: add readmirror property and ioctl to set get Anand Jain
  2019-09-11 16:37 ` [PATCH RFC v2 0/2] readmirror feature David Sterba
  2019-09-11 18:42 ` Josef Bacik
  3 siblings, 1 reply; 17+ messages in thread
From: Anand Jain @ 2019-08-26  9:04 UTC (permalink / raw)
  To: linux-btrfs

This patch adds readmirror property to be applied at the filesystem object.
And uses ioctl BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR to get
and set the property respectively.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
v1->RFC v2:
  . Changed format specifier from devid1,2,3.. to devid:1,2,3..

 ioctl.h | 14 +++++++++
 props.c | 98 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 112 insertions(+)

diff --git a/ioctl.h b/ioctl.h
index 66ee599f7a82..ccdf600bae77 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -765,6 +765,16 @@ struct btrfs_ioctl_get_subvol_rootref_args {
 };
 BUILD_ASSERT(sizeof(struct btrfs_ioctl_get_subvol_rootref_args) == 4096);
 
+enum btrfs_readmirror_policy {
+	BTRFS_READMIRROR_DEFAULT = 0,
+	BTRFS_READMIRROR_DEVID,
+};
+
+struct btrfs_ioctl_readmirror_args {
+	__u64 type; /* RW */
+	__u64 device_bitmap; /* RW */
+};
+
 /* Error codes as returned by the kernel */
 enum btrfs_err_code {
 	notused,
@@ -929,6 +939,10 @@ static inline char *btrfs_err_str(enum btrfs_err_code err_code)
 				struct btrfs_ioctl_get_subvol_rootref_args)
 #define BTRFS_IOC_INO_LOOKUP_USER _IOWR(BTRFS_IOCTL_MAGIC, 62, \
 				struct btrfs_ioctl_ino_lookup_user_args)
+#define BTRFS_IOC_GET_READMIRROR _IOWR(BTRFS_IOCTL_MAGIC, 63, \
+				struct btrfs_ioctl_readmirror_args)
+#define BTRFS_IOC_SET_READMIRROR _IOWR(BTRFS_IOCTL_MAGIC, 64, \
+				struct btrfs_ioctl_readmirror_args)
 #ifdef __cplusplus
 }
 #endif
diff --git a/props.c b/props.c
index 004022600d51..f077e55e2274 100644
--- a/props.c
+++ b/props.c
@@ -166,6 +166,98 @@ out:
 	return ret;
 }
 
+static int prop_readmirror(enum prop_object_type type, const char *object,
+			   const char *name, const char *value)
+{
+	int fd;
+	int ret;
+	DIR *dirstream = NULL;
+
+	fd = open_file_or_dir3(object, &dirstream, value ? O_RDWR : O_RDONLY);
+	if (fd < 0) {
+		ret = -errno;
+		error("failed to open %s: %m", object);
+		return ret;
+	}
+
+	if (value) {
+		int ret;
+		int final_ret;
+		size_t len = strlen(value);
+		char *value_dup;
+		char *devid_str;
+		struct btrfs_ioctl_readmirror_args readmirror;
+
+		if (len > 0 && (len <= 6 || strncmp("devid:", value, 6)))
+			return -EINVAL;
+
+		/* value format - an example: readmirror devid:1,2,3,.. */
+		value_dup = strndup(value + 6, len - 6);
+		if (!value_dup)
+			return -ENOMEM;
+
+		final_ret = 0;
+		if (len == 0) {
+			readmirror.type = BTRFS_READMIRROR_DEFAULT;
+			final_ret = ioctl(fd, BTRFS_IOC_SET_READMIRROR, &readmirror);
+			if (final_ret < 0)
+				error("failed to reset readmirror: (%m)");
+		} else {
+			__u64 device_bitmap = 0;
+
+			while ((devid_str = strsep(&value_dup, ",")) != NULL)
+				device_bitmap = device_bitmap |
+						(1ULL << arg_strtou64(devid_str));
+
+			readmirror.type = BTRFS_READMIRROR_DEVID;
+			readmirror.device_bitmap = device_bitmap;
+			ret = ioctl(fd, BTRFS_IOC_SET_READMIRROR, &readmirror);
+			if (ret < 0) {
+				error("failed to set readmirror: (%m)");
+				final_ret = ret;
+			}
+		}
+		return final_ret;
+	} else {
+		struct btrfs_ioctl_readmirror_args readmirror;
+		struct btrfs_ioctl_dev_info_args *device;
+		struct btrfs_ioctl_fs_info_args fs_devices;
+
+		ret = get_fs_info(object, &fs_devices, &device);
+		if (ret)
+			return ret;
+
+		free(device);
+		if (fs_devices.num_devices > 63) {
+			error("can't set readmirror for device id more than 63");
+			return -EOPNOTSUPP;
+		}
+
+		ret = ioctl(fd, BTRFS_IOC_GET_READMIRROR, &readmirror);
+		if (ret) {
+			error("readmirror ioctl failed: %m");
+			return ret;
+		}
+		if (readmirror.type == BTRFS_READMIRROR_DEFAULT) {
+			printf("readmirror default\n");
+		} else if (readmirror.type == BTRFS_READMIRROR_DEVID) {
+			u64 cnt;
+
+			printf("readmirror devid:");
+			for (cnt = 0; cnt < 64; cnt++) {
+				if ((1ULL << cnt) & readmirror.device_bitmap)
+					printf("%llu ", cnt);
+			}
+			printf("\n");
+		}
+		return 0;
+	}
+
+	close_file_or_dir(fd, dirstream);
+
+	return ret;
+}
+
 const struct prop_handler prop_handlers[] = {
 	{
 		.name ="ro",
@@ -187,5 +279,11 @@ const struct prop_handler prop_handlers[] = {
 		.read_only = 0,
 	 	.types = prop_object_inode, prop_compression
 	},
+	{
+		.name = "readmirror",
+		.desc = "specify the device to be used for read",
+		.read_only = 0,
+		.types = prop_object_root, prop_readmirror
+	},
 	{NULL, NULL, 0, 0, NULL}
 };
-- 
2.21.0 (Apple Git-120)


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH RFC v2.1] btrfs-progs: add readmirror property and ioctl to set get
  2019-08-26  9:04 ` [PATCH RFC v2] btrfs-progs: add readmirror property and ioctl to set get readmirror Anand Jain
@ 2019-08-29  3:39   ` Anand Jain
  0 siblings, 0 replies; 17+ messages in thread
From: Anand Jain @ 2019-08-29  3:39 UTC (permalink / raw)
  To: linux-btrfs

This patch adds readmirror property to be applied at the filesystem object.
And uses ioctl BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR to get
and set the property respectively.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
RFC v2->RFC v2.1:
  add dump-tree support, depends on the patch
     [PATCH] btrfs-progs: print-tree add missing DEV_STATS
v1->RFC v2:
  Changed format specifier from devid1,2,3.. to devid:1,2,3..

 ctree.h                   | 14 ++++++
 ioctl.h                   | 14 ++++++
 libbtrfsutil/btrfs_tree.h | 11 +++++
 print-tree.c              | 15 ++++++
 props.c                   | 98 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 152 insertions(+)

diff --git a/ctree.h b/ctree.h
index 0d12563b7261..f9523ea4cea6 100644
--- a/ctree.h
+++ b/ctree.h
@@ -92,6 +92,9 @@ struct btrfs_free_space_ctl;
 /* device stats in the device tree */
 #define BTRFS_DEV_STATS_OBJECTID 0ULL
 
+/* store readmirror policy inforamtion in the device tree */
+#define BTRFS_READMIRROR_OBJECTID -3ULL
+
 /* for storing balance parameters in the root tree */
 #define BTRFS_BALANCE_OBJECTID -4ULL
 
@@ -879,6 +882,14 @@ struct btrfs_balance_item {
 	__le64 unused[4];
 } __attribute__ ((__packed__));
 
+/*
+ * readmirror's persistent storage format
+ */
+struct btrfs_readmirror_item {
+       __le64 type;
+       __le64 unused[3];
+} __attribute__ ((__packed__));
+
 #define BTRFS_FILE_EXTENT_INLINE 0
 #define BTRFS_FILE_EXTENT_REG 1
 #define BTRFS_FILE_EXTENT_PREALLOC 2
@@ -2389,6 +2400,9 @@ BTRFS_SETGET_STACK_FUNCS(stack_qgroup_limit_rsv_exclusive,
 /* btrfs_balance_item */
 BTRFS_SETGET_FUNCS(balance_item_flags, struct btrfs_balance_item, flags, 64);
 
+/* btrfs_readmirror_item */
+BTRFS_SETGET_FUNCS(readmirror_type, struct btrfs_readmirror_item, type, 64);
+
 static inline struct btrfs_disk_balance_args* btrfs_balance_item_data(
 		struct extent_buffer *eb, struct btrfs_balance_item *bi)
 {
diff --git a/ioctl.h b/ioctl.h
index 66ee599f7a82..ccdf600bae77 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -765,6 +765,16 @@ struct btrfs_ioctl_get_subvol_rootref_args {
 };
 BUILD_ASSERT(sizeof(struct btrfs_ioctl_get_subvol_rootref_args) == 4096);
 
+enum btrfs_readmirror_policy {
+	BTRFS_READMIRROR_DEFAULT = 0,
+	BTRFS_READMIRROR_DEVID,
+};
+
+struct btrfs_ioctl_readmirror_args {
+	__u64 type; /* RW */
+	__u64 device_bitmap; /* RW */
+};
+
 /* Error codes as returned by the kernel */
 enum btrfs_err_code {
 	notused,
@@ -929,6 +939,10 @@ static inline char *btrfs_err_str(enum btrfs_err_code err_code)
 				struct btrfs_ioctl_get_subvol_rootref_args)
 #define BTRFS_IOC_INO_LOOKUP_USER _IOWR(BTRFS_IOCTL_MAGIC, 62, \
 				struct btrfs_ioctl_ino_lookup_user_args)
+#define BTRFS_IOC_GET_READMIRROR _IOWR(BTRFS_IOCTL_MAGIC, 63, \
+				struct btrfs_ioctl_readmirror_args)
+#define BTRFS_IOC_SET_READMIRROR _IOWR(BTRFS_IOCTL_MAGIC, 64, \
+				struct btrfs_ioctl_readmirror_args)
 #ifdef __cplusplus
 }
 #endif
diff --git a/libbtrfsutil/btrfs_tree.h b/libbtrfsutil/btrfs_tree.h
index 8ea3e31d9b96..b6785fef4a8d 100644
--- a/libbtrfsutil/btrfs_tree.h
+++ b/libbtrfsutil/btrfs_tree.h
@@ -51,6 +51,9 @@
 /* device stats in the device tree */
 #define BTRFS_DEV_STATS_OBJECTID 0ULL
 
+/* store readmirror policy inforamtion in the device tree */
+#define BTRFS_READMIRROR_OBJECTID -3ULL
+
 /* for storing balance parameters in the root tree */
 #define BTRFS_BALANCE_OBJECTID -4ULL
 
@@ -962,4 +965,12 @@ struct btrfs_qgroup_limit_item {
 	__le64 rsv_excl;
 } __attribute__ ((__packed__));
 
+/*
+ * readmirror's persistent storage format
+ */
+struct btrfs_readmirror_item {
+	__le64 type;
+	__le64 unused[3];
+} __attribute__ ((__packed__));
+
 #endif /* _BTRFS_CTREE_H_ */
diff --git a/print-tree.c b/print-tree.c
index 15d8806b6f28..4ed698ee3a62 100644
--- a/print-tree.c
+++ b/print-tree.c
@@ -692,6 +692,8 @@ void print_objectid(FILE *stream, u64 objectid, u8 type)
 	case BTRFS_PERSISTENT_ITEM_KEY:
 		if (objectid == BTRFS_DEV_STATS_OBJECTID)
 			fprintf(stream, "DEV_STATS");
+		else if (objectid == BTRFS_READMIRROR_OBJECTID)
+			fprintf(stream, "READMIRROR");
 		else
 			fprintf(stream, "%llu", (unsigned long long)objectid);
 		return;
@@ -960,6 +962,16 @@ static void print_balance_item(struct extent_buffer *eb,
 	print_disk_balance_args(btrfs_balance_item_sys(eb, bi));
 }
 
+static void print_readmirror(struct extent_buffer *eb,
+			     struct btrfs_readmirror_item *rm, u32 size)
+{
+	if (btrfs_readmirror_type(eb, rm) == 1)
+		printf("\t\treadmirror.type devid\n");
+	else
+		printf("\t\treadmirror.type 0x%llx\n",
+			btrfs_readmirror_type(eb, rm));
+}
+
 static void print_dev_stats(struct extent_buffer *eb,
 		struct btrfs_dev_stats_item *stats, u32 size)
 {
@@ -1109,6 +1121,9 @@ static void print_persistent_item(struct extent_buffer *eb, void *ptr,
 	print_objectid(stdout, objectid, BTRFS_PERSISTENT_ITEM_KEY);
 	printf(" offset %llu\n", (unsigned long long)offset);
 	switch (objectid) {
+	case BTRFS_READMIRROR_OBJECTID:
+		print_readmirror(eb, ptr, item_size);
+		break;
 	case BTRFS_DEV_STATS_OBJECTID:
 		print_dev_stats(eb, ptr, item_size);
 		break;
diff --git a/props.c b/props.c
index 004022600d51..f077e55e2274 100644
--- a/props.c
+++ b/props.c
@@ -166,6 +166,98 @@ out:
 	return ret;
 }
 
+static int prop_readmirror(enum prop_object_type type, const char *object,
+			   const char *name, const char *value)
+{
+	int fd;
+	int ret;
+	DIR *dirstream = NULL;
+
+	fd = open_file_or_dir3(object, &dirstream, value ? O_RDWR : O_RDONLY);
+	if (fd < 0) {
+		ret = -errno;
+		error("failed to open %s: %m", object);
+		return ret;
+	}
+
+	if (value) {
+		int ret;
+		int final_ret;
+		size_t len = strlen(value);
+		char *value_dup;
+		char *devid_str;
+		struct btrfs_ioctl_readmirror_args readmirror;
+
+		if (len > 0 && (len <= 6 || strncmp("devid:", value, 6)))
+			return -EINVAL;
+
+		/* value format - an example: readmirror devid:1,2,3,.. */
+		value_dup = strndup(value + 6, len - 6);
+		if (!value_dup)
+			return -ENOMEM;
+
+		final_ret = 0;
+		if (len == 0) {
+			readmirror.type = BTRFS_READMIRROR_DEFAULT;
+			final_ret = ioctl(fd, BTRFS_IOC_SET_READMIRROR, &readmirror);
+			if (final_ret < 0)
+				error("failed to reset readmirror: (%m)");
+		} else {
+			__u64 device_bitmap = 0;
+
+			while ((devid_str = strsep(&value_dup, ",")) != NULL)
+				device_bitmap = device_bitmap |
+						(1ULL << arg_strtou64(devid_str));
+
+			readmirror.type = BTRFS_READMIRROR_DEVID;
+			readmirror.device_bitmap = device_bitmap;
+			ret = ioctl(fd, BTRFS_IOC_SET_READMIRROR, &readmirror);
+			if (ret < 0) {
+				error("failed to set readmirror: (%m)");
+				final_ret = ret;
+			}
+		}
+		return final_ret;
+	} else {
+		struct btrfs_ioctl_readmirror_args readmirror;
+		struct btrfs_ioctl_dev_info_args *device;
+		struct btrfs_ioctl_fs_info_args fs_devices;
+
+		ret = get_fs_info(object, &fs_devices, &device);
+		if (ret)
+			return ret;
+
+		free(device);
+		if (fs_devices.num_devices > 63) {
+			error("can't set readmirror for device id more than 63");
+			return -EOPNOTSUPP;
+		}
+
+		ret = ioctl(fd, BTRFS_IOC_GET_READMIRROR, &readmirror);
+		if (ret) {
+			error("readmirror ioctl failed: %m");
+			return ret;
+		}
+		if (readmirror.type == BTRFS_READMIRROR_DEFAULT) {
+			printf("readmirror default\n");
+		} else if (readmirror.type == BTRFS_READMIRROR_DEVID) {
+			u64 cnt;
+
+			printf("readmirror devid:");
+			for (cnt = 0; cnt < 64; cnt++) {
+				if ((1ULL << cnt) & readmirror.device_bitmap)
+					printf("%llu ", cnt);
+			}
+			printf("\n");
+		}
+		return 0;
+	}
+
+	close_file_or_dir(fd, dirstream);
+
+	return ret;
+}
+
 const struct prop_handler prop_handlers[] = {
 	{
 		.name ="ro",
@@ -187,5 +279,11 @@ const struct prop_handler prop_handlers[] = {
 		.read_only = 0,
 	 	.types = prop_object_inode, prop_compression
 	},
+	{
+		.name = "readmirror",
+		.desc = "specify the device to be used for read",
+		.read_only = 0,
+		.types = prop_object_root, prop_readmirror
+	},
 	{NULL, NULL, 0, 0, NULL}
 };
-- 
2.21.0 (Apple Git-120)


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v2 0/2]  readmirror feature
  2019-08-26  9:04 [PATCH RFC v2 0/2] readmirror feature Anand Jain
  2019-08-26  9:04 ` [PATCH RFC v2] btrfs: add readmirror framework and policy devid Anand Jain
  2019-08-26  9:04 ` [PATCH RFC v2] btrfs-progs: add readmirror property and ioctl to set get readmirror Anand Jain
@ 2019-09-11 16:37 ` David Sterba
  2019-09-12  7:48   ` Anand Jain
  2019-09-11 18:42 ` Josef Bacik
  3 siblings, 1 reply; 17+ messages in thread
From: David Sterba @ 2019-09-11 16:37 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote:
> Function call chain  __btrfs_map_block()->find_live_mirror() uses
> thread pid to determine the %mirror_num when the mirror_num=0.
> 
> This patch introduces a framework so that we can add policies to determine
> the %mirror_num. And also adds the devid as the readmirror policy.
> 
> The new property is stored as an item in the device tree as show below.
>     (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid)
> 
> To be able to set and get this new property also introduces new ioctls
> BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument
> is defined as
>         struct btrfs_ioctl_readmirror_args {
>                 __u64 type; /* RW */
>                 __u64 device_bitmap; /* RW */
>         }

I don't remember if there was a suggestion to use ioctls for read
mirror, but the property interface should be sufficient. Besides this
ioctl interafce is quite an anti-pattern: narrow use, non-extensible
structure, alternative and more convenient interface already available.

> An usage example as follows:
>         btrfs property set /btrfs readmirror devid:1,3
>         btrfs property get /btrfs readmirror
>           readmirror devid:1 3
>         btrfs property set /btrfs readmirror ""
>         btrfs property get /btrfs readmirror
>           readmirror default
> 
> This patchset has been tested completely, however marked as RFC for the 
> following reasons and comments on them (or any other) are appreciated as
> usual.
>  . The new objectid is defined as
>       #define BTRFS_READMIRROR_OBJECTID -1ULL
>    Need consent we are fine to use this value, and with this value it
>    shall be placed just before the DEV_STATS_OBJECTID item which is more
>    frequently used only during the device errors.

-1 can be considered a special value in other cases, not necessarily
here but if the ordering of items is the only reason I'd say no. The
keys/items will most likely land in the same node so there's no point
forcing the order.

> .  I am using a u64 bitmap to represent the devices id, so the max device
>    id that we could represent is 63, its a kind of limitation which should
>    be addressed before integration, I wonder if there is any suggestion?

Uh 63, so now an implementation detail is posing a global limit? That
sounds backwards.

>    Kindly note that, multiple ioctls with each time representing a set of
>    device(s) is not a choice because we need to make sure the readmirror
>    changes happens in a commit transaction.

I believe this can be guaranteed by the properties interface, ie. single
value gets processed at once and with some locking around the state of
devices can be updated atomically.

The open question is still how to store the readmirror property
per-device, it could be either an item or bit inside the device
structure.

Besides the interface, I'm kind of missing the usecase description what
is expected from the read mirror policies and how they could affect
various scenarios. Maybe it was in some of the previous iterations, it's
hard too track everything so this should be part of the cover letter (or
at leat linked if it's too much text).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v2 0/2]  readmirror feature
  2019-08-26  9:04 [PATCH RFC v2 0/2] readmirror feature Anand Jain
                   ` (2 preceding siblings ...)
  2019-09-11 16:37 ` [PATCH RFC v2 0/2] readmirror feature David Sterba
@ 2019-09-11 18:42 ` Josef Bacik
  2019-09-11 19:13   ` Eli V
  3 siblings, 1 reply; 17+ messages in thread
From: Josef Bacik @ 2019-09-11 18:42 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote:
> Function call chain  __btrfs_map_block()->find_live_mirror() uses
> thread pid to determine the %mirror_num when the mirror_num=0.
> 
> This patch introduces a framework so that we can add policies to determine
> the %mirror_num. And also adds the devid as the readmirror policy.
> 
> The new property is stored as an item in the device tree as show below.
>     (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid)
> 
> To be able to set and get this new property also introduces new ioctls
> BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument
> is defined as
>         struct btrfs_ioctl_readmirror_args {
>                 __u64 type; /* RW */
>                 __u64 device_bitmap; /* RW */
>         }
> 
> An usage example as follows:
>         btrfs property set /btrfs readmirror devid:1,3
>         btrfs property get /btrfs readmirror
>           readmirror devid:1 3
>         btrfs property set /btrfs readmirror ""
>         btrfs property get /btrfs readmirror
>           readmirror default
> 
> This patchset has been tested completely, however marked as RFC for the 
> following reasons and comments on them (or any other) are appreciated as
> usual.
>  . The new objectid is defined as
>       #define BTRFS_READMIRROR_OBJECTID -1ULL
>    Need consent we are fine to use this value, and with this value it
>    shall be placed just before the DEV_STATS_OBJECTID item which is more
>    frequently used only during the device errors.
> 
> .  I am using a u64 bitmap to represent the devices id, so the max device
>    id that we could represent is 63, its a kind of limitation which should
>    be addressed before integration, I wonder if there is any suggestion?
>    Kindly note that, multiple ioctls with each time representing a set of
>    device(s) is not a choice because we need to make sure the readmirror
>    changes happens in a commit transaction.
> 
> v1->RFC v2:
>   . Property is stored as a dev-tree item instead of root inode extended
>     attribute.
>   . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to BTRFS_DEV_STATE_READ_PREFERRED.
>   . Changed format specifier from devid1,2,3.. to devid:1,2,3..
> 
> RFC->v1:
>   Drops pid as one of the readmirror policy choices and as usual remains
>   as default. And when the devid is reset the readmirror policy falls back
>   to pid.
>   Drops the mount -o readmirror idea, it can be added at a later point of
>   time.
>   Property now accepts more than 1 devid as readmirror device. As shown
>   in the example above.
> 

This is a lot of infrastructure to just change which mirror we read to based on
some arbitrary user policy.  I assume this is to solve the case where you have
slow and fast disks, so you can always read from the fast disk?  And then it's
only used in RAID1, so the very narrow usecase of having a RAID1 setup with a
SSD and a normal disk?  I'm not seeing a point to this much code for one
particular obscure setup.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v2 0/2] readmirror feature
  2019-09-11 18:42 ` Josef Bacik
@ 2019-09-11 19:13   ` Eli V
  2019-09-11 19:16     ` Josef Bacik
  0 siblings, 1 reply; 17+ messages in thread
From: Eli V @ 2019-09-11 19:13 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Anand Jain, linux-btrfs

On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote:
> > Function call chain  __btrfs_map_block()->find_live_mirror() uses
> > thread pid to determine the %mirror_num when the mirror_num=0.
> >
> > This patch introduces a framework so that we can add policies to determine
> > the %mirror_num. And also adds the devid as the readmirror policy.
> >
> > The new property is stored as an item in the device tree as show below.
> >     (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid)
> >
> > To be able to set and get this new property also introduces new ioctls
> > BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument
> > is defined as
> >         struct btrfs_ioctl_readmirror_args {
> >                 __u64 type; /* RW */
> >                 __u64 device_bitmap; /* RW */
> >         }
> >
> > An usage example as follows:
> >         btrfs property set /btrfs readmirror devid:1,3
> >         btrfs property get /btrfs readmirror
> >           readmirror devid:1 3
> >         btrfs property set /btrfs readmirror ""
> >         btrfs property get /btrfs readmirror
> >           readmirror default
> >
> > This patchset has been tested completely, however marked as RFC for the
> > following reasons and comments on them (or any other) are appreciated as
> > usual.
> >  . The new objectid is defined as
> >       #define BTRFS_READMIRROR_OBJECTID -1ULL
> >    Need consent we are fine to use this value, and with this value it
> >    shall be placed just before the DEV_STATS_OBJECTID item which is more
> >    frequently used only during the device errors.
> >
> > .  I am using a u64 bitmap to represent the devices id, so the max device
> >    id that we could represent is 63, its a kind of limitation which should
> >    be addressed before integration, I wonder if there is any suggestion?
> >    Kindly note that, multiple ioctls with each time representing a set of
> >    device(s) is not a choice because we need to make sure the readmirror
> >    changes happens in a commit transaction.
> >
> > v1->RFC v2:
> >   . Property is stored as a dev-tree item instead of root inode extended
> >     attribute.
> >   . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to BTRFS_DEV_STATE_READ_PREFERRED.
> >   . Changed format specifier from devid1,2,3.. to devid:1,2,3..
> >
> > RFC->v1:
> >   Drops pid as one of the readmirror policy choices and as usual remains
> >   as default. And when the devid is reset the readmirror policy falls back
> >   to pid.
> >   Drops the mount -o readmirror idea, it can be added at a later point of
> >   time.
> >   Property now accepts more than 1 devid as readmirror device. As shown
> >   in the example above.
> >
>
> This is a lot of infrastructure to just change which mirror we read to based on
> some arbitrary user policy.  I assume this is to solve the case where you have
> slow and fast disks, so you can always read from the fast disk?  And then it's
> only used in RAID1, so the very narrow usecase of having a RAID1 setup with a
> SSD and a normal disk?  I'm not seeing a point to this much code for one
> particular obscure setup.  Thanks,
>
> Josef

Not commenting on the code itself, but as a user I see this SSD RAID1
acceleration as a future much have feature. It's only obscure at the
moment because we don't have code to take advantage of it. But on
large btrfs filesystems with hundreds of GB of metadata, like I have
for backups, the usability of the filesystem is dramatically improved
having the metadata on an SSD( though currently only half of the time
due to the even/odd pid distribution.)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v2 0/2] readmirror feature
  2019-09-11 19:13   ` Eli V
@ 2019-09-11 19:16     ` Josef Bacik
  2019-09-12  7:41       ` Anand Jain
  0 siblings, 1 reply; 17+ messages in thread
From: Josef Bacik @ 2019-09-11 19:16 UTC (permalink / raw)
  To: Eli V; +Cc: Josef Bacik, Anand Jain, linux-btrfs

On Wed, Sep 11, 2019 at 03:13:21PM -0400, Eli V wrote:
> On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote:
> > > Function call chain  __btrfs_map_block()->find_live_mirror() uses
> > > thread pid to determine the %mirror_num when the mirror_num=0.
> > >
> > > This patch introduces a framework so that we can add policies to determine
> > > the %mirror_num. And also adds the devid as the readmirror policy.
> > >
> > > The new property is stored as an item in the device tree as show below.
> > >     (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid)
> > >
> > > To be able to set and get this new property also introduces new ioctls
> > > BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument
> > > is defined as
> > >         struct btrfs_ioctl_readmirror_args {
> > >                 __u64 type; /* RW */
> > >                 __u64 device_bitmap; /* RW */
> > >         }
> > >
> > > An usage example as follows:
> > >         btrfs property set /btrfs readmirror devid:1,3
> > >         btrfs property get /btrfs readmirror
> > >           readmirror devid:1 3
> > >         btrfs property set /btrfs readmirror ""
> > >         btrfs property get /btrfs readmirror
> > >           readmirror default
> > >
> > > This patchset has been tested completely, however marked as RFC for the
> > > following reasons and comments on them (or any other) are appreciated as
> > > usual.
> > >  . The new objectid is defined as
> > >       #define BTRFS_READMIRROR_OBJECTID -1ULL
> > >    Need consent we are fine to use this value, and with this value it
> > >    shall be placed just before the DEV_STATS_OBJECTID item which is more
> > >    frequently used only during the device errors.
> > >
> > > .  I am using a u64 bitmap to represent the devices id, so the max device
> > >    id that we could represent is 63, its a kind of limitation which should
> > >    be addressed before integration, I wonder if there is any suggestion?
> > >    Kindly note that, multiple ioctls with each time representing a set of
> > >    device(s) is not a choice because we need to make sure the readmirror
> > >    changes happens in a commit transaction.
> > >
> > > v1->RFC v2:
> > >   . Property is stored as a dev-tree item instead of root inode extended
> > >     attribute.
> > >   . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to BTRFS_DEV_STATE_READ_PREFERRED.
> > >   . Changed format specifier from devid1,2,3.. to devid:1,2,3..
> > >
> > > RFC->v1:
> > >   Drops pid as one of the readmirror policy choices and as usual remains
> > >   as default. And when the devid is reset the readmirror policy falls back
> > >   to pid.
> > >   Drops the mount -o readmirror idea, it can be added at a later point of
> > >   time.
> > >   Property now accepts more than 1 devid as readmirror device. As shown
> > >   in the example above.
> > >
> >
> > This is a lot of infrastructure to just change which mirror we read to based on
> > some arbitrary user policy.  I assume this is to solve the case where you have
> > slow and fast disks, so you can always read from the fast disk?  And then it's
> > only used in RAID1, so the very narrow usecase of having a RAID1 setup with a
> > SSD and a normal disk?  I'm not seeing a point to this much code for one
> > particular obscure setup.  Thanks,
> >
> > Josef
> 
> Not commenting on the code itself, but as a user I see this SSD RAID1
> acceleration as a future much have feature. It's only obscure at the
> moment because we don't have code to take advantage of it. But on
> large btrfs filesystems with hundreds of GB of metadata, like I have
> for backups, the usability of the filesystem is dramatically improved
> having the metadata on an SSD( though currently only half of the time
> due to the even/odd pid distribution.)

But that's different from a mirror.  100% it would be nice to say "put my
metadata on the ssd, data elsewhere".  That's not what this patch is about, this
patch is specifically about changing which drive we choose in a mirrored setup,
which is super unlikely to mirror a SSD with a slow drive, cause it's just going
to be slow no matter what.  Sure we could make it so reads always go to the SSD,
but we can accomplish that by just adding a check for nonrotational in the code,
and then we don't have to encode all this nonsense in the file system.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v2 0/2] readmirror feature
  2019-09-11 19:16     ` Josef Bacik
@ 2019-09-12  7:41       ` Anand Jain
  2019-09-12  9:50         ` Josef Bacik
  0 siblings, 1 reply; 17+ messages in thread
From: Anand Jain @ 2019-09-12  7:41 UTC (permalink / raw)
  To: Josef Bacik, Eli V; +Cc: linux-btrfs



Thanks for the comments. More below.

On 12/9/19 3:16 AM, Josef Bacik wrote:
> On Wed, Sep 11, 2019 at 03:13:21PM -0400, Eli V wrote:
>> On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik <josef@toxicpanda.com> wrote:
>>>
>>> On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote:
>>>> Function call chain  __btrfs_map_block()->find_live_mirror() uses
>>>> thread pid to determine the %mirror_num when the mirror_num=0.
>>>>
>>>> This patch introduces a framework so that we can add policies to determine
>>>> the %mirror_num. And also adds the devid as the readmirror policy.
>>>>
>>>> The new property is stored as an item in the device tree as show below.
>>>>      (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid)
>>>>
>>>> To be able to set and get this new property also introduces new ioctls
>>>> BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument
>>>> is defined as
>>>>          struct btrfs_ioctl_readmirror_args {
>>>>                  __u64 type; /* RW */
>>>>                  __u64 device_bitmap; /* RW */
>>>>          }
>>>>
>>>> An usage example as follows:
>>>>          btrfs property set /btrfs readmirror devid:1,3
>>>>          btrfs property get /btrfs readmirror
>>>>            readmirror devid:1 3
>>>>          btrfs property set /btrfs readmirror ""
>>>>          btrfs property get /btrfs readmirror
>>>>            readmirror default
>>>>
>>>> This patchset has been tested completely, however marked as RFC for the
>>>> following reasons and comments on them (or any other) are appreciated as
>>>> usual.
>>>>   . The new objectid is defined as
>>>>        #define BTRFS_READMIRROR_OBJECTID -1ULL
>>>>     Need consent we are fine to use this value, and with this value it
>>>>     shall be placed just before the DEV_STATS_OBJECTID item which is more
>>>>     frequently used only during the device errors.
>>>>
>>>> .  I am using a u64 bitmap to represent the devices id, so the max device
>>>>     id that we could represent is 63, its a kind of limitation which should
>>>>     be addressed before integration, I wonder if there is any suggestion?
>>>>     Kindly note that, multiple ioctls with each time representing a set of
>>>>     device(s) is not a choice because we need to make sure the readmirror
>>>>     changes happens in a commit transaction.
>>>>
>>>> v1->RFC v2:
>>>>    . Property is stored as a dev-tree item instead of root inode extended
>>>>      attribute.
>>>>    . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to BTRFS_DEV_STATE_READ_PREFERRED.
>>>>    . Changed format specifier from devid1,2,3.. to devid:1,2,3..
>>>>
>>>> RFC->v1:
>>>>    Drops pid as one of the readmirror policy choices and as usual remains
>>>>    as default. And when the devid is reset the readmirror policy falls back
>>>>    to pid.
>>>>    Drops the mount -o readmirror idea, it can be added at a later point of
>>>>    time.
>>>>    Property now accepts more than 1 devid as readmirror device. As shown
>>>>    in the example above.
>>>>
>>>
>>> This is a lot of infrastructure

   Ok. Any idea on a better implementation?
   How about extended attribute approach? v1 patches proposed
   it, but it abused the extended attribute as commented here [1]
   and v2 got changed to an item-key.

[1]
https://lore.kernel.org/linux-btrfs/be68e6ea-00bc-b750-25e1-9c584b99308f@gmx.com/


>>> to just change which mirror we read to based on
>>> some arbitrary user policy.  I assume this is to solve the case where you have
>>> slow and fast disks, so you can always read from the fast disk?  And then it's
>>> only used in RAID1, so the very narrow usecase of having a RAID1 setup with a
>>> SSD and a normal disk?  I'm not seeing a point to this much code for one
>>> particular obscure setup.  Thanks,
>>>
>>> Josef
>>
>> Not commenting on the code itself, but as a user I see this SSD RAID1
>> acceleration as a future much have feature. It's only obscure at the
>> moment because we don't have code to take advantage of it. But on
>> large btrfs filesystems with hundreds of GB of metadata, like I have
>> for backups, the usability of the filesystem is dramatically improved
>> having the metadata on an SSD( though currently only half of the time
>> due to the even/odd pid distribution.)
> 
> But that's different from a mirror.  100% it would be nice to say "put my
> metadata on the ssd, data elsewhere".  That's not what this patch is about, this
> patch is specifically about changing which drive we choose in a mirrored setup,
> which is super unlikely to mirror a SSD with a slow drive, cause it's just going
> to be slow no matter what.  Sure we could make it so reads always go to the SSD,
> but we can accomplish that by just adding a check for nonrotational in the code,
> and then we don't have to encode all this nonsense in the file system.  Thanks,

  I wrote about the readmirror policy framework here[2],
  I forgot to link it here, sorry about that, my mistake.

  [2]
 
https://lore.kernel.org/linux-btrfs/1552989624-29577-1-git-send-email-anand.jain@oracle.com/

  Readmirror policy is for raid1, raid10 and future N way mirror.
  Yes for now its only for raid1.

  Here the idea is to create a framework so that readmirror policy
  can be configured as needed. And nonrotational can be one such policy.

  The example of hard-coded nonrotational policy does not work in case
  of ssd and a remote iscsi ssd, OR in case of local ssd and a NVME block
  device, as all these are still nonrotational devices. So hard-coded
  policy is not a good idea. If we have to hardcode then there is Q-depth
  based readmirror routing is better (patch in the ML), but that is
  not good enough, because some configs wants it based on the disk-LBA
  so that SAN storage target cache is balanced and not duplicated.
  So in short it must be a configurable policy.

  devid policy is the first policy which is for the advance users when
  they know what they are doing, which is sure to support any types
  of HW configurations/combinations (except for the cache balance).

  So in total potential configurable policies are:

  - pid     - original. dropped as a policy because of the comments
              received [3].
  - Q-depth - patches are in the ML this can be the default policy.
  - LBA     - to avoid duplicating the cache on the storage target
              in SAN
  - devid   - as discussed above.
  - nonrotational - as discussed above.

  So there are 5 ways to configure as needed, so a framework
  infrastructure is worth?

[3]
https://lore.kernel.org/linux-btrfs/20190409154840.GM29086@twin.jikos.cz/

Thanks, Anand

> Josef
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v2 0/2] readmirror feature
  2019-09-11 16:37 ` [PATCH RFC v2 0/2] readmirror feature David Sterba
@ 2019-09-12  7:48   ` Anand Jain
  0 siblings, 0 replies; 17+ messages in thread
From: Anand Jain @ 2019-09-12  7:48 UTC (permalink / raw)
  To: dsterba, linux-btrfs

On 12/9/19 12:37 AM, David Sterba wrote:
> On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote:
>> Function call chain  __btrfs_map_block()->find_live_mirror() uses
>> thread pid to determine the %mirror_num when the mirror_num=0.
>>
>> This patch introduces a framework so that we can add policies to determine
>> the %mirror_num. And also adds the devid as the readmirror policy.
>>
>> The new property is stored as an item in the device tree as show below.
>>      (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid)
>>
>> To be able to set and get this new property also introduces new ioctls
>> BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument
>> is defined as
>>          struct btrfs_ioctl_readmirror_args {
>>                  __u64 type; /* RW */
>>                  __u64 device_bitmap; /* RW */
>>          }
> 
> I don't remember if there was a suggestion to use ioctls for read
> mirror, but the property interface should be sufficient. Besides this
> ioctl interafce is quite an anti-pattern: narrow use, non-extensible
> structure, alternative and more convenient interface already available.

  Extended attribute interface f(get/set)attr is inode bound, but
  the readmirror property is filesystem bound. For the readmirror we
  can still use the extended attribute, but it might be considered as
  abuse which we haven't done so far, here below [1] is the list of
  property with the interface it uses and where the property is saved.

[1]
  property      interface  save-as
  ro            ioctl      root-item
  label         ioctl      super-block
  compression   xattr      xattr
  v1:readmirror xattr      xattr
  v2:readmirror ioctl      dev-tree-item

  You are asking for the combination of
    property   interface  save-as
    readmirror xattr      dev-tree-item

  I can give a try.

>> An usage example as follows:
>>          btrfs property set /btrfs readmirror devid:1,3
>>          btrfs property get /btrfs readmirror
>>            readmirror devid:1 3
>>          btrfs property set /btrfs readmirror ""
>>          btrfs property get /btrfs readmirror
>>            readmirror default
>>
>> This patchset has been tested completely, however marked as RFC for the
>> following reasons and comments on them (or any other) are appreciated as
>> usual.
>>   . The new objectid is defined as
>>        #define BTRFS_READMIRROR_OBJECTID -1ULL
>>     Need consent we are fine to use this value, and with this value it
>>     shall be placed just before the DEV_STATS_OBJECTID item which is more
>>     frequently used only during the device errors.
> 
> -1 can be considered a special value in other cases, not necessarily
> here but if the ordering of items is the only reason I'd say no. The
> keys/items will most likely land in the same node so there's no point
> forcing the order.

  Agreed. Any suggestion on the value for the BTRFS_READMIRROR_OBJECTID.

>> .  I am using a u64 bitmap to represent the devices id, so the max device
>>     id that we could represent is 63, its a kind of limitation which should
>>     be addressed before integration, I wonder if there is any suggestion?
> 
> Uh 63, so now an implementation detail is posing a global limit? That
> sounds backwards.

   Yes. And I was thinking of u64 array but that doesn't scale as well.
   Anyways I have in the list to try using xattr interface which may
   address this issue.

>>     Kindly note that, multiple ioctls with each time representing a set of
>>     device(s) is not a choice because we need to make sure the readmirror
>>     changes happens in a commit transaction.
> 
> I believe this can be guaranteed by the properties interface, ie. single
> value gets processed at once and with some locking around the state of
> devices can be updated atomically.
> 
> The open question is still how to store the readmirror property
> per-device, it could be either an item or bit inside the device
> structure.

  We discussed that before here.

 
https://lore.kernel.org/linux-btrfs/8d31c3a2-3fb0-63af-3173-948ed0e84de3@oracle.com/

> Besides the interface, I'm kind of missing the usecase description what
> is expected from the read mirror policies and how they could affect
> various scenarios. Maybe it was in some of the previous iterations, it's
> hard too track everything so this should be part of the cover letter (or
> at leat linked if it's too much text).
> 

  Yep. My mistake I missed it link it. Sorry about that.

Thanks, Anand

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v2 0/2] readmirror feature
  2019-09-12  7:41       ` Anand Jain
@ 2019-09-12  9:50         ` Josef Bacik
  2019-09-12 10:00           ` Anand Jain
  0 siblings, 1 reply; 17+ messages in thread
From: Josef Bacik @ 2019-09-12  9:50 UTC (permalink / raw)
  To: Anand Jain; +Cc: Josef Bacik, Eli V, linux-btrfs

On Thu, Sep 12, 2019 at 03:41:42PM +0800, Anand Jain wrote:
> 
> 
> Thanks for the comments. More below.
> 
> On 12/9/19 3:16 AM, Josef Bacik wrote:
> > On Wed, Sep 11, 2019 at 03:13:21PM -0400, Eli V wrote:
> > > On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik <josef@toxicpanda.com> wrote:
> > > > 
> > > > On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote:
> > > > > Function call chain  __btrfs_map_block()->find_live_mirror() uses
> > > > > thread pid to determine the %mirror_num when the mirror_num=0.
> > > > > 
> > > > > This patch introduces a framework so that we can add policies to determine
> > > > > the %mirror_num. And also adds the devid as the readmirror policy.
> > > > > 
> > > > > The new property is stored as an item in the device tree as show below.
> > > > >      (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid)
> > > > > 
> > > > > To be able to set and get this new property also introduces new ioctls
> > > > > BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument
> > > > > is defined as
> > > > >          struct btrfs_ioctl_readmirror_args {
> > > > >                  __u64 type; /* RW */
> > > > >                  __u64 device_bitmap; /* RW */
> > > > >          }
> > > > > 
> > > > > An usage example as follows:
> > > > >          btrfs property set /btrfs readmirror devid:1,3
> > > > >          btrfs property get /btrfs readmirror
> > > > >            readmirror devid:1 3
> > > > >          btrfs property set /btrfs readmirror ""
> > > > >          btrfs property get /btrfs readmirror
> > > > >            readmirror default
> > > > > 
> > > > > This patchset has been tested completely, however marked as RFC for the
> > > > > following reasons and comments on them (or any other) are appreciated as
> > > > > usual.
> > > > >   . The new objectid is defined as
> > > > >        #define BTRFS_READMIRROR_OBJECTID -1ULL
> > > > >     Need consent we are fine to use this value, and with this value it
> > > > >     shall be placed just before the DEV_STATS_OBJECTID item which is more
> > > > >     frequently used only during the device errors.
> > > > > 
> > > > > .  I am using a u64 bitmap to represent the devices id, so the max device
> > > > >     id that we could represent is 63, its a kind of limitation which should
> > > > >     be addressed before integration, I wonder if there is any suggestion?
> > > > >     Kindly note that, multiple ioctls with each time representing a set of
> > > > >     device(s) is not a choice because we need to make sure the readmirror
> > > > >     changes happens in a commit transaction.
> > > > > 
> > > > > v1->RFC v2:
> > > > >    . Property is stored as a dev-tree item instead of root inode extended
> > > > >      attribute.
> > > > >    . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to BTRFS_DEV_STATE_READ_PREFERRED.
> > > > >    . Changed format specifier from devid1,2,3.. to devid:1,2,3..
> > > > > 
> > > > > RFC->v1:
> > > > >    Drops pid as one of the readmirror policy choices and as usual remains
> > > > >    as default. And when the devid is reset the readmirror policy falls back
> > > > >    to pid.
> > > > >    Drops the mount -o readmirror idea, it can be added at a later point of
> > > > >    time.
> > > > >    Property now accepts more than 1 devid as readmirror device. As shown
> > > > >    in the example above.
> > > > > 
> > > > 
> > > > This is a lot of infrastructure
> 
>   Ok. Any idea on a better implementation?
>   How about extended attribute approach? v1 patches proposed
>   it, but it abused the extended attribute as commented here [1]
>   and v2 got changed to an item-key.
> 
> [1]
> https://lore.kernel.org/linux-btrfs/be68e6ea-00bc-b750-25e1-9c584b99308f@gmx.com/
>

That's a NAK on the prop interface.  This is a fs wide policy, not a
directory/inode policy.
 
> 
> > > > to just change which mirror we read to based on
> > > > some arbitrary user policy.  I assume this is to solve the case where you have
> > > > slow and fast disks, so you can always read from the fast disk?  And then it's
> > > > only used in RAID1, so the very narrow usecase of having a RAID1 setup with a
> > > > SSD and a normal disk?  I'm not seeing a point to this much code for one
> > > > particular obscure setup.  Thanks,
> > > > 
> > > > Josef
> > > 
> > > Not commenting on the code itself, but as a user I see this SSD RAID1
> > > acceleration as a future much have feature. It's only obscure at the
> > > moment because we don't have code to take advantage of it. But on
> > > large btrfs filesystems with hundreds of GB of metadata, like I have
> > > for backups, the usability of the filesystem is dramatically improved
> > > having the metadata on an SSD( though currently only half of the time
> > > due to the even/odd pid distribution.)
> > 
> > But that's different from a mirror.  100% it would be nice to say "put my
> > metadata on the ssd, data elsewhere".  That's not what this patch is about, this
> > patch is specifically about changing which drive we choose in a mirrored setup,
> > which is super unlikely to mirror a SSD with a slow drive, cause it's just going
> > to be slow no matter what.  Sure we could make it so reads always go to the SSD,
> > but we can accomplish that by just adding a check for nonrotational in the code,
> > and then we don't have to encode all this nonsense in the file system.  Thanks,
> 
>  I wrote about the readmirror policy framework here[2],
>  I forgot to link it here, sorry about that, my mistake.
> 
>  [2]
> 
> https://lore.kernel.org/linux-btrfs/1552989624-29577-1-git-send-email-anand.jain@oracle.com/
> 
>  Readmirror policy is for raid1, raid10 and future N way mirror.
>  Yes for now its only for raid1.
> 
>  Here the idea is to create a framework so that readmirror policy
>  can be configured as needed. And nonrotational can be one such policy.
> 
>  The example of hard-coded nonrotational policy does not work in case
>  of ssd and a remote iscsi ssd, OR in case of local ssd and a NVME block
>  device, as all these are still nonrotational devices. So hard-coded
>  policy is not a good idea. If we have to hardcode then there is Q-depth
>  based readmirror routing is better (patch in the ML), but that is
>  not good enough, because some configs wants it based on the disk-LBA
>  so that SAN storage target cache is balanced and not duplicated.
>  So in short it must be a configurable policy.
> 

Again, if you are mixing disk types you likely always want non-rotational, but
still mixing different speed devices in a mirror setup is just asking for weird
latency problems.  I don't think solving this use case is necessary.  If you mix
ssd + network device in a serious production setup then you probably should be
fired cause you don't know what you are doing.  Having the generic
"nonrotational gets priority" is going to cover 99% of the actual use cases that
make sense.

The SAN usecase I can sort of see, but again I don't feel like it's a problem we
need to solve with on-disk format.  Add a priority to sysfs so you can change it
with udev or something on the fly.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v2 0/2] readmirror feature
  2019-09-12  9:50         ` Josef Bacik
@ 2019-09-12 10:00           ` Anand Jain
  2019-09-12 10:03             ` Josef Bacik
  0 siblings, 1 reply; 17+ messages in thread
From: Anand Jain @ 2019-09-12 10:00 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Anand Jain, Eli V, linux-btrfs



> On 12 Sep 2019, at 5:50 PM, Josef Bacik <josef@toxicpanda.com> wrote:
> 
> On Thu, Sep 12, 2019 at 03:41:42PM +0800, Anand Jain wrote:
>> 
>> 
>> Thanks for the comments. More below.
>> 
>> On 12/9/19 3:16 AM, Josef Bacik wrote:
>>> On Wed, Sep 11, 2019 at 03:13:21PM -0400, Eli V wrote:
>>>> On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik <josef@toxicpanda.com> wrote:
>>>>> 
>>>>> On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote:
>>>>>> Function call chain  __btrfs_map_block()->find_live_mirror() uses
>>>>>> thread pid to determine the %mirror_num when the mirror_num=0.
>>>>>> 
>>>>>> This patch introduces a framework so that we can add policies to determine
>>>>>> the %mirror_num. And also adds the devid as the readmirror policy.
>>>>>> 
>>>>>> The new property is stored as an item in the device tree as show below.
>>>>>>     (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid)
>>>>>> 
>>>>>> To be able to set and get this new property also introduces new ioctls
>>>>>> BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument
>>>>>> is defined as
>>>>>>         struct btrfs_ioctl_readmirror_args {
>>>>>>                 __u64 type; /* RW */
>>>>>>                 __u64 device_bitmap; /* RW */
>>>>>>         }
>>>>>> 
>>>>>> An usage example as follows:
>>>>>>         btrfs property set /btrfs readmirror devid:1,3
>>>>>>         btrfs property get /btrfs readmirror
>>>>>>           readmirror devid:1 3
>>>>>>         btrfs property set /btrfs readmirror ""
>>>>>>         btrfs property get /btrfs readmirror
>>>>>>           readmirror default
>>>>>> 
>>>>>> This patchset has been tested completely, however marked as RFC for the
>>>>>> following reasons and comments on them (or any other) are appreciated as
>>>>>> usual.
>>>>>>  . The new objectid is defined as
>>>>>>       #define BTRFS_READMIRROR_OBJECTID -1ULL
>>>>>>    Need consent we are fine to use this value, and with this value it
>>>>>>    shall be placed just before the DEV_STATS_OBJECTID item which is more
>>>>>>    frequently used only during the device errors.
>>>>>> 
>>>>>> .  I am using a u64 bitmap to represent the devices id, so the max device
>>>>>>    id that we could represent is 63, its a kind of limitation which should
>>>>>>    be addressed before integration, I wonder if there is any suggestion?
>>>>>>    Kindly note that, multiple ioctls with each time representing a set of
>>>>>>    device(s) is not a choice because we need to make sure the readmirror
>>>>>>    changes happens in a commit transaction.
>>>>>> 
>>>>>> v1->RFC v2:
>>>>>>   . Property is stored as a dev-tree item instead of root inode extended
>>>>>>     attribute.
>>>>>>   . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to BTRFS_DEV_STATE_READ_PREFERRED.
>>>>>>   . Changed format specifier from devid1,2,3.. to devid:1,2,3..
>>>>>> 
>>>>>> RFC->v1:
>>>>>>   Drops pid as one of the readmirror policy choices and as usual remains
>>>>>>   as default. And when the devid is reset the readmirror policy falls back
>>>>>>   to pid.
>>>>>>   Drops the mount -o readmirror idea, it can be added at a later point of
>>>>>>   time.
>>>>>>   Property now accepts more than 1 devid as readmirror device. As shown
>>>>>>   in the example above.
>>>>>> 
>>>>> 
>>>>> This is a lot of infrastructure
>> 
>>  Ok. Any idea on a better implementation?
>>  How about extended attribute approach? v1 patches proposed
>>  it, but it abused the extended attribute as commented here [1]
>>  and v2 got changed to an item-key.
>> 
>> [1]
>> https://lore.kernel.org/linux-btrfs/be68e6ea-00bc-b750-25e1-9c584b99308f@gmx.com/
>> 
> 
> That's a NAK on the prop interface.  This is a fs wide policy, not a
> directory/inode policy.
> 
>> 
>>>>> to just change which mirror we read to based on
>>>>> some arbitrary user policy.  I assume this is to solve the case where you have
>>>>> slow and fast disks, so you can always read from the fast disk?  And then it's
>>>>> only used in RAID1, so the very narrow usecase of having a RAID1 setup with a
>>>>> SSD and a normal disk?  I'm not seeing a point to this much code for one
>>>>> particular obscure setup.  Thanks,
>>>>> 
>>>>> Josef
>>>> 
>>>> Not commenting on the code itself, but as a user I see this SSD RAID1
>>>> acceleration as a future much have feature. It's only obscure at the
>>>> moment because we don't have code to take advantage of it. But on
>>>> large btrfs filesystems with hundreds of GB of metadata, like I have
>>>> for backups, the usability of the filesystem is dramatically improved
>>>> having the metadata on an SSD( though currently only half of the time
>>>> due to the even/odd pid distribution.)
>>> 
>>> But that's different from a mirror.  100% it would be nice to say "put my
>>> metadata on the ssd, data elsewhere".  That's not what this patch is about, this
>>> patch is specifically about changing which drive we choose in a mirrored setup,
>>> which is super unlikely to mirror a SSD with a slow drive, cause it's just going
>>> to be slow no matter what.  Sure we could make it so reads always go to the SSD,
>>> but we can accomplish that by just adding a check for nonrotational in the code,
>>> and then we don't have to encode all this nonsense in the file system.  Thanks,
>> 
>> I wrote about the readmirror policy framework here[2],
>> I forgot to link it here, sorry about that, my mistake.
>> 
>> [2]
>> 
>> https://lore.kernel.org/linux-btrfs/1552989624-29577-1-git-send-email-anand.jain@oracle.com/
>> 
>> Readmirror policy is for raid1, raid10 and future N way mirror.
>> Yes for now its only for raid1.
>> 
>> Here the idea is to create a framework so that readmirror policy
>> can be configured as needed. And nonrotational can be one such policy.
>> 
>> The example of hard-coded nonrotational policy does not work in case
>> of ssd and a remote iscsi ssd, OR in case of local ssd and a NVME block
>> device, as all these are still nonrotational devices. So hard-coded
>> policy is not a good idea. If we have to hardcode then there is Q-depth
>> based readmirror routing is better (patch in the ML), but that is
>> not good enough, because some configs wants it based on the disk-LBA
>> so that SAN storage target cache is balanced and not duplicated.
>> So in short it must be a configurable policy.
>> 
> 
> Again, if you are mixing disk types you likely always want non-rotational, but
> still mixing different speed devices in a mirror setup is just asking for weird
> latency problems.  I don't think solving this use case is necessary.  If you mix
> ssd + network device in a serious production setup then you probably should be
> fired cause you don't know what you are doing.  Having the generic
> "nonrotational gets priority" is going to cover 99% of the actual use cases that
> make sense.
> 
> The SAN usecase I can sort of see, but again I don't feel like it's a problem we
> need to solve with on-disk format.  Add a priority to sysfs so you can change it
> with udev or something on the fly.  Thanks,
> 
 
 Ok.
 Sysfs is fine however we need configuration to be persistent across reboots.
 Any idea?

Thanks, Anand

> Josef


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v2 0/2] readmirror feature
  2019-09-12 10:00           ` Anand Jain
@ 2019-09-12 10:03             ` Josef Bacik
  2019-09-12 10:10               ` Anand Jain
  0 siblings, 1 reply; 17+ messages in thread
From: Josef Bacik @ 2019-09-12 10:03 UTC (permalink / raw)
  To: Anand Jain; +Cc: Josef Bacik, Eli V, linux-btrfs

On Thu, Sep 12, 2019 at 06:00:21PM +0800, Anand Jain wrote:
> 
> 
> > On 12 Sep 2019, at 5:50 PM, Josef Bacik <josef@toxicpanda.com> wrote:
> > 
> > On Thu, Sep 12, 2019 at 03:41:42PM +0800, Anand Jain wrote:
> >> 
> >> 
> >> Thanks for the comments. More below.
> >> 
> >> On 12/9/19 3:16 AM, Josef Bacik wrote:
> >>> On Wed, Sep 11, 2019 at 03:13:21PM -0400, Eli V wrote:
> >>>> On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >>>>> 
> >>>>> On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote:
> >>>>>> Function call chain  __btrfs_map_block()->find_live_mirror() uses
> >>>>>> thread pid to determine the %mirror_num when the mirror_num=0.
> >>>>>> 
> >>>>>> This patch introduces a framework so that we can add policies to determine
> >>>>>> the %mirror_num. And also adds the devid as the readmirror policy.
> >>>>>> 
> >>>>>> The new property is stored as an item in the device tree as show below.
> >>>>>>     (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid)
> >>>>>> 
> >>>>>> To be able to set and get this new property also introduces new ioctls
> >>>>>> BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument
> >>>>>> is defined as
> >>>>>>         struct btrfs_ioctl_readmirror_args {
> >>>>>>                 __u64 type; /* RW */
> >>>>>>                 __u64 device_bitmap; /* RW */
> >>>>>>         }
> >>>>>> 
> >>>>>> An usage example as follows:
> >>>>>>         btrfs property set /btrfs readmirror devid:1,3
> >>>>>>         btrfs property get /btrfs readmirror
> >>>>>>           readmirror devid:1 3
> >>>>>>         btrfs property set /btrfs readmirror ""
> >>>>>>         btrfs property get /btrfs readmirror
> >>>>>>           readmirror default
> >>>>>> 
> >>>>>> This patchset has been tested completely, however marked as RFC for the
> >>>>>> following reasons and comments on them (or any other) are appreciated as
> >>>>>> usual.
> >>>>>>  . The new objectid is defined as
> >>>>>>       #define BTRFS_READMIRROR_OBJECTID -1ULL
> >>>>>>    Need consent we are fine to use this value, and with this value it
> >>>>>>    shall be placed just before the DEV_STATS_OBJECTID item which is more
> >>>>>>    frequently used only during the device errors.
> >>>>>> 
> >>>>>> .  I am using a u64 bitmap to represent the devices id, so the max device
> >>>>>>    id that we could represent is 63, its a kind of limitation which should
> >>>>>>    be addressed before integration, I wonder if there is any suggestion?
> >>>>>>    Kindly note that, multiple ioctls with each time representing a set of
> >>>>>>    device(s) is not a choice because we need to make sure the readmirror
> >>>>>>    changes happens in a commit transaction.
> >>>>>> 
> >>>>>> v1->RFC v2:
> >>>>>>   . Property is stored as a dev-tree item instead of root inode extended
> >>>>>>     attribute.
> >>>>>>   . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to BTRFS_DEV_STATE_READ_PREFERRED.
> >>>>>>   . Changed format specifier from devid1,2,3.. to devid:1,2,3..
> >>>>>> 
> >>>>>> RFC->v1:
> >>>>>>   Drops pid as one of the readmirror policy choices and as usual remains
> >>>>>>   as default. And when the devid is reset the readmirror policy falls back
> >>>>>>   to pid.
> >>>>>>   Drops the mount -o readmirror idea, it can be added at a later point of
> >>>>>>   time.
> >>>>>>   Property now accepts more than 1 devid as readmirror device. As shown
> >>>>>>   in the example above.
> >>>>>> 
> >>>>> 
> >>>>> This is a lot of infrastructure
> >> 
> >>  Ok. Any idea on a better implementation?
> >>  How about extended attribute approach? v1 patches proposed
> >>  it, but it abused the extended attribute as commented here [1]
> >>  and v2 got changed to an item-key.
> >> 
> >> [1]
> >> https://lore.kernel.org/linux-btrfs/be68e6ea-00bc-b750-25e1-9c584b99308f@gmx.com/
> >> 
> > 
> > That's a NAK on the prop interface.  This is a fs wide policy, not a
> > directory/inode policy.
> > 
> >> 
> >>>>> to just change which mirror we read to based on
> >>>>> some arbitrary user policy.  I assume this is to solve the case where you have
> >>>>> slow and fast disks, so you can always read from the fast disk?  And then it's
> >>>>> only used in RAID1, so the very narrow usecase of having a RAID1 setup with a
> >>>>> SSD and a normal disk?  I'm not seeing a point to this much code for one
> >>>>> particular obscure setup.  Thanks,
> >>>>> 
> >>>>> Josef
> >>>> 
> >>>> Not commenting on the code itself, but as a user I see this SSD RAID1
> >>>> acceleration as a future much have feature. It's only obscure at the
> >>>> moment because we don't have code to take advantage of it. But on
> >>>> large btrfs filesystems with hundreds of GB of metadata, like I have
> >>>> for backups, the usability of the filesystem is dramatically improved
> >>>> having the metadata on an SSD( though currently only half of the time
> >>>> due to the even/odd pid distribution.)
> >>> 
> >>> But that's different from a mirror.  100% it would be nice to say "put my
> >>> metadata on the ssd, data elsewhere".  That's not what this patch is about, this
> >>> patch is specifically about changing which drive we choose in a mirrored setup,
> >>> which is super unlikely to mirror a SSD with a slow drive, cause it's just going
> >>> to be slow no matter what.  Sure we could make it so reads always go to the SSD,
> >>> but we can accomplish that by just adding a check for nonrotational in the code,
> >>> and then we don't have to encode all this nonsense in the file system.  Thanks,
> >> 
> >> I wrote about the readmirror policy framework here[2],
> >> I forgot to link it here, sorry about that, my mistake.
> >> 
> >> [2]
> >> 
> >> https://lore.kernel.org/linux-btrfs/1552989624-29577-1-git-send-email-anand.jain@oracle.com/
> >> 
> >> Readmirror policy is for raid1, raid10 and future N way mirror.
> >> Yes for now its only for raid1.
> >> 
> >> Here the idea is to create a framework so that readmirror policy
> >> can be configured as needed. And nonrotational can be one such policy.
> >> 
> >> The example of hard-coded nonrotational policy does not work in case
> >> of ssd and a remote iscsi ssd, OR in case of local ssd and a NVME block
> >> device, as all these are still nonrotational devices. So hard-coded
> >> policy is not a good idea. If we have to hardcode then there is Q-depth
> >> based readmirror routing is better (patch in the ML), but that is
> >> not good enough, because some configs wants it based on the disk-LBA
> >> so that SAN storage target cache is balanced and not duplicated.
> >> So in short it must be a configurable policy.
> >> 
> > 
> > Again, if you are mixing disk types you likely always want non-rotational, but
> > still mixing different speed devices in a mirror setup is just asking for weird
> > latency problems.  I don't think solving this use case is necessary.  If you mix
> > ssd + network device in a serious production setup then you probably should be
> > fired cause you don't know what you are doing.  Having the generic
> > "nonrotational gets priority" is going to cover 99% of the actual use cases that
> > make sense.
> > 
> > The SAN usecase I can sort of see, but again I don't feel like it's a problem we
> > need to solve with on-disk format.  Add a priority to sysfs so you can change it
> > with udev or something on the fly.  Thanks,
> > 
>  
>  Ok.
>  Sysfs is fine however we need configuration to be persistent across reboots.
>  Any idea?
>

Udev rules.  Thanks,

Josef 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v2 0/2] readmirror feature
  2019-09-12 10:03             ` Josef Bacik
@ 2019-09-12 10:10               ` Anand Jain
  2019-09-12 10:13                 ` Josef Bacik
  0 siblings, 1 reply; 17+ messages in thread
From: Anand Jain @ 2019-09-12 10:10 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Anand Jain, Eli V, linux-btrfs



> On 12 Sep 2019, at 6:03 PM, Josef Bacik <josef@toxicpanda.com> wrote:
> 
> On Thu, Sep 12, 2019 at 06:00:21PM +0800, Anand Jain wrote:
>> 
>> 
>>> On 12 Sep 2019, at 5:50 PM, Josef Bacik <josef@toxicpanda.com> wrote:
>>> 
>>> On Thu, Sep 12, 2019 at 03:41:42PM +0800, Anand Jain wrote:
>>>> 
>>>> 
>>>> Thanks for the comments. More below.
>>>> 
>>>> On 12/9/19 3:16 AM, Josef Bacik wrote:
>>>>> On Wed, Sep 11, 2019 at 03:13:21PM -0400, Eli V wrote:
>>>>>> On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik <josef@toxicpanda.com> wrote:
>>>>>>> 
>>>>>>> On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote:
>>>>>>>> Function call chain  __btrfs_map_block()->find_live_mirror() uses
>>>>>>>> thread pid to determine the %mirror_num when the mirror_num=0.
>>>>>>>> 
>>>>>>>> This patch introduces a framework so that we can add policies to determine
>>>>>>>> the %mirror_num. And also adds the devid as the readmirror policy.
>>>>>>>> 
>>>>>>>> The new property is stored as an item in the device tree as show below.
>>>>>>>>    (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid)
>>>>>>>> 
>>>>>>>> To be able to set and get this new property also introduces new ioctls
>>>>>>>> BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument
>>>>>>>> is defined as
>>>>>>>>        struct btrfs_ioctl_readmirror_args {
>>>>>>>>                __u64 type; /* RW */
>>>>>>>>                __u64 device_bitmap; /* RW */
>>>>>>>>        }
>>>>>>>> 
>>>>>>>> An usage example as follows:
>>>>>>>>        btrfs property set /btrfs readmirror devid:1,3
>>>>>>>>        btrfs property get /btrfs readmirror
>>>>>>>>          readmirror devid:1 3
>>>>>>>>        btrfs property set /btrfs readmirror ""
>>>>>>>>        btrfs property get /btrfs readmirror
>>>>>>>>          readmirror default
>>>>>>>> 
>>>>>>>> This patchset has been tested completely, however marked as RFC for the
>>>>>>>> following reasons and comments on them (or any other) are appreciated as
>>>>>>>> usual.
>>>>>>>> . The new objectid is defined as
>>>>>>>>      #define BTRFS_READMIRROR_OBJECTID -1ULL
>>>>>>>>   Need consent we are fine to use this value, and with this value it
>>>>>>>>   shall be placed just before the DEV_STATS_OBJECTID item which is more
>>>>>>>>   frequently used only during the device errors.
>>>>>>>> 
>>>>>>>> .  I am using a u64 bitmap to represent the devices id, so the max device
>>>>>>>>   id that we could represent is 63, its a kind of limitation which should
>>>>>>>>   be addressed before integration, I wonder if there is any suggestion?
>>>>>>>>   Kindly note that, multiple ioctls with each time representing a set of
>>>>>>>>   device(s) is not a choice because we need to make sure the readmirror
>>>>>>>>   changes happens in a commit transaction.
>>>>>>>> 
>>>>>>>> v1->RFC v2:
>>>>>>>>  . Property is stored as a dev-tree item instead of root inode extended
>>>>>>>>    attribute.
>>>>>>>>  . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to BTRFS_DEV_STATE_READ_PREFERRED.
>>>>>>>>  . Changed format specifier from devid1,2,3.. to devid:1,2,3..
>>>>>>>> 
>>>>>>>> RFC->v1:
>>>>>>>>  Drops pid as one of the readmirror policy choices and as usual remains
>>>>>>>>  as default. And when the devid is reset the readmirror policy falls back
>>>>>>>>  to pid.
>>>>>>>>  Drops the mount -o readmirror idea, it can be added at a later point of
>>>>>>>>  time.
>>>>>>>>  Property now accepts more than 1 devid as readmirror device. As shown
>>>>>>>>  in the example above.
>>>>>>>> 
>>>>>>> 
>>>>>>> This is a lot of infrastructure
>>>> 
>>>> Ok. Any idea on a better implementation?
>>>> How about extended attribute approach? v1 patches proposed
>>>> it, but it abused the extended attribute as commented here [1]
>>>> and v2 got changed to an item-key.
>>>> 
>>>> [1]
>>>> https://lore.kernel.org/linux-btrfs/be68e6ea-00bc-b750-25e1-9c584b99308f@gmx.com/
>>>> 
>>> 
>>> That's a NAK on the prop interface.  This is a fs wide policy, not a
>>> directory/inode policy.
>>> 
>>>> 
>>>>>>> to just change which mirror we read to based on
>>>>>>> some arbitrary user policy.  I assume this is to solve the case where you have
>>>>>>> slow and fast disks, so you can always read from the fast disk?  And then it's
>>>>>>> only used in RAID1, so the very narrow usecase of having a RAID1 setup with a
>>>>>>> SSD and a normal disk?  I'm not seeing a point to this much code for one
>>>>>>> particular obscure setup.  Thanks,
>>>>>>> 
>>>>>>> Josef
>>>>>> 
>>>>>> Not commenting on the code itself, but as a user I see this SSD RAID1
>>>>>> acceleration as a future much have feature. It's only obscure at the
>>>>>> moment because we don't have code to take advantage of it. But on
>>>>>> large btrfs filesystems with hundreds of GB of metadata, like I have
>>>>>> for backups, the usability of the filesystem is dramatically improved
>>>>>> having the metadata on an SSD( though currently only half of the time
>>>>>> due to the even/odd pid distribution.)
>>>>> 
>>>>> But that's different from a mirror.  100% it would be nice to say "put my
>>>>> metadata on the ssd, data elsewhere".  That's not what this patch is about, this
>>>>> patch is specifically about changing which drive we choose in a mirrored setup,
>>>>> which is super unlikely to mirror a SSD with a slow drive, cause it's just going
>>>>> to be slow no matter what.  Sure we could make it so reads always go to the SSD,
>>>>> but we can accomplish that by just adding a check for nonrotational in the code,
>>>>> and then we don't have to encode all this nonsense in the file system.  Thanks,
>>>> 
>>>> I wrote about the readmirror policy framework here[2],
>>>> I forgot to link it here, sorry about that, my mistake.
>>>> 
>>>> [2]
>>>> 
>>>> https://lore.kernel.org/linux-btrfs/1552989624-29577-1-git-send-email-anand.jain@oracle.com/
>>>> 
>>>> Readmirror policy is for raid1, raid10 and future N way mirror.
>>>> Yes for now its only for raid1.
>>>> 
>>>> Here the idea is to create a framework so that readmirror policy
>>>> can be configured as needed. And nonrotational can be one such policy.
>>>> 
>>>> The example of hard-coded nonrotational policy does not work in case
>>>> of ssd and a remote iscsi ssd, OR in case of local ssd and a NVME block
>>>> device, as all these are still nonrotational devices. So hard-coded
>>>> policy is not a good idea. If we have to hardcode then there is Q-depth
>>>> based readmirror routing is better (patch in the ML), but that is
>>>> not good enough, because some configs wants it based on the disk-LBA
>>>> so that SAN storage target cache is balanced and not duplicated.
>>>> So in short it must be a configurable policy.
>>>> 
>>> 
>>> Again, if you are mixing disk types you likely always want non-rotational, but
>>> still mixing different speed devices in a mirror setup is just asking for weird
>>> latency problems.  I don't think solving this use case is necessary.  If you mix
>>> ssd + network device in a serious production setup then you probably should be
>>> fired cause you don't know what you are doing.  Having the generic
>>> "nonrotational gets priority" is going to cover 99% of the actual use cases that
>>> make sense.
>>> 
>>> The SAN usecase I can sort of see, but again I don't feel like it's a problem we
>>> need to solve with on-disk format.  Add a priority to sysfs so you can change it
>>> with udev or something on the fly.  Thanks,
>>> 
>> 
>> Ok.
>> Sysfs is fine however we need configuration to be persistent across reboots.
>> Any idea?
>> 
> 
> Udev rules.  Thanks,
> 

 Josef, configs moving along with the luns in san seems to be more
 easy to manage, otherwise when the host fails, each potential new
 server has to be pre-configured with the udev rules. 

Thanks, Anand

> Josef 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v2 0/2] readmirror feature
  2019-09-12 10:10               ` Anand Jain
@ 2019-09-12 10:13                 ` Josef Bacik
  2019-09-16  8:19                   ` Anand Jain
  0 siblings, 1 reply; 17+ messages in thread
From: Josef Bacik @ 2019-09-12 10:13 UTC (permalink / raw)
  To: Anand Jain; +Cc: Josef Bacik, Eli V, linux-btrfs

On Thu, Sep 12, 2019 at 06:10:08PM +0800, Anand Jain wrote:
> 
> 
> > On 12 Sep 2019, at 6:03 PM, Josef Bacik <josef@toxicpanda.com> wrote:
> > 
> > On Thu, Sep 12, 2019 at 06:00:21PM +0800, Anand Jain wrote:
> >> 
> >> 
> >>> On 12 Sep 2019, at 5:50 PM, Josef Bacik <josef@toxicpanda.com> wrote:
> >>> 
> >>> On Thu, Sep 12, 2019 at 03:41:42PM +0800, Anand Jain wrote:
> >>>> 
> >>>> 
> >>>> Thanks for the comments. More below.
> >>>> 
> >>>> On 12/9/19 3:16 AM, Josef Bacik wrote:
> >>>>> On Wed, Sep 11, 2019 at 03:13:21PM -0400, Eli V wrote:
> >>>>>> On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >>>>>>> 
> >>>>>>> On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote:
> >>>>>>>> Function call chain  __btrfs_map_block()->find_live_mirror() uses
> >>>>>>>> thread pid to determine the %mirror_num when the mirror_num=0.
> >>>>>>>> 
> >>>>>>>> This patch introduces a framework so that we can add policies to determine
> >>>>>>>> the %mirror_num. And also adds the devid as the readmirror policy.
> >>>>>>>> 
> >>>>>>>> The new property is stored as an item in the device tree as show below.
> >>>>>>>>    (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid)
> >>>>>>>> 
> >>>>>>>> To be able to set and get this new property also introduces new ioctls
> >>>>>>>> BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument
> >>>>>>>> is defined as
> >>>>>>>>        struct btrfs_ioctl_readmirror_args {
> >>>>>>>>                __u64 type; /* RW */
> >>>>>>>>                __u64 device_bitmap; /* RW */
> >>>>>>>>        }
> >>>>>>>> 
> >>>>>>>> An usage example as follows:
> >>>>>>>>        btrfs property set /btrfs readmirror devid:1,3
> >>>>>>>>        btrfs property get /btrfs readmirror
> >>>>>>>>          readmirror devid:1 3
> >>>>>>>>        btrfs property set /btrfs readmirror ""
> >>>>>>>>        btrfs property get /btrfs readmirror
> >>>>>>>>          readmirror default
> >>>>>>>> 
> >>>>>>>> This patchset has been tested completely, however marked as RFC for the
> >>>>>>>> following reasons and comments on them (or any other) are appreciated as
> >>>>>>>> usual.
> >>>>>>>> . The new objectid is defined as
> >>>>>>>>      #define BTRFS_READMIRROR_OBJECTID -1ULL
> >>>>>>>>   Need consent we are fine to use this value, and with this value it
> >>>>>>>>   shall be placed just before the DEV_STATS_OBJECTID item which is more
> >>>>>>>>   frequently used only during the device errors.
> >>>>>>>> 
> >>>>>>>> .  I am using a u64 bitmap to represent the devices id, so the max device
> >>>>>>>>   id that we could represent is 63, its a kind of limitation which should
> >>>>>>>>   be addressed before integration, I wonder if there is any suggestion?
> >>>>>>>>   Kindly note that, multiple ioctls with each time representing a set of
> >>>>>>>>   device(s) is not a choice because we need to make sure the readmirror
> >>>>>>>>   changes happens in a commit transaction.
> >>>>>>>> 
> >>>>>>>> v1->RFC v2:
> >>>>>>>>  . Property is stored as a dev-tree item instead of root inode extended
> >>>>>>>>    attribute.
> >>>>>>>>  . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to BTRFS_DEV_STATE_READ_PREFERRED.
> >>>>>>>>  . Changed format specifier from devid1,2,3.. to devid:1,2,3..
> >>>>>>>> 
> >>>>>>>> RFC->v1:
> >>>>>>>>  Drops pid as one of the readmirror policy choices and as usual remains
> >>>>>>>>  as default. And when the devid is reset the readmirror policy falls back
> >>>>>>>>  to pid.
> >>>>>>>>  Drops the mount -o readmirror idea, it can be added at a later point of
> >>>>>>>>  time.
> >>>>>>>>  Property now accepts more than 1 devid as readmirror device. As shown
> >>>>>>>>  in the example above.
> >>>>>>>> 
> >>>>>>> 
> >>>>>>> This is a lot of infrastructure
> >>>> 
> >>>> Ok. Any idea on a better implementation?
> >>>> How about extended attribute approach? v1 patches proposed
> >>>> it, but it abused the extended attribute as commented here [1]
> >>>> and v2 got changed to an item-key.
> >>>> 
> >>>> [1]
> >>>> https://lore.kernel.org/linux-btrfs/be68e6ea-00bc-b750-25e1-9c584b99308f@gmx.com/
> >>>> 
> >>> 
> >>> That's a NAK on the prop interface.  This is a fs wide policy, not a
> >>> directory/inode policy.
> >>> 
> >>>> 
> >>>>>>> to just change which mirror we read to based on
> >>>>>>> some arbitrary user policy.  I assume this is to solve the case where you have
> >>>>>>> slow and fast disks, so you can always read from the fast disk?  And then it's
> >>>>>>> only used in RAID1, so the very narrow usecase of having a RAID1 setup with a
> >>>>>>> SSD and a normal disk?  I'm not seeing a point to this much code for one
> >>>>>>> particular obscure setup.  Thanks,
> >>>>>>> 
> >>>>>>> Josef
> >>>>>> 
> >>>>>> Not commenting on the code itself, but as a user I see this SSD RAID1
> >>>>>> acceleration as a future much have feature. It's only obscure at the
> >>>>>> moment because we don't have code to take advantage of it. But on
> >>>>>> large btrfs filesystems with hundreds of GB of metadata, like I have
> >>>>>> for backups, the usability of the filesystem is dramatically improved
> >>>>>> having the metadata on an SSD( though currently only half of the time
> >>>>>> due to the even/odd pid distribution.)
> >>>>> 
> >>>>> But that's different from a mirror.  100% it would be nice to say "put my
> >>>>> metadata on the ssd, data elsewhere".  That's not what this patch is about, this
> >>>>> patch is specifically about changing which drive we choose in a mirrored setup,
> >>>>> which is super unlikely to mirror a SSD with a slow drive, cause it's just going
> >>>>> to be slow no matter what.  Sure we could make it so reads always go to the SSD,
> >>>>> but we can accomplish that by just adding a check for nonrotational in the code,
> >>>>> and then we don't have to encode all this nonsense in the file system.  Thanks,
> >>>> 
> >>>> I wrote about the readmirror policy framework here[2],
> >>>> I forgot to link it here, sorry about that, my mistake.
> >>>> 
> >>>> [2]
> >>>> 
> >>>> https://lore.kernel.org/linux-btrfs/1552989624-29577-1-git-send-email-anand.jain@oracle.com/
> >>>> 
> >>>> Readmirror policy is for raid1, raid10 and future N way mirror.
> >>>> Yes for now its only for raid1.
> >>>> 
> >>>> Here the idea is to create a framework so that readmirror policy
> >>>> can be configured as needed. And nonrotational can be one such policy.
> >>>> 
> >>>> The example of hard-coded nonrotational policy does not work in case
> >>>> of ssd and a remote iscsi ssd, OR in case of local ssd and a NVME block
> >>>> device, as all these are still nonrotational devices. So hard-coded
> >>>> policy is not a good idea. If we have to hardcode then there is Q-depth
> >>>> based readmirror routing is better (patch in the ML), but that is
> >>>> not good enough, because some configs wants it based on the disk-LBA
> >>>> so that SAN storage target cache is balanced and not duplicated.
> >>>> So in short it must be a configurable policy.
> >>>> 
> >>> 
> >>> Again, if you are mixing disk types you likely always want non-rotational, but
> >>> still mixing different speed devices in a mirror setup is just asking for weird
> >>> latency problems.  I don't think solving this use case is necessary.  If you mix
> >>> ssd + network device in a serious production setup then you probably should be
> >>> fired cause you don't know what you are doing.  Having the generic
> >>> "nonrotational gets priority" is going to cover 99% of the actual use cases that
> >>> make sense.
> >>> 
> >>> The SAN usecase I can sort of see, but again I don't feel like it's a problem we
> >>> need to solve with on-disk format.  Add a priority to sysfs so you can change it
> >>> with udev or something on the fly.  Thanks,
> >>> 
> >> 
> >> Ok.
> >> Sysfs is fine however we need configuration to be persistent across reboots.
> >> Any idea?
> >> 
> > 
> > Udev rules.  Thanks,
> > 
> 
>  Josef, configs moving along with the luns in san seems to be more
>  easy to manage, otherwise when the host fails, each potential new
>  server has to be pre-configured with the udev rules. 
> 

It's 2019, if people haven't figured out how to do persistent configuration by
now then all hope is lost.  Facebook persistently configures millions of
machines, I'm sure people can figure out how to make sure a udev rule ends up on
the right host connected to a SAN that doesn't move.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v2 0/2] readmirror feature
  2019-09-12 10:13                 ` Josef Bacik
@ 2019-09-16  8:19                   ` Anand Jain
  2019-09-24 14:27                     ` Anand Jain
  0 siblings, 1 reply; 17+ messages in thread
From: Anand Jain @ 2019-09-16  8:19 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Eli V, linux-btrfs

On 12/9/19 6:13 PM, Josef Bacik wrote:
> On Thu, Sep 12, 2019 at 06:10:08PM +0800, Anand Jain wrote:
>>
>>
>>> On 12 Sep 2019, at 6:03 PM, Josef Bacik <josef@toxicpanda.com> wrote:
>>>
>>> On Thu, Sep 12, 2019 at 06:00:21PM +0800, Anand Jain wrote:
>>>>
>>>>
>>>>> On 12 Sep 2019, at 5:50 PM, Josef Bacik <josef@toxicpanda.com> wrote:
>>>>>
>>>>> On Thu, Sep 12, 2019 at 03:41:42PM +0800, Anand Jain wrote:
>>>>>>
>>>>>>
>>>>>> Thanks for the comments. More below.
>>>>>>
>>>>>> On 12/9/19 3:16 AM, Josef Bacik wrote:
>>>>>>> On Wed, Sep 11, 2019 at 03:13:21PM -0400, Eli V wrote:
>>>>>>>> On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik <josef@toxicpanda.com> wrote:
>>>>>>>>>
>>>>>>>>> On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote:
>>>>>>>>>> Function call chain  __btrfs_map_block()->find_live_mirror() uses
>>>>>>>>>> thread pid to determine the %mirror_num when the mirror_num=0.
>>>>>>>>>>
>>>>>>>>>> This patch introduces a framework so that we can add policies to determine
>>>>>>>>>> the %mirror_num. And also adds the devid as the readmirror policy.
>>>>>>>>>>
>>>>>>>>>> The new property is stored as an item in the device tree as show below.
>>>>>>>>>>     (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid)
>>>>>>>>>>
>>>>>>>>>> To be able to set and get this new property also introduces new ioctls
>>>>>>>>>> BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument
>>>>>>>>>> is defined as
>>>>>>>>>>         struct btrfs_ioctl_readmirror_args {
>>>>>>>>>>                 __u64 type; /* RW */
>>>>>>>>>>                 __u64 device_bitmap; /* RW */
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>> An usage example as follows:
>>>>>>>>>>         btrfs property set /btrfs readmirror devid:1,3
>>>>>>>>>>         btrfs property get /btrfs readmirror
>>>>>>>>>>           readmirror devid:1 3
>>>>>>>>>>         btrfs property set /btrfs readmirror ""
>>>>>>>>>>         btrfs property get /btrfs readmirror
>>>>>>>>>>           readmirror default
>>>>>>>>>>
>>>>>>>>>> This patchset has been tested completely, however marked as RFC for the
>>>>>>>>>> following reasons and comments on them (or any other) are appreciated as
>>>>>>>>>> usual.
>>>>>>>>>> . The new objectid is defined as
>>>>>>>>>>       #define BTRFS_READMIRROR_OBJECTID -1ULL
>>>>>>>>>>    Need consent we are fine to use this value, and with this value it
>>>>>>>>>>    shall be placed just before the DEV_STATS_OBJECTID item which is more
>>>>>>>>>>    frequently used only during the device errors.
>>>>>>>>>>
>>>>>>>>>> .  I am using a u64 bitmap to represent the devices id, so the max device
>>>>>>>>>>    id that we could represent is 63, its a kind of limitation which should
>>>>>>>>>>    be addressed before integration, I wonder if there is any suggestion?
>>>>>>>>>>    Kindly note that, multiple ioctls with each time representing a set of
>>>>>>>>>>    device(s) is not a choice because we need to make sure the readmirror
>>>>>>>>>>    changes happens in a commit transaction.
>>>>>>>>>>
>>>>>>>>>> v1->RFC v2:
>>>>>>>>>>   . Property is stored as a dev-tree item instead of root inode extended
>>>>>>>>>>     attribute.
>>>>>>>>>>   . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to BTRFS_DEV_STATE_READ_PREFERRED.
>>>>>>>>>>   . Changed format specifier from devid1,2,3.. to devid:1,2,3..
>>>>>>>>>>
>>>>>>>>>> RFC->v1:
>>>>>>>>>>   Drops pid as one of the readmirror policy choices and as usual remains
>>>>>>>>>>   as default. And when the devid is reset the readmirror policy falls back
>>>>>>>>>>   to pid.
>>>>>>>>>>   Drops the mount -o readmirror idea, it can be added at a later point of
>>>>>>>>>>   time.
>>>>>>>>>>   Property now accepts more than 1 devid as readmirror device. As shown
>>>>>>>>>>   in the example above.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is a lot of infrastructure
>>>>>>
>>>>>> Ok. Any idea on a better implementation?
>>>>>> How about extended attribute approach? v1 patches proposed
>>>>>> it, but it abused the extended attribute as commented here [1]
>>>>>> and v2 got changed to an item-key.
>>>>>>
>>>>>> [1]
>>>>>> https://lore.kernel.org/linux-btrfs/be68e6ea-00bc-b750-25e1-9c584b99308f@gmx.com/
>>>>>>
>>>>>
>>>>> That's a NAK on the prop interface.  This is a fs wide policy, not a
>>>>> directory/inode policy.
>>>>>
>>>>>>
>>>>>>>>> to just change which mirror we read to based on
>>>>>>>>> some arbitrary user policy.  I assume this is to solve the case where you have
>>>>>>>>> slow and fast disks, so you can always read from the fast disk?  And then it's
>>>>>>>>> only used in RAID1, so the very narrow usecase of having a RAID1 setup with a
>>>>>>>>> SSD and a normal disk?  I'm not seeing a point to this much code for one
>>>>>>>>> particular obscure setup.  Thanks,
>>>>>>>>>
>>>>>>>>> Josef
>>>>>>>>
>>>>>>>> Not commenting on the code itself, but as a user I see this SSD RAID1
>>>>>>>> acceleration as a future much have feature. It's only obscure at the
>>>>>>>> moment because we don't have code to take advantage of it. But on
>>>>>>>> large btrfs filesystems with hundreds of GB of metadata, like I have
>>>>>>>> for backups, the usability of the filesystem is dramatically improved
>>>>>>>> having the metadata on an SSD( though currently only half of the time
>>>>>>>> due to the even/odd pid distribution.)
>>>>>>>
>>>>>>> But that's different from a mirror.  100% it would be nice to say "put my
>>>>>>> metadata on the ssd, data elsewhere".  That's not what this patch is about, this
>>>>>>> patch is specifically about changing which drive we choose in a mirrored setup,
>>>>>>> which is super unlikely to mirror a SSD with a slow drive, cause it's just going
>>>>>>> to be slow no matter what.  Sure we could make it so reads always go to the SSD,
>>>>>>> but we can accomplish that by just adding a check for nonrotational in the code,
>>>>>>> and then we don't have to encode all this nonsense in the file system.  Thanks,
>>>>>>
>>>>>> I wrote about the readmirror policy framework here[2],
>>>>>> I forgot to link it here, sorry about that, my mistake.
>>>>>>
>>>>>> [2]
>>>>>>
>>>>>> https://lore.kernel.org/linux-btrfs/1552989624-29577-1-git-send-email-anand.jain@oracle.com/
>>>>>>
>>>>>> Readmirror policy is for raid1, raid10 and future N way mirror.
>>>>>> Yes for now its only for raid1.
>>>>>>
>>>>>> Here the idea is to create a framework so that readmirror policy
>>>>>> can be configured as needed. And nonrotational can be one such policy.
>>>>>>
>>>>>> The example of hard-coded nonrotational policy does not work in case
>>>>>> of ssd and a remote iscsi ssd, OR in case of local ssd and a NVME block
>>>>>> device, as all these are still nonrotational devices. So hard-coded
>>>>>> policy is not a good idea. If we have to hardcode then there is Q-depth
>>>>>> based readmirror routing is better (patch in the ML), but that is
>>>>>> not good enough, because some configs wants it based on the disk-LBA
>>>>>> so that SAN storage target cache is balanced and not duplicated.
>>>>>> So in short it must be a configurable policy.
>>>>>>
>>>>>
>>>>> Again, if you are mixing disk types you likely always want non-rotational, but
>>>>> still mixing different speed devices in a mirror setup is just asking for weird
>>>>> latency problems.  I don't think solving this use case is necessary.  If you mix
>>>>> ssd + network device in a serious production setup then you probably should be
>>>>> fired cause you don't know what you are doing.  Having the generic
>>>>> "nonrotational gets priority" is going to cover 99% of the actual use cases that
>>>>> make sense.
>>>>>
>>>>> The SAN usecase I can sort of see, but again I don't feel like it's a problem we
>>>>> need to solve with on-disk format.  Add a priority to sysfs so you can change it
>>>>> with udev or something on the fly.  Thanks,
>>>>>
>>>>
>>>> Ok.
>>>> Sysfs is fine however we need configuration to be persistent across reboots.
>>>> Any idea?
>>>>
>>>
>>> Udev rules.  Thanks,
>>>
>>
>>   Josef, configs moving along with the luns in san seems to be more
>>   easy to manage, otherwise when the host fails, each potential new
>>   server has to be pre-configured with the udev rules.
>>
> 
> It's 2019, if people haven't figured out how to do persistent configuration by
> now then all hope is lost.  Facebook persistently configures millions of
> machines, I'm sure people can figure out how to make sure a udev rule ends up on
> the right host connected to a SAN that doesn't move.  Thanks,
> 
> Josef
> 

yeah. sysfs interface should be fine for now, and based on the
usage feedback we could consider persistent storage for the readmirror.

Typical to WORM (Write Once Read Many times) setups they mainly need
good read performance as data is imported once from somewhere else.
Just out of curiosity ran read performance checks using the readmirror 
property on ssd and nvme device.

$ btrfs fi show
Label: none  uuid: e24bd8b5-a9e9-41d4-b28e-e95c0c545466
	Total devices 2 FS bytes used 391.64MiB
	devid    1 size 447.13GiB used 2.01GiB path /dev/sdf
	devid    2 size 5.82TiB used 2.01GiB path /dev/nvme0n1

$ cat /sys/block/sdf/queue/rotational
0
$ cat /sys/block/nvme0n1/queue/rotational
0

$ dropcache; sleep 3; btrfs prop set /btrfs readmirror devid:2 && time 
md5sum /btrfs/fOH2
6129ccf74f7a761e0c3e096e051ba7a2  /btrfs/fOH2

real	0m0.725s
user	0m0.604s
sys	0m0.113s

$ dropcache; sleep 3; btrfs prop set /btrfs readmirror devid:1 && time 
md5sum /btrfs/fOH2
6129ccf74f7a761e0c3e096e051ba7a2  /btrfs/fOH2

real	0m1.125s
user	0m0.643s
sys	0m0.087s

nvme provides ~40% improvement for 400mb read io, however there are no
other IOs. The question will be what kind of read io load balancer will
be required at the full throttle of the preferred readmirror
device IO Q depth [1]. Which means a heuristic is required to juggle
around the policies and will be another readmirror policy to be part
of the same framework when required.

[1]
$ cat /sys/block/nvme0n1/queue/nr_requests
1023
$ cat /sys/block/sdf/queue/nr_requests
64

Thanks, Anand

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v2 0/2] readmirror feature
  2019-09-16  8:19                   ` Anand Jain
@ 2019-09-24 14:27                     ` Anand Jain
  0 siblings, 0 replies; 17+ messages in thread
From: Anand Jain @ 2019-09-24 14:27 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Eli V, linux-btrfs

On 9/16/19 4:19 PM, Anand Jain wrote:
> On 12/9/19 6:13 PM, Josef Bacik wrote:
>> On Thu, Sep 12, 2019 at 06:10:08PM +0800, Anand Jain wrote:
>>>
>>>
>>>> On 12 Sep 2019, at 6:03 PM, Josef Bacik <josef@toxicpanda.com> wrote:
>>>>
>>>> On Thu, Sep 12, 2019 at 06:00:21PM +0800, Anand Jain wrote:
>>>>>
>>>>>
>>>>>> On 12 Sep 2019, at 5:50 PM, Josef Bacik <josef@toxicpanda.com> wrote:
>>>>>>
>>>>>> On Thu, Sep 12, 2019 at 03:41:42PM +0800, Anand Jain wrote:
>>>>>>>
>>>>>>>
>>>>>>> Thanks for the comments. More below.
>>>>>>>
>>>>>>> On 12/9/19 3:16 AM, Josef Bacik wrote:
>>>>>>>> On Wed, Sep 11, 2019 at 03:13:21PM -0400, Eli V wrote:
>>>>>>>>> On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik 
>>>>>>>>> <josef@toxicpanda.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote:
>>>>>>>>>>> Function call chain  __btrfs_map_block()->find_live_mirror() 
>>>>>>>>>>> uses
>>>>>>>>>>> thread pid to determine the %mirror_num when the mirror_num=0.
>>>>>>>>>>>
>>>>>>>>>>> This patch introduces a framework so that we can add policies 
>>>>>>>>>>> to determine
>>>>>>>>>>> the %mirror_num. And also adds the devid as the readmirror 
>>>>>>>>>>> policy.
>>>>>>>>>>>
>>>>>>>>>>> The new property is stored as an item in the device tree as 
>>>>>>>>>>> show below.
>>>>>>>>>>>     (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, 
>>>>>>>>>>> devid)
>>>>>>>>>>>
>>>>>>>>>>> To be able to set and get this new property also introduces 
>>>>>>>>>>> new ioctls
>>>>>>>>>>> BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The 
>>>>>>>>>>> ioctl argument
>>>>>>>>>>> is defined as
>>>>>>>>>>>         struct btrfs_ioctl_readmirror_args {
>>>>>>>>>>>                 __u64 type; /* RW */
>>>>>>>>>>>                 __u64 device_bitmap; /* RW */
>>>>>>>>>>>         }
>>>>>>>>>>>
>>>>>>>>>>> An usage example as follows:
>>>>>>>>>>>         btrfs property set /btrfs readmirror devid:1,3
>>>>>>>>>>>         btrfs property get /btrfs readmirror
>>>>>>>>>>>           readmirror devid:1 3
>>>>>>>>>>>         btrfs property set /btrfs readmirror ""
>>>>>>>>>>>         btrfs property get /btrfs readmirror
>>>>>>>>>>>           readmirror default
>>>>>>>>>>>
>>>>>>>>>>> This patchset has been tested completely, however marked as 
>>>>>>>>>>> RFC for the
>>>>>>>>>>> following reasons and comments on them (or any other) are 
>>>>>>>>>>> appreciated as
>>>>>>>>>>> usual.
>>>>>>>>>>> . The new objectid is defined as
>>>>>>>>>>>       #define BTRFS_READMIRROR_OBJECTID -1ULL
>>>>>>>>>>>    Need consent we are fine to use this value, and with this 
>>>>>>>>>>> value it
>>>>>>>>>>>    shall be placed just before the DEV_STATS_OBJECTID item 
>>>>>>>>>>> which is more
>>>>>>>>>>>    frequently used only during the device errors.
>>>>>>>>>>>
>>>>>>>>>>> .  I am using a u64 bitmap to represent the devices id, so 
>>>>>>>>>>> the max device
>>>>>>>>>>>    id that we could represent is 63, its a kind of limitation 
>>>>>>>>>>> which should
>>>>>>>>>>>    be addressed before integration, I wonder if there is any 
>>>>>>>>>>> suggestion?
>>>>>>>>>>>    Kindly note that, multiple ioctls with each time 
>>>>>>>>>>> representing a set of
>>>>>>>>>>>    device(s) is not a choice because we need to make sure the 
>>>>>>>>>>> readmirror
>>>>>>>>>>>    changes happens in a commit transaction.
>>>>>>>>>>>
>>>>>>>>>>> v1->RFC v2:
>>>>>>>>>>>   . Property is stored as a dev-tree item instead of root 
>>>>>>>>>>> inode extended
>>>>>>>>>>>     attribute.
>>>>>>>>>>>   . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to 
>>>>>>>>>>> BTRFS_DEV_STATE_READ_PREFERRED.
>>>>>>>>>>>   . Changed format specifier from devid1,2,3.. to devid:1,2,3..
>>>>>>>>>>>
>>>>>>>>>>> RFC->v1:
>>>>>>>>>>>   Drops pid as one of the readmirror policy choices and as 
>>>>>>>>>>> usual remains
>>>>>>>>>>>   as default. And when the devid is reset the readmirror 
>>>>>>>>>>> policy falls back
>>>>>>>>>>>   to pid.
>>>>>>>>>>>   Drops the mount -o readmirror idea, it can be added at a 
>>>>>>>>>>> later point of
>>>>>>>>>>>   time.
>>>>>>>>>>>   Property now accepts more than 1 devid as readmirror 
>>>>>>>>>>> device. As shown
>>>>>>>>>>>   in the example above.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This is a lot of infrastructure
>>>>>>>
>>>>>>> Ok. Any idea on a better implementation?
>>>>>>> How about extended attribute approach? v1 patches proposed
>>>>>>> it, but it abused the extended attribute as commented here [1]
>>>>>>> and v2 got changed to an item-key.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://lore.kernel.org/linux-btrfs/be68e6ea-00bc-b750-25e1-9c584b99308f@gmx.com/ 
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> That's a NAK on the prop interface.  This is a fs wide policy, not a
>>>>>> directory/inode policy.
>>>>>>
>>>>>>>
>>>>>>>>>> to just change which mirror we read to based on
>>>>>>>>>> some arbitrary user policy.  I assume this is to solve the 
>>>>>>>>>> case where you have
>>>>>>>>>> slow and fast disks, so you can always read from the fast 
>>>>>>>>>> disk?  And then it's
>>>>>>>>>> only used in RAID1, so the very narrow usecase of having a 
>>>>>>>>>> RAID1 setup with a
>>>>>>>>>> SSD and a normal disk?  I'm not seeing a point to this much 
>>>>>>>>>> code for one
>>>>>>>>>> particular obscure setup.  Thanks,
>>>>>>>>>>
>>>>>>>>>> Josef
>>>>>>>>>
>>>>>>>>> Not commenting on the code itself, but as a user I see this SSD 
>>>>>>>>> RAID1
>>>>>>>>> acceleration as a future much have feature. It's only obscure 
>>>>>>>>> at the
>>>>>>>>> moment because we don't have code to take advantage of it. But on
>>>>>>>>> large btrfs filesystems with hundreds of GB of metadata, like I 
>>>>>>>>> have
>>>>>>>>> for backups, the usability of the filesystem is dramatically 
>>>>>>>>> improved
>>>>>>>>> having the metadata on an SSD( though currently only half of 
>>>>>>>>> the time
>>>>>>>>> due to the even/odd pid distribution.)
>>>>>>>>
>>>>>>>> But that's different from a mirror.  100% it would be nice to 
>>>>>>>> say "put my
>>>>>>>> metadata on the ssd, data elsewhere".  That's not what this 
>>>>>>>> patch is about, this
>>>>>>>> patch is specifically about changing which drive we choose in a 
>>>>>>>> mirrored setup,
>>>>>>>> which is super unlikely to mirror a SSD with a slow drive, cause 
>>>>>>>> it's just going
>>>>>>>> to be slow no matter what.  Sure we could make it so reads 
>>>>>>>> always go to the SSD,
>>>>>>>> but we can accomplish that by just adding a check for 
>>>>>>>> nonrotational in the code,
>>>>>>>> and then we don't have to encode all this nonsense in the file 
>>>>>>>> system.  Thanks,
>>>>>>>
>>>>>>> I wrote about the readmirror policy framework here[2],
>>>>>>> I forgot to link it here, sorry about that, my mistake.
>>>>>>>
>>>>>>> [2]
>>>>>>>
>>>>>>> https://lore.kernel.org/linux-btrfs/1552989624-29577-1-git-send-email-anand.jain@oracle.com/ 
>>>>>>>
>>>>>>>
>>>>>>> Readmirror policy is for raid1, raid10 and future N way mirror.
>>>>>>> Yes for now its only for raid1.
>>>>>>>
>>>>>>> Here the idea is to create a framework so that readmirror policy
>>>>>>> can be configured as needed. And nonrotational can be one such 
>>>>>>> policy.
>>>>>>>
>>>>>>> The example of hard-coded nonrotational policy does not work in case
>>>>>>> of ssd and a remote iscsi ssd, OR in case of local ssd and a NVME 
>>>>>>> block
>>>>>>> device, as all these are still nonrotational devices. So hard-coded
>>>>>>> policy is not a good idea. If we have to hardcode then there is 
>>>>>>> Q-depth
>>>>>>> based readmirror routing is better (patch in the ML), but that is
>>>>>>> not good enough, because some configs wants it based on the disk-LBA
>>>>>>> so that SAN storage target cache is balanced and not duplicated.
>>>>>>> So in short it must be a configurable policy.
>>>>>>>
>>>>>>
>>>>>> Again, if you are mixing disk types you likely always want 
>>>>>> non-rotational, but
>>>>>> still mixing different speed devices in a mirror setup is just 
>>>>>> asking for weird
>>>>>> latency problems.  I don't think solving this use case is 
>>>>>> necessary.  If you mix
>>>>>> ssd + network device in a serious production setup then you 
>>>>>> probably should be
>>>>>> fired cause you don't know what you are doing.  Having the generic
>>>>>> "nonrotational gets priority" is going to cover 99% of the actual 
>>>>>> use cases that
>>>>>> make sense.
>>>>>>
>>>>>> The SAN usecase I can sort of see, but again I don't feel like 
>>>>>> it's a problem we
>>>>>> need to solve with on-disk format.  Add a priority to sysfs so you 
>>>>>> can change it
>>>>>> with udev or something on the fly.  Thanks,
>>>>>>
>>>>>
>>>>> Ok.
>>>>> Sysfs is fine however we need configuration to be persistent across 
>>>>> reboots.
>>>>> Any idea?
>>>>>
>>>>
>>>> Udev rules.  Thanks,
>>>>
>>>
>>>   Josef, configs moving along with the luns in san seems to be more
>>>   easy to manage, otherwise when the host fails, each potential new
>>>   server has to be pre-configured with the udev rules.
>>>
>>
>> It's 2019, if people haven't figured out how to do persistent 
>> configuration by
>> now then all hope is lost.  Facebook persistently configures millions of
>> machines, I'm sure people can figure out how to make sure a udev rule 
>> ends up on
>> the right host connected to a SAN that doesn't move.  Thanks,
>>
>> Josef
>>
> 
> yeah. sysfs interface should be fine for now, and based on the
> usage feedback we could consider persistent storage for the readmirror.
> 
> Typical to WORM (Write Once Read Many times) setups they mainly need
> good read performance as data is imported once from somewhere else.
> Just out of curiosity ran read performance checks using the readmirror 
> property on ssd and nvme device.
> 
> $ btrfs fi show
> Label: none  uuid: e24bd8b5-a9e9-41d4-b28e-e95c0c545466
>      Total devices 2 FS bytes used 391.64MiB
>      devid    1 size 447.13GiB used 2.01GiB path /dev/sdf
>      devid    2 size 5.82TiB used 2.01GiB path /dev/nvme0n1
> 
> $ cat /sys/block/sdf/queue/rotational
> 0
> $ cat /sys/block/nvme0n1/queue/rotational
> 0
> 
> $ dropcache; sleep 3; btrfs prop set /btrfs readmirror devid:2 && time 
> md5sum /btrfs/fOH2
> 6129ccf74f7a761e0c3e096e051ba7a2  /btrfs/fOH2
> 
> real    0m0.725s
> user    0m0.604s
> sys    0m0.113s
> 
> $ dropcache; sleep 3; btrfs prop set /btrfs readmirror devid:1 && time 
> md5sum /btrfs/fOH2
> 6129ccf74f7a761e0c3e096e051ba7a2  /btrfs/fOH2
> 
> real    0m1.125s
> user    0m0.643s
> sys    0m0.087s
> 
> nvme provides ~40% improvement for 400mb read io, however there are no
> other IOs. The question will be what kind of read io load balancer will
> be required at the full throttle of the preferred readmirror
> device IO Q depth [1]. Which means a heuristic is required to juggle
> around the policies and will be another readmirror policy to be part
> of the same framework when required.
> 
> [1]
> $ cat /sys/block/nvme0n1/queue/nr_requests
> 1023
> $ cat /sys/block/sdf/queue/nr_requests
> 64
> 
> Thanks, Anand


Josef,

  Here I infer as follows .. I wonder if it makes sense.

  - Potential readmirror policies are: pid, devid/manual, lba/cache,
    Q-depth(patches are in the ml from facebook) each one of them has
    their own advantage so a framework is essential.

  - As the block device performance keeps evolving so its common to have
    heterogeneous block devices on the host. readmirror policy helps
    to configure for the best possible performance.

  - sysfs is a good interface for readmirror policy as in ioctl we can't
    pass all possible devices in the ioctl argument.

  - ondisk storage is essential for the readmirror, and key-item pair
    is the only way for its storage. (On the 2nd thought storage cannot
    be implemented at a later point as sysfs interface must be consistent
    in each version. So implementing together sysfs and ondisk key-item
    pair makes sense to me.)

Thanks, Anand

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2019-09-24 14:28 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-26  9:04 [PATCH RFC v2 0/2] readmirror feature Anand Jain
2019-08-26  9:04 ` [PATCH RFC v2] btrfs: add readmirror framework and policy devid Anand Jain
2019-08-26  9:04 ` [PATCH RFC v2] btrfs-progs: add readmirror property and ioctl to set get readmirror Anand Jain
2019-08-29  3:39   ` [PATCH RFC v2.1] btrfs-progs: add readmirror property and ioctl to set get Anand Jain
2019-09-11 16:37 ` [PATCH RFC v2 0/2] readmirror feature David Sterba
2019-09-12  7:48   ` Anand Jain
2019-09-11 18:42 ` Josef Bacik
2019-09-11 19:13   ` Eli V
2019-09-11 19:16     ` Josef Bacik
2019-09-12  7:41       ` Anand Jain
2019-09-12  9:50         ` Josef Bacik
2019-09-12 10:00           ` Anand Jain
2019-09-12 10:03             ` Josef Bacik
2019-09-12 10:10               ` Anand Jain
2019-09-12 10:13                 ` Josef Bacik
2019-09-16  8:19                   ` Anand Jain
2019-09-24 14:27                     ` Anand Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).