All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/5] readmirror feature
@ 2019-03-19 10:00 Anand Jain
  2019-03-19 10:00 ` [PATCH RFC 1/5] btrfs: add inode pointer to prop_handler::validate() Anand Jain
                   ` (7 more replies)
  0 siblings, 8 replies; 13+ messages in thread
From: Anand Jain @ 2019-03-19 10:00 UTC (permalink / raw)
  To: linux-btrfs

RFC patch as of now, appreciate your comments. This patch set has
been tested.

Function call chain  __btrfs_map_block()->find_live_mirror() uses
thread pid to determine the %mirror_num when the mirror_num=0.
    
The pid based mirror_num extrapolation has following disadvantages
 . A large single-process read IO will read only from one disk.
 . In a worst scenario all processes read accessing the FS could have
       either odd or even pid, the read IO gets skewed.
 . There is no deterministic way of knowing/controlling which copy will
       be used for reading.
 . May see performance variations for a given set of multi process
       workload ran at different times.
    
So we need other types of readmirror policies.
    
This patch introduces a framework so that we can add more policies, and
converts the existing %pid into as a configurable parameter using the
property. And also provides a transient readmirror mount option, so that
this property can be applied for the read io during mount and for
readonly FS.

 For example:
   btrfs property set <mnt> readmirror pid
   btrfs property set <mnt> readmirror ""
   btrfs property set <mnt> readmirror devid<n>

   mount -o readmirror=pid <mnt>
   mount -o readmirror=devid<n> <mnt>

Please note:
 The property storage is an extented attributes of root inode
 (BTRFS_FS_TREE_OBJECTID), the other choice is a new ondisk KEY,
 which is open for comments.

 This patch uses the unused btrfs_device::type, and does not use the
 bitwise ops because as type is u64 and bitwise ops need u32, so we
 need 32bits of the btrfs_device::type. Which is a kind of messy.
 Its open for suggestion for any better way.

Anand Jain (5):
  btrfs: add inode pointer to prop_handler::validate()
  btrfs: add readmirror pid property
  btrfs: add readmirror devid property
  btrfs: add readmirror pid mount option
  btrfs: add readmirror devid mount option

 fs/btrfs/props.c   | 99 +++++++++++++++++++++++++++++++++++++++++++++++++++---
 fs/btrfs/super.c   | 29 ++++++++++++++++
 fs/btrfs/volumes.c | 29 +++++++++++++++-
 fs/btrfs/volumes.h | 11 ++++++
 4 files changed, 163 insertions(+), 5 deletions(-)

Anand Jain (2):
  btrfs-progs: add helper to create xattr name
  btrfs-progs: add readmirror policy

 props.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 68 insertions(+), 7 deletions(-)
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH RFC 1/5] btrfs: add inode pointer to prop_handler::validate()
  2019-03-19 10:00 [PATCH RFC 0/5] readmirror feature Anand Jain
@ 2019-03-19 10:00 ` Anand Jain
  2019-03-19 10:00 ` [PATCH RFC 2/5] btrfs: add readmirror pid property Anand Jain
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Anand Jain @ 2019-03-19 10:00 UTC (permalink / raw)
  To: linux-btrfs

In preparation to add the readmirror devid property, pass inode in the
prop_handler::validate() function vector, so that it can fetch
corresponding fs_devices.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/props.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/props.c b/fs/btrfs/props.c
index 99fa2459ba61..1a13f10a6ef5 100644
--- a/fs/btrfs/props.c
+++ b/fs/btrfs/props.c
@@ -18,7 +18,7 @@
 struct prop_handler {
 	struct hlist_node node;
 	const char *xattr_name;
-	int (*validate)(const char *value, size_t len);
+	int (*validate)(struct inode *inode, const char *value, size_t len);
 	int (*apply)(struct inode *inode, const char *value, size_t len);
 	const char *(*extract)(struct inode *inode);
 	int inheritable;
@@ -89,7 +89,7 @@ int btrfs_set_prop(struct inode *inode, const char *name, const char *value,
 		goto out;
 	}
 
-	ret = handler->validate(value, value_len);
+	ret = handler->validate(inode, value, value_len);
 	if (ret)
 		return ret;
 
@@ -263,7 +263,8 @@ int btrfs_load_inode_props(struct inode *inode, struct btrfs_path *path)
 	return ret;
 }
 
-static int prop_compression_validate(const char *value, size_t len)
+static int prop_compression_validate(struct inode *inode, const char *value,
+				     size_t len)
 {
 	if (!value)
 		return 0;
@@ -361,7 +362,7 @@ static int inherit_props(struct btrfs_trans_handle *trans,
 			continue;
 
 		/* may be removed */
-		ret = h->validate(value, strlen(value));
+		ret = h->validate(inode, value, strlen(value));
 		if (ret)
 			continue;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH RFC 2/5] btrfs: add readmirror pid property
  2019-03-19 10:00 [PATCH RFC 0/5] readmirror feature Anand Jain
  2019-03-19 10:00 ` [PATCH RFC 1/5] btrfs: add inode pointer to prop_handler::validate() Anand Jain
@ 2019-03-19 10:00 ` Anand Jain
  2019-03-19 10:00 ` [PATCH RFC 3/5] btrfs: add readmirror devid property Anand Jain
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Anand Jain @ 2019-03-19 10:00 UTC (permalink / raw)
  To: linux-btrfs

Function call chain  __btrfs_map_block()->find_live_mirror() uses
thread %pid to determine the %mirror_num for the read when the
mirror_num=0 in the argument.

This pid based mirror_num extrapolation has following disadvantages
 A single-process large read IO will read only from one disk.
 In a worst scenario all processes read accessing the FS could have
   either odd or even pid, the read IO gets skewed.
 There is no deterministic way of knowing/controlling which copy will
   be used for reading.
 May see performance variations for a given set of multi process
   workload ran at different times.

So we need other types of readmirror policies.

This patch introduces a framework so that we can add more policies, and
converts the existing %pid into as a configurable parameter using the
property.

 For example:
  btrfs property set /btrfs readmirror pid
  btrfs property set /btrfs readmirror ""

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/props.c   | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.c | 11 ++++++++++-
 fs/btrfs/volumes.h |  7 +++++++
 3 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/props.c b/fs/btrfs/props.c
index 1a13f10a6ef5..776cdf099f93 100644
--- a/fs/btrfs/props.c
+++ b/fs/btrfs/props.c
@@ -11,6 +11,7 @@
 #include "ctree.h"
 #include "xattr.h"
 #include "compression.h"
+#include "volumes.h"
 
 #define BTRFS_PROP_HANDLERS_HT_BITS 8
 static DEFINE_HASHTABLE(prop_handlers_ht, BTRFS_PROP_HANDLERS_HT_BITS);
@@ -326,6 +327,45 @@ static const char *prop_compression_extract(struct inode *inode)
 	return NULL;
 }
 
+static int prop_readmirror_validate(struct inode *inode, const char *value,
+				    size_t len)
+{
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+
+	if (root->root_key.objectid != BTRFS_FS_TREE_OBJECTID)
+		return -EINVAL;
+
+	if (!len)
+		return 0;
+
+	if (!strncmp("pid", value, 3))
+		return 0;
+
+	return -EINVAL;
+}
+
+static int prop_readmirror_apply(struct inode *inode, const char *value,
+				 size_t len)
+{
+	struct btrfs_fs_devices *fs_devices = btrfs_sb(inode->i_sb)->fs_devices;
+
+	if (!value)
+		fs_devices->readmirror_policy = BTRFS_READMIRROR_DEFAULT;
+	else if (!strncmp("pid", value, 3))
+		fs_devices->readmirror_policy = BTRFS_READMIRROR_PID;
+
+	return 0;
+}
+
+static const char *prop_readmirror_extract(struct inode *inode)
+{
+	/*
+	 * readmirror policy is applied for the whole FS, inheritance is not
+	 * applicable.
+	 */
+	return NULL;
+}
+
 static struct prop_handler prop_handlers[] = {
 	{
 		.xattr_name = XATTR_BTRFS_PREFIX "compression",
@@ -334,6 +374,13 @@ static const char *prop_compression_extract(struct inode *inode)
 		.extract = prop_compression_extract,
 		.inheritable = 1
 	},
+	{
+		.xattr_name = XATTR_BTRFS_PREFIX "readmirror",
+		.validate = prop_readmirror_validate,
+		.apply = prop_readmirror_apply,
+		.extract = prop_readmirror_extract,
+		.inheritable = 0
+	},
 };
 
 static int inherit_props(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9024eee889b9..e5072d46e181 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5562,7 +5562,16 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 	else
 		num_stripes = map->num_stripes;
 
-	preferred_mirror = first + current->pid % num_stripes;
+	switch(fs_info->fs_devices->readmirror_policy) {
+	case BTRFS_READMIRROR_PID:
+		/* fall through */
+	case BTRFS_READMIRROR_DEFAULT:
+		/* fall through */
+	default:
+		/* readmirror as per thread pid */
+		preferred_mirror = first + current->pid % num_stripes;
+		break;
+	}
 
 	if (dev_replace_is_ongoing &&
 	    fs_info->dev_replace.cont_reading_from_srcdev_mode ==
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 3ad9d58d1b66..27dce9242b55 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -208,6 +208,11 @@ struct btrfs_device {
 BTRFS_DEVICE_GETSET_FUNCS(disk_total_bytes);
 BTRFS_DEVICE_GETSET_FUNCS(bytes_used);
 
+enum btrfs_readmirror_policy {
+	BTRFS_READMIRROR_DEFAULT,
+	BTRFS_READMIRROR_PID,
+};
+
 struct btrfs_fs_devices {
 	u8 fsid[BTRFS_FSID_SIZE]; /* FS specific uuid */
 	u8 metadata_uuid[BTRFS_FSID_SIZE];
@@ -254,6 +259,8 @@ struct btrfs_fs_devices {
 	struct kobject fsid_kobj;
 	struct kobject *device_dir_kobj;
 	struct completion kobj_unregister;
+
+	int readmirror_policy;
 };
 
 #define BTRFS_BIO_INLINE_CSUM_SIZE	64
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH RFC 3/5] btrfs: add readmirror devid property
  2019-03-19 10:00 [PATCH RFC 0/5] readmirror feature Anand Jain
  2019-03-19 10:00 ` [PATCH RFC 1/5] btrfs: add inode pointer to prop_handler::validate() Anand Jain
  2019-03-19 10:00 ` [PATCH RFC 2/5] btrfs: add readmirror pid property Anand Jain
@ 2019-03-19 10:00 ` Anand Jain
  2019-03-19 10:00 ` [PATCH RFC 4/5] btrfs: add readmirror pid mount option Anand Jain
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Anand Jain @ 2019-03-19 10:00 UTC (permalink / raw)
  To: linux-btrfs

Introduces devid readmirror property, which directs all read IO to a
device.

For example:
  btrfs property set <mnt> readmirror devid<n>

As of now readmirror by devid supports only raid1s. Raid10 support has
to leverage device grouping feature to facilitate the setting of
readmirror by device set.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/props.c   | 49 ++++++++++++++++++++++++++++++++++++++++++++++---
 fs/btrfs/volumes.c | 16 ++++++++++++++++
 fs/btrfs/volumes.h |  3 +++
 3 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/props.c b/fs/btrfs/props.c
index 776cdf099f93..a17c5a69bcb3 100644
--- a/fs/btrfs/props.c
+++ b/fs/btrfs/props.c
@@ -338,8 +338,28 @@ static int prop_readmirror_validate(struct inode *inode, const char *value,
 	if (!len)
 		return 0;
 
-	if (!strncmp("pid", value, 3))
+	if (!strncmp("pid", value, 3)) {
 		return 0;
+	} else if (!strncmp("devid", value, 5)) {
+		u64 devid;
+		char *value_dup;
+
+		if (len <= 5)
+			return -EINVAL;
+
+		value_dup = kstrndup(value, len, GFP_KERNEL);
+		if (!value_dup)
+			return -ENOMEM;
+		if (kstrtoull(value_dup + 5, 10, &devid)) {
+			kfree(value_dup);
+			return -EINVAL;
+		}
+		kfree(value_dup);
+
+		if (btrfs_find_device(root->fs_info->fs_devices, devid,
+				      NULL, NULL, false))
+			return 0;
+	}
 
 	return -EINVAL;
 }
@@ -349,10 +369,33 @@ static int prop_readmirror_apply(struct inode *inode, const char *value,
 {
 	struct btrfs_fs_devices *fs_devices = btrfs_sb(inode->i_sb)->fs_devices;
 
-	if (!value)
+	if (!value) {
 		fs_devices->readmirror_policy = BTRFS_READMIRROR_DEFAULT;
-	else if (!strncmp("pid", value, 3))
+	} else if (!strncmp("pid", value, 3)) {
 		fs_devices->readmirror_policy = BTRFS_READMIRROR_PID;
+	} else if (!strncmp("devid", value, 5)) {
+		u64 devid;
+		char *value_dup;
+		struct btrfs_device *device;
+		struct btrfs_root *root = BTRFS_I(inode)->root;
+
+		value_dup = kstrndup(value, len, GFP_KERNEL);
+		if (!value_dup)
+			return -ENOMEM;
+		if (kstrtoull(value_dup + 5, 10, &devid)) {
+			kfree(value_dup);
+			return -EINVAL;
+		}
+		kfree(value_dup);
+
+		fs_devices->readmirror_policy = BTRFS_READMIRROR_DEVID;
+		device = btrfs_find_device(root->fs_info->fs_devices, devid,
+					   NULL, NULL, false);
+		if (!device)
+			return -ENODEV;
+
+		device->type = BTRFS_DEVICE_TYPE_READ_OPTIMIZED;
+	}
 
 	return 0;
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e5072d46e181..d3b7427d6d96 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5553,6 +5553,7 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 	int preferred_mirror;
 	int tolerance;
 	struct btrfs_device *srcdev;
+	bool found = false;
 
 	ASSERT((map->type &
 		 (BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_RAID10)));
@@ -5563,6 +5564,21 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 		num_stripes = map->num_stripes;
 
 	switch(fs_info->fs_devices->readmirror_policy) {
+	case BTRFS_READMIRROR_DEVID:
+		/* skip raid10 for now */
+		if (map->type & BTRFS_BLOCK_GROUP_RAID1) {
+			for (i = first; i < first + num_stripes; i++) {
+				if (map->stripes[i].dev->type ==
+				    BTRFS_DEVICE_TYPE_READ_OPTIMIZED) {
+					preferred_mirror = i;
+					found = true;
+					break;
+				}
+			}
+			if (found)
+				break;
+		}
+		/* fall through */
 	case BTRFS_READMIRROR_PID:
 		/* fall through */
 	case BTRFS_READMIRROR_DEFAULT:
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 27dce9242b55..9c2e64548a11 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -208,9 +208,12 @@ struct btrfs_device {
 BTRFS_DEVICE_GETSET_FUNCS(disk_total_bytes);
 BTRFS_DEVICE_GETSET_FUNCS(bytes_used);
 
+#define BTRFS_DEVICE_TYPE_READ_OPTIMIZED  1ULL
+
 enum btrfs_readmirror_policy {
 	BTRFS_READMIRROR_DEFAULT,
 	BTRFS_READMIRROR_PID,
+	BTRFS_READMIRROR_DEVID,
 };
 
 struct btrfs_fs_devices {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH RFC 4/5] btrfs: add readmirror pid mount option
  2019-03-19 10:00 [PATCH RFC 0/5] readmirror feature Anand Jain
                   ` (2 preceding siblings ...)
  2019-03-19 10:00 ` [PATCH RFC 3/5] btrfs: add readmirror devid property Anand Jain
@ 2019-03-19 10:00 ` Anand Jain
  2019-03-19 10:00 ` [PATCH RFC 5/5] btrfs: add readmirror devid " Anand Jain
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Anand Jain @ 2019-03-19 10:00 UTC (permalink / raw)
  To: linux-btrfs

This patch adds mount option -o readmirror, which provides a transient
readmirror feature, which can be used in a readonly mount as well.

For example..
  mount -o readmirror=pid

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/super.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index e01fc68af6c1..9d25a01bcecb 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -343,6 +343,7 @@ enum {
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	Opt_ref_verify,
 #endif
+	Opt_readmirror_policy,
 	Opt_err,
 };
 
@@ -413,6 +414,7 @@ enum {
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	{Opt_ref_verify, "ref_verify"},
 #endif
+	{Opt_readmirror_policy, "readmirror=%s"},
 	{Opt_err, NULL},
 };
 
@@ -850,6 +852,14 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 			btrfs_set_opt(info->mount_opt, REF_VERIFY);
 			break;
 #endif
+		case Opt_readmirror_policy:
+			if (strcmp(args[0].from, "pid") == 0) {
+				info->fs_devices->readmirror_policy =
+							BTRFS_READMIRROR_PID;
+				break;
+			}
+			ret = -EINVAL;
+			goto out;
 		case Opt_err:
 			btrfs_info(info, "unrecognized mount option '%s'", p);
 			ret = -EINVAL;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH RFC 5/5] btrfs: add readmirror devid mount option
  2019-03-19 10:00 [PATCH RFC 0/5] readmirror feature Anand Jain
                   ` (3 preceding siblings ...)
  2019-03-19 10:00 ` [PATCH RFC 4/5] btrfs: add readmirror pid mount option Anand Jain
@ 2019-03-19 10:00 ` Anand Jain
  2019-03-19 10:02 ` [PATCH RFC 1/2] btrfs-progs: add helper to create xattr name Anand Jain
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Anand Jain @ 2019-03-19 10:00 UTC (permalink / raw)
  To: linux-btrfs

This patch provides the readmirror=devid<n> feature configurable through
the mount option which is transient and can be applied during readonly
mount as well.

For example:
mount -o readmirror=devid<n>

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/super.c   | 19 +++++++++++++++++++
 fs/btrfs/volumes.c |  2 ++
 fs/btrfs/volumes.h |  1 +
 3 files changed, 22 insertions(+)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 9d25a01bcecb..5aa1a4d14cd7 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -857,6 +857,25 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 				info->fs_devices->readmirror_policy =
 							BTRFS_READMIRROR_PID;
 				break;
+			} else if (strncmp(args[0].from, "devid", 5) == 0) {
+				u64 devid;
+				struct btrfs_device *device;
+
+				if (kstrtoull(args[0].from + 5, 10, &devid)) {
+					ret = -EINVAL;
+					goto out;
+				}
+				info->fs_devices->readmirror_policy =
+							BTRFS_READMIRROR_DEVID;
+				device = btrfs_find_device(info->fs_devices,
+							   devid, NULL, NULL,
+							   false);
+				if (!device) {
+					ret = -ENODEV;
+					goto out;
+				}
+		                device->type_in_ram = BTRFS_DEVICE_TYPE_READ_OPTIMIZED;
+				break;
 			}
 			ret = -EINVAL;
 			goto out;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d3b7427d6d96..637617087c83 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5569,6 +5569,8 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 		if (map->type & BTRFS_BLOCK_GROUP_RAID1) {
 			for (i = first; i < first + num_stripes; i++) {
 				if (map->stripes[i].dev->type ==
+				    BTRFS_DEVICE_TYPE_READ_OPTIMIZED ||
+				    map->stripes[i].dev->type_in_ram ==
 				    BTRFS_DEVICE_TYPE_READ_OPTIMIZED) {
 					preferred_mirror = i;
 					found = true;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 9c2e64548a11..3e534bcaf43e 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -91,6 +91,7 @@ struct btrfs_device {
 	u32 io_width;
 	/* type and info about this device */
 	u64 type;
+	u64 type_in_ram;
 
 	/* minimal io size for this device */
 	u32 sector_size;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH RFC 1/2] btrfs-progs: add helper to create xattr name
  2019-03-19 10:00 [PATCH RFC 0/5] readmirror feature Anand Jain
                   ` (4 preceding siblings ...)
  2019-03-19 10:00 ` [PATCH RFC 5/5] btrfs: add readmirror devid " Anand Jain
@ 2019-03-19 10:02 ` Anand Jain
  2019-03-19 10:02   ` [PATCH RFC 2/2] btrfs-progs: add readmirror policy Anand Jain
  2019-03-21 16:26 ` [PATCH RFC 0/5] readmirror feature Steven Davies
  2019-04-09 15:48 ` David Sterba
  7 siblings, 1 reply; 13+ messages in thread
From: Anand Jain @ 2019-03-19 10:02 UTC (permalink / raw)
  To: linux-btrfs

This is a preparatory patch to add more xattr attributes, care a helper
function to alloc xattr name.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 props.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/props.c b/props.c
index efa11180d4c5..3a498bd9e904 100644
--- a/props.c
+++ b/props.c
@@ -94,6 +94,21 @@ static int prop_label(enum prop_object_type type,
 	return ret;
 }
 
+static char *alloc_xattr_name(const char *name)
+{
+	char *xattr_name;
+
+	xattr_name = malloc(XATTR_BTRFS_PREFIX_LEN + strlen(name) + 1);
+	if (!xattr_name)
+		ERR_PTR(-ENOMEM);
+
+	memcpy(xattr_name, XATTR_BTRFS_PREFIX, XATTR_BTRFS_PREFIX_LEN);
+	memcpy(xattr_name + XATTR_BTRFS_PREFIX_LEN, name, strlen(name));
+	xattr_name[XATTR_BTRFS_PREFIX_LEN + strlen(name)] = '\0';
+
+	return xattr_name;
+}
+
 static int prop_compression(enum prop_object_type type,
 			    const char *object,
 			    const char *name,
@@ -114,14 +129,11 @@ static int prop_compression(enum prop_object_type type,
 		goto out;
 	}
 
-	xattr_name = malloc(XATTR_BTRFS_PREFIX_LEN + strlen(name) + 1);
-	if (!xattr_name) {
-		ret = -ENOMEM;
-		goto out;
+	xattr_name = alloc_xattr_name(name);
+	if (IS_ERR(xattr_name)) {
+		error("failed to alloc xattr_name %s: %m", object);
+		return PTR_ERR(xattr_name);
 	}
-	memcpy(xattr_name, XATTR_BTRFS_PREFIX, XATTR_BTRFS_PREFIX_LEN);
-	memcpy(xattr_name + XATTR_BTRFS_PREFIX_LEN, name, strlen(name));
-	xattr_name[XATTR_BTRFS_PREFIX_LEN + strlen(name)] = '\0';
 
 	if (value) {
 		if (strcmp(value, "no") == 0 || strcmp(value, "none") == 0)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH RFC 2/2] btrfs-progs: add readmirror policy
  2019-03-19 10:02 ` [PATCH RFC 1/2] btrfs-progs: add helper to create xattr name Anand Jain
@ 2019-03-19 10:02   ` Anand Jain
  0 siblings, 0 replies; 13+ messages in thread
From: Anand Jain @ 2019-03-19 10:02 UTC (permalink / raw)
  To: linux-btrfs

This sets the readmirror=<parm> as a btrfs.<attr> extentded attribute.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 props.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/props.c b/props.c
index 3a498bd9e904..1d1a2c7f9d14 100644
--- a/props.c
+++ b/props.c
@@ -178,6 +178,53 @@ out:
 	return ret;
 }
 
+static int prop_readmirror(enum prop_object_type type, const char *object,
+			   const char *name, const char *value)
+{
+	int fd;
+	int ret;
+	char buf[256] = {0};
+	char *xattr_name;
+	DIR *dirstream = NULL;
+
+	fd = open_file_or_dir3(object, &dirstream, value ? O_RDWR : O_RDONLY);
+	if (fd < 0) {
+		ret = -errno;
+		error("failed to open %s: %m", object);
+		return ret;
+	}
+
+	xattr_name = alloc_xattr_name(name);
+	if (IS_ERR(xattr_name)) {
+		error("failed to alloc xattr_name %s: %m", object);
+		return PTR_ERR(xattr_name);
+	}
+
+	ret = 0;
+	if (value) {
+		if (fsetxattr(fd, xattr_name, value, strlen(value), 0) < 0) {
+			ret = -errno;
+			error("failed to set readmirror for %s: %m", object);
+		}
+	} else {
+		if (fgetxattr(fd, xattr_name, buf, 256) < 0) {
+			if (errno != ENOATTR) {
+				ret = -errno;
+				error("failed to get readmirror for %s: %m",
+				      object);
+			}
+		} else {
+			fprintf(stdout, "readmirror=%.*s\n", (int) strlen(buf),
+				buf);
+		}
+	}
+
+	free(xattr_name);
+	close_file_or_dir(fd, dirstream);
+
+	return ret;
+}
+
 const struct prop_handler prop_handlers[] = {
 	{"ro", "Set/get read-only flag of subvolume.", 0, prop_object_subvol,
 	 prop_read_only},
@@ -185,5 +232,7 @@ const struct prop_handler prop_handlers[] = {
 	 prop_object_dev | prop_object_root, prop_label},
 	{"compression", "Set/get compression for a file or directory", 0,
 	 prop_object_inode, prop_compression},
+	{"readmirror", "set/get readmirror policy for filesystem", 0,
+	 prop_object_root, prop_readmirror},
 	{NULL, NULL, 0, 0, NULL}
 };
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/5] readmirror feature
  2019-03-19 10:00 [PATCH RFC 0/5] readmirror feature Anand Jain
                   ` (5 preceding siblings ...)
  2019-03-19 10:02 ` [PATCH RFC 1/2] btrfs-progs: add helper to create xattr name Anand Jain
@ 2019-03-21 16:26 ` Steven Davies
  2019-03-21 18:26   ` waxhead
  2019-03-22  6:08   ` Anand Jain
  2019-04-09 15:48 ` David Sterba
  7 siblings, 2 replies; 13+ messages in thread
From: Steven Davies @ 2019-03-21 16:26 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On 2019-03-19 10:00, Anand Jain wrote:
> RFC patch as of now, appreciate your comments. This patch set has
> been tested.
> 
> Function call chain  __btrfs_map_block()->find_live_mirror() uses
> thread pid to determine the %mirror_num when the mirror_num=0.
> 
> The pid based mirror_num extrapolation has following disadvantages
>  . A large single-process read IO will read only from one disk.
>  . In a worst scenario all processes read accessing the FS could have
>        either odd or even pid, the read IO gets skewed.
>  . There is no deterministic way of knowing/controlling which copy will
>        be used for reading.
>  . May see performance variations for a given set of multi process
>        workload ran at different times.
> 
> So we need other types of readmirror policies.
> 
> This patch introduces a framework so that we can add more policies, and
> converts the existing %pid into as a configurable parameter using the
> property. And also provides a transient readmirror mount option, so 
> that
> this property can be applied for the read io during mount and for
> readonly FS.

Is it possible to set this property at mkfs time?

>  For example:
>    btrfs property set <mnt> readmirror pid
>    btrfs property set <mnt> readmirror ""
>    btrfs property set <mnt> readmirror devid<n>
> 
>    mount -o readmirror=pid <mnt>
>    mount -o readmirror=devid<n> <mnt>

This is an edge case but should we be allowed to set more than one 
device as a read mirror in a 3+ device array? In theory there could be 
two fast disks and one slow disk where all stripes are guaranteed to be 
on at least one fast disk.

I'll test these patches out when I have some spare time over the next 
few weeks. Do you have a tree I can pull / what are the patches based 
on?

Way beyond this patch series, considering a 3+ device raid1 array with 
mixed fast and slow disks perhaps there could also be a write preference 
for disks to fill up the fast disks first.

Steven Davies

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/5] readmirror feature
  2019-03-21 16:26 ` [PATCH RFC 0/5] readmirror feature Steven Davies
@ 2019-03-21 18:26   ` waxhead
  2019-03-22  6:08   ` Anand Jain
  1 sibling, 0 replies; 13+ messages in thread
From: waxhead @ 2019-03-21 18:26 UTC (permalink / raw)
  To: Steven Davies, Anand Jain; +Cc: linux-btrfs

Steven Davies wrote:
> On 2019-03-19 10:00, Anand Jain wrote:
>> RFC patch as of now, appreciate your comments. This patch set has
>> been tested.
>>
>> ....
>>
>> This patch introduces a framework so that we can add more policies, and
>> converts the existing %pid into as a configurable parameter using the
>> property. And also provides a transient readmirror mount option, so that
>> this property can be applied for the read io during mount and for
>> readonly FS.
> 
> Is it possible to set this property at mkfs time?
> 
>>  For example:
>>    btrfs property set <mnt> readmirror pid
>>    btrfs property set <mnt> readmirror ""
>>    btrfs property set <mnt> readmirror devid<n>
>>
>>    mount -o readmirror=pid <mnt>
>>    mount -o readmirror=devid<n> <mnt>
> 
> This is an edge case but should we be allowed to set more than one 
> device as a read mirror in a 3+ device array? In theory there could be 
> two fast disks and one slow disk where all stripes are guaranteed to be 
> on at least one fast disk.
> 
> I'll test these patches out when I have some spare time over the next 
> few weeks. Do you have a tree I can pull / what are the patches based on?
> 
> Way beyond this patch series, considering a 3+ device raid1 array with 
> mixed fast and slow disks perhaps there could also be a write preference 
> for disks to fill up the fast disks first.
> 
> Steven Davies

This is more of less a feature request , but it feels right to mention 
this here...

If I remember correctly BTRFS does not currently scale to more than 32 
devices right now (due to some striping related stuff), but if BTRFS can 
have many more devices, and even now at >16 devices would it not make 
sense to be able to "tag" devices or make "device groups".

For example - when (if) subvolumes can have different "RAID" profiles it 
would perhaps make sense to ensure that certain subvolumes would prefer 
to use storage from a certain group of devices. In a mixed setup where 
you have both SSD's, HDD's and those new M2 things (or whatever) it 
could be nice to make a subvolume for /usr or /var and ask BTRFS to 
prefer to store that on a SSD's while a subvolume for /home could be 
preferred to be stored on HDD's.

Having device groups could also allow to define certain storage devices 
for "hot data" so that data that is read often could auto-migrate to the 
faster storage devices.. As I understand BTRFS it is "just" a matter of 
setting a flag pr. chunk so it would prefer to be allocated on device of 
type/group xyz...

In a N-way mirror setup you could would read mostly from SSD's while 
using HDD's for storing the mirror copy.

if I may suggest ... I think something along the lines of...

'btrfs device setgroup DEVID GROUPNAME'
'btrfs property set GROUPNAME readweight=100, writeweight=50'
'btrfs property set GROUPNAME readmirror=whatever_policy'
and
'btrfs subvolume setgroup GROUPNAME'

would to just fine...

Again , just a suggestion from a regular BTRFS user. Nothing more , but 
please consider something like this. The current readmirror idea sounds 
a tad limited as it does not account for subvolumes.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/5] readmirror feature
  2019-03-21 16:26 ` [PATCH RFC 0/5] readmirror feature Steven Davies
  2019-03-21 18:26   ` waxhead
@ 2019-03-22  6:08   ` Anand Jain
  1 sibling, 0 replies; 13+ messages in thread
From: Anand Jain @ 2019-03-22  6:08 UTC (permalink / raw)
  To: Steven Davies; +Cc: linux-btrfs



On 3/22/19 12:26 AM, Steven Davies wrote:
> On 2019-03-19 10:00, Anand Jain wrote:
>> RFC patch as of now, appreciate your comments. This patch set has
>> been tested.
>>
>> Function call chain  __btrfs_map_block()->find_live_mirror() uses
>> thread pid to determine the %mirror_num when the mirror_num=0.
>>
>> The pid based mirror_num extrapolation has following disadvantages
>>  . A large single-process read IO will read only from one disk.
>>  . In a worst scenario all processes read accessing the FS could have
>>        either odd or even pid, the read IO gets skewed.
>>  . There is no deterministic way of knowing/controlling which copy will
>>        be used for reading.
>>  . May see performance variations for a given set of multi process
>>        workload ran at different times.
>>
>> So we need other types of readmirror policies.
>>
>> This patch introduces a framework so that we can add more policies, and
>> converts the existing %pid into as a configurable parameter using the
>> property. And also provides a transient readmirror mount option, so that
>> this property can be applied for the read io during mount and for
>> readonly FS.
> 
> Is it possible to set this property at mkfs time?

  Possible.

>>  For example:
>>    btrfs property set <mnt> readmirror pid
>>    btrfs property set <mnt> readmirror ""
>>    btrfs property set <mnt> readmirror devid<n>
>>
>>    mount -o readmirror=pid <mnt>
>>    mount -o readmirror=devid<n> <mnt>
> 
> This is an edge case but should we be allowed to set more than one 
> device as a read mirror in a 3+ device array? In theory there could be 
> two fast disks and one slow disk where all stripes are guaranteed to be 
> on at least one fast disk.

  Right. We need to evolve the device grouping part to get this.

> I'll test these patches out when I have some spare time over the next 
> few weeks. Do you have a tree I can pull / what are the patches based on?

   Thanks for the tests. Sure will upload and send the link.

> Way beyond this patch series, considering a 3+ device raid1 array with 
> mixed fast and slow disks perhaps there could also be a write preference 
> for disks to fill up the fast disks first.

   Right.

Thanks
Anand

> Steven Davies

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/5] readmirror feature
  2019-03-19 10:00 [PATCH RFC 0/5] readmirror feature Anand Jain
                   ` (6 preceding siblings ...)
  2019-03-21 16:26 ` [PATCH RFC 0/5] readmirror feature Steven Davies
@ 2019-04-09 15:48 ` David Sterba
  2019-04-12  9:02   ` Anand Jain
  7 siblings, 1 reply; 13+ messages in thread
From: David Sterba @ 2019-04-09 15:48 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Tue, Mar 19, 2019 at 06:00:19PM +0800, Anand Jain wrote:
> RFC patch as of now, appreciate your comments. This patch set has
> been tested.
> 
> Function call chain  __btrfs_map_block()->find_live_mirror() uses
> thread pid to determine the %mirror_num when the mirror_num=0.
>     
> The pid based mirror_num extrapolation has following disadvantages
>  . A large single-process read IO will read only from one disk.

Please note that the pid refers to the metadata kernel threads that
submit the IO, so on average the IO is serviced by different threads
with random PID. In the end it gets distributed over multiple drives.
But it's nott ideal, no argument about that.

>  . In a worst scenario all processes read accessing the FS could have
>        either odd or even pid, the read IO gets skewed.
>  . There is no deterministic way of knowing/controlling which copy will
>        be used for reading.
>  . May see performance variations for a given set of multi process
>        workload ran at different times.
>     
> So we need other types of readmirror policies.
>     
> This patch introduces a framework so that we can add more policies, and
> converts the existing %pid into as a configurable parameter using the
> property. And also provides a transient readmirror mount option, so that
> this property can be applied for the read io during mount and for
> readonly FS.

I think the mirror selection by pid was a temporary solution (that
stuck, as it always happens). On average it does not work bad. Once we
have a better way then it can be dropped. So I'm not convinced it needs
to be one of the configuration options.

What exactly do you mean by 'transient'? I see that it could be needed
for mount or read-only, though I don't see the usefulness of that.

>  For example:
>    btrfs property set <mnt> readmirror pid
>    btrfs property set <mnt> readmirror ""
>    btrfs property set <mnt> readmirror devid<n>
> 
>    mount -o readmirror=pid <mnt>
>    mount -o readmirror=devid<n> <mnt>
> 
> Please note:
>  The property storage is an extented attributes of root inode
>  (BTRFS_FS_TREE_OBJECTID), the other choice is a new ondisk KEY,
>  which is open for comments.

A new key type is not going to be allocated for that, the extended
attributes for properties are there. Another question is where to attach
the property as this is a whole-filesystem property, we don't have an
example to follow so far.

>  This patch uses the unused btrfs_device::type, and does not use the
>  bitwise ops because as type is u64 and bitwise ops need u32, so we
>  need 32bits of the btrfs_device::type. Which is a kind of messy.
>  Its open for suggestion for any better way.

For a prototype and early testing, using the type is probably ok, but
I don't think it's right for permanent storage. Besides I'm sure that
the u64 will not suffice in the long run.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/5] readmirror feature
  2019-04-09 15:48 ` David Sterba
@ 2019-04-12  9:02   ` Anand Jain
  0 siblings, 0 replies; 13+ messages in thread
From: Anand Jain @ 2019-04-12  9:02 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs


Thanks for the comments.. more below.

On 9/4/19 11:48 PM, David Sterba wrote:
> On Tue, Mar 19, 2019 at 06:00:19PM +0800, Anand Jain wrote:
>> RFC patch as of now, appreciate your comments. This patch set has
>> been tested.
>>
>> Function call chain  __btrfs_map_block()->find_live_mirror() uses
>> thread pid to determine the %mirror_num when the mirror_num=0.
>>      
>> The pid based mirror_num extrapolation has following disadvantages
>>   . A large single-process read IO will read only from one disk.
> 
> Please note that the pid refers to the metadata kernel threads that
> submit the IO, so on average the IO is serviced by different threads
> with random PID. In the end it gets distributed over multiple drives.
> But it's nott ideal, no argument about that.

  ok.

>>   . In a worst scenario all processes read accessing the FS could have
>>         either odd or even pid, the read IO gets skewed.
>>   . There is no deterministic way of knowing/controlling which copy will
>>         be used for reading.
>>   . May see performance variations for a given set of multi process
>>         workload ran at different times.
>>      
>> So we need other types of readmirror policies.
>>      
>> This patch introduces a framework so that we can add more policies, and
>> converts the existing %pid into as a configurable parameter using the
>> property. And also provides a transient readmirror mount option, so that
>> this property can be applied for the read io during mount and for
>> readonly FS.
> 
> I think the mirror selection by pid was a temporary solution (that
> stuck, as it always happens). On average it does not work bad. Once we
> have a better way then it can be dropped. So I'm not convinced it needs
> to be one of the configuration options.

  Ok. Thanks for confirming, we could replace pid with something better,
  as of now we have devid-based and device-qdepth based, IMO qdepth based
  should be default as I commented in the respective patch.

> What exactly do you mean by 'transient'? I see that it could be needed
> for mount or read-only, though I don't see the usefulness of that.

  I mean to say does not persist across reboot. To override the
  original setting lets say device-qdepth based, but momentarily
  to avoid known read errors from a disk.

>>   For example:
>>     btrfs property set <mnt> readmirror pid
>>     btrfs property set <mnt> readmirror ""
>>     btrfs property set <mnt> readmirror devid<n>
>>
>>     mount -o readmirror=pid <mnt>
>>     mount -o readmirror=devid<n> <mnt>
>>
>> Please note:
>>   The property storage is an extented attributes of root inode
>>   (BTRFS_FS_TREE_OBJECTID), the other choice is a new ondisk KEY,
>>   which is open for comments.
> 
> A new key type is not going to be allocated for that, the extended
> attributes for properties are there. Another question is where to attach
> the property as this is a whole-filesystem property, we don't have an
> example to follow so far.

  Ok no key.
  Yep its a whole filesystem property. As of now it goes to the root
  inode as an extended attribute. I find it reasonable.

>>   This patch uses the unused btrfs_device::type, and does not use the
>>   bitwise ops because as type is u64 and bitwise ops need u32, so we
>>   need 32bits of the btrfs_device::type. Which is a kind of messy.
>>   Its open for suggestion for any better way.
> 
> For a prototype and early testing, using the type is probably ok, but
> I don't think it's right for permanent storage. Besides I'm sure that
> the u64 will not suffice in the long run.

Hmm.. I don't know. I thought
   <upper 32 bits> for disk groups
   <lower 32 bits> for disk types
should suffice. Not too sure what I am missing.

Thanks, Anand.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-04-12  9:02 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-19 10:00 [PATCH RFC 0/5] readmirror feature Anand Jain
2019-03-19 10:00 ` [PATCH RFC 1/5] btrfs: add inode pointer to prop_handler::validate() Anand Jain
2019-03-19 10:00 ` [PATCH RFC 2/5] btrfs: add readmirror pid property Anand Jain
2019-03-19 10:00 ` [PATCH RFC 3/5] btrfs: add readmirror devid property Anand Jain
2019-03-19 10:00 ` [PATCH RFC 4/5] btrfs: add readmirror pid mount option Anand Jain
2019-03-19 10:00 ` [PATCH RFC 5/5] btrfs: add readmirror devid " Anand Jain
2019-03-19 10:02 ` [PATCH RFC 1/2] btrfs-progs: add helper to create xattr name Anand Jain
2019-03-19 10:02   ` [PATCH RFC 2/2] btrfs-progs: add readmirror policy Anand Jain
2019-03-21 16:26 ` [PATCH RFC 0/5] readmirror feature Steven Davies
2019-03-21 18:26   ` waxhead
2019-03-22  6:08   ` Anand Jain
2019-04-09 15:48 ` David Sterba
2019-04-12  9:02   ` Anand Jain

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.