linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] parallel 'copy-from' Ops in copy_file_range
@ 2020-01-27 16:43 Luis Henriques
  2020-01-27 16:43 ` [RFC PATCH 1/3] libceph: add non-blocking version of ceph_osdc_copy_from() Luis Henriques
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Luis Henriques @ 2020-01-27 16:43 UTC (permalink / raw)
  To: Jeff Layton, Sage Weil, Ilya Dryomov, Yan, Zheng, Gregory Farnum
  Cc: ceph-devel, linux-kernel, Luis Henriques

Hi,

As discussed here[1] I'm sending an RFC patchset that does the
parallelization of the requests sent to the OSDs during a copy_file_range
syscall in CephFS.

  [1] https://lore.kernel.org/lkml/20200108100353.23770-1-lhenriques@suse.com/

I've also some performance numbers that I wanted to share. Here's a
description of the very simple tests I've run:

 - create a file with 200 objects in it
   * i.e. tests with different object sizes mean different file sizes
 - drop all caches and umount the filesystem
 - Measure:
   * mount filesystem
   * full file copy (with copy_file_range)
   * umount filesystem

Tests were repeated several times and the average value was used for
comparison.

  DISCLAIMER:
  These numbers are only indicative, and different clusters and client
  configs will for sure show different performance!  More rigorous tests
  would be require to validate these results.

Having as baseline a full read+write (basically, a copy_file_range
operation within a filesystem mounted without the 'copyfrom' option),
here's some values for different object sizes:

			  8M	  4M	  1M	  65k
read+write		100%	100%	100%	 100%
sequential		 51%	 52%	 83%	>100%
parallel (throttle=1)	 51%	 52%	 83%	>100%
parallel (throttle=0)	 17%	 17%	 83%	>100%

Notes:

- 'parallel (throttle=0)' was a test where *all* the requests (i.e. 200
  requests to copy the 200 objects in the file) were sent to the OSDs and
  the wait for requests completion is done at the end only.

- 'parallel (throttle=1)' was just a control test, where the wait for
  completion is done immediately after a request is sent.  It was expected
  to be very similar to the non-optimized ('sequential') tests.

- These tests were executed on a cluster with 40 OSDs, spread across 5
  (bare-metal) nodes.

- The tests with object size of 65k show that copy_file_range definitely
  doesn't scale to files with small object sizes.  '> 100%' actually means
  more than 10x slower.

Measuring the mount+copy+umount masks the actual difference between
different throttle values due to the time spent in mount+umount.  Thus,
there was no real difference between throttle=0 (send all and wait) and
throttle=20 (send 20, wait, send 20, ...).  But here's what I observed
when measuring only the copy operation (4M object size):

read+write		100%
parallel (throttle=1)	 56%
parallel (throttle=5)	 23%
parallel (throttle=10)	 14%
parallel (throttle=20)	  9%
parallel (throttle=5)	  5%

Anyway, I'll still need to revisit patch 0003 as it doesn't follow the
suggestion done by Jeff to *not* add another knob to fine-tune the
throttle value -- this patch adds a kernel parameter for a knob that I
wanted to use in my testing to observe different values of this throttle
limit.

The goal is to probably to drop this patch and do the throttling in patch
0002.  I just need to come up with a decent heuristic.  Jeff's suggestion
was to use rsize/wsize, which are set to 64M by default IIRC.  Somehow I
feel that it should be related to the number of OSDs in the cluster
instead, but I'm not sure how.  And testing these sort of heuristics would
require different clusters, which isn't particularly easy to get.  Anyway,
comments are welcome!

Cheers,
--
Luis

Luis Henriques (3):
  libceph: add non-blocking version of ceph_osdc_copy_from()
  ceph: parallelize all copy-from requests in copy_file_range
  ceph: add module param to throttle 'copy-from2' operations

 fs/ceph/file.c                  | 48 ++++++++++++++++++++++++++--
 fs/ceph/super.c                 |  4 +++
 fs/ceph/super.h                 |  2 ++
 include/linux/ceph/osd_client.h | 14 +++++++++
 net/ceph/osd_client.c           | 55 +++++++++++++++++++++++++--------
 5 files changed, 108 insertions(+), 15 deletions(-)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH 1/3] libceph: add non-blocking version of ceph_osdc_copy_from()
  2020-01-27 16:43 [RFC PATCH 0/3] parallel 'copy-from' Ops in copy_file_range Luis Henriques
@ 2020-01-27 16:43 ` Luis Henriques
  2020-01-27 17:47   ` Ilya Dryomov
  2020-01-27 16:43 ` [RFC PATCH 2/3] ceph: parallelize all copy-from requests in copy_file_range Luis Henriques
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Luis Henriques @ 2020-01-27 16:43 UTC (permalink / raw)
  To: Jeff Layton, Sage Weil, Ilya Dryomov, Yan, Zheng, Gregory Farnum
  Cc: ceph-devel, linux-kernel, Luis Henriques

A non-blocking version of ceph_osdc_copy_from will allow for callers to
send 'copy-from' requests in bulk and wait for all of them to complete in
the end.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
---
 include/linux/ceph/osd_client.h | 12 ++++++++
 net/ceph/osd_client.c           | 54 +++++++++++++++++++++++++--------
 2 files changed, 53 insertions(+), 13 deletions(-)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 5a62dbd3f4c2..7916a178d137 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -537,6 +537,18 @@ int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
 			u32 truncate_seq, u64 truncate_size,
 			u8 copy_from_flags);
 
+struct ceph_osd_request *ceph_osdc_copy_from_nowait(
+			struct ceph_osd_client *osdc,
+			u64 src_snapid, u64 src_version,
+			struct ceph_object_id *src_oid,
+			struct ceph_object_locator *src_oloc,
+			u32 src_fadvise_flags,
+			struct ceph_object_id *dst_oid,
+			struct ceph_object_locator *dst_oloc,
+			u32 dst_fadvise_flags,
+			u32 truncate_seq, u64 truncate_size,
+			u8 copy_from_flags);
+
 /* watch/notify */
 struct ceph_osd_linger_request *
 ceph_osdc_watch(struct ceph_osd_client *osdc,
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index b68b376d8c2f..7f984532f37c 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -5346,23 +5346,24 @@ static int osd_req_op_copy_from_init(struct ceph_osd_request *req,
 	return 0;
 }
 
-int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
-			u64 src_snapid, u64 src_version,
-			struct ceph_object_id *src_oid,
-			struct ceph_object_locator *src_oloc,
-			u32 src_fadvise_flags,
-			struct ceph_object_id *dst_oid,
-			struct ceph_object_locator *dst_oloc,
-			u32 dst_fadvise_flags,
-			u32 truncate_seq, u64 truncate_size,
-			u8 copy_from_flags)
+struct ceph_osd_request *ceph_osdc_copy_from_nowait(
+		struct ceph_osd_client *osdc,
+		u64 src_snapid, u64 src_version,
+		struct ceph_object_id *src_oid,
+		struct ceph_object_locator *src_oloc,
+		u32 src_fadvise_flags,
+		struct ceph_object_id *dst_oid,
+		struct ceph_object_locator *dst_oloc,
+		u32 dst_fadvise_flags,
+		u32 truncate_seq, u64 truncate_size,
+		u8 copy_from_flags)
 {
 	struct ceph_osd_request *req;
 	int ret;
 
 	req = ceph_osdc_alloc_request(osdc, NULL, 1, false, GFP_KERNEL);
 	if (!req)
-		return -ENOMEM;
+		return ERR_PTR(-ENOMEM);
 
 	req->r_flags = CEPH_OSD_FLAG_WRITE;
 
@@ -5381,11 +5382,38 @@ int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
 		goto out;
 
 	ceph_osdc_start_request(osdc, req, false);
-	ret = ceph_osdc_wait_request(osdc, req);
+	return req;
 
 out:
 	ceph_osdc_put_request(req);
-	return ret;
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(ceph_osdc_copy_from_nowait);
+
+int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
+			u64 src_snapid, u64 src_version,
+			struct ceph_object_id *src_oid,
+			struct ceph_object_locator *src_oloc,
+			u32 src_fadvise_flags,
+			struct ceph_object_id *dst_oid,
+			struct ceph_object_locator *dst_oloc,
+			u32 dst_fadvise_flags,
+			u32 truncate_seq, u64 truncate_size,
+			u8 copy_from_flags)
+{
+	struct ceph_osd_request *req;
+
+	req = ceph_osdc_copy_from_nowait(osdc,
+			src_snapid, src_version,
+			src_oid, src_oloc,
+			src_fadvise_flags,
+			dst_oid, dst_oloc,
+			dst_fadvise_flags,
+			truncate_seq, truncate_size,
+			copy_from_flags);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+	return ceph_osdc_wait_request(osdc, req);
 }
 EXPORT_SYMBOL(ceph_osdc_copy_from);
 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 2/3] ceph: parallelize all copy-from requests in copy_file_range
  2020-01-27 16:43 [RFC PATCH 0/3] parallel 'copy-from' Ops in copy_file_range Luis Henriques
  2020-01-27 16:43 ` [RFC PATCH 1/3] libceph: add non-blocking version of ceph_osdc_copy_from() Luis Henriques
@ 2020-01-27 16:43 ` Luis Henriques
  2020-01-27 17:58   ` Ilya Dryomov
  2020-01-27 16:43 ` [RFC PATCH 3/3] ceph: add module param to throttle 'copy-from2' operations Luis Henriques
  2020-01-27 18:16 ` [RFC PATCH 0/3] parallel 'copy-from' Ops in copy_file_range Ilya Dryomov
  3 siblings, 1 reply; 12+ messages in thread
From: Luis Henriques @ 2020-01-27 16:43 UTC (permalink / raw)
  To: Jeff Layton, Sage Weil, Ilya Dryomov, Yan, Zheng, Gregory Farnum
  Cc: ceph-devel, linux-kernel, Luis Henriques

Right now the copy_file_range syscall serializes all the OSDs 'copy-from'
operations, waiting for each request to complete before sending the next
one.  This patch modifies copy_file_range so that all the 'copy-from'
operations are sent in bulk and waits for its completion at the end.  This
will allow significant speed-ups, specially when sending requests for
different target OSDs.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
---
 fs/ceph/file.c                  | 38 +++++++++++++++++++++++++++++++--
 include/linux/ceph/osd_client.h |  2 ++
 net/ceph/osd_client.c           |  1 +
 3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 1e6cdf2dfe90..5d8f0ba11719 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1931,6 +1931,28 @@ static int is_file_size_ok(struct inode *src_inode, struct inode *dst_inode,
 	return 0;
 }
 
+static int wait_copy_from_reqs(struct list_head *osd_reqs)
+{
+	struct ceph_osd_request *req;
+	int ret = 0, err;
+
+	while (!list_empty(osd_reqs)) {
+		req = list_first_entry(osd_reqs,
+				       struct ceph_osd_request,
+				       r_copy_item);
+		list_del_init(&req->r_copy_item);
+		err = ceph_osdc_wait_request(req->r_osdc, req);
+		if (err) {
+			if (!ret)
+				ret = err;
+			dout("copy request failed (err=%d)\n", err);
+		}
+		ceph_osdc_put_request(req);
+	}
+
+	return ret;
+}
+
 static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
 				      struct file *dst_file, loff_t dst_off,
 				      size_t len, unsigned int flags)
@@ -1943,12 +1965,14 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
 	struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
 	struct ceph_object_locator src_oloc, dst_oloc;
 	struct ceph_object_id src_oid, dst_oid;
+	struct ceph_osd_request *req;
 	loff_t endoff = 0, size;
 	ssize_t ret = -EIO;
 	u64 src_objnum, dst_objnum, src_objoff, dst_objoff;
 	u32 src_objlen, dst_objlen, object_size;
 	int src_got = 0, dst_got = 0, err, dirty;
 	bool do_final_copy = false;
+	LIST_HEAD(osd_reqs);
 
 	if (src_inode->i_sb != dst_inode->i_sb) {
 		struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
@@ -2097,7 +2121,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
 		ceph_oid_printf(&dst_oid, "%llx.%08llx",
 				dst_ci->i_vino.ino, dst_objnum);
 		/* Do an object remote copy */
-		err = ceph_osdc_copy_from(
+		req = ceph_osdc_copy_from_nowait(
 			&src_fsc->client->osdc,
 			src_ci->i_vino.snap, 0,
 			&src_oid, &src_oloc,
@@ -2108,7 +2132,8 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
 			CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
 			dst_ci->i_truncate_seq, dst_ci->i_truncate_size,
 			CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
-		if (err) {
+		if (IS_ERR(req)) {
+			err = PTR_ERR(req);
 			if (err == -EOPNOTSUPP) {
 				src_fsc->have_copy_from2 = false;
 				pr_notice("OSDs don't support 'copy-from2'; "
@@ -2117,14 +2142,23 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
 			dout("ceph_osdc_copy_from returned %d\n", err);
 			if (!ret)
 				ret = err;
+			/* wait for all queued requests */
+			wait_copy_from_reqs(&osd_reqs);
 			goto out_caps;
 		}
+		list_add(&req->r_copy_item, &osd_reqs);
 		len -= object_size;
 		src_off += object_size;
 		dst_off += object_size;
 		ret += object_size;
 	}
 
+	err = wait_copy_from_reqs(&osd_reqs);
+	if (err) {
+		if (!ret)
+			ret = err;
+		goto out_caps;
+	}
 	if (len)
 		/* We still need one final local copy */
 		do_final_copy = true;
diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 7916a178d137..2b4a14bc6154 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -210,6 +210,8 @@ struct ceph_osd_request {
 	u64 r_data_offset;                    /* ditto */
 	bool r_linger;                        /* don't resend on failure */
 
+	struct list_head r_copy_item;         /* used for copy-from operations */
+
 	/* internal */
 	unsigned long r_stamp;                /* jiffies, send or check time */
 	unsigned long r_start_stamp;          /* jiffies */
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 7f984532f37c..16f38c3d606e 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -531,6 +531,7 @@ static void request_init(struct ceph_osd_request *req)
 	RB_CLEAR_NODE(&req->r_node);
 	RB_CLEAR_NODE(&req->r_mc_node);
 	INIT_LIST_HEAD(&req->r_private_item);
+	INIT_LIST_HEAD(&req->r_copy_item);
 
 	target_init(&req->r_t);
 }

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 3/3] ceph: add module param to throttle 'copy-from2' operations
  2020-01-27 16:43 [RFC PATCH 0/3] parallel 'copy-from' Ops in copy_file_range Luis Henriques
  2020-01-27 16:43 ` [RFC PATCH 1/3] libceph: add non-blocking version of ceph_osdc_copy_from() Luis Henriques
  2020-01-27 16:43 ` [RFC PATCH 2/3] ceph: parallelize all copy-from requests in copy_file_range Luis Henriques
@ 2020-01-27 16:43 ` Luis Henriques
  2020-01-27 18:16 ` [RFC PATCH 0/3] parallel 'copy-from' Ops in copy_file_range Ilya Dryomov
  3 siblings, 0 replies; 12+ messages in thread
From: Luis Henriques @ 2020-01-27 16:43 UTC (permalink / raw)
  To: Jeff Layton, Sage Weil, Ilya Dryomov, Yan, Zheng, Gregory Farnum
  Cc: ceph-devel, linux-kernel, Luis Henriques

This patch adds a ceph kernel module parameter that allows to throttle the
amount of parallel requests that can be sent to the OSDs before waiting
for the completion.  This allows to prevent DoS'ing the ODSs with too many
requests at once when copying a big file.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
---
 fs/ceph/file.c  | 10 ++++++++++
 fs/ceph/super.c |  4 ++++
 fs/ceph/super.h |  2 ++
 3 files changed, 16 insertions(+)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 5d8f0ba11719..bf18712f3bd3 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1973,6 +1973,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
 	int src_got = 0, dst_got = 0, err, dirty;
 	bool do_final_copy = false;
 	LIST_HEAD(osd_reqs);
+	unsigned int ncopies = cfr_throttle;
 
 	if (src_inode->i_sb != dst_inode->i_sb) {
 		struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
@@ -2151,6 +2152,15 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
 		src_off += object_size;
 		dst_off += object_size;
 		ret += object_size;
+		if (cfr_throttle && (--ncopies == 0)) {
+			err = wait_copy_from_reqs(&osd_reqs);
+			if (err) {
+				if (!ret)
+					ret = err;
+				goto out_caps;
+			}
+			ncopies = cfr_throttle;
+		}
 	}
 
 	err = wait_copy_from_reqs(&osd_reqs);
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index b62c487a53af..02e8b6f93d50 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -1238,6 +1238,10 @@ static void __exit exit_ceph(void)
 	destroy_caches();
 }
 
+unsigned int cfr_throttle = 0;
+module_param(cfr_throttle, uint, 0644);
+MODULE_PARM_DESC(cfr_throttle, "copy_file_range throttle value.");
+
 module_init(init_ceph);
 module_exit(exit_ceph);
 
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index b2f86bed5c2c..fb98b4b1ec72 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -72,6 +72,8 @@
 #define CEPH_CAPS_WANTED_DELAY_MIN_DEFAULT      5  /* cap release delay */
 #define CEPH_CAPS_WANTED_DELAY_MAX_DEFAULT     60  /* cap release delay */
 
+extern unsigned int cfr_throttle;
+
 struct ceph_mount_options {
 	unsigned int flags;
 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/3] libceph: add non-blocking version of ceph_osdc_copy_from()
  2020-01-27 16:43 ` [RFC PATCH 1/3] libceph: add non-blocking version of ceph_osdc_copy_from() Luis Henriques
@ 2020-01-27 17:47   ` Ilya Dryomov
  2020-01-27 18:39     ` Luis Henriques
  0 siblings, 1 reply; 12+ messages in thread
From: Ilya Dryomov @ 2020-01-27 17:47 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Jeff Layton, Sage Weil, Yan, Zheng, Gregory Farnum,
	Ceph Development, LKML

On Mon, Jan 27, 2020 at 5:43 PM Luis Henriques <lhenriques@suse.com> wrote:
>
> A non-blocking version of ceph_osdc_copy_from will allow for callers to
> send 'copy-from' requests in bulk and wait for all of them to complete in
> the end.
>
> Signed-off-by: Luis Henriques <lhenriques@suse.com>
> ---
>  include/linux/ceph/osd_client.h | 12 ++++++++
>  net/ceph/osd_client.c           | 54 +++++++++++++++++++++++++--------
>  2 files changed, 53 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index 5a62dbd3f4c2..7916a178d137 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -537,6 +537,18 @@ int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
>                         u32 truncate_seq, u64 truncate_size,
>                         u8 copy_from_flags);
>
> +struct ceph_osd_request *ceph_osdc_copy_from_nowait(
> +                       struct ceph_osd_client *osdc,
> +                       u64 src_snapid, u64 src_version,
> +                       struct ceph_object_id *src_oid,
> +                       struct ceph_object_locator *src_oloc,
> +                       u32 src_fadvise_flags,
> +                       struct ceph_object_id *dst_oid,
> +                       struct ceph_object_locator *dst_oloc,
> +                       u32 dst_fadvise_flags,
> +                       u32 truncate_seq, u64 truncate_size,
> +                       u8 copy_from_flags);
> +
>  /* watch/notify */
>  struct ceph_osd_linger_request *
>  ceph_osdc_watch(struct ceph_osd_client *osdc,
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index b68b376d8c2f..7f984532f37c 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -5346,23 +5346,24 @@ static int osd_req_op_copy_from_init(struct ceph_osd_request *req,
>         return 0;
>  }
>
> -int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
> -                       u64 src_snapid, u64 src_version,
> -                       struct ceph_object_id *src_oid,
> -                       struct ceph_object_locator *src_oloc,
> -                       u32 src_fadvise_flags,
> -                       struct ceph_object_id *dst_oid,
> -                       struct ceph_object_locator *dst_oloc,
> -                       u32 dst_fadvise_flags,
> -                       u32 truncate_seq, u64 truncate_size,
> -                       u8 copy_from_flags)
> +struct ceph_osd_request *ceph_osdc_copy_from_nowait(
> +               struct ceph_osd_client *osdc,
> +               u64 src_snapid, u64 src_version,
> +               struct ceph_object_id *src_oid,
> +               struct ceph_object_locator *src_oloc,
> +               u32 src_fadvise_flags,
> +               struct ceph_object_id *dst_oid,
> +               struct ceph_object_locator *dst_oloc,
> +               u32 dst_fadvise_flags,
> +               u32 truncate_seq, u64 truncate_size,
> +               u8 copy_from_flags)
>  {
>         struct ceph_osd_request *req;
>         int ret;
>
>         req = ceph_osdc_alloc_request(osdc, NULL, 1, false, GFP_KERNEL);
>         if (!req)
> -               return -ENOMEM;
> +               return ERR_PTR(-ENOMEM);
>
>         req->r_flags = CEPH_OSD_FLAG_WRITE;
>
> @@ -5381,11 +5382,38 @@ int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
>                 goto out;
>
>         ceph_osdc_start_request(osdc, req, false);
> -       ret = ceph_osdc_wait_request(osdc, req);
> +       return req;
>
>  out:
>         ceph_osdc_put_request(req);
> -       return ret;
> +       return ERR_PTR(ret);
> +}
> +EXPORT_SYMBOL(ceph_osdc_copy_from_nowait);
> +
> +int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
> +                       u64 src_snapid, u64 src_version,
> +                       struct ceph_object_id *src_oid,
> +                       struct ceph_object_locator *src_oloc,
> +                       u32 src_fadvise_flags,
> +                       struct ceph_object_id *dst_oid,
> +                       struct ceph_object_locator *dst_oloc,
> +                       u32 dst_fadvise_flags,
> +                       u32 truncate_seq, u64 truncate_size,
> +                       u8 copy_from_flags)
> +{
> +       struct ceph_osd_request *req;
> +
> +       req = ceph_osdc_copy_from_nowait(osdc,
> +                       src_snapid, src_version,
> +                       src_oid, src_oloc,
> +                       src_fadvise_flags,
> +                       dst_oid, dst_oloc,
> +                       dst_fadvise_flags,
> +                       truncate_seq, truncate_size,
> +                       copy_from_flags);
> +       if (IS_ERR(req))
> +               return PTR_ERR(req);
> +       return ceph_osdc_wait_request(osdc, req);

I don't think we need a blocking version.  Change ceph_osdc_copy_from()
and keep the name -- no need for async or nowait suffixes.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 2/3] ceph: parallelize all copy-from requests in copy_file_range
  2020-01-27 16:43 ` [RFC PATCH 2/3] ceph: parallelize all copy-from requests in copy_file_range Luis Henriques
@ 2020-01-27 17:58   ` Ilya Dryomov
  2020-01-27 18:44     ` Luis Henriques
  0 siblings, 1 reply; 12+ messages in thread
From: Ilya Dryomov @ 2020-01-27 17:58 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Jeff Layton, Sage Weil, Yan, Zheng, Gregory Farnum,
	Ceph Development, LKML

On Mon, Jan 27, 2020 at 5:43 PM Luis Henriques <lhenriques@suse.com> wrote:
>
> Right now the copy_file_range syscall serializes all the OSDs 'copy-from'
> operations, waiting for each request to complete before sending the next
> one.  This patch modifies copy_file_range so that all the 'copy-from'
> operations are sent in bulk and waits for its completion at the end.  This
> will allow significant speed-ups, specially when sending requests for
> different target OSDs.
>
> Signed-off-by: Luis Henriques <lhenriques@suse.com>
> ---
>  fs/ceph/file.c                  | 38 +++++++++++++++++++++++++++++++--
>  include/linux/ceph/osd_client.h |  2 ++
>  net/ceph/osd_client.c           |  1 +
>  3 files changed, 39 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 1e6cdf2dfe90..5d8f0ba11719 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1931,6 +1931,28 @@ static int is_file_size_ok(struct inode *src_inode, struct inode *dst_inode,
>         return 0;
>  }
>
> +static int wait_copy_from_reqs(struct list_head *osd_reqs)
> +{
> +       struct ceph_osd_request *req;
> +       int ret = 0, err;
> +
> +       while (!list_empty(osd_reqs)) {
> +               req = list_first_entry(osd_reqs,
> +                                      struct ceph_osd_request,
> +                                      r_copy_item);
> +               list_del_init(&req->r_copy_item);
> +               err = ceph_osdc_wait_request(req->r_osdc, req);
> +               if (err) {
> +                       if (!ret)
> +                               ret = err;
> +                       dout("copy request failed (err=%d)\n", err);
> +               }
> +               ceph_osdc_put_request(req);
> +       }
> +
> +       return ret;
> +}

This should probably go into libceph, as ceph_osdc_wait_requests().

> +
>  static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>                                       struct file *dst_file, loff_t dst_off,
>                                       size_t len, unsigned int flags)
> @@ -1943,12 +1965,14 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>         struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
>         struct ceph_object_locator src_oloc, dst_oloc;
>         struct ceph_object_id src_oid, dst_oid;
> +       struct ceph_osd_request *req;
>         loff_t endoff = 0, size;
>         ssize_t ret = -EIO;
>         u64 src_objnum, dst_objnum, src_objoff, dst_objoff;
>         u32 src_objlen, dst_objlen, object_size;
>         int src_got = 0, dst_got = 0, err, dirty;
>         bool do_final_copy = false;
> +       LIST_HEAD(osd_reqs);
>
>         if (src_inode->i_sb != dst_inode->i_sb) {
>                 struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
> @@ -2097,7 +2121,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>                 ceph_oid_printf(&dst_oid, "%llx.%08llx",
>                                 dst_ci->i_vino.ino, dst_objnum);
>                 /* Do an object remote copy */
> -               err = ceph_osdc_copy_from(
> +               req = ceph_osdc_copy_from_nowait(
>                         &src_fsc->client->osdc,
>                         src_ci->i_vino.snap, 0,
>                         &src_oid, &src_oloc,
> @@ -2108,7 +2132,8 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>                         CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
>                         dst_ci->i_truncate_seq, dst_ci->i_truncate_size,
>                         CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
> -               if (err) {
> +               if (IS_ERR(req)) {
> +                       err = PTR_ERR(req);
>                         if (err == -EOPNOTSUPP) {
>                                 src_fsc->have_copy_from2 = false;
>                                 pr_notice("OSDs don't support 'copy-from2'; "
> @@ -2117,14 +2142,23 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>                         dout("ceph_osdc_copy_from returned %d\n", err);
>                         if (!ret)
>                                 ret = err;
> +                       /* wait for all queued requests */
> +                       wait_copy_from_reqs(&osd_reqs);
>                         goto out_caps;
>                 }
> +               list_add(&req->r_copy_item, &osd_reqs);
>                 len -= object_size;
>                 src_off += object_size;
>                 dst_off += object_size;
>                 ret += object_size;
>         }
>
> +       err = wait_copy_from_reqs(&osd_reqs);
> +       if (err) {
> +               if (!ret)
> +                       ret = err;
> +               goto out_caps;
> +       }
>         if (len)
>                 /* We still need one final local copy */
>                 do_final_copy = true;
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index 7916a178d137..2b4a14bc6154 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -210,6 +210,8 @@ struct ceph_osd_request {
>         u64 r_data_offset;                    /* ditto */
>         bool r_linger;                        /* don't resend on failure */
>
> +       struct list_head r_copy_item;         /* used for copy-from operations */
> +

We have r_private_item for exactly this kind of ad-hoc lists, already
used in rbd and ceph_direct_read_write().

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/3] parallel 'copy-from' Ops in copy_file_range
  2020-01-27 16:43 [RFC PATCH 0/3] parallel 'copy-from' Ops in copy_file_range Luis Henriques
                   ` (2 preceding siblings ...)
  2020-01-27 16:43 ` [RFC PATCH 3/3] ceph: add module param to throttle 'copy-from2' operations Luis Henriques
@ 2020-01-27 18:16 ` Ilya Dryomov
  2020-01-27 18:52   ` Luis Henriques
  3 siblings, 1 reply; 12+ messages in thread
From: Ilya Dryomov @ 2020-01-27 18:16 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Jeff Layton, Sage Weil, Yan, Zheng, Gregory Farnum,
	Ceph Development, LKML

On Mon, Jan 27, 2020 at 5:43 PM Luis Henriques <lhenriques@suse.com> wrote:
>
> Hi,
>
> As discussed here[1] I'm sending an RFC patchset that does the
> parallelization of the requests sent to the OSDs during a copy_file_range
> syscall in CephFS.
>
>   [1] https://lore.kernel.org/lkml/20200108100353.23770-1-lhenriques@suse.com/
>
> I've also some performance numbers that I wanted to share. Here's a
> description of the very simple tests I've run:
>
>  - create a file with 200 objects in it
>    * i.e. tests with different object sizes mean different file sizes
>  - drop all caches and umount the filesystem
>  - Measure:
>    * mount filesystem
>    * full file copy (with copy_file_range)
>    * umount filesystem
>
> Tests were repeated several times and the average value was used for
> comparison.
>
>   DISCLAIMER:
>   These numbers are only indicative, and different clusters and client
>   configs will for sure show different performance!  More rigorous tests
>   would be require to validate these results.
>
> Having as baseline a full read+write (basically, a copy_file_range
> operation within a filesystem mounted without the 'copyfrom' option),
> here's some values for different object sizes:
>
>                           8M      4M      1M      65k
> read+write              100%    100%    100%     100%
> sequential               51%     52%     83%    >100%
> parallel (throttle=1)    51%     52%     83%    >100%
> parallel (throttle=0)    17%     17%     83%    >100%
>
> Notes:
>
> - 'parallel (throttle=0)' was a test where *all* the requests (i.e. 200
>   requests to copy the 200 objects in the file) were sent to the OSDs and
>   the wait for requests completion is done at the end only.
>
> - 'parallel (throttle=1)' was just a control test, where the wait for
>   completion is done immediately after a request is sent.  It was expected
>   to be very similar to the non-optimized ('sequential') tests.
>
> - These tests were executed on a cluster with 40 OSDs, spread across 5
>   (bare-metal) nodes.
>
> - The tests with object size of 65k show that copy_file_range definitely
>   doesn't scale to files with small object sizes.  '> 100%' actually means
>   more than 10x slower.
>
> Measuring the mount+copy+umount masks the actual difference between
> different throttle values due to the time spent in mount+umount.  Thus,
> there was no real difference between throttle=0 (send all and wait) and
> throttle=20 (send 20, wait, send 20, ...).  But here's what I observed
> when measuring only the copy operation (4M object size):
>
> read+write              100%
> parallel (throttle=1)    56%
> parallel (throttle=5)    23%
> parallel (throttle=10)   14%
> parallel (throttle=20)    9%
> parallel (throttle=5)     5%

Was this supposed to be throttle=50?

>
> Anyway, I'll still need to revisit patch 0003 as it doesn't follow the
> suggestion done by Jeff to *not* add another knob to fine-tune the
> throttle value -- this patch adds a kernel parameter for a knob that I
> wanted to use in my testing to observe different values of this throttle
> limit.
>
> The goal is to probably to drop this patch and do the throttling in patch
> 0002.  I just need to come up with a decent heuristic.  Jeff's suggestion
> was to use rsize/wsize, which are set to 64M by default IIRC.  Somehow I
> feel that it should be related to the number of OSDs in the cluster
> instead, but I'm not sure how.  And testing these sort of heuristics would
> require different clusters, which isn't particularly easy to get.  Anyway,
> comments are welcome!

I agree with Jeff, this throttle is certainly not worth a module
parameter (or a mount option).  I would start with something like
C * (wsize / object size) and pick C between 1 and 4.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/3] libceph: add non-blocking version of ceph_osdc_copy_from()
  2020-01-27 17:47   ` Ilya Dryomov
@ 2020-01-27 18:39     ` Luis Henriques
  0 siblings, 0 replies; 12+ messages in thread
From: Luis Henriques @ 2020-01-27 18:39 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Jeff Layton, Sage Weil, Yan, Zheng, Gregory Farnum,
	Ceph Development, LKML

On Mon, Jan 27, 2020 at 06:47:02PM +0100, Ilya Dryomov wrote:
> On Mon, Jan 27, 2020 at 5:43 PM Luis Henriques <lhenriques@suse.com> wrote:
> >
> > A non-blocking version of ceph_osdc_copy_from will allow for callers to
> > send 'copy-from' requests in bulk and wait for all of them to complete in
> > the end.
> >
> > Signed-off-by: Luis Henriques <lhenriques@suse.com>
> > ---
> >  include/linux/ceph/osd_client.h | 12 ++++++++
> >  net/ceph/osd_client.c           | 54 +++++++++++++++++++++++++--------
> >  2 files changed, 53 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> > index 5a62dbd3f4c2..7916a178d137 100644
> > --- a/include/linux/ceph/osd_client.h
> > +++ b/include/linux/ceph/osd_client.h
> > @@ -537,6 +537,18 @@ int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
> >                         u32 truncate_seq, u64 truncate_size,
> >                         u8 copy_from_flags);
> >
> > +struct ceph_osd_request *ceph_osdc_copy_from_nowait(
> > +                       struct ceph_osd_client *osdc,
> > +                       u64 src_snapid, u64 src_version,
> > +                       struct ceph_object_id *src_oid,
> > +                       struct ceph_object_locator *src_oloc,
> > +                       u32 src_fadvise_flags,
> > +                       struct ceph_object_id *dst_oid,
> > +                       struct ceph_object_locator *dst_oloc,
> > +                       u32 dst_fadvise_flags,
> > +                       u32 truncate_seq, u64 truncate_size,
> > +                       u8 copy_from_flags);
> > +
> >  /* watch/notify */
> >  struct ceph_osd_linger_request *
> >  ceph_osdc_watch(struct ceph_osd_client *osdc,
> > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> > index b68b376d8c2f..7f984532f37c 100644
> > --- a/net/ceph/osd_client.c
> > +++ b/net/ceph/osd_client.c
> > @@ -5346,23 +5346,24 @@ static int osd_req_op_copy_from_init(struct ceph_osd_request *req,
> >         return 0;
> >  }
> >
> > -int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
> > -                       u64 src_snapid, u64 src_version,
> > -                       struct ceph_object_id *src_oid,
> > -                       struct ceph_object_locator *src_oloc,
> > -                       u32 src_fadvise_flags,
> > -                       struct ceph_object_id *dst_oid,
> > -                       struct ceph_object_locator *dst_oloc,
> > -                       u32 dst_fadvise_flags,
> > -                       u32 truncate_seq, u64 truncate_size,
> > -                       u8 copy_from_flags)
> > +struct ceph_osd_request *ceph_osdc_copy_from_nowait(
> > +               struct ceph_osd_client *osdc,
> > +               u64 src_snapid, u64 src_version,
> > +               struct ceph_object_id *src_oid,
> > +               struct ceph_object_locator *src_oloc,
> > +               u32 src_fadvise_flags,
> > +               struct ceph_object_id *dst_oid,
> > +               struct ceph_object_locator *dst_oloc,
> > +               u32 dst_fadvise_flags,
> > +               u32 truncate_seq, u64 truncate_size,
> > +               u8 copy_from_flags)
> >  {
> >         struct ceph_osd_request *req;
> >         int ret;
> >
> >         req = ceph_osdc_alloc_request(osdc, NULL, 1, false, GFP_KERNEL);
> >         if (!req)
> > -               return -ENOMEM;
> > +               return ERR_PTR(-ENOMEM);
> >
> >         req->r_flags = CEPH_OSD_FLAG_WRITE;
> >
> > @@ -5381,11 +5382,38 @@ int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
> >                 goto out;
> >
> >         ceph_osdc_start_request(osdc, req, false);
> > -       ret = ceph_osdc_wait_request(osdc, req);
> > +       return req;
> >
> >  out:
> >         ceph_osdc_put_request(req);
> > -       return ret;
> > +       return ERR_PTR(ret);
> > +}
> > +EXPORT_SYMBOL(ceph_osdc_copy_from_nowait);
> > +
> > +int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
> > +                       u64 src_snapid, u64 src_version,
> > +                       struct ceph_object_id *src_oid,
> > +                       struct ceph_object_locator *src_oloc,
> > +                       u32 src_fadvise_flags,
> > +                       struct ceph_object_id *dst_oid,
> > +                       struct ceph_object_locator *dst_oloc,
> > +                       u32 dst_fadvise_flags,
> > +                       u32 truncate_seq, u64 truncate_size,
> > +                       u8 copy_from_flags)
> > +{
> > +       struct ceph_osd_request *req;
> > +
> > +       req = ceph_osdc_copy_from_nowait(osdc,
> > +                       src_snapid, src_version,
> > +                       src_oid, src_oloc,
> > +                       src_fadvise_flags,
> > +                       dst_oid, dst_oloc,
> > +                       dst_fadvise_flags,
> > +                       truncate_seq, truncate_size,
> > +                       copy_from_flags);
> > +       if (IS_ERR(req))
> > +               return PTR_ERR(req);
> > +       return ceph_osdc_wait_request(osdc, req);
> 
> I don't think we need a blocking version.  Change ceph_osdc_copy_from()
> and keep the name -- no need for async or nowait suffixes.

Yeah, since the only user is copy_file_range we definitely don't need the
original version.  I'll include this change in the next iteration.

Cheers,
--
Luís

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 2/3] ceph: parallelize all copy-from requests in copy_file_range
  2020-01-27 17:58   ` Ilya Dryomov
@ 2020-01-27 18:44     ` Luis Henriques
  2020-01-28  8:39       ` Ilya Dryomov
  0 siblings, 1 reply; 12+ messages in thread
From: Luis Henriques @ 2020-01-27 18:44 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Jeff Layton, Sage Weil, Yan, Zheng, Gregory Farnum,
	Ceph Development, LKML

On Mon, Jan 27, 2020 at 06:58:46PM +0100, Ilya Dryomov wrote:
> On Mon, Jan 27, 2020 at 5:43 PM Luis Henriques <lhenriques@suse.com> wrote:
> >
> > Right now the copy_file_range syscall serializes all the OSDs 'copy-from'
> > operations, waiting for each request to complete before sending the next
> > one.  This patch modifies copy_file_range so that all the 'copy-from'
> > operations are sent in bulk and waits for its completion at the end.  This
> > will allow significant speed-ups, specially when sending requests for
> > different target OSDs.
> >
> > Signed-off-by: Luis Henriques <lhenriques@suse.com>
> > ---
> >  fs/ceph/file.c                  | 38 +++++++++++++++++++++++++++++++--
> >  include/linux/ceph/osd_client.h |  2 ++
> >  net/ceph/osd_client.c           |  1 +
> >  3 files changed, 39 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > index 1e6cdf2dfe90..5d8f0ba11719 100644
> > --- a/fs/ceph/file.c
> > +++ b/fs/ceph/file.c
> > @@ -1931,6 +1931,28 @@ static int is_file_size_ok(struct inode *src_inode, struct inode *dst_inode,
> >         return 0;
> >  }
> >
> > +static int wait_copy_from_reqs(struct list_head *osd_reqs)
> > +{
> > +       struct ceph_osd_request *req;
> > +       int ret = 0, err;
> > +
> > +       while (!list_empty(osd_reqs)) {
> > +               req = list_first_entry(osd_reqs,
> > +                                      struct ceph_osd_request,
> > +                                      r_copy_item);
> > +               list_del_init(&req->r_copy_item);
> > +               err = ceph_osdc_wait_request(req->r_osdc, req);
> > +               if (err) {
> > +                       if (!ret)
> > +                               ret = err;
> > +                       dout("copy request failed (err=%d)\n", err);
> > +               }
> > +               ceph_osdc_put_request(req);
> > +       }
> > +
> > +       return ret;
> > +}
> 
> This should probably go into libceph, as ceph_osdc_wait_requests().

Sure, that makes sense.

> > +
> >  static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> >                                       struct file *dst_file, loff_t dst_off,
> >                                       size_t len, unsigned int flags)
> > @@ -1943,12 +1965,14 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> >         struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
> >         struct ceph_object_locator src_oloc, dst_oloc;
> >         struct ceph_object_id src_oid, dst_oid;
> > +       struct ceph_osd_request *req;
> >         loff_t endoff = 0, size;
> >         ssize_t ret = -EIO;
> >         u64 src_objnum, dst_objnum, src_objoff, dst_objoff;
> >         u32 src_objlen, dst_objlen, object_size;
> >         int src_got = 0, dst_got = 0, err, dirty;
> >         bool do_final_copy = false;
> > +       LIST_HEAD(osd_reqs);
> >
> >         if (src_inode->i_sb != dst_inode->i_sb) {
> >                 struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
> > @@ -2097,7 +2121,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> >                 ceph_oid_printf(&dst_oid, "%llx.%08llx",
> >                                 dst_ci->i_vino.ino, dst_objnum);
> >                 /* Do an object remote copy */
> > -               err = ceph_osdc_copy_from(
> > +               req = ceph_osdc_copy_from_nowait(
> >                         &src_fsc->client->osdc,
> >                         src_ci->i_vino.snap, 0,
> >                         &src_oid, &src_oloc,
> > @@ -2108,7 +2132,8 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> >                         CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
> >                         dst_ci->i_truncate_seq, dst_ci->i_truncate_size,
> >                         CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
> > -               if (err) {
> > +               if (IS_ERR(req)) {
> > +                       err = PTR_ERR(req);
> >                         if (err == -EOPNOTSUPP) {
> >                                 src_fsc->have_copy_from2 = false;
> >                                 pr_notice("OSDs don't support 'copy-from2'; "
> > @@ -2117,14 +2142,23 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> >                         dout("ceph_osdc_copy_from returned %d\n", err);
> >                         if (!ret)
> >                                 ret = err;
> > +                       /* wait for all queued requests */
> > +                       wait_copy_from_reqs(&osd_reqs);
> >                         goto out_caps;
> >                 }
> > +               list_add(&req->r_copy_item, &osd_reqs);
> >                 len -= object_size;
> >                 src_off += object_size;
> >                 dst_off += object_size;
> >                 ret += object_size;
> >         }
> >
> > +       err = wait_copy_from_reqs(&osd_reqs);
> > +       if (err) {
> > +               if (!ret)
> > +                       ret = err;
> > +               goto out_caps;
> > +       }
> >         if (len)
> >                 /* We still need one final local copy */
> >                 do_final_copy = true;
> > diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> > index 7916a178d137..2b4a14bc6154 100644
> > --- a/include/linux/ceph/osd_client.h
> > +++ b/include/linux/ceph/osd_client.h
> > @@ -210,6 +210,8 @@ struct ceph_osd_request {
> >         u64 r_data_offset;                    /* ditto */
> >         bool r_linger;                        /* don't resend on failure */
> >
> > +       struct list_head r_copy_item;         /* used for copy-from operations */
> > +
> 
> We have r_private_item for exactly this kind of ad-hoc lists, already
> used in rbd and ceph_direct_read_write().

Ah!  I've actually considered using it, but thought that using the same
field for different purposes may be a source of confusion and bugs in the
future.  But sure, it can be used.  I'll add a comment in the struct
definition regarding this ad-hoc usage.

Cheers,
--
Luís

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/3] parallel 'copy-from' Ops in copy_file_range
  2020-01-27 18:16 ` [RFC PATCH 0/3] parallel 'copy-from' Ops in copy_file_range Ilya Dryomov
@ 2020-01-27 18:52   ` Luis Henriques
  2020-01-28 17:15     ` Gregory Farnum
  0 siblings, 1 reply; 12+ messages in thread
From: Luis Henriques @ 2020-01-27 18:52 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Jeff Layton, Sage Weil, Yan, Zheng, Gregory Farnum,
	Ceph Development, LKML

On Mon, Jan 27, 2020 at 07:16:17PM +0100, Ilya Dryomov wrote:
> On Mon, Jan 27, 2020 at 5:43 PM Luis Henriques <lhenriques@suse.com> wrote:
> >
> > Hi,
> >
> > As discussed here[1] I'm sending an RFC patchset that does the
> > parallelization of the requests sent to the OSDs during a copy_file_range
> > syscall in CephFS.
> >
> >   [1] https://lore.kernel.org/lkml/20200108100353.23770-1-lhenriques@suse.com/
> >
> > I've also some performance numbers that I wanted to share. Here's a
> > description of the very simple tests I've run:
> >
> >  - create a file with 200 objects in it
> >    * i.e. tests with different object sizes mean different file sizes
> >  - drop all caches and umount the filesystem
> >  - Measure:
> >    * mount filesystem
> >    * full file copy (with copy_file_range)
> >    * umount filesystem
> >
> > Tests were repeated several times and the average value was used for
> > comparison.
> >
> >   DISCLAIMER:
> >   These numbers are only indicative, and different clusters and client
> >   configs will for sure show different performance!  More rigorous tests
> >   would be require to validate these results.
> >
> > Having as baseline a full read+write (basically, a copy_file_range
> > operation within a filesystem mounted without the 'copyfrom' option),
> > here's some values for different object sizes:
> >
> >                           8M      4M      1M      65k
> > read+write              100%    100%    100%     100%
> > sequential               51%     52%     83%    >100%
> > parallel (throttle=1)    51%     52%     83%    >100%
> > parallel (throttle=0)    17%     17%     83%    >100%
> >
> > Notes:
> >
> > - 'parallel (throttle=0)' was a test where *all* the requests (i.e. 200
> >   requests to copy the 200 objects in the file) were sent to the OSDs and
> >   the wait for requests completion is done at the end only.
> >
> > - 'parallel (throttle=1)' was just a control test, where the wait for
> >   completion is done immediately after a request is sent.  It was expected
> >   to be very similar to the non-optimized ('sequential') tests.
> >
> > - These tests were executed on a cluster with 40 OSDs, spread across 5
> >   (bare-metal) nodes.
> >
> > - The tests with object size of 65k show that copy_file_range definitely
> >   doesn't scale to files with small object sizes.  '> 100%' actually means
> >   more than 10x slower.
> >
> > Measuring the mount+copy+umount masks the actual difference between
> > different throttle values due to the time spent in mount+umount.  Thus,
> > there was no real difference between throttle=0 (send all and wait) and
> > throttle=20 (send 20, wait, send 20, ...).  But here's what I observed
> > when measuring only the copy operation (4M object size):
> >
> > read+write              100%
> > parallel (throttle=1)    56%
> > parallel (throttle=5)    23%
> > parallel (throttle=10)   14%
> > parallel (throttle=20)    9%
> > parallel (throttle=5)     5%
> 
> Was this supposed to be throttle=50?

Ups, no it should be throttle=0 (i.e. no throttle).

> >
> > Anyway, I'll still need to revisit patch 0003 as it doesn't follow the
> > suggestion done by Jeff to *not* add another knob to fine-tune the
> > throttle value -- this patch adds a kernel parameter for a knob that I
> > wanted to use in my testing to observe different values of this throttle
> > limit.
> >
> > The goal is to probably to drop this patch and do the throttling in patch
> > 0002.  I just need to come up with a decent heuristic.  Jeff's suggestion
> > was to use rsize/wsize, which are set to 64M by default IIRC.  Somehow I
> > feel that it should be related to the number of OSDs in the cluster
> > instead, but I'm not sure how.  And testing these sort of heuristics would
> > require different clusters, which isn't particularly easy to get.  Anyway,
> > comments are welcome!
> 
> I agree with Jeff, this throttle is certainly not worth a module
> parameter (or a mount option).  I would start with something like
> C * (wsize / object size) and pick C between 1 and 4.

Sure, I also agree with not adding the new parameter or mount option.
It's just tricky to pick (and test!) the best formula to use.  From your
proposal the throttle value would be by default between 16 and 64; those
probably work fine in some situations (for example, in the cluster I used
for running my tests).  But for a really big cluster, with hundreds of
OSDs, it's difficult to say.

Anyway, I'll come up with a proposal for the next revision.  And thanks a
lot for your feedback.

Cheers,
--
Luís

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 2/3] ceph: parallelize all copy-from requests in copy_file_range
  2020-01-27 18:44     ` Luis Henriques
@ 2020-01-28  8:39       ` Ilya Dryomov
  0 siblings, 0 replies; 12+ messages in thread
From: Ilya Dryomov @ 2020-01-28  8:39 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Jeff Layton, Sage Weil, Yan, Zheng, Gregory Farnum,
	Ceph Development, LKML

On Mon, Jan 27, 2020 at 7:44 PM Luis Henriques <lhenriques@suse.com> wrote:
>
> On Mon, Jan 27, 2020 at 06:58:46PM +0100, Ilya Dryomov wrote:
> > On Mon, Jan 27, 2020 at 5:43 PM Luis Henriques <lhenriques@suse.com> wrote:
> > >
> > > Right now the copy_file_range syscall serializes all the OSDs 'copy-from'
> > > operations, waiting for each request to complete before sending the next
> > > one.  This patch modifies copy_file_range so that all the 'copy-from'
> > > operations are sent in bulk and waits for its completion at the end.  This
> > > will allow significant speed-ups, specially when sending requests for
> > > different target OSDs.
> > >
> > > Signed-off-by: Luis Henriques <lhenriques@suse.com>
> > > ---
> > >  fs/ceph/file.c                  | 38 +++++++++++++++++++++++++++++++--
> > >  include/linux/ceph/osd_client.h |  2 ++
> > >  net/ceph/osd_client.c           |  1 +
> > >  3 files changed, 39 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > index 1e6cdf2dfe90..5d8f0ba11719 100644
> > > --- a/fs/ceph/file.c
> > > +++ b/fs/ceph/file.c
> > > @@ -1931,6 +1931,28 @@ static int is_file_size_ok(struct inode *src_inode, struct inode *dst_inode,
> > >         return 0;
> > >  }
> > >
> > > +static int wait_copy_from_reqs(struct list_head *osd_reqs)
> > > +{
> > > +       struct ceph_osd_request *req;
> > > +       int ret = 0, err;
> > > +
> > > +       while (!list_empty(osd_reqs)) {
> > > +               req = list_first_entry(osd_reqs,
> > > +                                      struct ceph_osd_request,
> > > +                                      r_copy_item);
> > > +               list_del_init(&req->r_copy_item);
> > > +               err = ceph_osdc_wait_request(req->r_osdc, req);
> > > +               if (err) {
> > > +                       if (!ret)
> > > +                               ret = err;
> > > +                       dout("copy request failed (err=%d)\n", err);
> > > +               }
> > > +               ceph_osdc_put_request(req);
> > > +       }
> > > +
> > > +       return ret;
> > > +}
> >
> > This should probably go into libceph, as ceph_osdc_wait_requests().
>
> Sure, that makes sense.
>
> > > +
> > >  static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > >                                       struct file *dst_file, loff_t dst_off,
> > >                                       size_t len, unsigned int flags)
> > > @@ -1943,12 +1965,14 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > >         struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
> > >         struct ceph_object_locator src_oloc, dst_oloc;
> > >         struct ceph_object_id src_oid, dst_oid;
> > > +       struct ceph_osd_request *req;
> > >         loff_t endoff = 0, size;
> > >         ssize_t ret = -EIO;
> > >         u64 src_objnum, dst_objnum, src_objoff, dst_objoff;
> > >         u32 src_objlen, dst_objlen, object_size;
> > >         int src_got = 0, dst_got = 0, err, dirty;
> > >         bool do_final_copy = false;
> > > +       LIST_HEAD(osd_reqs);
> > >
> > >         if (src_inode->i_sb != dst_inode->i_sb) {
> > >                 struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
> > > @@ -2097,7 +2121,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > >                 ceph_oid_printf(&dst_oid, "%llx.%08llx",
> > >                                 dst_ci->i_vino.ino, dst_objnum);
> > >                 /* Do an object remote copy */
> > > -               err = ceph_osdc_copy_from(
> > > +               req = ceph_osdc_copy_from_nowait(
> > >                         &src_fsc->client->osdc,
> > >                         src_ci->i_vino.snap, 0,
> > >                         &src_oid, &src_oloc,
> > > @@ -2108,7 +2132,8 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > >                         CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
> > >                         dst_ci->i_truncate_seq, dst_ci->i_truncate_size,
> > >                         CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
> > > -               if (err) {
> > > +               if (IS_ERR(req)) {
> > > +                       err = PTR_ERR(req);
> > >                         if (err == -EOPNOTSUPP) {
> > >                                 src_fsc->have_copy_from2 = false;
> > >                                 pr_notice("OSDs don't support 'copy-from2'; "
> > > @@ -2117,14 +2142,23 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > >                         dout("ceph_osdc_copy_from returned %d\n", err);
> > >                         if (!ret)
> > >                                 ret = err;
> > > +                       /* wait for all queued requests */
> > > +                       wait_copy_from_reqs(&osd_reqs);
> > >                         goto out_caps;
> > >                 }
> > > +               list_add(&req->r_copy_item, &osd_reqs);
> > >                 len -= object_size;
> > >                 src_off += object_size;
> > >                 dst_off += object_size;
> > >                 ret += object_size;
> > >         }
> > >
> > > +       err = wait_copy_from_reqs(&osd_reqs);
> > > +       if (err) {
> > > +               if (!ret)
> > > +                       ret = err;
> > > +               goto out_caps;
> > > +       }
> > >         if (len)
> > >                 /* We still need one final local copy */
> > >                 do_final_copy = true;
> > > diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> > > index 7916a178d137..2b4a14bc6154 100644
> > > --- a/include/linux/ceph/osd_client.h
> > > +++ b/include/linux/ceph/osd_client.h
> > > @@ -210,6 +210,8 @@ struct ceph_osd_request {
> > >         u64 r_data_offset;                    /* ditto */
> > >         bool r_linger;                        /* don't resend on failure */
> > >
> > > +       struct list_head r_copy_item;         /* used for copy-from operations */
> > > +
> >
> > We have r_private_item for exactly this kind of ad-hoc lists, already
> > used in rbd and ceph_direct_read_write().
>
> Ah!  I've actually considered using it, but thought that using the same
> field for different purposes may be a source of confusion and bugs in the
> future.  But sure, it can be used.  I'll add a comment in the struct
> definition regarding this ad-hoc usage.

It's a means to pass some state through libceph, just like r_priv
pointer.  The comment is already there, it says "for use by callbacks"
meaning that libceph won't touch these fields.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/3] parallel 'copy-from' Ops in copy_file_range
  2020-01-27 18:52   ` Luis Henriques
@ 2020-01-28 17:15     ` Gregory Farnum
  0 siblings, 0 replies; 12+ messages in thread
From: Gregory Farnum @ 2020-01-28 17:15 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Ilya Dryomov, Jeff Layton, Sage Weil, Yan, Zheng, Ceph Development, LKML

On Mon, Jan 27, 2020 at 7:52 PM Luis Henriques <lhenriques@suse.com> wrote:
>
> On Mon, Jan 27, 2020 at 07:16:17PM +0100, Ilya Dryomov wrote:
> > On Mon, Jan 27, 2020 at 5:43 PM Luis Henriques <lhenriques@suse.com> wrote:
> > >
> > > Hi,
> > >
> > > As discussed here[1] I'm sending an RFC patchset that does the
> > > parallelization of the requests sent to the OSDs during a copy_file_range
> > > syscall in CephFS.
> > >
> > >   [1] https://lore.kernel.org/lkml/20200108100353.23770-1-lhenriques@suse.com/
> > >
> > > I've also some performance numbers that I wanted to share. Here's a
> > > description of the very simple tests I've run:
> > >
> > >  - create a file with 200 objects in it
> > >    * i.e. tests with different object sizes mean different file sizes
> > >  - drop all caches and umount the filesystem
> > >  - Measure:
> > >    * mount filesystem
> > >    * full file copy (with copy_file_range)
> > >    * umount filesystem
> > >
> > > Tests were repeated several times and the average value was used for
> > > comparison.
> > >
> > >   DISCLAIMER:
> > >   These numbers are only indicative, and different clusters and client
> > >   configs will for sure show different performance!  More rigorous tests
> > >   would be require to validate these results.
> > >
> > > Having as baseline a full read+write (basically, a copy_file_range
> > > operation within a filesystem mounted without the 'copyfrom' option),
> > > here's some values for different object sizes:
> > >
> > >                           8M      4M      1M      65k
> > > read+write              100%    100%    100%     100%
> > > sequential               51%     52%     83%    >100%
> > > parallel (throttle=1)    51%     52%     83%    >100%
> > > parallel (throttle=0)    17%     17%     83%    >100%
> > >
> > > Notes:
> > >
> > > - 'parallel (throttle=0)' was a test where *all* the requests (i.e. 200
> > >   requests to copy the 200 objects in the file) were sent to the OSDs and
> > >   the wait for requests completion is done at the end only.
> > >
> > > - 'parallel (throttle=1)' was just a control test, where the wait for
> > >   completion is done immediately after a request is sent.  It was expected
> > >   to be very similar to the non-optimized ('sequential') tests.
> > >
> > > - These tests were executed on a cluster with 40 OSDs, spread across 5
> > >   (bare-metal) nodes.
> > >
> > > - The tests with object size of 65k show that copy_file_range definitely
> > >   doesn't scale to files with small object sizes.  '> 100%' actually means
> > >   more than 10x slower.
> > >
> > > Measuring the mount+copy+umount masks the actual difference between
> > > different throttle values due to the time spent in mount+umount.  Thus,
> > > there was no real difference between throttle=0 (send all and wait) and
> > > throttle=20 (send 20, wait, send 20, ...).  But here's what I observed
> > > when measuring only the copy operation (4M object size):
> > >
> > > read+write              100%
> > > parallel (throttle=1)    56%
> > > parallel (throttle=5)    23%
> > > parallel (throttle=10)   14%
> > > parallel (throttle=20)    9%
> > > parallel (throttle=5)     5%
> >
> > Was this supposed to be throttle=50?
>
> Ups, no it should be throttle=0 (i.e. no throttle).
>
> > >
> > > Anyway, I'll still need to revisit patch 0003 as it doesn't follow the
> > > suggestion done by Jeff to *not* add another knob to fine-tune the
> > > throttle value -- this patch adds a kernel parameter for a knob that I
> > > wanted to use in my testing to observe different values of this throttle
> > > limit.
> > >
> > > The goal is to probably to drop this patch and do the throttling in patch
> > > 0002.  I just need to come up with a decent heuristic.  Jeff's suggestion
> > > was to use rsize/wsize, which are set to 64M by default IIRC.  Somehow I
> > > feel that it should be related to the number of OSDs in the cluster
> > > instead, but I'm not sure how.  And testing these sort of heuristics would
> > > require different clusters, which isn't particularly easy to get.  Anyway,
> > > comments are welcome!
> >
> > I agree with Jeff, this throttle is certainly not worth a module
> > parameter (or a mount option).  I would start with something like
> > C * (wsize / object size) and pick C between 1 and 4.
>
> Sure, I also agree with not adding the new parameter or mount option.
> It's just tricky to pick (and test!) the best formula to use.  From your
> proposal the throttle value would be by default between 16 and 64; those
> probably work fine in some situations (for example, in the cluster I used
> for running my tests).  But for a really big cluster, with hundreds of
> OSDs, it's difficult to say.

We don't really need a single client to be capable of spraying the
entire cluster in a single operation — as the wsize is already an
effective control over how parallel a single write is allowed to be, I
think we're okay using it as the basis for copy_file_range as well
without worrying about scaling it up!.
-Greg

>
> Anyway, I'll come up with a proposal for the next revision.  And thanks a
> lot for your feedback.
>
> Cheers,
> --
> Luís
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-01-28 17:16 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-27 16:43 [RFC PATCH 0/3] parallel 'copy-from' Ops in copy_file_range Luis Henriques
2020-01-27 16:43 ` [RFC PATCH 1/3] libceph: add non-blocking version of ceph_osdc_copy_from() Luis Henriques
2020-01-27 17:47   ` Ilya Dryomov
2020-01-27 18:39     ` Luis Henriques
2020-01-27 16:43 ` [RFC PATCH 2/3] ceph: parallelize all copy-from requests in copy_file_range Luis Henriques
2020-01-27 17:58   ` Ilya Dryomov
2020-01-27 18:44     ` Luis Henriques
2020-01-28  8:39       ` Ilya Dryomov
2020-01-27 16:43 ` [RFC PATCH 3/3] ceph: add module param to throttle 'copy-from2' operations Luis Henriques
2020-01-27 18:16 ` [RFC PATCH 0/3] parallel 'copy-from' Ops in copy_file_range Ilya Dryomov
2020-01-27 18:52   ` Luis Henriques
2020-01-28 17:15     ` Gregory Farnum

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).