linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/1] parallel 'copy-from' Ops in copy_file_range
@ 2020-01-29 18:20 Luis Henriques
  2020-01-29 18:20 ` [PATCH 1/1] ceph: parallelize all copy-from requests " Luis Henriques
  0 siblings, 1 reply; 6+ messages in thread
From: Luis Henriques @ 2020-01-29 18:20 UTC (permalink / raw)
  To: Jeff Layton, Sage Weil, Ilya Dryomov, Yan, Zheng, Gregory Farnum
  Cc: ceph-devel, linux-kernel, Luis Henriques

Hi,

Here's a re-spin of the patchset (now a single patch!) to improve
copy_file_range performance by parallelizing the OSD requests.

Changelog since v1 (the RFC):

- Dropped synchronous version of ceph_osdc_copy_from
  * This was the reason for merging the patchset into a single patch, as
    changing ceph_osdc_copy_from definition would break bisectability

- Moved wait_copy_from_reqs into libceph, renaming it to
  ceph_osdc_wait_requests

- Use ->r_private_item instead of adding a new list_head into struct
  ceph_osd_request

- Dropped the module parameter (used for testing) and added the throttling
  mechanism using the formula suggested by Ilya.

For reference, here's the original RFC cover letter (fixed typo in results
table: 'throttle=5' should be 'throttle=0').

--------------------------------------------------------------------------

As discussed here[1] I'm sending an RFC patchset that does the
parallelization of the requests sent to the OSDs during a copy_file_range
syscall in CephFS.

  [1] https://lore.kernel.org/lkml/20200108100353.23770-1-lhenriques@suse.com/

I've also some performance numbers that I wanted to share. Here's a
description of the very simple tests I've run:

 - create a file with 200 objects in it
   * i.e. tests with different object sizes mean different file sizes
 - drop all caches and umount the filesystem
 - Measure:
   * mount filesystem
   * full file copy (with copy_file_range)
   * umount filesystem

Tests were repeated several times and the average value was used for
comparison.

  DISCLAIMER:
  These numbers are only indicative, and different clusters and client
  configs will for sure show different performance!  More rigorous tests
  would be require to validate these results.

Having as baseline a full read+write (basically, a copy_file_range
operation within a filesystem mounted without the 'copyfrom' option),
here's some values for different object sizes:

			  8M	  4M	  1M	  65k
read+write		100%	100%	100%	 100%
sequential		 51%	 52%	 83%	>100%
parallel (throttle=1)	 51%	 52%	 83%	>100%
parallel (throttle=0)	 17%	 17%	 83%	>100%

Notes:

- 'parallel (throttle=0)' was a test where *all* the requests (i.e. 200
  requests to copy the 200 objects in the file) were sent to the OSDs and
  the wait for requests completion is done at the end only.

- 'parallel (throttle=1)' was just a control test, where the wait for
  completion is done immediately after a request is sent.  It was expected
  to be very similar to the non-optimized ('sequential') tests.

- These tests were executed on a cluster with 40 OSDs, spread across 5
  (bare-metal) nodes.

- The tests with object size of 65k show that copy_file_range definitely
  doesn't scale to files with small object sizes.  '> 100%' actually means
  more than 10x slower.

Measuring the mount+copy+umount masks the actual difference between
different throttle values due to the time spent in mount+umount.  Thus,
there was no real difference between throttle=0 (send all and wait) and
throttle=20 (send 20, wait, send 20, ...).  But here's what I observed
when measuring only the copy operation (4M object size):

read+write		100%
parallel (throttle=1)	 56%
parallel (throttle=5)	 23%
parallel (throttle=10)	 14%
parallel (throttle=20)	  9%
parallel (throttle=0)	  5%

Anyway, I'll still need to revisit patch 0003 as it doesn't follow the
suggestion done by Jeff to *not* add another knob to fine-tune the
throttle value -- this patch adds a kernel parameter for a knob that I
wanted to use in my testing to observe different values of this throttle
limit.

The goal is to probably to drop this patch and do the throttling in patch
0002.  I just need to come up with a decent heuristic.  Jeff's suggestion
was to use rsize/wsize, which are set to 64M by default IIRC.  Somehow I
feel that it should be related to the number of OSDs in the cluster
instead, but I'm not sure how.  And testing these sort of heuristics would
require different clusters, which isn't particularly easy to get.  Anyway,
comments are welcome!

Cheers,
--
Luis


Luis Henriques (1):
  ceph: parallelize all copy-from requests in copy_file_range

 fs/ceph/file.c                  | 34 ++++++++++++++++++++--
 include/linux/ceph/osd_client.h |  5 +++-
 net/ceph/osd_client.c           | 50 ++++++++++++++++++++++++---------
 3 files changed, 72 insertions(+), 17 deletions(-)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/1] ceph: parallelize all copy-from requests in copy_file_range
  2020-01-29 18:20 [PATCH v2 0/1] parallel 'copy-from' Ops in copy_file_range Luis Henriques
@ 2020-01-29 18:20 ` Luis Henriques
  2020-01-30 14:15   ` Ilya Dryomov
  0 siblings, 1 reply; 6+ messages in thread
From: Luis Henriques @ 2020-01-29 18:20 UTC (permalink / raw)
  To: Jeff Layton, Sage Weil, Ilya Dryomov, Yan, Zheng, Gregory Farnum
  Cc: ceph-devel, linux-kernel, Luis Henriques

Right now the copy_file_range syscall serializes all the OSDs 'copy-from'
operations, waiting for each request to complete before sending the next
one.  This patch modifies copy_file_range so that all the 'copy-from'
operations are sent in bulk and wait for its completion at the end.  This
will allow significant speed-ups, specially when sending requests to
different target OSDs.

There's also a throttling mechanism so that OSDs aren't flooded with
requests when a client performs a big file copy.  Currently the throttling
mechanism simply waits for the requests when the number of in-flight
requests reaches (wsize / object size) * 4.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
---
 fs/ceph/file.c                  | 34 ++++++++++++++++++++--
 include/linux/ceph/osd_client.h |  5 +++-
 net/ceph/osd_client.c           | 50 ++++++++++++++++++++++++---------
 3 files changed, 72 insertions(+), 17 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 1e6cdf2dfe90..77a16324dcb4 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1943,12 +1943,14 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
 	struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
 	struct ceph_object_locator src_oloc, dst_oloc;
 	struct ceph_object_id src_oid, dst_oid;
+	struct ceph_osd_request *req;
 	loff_t endoff = 0, size;
 	ssize_t ret = -EIO;
 	u64 src_objnum, dst_objnum, src_objoff, dst_objoff;
 	u32 src_objlen, dst_objlen, object_size;
-	int src_got = 0, dst_got = 0, err, dirty;
+	int src_got = 0, dst_got = 0, err, dirty, ncopies;
 	bool do_final_copy = false;
+	LIST_HEAD(osd_reqs);
 
 	if (src_inode->i_sb != dst_inode->i_sb) {
 		struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
@@ -2083,6 +2085,12 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
 			goto out_caps;
 	}
 	object_size = src_ci->i_layout.object_size;
+
+	/*
+	 * Throttle the object copies: ncopies holds the number of allowed
+	 * in-flight 'copy-from' requests before waiting for their completion
+	 */
+	ncopies = (src_fsc->mount_options->wsize / object_size) * 4;
 	while (len >= object_size) {
 		ceph_calc_file_object_mapping(&src_ci->i_layout, src_off,
 					      object_size, &src_objnum,
@@ -2097,7 +2105,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
 		ceph_oid_printf(&dst_oid, "%llx.%08llx",
 				dst_ci->i_vino.ino, dst_objnum);
 		/* Do an object remote copy */
-		err = ceph_osdc_copy_from(
+		req = ceph_osdc_copy_from(
 			&src_fsc->client->osdc,
 			src_ci->i_vino.snap, 0,
 			&src_oid, &src_oloc,
@@ -2108,7 +2116,8 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
 			CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
 			dst_ci->i_truncate_seq, dst_ci->i_truncate_size,
 			CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
-		if (err) {
+		if (IS_ERR(req)) {
+			err = PTR_ERR(req);
 			if (err == -EOPNOTSUPP) {
 				src_fsc->have_copy_from2 = false;
 				pr_notice("OSDs don't support 'copy-from2'; "
@@ -2117,14 +2126,33 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
 			dout("ceph_osdc_copy_from returned %d\n", err);
 			if (!ret)
 				ret = err;
+			/* wait for all queued requests */
+			ceph_osdc_wait_requests(&osd_reqs);
 			goto out_caps;
 		}
+		list_add(&req->r_private_item, &osd_reqs);
 		len -= object_size;
 		src_off += object_size;
 		dst_off += object_size;
 		ret += object_size;
+		if (--ncopies == 0) {
+			err = ceph_osdc_wait_requests(&osd_reqs);
+			if (err) {
+				if (!ret)
+					ret = err;
+				goto out_caps;
+			}
+			ncopies = (src_fsc->mount_options->wsize /
+				   object_size) * 4;
+		}
 	}
 
+	err = ceph_osdc_wait_requests(&osd_reqs);
+	if (err) {
+		if (!ret)
+			ret = err;
+		goto out_caps;
+	}
 	if (len)
 		/* We still need one final local copy */
 		do_final_copy = true;
diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 5a62dbd3f4c2..25565dbfd65a 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -526,7 +526,8 @@ extern int ceph_osdc_writepages(struct ceph_osd_client *osdc,
 				struct timespec64 *mtime,
 				struct page **pages, int nr_pages);
 
-int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
+struct ceph_osd_request *ceph_osdc_copy_from(
+			struct ceph_osd_client *osdc,
 			u64 src_snapid, u64 src_version,
 			struct ceph_object_id *src_oid,
 			struct ceph_object_locator *src_oloc,
@@ -537,6 +538,8 @@ int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
 			u32 truncate_seq, u64 truncate_size,
 			u8 copy_from_flags);
 
+int ceph_osdc_wait_requests(struct list_head *osd_reqs);
+
 /* watch/notify */
 struct ceph_osd_linger_request *
 ceph_osdc_watch(struct ceph_osd_client *osdc,
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index b68b376d8c2f..c123e231eaf4 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -5346,23 +5346,47 @@ static int osd_req_op_copy_from_init(struct ceph_osd_request *req,
 	return 0;
 }
 
-int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
-			u64 src_snapid, u64 src_version,
-			struct ceph_object_id *src_oid,
-			struct ceph_object_locator *src_oloc,
-			u32 src_fadvise_flags,
-			struct ceph_object_id *dst_oid,
-			struct ceph_object_locator *dst_oloc,
-			u32 dst_fadvise_flags,
-			u32 truncate_seq, u64 truncate_size,
-			u8 copy_from_flags)
+int ceph_osdc_wait_requests(struct list_head *osd_reqs)
+{
+	struct ceph_osd_request *req;
+	int ret = 0, err;
+
+	while (!list_empty(osd_reqs)) {
+		req = list_first_entry(osd_reqs,
+				       struct ceph_osd_request,
+				       r_private_item);
+		list_del_init(&req->r_private_item);
+		err = ceph_osdc_wait_request(req->r_osdc, req);
+		if (err) {
+			if (!ret)
+				ret = err;
+			dout("copy request failed (err=%d)\n", err);
+		}
+		ceph_osdc_put_request(req);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(ceph_osdc_wait_requests);
+
+struct ceph_osd_request *ceph_osdc_copy_from(
+		struct ceph_osd_client *osdc,
+		u64 src_snapid, u64 src_version,
+		struct ceph_object_id *src_oid,
+		struct ceph_object_locator *src_oloc,
+		u32 src_fadvise_flags,
+		struct ceph_object_id *dst_oid,
+		struct ceph_object_locator *dst_oloc,
+		u32 dst_fadvise_flags,
+		u32 truncate_seq, u64 truncate_size,
+		u8 copy_from_flags)
 {
 	struct ceph_osd_request *req;
 	int ret;
 
 	req = ceph_osdc_alloc_request(osdc, NULL, 1, false, GFP_KERNEL);
 	if (!req)
-		return -ENOMEM;
+		return ERR_PTR(-ENOMEM);
 
 	req->r_flags = CEPH_OSD_FLAG_WRITE;
 
@@ -5381,11 +5405,11 @@ int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
 		goto out;
 
 	ceph_osdc_start_request(osdc, req, false);
-	ret = ceph_osdc_wait_request(osdc, req);
+	return req;
 
 out:
 	ceph_osdc_put_request(req);
-	return ret;
+	return ERR_PTR(ret);
 }
 EXPORT_SYMBOL(ceph_osdc_copy_from);
 

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] ceph: parallelize all copy-from requests in copy_file_range
  2020-01-29 18:20 ` [PATCH 1/1] ceph: parallelize all copy-from requests " Luis Henriques
@ 2020-01-30 14:15   ` Ilya Dryomov
  2020-01-30 15:37     ` Luis Henriques
  0 siblings, 1 reply; 6+ messages in thread
From: Ilya Dryomov @ 2020-01-30 14:15 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Jeff Layton, Sage Weil, Yan, Zheng, Gregory Farnum,
	Ceph Development, LKML

On Wed, Jan 29, 2020 at 7:20 PM Luis Henriques <lhenriques@suse.com> wrote:
>
> Right now the copy_file_range syscall serializes all the OSDs 'copy-from'
> operations, waiting for each request to complete before sending the next
> one.  This patch modifies copy_file_range so that all the 'copy-from'
> operations are sent in bulk and wait for its completion at the end.  This
> will allow significant speed-ups, specially when sending requests to
> different target OSDs.
>
> There's also a throttling mechanism so that OSDs aren't flooded with
> requests when a client performs a big file copy.  Currently the throttling
> mechanism simply waits for the requests when the number of in-flight
> requests reaches (wsize / object size) * 4.
>
> Signed-off-by: Luis Henriques <lhenriques@suse.com>
> ---
>  fs/ceph/file.c                  | 34 ++++++++++++++++++++--
>  include/linux/ceph/osd_client.h |  5 +++-
>  net/ceph/osd_client.c           | 50 ++++++++++++++++++++++++---------
>  3 files changed, 72 insertions(+), 17 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 1e6cdf2dfe90..77a16324dcb4 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1943,12 +1943,14 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>         struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
>         struct ceph_object_locator src_oloc, dst_oloc;
>         struct ceph_object_id src_oid, dst_oid;
> +       struct ceph_osd_request *req;
>         loff_t endoff = 0, size;
>         ssize_t ret = -EIO;
>         u64 src_objnum, dst_objnum, src_objoff, dst_objoff;
>         u32 src_objlen, dst_objlen, object_size;
> -       int src_got = 0, dst_got = 0, err, dirty;
> +       int src_got = 0, dst_got = 0, err, dirty, ncopies;
>         bool do_final_copy = false;
> +       LIST_HEAD(osd_reqs);
>
>         if (src_inode->i_sb != dst_inode->i_sb) {
>                 struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
> @@ -2083,6 +2085,12 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>                         goto out_caps;
>         }
>         object_size = src_ci->i_layout.object_size;
> +
> +       /*
> +        * Throttle the object copies: ncopies holds the number of allowed
> +        * in-flight 'copy-from' requests before waiting for their completion
> +        */
> +       ncopies = (src_fsc->mount_options->wsize / object_size) * 4;
>         while (len >= object_size) {
>                 ceph_calc_file_object_mapping(&src_ci->i_layout, src_off,
>                                               object_size, &src_objnum,
> @@ -2097,7 +2105,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>                 ceph_oid_printf(&dst_oid, "%llx.%08llx",
>                                 dst_ci->i_vino.ino, dst_objnum);
>                 /* Do an object remote copy */
> -               err = ceph_osdc_copy_from(
> +               req = ceph_osdc_copy_from(
>                         &src_fsc->client->osdc,
>                         src_ci->i_vino.snap, 0,
>                         &src_oid, &src_oloc,
> @@ -2108,7 +2116,8 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>                         CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
>                         dst_ci->i_truncate_seq, dst_ci->i_truncate_size,
>                         CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
> -               if (err) {
> +               if (IS_ERR(req)) {
> +                       err = PTR_ERR(req);
>                         if (err == -EOPNOTSUPP) {

No point in checking for EOPNOTSUPP here, because ceph_osdc_copy_from()
won't ever return that.  This loop needs more massaging and more testing
on old OSDs...

>                                 src_fsc->have_copy_from2 = false;
>                                 pr_notice("OSDs don't support 'copy-from2'; "
> @@ -2117,14 +2126,33 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>                         dout("ceph_osdc_copy_from returned %d\n", err);
>                         if (!ret)
>                                 ret = err;
> +                       /* wait for all queued requests */
> +                       ceph_osdc_wait_requests(&osd_reqs);
>                         goto out_caps;
>                 }
> +               list_add(&req->r_private_item, &osd_reqs);
>                 len -= object_size;
>                 src_off += object_size;
>                 dst_off += object_size;
>                 ret += object_size;

So ret is incremented here, but you have numerious tests where ret is
assigned an error only if ret is 0.  Unless I'm missing something, this
interferes with returning errors from __ceph_copy_file_range().

> +               if (--ncopies == 0) {
> +                       err = ceph_osdc_wait_requests(&osd_reqs);
> +                       if (err) {
> +                               if (!ret)
> +                                       ret = err;
> +                               goto out_caps;
> +                       }
> +                       ncopies = (src_fsc->mount_options->wsize /
> +                                  object_size) * 4;

The object size is constant within a file, so ncopies should be too.
Perhaps introduce a counter instead of recalculating ncopies here?

> +               }
>         }
>
> +       err = ceph_osdc_wait_requests(&osd_reqs);
> +       if (err) {
> +               if (!ret)
> +                       ret = err;
> +               goto out_caps;
> +       }
>         if (len)
>                 /* We still need one final local copy */
>                 do_final_copy = true;
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index 5a62dbd3f4c2..25565dbfd65a 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -526,7 +526,8 @@ extern int ceph_osdc_writepages(struct ceph_osd_client *osdc,
>                                 struct timespec64 *mtime,
>                                 struct page **pages, int nr_pages);
>
> -int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
> +struct ceph_osd_request *ceph_osdc_copy_from(
> +                       struct ceph_osd_client *osdc,
>                         u64 src_snapid, u64 src_version,
>                         struct ceph_object_id *src_oid,
>                         struct ceph_object_locator *src_oloc,
> @@ -537,6 +538,8 @@ int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
>                         u32 truncate_seq, u64 truncate_size,
>                         u8 copy_from_flags);
>
> +int ceph_osdc_wait_requests(struct list_head *osd_reqs);
> +
>  /* watch/notify */
>  struct ceph_osd_linger_request *
>  ceph_osdc_watch(struct ceph_osd_client *osdc,
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index b68b376d8c2f..c123e231eaf4 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -5346,23 +5346,47 @@ static int osd_req_op_copy_from_init(struct ceph_osd_request *req,
>         return 0;
>  }
>
> -int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
> -                       u64 src_snapid, u64 src_version,
> -                       struct ceph_object_id *src_oid,
> -                       struct ceph_object_locator *src_oloc,
> -                       u32 src_fadvise_flags,
> -                       struct ceph_object_id *dst_oid,
> -                       struct ceph_object_locator *dst_oloc,
> -                       u32 dst_fadvise_flags,
> -                       u32 truncate_seq, u64 truncate_size,
> -                       u8 copy_from_flags)
> +int ceph_osdc_wait_requests(struct list_head *osd_reqs)
> +{
> +       struct ceph_osd_request *req;
> +       int ret = 0, err;
> +
> +       while (!list_empty(osd_reqs)) {
> +               req = list_first_entry(osd_reqs,
> +                                      struct ceph_osd_request,
> +                                      r_private_item);
> +               list_del_init(&req->r_private_item);
> +               err = ceph_osdc_wait_request(req->r_osdc, req);
> +               if (err) {
> +                       if (!ret)
> +                               ret = err;
> +                       dout("copy request failed (err=%d)\n", err);

This dout needs updating, but I'd just remove it.  The error code is
there in other messages.

> +               }
> +               ceph_osdc_put_request(req);
> +       }
> +
> +       return ret;
> +}
> +EXPORT_SYMBOL(ceph_osdc_wait_requests);

Move this function after ceph_osdc_wait_request(), so that they are
close to each other (and osd_req_op_copy_from_init() isn't separated
from ceph_osdc_copy_from() by something unrelated).

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] ceph: parallelize all copy-from requests in copy_file_range
  2020-01-30 14:15   ` Ilya Dryomov
@ 2020-01-30 15:37     ` Luis Henriques
  2020-01-30 15:59       ` Ilya Dryomov
  0 siblings, 1 reply; 6+ messages in thread
From: Luis Henriques @ 2020-01-30 15:37 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Jeff Layton, Sage Weil, Yan, Zheng, Gregory Farnum,
	Ceph Development, LKML

On Thu, Jan 30, 2020 at 03:15:52PM +0100, Ilya Dryomov wrote:
> On Wed, Jan 29, 2020 at 7:20 PM Luis Henriques <lhenriques@suse.com> wrote:
> >
> > Right now the copy_file_range syscall serializes all the OSDs 'copy-from'
> > operations, waiting for each request to complete before sending the next
> > one.  This patch modifies copy_file_range so that all the 'copy-from'
> > operations are sent in bulk and wait for its completion at the end.  This
> > will allow significant speed-ups, specially when sending requests to
> > different target OSDs.
> >
> > There's also a throttling mechanism so that OSDs aren't flooded with
> > requests when a client performs a big file copy.  Currently the throttling
> > mechanism simply waits for the requests when the number of in-flight
> > requests reaches (wsize / object size) * 4.
> >
> > Signed-off-by: Luis Henriques <lhenriques@suse.com>
> > ---
> >  fs/ceph/file.c                  | 34 ++++++++++++++++++++--
> >  include/linux/ceph/osd_client.h |  5 +++-
> >  net/ceph/osd_client.c           | 50 ++++++++++++++++++++++++---------
> >  3 files changed, 72 insertions(+), 17 deletions(-)
> >
> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > index 1e6cdf2dfe90..77a16324dcb4 100644
> > --- a/fs/ceph/file.c
> > +++ b/fs/ceph/file.c
> > @@ -1943,12 +1943,14 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> >         struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
> >         struct ceph_object_locator src_oloc, dst_oloc;
> >         struct ceph_object_id src_oid, dst_oid;
> > +       struct ceph_osd_request *req;
> >         loff_t endoff = 0, size;
> >         ssize_t ret = -EIO;
> >         u64 src_objnum, dst_objnum, src_objoff, dst_objoff;
> >         u32 src_objlen, dst_objlen, object_size;
> > -       int src_got = 0, dst_got = 0, err, dirty;
> > +       int src_got = 0, dst_got = 0, err, dirty, ncopies;
> >         bool do_final_copy = false;
> > +       LIST_HEAD(osd_reqs);
> >
> >         if (src_inode->i_sb != dst_inode->i_sb) {
> >                 struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
> > @@ -2083,6 +2085,12 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> >                         goto out_caps;
> >         }
> >         object_size = src_ci->i_layout.object_size;
> > +
> > +       /*
> > +        * Throttle the object copies: ncopies holds the number of allowed
> > +        * in-flight 'copy-from' requests before waiting for their completion
> > +        */
> > +       ncopies = (src_fsc->mount_options->wsize / object_size) * 4;
> >         while (len >= object_size) {
> >                 ceph_calc_file_object_mapping(&src_ci->i_layout, src_off,
> >                                               object_size, &src_objnum,
> > @@ -2097,7 +2105,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> >                 ceph_oid_printf(&dst_oid, "%llx.%08llx",
> >                                 dst_ci->i_vino.ino, dst_objnum);
> >                 /* Do an object remote copy */
> > -               err = ceph_osdc_copy_from(
> > +               req = ceph_osdc_copy_from(
> >                         &src_fsc->client->osdc,
> >                         src_ci->i_vino.snap, 0,
> >                         &src_oid, &src_oloc,
> > @@ -2108,7 +2116,8 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> >                         CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
> >                         dst_ci->i_truncate_seq, dst_ci->i_truncate_size,
> >                         CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
> > -               if (err) {
> > +               if (IS_ERR(req)) {
> > +                       err = PTR_ERR(req);
> >                         if (err == -EOPNOTSUPP) {
> 
> No point in checking for EOPNOTSUPP here, because ceph_osdc_copy_from()
> won't ever return that.  This loop needs more massaging and more testing
> on old OSDs...

Right, I missed that.  Setting src_fsc->have_copy_from2 to false should be
moved into the two 'if (err)' statements following the calls to
ceph_osdc_wait_requests.  I'll go fix that.  And test it with on a cluster
with OSDs that don't have this copy-from2 operation.

> >                                 src_fsc->have_copy_from2 = false;
> >                                 pr_notice("OSDs don't support 'copy-from2'; "
> > @@ -2117,14 +2126,33 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> >                         dout("ceph_osdc_copy_from returned %d\n", err);
> >                         if (!ret)
> >                                 ret = err;
> > +                       /* wait for all queued requests */
> > +                       ceph_osdc_wait_requests(&osd_reqs);
> >                         goto out_caps;
> >                 }
> > +               list_add(&req->r_private_item, &osd_reqs);
> >                 len -= object_size;
> >                 src_off += object_size;
> >                 dst_off += object_size;
> >                 ret += object_size;
> 
> So ret is incremented here, but you have numerious tests where ret is
> assigned an error only if ret is 0.  Unless I'm missing something, this
> interferes with returning errors from __ceph_copy_file_range().

Well, the problem is that an error may occur *after* we have already done
some copies.  In that case we need to return the number of bytes that have
been successfully copied instead of an error; eventually, subsequent calls
to complete the copy_file_range will then return the error.  At least this
is how I understood the man page (i.e. similar to the write(2) syscall).

> > +               if (--ncopies == 0) {
> > +                       err = ceph_osdc_wait_requests(&osd_reqs);
> > +                       if (err) {
> > +                               if (!ret)
> > +                                       ret = err;
> > +                               goto out_caps;
> > +                       }
> > +                       ncopies = (src_fsc->mount_options->wsize /
> > +                                  object_size) * 4;
> 
> The object size is constant within a file, so ncopies should be too.
> Perhaps introduce a counter instead of recalculating ncopies here?

Not sure I understood your comment.  You would rather have:

 * ncopies initialized only once outside the loop
 * have a counter counting the number of objects copied
 * call ceph_osdc_wait_requests() when this counter is a multiple of
   ncopies

Is that it?

Cheers,
--
Luís

> > +               }
> >         }
> >
> > +       err = ceph_osdc_wait_requests(&osd_reqs);
> > +       if (err) {
> > +               if (!ret)
> > +                       ret = err;
> > +               goto out_caps;
> > +       }
> >         if (len)
> >                 /* We still need one final local copy */
> >                 do_final_copy = true;
> > diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> > index 5a62dbd3f4c2..25565dbfd65a 100644
> > --- a/include/linux/ceph/osd_client.h
> > +++ b/include/linux/ceph/osd_client.h
> > @@ -526,7 +526,8 @@ extern int ceph_osdc_writepages(struct ceph_osd_client *osdc,
> >                                 struct timespec64 *mtime,
> >                                 struct page **pages, int nr_pages);
> >
> > -int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
> > +struct ceph_osd_request *ceph_osdc_copy_from(
> > +                       struct ceph_osd_client *osdc,
> >                         u64 src_snapid, u64 src_version,
> >                         struct ceph_object_id *src_oid,
> >                         struct ceph_object_locator *src_oloc,
> > @@ -537,6 +538,8 @@ int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
> >                         u32 truncate_seq, u64 truncate_size,
> >                         u8 copy_from_flags);
> >
> > +int ceph_osdc_wait_requests(struct list_head *osd_reqs);
> > +
> >  /* watch/notify */
> >  struct ceph_osd_linger_request *
> >  ceph_osdc_watch(struct ceph_osd_client *osdc,
> > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> > index b68b376d8c2f..c123e231eaf4 100644
> > --- a/net/ceph/osd_client.c
> > +++ b/net/ceph/osd_client.c
> > @@ -5346,23 +5346,47 @@ static int osd_req_op_copy_from_init(struct ceph_osd_request *req,
> >         return 0;
> >  }
> >
> > -int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
> > -                       u64 src_snapid, u64 src_version,
> > -                       struct ceph_object_id *src_oid,
> > -                       struct ceph_object_locator *src_oloc,
> > -                       u32 src_fadvise_flags,
> > -                       struct ceph_object_id *dst_oid,
> > -                       struct ceph_object_locator *dst_oloc,
> > -                       u32 dst_fadvise_flags,
> > -                       u32 truncate_seq, u64 truncate_size,
> > -                       u8 copy_from_flags)
> > +int ceph_osdc_wait_requests(struct list_head *osd_reqs)
> > +{
> > +       struct ceph_osd_request *req;
> > +       int ret = 0, err;
> > +
> > +       while (!list_empty(osd_reqs)) {
> > +               req = list_first_entry(osd_reqs,
> > +                                      struct ceph_osd_request,
> > +                                      r_private_item);
> > +               list_del_init(&req->r_private_item);
> > +               err = ceph_osdc_wait_request(req->r_osdc, req);
> > +               if (err) {
> > +                       if (!ret)
> > +                               ret = err;
> > +                       dout("copy request failed (err=%d)\n", err);
> 
> This dout needs updating, but I'd just remove it.  The error code is
> there in other messages.
> 
> > +               }
> > +               ceph_osdc_put_request(req);
> > +       }
> > +
> > +       return ret;
> > +}
> > +EXPORT_SYMBOL(ceph_osdc_wait_requests);
> 
> Move this function after ceph_osdc_wait_request(), so that they are
> close to each other (and osd_req_op_copy_from_init() isn't separated
> from ceph_osdc_copy_from() by something unrelated).
> 
> Thanks,
> 
>                 Ilya

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] ceph: parallelize all copy-from requests in copy_file_range
  2020-01-30 15:37     ` Luis Henriques
@ 2020-01-30 15:59       ` Ilya Dryomov
  2020-01-30 16:31         ` Luis Henriques
  0 siblings, 1 reply; 6+ messages in thread
From: Ilya Dryomov @ 2020-01-30 15:59 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Jeff Layton, Sage Weil, Yan, Zheng, Gregory Farnum,
	Ceph Development, LKML

On Thu, Jan 30, 2020 at 4:37 PM Luis Henriques <lhenriques@suse.com> wrote:
>
> On Thu, Jan 30, 2020 at 03:15:52PM +0100, Ilya Dryomov wrote:
> > On Wed, Jan 29, 2020 at 7:20 PM Luis Henriques <lhenriques@suse.com> wrote:
> > >
> > > Right now the copy_file_range syscall serializes all the OSDs 'copy-from'
> > > operations, waiting for each request to complete before sending the next
> > > one.  This patch modifies copy_file_range so that all the 'copy-from'
> > > operations are sent in bulk and wait for its completion at the end.  This
> > > will allow significant speed-ups, specially when sending requests to
> > > different target OSDs.
> > >
> > > There's also a throttling mechanism so that OSDs aren't flooded with
> > > requests when a client performs a big file copy.  Currently the throttling
> > > mechanism simply waits for the requests when the number of in-flight
> > > requests reaches (wsize / object size) * 4.
> > >
> > > Signed-off-by: Luis Henriques <lhenriques@suse.com>
> > > ---
> > >  fs/ceph/file.c                  | 34 ++++++++++++++++++++--
> > >  include/linux/ceph/osd_client.h |  5 +++-
> > >  net/ceph/osd_client.c           | 50 ++++++++++++++++++++++++---------
> > >  3 files changed, 72 insertions(+), 17 deletions(-)
> > >
> > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > index 1e6cdf2dfe90..77a16324dcb4 100644
> > > --- a/fs/ceph/file.c
> > > +++ b/fs/ceph/file.c
> > > @@ -1943,12 +1943,14 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > >         struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
> > >         struct ceph_object_locator src_oloc, dst_oloc;
> > >         struct ceph_object_id src_oid, dst_oid;
> > > +       struct ceph_osd_request *req;
> > >         loff_t endoff = 0, size;
> > >         ssize_t ret = -EIO;
> > >         u64 src_objnum, dst_objnum, src_objoff, dst_objoff;
> > >         u32 src_objlen, dst_objlen, object_size;
> > > -       int src_got = 0, dst_got = 0, err, dirty;
> > > +       int src_got = 0, dst_got = 0, err, dirty, ncopies;
> > >         bool do_final_copy = false;
> > > +       LIST_HEAD(osd_reqs);
> > >
> > >         if (src_inode->i_sb != dst_inode->i_sb) {
> > >                 struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
> > > @@ -2083,6 +2085,12 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > >                         goto out_caps;
> > >         }
> > >         object_size = src_ci->i_layout.object_size;
> > > +
> > > +       /*
> > > +        * Throttle the object copies: ncopies holds the number of allowed
> > > +        * in-flight 'copy-from' requests before waiting for their completion
> > > +        */
> > > +       ncopies = (src_fsc->mount_options->wsize / object_size) * 4;
> > >         while (len >= object_size) {
> > >                 ceph_calc_file_object_mapping(&src_ci->i_layout, src_off,
> > >                                               object_size, &src_objnum,
> > > @@ -2097,7 +2105,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > >                 ceph_oid_printf(&dst_oid, "%llx.%08llx",
> > >                                 dst_ci->i_vino.ino, dst_objnum);
> > >                 /* Do an object remote copy */
> > > -               err = ceph_osdc_copy_from(
> > > +               req = ceph_osdc_copy_from(
> > >                         &src_fsc->client->osdc,
> > >                         src_ci->i_vino.snap, 0,
> > >                         &src_oid, &src_oloc,
> > > @@ -2108,7 +2116,8 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > >                         CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
> > >                         dst_ci->i_truncate_seq, dst_ci->i_truncate_size,
> > >                         CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
> > > -               if (err) {
> > > +               if (IS_ERR(req)) {
> > > +                       err = PTR_ERR(req);
> > >                         if (err == -EOPNOTSUPP) {
> >
> > No point in checking for EOPNOTSUPP here, because ceph_osdc_copy_from()
> > won't ever return that.  This loop needs more massaging and more testing
> > on old OSDs...
>
> Right, I missed that.  Setting src_fsc->have_copy_from2 to false should be
> moved into the two 'if (err)' statements following the calls to
> ceph_osdc_wait_requests.  I'll go fix that.  And test it with on a cluster
> with OSDs that don't have this copy-from2 operation.
>
> > >                                 src_fsc->have_copy_from2 = false;
> > >                                 pr_notice("OSDs don't support 'copy-from2'; "
> > > @@ -2117,14 +2126,33 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > >                         dout("ceph_osdc_copy_from returned %d\n", err);
> > >                         if (!ret)
> > >                                 ret = err;
> > > +                       /* wait for all queued requests */
> > > +                       ceph_osdc_wait_requests(&osd_reqs);
> > >                         goto out_caps;
> > >                 }
> > > +               list_add(&req->r_private_item, &osd_reqs);
> > >                 len -= object_size;
> > >                 src_off += object_size;
> > >                 dst_off += object_size;
> > >                 ret += object_size;
> >
> > So ret is incremented here, but you have numerious tests where ret is
> > assigned an error only if ret is 0.  Unless I'm missing something, this
> > interferes with returning errors from __ceph_copy_file_range().
>
> Well, the problem is that an error may occur *after* we have already done
> some copies.  In that case we need to return the number of bytes that have
> been successfully copied instead of an error; eventually, subsequent calls
> to complete the copy_file_range will then return the error.  At least this
> is how I understood the man page (i.e. similar to the write(2) syscall).

AFAICS ret is incremented before you know that *any* of the copies were
successful.  If the first copy fails, how do you report that error?

>
> > > +               if (--ncopies == 0) {
> > > +                       err = ceph_osdc_wait_requests(&osd_reqs);
> > > +                       if (err) {
> > > +                               if (!ret)
> > > +                                       ret = err;
> > > +                               goto out_caps;
> > > +                       }
> > > +                       ncopies = (src_fsc->mount_options->wsize /
> > > +                                  object_size) * 4;
> >
> > The object size is constant within a file, so ncopies should be too.
> > Perhaps introduce a counter instead of recalculating ncopies here?
>
> Not sure I understood your comment.  You would rather have:
>
>  * ncopies initialized only once outside the loop
>  * have a counter counting the number of objects copied
>  * call ceph_osdc_wait_requests() when this counter is a multiple of
>    ncopies

I was thinking of a counter that is initialized to ncopies and reset to
ncopies any time it reaches 0.  This is just a nit though.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] ceph: parallelize all copy-from requests in copy_file_range
  2020-01-30 15:59       ` Ilya Dryomov
@ 2020-01-30 16:31         ` Luis Henriques
  0 siblings, 0 replies; 6+ messages in thread
From: Luis Henriques @ 2020-01-30 16:31 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Jeff Layton, Sage Weil, Yan, Zheng, Gregory Farnum,
	Ceph Development, LKML

On Thu, Jan 30, 2020 at 04:59:59PM +0100, Ilya Dryomov wrote:
...
> > > > @@ -2117,14 +2126,33 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > >                         dout("ceph_osdc_copy_from returned %d\n", err);
> > > >                         if (!ret)
> > > >                                 ret = err;
> > > > +                       /* wait for all queued requests */
> > > > +                       ceph_osdc_wait_requests(&osd_reqs);
> > > >                         goto out_caps;
> > > >                 }
> > > > +               list_add(&req->r_private_item, &osd_reqs);
> > > >                 len -= object_size;
> > > >                 src_off += object_size;
> > > >                 dst_off += object_size;
> > > >                 ret += object_size;
> > >
> > > So ret is incremented here, but you have numerious tests where ret is
> > > assigned an error only if ret is 0.  Unless I'm missing something, this
> > > interferes with returning errors from __ceph_copy_file_range().
> >
> > Well, the problem is that an error may occur *after* we have already done
> > some copies.  In that case we need to return the number of bytes that have
> > been successfully copied instead of an error; eventually, subsequent calls
> > to complete the copy_file_range will then return the error.  At least this
> > is how I understood the man page (i.e. similar to the write(2) syscall).
> 
> AFAICS ret is incremented before you know that *any* of the copies were
> successful.  If the first copy fails, how do you report that error?

Ah, got it.  So, a solution would be to have the ceph_osdc_wait_requests
interface changed so that we can get the number of successful requests.

/me takes a deep breath and goes look at the *whole* thing to prevent
these mistakes.  Thanks a lot for your review, Ilya.

> 
> >
> > > > +               if (--ncopies == 0) {
> > > > +                       err = ceph_osdc_wait_requests(&osd_reqs);
> > > > +                       if (err) {
> > > > +                               if (!ret)
> > > > +                                       ret = err;
> > > > +                               goto out_caps;
> > > > +                       }
> > > > +                       ncopies = (src_fsc->mount_options->wsize /
> > > > +                                  object_size) * 4;
> > >
> > > The object size is constant within a file, so ncopies should be too.
> > > Perhaps introduce a counter instead of recalculating ncopies here?
> >
> > Not sure I understood your comment.  You would rather have:
> >
> >  * ncopies initialized only once outside the loop
> >  * have a counter counting the number of objects copied
> >  * call ceph_osdc_wait_requests() when this counter is a multiple of
> >    ncopies
> 
> I was thinking of a counter that is initialized to ncopies and reset to
> ncopies any time it reaches 0.  This is just a nit though.

Sure, no problem.

Cheers,
--
Luís

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-01-30 16:31 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-29 18:20 [PATCH v2 0/1] parallel 'copy-from' Ops in copy_file_range Luis Henriques
2020-01-29 18:20 ` [PATCH 1/1] ceph: parallelize all copy-from requests " Luis Henriques
2020-01-30 14:15   ` Ilya Dryomov
2020-01-30 15:37     ` Luis Henriques
2020-01-30 15:59       ` Ilya Dryomov
2020-01-30 16:31         ` Luis Henriques

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).