linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] ceph: allow object copies across different filesystems in the same cluster
       [not found] ` <30b09cb015563913d073c488c8de8ba0cceedd7b.camel@kernel.org>
@ 2019-09-06 16:26   ` Luis Henriques
  2019-09-07 13:53     ` Jeff Layton
  0 siblings, 1 reply; 12+ messages in thread
From: Luis Henriques @ 2019-09-06 16:26 UTC (permalink / raw)
  To: Jeff Layton; +Cc: ceph-devel, linux-kernel

"Jeff Layton" <jlayton@kernel.org> writes:

> On Fri, 2019-09-06 at 14:57 +0100, Luis Henriques wrote:
>> OSDs are able to perform object copies across different pools.  Thus,
>> there's no need to prevent copy_file_range from doing remote copies if the
>> source and destination superblocks are different.  Only return -EXDEV if
>> they have different fsid (the cluster ID).
>> 
>> Signed-off-by: Luis Henriques <lhenriques@suse.com>
>> ---
>>  fs/ceph/file.c | 23 +++++++++++++++++++----
>>  1 file changed, 19 insertions(+), 4 deletions(-)
>> 
>> Hi!
>> 
>> I've finally managed to run some tests using multiple filesystems, both
>> within a single cluster and also using two different clusters.  The
>> behaviour of copy_file_range (with this patch, of course) was what I
>> expected:
>> 
>>   - Object copies work fine across different filesystems within the same
>>     cluster (even with pools in different PGs);
>>   - -EXDEV is returned if the fsid is different
>> 
>> (OT: I wonder why the cluster ID is named 'fsid'; historical reasons?
>>  Because this is actually what's in ceph.conf fsid in "[global]"
>>  section.  Anyway...)
>> 
>> So, what's missing right now is (I always mention this when I have the
>> opportunity!) to merge https://github.com/ceph/ceph/pull/25374 :-)
>> And add the corresponding support for the new flag to the kernel
>> client, of course.
>> 
>> Cheers,
>> --
>> Luis
>> 
>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> index 685a03cc4b77..88d116893c2b 100644
>> --- a/fs/ceph/file.c
>> +++ b/fs/ceph/file.c
>> @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>>  	struct ceph_inode_info *src_ci = ceph_inode(src_inode);
>>  	struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
>>  	struct ceph_cap_flush *prealloc_cf;
>> +	struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
>>  	struct ceph_object_locator src_oloc, dst_oloc;
>>  	struct ceph_object_id src_oid, dst_oid;
>>  	loff_t endoff = 0, size;
>> @@ -1915,8 +1916,22 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>>  
>>  	if (src_inode == dst_inode)
>>  		return -EINVAL;
>> -	if (src_inode->i_sb != dst_inode->i_sb)
>> -		return -EXDEV;
>> +	if (src_inode->i_sb != dst_inode->i_sb) {
>> +		struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
>> +
>> +		if (!src_fsc->client->have_fsid || !dst_fsc->client->have_fsid) {
>> +			dout("No fsid in a fs client\n");
>> +			return -EXDEV;
>> +		}
>
> In what situation is there no fsid? Old cluster version?
>
> If there is no fsid, can we take that to indicate that there is only a
> single filesystem possible in the cluster and that we should attempt the
> copy anyway?

TBH I'm not sure if 'have_fsid' can ever be 'false' in this call.  It is
set to 'true' when handling the monmap, and it's never changed back to
'false'.  Since I don't think copy_file_range will be invoked *before*
we get the monmap, it should be safe to drop this check.  Maybe it could
be replaced it by a WARN_ON()?

Cheers,
-- 
Luis

>
>> +		if (ceph_fsid_compare(&src_fsc->client->fsid,
>> +				      &dst_fsc->client->fsid)) {
>> +			dout("Copying object across different clusters:");
>> +			dout("  src fsid: %*ph\n  dst fsid: %*ph\n",
>> +			     16, &src_fsc->client->fsid,
>> +			     16, &dst_fsc->client->fsid);
>> +			return -EXDEV;
>> +		}
>> +	}
>>  	if (ceph_snap(dst_inode) != CEPH_NOSNAP)
>>  		return -EROFS;
>>  
>> @@ -1928,7 +1943,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>>  	 * efficient).
>>  	 */
>>  
>> -	if (ceph_test_mount_opt(ceph_inode_to_client(src_inode), NOCOPYFROM))
>> +	if (ceph_test_mount_opt(src_fsc, NOCOPYFROM))
>>  		return -EOPNOTSUPP;
>>  
>>  	if ((src_ci->i_layout.stripe_unit != dst_ci->i_layout.stripe_unit) ||
>> @@ -2044,7 +2059,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>>  				dst_ci->i_vino.ino, dst_objnum);
>>  		/* Do an object remote copy */
>>  		err = ceph_osdc_copy_from(
>> -			&ceph_inode_to_client(src_inode)->client->osdc,
>> +			&src_fsc->client->osdc,
>>  			src_ci->i_vino.snap, 0,
>>  			&src_oid, &src_oloc,
>>  			CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] ceph: allow object copies across different filesystems in the same cluster
  2019-09-06 16:26   ` [PATCH] ceph: allow object copies across different filesystems in the same cluster Luis Henriques
@ 2019-09-07 13:53     ` Jeff Layton
  2019-09-09 10:18       ` Luis Henriques
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff Layton @ 2019-09-07 13:53 UTC (permalink / raw)
  To: Luis Henriques; +Cc: ceph-devel, linux-kernel

On Fri, 2019-09-06 at 17:26 +0100, Luis Henriques wrote:
> "Jeff Layton" <jlayton@kernel.org> writes:
> 
> > On Fri, 2019-09-06 at 14:57 +0100, Luis Henriques wrote:
> > > OSDs are able to perform object copies across different pools.  Thus,
> > > there's no need to prevent copy_file_range from doing remote copies if the
> > > source and destination superblocks are different.  Only return -EXDEV if
> > > they have different fsid (the cluster ID).
> > > 
> > > Signed-off-by: Luis Henriques <lhenriques@suse.com>
> > > ---
> > >  fs/ceph/file.c | 23 +++++++++++++++++++----
> > >  1 file changed, 19 insertions(+), 4 deletions(-)
> > > 
> > > Hi!
> > > 
> > > I've finally managed to run some tests using multiple filesystems, both
> > > within a single cluster and also using two different clusters.  The
> > > behaviour of copy_file_range (with this patch, of course) was what I
> > > expected:
> > > 
> > >   - Object copies work fine across different filesystems within the same
> > >     cluster (even with pools in different PGs);
> > >   - -EXDEV is returned if the fsid is different
> > > 
> > > (OT: I wonder why the cluster ID is named 'fsid'; historical reasons?
> > >  Because this is actually what's in ceph.conf fsid in "[global]"
> > >  section.  Anyway...)
> > > 
> > > So, what's missing right now is (I always mention this when I have the
> > > opportunity!) to merge https://github.com/ceph/ceph/pull/25374 :-)
> > > And add the corresponding support for the new flag to the kernel
> > > client, of course.
> > > 
> > > Cheers,
> > > --
> > > Luis
> > > 
> > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > index 685a03cc4b77..88d116893c2b 100644
> > > --- a/fs/ceph/file.c
> > > +++ b/fs/ceph/file.c
> > > @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > >  	struct ceph_inode_info *src_ci = ceph_inode(src_inode);
> > >  	struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
> > >  	struct ceph_cap_flush *prealloc_cf;
> > > +	struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
> > >  	struct ceph_object_locator src_oloc, dst_oloc;
> > >  	struct ceph_object_id src_oid, dst_oid;
> > >  	loff_t endoff = 0, size;
> > > @@ -1915,8 +1916,22 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > >  
> > >  	if (src_inode == dst_inode)
> > >  		return -EINVAL;
> > > -	if (src_inode->i_sb != dst_inode->i_sb)
> > > -		return -EXDEV;
> > > +	if (src_inode->i_sb != dst_inode->i_sb) {
> > > +		struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
> > > +
> > > +		if (!src_fsc->client->have_fsid || !dst_fsc->client->have_fsid) {
> > > +			dout("No fsid in a fs client\n");
> > > +			return -EXDEV;
> > > +		}
> > 
> > In what situation is there no fsid? Old cluster version?
> > 
> > If there is no fsid, can we take that to indicate that there is only a
> > single filesystem possible in the cluster and that we should attempt the
> > copy anyway?
> 
> TBH I'm not sure if 'have_fsid' can ever be 'false' in this call.  It is
> set to 'true' when handling the monmap, and it's never changed back to
> 'false'.  Since I don't think copy_file_range will be invoked *before*
> we get the monmap, it should be safe to drop this check.  Maybe it could
> be replaced it by a WARN_ON()?
> 

Yeah. I think the have_fsid flag just allows us to avoid the pr_err msg
in ceph_check_fsid when the client is initially created. Maybe there is
some better way to achieve that?

In any case, I'd just drop that condition here.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] ceph: allow object copies across different filesystems in the same cluster
  2019-09-07 13:53     ` Jeff Layton
@ 2019-09-09 10:18       ` Luis Henriques
  2019-09-09 10:28         ` [PATCH v2] " Luis Henriques
  0 siblings, 1 reply; 12+ messages in thread
From: Luis Henriques @ 2019-09-09 10:18 UTC (permalink / raw)
  To: Jeff Layton; +Cc: ceph-devel, linux-kernel

"Jeff Layton" <jlayton@kernel.org> writes:

> On Fri, 2019-09-06 at 17:26 +0100, Luis Henriques wrote:
>> "Jeff Layton" <jlayton@kernel.org> writes:
>> 
>> > On Fri, 2019-09-06 at 14:57 +0100, Luis Henriques wrote:
>> > > OSDs are able to perform object copies across different pools.  Thus,
>> > > there's no need to prevent copy_file_range from doing remote copies if the
>> > > source and destination superblocks are different.  Only return -EXDEV if
>> > > they have different fsid (the cluster ID).
>> > > 
>> > > Signed-off-by: Luis Henriques <lhenriques@suse.com>
>> > > ---
>> > >  fs/ceph/file.c | 23 +++++++++++++++++++----
>> > >  1 file changed, 19 insertions(+), 4 deletions(-)
>> > > 
>> > > Hi!
>> > > 
>> > > I've finally managed to run some tests using multiple filesystems, both
>> > > within a single cluster and also using two different clusters.  The
>> > > behaviour of copy_file_range (with this patch, of course) was what I
>> > > expected:
>> > > 
>> > >   - Object copies work fine across different filesystems within the same
>> > >     cluster (even with pools in different PGs);
>> > >   - -EXDEV is returned if the fsid is different
>> > > 
>> > > (OT: I wonder why the cluster ID is named 'fsid'; historical reasons?
>> > >  Because this is actually what's in ceph.conf fsid in "[global]"
>> > >  section.  Anyway...)
>> > > 
>> > > So, what's missing right now is (I always mention this when I have the
>> > > opportunity!) to merge https://github.com/ceph/ceph/pull/25374 :-)
>> > > And add the corresponding support for the new flag to the kernel
>> > > client, of course.
>> > > 
>> > > Cheers,
>> > > --
>> > > Luis
>> > > 
>> > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> > > index 685a03cc4b77..88d116893c2b 100644
>> > > --- a/fs/ceph/file.c
>> > > +++ b/fs/ceph/file.c
>> > > @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>> > >  	struct ceph_inode_info *src_ci = ceph_inode(src_inode);
>> > >  	struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
>> > >  	struct ceph_cap_flush *prealloc_cf;
>> > > +	struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
>> > >  	struct ceph_object_locator src_oloc, dst_oloc;
>> > >  	struct ceph_object_id src_oid, dst_oid;
>> > >  	loff_t endoff = 0, size;
>> > > @@ -1915,8 +1916,22 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>> > >  
>> > >  	if (src_inode == dst_inode)
>> > >  		return -EINVAL;
>> > > -	if (src_inode->i_sb != dst_inode->i_sb)
>> > > -		return -EXDEV;
>> > > +	if (src_inode->i_sb != dst_inode->i_sb) {
>> > > +		struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
>> > > +
>> > > +		if (!src_fsc->client->have_fsid || !dst_fsc->client->have_fsid) {
>> > > +			dout("No fsid in a fs client\n");
>> > > +			return -EXDEV;
>> > > +		}
>> > 
>> > In what situation is there no fsid? Old cluster version?
>> > 
>> > If there is no fsid, can we take that to indicate that there is only a
>> > single filesystem possible in the cluster and that we should attempt the
>> > copy anyway?
>> 
>> TBH I'm not sure if 'have_fsid' can ever be 'false' in this call.  It is
>> set to 'true' when handling the monmap, and it's never changed back to
>> 'false'.  Since I don't think copy_file_range will be invoked *before*
>> we get the monmap, it should be safe to drop this check.  Maybe it could
>> be replaced it by a WARN_ON()?
>> 
>
> Yeah. I think the have_fsid flag just allows us to avoid the pr_err msg
> in ceph_check_fsid when the client is initially created. Maybe there is
> some better way to achieve that?

I guess the struct ceph_fsid embedded in the client(s) could be changed
into a pointer initialized to NULL (and later dynamically allocated).
Then, the have_fsid check could be replaced by a NULL check.  Not sure
if it would bring any real benefit, though.  Want me to give that a try?
Or maybe I misunderstood you question.

> In any case, I'd just drop that condition here.

Ok, I'll send v2 in a second, without this check.

[ BTW, looks like my initial post didn't made it into vger.kernel.org.
  It was probably dropped because I screwed-up the 'To:' field in my
  email (no idea how I did that, TBH). ]

Cheers,
-- 
Luis

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2] ceph: allow object copies across different filesystems in the same cluster
  2019-09-09 10:18       ` Luis Henriques
@ 2019-09-09 10:28         ` Luis Henriques
  2019-09-09 10:35           ` Jeff Layton
  2019-09-09 10:51           ` Ilya Dryomov
  0 siblings, 2 replies; 12+ messages in thread
From: Luis Henriques @ 2019-09-09 10:28 UTC (permalink / raw)
  To: Jeff Layton, Sage Weil, Ilya Dryomov
  Cc: ceph-devel, linux-kernel, Luis Henriques

OSDs are able to perform object copies across different pools.  Thus,
there's no need to prevent copy_file_range from doing remote copies if the
source and destination superblocks are different.  Only return -EXDEV if
they have different fsid (the cluster ID).

Signed-off-by: Luis Henriques <lhenriques@suse.com>
---
 fs/ceph/file.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

Hi,

Here's the patch changelog since initial submittion:

- Dropped have_fsid checks on client structs
- Use %pU to print the fsid instead of raw hex strings (%*ph)
- Fixed 'To:' field in email so that this time the patch hits vger

Cheers,
--
Luis

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 685a03cc4b77..4a624a1dd0bb 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
 	struct ceph_inode_info *src_ci = ceph_inode(src_inode);
 	struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
 	struct ceph_cap_flush *prealloc_cf;
+	struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
 	struct ceph_object_locator src_oloc, dst_oloc;
 	struct ceph_object_id src_oid, dst_oid;
 	loff_t endoff = 0, size;
@@ -1915,8 +1916,17 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
 
 	if (src_inode == dst_inode)
 		return -EINVAL;
-	if (src_inode->i_sb != dst_inode->i_sb)
-		return -EXDEV;
+	if (src_inode->i_sb != dst_inode->i_sb) {
+		struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
+
+		if (ceph_fsid_compare(&src_fsc->client->fsid,
+				      &dst_fsc->client->fsid)) {
+			dout("Copying object across different clusters:");
+			dout("  src fsid: %pU dst fsid: %pU\n",
+			     &src_fsc->client->fsid, &dst_fsc->client->fsid);
+			return -EXDEV;
+		}
+	}
 	if (ceph_snap(dst_inode) != CEPH_NOSNAP)
 		return -EROFS;
 
@@ -1928,7 +1938,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
 	 * efficient).
 	 */
 
-	if (ceph_test_mount_opt(ceph_inode_to_client(src_inode), NOCOPYFROM))
+	if (ceph_test_mount_opt(src_fsc, NOCOPYFROM))
 		return -EOPNOTSUPP;
 
 	if ((src_ci->i_layout.stripe_unit != dst_ci->i_layout.stripe_unit) ||
@@ -2044,7 +2054,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
 				dst_ci->i_vino.ino, dst_objnum);
 		/* Do an object remote copy */
 		err = ceph_osdc_copy_from(
-			&ceph_inode_to_client(src_inode)->client->osdc,
+			&src_fsc->client->osdc,
 			src_ci->i_vino.snap, 0,
 			&src_oid, &src_oloc,
 			CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] ceph: allow object copies across different filesystems in the same cluster
  2019-09-09 10:28         ` [PATCH v2] " Luis Henriques
@ 2019-09-09 10:35           ` Jeff Layton
  2019-09-09 11:05             ` Jeff Layton
  2019-09-09 11:15             ` Luis Henriques
  2019-09-09 10:51           ` Ilya Dryomov
  1 sibling, 2 replies; 12+ messages in thread
From: Jeff Layton @ 2019-09-09 10:35 UTC (permalink / raw)
  To: Luis Henriques, Sage Weil, Ilya Dryomov; +Cc: ceph-devel, linux-kernel

On Mon, 2019-09-09 at 11:28 +0100, Luis Henriques wrote:
> OSDs are able to perform object copies across different pools.  Thus,
> there's no need to prevent copy_file_range from doing remote copies if the
> source and destination superblocks are different.  Only return -EXDEV if
> they have different fsid (the cluster ID).
> 
> Signed-off-by: Luis Henriques <lhenriques@suse.com>
> ---
>  fs/ceph/file.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> Hi,
> 
> Here's the patch changelog since initial submittion:
> 
> - Dropped have_fsid checks on client structs
> - Use %pU to print the fsid instead of raw hex strings (%*ph)
> - Fixed 'To:' field in email so that this time the patch hits vger
> 
> Cheers,
> --
> Luis
> 
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 685a03cc4b77..4a624a1dd0bb 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>  	struct ceph_inode_info *src_ci = ceph_inode(src_inode);
>  	struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
>  	struct ceph_cap_flush *prealloc_cf;
> +	struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
>  	struct ceph_object_locator src_oloc, dst_oloc;
>  	struct ceph_object_id src_oid, dst_oid;
>  	loff_t endoff = 0, size;
> @@ -1915,8 +1916,17 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>  
>  	if (src_inode == dst_inode)
>  		return -EINVAL;
> -	if (src_inode->i_sb != dst_inode->i_sb)
> -		return -EXDEV;
> +	if (src_inode->i_sb != dst_inode->i_sb) {
> +		struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
> +
> +		if (ceph_fsid_compare(&src_fsc->client->fsid,
> +				      &dst_fsc->client->fsid)) {
> +			dout("Copying object across different clusters:");
> +			dout("  src fsid: %pU dst fsid: %pU\n",
> +			     &src_fsc->client->fsid, &dst_fsc->client->fsid);
> +			return -EXDEV;
> +		}
> +	}

Just to be clear: what happens here if I mount two entirely separate
clusters, and their OSDs don't have any access to one another? Will this
fail at some later point with an error that we can catch so that we can
fall back?


>  	if (ceph_snap(dst_inode) != CEPH_NOSNAP)
>  		return -EROFS;
>  
> @@ -1928,7 +1938,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>  	 * efficient).
>  	 */
>  
> -	if (ceph_test_mount_opt(ceph_inode_to_client(src_inode), NOCOPYFROM))
> +	if (ceph_test_mount_opt(src_fsc, NOCOPYFROM))
>  		return -EOPNOTSUPP;
>  
>  	if ((src_ci->i_layout.stripe_unit != dst_ci->i_layout.stripe_unit) ||
> @@ -2044,7 +2054,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>  				dst_ci->i_vino.ino, dst_objnum);
>  		/* Do an object remote copy */
>  		err = ceph_osdc_copy_from(
> -			&ceph_inode_to_client(src_inode)->client->osdc,
> +			&src_fsc->client->osdc,
>  			src_ci->i_vino.snap, 0,
>  			&src_oid, &src_oloc,
>  			CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] ceph: allow object copies across different filesystems in the same cluster
  2019-09-09 10:28         ` [PATCH v2] " Luis Henriques
  2019-09-09 10:35           ` Jeff Layton
@ 2019-09-09 10:51           ` Ilya Dryomov
  1 sibling, 0 replies; 12+ messages in thread
From: Ilya Dryomov @ 2019-09-09 10:51 UTC (permalink / raw)
  To: Luis Henriques; +Cc: Jeff Layton, Sage Weil, Ceph Development, LKML

On Mon, Sep 9, 2019 at 12:29 PM Luis Henriques <lhenriques@suse.com> wrote:
>
> OSDs are able to perform object copies across different pools.  Thus,
> there's no need to prevent copy_file_range from doing remote copies if the
> source and destination superblocks are different.  Only return -EXDEV if
> they have different fsid (the cluster ID).
>
> Signed-off-by: Luis Henriques <lhenriques@suse.com>
> ---
>  fs/ceph/file.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
>
> Hi,
>
> Here's the patch changelog since initial submittion:
>
> - Dropped have_fsid checks on client structs
> - Use %pU to print the fsid instead of raw hex strings (%*ph)
> - Fixed 'To:' field in email so that this time the patch hits vger
>
> Cheers,
> --
> Luis
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 685a03cc4b77..4a624a1dd0bb 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>         struct ceph_inode_info *src_ci = ceph_inode(src_inode);
>         struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
>         struct ceph_cap_flush *prealloc_cf;
> +       struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
>         struct ceph_object_locator src_oloc, dst_oloc;
>         struct ceph_object_id src_oid, dst_oid;
>         loff_t endoff = 0, size;
> @@ -1915,8 +1916,17 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>
>         if (src_inode == dst_inode)
>                 return -EINVAL;
> -       if (src_inode->i_sb != dst_inode->i_sb)
> -               return -EXDEV;
> +       if (src_inode->i_sb != dst_inode->i_sb) {
> +               struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
> +
> +               if (ceph_fsid_compare(&src_fsc->client->fsid,
> +                                     &dst_fsc->client->fsid)) {
> +                       dout("Copying object across different clusters:");
> +                       dout("  src fsid: %pU dst fsid: %pU\n",
> +                            &src_fsc->client->fsid, &dst_fsc->client->fsid);

Hi Luis,

This should be a single dout.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] ceph: allow object copies across different filesystems in the same cluster
  2019-09-09 10:35           ` Jeff Layton
@ 2019-09-09 11:05             ` Jeff Layton
  2019-09-09 13:55               ` Luis Henriques
  2019-09-09 11:15             ` Luis Henriques
  1 sibling, 1 reply; 12+ messages in thread
From: Jeff Layton @ 2019-09-09 11:05 UTC (permalink / raw)
  To: Luis Henriques, Sage Weil, Ilya Dryomov; +Cc: ceph-devel, linux-kernel

On Mon, 2019-09-09 at 06:35 -0400, Jeff Layton wrote:
> On Mon, 2019-09-09 at 11:28 +0100, Luis Henriques wrote:
> > OSDs are able to perform object copies across different pools.  Thus,
> > there's no need to prevent copy_file_range from doing remote copies if the
> > source and destination superblocks are different.  Only return -EXDEV if
> > they have different fsid (the cluster ID).
> > 
> > Signed-off-by: Luis Henriques <lhenriques@suse.com>
> > ---
> >  fs/ceph/file.c | 18 ++++++++++++++----
> >  1 file changed, 14 insertions(+), 4 deletions(-)
> > 
> > Hi,
> > 
> > Here's the patch changelog since initial submittion:
> > 
> > - Dropped have_fsid checks on client structs
> > - Use %pU to print the fsid instead of raw hex strings (%*ph)
> > - Fixed 'To:' field in email so that this time the patch hits vger
> > 
> > Cheers,
> > --
> > Luis
> > 
> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > index 685a03cc4b77..4a624a1dd0bb 100644
> > --- a/fs/ceph/file.c
> > +++ b/fs/ceph/file.c
> > @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> >  	struct ceph_inode_info *src_ci = ceph_inode(src_inode);
> >  	struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
> >  	struct ceph_cap_flush *prealloc_cf;
> > +	struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
> >  	struct ceph_object_locator src_oloc, dst_oloc;
> >  	struct ceph_object_id src_oid, dst_oid;
> >  	loff_t endoff = 0, size;
> > @@ -1915,8 +1916,17 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> >  
> >  	if (src_inode == dst_inode)
> >  		return -EINVAL;
> > -	if (src_inode->i_sb != dst_inode->i_sb)
> > -		return -EXDEV;
> > +	if (src_inode->i_sb != dst_inode->i_sb) {
> > +		struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
> > +
> > +		if (ceph_fsid_compare(&src_fsc->client->fsid,
> > +				      &dst_fsc->client->fsid)) {
> > +			dout("Copying object across different clusters:");
> > +			dout("  src fsid: %pU dst fsid: %pU\n",
> > +			     &src_fsc->client->fsid, &dst_fsc->client->fsid);
> > +			return -EXDEV;
> > +		}
> > +	}
> 
> Just to be clear: what happens here if I mount two entirely separate
> clusters, and their OSDs don't have any access to one another? Will this
> fail at some later point with an error that we can catch so that we can
> fall back?
> 

Duh, sorry I asked before I had a cup of coffee this morning. The whole
point is to skip that case.

That said...I wonder if it's possible to have an fsid collision across
two separate clusters and this fail to catch that case? Aren't these
things just allocated via a simple counter increment?

Probably not worth worrying about overmuch, but might be good to
understand what would happen in that case if only to field mailing list
reports.

Other than that, this looks fine, modulo Ilya's comment about the two
dout messages.

> 
> >  	if (ceph_snap(dst_inode) != CEPH_NOSNAP)
> >  		return -EROFS;
> >  
> > @@ -1928,7 +1938,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> >  	 * efficient).
> >  	 */
> >  
> > -	if (ceph_test_mount_opt(ceph_inode_to_client(src_inode), NOCOPYFROM))
> > +	if (ceph_test_mount_opt(src_fsc, NOCOPYFROM))
> >  		return -EOPNOTSUPP;
> >  
> >  	if ((src_ci->i_layout.stripe_unit != dst_ci->i_layout.stripe_unit) ||
> > @@ -2044,7 +2054,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> >  				dst_ci->i_vino.ino, dst_objnum);
> >  		/* Do an object remote copy */
> >  		err = ceph_osdc_copy_from(
> > -			&ceph_inode_to_client(src_inode)->client->osdc,
> > +			&src_fsc->client->osdc,
> >  			src_ci->i_vino.snap, 0,
> >  			&src_oid, &src_oloc,
> >  			CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] ceph: allow object copies across different filesystems in the same cluster
  2019-09-09 10:35           ` Jeff Layton
  2019-09-09 11:05             ` Jeff Layton
@ 2019-09-09 11:15             ` Luis Henriques
  2019-09-09 22:22               ` Gregory Farnum
  1 sibling, 1 reply; 12+ messages in thread
From: Luis Henriques @ 2019-09-09 11:15 UTC (permalink / raw)
  To: Jeff Layton; +Cc: IlyaDryomov, Sage Weil, ceph-devel, linux-kernel

"Jeff Layton" <jlayton@kernel.org> writes:

> On Mon, 2019-09-09 at 11:28 +0100, Luis Henriques wrote:
>> OSDs are able to perform object copies across different pools.  Thus,
>> there's no need to prevent copy_file_range from doing remote copies if the
>> source and destination superblocks are different.  Only return -EXDEV if
>> they have different fsid (the cluster ID).
>> 
>> Signed-off-by: Luis Henriques <lhenriques@suse.com>
>> ---
>>  fs/ceph/file.c | 18 ++++++++++++++----
>>  1 file changed, 14 insertions(+), 4 deletions(-)
>> 
>> Hi,
>> 
>> Here's the patch changelog since initial submittion:
>> 
>> - Dropped have_fsid checks on client structs
>> - Use %pU to print the fsid instead of raw hex strings (%*ph)
>> - Fixed 'To:' field in email so that this time the patch hits vger
>> 
>> Cheers,
>> --
>> Luis
>> 
>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> index 685a03cc4b77..4a624a1dd0bb 100644
>> --- a/fs/ceph/file.c
>> +++ b/fs/ceph/file.c
>> @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>>  	struct ceph_inode_info *src_ci = ceph_inode(src_inode);
>>  	struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
>>  	struct ceph_cap_flush *prealloc_cf;
>> +	struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
>>  	struct ceph_object_locator src_oloc, dst_oloc;
>>  	struct ceph_object_id src_oid, dst_oid;
>>  	loff_t endoff = 0, size;
>> @@ -1915,8 +1916,17 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>>  
>>  	if (src_inode == dst_inode)
>>  		return -EINVAL;
>> -	if (src_inode->i_sb != dst_inode->i_sb)
>> -		return -EXDEV;
>> +	if (src_inode->i_sb != dst_inode->i_sb) {
>> +		struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
>> +
>> +		if (ceph_fsid_compare(&src_fsc->client->fsid,
>> +				      &dst_fsc->client->fsid)) {
>> +			dout("Copying object across different clusters:");
>> +			dout("  src fsid: %pU dst fsid: %pU\n",
>> +			     &src_fsc->client->fsid, &dst_fsc->client->fsid);
>> +			return -EXDEV;
>> +		}
>> +	}
>
> Just to be clear: what happens here if I mount two entirely separate
> clusters, and their OSDs don't have any access to one another? Will this
> fail at some later point with an error that we can catch so that we can
> fall back?

This is exactly what this check prevents: if we have two CephFS from two
unrelated clusters mounted and we try to copy a file across them, the
operation will fail with -EXDEV[1] because the FSIDs for these two
ceph_fs_client will be different.  OTOH, if these two filesystems are
within the same cluster (and thus with the same FSID), then the OSDs are
able to do 'copy-from' operations between them.

I've tested all these scenarios and they seem to be handled correctly.
Now, I'm assuming that *all* OSDs within the same ceph cluster can
communicate between themselves; if this assumption is false, then this
patch is broken.  But again, I'm not aware of any mechanism that
prevents 2 OSDs from communicating between them.

[1] Actually, the files will still be copied because we'll fallback into
the default VFS generic_copy_file_range behaviour, which is to do
reads+writes operations.

Cheers,
-- 
Luis


>
>
>>  	if (ceph_snap(dst_inode) != CEPH_NOSNAP)
>>  		return -EROFS;
>>  
>> @@ -1928,7 +1938,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>>  	 * efficient).
>>  	 */
>>  
>> -	if (ceph_test_mount_opt(ceph_inode_to_client(src_inode), NOCOPYFROM))
>> +	if (ceph_test_mount_opt(src_fsc, NOCOPYFROM))
>>  		return -EOPNOTSUPP;
>>  
>>  	if ((src_ci->i_layout.stripe_unit != dst_ci->i_layout.stripe_unit) ||
>> @@ -2044,7 +2054,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>>  				dst_ci->i_vino.ino, dst_objnum);
>>  		/* Do an object remote copy */
>>  		err = ceph_osdc_copy_from(
>> -			&ceph_inode_to_client(src_inode)->client->osdc,
>> +			&src_fsc->client->osdc,
>>  			src_ci->i_vino.snap, 0,
>>  			&src_oid, &src_oloc,
>>  			CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] ceph: allow object copies across different filesystems in the same cluster
  2019-09-09 11:05             ` Jeff Layton
@ 2019-09-09 13:55               ` Luis Henriques
  2019-09-09 15:21                 ` Jeff Layton
  0 siblings, 1 reply; 12+ messages in thread
From: Luis Henriques @ 2019-09-09 13:55 UTC (permalink / raw)
  To: Jeff Layton; +Cc: IlyaDryomov, Sage Weil, ceph-devel, linux-kernel

"Jeff Layton" <jlayton@kernel.org> writes:

> On Mon, 2019-09-09 at 06:35 -0400, Jeff Layton wrote:
>> On Mon, 2019-09-09 at 11:28 +0100, Luis Henriques wrote:
>> > OSDs are able to perform object copies across different pools.  Thus,
>> > there's no need to prevent copy_file_range from doing remote copies if the
>> > source and destination superblocks are different.  Only return -EXDEV if
>> > they have different fsid (the cluster ID).
>> > 
>> > Signed-off-by: Luis Henriques <lhenriques@suse.com>
>> > ---
>> >  fs/ceph/file.c | 18 ++++++++++++++----
>> >  1 file changed, 14 insertions(+), 4 deletions(-)
>> > 
>> > Hi,
>> > 
>> > Here's the patch changelog since initial submittion:
>> > 
>> > - Dropped have_fsid checks on client structs
>> > - Use %pU to print the fsid instead of raw hex strings (%*ph)
>> > - Fixed 'To:' field in email so that this time the patch hits vger
>> > 
>> > Cheers,
>> > --
>> > Luis
>> > 
>> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> > index 685a03cc4b77..4a624a1dd0bb 100644
>> > --- a/fs/ceph/file.c
>> > +++ b/fs/ceph/file.c
>> > @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>> >  	struct ceph_inode_info *src_ci = ceph_inode(src_inode);
>> >  	struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
>> >  	struct ceph_cap_flush *prealloc_cf;
>> > +	struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
>> >  	struct ceph_object_locator src_oloc, dst_oloc;
>> >  	struct ceph_object_id src_oid, dst_oid;
>> >  	loff_t endoff = 0, size;
>> > @@ -1915,8 +1916,17 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>> >  
>> >  	if (src_inode == dst_inode)
>> >  		return -EINVAL;
>> > -	if (src_inode->i_sb != dst_inode->i_sb)
>> > -		return -EXDEV;
>> > +	if (src_inode->i_sb != dst_inode->i_sb) {
>> > +		struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
>> > +
>> > +		if (ceph_fsid_compare(&src_fsc->client->fsid,
>> > +				      &dst_fsc->client->fsid)) {
>> > +			dout("Copying object across different clusters:");
>> > +			dout("  src fsid: %pU dst fsid: %pU\n",
>> > +			     &src_fsc->client->fsid, &dst_fsc->client->fsid);
>> > +			return -EXDEV;
>> > +		}
>> > +	}
>> 
>> Just to be clear: what happens here if I mount two entirely separate
>> clusters, and their OSDs don't have any access to one another? Will this
>> fail at some later point with an error that we can catch so that we can
>> fall back?
>> 
>
> Duh, sorry I asked before I had a cup of coffee this morning. The whole
> point is to skip that case.
>
> That said...I wonder if it's possible to have an fsid collision across
> two separate clusters and this fail to catch that case? Aren't these
> things just allocated via a simple counter increment?

My understanding is that this is some sort of UUID.  Looking at
doc/install/manual-deployment.rst it says that the fsid is a unique ID
that should be generated using uuidgen (I believe that's what vstart.sh
clusters use).

That said, it's obviously possible to reuse an fsid in two clusters.
And mounting both filesystems with the same fsid on the same client may
already cause some troubles without even trying to copy_file_range files
across them (for ex., fscache code seems to assume unique fsids).  But I
have never tested such sort of things (probably no one did) and I really
don't know what are the consequences.  In this specific case, I would
expect the 'copy-from' operation to fail with some error from the OSDs.

> Probably not worth worrying about overmuch, but might be good to
> understand what would happen in that case if only to field mailing list
> reports.

If there are concerns regarding this, I'm OK simply dropping the patch
for now and continue forbidding object copies when superblocks are
different.  I just thought this was a low-hanging fruit, and didn't
realized that it's not very easy to ensure that 2 cephfs instances
actually belong to the same cluster.  Maybe there are other checks that
could be done...?

Cheers,
-- 
Luis

> Other than that, this looks fine, modulo Ilya's comment about the two
> dout messages.
>
>> 
>> >  	if (ceph_snap(dst_inode) != CEPH_NOSNAP)
>> >  		return -EROFS;
>> >  
>> > @@ -1928,7 +1938,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>> >  	 * efficient).
>> >  	 */
>> >  
>> > -	if (ceph_test_mount_opt(ceph_inode_to_client(src_inode), NOCOPYFROM))
>> > +	if (ceph_test_mount_opt(src_fsc, NOCOPYFROM))
>> >  		return -EOPNOTSUPP;
>> >  
>> >  	if ((src_ci->i_layout.stripe_unit != dst_ci->i_layout.stripe_unit) ||
>> > @@ -2044,7 +2054,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>> >  				dst_ci->i_vino.ino, dst_objnum);
>> >  		/* Do an object remote copy */
>> >  		err = ceph_osdc_copy_from(
>> > -			&ceph_inode_to_client(src_inode)->client->osdc,
>> > +			&src_fsc->client->osdc,
>> >  			src_ci->i_vino.snap, 0,
>> >  			&src_oid, &src_oloc,
>> >  			CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] ceph: allow object copies across different filesystems in the same cluster
  2019-09-09 13:55               ` Luis Henriques
@ 2019-09-09 15:21                 ` Jeff Layton
  0 siblings, 0 replies; 12+ messages in thread
From: Jeff Layton @ 2019-09-09 15:21 UTC (permalink / raw)
  To: Luis Henriques; +Cc: IlyaDryomov, Sage Weil, ceph-devel, linux-kernel

On Mon, 2019-09-09 at 14:55 +0100, Luis Henriques wrote:
> "Jeff Layton" <jlayton@kernel.org> writes:
> 
> > On Mon, 2019-09-09 at 06:35 -0400, Jeff Layton wrote:
> > > On Mon, 2019-09-09 at 11:28 +0100, Luis Henriques wrote:
> > > > OSDs are able to perform object copies across different pools.  Thus,
> > > > there's no need to prevent copy_file_range from doing remote copies if the
> > > > source and destination superblocks are different.  Only return -EXDEV if
> > > > they have different fsid (the cluster ID).
> > > > 
> > > > Signed-off-by: Luis Henriques <lhenriques@suse.com>
> > > > ---
> > > >  fs/ceph/file.c | 18 ++++++++++++++----
> > > >  1 file changed, 14 insertions(+), 4 deletions(-)
> > > > 
> > > > Hi,
> > > > 
> > > > Here's the patch changelog since initial submittion:
> > > > 
> > > > - Dropped have_fsid checks on client structs
> > > > - Use %pU to print the fsid instead of raw hex strings (%*ph)
> > > > - Fixed 'To:' field in email so that this time the patch hits vger
> > > > 
> > > > Cheers,
> > > > --
> > > > Luis
> > > > 
> > > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > > index 685a03cc4b77..4a624a1dd0bb 100644
> > > > --- a/fs/ceph/file.c
> > > > +++ b/fs/ceph/file.c
> > > > @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > >  	struct ceph_inode_info *src_ci = ceph_inode(src_inode);
> > > >  	struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
> > > >  	struct ceph_cap_flush *prealloc_cf;
> > > > +	struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
> > > >  	struct ceph_object_locator src_oloc, dst_oloc;
> > > >  	struct ceph_object_id src_oid, dst_oid;
> > > >  	loff_t endoff = 0, size;
> > > > @@ -1915,8 +1916,17 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > >  
> > > >  	if (src_inode == dst_inode)
> > > >  		return -EINVAL;
> > > > -	if (src_inode->i_sb != dst_inode->i_sb)
> > > > -		return -EXDEV;
> > > > +	if (src_inode->i_sb != dst_inode->i_sb) {
> > > > +		struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
> > > > +
> > > > +		if (ceph_fsid_compare(&src_fsc->client->fsid,
> > > > +				      &dst_fsc->client->fsid)) {
> > > > +			dout("Copying object across different clusters:");
> > > > +			dout("  src fsid: %pU dst fsid: %pU\n",
> > > > +			     &src_fsc->client->fsid, &dst_fsc->client->fsid);
> > > > +			return -EXDEV;
> > > > +		}
> > > > +	}
> > > 
> > > Just to be clear: what happens here if I mount two entirely separate
> > > clusters, and their OSDs don't have any access to one another? Will this
> > > fail at some later point with an error that we can catch so that we can
> > > fall back?
> > > 
> > 
> > Duh, sorry I asked before I had a cup of coffee this morning. The whole
> > point is to skip that case.
> > 
> > That said...I wonder if it's possible to have an fsid collision across
> > two separate clusters and this fail to catch that case? Aren't these
> > things just allocated via a simple counter increment?
> 
> My understanding is that this is some sort of UUID.  Looking at
> doc/install/manual-deployment.rst it says that the fsid is a unique ID
> that should be generated using uuidgen (I believe that's what vstart.sh
> clusters use).
> 
> That said, it's obviously possible to reuse an fsid in two clusters.
> And mounting both filesystems with the same fsid on the same client may
> already cause some troubles without even trying to copy_file_range files
> across them (for ex., fscache code seems to assume unique fsids).  But I
> have never tested such sort of things (probably no one did) and I really
> don't know what are the consequences.  In this specific case, I would
> expect the 'copy-from' operation to fail with some error from the OSDs.
> 

Makes sense. I suppose the worst possible case is data corruption due to
copying to/from the wrong object, but the risk here seems quite low.

> > Probably not worth worrying about overmuch, but might be good to
> > understand what would happen in that case if only to field mailing list
> > reports.
> 
> If there are concerns regarding this, I'm OK simply dropping the patch
> for now and continue forbidding object copies when superblocks are
> different.  I just thought this was a low-hanging fruit, and didn't
> realized that it's not very easy to ensure that 2 cephfs instances
> actually belong to the same cluster.  Maybe there are other checks that
> could be done...?
> 

I'm not really concerned about it, particularly if these values are
usually generated as uuids. If we get reports that involve collisions
here, then we can revisit it then.

IMO, it's up to the admin to guarantee that the fsid is unique within a
multi-cluster environment.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] ceph: allow object copies across different filesystems in the same cluster
  2019-09-09 11:15             ` Luis Henriques
@ 2019-09-09 22:22               ` Gregory Farnum
  2019-09-10 10:45                 ` Luis Henriques
  0 siblings, 1 reply; 12+ messages in thread
From: Gregory Farnum @ 2019-09-09 22:22 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Jeff Layton, IlyaDryomov, Sage Weil, ceph-devel, linux-kernel

On Mon, Sep 9, 2019 at 4:15 AM Luis Henriques <lhenriques@suse.com> wrote:
>
> "Jeff Layton" <jlayton@kernel.org> writes:
>
> > On Mon, 2019-09-09 at 11:28 +0100, Luis Henriques wrote:
> >> OSDs are able to perform object copies across different pools.  Thus,
> >> there's no need to prevent copy_file_range from doing remote copies if the
> >> source and destination superblocks are different.  Only return -EXDEV if
> >> they have different fsid (the cluster ID).
> >>
> >> Signed-off-by: Luis Henriques <lhenriques@suse.com>
> >> ---
> >>  fs/ceph/file.c | 18 ++++++++++++++----
> >>  1 file changed, 14 insertions(+), 4 deletions(-)
> >>
> >> Hi,
> >>
> >> Here's the patch changelog since initial submittion:
> >>
> >> - Dropped have_fsid checks on client structs
> >> - Use %pU to print the fsid instead of raw hex strings (%*ph)
> >> - Fixed 'To:' field in email so that this time the patch hits vger
> >>
> >> Cheers,
> >> --
> >> Luis
> >>
> >> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> >> index 685a03cc4b77..4a624a1dd0bb 100644
> >> --- a/fs/ceph/file.c
> >> +++ b/fs/ceph/file.c
> >> @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> >>      struct ceph_inode_info *src_ci = ceph_inode(src_inode);
> >>      struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
> >>      struct ceph_cap_flush *prealloc_cf;
> >> +    struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
> >>      struct ceph_object_locator src_oloc, dst_oloc;
> >>      struct ceph_object_id src_oid, dst_oid;
> >>      loff_t endoff = 0, size;
> >> @@ -1915,8 +1916,17 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> >>
> >>      if (src_inode == dst_inode)
> >>              return -EINVAL;
> >> -    if (src_inode->i_sb != dst_inode->i_sb)
> >> -            return -EXDEV;
> >> +    if (src_inode->i_sb != dst_inode->i_sb) {
> >> +            struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
> >> +
> >> +            if (ceph_fsid_compare(&src_fsc->client->fsid,
> >> +                                  &dst_fsc->client->fsid)) {
> >> +                    dout("Copying object across different clusters:");
> >> +                    dout("  src fsid: %pU dst fsid: %pU\n",
> >> +                         &src_fsc->client->fsid, &dst_fsc->client->fsid);
> >> +                    return -EXDEV;
> >> +            }
> >> +    }
> >
> > Just to be clear: what happens here if I mount two entirely separate
> > clusters, and their OSDs don't have any access to one another? Will this
> > fail at some later point with an error that we can catch so that we can
> > fall back?
>
> This is exactly what this check prevents: if we have two CephFS from two
> unrelated clusters mounted and we try to copy a file across them, the
> operation will fail with -EXDEV[1] because the FSIDs for these two
> ceph_fs_client will be different.  OTOH, if these two filesystems are
> within the same cluster (and thus with the same FSID), then the OSDs are
> able to do 'copy-from' operations between them.
>
> I've tested all these scenarios and they seem to be handled correctly.
> Now, I'm assuming that *all* OSDs within the same ceph cluster can
> communicate between themselves; if this assumption is false, then this
> patch is broken.  But again, I'm not aware of any mechanism that
> prevents 2 OSDs from communicating between them.

Your assumption is correct: all OSDs in a Ceph cluster can communicate
with each other. I'm not aware of any plans to change this.

I spent a bit of time trying to figure out how this could break
security models and things and didn't come up with anything, so I
think functionally it's fine even though I find it a bit scary.

Also, yes, cluster FSIDs are UUIDs so they shouldn't collide.
-Greg

>
> [1] Actually, the files will still be copied because we'll fallback into
> the default VFS generic_copy_file_range behaviour, which is to do
> reads+writes operations.
>
> Cheers,
> --
> Luis
>
>
> >
> >
> >>      if (ceph_snap(dst_inode) != CEPH_NOSNAP)
> >>              return -EROFS;
> >>
> >> @@ -1928,7 +1938,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> >>       * efficient).
> >>       */
> >>
> >> -    if (ceph_test_mount_opt(ceph_inode_to_client(src_inode), NOCOPYFROM))
> >> +    if (ceph_test_mount_opt(src_fsc, NOCOPYFROM))
> >>              return -EOPNOTSUPP;
> >>
> >>      if ((src_ci->i_layout.stripe_unit != dst_ci->i_layout.stripe_unit) ||
> >> @@ -2044,7 +2054,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> >>                              dst_ci->i_vino.ino, dst_objnum);
> >>              /* Do an object remote copy */
> >>              err = ceph_osdc_copy_from(
> >> -                    &ceph_inode_to_client(src_inode)->client->osdc,
> >> +                    &src_fsc->client->osdc,
> >>                      src_ci->i_vino.snap, 0,
> >>                      &src_oid, &src_oloc,
> >>                      CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] ceph: allow object copies across different filesystems in the same cluster
  2019-09-09 22:22               ` Gregory Farnum
@ 2019-09-10 10:45                 ` Luis Henriques
  0 siblings, 0 replies; 12+ messages in thread
From: Luis Henriques @ 2019-09-10 10:45 UTC (permalink / raw)
  To: Gregory Farnum
  Cc: IlyaDryomov, Jeff Layton, Sage Weil, ceph-devel, linux-kernel

Gregory Farnum <gfarnum@redhat.com> writes:

> On Mon, Sep 9, 2019 at 4:15 AM Luis Henriques <lhenriques@suse.com> wrote:
>>
>> "Jeff Layton" <jlayton@kernel.org> writes:
>>
>> > On Mon, 2019-09-09 at 11:28 +0100, Luis Henriques wrote:
>> >> OSDs are able to perform object copies across different pools.  Thus,
>> >> there's no need to prevent copy_file_range from doing remote copies if the
>> >> source and destination superblocks are different.  Only return -EXDEV if
>> >> they have different fsid (the cluster ID).
>> >>
>> >> Signed-off-by: Luis Henriques <lhenriques@suse.com>
>> >> ---
>> >>  fs/ceph/file.c | 18 ++++++++++++++----
>> >>  1 file changed, 14 insertions(+), 4 deletions(-)
>> >>
>> >> Hi,
>> >>
>> >> Here's the patch changelog since initial submittion:
>> >>
>> >> - Dropped have_fsid checks on client structs
>> >> - Use %pU to print the fsid instead of raw hex strings (%*ph)
>> >> - Fixed 'To:' field in email so that this time the patch hits vger
>> >>
>> >> Cheers,
>> >> --
>> >> Luis
>> >>
>> >> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> >> index 685a03cc4b77..4a624a1dd0bb 100644
>> >> --- a/fs/ceph/file.c
>> >> +++ b/fs/ceph/file.c
>> >> @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>> >>      struct ceph_inode_info *src_ci = ceph_inode(src_inode);
>> >>      struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
>> >>      struct ceph_cap_flush *prealloc_cf;
>> >> +    struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
>> >>      struct ceph_object_locator src_oloc, dst_oloc;
>> >>      struct ceph_object_id src_oid, dst_oid;
>> >>      loff_t endoff = 0, size;
>> >> @@ -1915,8 +1916,17 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>> >>
>> >>      if (src_inode == dst_inode)
>> >>              return -EINVAL;
>> >> -    if (src_inode->i_sb != dst_inode->i_sb)
>> >> -            return -EXDEV;
>> >> +    if (src_inode->i_sb != dst_inode->i_sb) {
>> >> +            struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
>> >> +
>> >> +            if (ceph_fsid_compare(&src_fsc->client->fsid,
>> >> +                                  &dst_fsc->client->fsid)) {
>> >> +                    dout("Copying object across different clusters:");
>> >> +                    dout("  src fsid: %pU dst fsid: %pU\n",
>> >> +                         &src_fsc->client->fsid, &dst_fsc->client->fsid);
>> >> +                    return -EXDEV;
>> >> +            }
>> >> +    }
>> >
>> > Just to be clear: what happens here if I mount two entirely separate
>> > clusters, and their OSDs don't have any access to one another? Will this
>> > fail at some later point with an error that we can catch so that we can
>> > fall back?
>>
>> This is exactly what this check prevents: if we have two CephFS from two
>> unrelated clusters mounted and we try to copy a file across them, the
>> operation will fail with -EXDEV[1] because the FSIDs for these two
>> ceph_fs_client will be different.  OTOH, if these two filesystems are
>> within the same cluster (and thus with the same FSID), then the OSDs are
>> able to do 'copy-from' operations between them.
>>
>> I've tested all these scenarios and they seem to be handled correctly.
>> Now, I'm assuming that *all* OSDs within the same ceph cluster can
>> communicate between themselves; if this assumption is false, then this
>> patch is broken.  But again, I'm not aware of any mechanism that
>> prevents 2 OSDs from communicating between them.
>
> Your assumption is correct: all OSDs in a Ceph cluster can communicate
> with each other. I'm not aware of any plans to change this.
>
> I spent a bit of time trying to figure out how this could break
> security models and things and didn't come up with anything, so I
> think functionally it's fine even though I find it a bit scary.
>
> Also, yes, cluster FSIDs are UUIDs so they shouldn't collide.

Awesome, thanks for clarifying these points!

Cheers,
-- 
Luis


> -Greg
>
>>
>> [1] Actually, the files will still be copied because we'll fallback into
>> the default VFS generic_copy_file_range behaviour, which is to do
>> reads+writes operations.
>>
>> Cheers,
>> --
>> Luis
>>
>>
>> >
>> >
>> >>      if (ceph_snap(dst_inode) != CEPH_NOSNAP)
>> >>              return -EROFS;
>> >>
>> >> @@ -1928,7 +1938,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>> >>       * efficient).
>> >>       */
>> >>
>> >> -    if (ceph_test_mount_opt(ceph_inode_to_client(src_inode), NOCOPYFROM))
>> >> +    if (ceph_test_mount_opt(src_fsc, NOCOPYFROM))
>> >>              return -EOPNOTSUPP;
>> >>
>> >>      if ((src_ci->i_layout.stripe_unit != dst_ci->i_layout.stripe_unit) ||
>> >> @@ -2044,7 +2054,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>> >>                              dst_ci->i_vino.ino, dst_objnum);
>> >>              /* Do an object remote copy */
>> >>              err = ceph_osdc_copy_from(
>> >> -                    &ceph_inode_to_client(src_inode)->client->osdc,
>> >> +                    &src_fsc->client->osdc,
>> >>                      src_ci->i_vino.snap, 0,
>> >>                      &src_oid, &src_oloc,
>> >>                      CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-09-10 10:45 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20190906135750.29543-1-lhenriques@suse.com>
     [not found] ` <30b09cb015563913d073c488c8de8ba0cceedd7b.camel@kernel.org>
2019-09-06 16:26   ` [PATCH] ceph: allow object copies across different filesystems in the same cluster Luis Henriques
2019-09-07 13:53     ` Jeff Layton
2019-09-09 10:18       ` Luis Henriques
2019-09-09 10:28         ` [PATCH v2] " Luis Henriques
2019-09-09 10:35           ` Jeff Layton
2019-09-09 11:05             ` Jeff Layton
2019-09-09 13:55               ` Luis Henriques
2019-09-09 15:21                 ` Jeff Layton
2019-09-09 11:15             ` Luis Henriques
2019-09-09 22:22               ` Gregory Farnum
2019-09-10 10:45                 ` Luis Henriques
2019-09-09 10:51           ` Ilya Dryomov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).