From: Luis Henriques <lhenriques@suse.com>
To: Gregory Farnum <gfarnum@redhat.com>
Cc: IlyaDryomov <idryomov@gmail.com>,
Jeff Layton <jlayton@kernel.org>, Sage Weil <sage@redhat.com>,
ceph-devel <ceph-devel@vger.kernel.org>,
linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2] ceph: allow object copies across different filesystems in the same cluster
Date: Tue, 10 Sep 2019 11:45:41 +0100 [thread overview]
Message-ID: <871rwoo2ei.fsf@suse.com> (raw)
In-Reply-To: <CAJ4mKGZVjJxQA69s92C+7DFbDxv87SOj10AUfyLXwVe9b+SDTw@mail.gmail.com> (Gregory Farnum's message of "Mon, 9 Sep 2019 15:22:10 -0700")
Gregory Farnum <gfarnum@redhat.com> writes:
> On Mon, Sep 9, 2019 at 4:15 AM Luis Henriques <lhenriques@suse.com> wrote:
>>
>> "Jeff Layton" <jlayton@kernel.org> writes:
>>
>> > On Mon, 2019-09-09 at 11:28 +0100, Luis Henriques wrote:
>> >> OSDs are able to perform object copies across different pools. Thus,
>> >> there's no need to prevent copy_file_range from doing remote copies if the
>> >> source and destination superblocks are different. Only return -EXDEV if
>> >> they have different fsid (the cluster ID).
>> >>
>> >> Signed-off-by: Luis Henriques <lhenriques@suse.com>
>> >> ---
>> >> fs/ceph/file.c | 18 ++++++++++++++----
>> >> 1 file changed, 14 insertions(+), 4 deletions(-)
>> >>
>> >> Hi,
>> >>
>> >> Here's the patch changelog since initial submittion:
>> >>
>> >> - Dropped have_fsid checks on client structs
>> >> - Use %pU to print the fsid instead of raw hex strings (%*ph)
>> >> - Fixed 'To:' field in email so that this time the patch hits vger
>> >>
>> >> Cheers,
>> >> --
>> >> Luis
>> >>
>> >> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> >> index 685a03cc4b77..4a624a1dd0bb 100644
>> >> --- a/fs/ceph/file.c
>> >> +++ b/fs/ceph/file.c
>> >> @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>> >> struct ceph_inode_info *src_ci = ceph_inode(src_inode);
>> >> struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
>> >> struct ceph_cap_flush *prealloc_cf;
>> >> + struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
>> >> struct ceph_object_locator src_oloc, dst_oloc;
>> >> struct ceph_object_id src_oid, dst_oid;
>> >> loff_t endoff = 0, size;
>> >> @@ -1915,8 +1916,17 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>> >>
>> >> if (src_inode == dst_inode)
>> >> return -EINVAL;
>> >> - if (src_inode->i_sb != dst_inode->i_sb)
>> >> - return -EXDEV;
>> >> + if (src_inode->i_sb != dst_inode->i_sb) {
>> >> + struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
>> >> +
>> >> + if (ceph_fsid_compare(&src_fsc->client->fsid,
>> >> + &dst_fsc->client->fsid)) {
>> >> + dout("Copying object across different clusters:");
>> >> + dout(" src fsid: %pU dst fsid: %pU\n",
>> >> + &src_fsc->client->fsid, &dst_fsc->client->fsid);
>> >> + return -EXDEV;
>> >> + }
>> >> + }
>> >
>> > Just to be clear: what happens here if I mount two entirely separate
>> > clusters, and their OSDs don't have any access to one another? Will this
>> > fail at some later point with an error that we can catch so that we can
>> > fall back?
>>
>> This is exactly what this check prevents: if we have two CephFS from two
>> unrelated clusters mounted and we try to copy a file across them, the
>> operation will fail with -EXDEV[1] because the FSIDs for these two
>> ceph_fs_client will be different. OTOH, if these two filesystems are
>> within the same cluster (and thus with the same FSID), then the OSDs are
>> able to do 'copy-from' operations between them.
>>
>> I've tested all these scenarios and they seem to be handled correctly.
>> Now, I'm assuming that *all* OSDs within the same ceph cluster can
>> communicate between themselves; if this assumption is false, then this
>> patch is broken. But again, I'm not aware of any mechanism that
>> prevents 2 OSDs from communicating between them.
>
> Your assumption is correct: all OSDs in a Ceph cluster can communicate
> with each other. I'm not aware of any plans to change this.
>
> I spent a bit of time trying to figure out how this could break
> security models and things and didn't come up with anything, so I
> think functionally it's fine even though I find it a bit scary.
>
> Also, yes, cluster FSIDs are UUIDs so they shouldn't collide.
Awesome, thanks for clarifying these points!
Cheers,
--
Luis
> -Greg
>
>>
>> [1] Actually, the files will still be copied because we'll fallback into
>> the default VFS generic_copy_file_range behaviour, which is to do
>> reads+writes operations.
>>
>> Cheers,
>> --
>> Luis
>>
>>
>> >
>> >
>> >> if (ceph_snap(dst_inode) != CEPH_NOSNAP)
>> >> return -EROFS;
>> >>
>> >> @@ -1928,7 +1938,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>> >> * efficient).
>> >> */
>> >>
>> >> - if (ceph_test_mount_opt(ceph_inode_to_client(src_inode), NOCOPYFROM))
>> >> + if (ceph_test_mount_opt(src_fsc, NOCOPYFROM))
>> >> return -EOPNOTSUPP;
>> >>
>> >> if ((src_ci->i_layout.stripe_unit != dst_ci->i_layout.stripe_unit) ||
>> >> @@ -2044,7 +2054,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>> >> dst_ci->i_vino.ino, dst_objnum);
>> >> /* Do an object remote copy */
>> >> err = ceph_osdc_copy_from(
>> >> - &ceph_inode_to_client(src_inode)->client->osdc,
>> >> + &src_fsc->client->osdc,
>> >> src_ci->i_vino.snap, 0,
>> >> &src_oid, &src_oloc,
>> >> CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |
>
next prev parent reply other threads:[~2019-09-10 10:45 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20190906135750.29543-1-lhenriques@suse.com>
[not found] ` <30b09cb015563913d073c488c8de8ba0cceedd7b.camel@kernel.org>
2019-09-06 16:26 ` [PATCH] ceph: allow object copies across different filesystems in the same cluster Luis Henriques
2019-09-07 13:53 ` Jeff Layton
2019-09-09 10:18 ` Luis Henriques
2019-09-09 10:28 ` [PATCH v2] " Luis Henriques
2019-09-09 10:35 ` Jeff Layton
2019-09-09 11:05 ` Jeff Layton
2019-09-09 13:55 ` Luis Henriques
2019-09-09 15:21 ` Jeff Layton
2019-09-09 11:15 ` Luis Henriques
2019-09-09 22:22 ` Gregory Farnum
2019-09-10 10:45 ` Luis Henriques [this message]
2019-09-09 10:51 ` Ilya Dryomov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=871rwoo2ei.fsf@suse.com \
--to=lhenriques@suse.com \
--cc=ceph-devel@vger.kernel.org \
--cc=gfarnum@redhat.com \
--cc=idryomov@gmail.com \
--cc=jlayton@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=sage@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).