linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ilya Dryomov <idryomov@gmail.com>
To: Luis Henriques <lhenriques@suse.com>
Cc: Jeff Layton <jlayton@kernel.org>, Sage Weil <sage@redhat.com>,
	"Yan, Zheng" <ukernel@gmail.com>,
	Ceph Development <ceph-devel@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH 0/2] ceph: safely use 'copy-from' Op on Octopus OSDs
Date: Fri, 8 Nov 2019 16:15:35 +0100	[thread overview]
Message-ID: <CAOi1vP-sVQKvpiPLoZ=9s7Hy=c2eQRocxSs1nPrXAUCbbZUZ-g@mail.gmail.com> (raw)
In-Reply-To: <20191108141555.31176-1-lhenriques@suse.com>

On Fri, Nov 8, 2019 at 3:15 PM Luis Henriques <lhenriques@suse.com> wrote:
>
> Hi!
>
> (Sorry for the long cover letter!)

This is exactly what cover letters are for!

>
> Since the fix for [1] has finally been merged and should be available in
> the next (Octopus) ceph release, I'm trying to clean-up my kernel client
> patch that tries to find out whether or not it's safe to use the
> 'copy-from' RADOS operation for copy_file_range.
>
> So, the fix for [1] was to modify the 'copy-from' operation to allow
> clients to optionally (using the CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ
> flag) send the extra truncate_seq and truncate_size parameters.  Since
> only Octopus will have this fix (no backports planned), the client
> simply needs to ensure the OSDs being used have SERVER_OCTOPUS in their
> features.
>
> My initial solution was to add an extra test in __submit_request,
> looping all the request ops and checking if the connection has the
> required features for that operation.  Obviously, at the moment only the
> copy-from operation has a restriction but I guess others may be added in
> the future.  I believe that doing this at this point (__submit_request)
> allows to cover cases where a cluster is being upgraded to Octopus and
> we have different OSDs running with different feature bits.
>
> Unfortunately, this solution is racy because the connection state
> machine may be changing and the peer_features field isn't yet set.  For
> example: if the connection to an OSD is being re-open when we're about
> to check the features, the con->state will be CON_STATE_PREOPEN and the
> con->peer_features will be 0.  I tried to find ways to move the feature
> check further down in the stack, but that can't be easily done without
> adding more infrastructure.  A solution that came to my mind was to add
> a new con->ops, invoked in the context of ceph_con_workfn, under the
> con->mutex.  This callback could then verify the available features,
> aborting the operation if needed.
>
> Note that the race in this patchset doesn't seem to be a huge problem,
> other than occasionally reverting to a VFS generic copy_file_range, as
> -EOPNOTSUPP will be returned here.  But it's still a race, and there are
> probably other cases that I'm missing.
>
> Anyway, maybe I'm missing an obvious solution for checking these OSD
> features, but I'm open to any suggestions on other options (or some
> feedback on the new callback in ceph_connection_operations option).
>
> [1] https://tracker.ceph.com/issues/37378

If the OSD checked for unknown flags, like newer syscalls do, it would
be super easy, but it looks like it doesn't.

An obvious solution is to look at require_osd_release in osdmap, but we
don't decode that in the kernel because it lives the OSD portion of the
osdmap.  We could add that and consider the fact that the client now
needs to decode more than just the client portion a design mistake.
I'm not sure what can of worms does that open and if copy-from alone is
worth it though.  Perhaps that field could be moved to (or a copy of it
be replicated in) the client portion of the osdmap starting with
octopus?  We seem to be running into it on the client side more and
more...

Given the track record of this feature (the fix for the most recently
discovered data corrupting bug hasn't even merged yet), I would be very
hesitant to turn it back on by default even if we figure out a good
solution for the feature check.  In my opinion, it should stay opt-in.

Thanks,

                Ilya

  parent reply	other threads:[~2019-11-08 15:15 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-08 14:15 [RFC PATCH 0/2] ceph: safely use 'copy-from' Op on Octopus OSDs Luis Henriques
2019-11-08 14:15 ` [RFC PATCH 1/2] ceph: add support for sending truncate_{seq,size} in 'copy-from' Op Luis Henriques
2019-11-08 14:15 ` [RFC PATCH 2/2] ceph: make 'copyfrom' a default mount option again Luis Henriques
2019-11-08 15:15 ` Ilya Dryomov [this message]
2019-11-08 16:47   ` [RFC PATCH 0/2] ceph: safely use 'copy-from' Op on Octopus OSDs Luis Henriques
2019-11-08 16:57     ` Sage Weil
2019-11-08 17:16       ` Luis Henriques
2019-11-08 17:22         ` Sage Weil
2019-11-08 17:31           ` Luis Henriques
2019-11-11 16:30             ` Luis Henriques
2019-11-11 20:51               ` Ilya Dryomov
2019-11-12 10:42                 ` Luis Henriques
2019-11-08 16:59     ` Ilya Dryomov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAOi1vP-sVQKvpiPLoZ=9s7Hy=c2eQRocxSs1nPrXAUCbbZUZ-g@mail.gmail.com' \
    --to=idryomov@gmail.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=jlayton@kernel.org \
    --cc=lhenriques@suse.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sage@redhat.com \
    --cc=ukernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).