All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@redhat.com>
To: John Spray <jspray@redhat.com>
Cc: "Yan, Zheng" <zyan@redhat.com>,
	ceph-devel <ceph-devel@vger.kernel.org>,
	Ilya Dryomov <idryomov@gmail.com>, Sage Weil <sage@redhat.com>
Subject: Re: [PATCH v1 4/7] ceph: handle new osdmap epoch updates in CLIENT_CAPS and WRITE codepaths
Date: Thu, 02 Feb 2017 11:07:20 -0500	[thread overview]
Message-ID: <1486051640.2812.16.camel@redhat.com> (raw)
In-Reply-To: <CALe9h7cj_vqu-Kt2jauC0pA2Fpp-cjfg30fDQKeN7r55iXeL7A@mail.gmail.com>

On Wed, 2017-02-01 at 19:55 +0000, John Spray wrote:
> On Wed, Feb 1, 2017 at 7:50 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > On Sun, 2017-01-22 at 17:40 +0800, Yan, Zheng wrote:
> > > > On 20 Jan 2017, at 23:17, Jeff Layton <jlayton@redhat.com> wrote:
> > > > 
> > > > This patch is heavily inspired by John Spray's earlier work, but
> > > > implemented in a different way.
> > > > 
> > > > Create and register a new map_cb for cephfs, to allow it to handle
> > > > changes to the osdmap.
> > > > 
> > > > In the version 5 of CLIENT_CAPS messages, the barrier field is added as an
> > > > instruction to clients that they may not use the attached capabilities
> > > > until they have a particular OSD map epoch.
> > > > 
> > > > When we get a message with such a field and don't have the requisite map
> > > > epoch yet, we put that message on a list in the session, to be run when
> > > > the map does come in.
> > > > 
> > > > When we get a new map update, the map_cb routine first checks to see
> > > > whether there may be an OSD or pool full condition. If so, then we walk
> > > > the list of OSD calls and kill off any writes to full OSDs or pools with
> > > > -ENOSPC.  While cancelling, we store the latest OSD epoch seen in each
> > > > request. This will be used later in the CAPRELEASE messages.
> > > > 
> > > > Then, it walks the session list and queues the workqueue job for each.
> > > > When the workqueue job runs, it walks the list of delayed caps and tries
> > > > to rerun each one. If the epoch is still not high enough, they just get
> > > > put back on the delay queue for when the map does come in.
> > > > 
> > > > Suggested-by: John Spray <john.spray@redhat.com>
> > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > ---
> > > > fs/ceph/caps.c       | 43 +++++++++++++++++++++++++++---
> > > > fs/ceph/debugfs.c    |  3 +++
> > > > fs/ceph/mds_client.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > fs/ceph/mds_client.h |  3 +++
> > > > 4 files changed, 120 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > > index d941c48e8bff..f33d424b5e12 100644
> > > > --- a/fs/ceph/caps.c
> > > > +++ b/fs/ceph/caps.c
> > > > @@ -1077,7 +1077,7 @@ static int send_cap_msg(struct cap_msg_args *arg)
> > > >     /* inline data size */
> > > >     ceph_encode_32(&p, 0);
> > > >     /* osd_epoch_barrier (version 5) */
> > > > -   ceph_encode_32(&p, 0);
> > > > +   ceph_encode_32(&p, arg->session->s_mdsc->cap_epoch_barrier);
> > > >     /* oldest_flush_tid (version 6) */
> > > >     ceph_encode_64(&p, arg->oldest_flush_tid);
> > > > 
> > > > @@ -3577,9 +3577,12 @@ void ceph_handle_caps(struct ceph_mds_session *session,
> > > >     void *snaptrace;
> > > >     size_t snaptrace_len;
> > > >     void *p, *end;
> > > > +   u32 epoch_barrier = 0;
> > > > 
> > > >     dout("handle_caps from mds%d\n", mds);
> > > > 
> > > > +   WARN_ON_ONCE(!list_empty(&msg->list_head));
> > > > +
> > > >     /* decode */
> > > >     end = msg->front.iov_base + msg->front.iov_len;
> > > >     tid = le64_to_cpu(msg->hdr.tid);
> > > > @@ -3625,13 +3628,45 @@ void ceph_handle_caps(struct ceph_mds_session *session,
> > > >             p += inline_len;
> > > >     }
> > > > 
> > > > +   if (le16_to_cpu(msg->hdr.version) >= 5) {
> > > > +           struct ceph_osd_client *osdc = &mdsc->fsc->client->osdc;
> > > > +
> > > > +           ceph_decode_32_safe(&p, end, epoch_barrier, bad);
> > > > +
> > > > +           /* Do lockless check first to avoid mutex if we can */
> > > > +           if (epoch_barrier > mdsc->cap_epoch_barrier) {
> > > > +                   mutex_lock(&mdsc->mutex);
> > > > +                   if (epoch_barrier > mdsc->cap_epoch_barrier)
> > > > +                           mdsc->cap_epoch_barrier = epoch_barrier;
> > > > +                   mutex_unlock(&mdsc->mutex);
> > > > +           }
> > > > +
> > > > +           down_read(&osdc->lock);
> > > > +           if (osdc->osdmap->epoch < epoch_barrier) {
> > > > +                   dout("handle_caps delaying message until OSD epoch %d\n", epoch_barrier);
> > > > +                   ceph_msg_get(msg);
> > > > +                   spin_lock(&session->s_cap_lock);
> > > > +                   list_add(&msg->list_head, &session->s_delayed_caps);
> > > > +                   spin_unlock(&session->s_cap_lock);
> > > > +
> > > > +                   // Kick OSD client to get the latest map
> > > > +                   __ceph_osdc_maybe_request_map(osdc);
> > > > +
> > > > +                   up_read(&osdc->lock);
> > > > +                   return;
> > > > +           }
> > > 
> > > Cap messages need to be handled in the same order as they were sent. I’m worry if the delay breaks something or makes cap revoke slow. Why not use the use the same approach as user space client? pass the epoch_barrier to libceph and let libceph delay sending osd requests.
> > > 
> > > Regards
> > > Yan, Zheng
> > > 
> > 
> > Now that I've looked at this in more detail, I'm not sure I understand
> > what you're suggesting.
> > 
> > In this case, we have gotten a cap message from the MDS, and we can't
> > use any caps granted in there until the right map epoch comes in. Isn't
> > it wrong to do all of the stuff in handle_cap_grant (for instance)
> > before we've received that map epoch?
> > 
> > The userland client just seems to just idle OSD requests in the
> > objecter layer until the right map comes in. Is that really sufficient
> > here?
> 
> Yes -- the key thing is that the client sees the effects of *another*
> client's blacklisting (the client that might have previously held caps
> on the file we're about to write/read).  So the client receiving the
> barrier message is completely OK and entitled to the caps, he just
> needs to make sure his OSD ops don't go out until he's seen the epoch
> so that his operations are properly ordered with respect to the
> blacklisting of this notional other client.
> 
> John

One more question...how is epoch_t wraparound handled?

The number is 32 bits so it seems like it could happen with enough
clients flapping, and 0 has a bit of a special meaning here, AFAICS.
Most of the code seems to assume that it can never occur. Could it?

-- 
Jeff Layton <jlayton@redhat.com>

  parent reply	other threads:[~2017-02-02 16:07 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-20 15:17 [PATCH v1 0/7] ceph: implement new-style ENOSPC handling in kcephfs Jeff Layton
2017-01-20 15:17 ` [PATCH v1 1/7] libceph: add ceph_osdc_cancel_writes Jeff Layton
2017-01-20 15:17 ` [PATCH v1 2/7] libceph: rename and export have_pool_full Jeff Layton
2017-01-20 15:17 ` [PATCH v1 3/7] libceph: rename and export maybe_request_map Jeff Layton
2017-01-20 15:17 ` [PATCH v1 4/7] ceph: handle new osdmap epoch updates in CLIENT_CAPS and WRITE codepaths Jeff Layton
2017-01-22  9:40   ` Yan, Zheng
2017-01-22 15:38     ` Jeff Layton
2017-01-23  1:38       ` Yan, Zheng
2017-02-01 19:50     ` Jeff Layton
2017-02-01 19:55       ` John Spray
2017-02-01 20:55         ` Jeff Layton
2017-02-02 16:07         ` Jeff Layton [this message]
2017-02-02 16:35           ` John Spray
2017-01-20 15:17 ` [PATCH v1 5/7] ceph: update CAPRELEASE message format Jeff Layton
2017-01-20 15:17 ` [PATCH v1 6/7] ceph: clean out delayed caps when destroying session Jeff Layton
2017-01-20 15:17 ` [PATCH v1 7/7] libceph: allow requests to return immediately on full conditions if caller wishes Jeff Layton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1486051640.2812.16.camel@redhat.com \
    --to=jlayton@redhat.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=idryomov@gmail.com \
    --cc=jspray@redhat.com \
    --cc=sage@redhat.com \
    --cc=zyan@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.