All of lore.kernel.org
 help / color / mirror / Atom feed
* reducing s_mutex coverage in kcephfs client
@ 2020-03-26 16:58 Jeff Layton
  2020-03-27 14:31 ` Yan, Zheng
  0 siblings, 1 reply; 5+ messages in thread
From: Jeff Layton @ 2020-03-26 16:58 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: Gregory Farnum, Ceph Development, Patrick Donnelly, Sage Weil

I had mentioned this in standup this morning, but it's a bit of a
complex topic and Zheng asked me to send email instead. I'm also cc'ing
ceph-devel for posterity...

The locking in the cap handling code is extremely hairy, with many
places where we need to take sleeping locks while we're in atomic
context (under spinlock, mostly). A lot of the problem is due to the
need to take the session->s_mutex.

For instance, there's this in ceph_check_caps:

a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 1999)              if (session && session != cap->session) {
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2000)                      dout("oops, wrong session %p mutex\n", session);
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2001)                      mutex_unlock(&session->s_mutex);
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2002)                      session = NULL;
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2003)              }
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2004)              if (!session) {
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2005)                      session = cap->session;
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2006)                      if (mutex_trylock(&session->s_mutex) == 0) {
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2007)                              dout("inverting session/ino locks on %p\n",
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2008)                                   session);
be655596b3de5 (Sage Weil           2011-11-30 09:47:09 -0800 2009)                              spin_unlock(&ci->i_ceph_lock);
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2010)                              if (took_snap_rwsem) {
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2011)                                      up_read(&mdsc->snap_rwsem);
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2012)                                      took_snap_rwsem = 0;
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2013)                              }
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2014)                              mutex_lock(&session->s_mutex);
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2015)                              goto retry;
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2016)                      }
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2017)              }

At this point, we're walking the inode's caps rbtree, while holding the
inode->i_ceph_lock. We're eventually going to need to send a cap message
to the MDS for this cap, but that requires the cap->session->s_mutex. We
try to take it without blocking first, but if that fails, we have to
unwind all of the locking and start over. Gross. That also makes the
handling of snap_rwsem much more complex than it really should be too.

It does this, despite the fact that the cap message doesn't actually
need much from the session (just the session->s_con, mostly). Most of
the info in the message comes from the inode and cap objects.

My question is: What is the s_mutex guaranteeing at this point?

More to the point, is it strictly required that we hold that mutex as we
marshal up the outgoing request? It would be much cleaner to be able to
just drop the spinlock after getting the ceph_msg_args ready to send,
then take the session mutex and send the request.

The state of the MDS session is not checked in this codepath before the
send, so it doesn't seem like ordering vs. session state messages is
very important. This _is_ ordered vs. regular MDS requests, but a
per-session mutex seems like a very heavyweight way to do that.

If we're concerned about reordering cap messages that involve the same
inode, then there are other ways to ensure that ordering that don't
require a coarse-grained mutex.

It's just not clear to me what data this mutex is protecting in this
case.

Any hints would be welcome!
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: reducing s_mutex coverage in kcephfs client
  2020-03-26 16:58 reducing s_mutex coverage in kcephfs client Jeff Layton
@ 2020-03-27 14:31 ` Yan, Zheng
  2020-03-27 19:47   ` Jeff Layton
  0 siblings, 1 reply; 5+ messages in thread
From: Yan, Zheng @ 2020-03-27 14:31 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Gregory Farnum, Ceph Development, Patrick Donnelly, Sage Weil

On Fri, Mar 27, 2020 at 12:58 AM Jeff Layton <jlayton@redhat.com> wrote:
>
> I had mentioned this in standup this morning, but it's a bit of a
> complex topic and Zheng asked me to send email instead. I'm also cc'ing
> ceph-devel for posterity...
>
> The locking in the cap handling code is extremely hairy, with many
> places where we need to take sleeping locks while we're in atomic
> context (under spinlock, mostly). A lot of the problem is due to the
> need to take the session->s_mutex.
>
> For instance, there's this in ceph_check_caps:
>
> a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 1999)              if (session && session != cap->session) {
> a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2000)                      dout("oops, wrong session %p mutex\n", session);
> a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2001)                      mutex_unlock(&session->s_mutex);
> a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2002)                      session = NULL;
> a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2003)              }
> a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2004)              if (!session) {
> a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2005)                      session = cap->session;
> a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2006)                      if (mutex_trylock(&session->s_mutex) == 0) {
> a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2007)                              dout("inverting session/ino locks on %p\n",
> a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2008)                                   session);
> be655596b3de5 (Sage Weil           2011-11-30 09:47:09 -0800 2009)                              spin_unlock(&ci->i_ceph_lock);
> a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2010)                              if (took_snap_rwsem) {
> a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2011)                                      up_read(&mdsc->snap_rwsem);
> a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2012)                                      took_snap_rwsem = 0;
> a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2013)                              }
> a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2014)                              mutex_lock(&session->s_mutex);
> a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2015)                              goto retry;
> a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2016)                      }
> a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2017)              }
>
> At this point, we're walking the inode's caps rbtree, while holding the
> inode->i_ceph_lock. We're eventually going to need to send a cap message
> to the MDS for this cap, but that requires the cap->session->s_mutex. We
> try to take it without blocking first, but if that fails, we have to
> unwind all of the locking and start over. Gross. That also makes the
> handling of snap_rwsem much more complex than it really should be too.
>
> It does this, despite the fact that the cap message doesn't actually
> need much from the session (just the session->s_con, mostly). Most of
> the info in the message comes from the inode and cap objects.
>
> My question is: What is the s_mutex guaranteeing at this point?
>
> More to the point, is it strictly required that we hold that mutex as we
> marshal up the outgoing request? It would be much cleaner to be able to
> just drop the spinlock after getting the ceph_msg_args ready to send,
> then take the session mutex and send the request.
>
> The state of the MDS session is not checked in this codepath before the
> send, so it doesn't seem like ordering vs. session state messages is
> very important. This _is_ ordered vs. regular MDS requests, but a
> per-session mutex seems like a very heavyweight way to do that.
>
> If we're concerned about reordering cap messages that involve the same
> inode, then there are other ways to ensure that ordering that don't
> require a coarse-grained mutex.
>
> It's just not clear to me what data this mutex is protecting in this
> case.

I think it's mainly for message ordering. For example,  a request may
release multiple inodes' caps (by ceph_encode_inode_release).  Before
sending the request out, we need to prevent ceph_check_caps() from
touch these inodes' caps and sending cap messages.

>
> Any hints would be welcome!
> --
> Jeff Layton <jlayton@redhat.com>
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: reducing s_mutex coverage in kcephfs client
  2020-03-27 14:31 ` Yan, Zheng
@ 2020-03-27 19:47   ` Jeff Layton
  2020-03-29 15:10     ` Yan, Zheng
  0 siblings, 1 reply; 5+ messages in thread
From: Jeff Layton @ 2020-03-27 19:47 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: Gregory Farnum, Ceph Development, Patrick Donnelly, Sage Weil

On Fri, 2020-03-27 at 22:31 +0800, Yan, Zheng wrote:
> On Fri, Mar 27, 2020 at 12:58 AM Jeff Layton <jlayton@redhat.com> wrote:
> > I had mentioned this in standup this morning, but it's a bit of a
> > complex topic and Zheng asked me to send email instead. I'm also cc'ing
> > ceph-devel for posterity...
> > 
> > The locking in the cap handling code is extremely hairy, with many
> > places where we need to take sleeping locks while we're in atomic
> > context (under spinlock, mostly). A lot of the problem is due to the
> > need to take the session->s_mutex.
> > 
> > For instance, there's this in ceph_check_caps:
> > 
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 1999)              if (session && session != cap->session) {
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2000)                      dout("oops, wrong session %p mutex\n", session);
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2001)                      mutex_unlock(&session->s_mutex);
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2002)                      session = NULL;
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2003)              }
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2004)              if (!session) {
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2005)                      session = cap->session;
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2006)                      if (mutex_trylock(&session->s_mutex) == 0) {
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2007)                              dout("inverting session/ino locks on %p\n",
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2008)                                   session);
> > be655596b3de5 (Sage Weil           2011-11-30 09:47:09 -0800 2009)                              spin_unlock(&ci->i_ceph_lock);
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2010)                              if (took_snap_rwsem) {
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2011)                                      up_read(&mdsc->snap_rwsem);
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2012)                                      took_snap_rwsem = 0;
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2013)                              }
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2014)                              mutex_lock(&session->s_mutex);
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2015)                              goto retry;
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2016)                      }
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2017)              }
> > 
> > At this point, we're walking the inode's caps rbtree, while holding the
> > inode->i_ceph_lock. We're eventually going to need to send a cap message
> > to the MDS for this cap, but that requires the cap->session->s_mutex. We
> > try to take it without blocking first, but if that fails, we have to
> > unwind all of the locking and start over. Gross. That also makes the
> > handling of snap_rwsem much more complex than it really should be too.
> > 
> > It does this, despite the fact that the cap message doesn't actually
> > need much from the session (just the session->s_con, mostly). Most of
> > the info in the message comes from the inode and cap objects.
> > 
> > My question is: What is the s_mutex guaranteeing at this point?
> > 
> > More to the point, is it strictly required that we hold that mutex as we
> > marshal up the outgoing request? It would be much cleaner to be able to
> > just drop the spinlock after getting the ceph_msg_args ready to send,
> > then take the session mutex and send the request.
> > 
> > The state of the MDS session is not checked in this codepath before the
> > send, so it doesn't seem like ordering vs. session state messages is
> > very important. This _is_ ordered vs. regular MDS requests, but a
> > per-session mutex seems like a very heavyweight way to do that.
> > 
> > If we're concerned about reordering cap messages that involve the same
> > inode, then there are other ways to ensure that ordering that don't
> > require a coarse-grained mutex.
> > 
> > It's just not clear to me what data this mutex is protecting in this
> > case.
> 
> I think it's mainly for message ordering. For example,  a request may
> release multiple inodes' caps (by ceph_encode_inode_release).  Before
> sending the request out, we need to prevent ceph_check_caps() from
> touch these inodes' caps and sending cap messages.

I don't get it.

AFAICT, ceph_encode_inode_release is called while holding the
mdsc->mutex, not the s_mutex. That is serialized on the i_ceph_lock, but
I don't think there's any guarantee what order (e.g.) a racing cap
update and release would be sent.

--
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: reducing s_mutex coverage in kcephfs client
  2020-03-27 19:47   ` Jeff Layton
@ 2020-03-29 15:10     ` Yan, Zheng
  2020-03-30 18:59       ` Jeff Layton
  0 siblings, 1 reply; 5+ messages in thread
From: Yan, Zheng @ 2020-03-29 15:10 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Gregory Farnum, Ceph Development, Patrick Donnelly, Sage Weil

On Sat, Mar 28, 2020 at 3:47 AM Jeff Layton <jlayton@redhat.com> wrote:
>
> On Fri, 2020-03-27 at 22:31 +0800, Yan, Zheng wrote:
> > On Fri, Mar 27, 2020 at 12:58 AM Jeff Layton <jlayton@redhat.com> wrote:
> > > I had mentioned this in standup this morning, but it's a bit of a
> > > complex topic and Zheng asked me to send email instead. I'm also cc'ing
> > > ceph-devel for posterity...
> > >
> > > The locking in the cap handling code is extremely hairy, with many
> > > places where we need to take sleeping locks while we're in atomic
> > > context (under spinlock, mostly). A lot of the problem is due to the
> > > need to take the session->s_mutex.
> > >
> > > For instance, there's this in ceph_check_caps:
> > >
> > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 1999)              if (session && session != cap->session) {
> > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2000)                      dout("oops, wrong session %p mutex\n", session);
> > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2001)                      mutex_unlock(&session->s_mutex);
> > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2002)                      session = NULL;
> > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2003)              }
> > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2004)              if (!session) {
> > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2005)                      session = cap->session;
> > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2006)                      if (mutex_trylock(&session->s_mutex) == 0) {
> > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2007)                              dout("inverting session/ino locks on %p\n",
> > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2008)                                   session);
> > > be655596b3de5 (Sage Weil           2011-11-30 09:47:09 -0800 2009)                              spin_unlock(&ci->i_ceph_lock);
> > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2010)                              if (took_snap_rwsem) {
> > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2011)                                      up_read(&mdsc->snap_rwsem);
> > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2012)                                      took_snap_rwsem = 0;
> > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2013)                              }
> > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2014)                              mutex_lock(&session->s_mutex);
> > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2015)                              goto retry;
> > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2016)                      }
> > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2017)              }
> > >
> > > At this point, we're walking the inode's caps rbtree, while holding the
> > > inode->i_ceph_lock. We're eventually going to need to send a cap message
> > > to the MDS for this cap, but that requires the cap->session->s_mutex. We
> > > try to take it without blocking first, but if that fails, we have to
> > > unwind all of the locking and start over. Gross. That also makes the
> > > handling of snap_rwsem much more complex than it really should be too.
> > >
> > > It does this, despite the fact that the cap message doesn't actually
> > > need much from the session (just the session->s_con, mostly). Most of
> > > the info in the message comes from the inode and cap objects.
> > >
> > > My question is: What is the s_mutex guaranteeing at this point?
> > >
> > > More to the point, is it strictly required that we hold that mutex as we
> > > marshal up the outgoing request? It would be much cleaner to be able to
> > > just drop the spinlock after getting the ceph_msg_args ready to send,
> > > then take the session mutex and send the request.
> > >
> > > The state of the MDS session is not checked in this codepath before the
> > > send, so it doesn't seem like ordering vs. session state messages is
> > > very important. This _is_ ordered vs. regular MDS requests, but a
> > > per-session mutex seems like a very heavyweight way to do that.
> > >
> > > If we're concerned about reordering cap messages that involve the same
> > > inode, then there are other ways to ensure that ordering that don't
> > > require a coarse-grained mutex.
> > >
> > > It's just not clear to me what data this mutex is protecting in this
> > > case.
> >
> > I think it's mainly for message ordering. For example,  a request may
> > release multiple inodes' caps (by ceph_encode_inode_release).  Before
> > sending the request out, we need to prevent ceph_check_caps() from
> > touch these inodes' caps and sending cap messages.
>
> I don't get it.
>
> AFAICT, ceph_encode_inode_release is called while holding the
> mdsc->mutex, not the s_mutex. That is serialized on the i_ceph_lock, but
> I don't think there's any guarantee what order (e.g.) a racing cap
> update and release would be sent.
>

You are right. cap messages can be slight out-of-order in above case.

I checked the code again.  I think s_mutex is mainly for:

- cap message senders use s_mutex to ensure session does not get
closed by CEPH_SESSION_CLOSE. For example __mark_caps_flushing() may
race with remove_session_caps()
- some functions such as ceph_iterate_session_caps are not reentrant.
s_mutex is used for protecting these functions.
- send_mds_reconnect() use s_mutex to prevent other threads from
modifying cap states while it composing the reconnect message.
Regards
Yan, Zheng



> --
> Jeff Layton <jlayton@redhat.com>
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: reducing s_mutex coverage in kcephfs client
  2020-03-29 15:10     ` Yan, Zheng
@ 2020-03-30 18:59       ` Jeff Layton
  0 siblings, 0 replies; 5+ messages in thread
From: Jeff Layton @ 2020-03-30 18:59 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: Gregory Farnum, Ceph Development, Patrick Donnelly, Sage Weil

On Sun, 2020-03-29 at 23:10 +0800, Yan, Zheng wrote:
> On Sat, Mar 28, 2020 at 3:47 AM Jeff Layton <jlayton@redhat.com> wrote:
> > On Fri, 2020-03-27 at 22:31 +0800, Yan, Zheng wrote:
> > > On Fri, Mar 27, 2020 at 12:58 AM Jeff Layton <jlayton@redhat.com> wrote:
> > > > I had mentioned this in standup this morning, but it's a bit of a
> > > > complex topic and Zheng asked me to send email instead. I'm also cc'ing
> > > > ceph-devel for posterity...
> > > > 
> > > > The locking in the cap handling code is extremely hairy, with many
> > > > places where we need to take sleeping locks while we're in atomic
> > > > context (under spinlock, mostly). A lot of the problem is due to the
> > > > need to take the session->s_mutex.
> > > > 
> > > > For instance, there's this in ceph_check_caps:
> > > > 
> > > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 1999)              if (session && session != cap->session) {
> > > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2000)                      dout("oops, wrong session %p mutex\n", session);
> > > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2001)                      mutex_unlock(&session->s_mutex);
> > > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2002)                      session = NULL;
> > > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2003)              }
> > > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2004)              if (!session) {
> > > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2005)                      session = cap->session;
> > > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2006)                      if (mutex_trylock(&session->s_mutex) == 0) {
> > > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2007)                              dout("inverting session/ino locks on %p\n",
> > > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2008)                                   session);
> > > > be655596b3de5 (Sage Weil           2011-11-30 09:47:09 -0800 2009)                              spin_unlock(&ci->i_ceph_lock);
> > > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2010)                              if (took_snap_rwsem) {
> > > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2011)                                      up_read(&mdsc->snap_rwsem);
> > > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2012)                                      took_snap_rwsem = 0;
> > > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2013)                              }
> > > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2014)                              mutex_lock(&session->s_mutex);
> > > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2015)                              goto retry;
> > > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2016)                      }
> > > > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2017)              }
> > > > 
> > > > At this point, we're walking the inode's caps rbtree, while holding the
> > > > inode->i_ceph_lock. We're eventually going to need to send a cap message
> > > > to the MDS for this cap, but that requires the cap->session->s_mutex. We
> > > > try to take it without blocking first, but if that fails, we have to
> > > > unwind all of the locking and start over. Gross. That also makes the
> > > > handling of snap_rwsem much more complex than it really should be too.
> > > > 
> > > > It does this, despite the fact that the cap message doesn't actually
> > > > need much from the session (just the session->s_con, mostly). Most of
> > > > the info in the message comes from the inode and cap objects.
> > > > 
> > > > My question is: What is the s_mutex guaranteeing at this point?
> > > > 
> > > > More to the point, is it strictly required that we hold that mutex as we
> > > > marshal up the outgoing request? It would be much cleaner to be able to
> > > > just drop the spinlock after getting the ceph_msg_args ready to send,
> > > > then take the session mutex and send the request.
> > > > 
> > > > The state of the MDS session is not checked in this codepath before the
> > > > send, so it doesn't seem like ordering vs. session state messages is
> > > > very important. This _is_ ordered vs. regular MDS requests, but a
> > > > per-session mutex seems like a very heavyweight way to do that.
> > > > 
> > > > If we're concerned about reordering cap messages that involve the same
> > > > inode, then there are other ways to ensure that ordering that don't
> > > > require a coarse-grained mutex.
> > > > 
> > > > It's just not clear to me what data this mutex is protecting in this
> > > > case.
> > > 
> > > I think it's mainly for message ordering. For example,  a request may
> > > release multiple inodes' caps (by ceph_encode_inode_release).  Before
> > > sending the request out, we need to prevent ceph_check_caps() from
> > > touch these inodes' caps and sending cap messages.
> > 
> > I don't get it.
> > 
> > AFAICT, ceph_encode_inode_release is called while holding the
> > mdsc->mutex, not the s_mutex. That is serialized on the i_ceph_lock, but
> > I don't think there's any guarantee what order (e.g.) a racing cap
> > update and release would be sent.
> > 
> 
> You are right. cap messages can be slight out-of-order in above case.
> 

Thanks for confirming it.

> I checked the code again.  I think s_mutex is mainly for:
> 
> - cap message senders use s_mutex to ensure session does not get
> closed by CEPH_SESSION_CLOSE. For example __mark_caps_flushing() may
> race with remove_session_caps()

Ok, we should definitely try to prevent that race.

> - some functions such as ceph_iterate_session_caps are not reentrant.
> s_mutex is used for protecting these functions.

Good point. I'll have a look at that.

> - send_mds_reconnect() use s_mutex to prevent other threads from
> modifying cap states while it composing the reconnect message.
> 

Ok. I think we can reduce the coverage of the s_mutex somewhat, and
leave it in place for some of the more rare, heavyweight operations.

I think at this point, I realize that most of the cap message setup can
(and probably should) be done under the i_ceph_lock. It's only the
sending of the message the seems to require the s_mutex....which is a
shame, because the con has its own mutex, which could also be used to
ensure ordering.

Ideally, the queueing of outgoing messages to not require a mutex at
all, so we could queue them up under atomic context. That would do a lot
to simplify the cap handling, and it would probably be a lot more
efficient to boot.

It also stands out to me that most if not all of the operations under
ceph_con_send won't block. It doesn't seem like queueing up a message
and kicking the workqueue ought to require a mutex.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-03-30 18:59 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-26 16:58 reducing s_mutex coverage in kcephfs client Jeff Layton
2020-03-27 14:31 ` Yan, Zheng
2020-03-27 19:47   ` Jeff Layton
2020-03-29 15:10     ` Yan, Zheng
2020-03-30 18:59       ` Jeff Layton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.