All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] ceph: cap handling code fixes, cleanups and comments
@ 2020-03-23 16:07 Jeff Layton
  2020-03-23 16:07 ` [PATCH 1/8] ceph: reorganize __send_cap for less spinlock abuse Jeff Layton
                   ` (9 more replies)
  0 siblings, 10 replies; 15+ messages in thread
From: Jeff Layton @ 2020-03-23 16:07 UTC (permalink / raw)
  To: ceph-devel; +Cc: ukernel, idryomov, sage

I've been going over the cap handling code with an aim toward
simplifying the locking. There's one fix for a potential use-after-free
race in here. This also eliminates a number of __acquires and __releases
annotations by reorganizing the code, and adds some (hopefully helpful)
comments.

There should be no behavioral changes with this set.

Jeff Layton (8):
  ceph: reorganize __send_cap for less spinlock abuse
  ceph: split up __finish_cap_flush
  ceph: add comments for handle_cap_flush_ack logic
  ceph: don't release i_ceph_lock in handle_cap_trunc
  ceph: don't take i_ceph_lock in handle_cap_import
  ceph: document what protects i_dirty_item and i_flushing_item
  ceph: fix potential race in ceph_check_caps
  ceph: throw a warning if we destroy session with mutex still locked

 fs/ceph/caps.c       | 292 ++++++++++++++++++++++++-------------------
 fs/ceph/mds_client.c |   1 +
 fs/ceph/super.h      |   4 +-
 3 files changed, 170 insertions(+), 127 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/8] ceph: reorganize __send_cap for less spinlock abuse
  2020-03-23 16:07 [PATCH 0/8] ceph: cap handling code fixes, cleanups and comments Jeff Layton
@ 2020-03-23 16:07 ` Jeff Layton
  2020-03-24 14:48   ` Yan, Zheng
  2020-03-25 13:57   ` [PATCH v2] " Jeff Layton
  2020-03-23 16:07 ` [PATCH 2/8] ceph: split up __finish_cap_flush Jeff Layton
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 15+ messages in thread
From: Jeff Layton @ 2020-03-23 16:07 UTC (permalink / raw)
  To: ceph-devel; +Cc: ukernel, idryomov, sage

Get rid of the __releases annotation by breaking it up into two
functions: __prep_cap which is done under the spinlock and __send_cap
that is done outside it.

Nothing checks the return value from __send_cap, so make it void
return.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c | 193 ++++++++++++++++++++++++++++---------------------
 1 file changed, 110 insertions(+), 83 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 779433847f20..5bdca0da58a4 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1318,44 +1318,27 @@ void __ceph_remove_caps(struct ceph_inode_info *ci)
 }
 
 /*
- * Send a cap msg on the given inode.  Update our caps state, then
- * drop i_ceph_lock and send the message.
- *
- * Make note of max_size reported/requested from mds, revoked caps
- * that have now been implemented.
- *
- * Return non-zero if delayed release, or we experienced an error
- * such that the caller should requeue + retry later.
- *
- * called with i_ceph_lock, then drops it.
- * caller should hold snap_rwsem (read), s_mutex.
+ * Prepare to send a cap message to an MDS. Update the cap state, and populate
+ * the arg struct with the parameters that will need to be sent. This should
+ * be done under the i_ceph_lock to guard against changes to cap state.
  */
-static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
-		      int op, int flags, int used, int want, int retain,
-		      int flushing, u64 flush_tid, u64 oldest_flush_tid)
-	__releases(cap->ci->i_ceph_lock)
+static void __prep_cap(struct cap_msg_args *arg, struct ceph_cap *cap,
+		       int op, int flags, int used, int want, int retain,
+		       int flushing, u64 flush_tid, u64 oldest_flush_tid,
+		       struct ceph_buffer **old_blob)
 {
 	struct ceph_inode_info *ci = cap->ci;
 	struct inode *inode = &ci->vfs_inode;
-	struct ceph_buffer *old_blob = NULL;
-	struct cap_msg_args arg;
 	int held, revoking;
-	int wake = 0;
-	int ret;
 
-	/* Don't send anything if it's still being created. Return delayed */
-	if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
-		spin_unlock(&ci->i_ceph_lock);
-		dout("%s async create in flight for %p\n", __func__, inode);
-		return 1;
-	}
+	lockdep_assert_held(&ci->i_ceph_lock);
 
 	held = cap->issued | cap->implemented;
 	revoking = cap->implemented & ~cap->issued;
 	retain &= ~revoking;
 
-	dout("__send_cap %p cap %p session %p %s -> %s (revoking %s)\n",
-	     inode, cap, cap->session,
+	dout("%s %p cap %p session %p %s -> %s (revoking %s)\n",
+	     __func__, inode, cap, cap->session,
 	     ceph_cap_string(held), ceph_cap_string(held & retain),
 	     ceph_cap_string(revoking));
 	BUG_ON((retain & CEPH_CAP_PIN) == 0);
@@ -1363,60 +1346,51 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
 	ci->i_ceph_flags &= ~CEPH_I_FLUSH;
 
 	cap->issued &= retain;  /* drop bits we don't want */
-	if (cap->implemented & ~cap->issued) {
-		/*
-		 * Wake up any waiters on wanted -> needed transition.
-		 * This is due to the weird transition from buffered
-		 * to sync IO... we need to flush dirty pages _before_
-		 * allowing sync writes to avoid reordering.
-		 */
-		wake = 1;
-	}
 	cap->implemented &= cap->issued | used;
 	cap->mds_wanted = want;
 
-	arg.session = cap->session;
-	arg.ino = ceph_vino(inode).ino;
-	arg.cid = cap->cap_id;
-	arg.follows = flushing ? ci->i_head_snapc->seq : 0;
-	arg.flush_tid = flush_tid;
-	arg.oldest_flush_tid = oldest_flush_tid;
+	arg->session = cap->session;
+	arg->ino = ceph_vino(inode).ino;
+	arg->cid = cap->cap_id;
+	arg->follows = flushing ? ci->i_head_snapc->seq : 0;
+	arg->flush_tid = flush_tid;
+	arg->oldest_flush_tid = oldest_flush_tid;
 
-	arg.size = inode->i_size;
-	ci->i_reported_size = arg.size;
-	arg.max_size = ci->i_wanted_max_size;
+	arg->size = inode->i_size;
+	ci->i_reported_size = arg->size;
+	arg->max_size = ci->i_wanted_max_size;
 	if (cap == ci->i_auth_cap)
-		ci->i_requested_max_size = arg.max_size;
+		ci->i_requested_max_size = arg->max_size;
 
 	if (flushing & CEPH_CAP_XATTR_EXCL) {
-		old_blob = __ceph_build_xattrs_blob(ci);
-		arg.xattr_version = ci->i_xattrs.version;
-		arg.xattr_buf = ci->i_xattrs.blob;
+		*old_blob = __ceph_build_xattrs_blob(ci);
+		arg->xattr_version = ci->i_xattrs.version;
+		arg->xattr_buf = ci->i_xattrs.blob;
 	} else {
-		arg.xattr_buf = NULL;
+		arg->xattr_buf = NULL;
 	}
 
-	arg.mtime = inode->i_mtime;
-	arg.atime = inode->i_atime;
-	arg.ctime = inode->i_ctime;
-	arg.btime = ci->i_btime;
-	arg.change_attr = inode_peek_iversion_raw(inode);
+	arg->mtime = inode->i_mtime;
+	arg->atime = inode->i_atime;
+	arg->ctime = inode->i_ctime;
+	arg->btime = ci->i_btime;
+	arg->change_attr = inode_peek_iversion_raw(inode);
 
-	arg.op = op;
-	arg.caps = cap->implemented;
-	arg.wanted = want;
-	arg.dirty = flushing;
+	arg->op = op;
+	arg->caps = cap->implemented;
+	arg->wanted = want;
+	arg->dirty = flushing;
 
-	arg.seq = cap->seq;
-	arg.issue_seq = cap->issue_seq;
-	arg.mseq = cap->mseq;
-	arg.time_warp_seq = ci->i_time_warp_seq;
+	arg->seq = cap->seq;
+	arg->issue_seq = cap->issue_seq;
+	arg->mseq = cap->mseq;
+	arg->time_warp_seq = ci->i_time_warp_seq;
 
-	arg.uid = inode->i_uid;
-	arg.gid = inode->i_gid;
-	arg.mode = inode->i_mode;
+	arg->uid = inode->i_uid;
+	arg->gid = inode->i_gid;
+	arg->mode = inode->i_mode;
 
-	arg.inline_data = ci->i_inline_version != CEPH_INLINE_NONE;
+	arg->inline_data = ci->i_inline_version != CEPH_INLINE_NONE;
 	if (!(flags & CEPH_CLIENT_CAPS_PENDING_CAPSNAP) &&
 	    !list_empty(&ci->i_cap_snaps)) {
 		struct ceph_cap_snap *capsnap;
@@ -1429,18 +1403,46 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
 			}
 		}
 	}
-	arg.flags = flags;
+	arg->flags = flags;
+}
 
-	spin_unlock(&ci->i_ceph_lock);
+/*
+ * Wake up any waiters on wanted -> needed transition. This is due to the weird
+ * transition from buffered to sync IO... we need to flush dirty pages _before_
+ * allowing sync writes to avoid reordering.
+ */
+static inline bool should_wake_cap_waiters(struct ceph_cap *cap)
+{
+	lockdep_assert_held(&cap->ci->i_ceph_lock);
 
-	ceph_buffer_put(old_blob);
+	return cap->implemented & ~cap->issued;
+}
 
-	ret = send_cap_msg(&arg);
+/*
+ * Send a cap msg on the given inode.  Update our caps state, then
+ * drop i_ceph_lock and send the message.
+ *
+ * Make note of max_size reported/requested from mds, revoked caps
+ * that have now been implemented.
+ *
+ * Return non-zero if delayed release, or we experienced an error
+ * such that the caller should requeue + retry later.
+ *
+ * called with i_ceph_lock, then drops it.
+ * caller should hold snap_rwsem (read), s_mutex.
+ */
+static void __send_cap(struct ceph_mds_client *mdsc, struct cap_msg_args *arg,
+		       struct ceph_inode_info *ci, bool wake)
+{
+	struct inode *inode = &ci->vfs_inode;
+	int ret;
+
+	ret = send_cap_msg(arg);
 	if (ret < 0) {
 		pr_err("error sending cap msg, ino (%llx.%llx) "
 		       "flushing %s tid %llu, requeue\n",
-		       ceph_vinop(inode), ceph_cap_string(flushing),
-		       flush_tid);
+		       ceph_vinop(inode), ceph_cap_string(arg->dirty),
+		       arg->flush_tid);
 		spin_lock(&ci->i_ceph_lock);
 		__cap_delay_requeue(mdsc, ci);
 		spin_unlock(&ci->i_ceph_lock);
@@ -1448,8 +1450,6 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
 
 	if (wake)
 		wake_up_all(&ci->i_cap_wq);
-
-	return ret;
 }
 
 static inline int __send_flush_snap(struct inode *inode,
@@ -1967,6 +1967,10 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
 	}
 
 	for (p = rb_first(&ci->i_caps); p; p = rb_next(p)) {
+		struct cap_msg_args arg;
+		struct ceph_buffer *old_blob = NULL;
+		bool wake;
+
 		cap = rb_entry(p, struct ceph_cap, ci_node);
 
 		/* avoid looping forever */
@@ -2094,9 +2098,15 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
 
 		mds = cap->mds;  /* remember mds, so we don't repeat */
 
-		/* __send_cap drops i_ceph_lock */
-		__send_cap(mdsc, cap, CEPH_CAP_OP_UPDATE, 0, cap_used, want,
-			   retain, flushing, flush_tid, oldest_flush_tid);
+		__prep_cap(&arg, cap, CEPH_CAP_OP_UPDATE, 0, cap_used, want,
+			   retain, flushing, flush_tid, oldest_flush_tid,
+			   &old_blob);
+		wake = should_wake_cap_waiters(cap);
+		spin_unlock(&ci->i_ceph_lock);
+
+		ceph_buffer_put(old_blob);
+		__send_cap(mdsc, &arg, ci, wake);
+
 		goto retry; /* retake i_ceph_lock and restart our cap scan. */
 	}
 
@@ -2135,6 +2145,9 @@ static int try_flush_caps(struct inode *inode, u64 *ptid)
 retry_locked:
 	if (ci->i_dirty_caps && ci->i_auth_cap) {
 		struct ceph_cap *cap = ci->i_auth_cap;
+		struct ceph_buffer *old_blob = NULL;
+		struct cap_msg_args arg;
+		bool wake;
 
 		if (session != cap->session) {
 			spin_unlock(&ci->i_ceph_lock);
@@ -2162,11 +2175,15 @@ static int try_flush_caps(struct inode *inode, u64 *ptid)
 		flush_tid = __mark_caps_flushing(inode, session, true,
 						 &oldest_flush_tid);
 
-		/* __send_cap drops i_ceph_lock */
-		__send_cap(mdsc, cap, CEPH_CAP_OP_FLUSH, CEPH_CLIENT_CAPS_SYNC,
+		__prep_cap(&arg, cap, CEPH_CAP_OP_FLUSH, CEPH_CLIENT_CAPS_SYNC,
 			   __ceph_caps_used(ci), __ceph_caps_wanted(ci),
 			   (cap->issued | cap->implemented),
-			   flushing, flush_tid, oldest_flush_tid);
+			   flushing, flush_tid, oldest_flush_tid, &old_blob);
+		wake = should_wake_cap_waiters(cap);
+		spin_unlock(&ci->i_ceph_lock);
+
+		ceph_buffer_put(old_blob);
+		__send_cap(mdsc, &arg, ci, wake);
 	} else {
 		if (!list_empty(&ci->i_cap_flush_list)) {
 			struct ceph_cap_flush *cf =
@@ -2368,15 +2385,25 @@ static void __kick_flushing_caps(struct ceph_mds_client *mdsc,
 		first_tid = cf->tid + 1;
 
 		if (cf->caps) {
+			struct ceph_buffer *old_blob = NULL;
+			struct cap_msg_args arg;
+			bool wake;
+
 			dout("kick_flushing_caps %p cap %p tid %llu %s\n",
 			     inode, cap, cf->tid, ceph_cap_string(cf->caps));
-			__send_cap(mdsc, cap, CEPH_CAP_OP_FLUSH,
+			__prep_cap(&arg, cap, CEPH_CAP_OP_FLUSH,
 					 (cf->tid < last_snap_flush ?
 					  CEPH_CLIENT_CAPS_PENDING_CAPSNAP : 0),
 					  __ceph_caps_used(ci),
 					  __ceph_caps_wanted(ci),
 					  (cap->issued | cap->implemented),
-					  cf->caps, cf->tid, oldest_flush_tid);
+					  cf->caps, cf->tid, oldest_flush_tid,
+					  &old_blob);
+			wake = should_wake_cap_waiters(cap);
+			spin_unlock(&ci->i_ceph_lock);
+
+			ceph_buffer_put(old_blob);
+			__send_cap(mdsc, &arg, ci, wake);
 		} else {
 			struct ceph_cap_snap *capsnap =
 					container_of(cf, struct ceph_cap_snap,
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/8] ceph: split up __finish_cap_flush
  2020-03-23 16:07 [PATCH 0/8] ceph: cap handling code fixes, cleanups and comments Jeff Layton
  2020-03-23 16:07 ` [PATCH 1/8] ceph: reorganize __send_cap for less spinlock abuse Jeff Layton
@ 2020-03-23 16:07 ` Jeff Layton
  2020-03-23 16:07 ` [PATCH 3/8] ceph: add comments for handle_cap_flush_ack logic Jeff Layton
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2020-03-23 16:07 UTC (permalink / raw)
  To: ceph-devel; +Cc: ukernel, idryomov, sage

This function takes a mdsc argument or ci argument, but if both are
passed in, it ignores the ci arg. Fortunately, nothing does that, but
there's no good reason to have the same function handle both cases.

Also, get rid of some branches and just use |= to set the wake_* vals.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c | 60 ++++++++++++++++++++++++--------------------------
 1 file changed, 29 insertions(+), 31 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 5bdca0da58a4..01877f91b85b 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1745,30 +1745,33 @@ static u64 __get_oldest_flush_tid(struct ceph_mds_client *mdsc)
  * Remove cap_flush from the mdsc's or inode's flushing cap list.
  * Return true if caller needs to wake up flush waiters.
  */
-static bool __finish_cap_flush(struct ceph_mds_client *mdsc,
-			       struct ceph_inode_info *ci,
-			       struct ceph_cap_flush *cf)
+static bool __detach_cap_flush_from_mdsc(struct ceph_mds_client *mdsc,
+					 struct ceph_cap_flush *cf)
 {
 	struct ceph_cap_flush *prev;
 	bool wake = cf->wake;
-	if (mdsc) {
-		/* are there older pending cap flushes? */
-		if (wake && cf->g_list.prev != &mdsc->cap_flush_list) {
-			prev = list_prev_entry(cf, g_list);
-			prev->wake = true;
-			wake = false;
-		}
-		list_del(&cf->g_list);
-	} else if (ci) {
-		if (wake && cf->i_list.prev != &ci->i_cap_flush_list) {
-			prev = list_prev_entry(cf, i_list);
-			prev->wake = true;
-			wake = false;
-		}
-		list_del(&cf->i_list);
-	} else {
-		BUG_ON(1);
+
+	if (wake && cf->g_list.prev != &mdsc->cap_flush_list) {
+		prev = list_prev_entry(cf, g_list);
+		prev->wake = true;
+		wake = false;
 	}
+	list_del(&cf->g_list);
+	return wake;
+}
+
+static bool __detach_cap_flush_from_ci(struct ceph_inode_info *ci,
+				       struct ceph_cap_flush *cf)
+{
+	struct ceph_cap_flush *prev;
+	bool wake = cf->wake;
+
+	if (wake && cf->i_list.prev != &ci->i_cap_flush_list) {
+		prev = list_prev_entry(cf, i_list);
+		prev->wake = true;
+		wake = false;
+	}
+	list_del(&cf->i_list);
 	return wake;
 }
 
@@ -3493,8 +3496,7 @@ static void handle_cap_flush_ack(struct inode *inode, u64 flush_tid,
 		if (cf->caps == 0) /* capsnap */
 			continue;
 		if (cf->tid <= flush_tid) {
-			if (__finish_cap_flush(NULL, ci, cf))
-				wake_ci = true;
+			wake_ci |= __detach_cap_flush_from_ci(ci, cf);
 			list_add_tail(&cf->i_list, &to_remove);
 		} else {
 			cleaned &= ~cf->caps;
@@ -3516,10 +3518,8 @@ static void handle_cap_flush_ack(struct inode *inode, u64 flush_tid,
 
 	spin_lock(&mdsc->cap_dirty_lock);
 
-	list_for_each_entry(cf, &to_remove, i_list) {
-		if (__finish_cap_flush(mdsc, NULL, cf))
-			wake_mdsc = true;
-	}
+	list_for_each_entry(cf, &to_remove, i_list)
+		wake_mdsc |= __detach_cap_flush_from_mdsc(mdsc, cf);
 
 	if (ci->i_flushing_caps == 0) {
 		if (list_empty(&ci->i_cap_flush_list)) {
@@ -3611,17 +3611,15 @@ static void handle_cap_flushsnap_ack(struct inode *inode, u64 flush_tid,
 		dout(" removing %p cap_snap %p follows %lld\n",
 		     inode, capsnap, follows);
 		list_del(&capsnap->ci_item);
-		if (__finish_cap_flush(NULL, ci, &capsnap->cap_flush))
-			wake_ci = true;
+		wake_ci |= __detach_cap_flush_from_ci(ci, &capsnap->cap_flush);
 
 		spin_lock(&mdsc->cap_dirty_lock);
 
 		if (list_empty(&ci->i_cap_flush_list))
 			list_del_init(&ci->i_flushing_item);
 
-		if (__finish_cap_flush(mdsc, NULL, &capsnap->cap_flush))
-			wake_mdsc = true;
-
+		wake_mdsc |= __detach_cap_flush_from_mdsc(mdsc,
+							  &capsnap->cap_flush);
 		spin_unlock(&mdsc->cap_dirty_lock);
 	}
 	spin_unlock(&ci->i_ceph_lock);
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/8] ceph: add comments for handle_cap_flush_ack logic
  2020-03-23 16:07 [PATCH 0/8] ceph: cap handling code fixes, cleanups and comments Jeff Layton
  2020-03-23 16:07 ` [PATCH 1/8] ceph: reorganize __send_cap for less spinlock abuse Jeff Layton
  2020-03-23 16:07 ` [PATCH 2/8] ceph: split up __finish_cap_flush Jeff Layton
@ 2020-03-23 16:07 ` Jeff Layton
  2020-03-23 16:07 ` [PATCH 4/8] ceph: don't release i_ceph_lock in handle_cap_trunc Jeff Layton
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2020-03-23 16:07 UTC (permalink / raw)
  To: ceph-devel; +Cc: ukernel, idryomov, sage

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 01877f91b85b..6220425e2a9c 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -3491,14 +3491,26 @@ static void handle_cap_flush_ack(struct inode *inode, u64 flush_tid,
 	bool wake_mdsc = false;
 
 	list_for_each_entry_safe(cf, tmp_cf, &ci->i_cap_flush_list, i_list) {
+		/* Is this the one that was flushed? */
 		if (cf->tid == flush_tid)
 			cleaned = cf->caps;
-		if (cf->caps == 0) /* capsnap */
+
+		/* Is this a capsnap? */
+		if (cf->caps == 0)
 			continue;
+
 		if (cf->tid <= flush_tid) {
+			/*
+			 * An earlier or current tid. The FLUSH_ACK should
+			 * represent a superset of this flush's caps.
+			 */
 			wake_ci |= __detach_cap_flush_from_ci(ci, cf);
 			list_add_tail(&cf->i_list, &to_remove);
 		} else {
+			/*
+			 * This is a later one. Any caps in it are still dirty
+			 * so don't count them as cleaned.
+			 */
 			cleaned &= ~cf->caps;
 			if (!cleaned)
 				break;
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/8] ceph: don't release i_ceph_lock in handle_cap_trunc
  2020-03-23 16:07 [PATCH 0/8] ceph: cap handling code fixes, cleanups and comments Jeff Layton
                   ` (2 preceding siblings ...)
  2020-03-23 16:07 ` [PATCH 3/8] ceph: add comments for handle_cap_flush_ack logic Jeff Layton
@ 2020-03-23 16:07 ` Jeff Layton
  2020-03-23 16:07 ` [PATCH 5/8] ceph: don't take i_ceph_lock in handle_cap_import Jeff Layton
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2020-03-23 16:07 UTC (permalink / raw)
  To: ceph-devel; +Cc: ukernel, idryomov, sage

There's no reason to do this here. Just have the caller handle it.
Also, add a lockdep assertion.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 6220425e2a9c..e112c1c802cf 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -3651,10 +3651,9 @@ static void handle_cap_flushsnap_ack(struct inode *inode, u64 flush_tid,
  *
  * caller hold s_mutex.
  */
-static void handle_cap_trunc(struct inode *inode,
+static bool handle_cap_trunc(struct inode *inode,
 			     struct ceph_mds_caps *trunc,
 			     struct ceph_mds_session *session)
-	__releases(ci->i_ceph_lock)
 {
 	struct ceph_inode_info *ci = ceph_inode(inode);
 	int mds = session->s_mds;
@@ -3665,7 +3664,9 @@ static void handle_cap_trunc(struct inode *inode,
 	int implemented = 0;
 	int dirty = __ceph_caps_dirty(ci);
 	int issued = __ceph_caps_issued(ceph_inode(inode), &implemented);
-	int queue_trunc = 0;
+	bool queue_trunc = false;
+
+	lockdep_assert_held(&ci->i_ceph_lock);
 
 	issued |= implemented | dirty;
 
@@ -3673,10 +3674,7 @@ static void handle_cap_trunc(struct inode *inode,
 	     inode, mds, seq, truncate_size, truncate_seq);
 	queue_trunc = ceph_fill_file_size(inode, issued,
 					  truncate_seq, truncate_size, size);
-	spin_unlock(&ci->i_ceph_lock);
-
-	if (queue_trunc)
-		ceph_queue_vmtruncate(inode);
+	return queue_trunc;
 }
 
 /*
@@ -3924,6 +3922,7 @@ void ceph_handle_caps(struct ceph_mds_session *session,
 	size_t snaptrace_len;
 	void *p, *end;
 	struct cap_extra_info extra_info = {};
+	bool queue_trunc;
 
 	dout("handle_caps from mds%d\n", session->s_mds);
 
@@ -4107,7 +4106,10 @@ void ceph_handle_caps(struct ceph_mds_session *session,
 		break;
 
 	case CEPH_CAP_OP_TRUNC:
-		handle_cap_trunc(inode, h, session);
+		queue_trunc = handle_cap_trunc(inode, h, session);
+		spin_unlock(&ci->i_ceph_lock);
+		if (queue_trunc)
+			ceph_queue_vmtruncate(inode);
 		break;
 
 	default:
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/8] ceph: don't take i_ceph_lock in handle_cap_import
  2020-03-23 16:07 [PATCH 0/8] ceph: cap handling code fixes, cleanups and comments Jeff Layton
                   ` (3 preceding siblings ...)
  2020-03-23 16:07 ` [PATCH 4/8] ceph: don't release i_ceph_lock in handle_cap_trunc Jeff Layton
@ 2020-03-23 16:07 ` Jeff Layton
  2020-03-23 16:07 ` [PATCH 6/8] ceph: document what protects i_dirty_item and i_flushing_item Jeff Layton
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2020-03-23 16:07 UTC (permalink / raw)
  To: ceph-devel; +Cc: ukernel, idryomov, sage

Just take it before calling it. This means we have to do a couple of
minor in-memory operations under the spinlock now, but those shouldn't
be an issue.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index e112c1c802cf..3eab905ba74b 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -3824,7 +3824,6 @@ static void handle_cap_import(struct ceph_mds_client *mdsc,
 			      struct ceph_mds_cap_peer *ph,
 			      struct ceph_mds_session *session,
 			      struct ceph_cap **target_cap, int *old_issued)
-	__acquires(ci->i_ceph_lock)
 {
 	struct ceph_inode_info *ci = ceph_inode(inode);
 	struct ceph_cap *cap, *ocap, *new_cap = NULL;
@@ -3849,14 +3848,13 @@ static void handle_cap_import(struct ceph_mds_client *mdsc,
 
 	dout("handle_cap_import inode %p ci %p mds%d mseq %d peer %d\n",
 	     inode, ci, mds, mseq, peer);
-
 retry:
-	spin_lock(&ci->i_ceph_lock);
 	cap = __get_cap_for_mds(ci, mds);
 	if (!cap) {
 		if (!new_cap) {
 			spin_unlock(&ci->i_ceph_lock);
 			new_cap = ceph_get_cap(mdsc, NULL);
+			spin_lock(&ci->i_ceph_lock);
 			goto retry;
 		}
 		cap = new_cap;
@@ -4070,6 +4068,7 @@ void ceph_handle_caps(struct ceph_mds_session *session,
 		} else {
 			down_read(&mdsc->snap_rwsem);
 		}
+		spin_lock(&ci->i_ceph_lock);
 		handle_cap_import(mdsc, inode, h, peer, session,
 				  &cap, &extra_info.issued);
 		handle_cap_grant(inode, session, cap,
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 6/8] ceph: document what protects i_dirty_item and i_flushing_item
  2020-03-23 16:07 [PATCH 0/8] ceph: cap handling code fixes, cleanups and comments Jeff Layton
                   ` (4 preceding siblings ...)
  2020-03-23 16:07 ` [PATCH 5/8] ceph: don't take i_ceph_lock in handle_cap_import Jeff Layton
@ 2020-03-23 16:07 ` Jeff Layton
  2020-03-23 16:07 ` [PATCH 7/8] ceph: fix potential race in ceph_check_caps Jeff Layton
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2020-03-23 16:07 UTC (permalink / raw)
  To: ceph-devel; +Cc: ukernel, idryomov, sage

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/super.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 47cfd8935b9c..bb372859c0ad 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -351,7 +351,9 @@ struct ceph_inode_info {
 	struct rb_root i_caps;           /* cap list */
 	struct ceph_cap *i_auth_cap;     /* authoritative cap, if any */
 	unsigned i_dirty_caps, i_flushing_caps;     /* mask of dirtied fields */
-	struct list_head i_dirty_item, i_flushing_item;
+	struct list_head i_dirty_item, i_flushing_item; /* protected by
+							 * mdsc->cap_dirty_lock
+							 */
 	/* we need to track cap writeback on a per-cap-bit basis, to allow
 	 * overlapping, pipelined cap flushes to the mds.  we can probably
 	 * reduce the tid to 8 bits if we're concerned about inode size. */
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 7/8] ceph: fix potential race in ceph_check_caps
  2020-03-23 16:07 [PATCH 0/8] ceph: cap handling code fixes, cleanups and comments Jeff Layton
                   ` (5 preceding siblings ...)
  2020-03-23 16:07 ` [PATCH 6/8] ceph: document what protects i_dirty_item and i_flushing_item Jeff Layton
@ 2020-03-23 16:07 ` Jeff Layton
  2020-03-26 13:25   ` [PATCH v2] " Jeff Layton
  2020-03-23 16:07 ` [PATCH 8/8] ceph: throw a warning if we destroy session with mutex still locked Jeff Layton
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 15+ messages in thread
From: Jeff Layton @ 2020-03-23 16:07 UTC (permalink / raw)
  To: ceph-devel; +Cc: ukernel, idryomov, sage

Nothing ensures that session will still be valid by the time we
dereference the pointer. Take and put a reference.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 3eab905ba74b..061e52912991 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2051,12 +2051,14 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
 			if (mutex_trylock(&session->s_mutex) == 0) {
 				dout("inverting session/ino locks on %p\n",
 				     session);
+				session = ceph_get_mds_session(session);
 				spin_unlock(&ci->i_ceph_lock);
 				if (took_snap_rwsem) {
 					up_read(&mdsc->snap_rwsem);
 					took_snap_rwsem = 0;
 				}
 				mutex_lock(&session->s_mutex);
+				ceph_put_mds_session(session);
 				goto retry;
 			}
 		}
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 8/8] ceph: throw a warning if we destroy session with mutex still locked
  2020-03-23 16:07 [PATCH 0/8] ceph: cap handling code fixes, cleanups and comments Jeff Layton
                   ` (6 preceding siblings ...)
  2020-03-23 16:07 ` [PATCH 7/8] ceph: fix potential race in ceph_check_caps Jeff Layton
@ 2020-03-23 16:07 ` Jeff Layton
  2020-03-24 15:05 ` [PATCH 0/8] ceph: cap handling code fixes, cleanups and comments Yan, Zheng
  2020-03-31 12:24 ` Luis Henriques
  9 siblings, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2020-03-23 16:07 UTC (permalink / raw)
  To: ceph-devel; +Cc: ukernel, idryomov, sage

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/mds_client.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index acce04483471..9a8e7013aca1 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -659,6 +659,7 @@ void ceph_put_mds_session(struct ceph_mds_session *s)
 	if (refcount_dec_and_test(&s->s_ref)) {
 		if (s->s_auth.authorizer)
 			ceph_auth_destroy_authorizer(s->s_auth.authorizer);
+		WARN_ON(mutex_is_locked(&s->s_mutex));
 		xa_destroy(&s->s_delegated_inos);
 		kfree(s);
 	}
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/8] ceph: reorganize __send_cap for less spinlock abuse
  2020-03-23 16:07 ` [PATCH 1/8] ceph: reorganize __send_cap for less spinlock abuse Jeff Layton
@ 2020-03-24 14:48   ` Yan, Zheng
  2020-03-25 10:23     ` Jeff Layton
  2020-03-25 13:57   ` [PATCH v2] " Jeff Layton
  1 sibling, 1 reply; 15+ messages in thread
From: Yan, Zheng @ 2020-03-24 14:48 UTC (permalink / raw)
  To: Jeff Layton; +Cc: ceph-devel, Ilya Dryomov, Sage Weil

On Tue, Mar 24, 2020 at 12:07 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> Get rid of the __releases annotation by breaking it up into two
> functions: __prep_cap which is done under the spinlock and __send_cap
> that is done outside it.
>
> Nothing checks the return value from __send_cap, so make it void
> return.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/ceph/caps.c | 193 ++++++++++++++++++++++++++++---------------------
>  1 file changed, 110 insertions(+), 83 deletions(-)
>
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index 779433847f20..5bdca0da58a4 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -1318,44 +1318,27 @@ void __ceph_remove_caps(struct ceph_inode_info *ci)
>  }
>
>  /*
> - * Send a cap msg on the given inode.  Update our caps state, then
> - * drop i_ceph_lock and send the message.
> - *
> - * Make note of max_size reported/requested from mds, revoked caps
> - * that have now been implemented.
> - *
> - * Return non-zero if delayed release, or we experienced an error
> - * such that the caller should requeue + retry later.
> - *
> - * called with i_ceph_lock, then drops it.
> - * caller should hold snap_rwsem (read), s_mutex.
> + * Prepare to send a cap message to an MDS. Update the cap state, and populate
> + * the arg struct with the parameters that will need to be sent. This should
> + * be done under the i_ceph_lock to guard against changes to cap state.
>   */
> -static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
> -                     int op, int flags, int used, int want, int retain,
> -                     int flushing, u64 flush_tid, u64 oldest_flush_tid)
> -       __releases(cap->ci->i_ceph_lock)
> +static void __prep_cap(struct cap_msg_args *arg, struct ceph_cap *cap,
> +                      int op, int flags, int used, int want, int retain,
> +                      int flushing, u64 flush_tid, u64 oldest_flush_tid,
> +                      struct ceph_buffer **old_blob)
>  {
>         struct ceph_inode_info *ci = cap->ci;
>         struct inode *inode = &ci->vfs_inode;
> -       struct ceph_buffer *old_blob = NULL;
> -       struct cap_msg_args arg;
>         int held, revoking;
> -       int wake = 0;
> -       int ret;
>
> -       /* Don't send anything if it's still being created. Return delayed */
> -       if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
> -               spin_unlock(&ci->i_ceph_lock);
> -               dout("%s async create in flight for %p\n", __func__, inode);
> -               return 1;
> -       }
> +       lockdep_assert_held(&ci->i_ceph_lock);
>
>         held = cap->issued | cap->implemented;
>         revoking = cap->implemented & ~cap->issued;
>         retain &= ~revoking;
>
> -       dout("__send_cap %p cap %p session %p %s -> %s (revoking %s)\n",
> -            inode, cap, cap->session,
> +       dout("%s %p cap %p session %p %s -> %s (revoking %s)\n",
> +            __func__, inode, cap, cap->session,
>              ceph_cap_string(held), ceph_cap_string(held & retain),
>              ceph_cap_string(revoking));
>         BUG_ON((retain & CEPH_CAP_PIN) == 0);
> @@ -1363,60 +1346,51 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
>         ci->i_ceph_flags &= ~CEPH_I_FLUSH;
>
>         cap->issued &= retain;  /* drop bits we don't want */
> -       if (cap->implemented & ~cap->issued) {
> -               /*
> -                * Wake up any waiters on wanted -> needed transition.
> -                * This is due to the weird transition from buffered
> -                * to sync IO... we need to flush dirty pages _before_
> -                * allowing sync writes to avoid reordering.
> -                */
> -               wake = 1;
> -       }
>         cap->implemented &= cap->issued | used;
>         cap->mds_wanted = want;
>
> -       arg.session = cap->session;
> -       arg.ino = ceph_vino(inode).ino;
> -       arg.cid = cap->cap_id;
> -       arg.follows = flushing ? ci->i_head_snapc->seq : 0;
> -       arg.flush_tid = flush_tid;
> -       arg.oldest_flush_tid = oldest_flush_tid;
> +       arg->session = cap->session;
> +       arg->ino = ceph_vino(inode).ino;
> +       arg->cid = cap->cap_id;
> +       arg->follows = flushing ? ci->i_head_snapc->seq : 0;
> +       arg->flush_tid = flush_tid;
> +       arg->oldest_flush_tid = oldest_flush_tid;
>
> -       arg.size = inode->i_size;
> -       ci->i_reported_size = arg.size;
> -       arg.max_size = ci->i_wanted_max_size;
> +       arg->size = inode->i_size;
> +       ci->i_reported_size = arg->size;
> +       arg->max_size = ci->i_wanted_max_size;
>         if (cap == ci->i_auth_cap)
> -               ci->i_requested_max_size = arg.max_size;
> +               ci->i_requested_max_size = arg->max_size;
>
>         if (flushing & CEPH_CAP_XATTR_EXCL) {
> -               old_blob = __ceph_build_xattrs_blob(ci);
> -               arg.xattr_version = ci->i_xattrs.version;
> -               arg.xattr_buf = ci->i_xattrs.blob;
> +               *old_blob = __ceph_build_xattrs_blob(ci);
> +               arg->xattr_version = ci->i_xattrs.version;
> +               arg->xattr_buf = ci->i_xattrs.blob;
>         } else {
> -               arg.xattr_buf = NULL;
> +               arg->xattr_buf = NULL;
>         }
>
> -       arg.mtime = inode->i_mtime;
> -       arg.atime = inode->i_atime;
> -       arg.ctime = inode->i_ctime;
> -       arg.btime = ci->i_btime;
> -       arg.change_attr = inode_peek_iversion_raw(inode);
> +       arg->mtime = inode->i_mtime;
> +       arg->atime = inode->i_atime;
> +       arg->ctime = inode->i_ctime;
> +       arg->btime = ci->i_btime;
> +       arg->change_attr = inode_peek_iversion_raw(inode);
>
> -       arg.op = op;
> -       arg.caps = cap->implemented;
> -       arg.wanted = want;
> -       arg.dirty = flushing;
> +       arg->op = op;
> +       arg->caps = cap->implemented;
> +       arg->wanted = want;
> +       arg->dirty = flushing;
>
> -       arg.seq = cap->seq;
> -       arg.issue_seq = cap->issue_seq;
> -       arg.mseq = cap->mseq;
> -       arg.time_warp_seq = ci->i_time_warp_seq;
> +       arg->seq = cap->seq;
> +       arg->issue_seq = cap->issue_seq;
> +       arg->mseq = cap->mseq;
> +       arg->time_warp_seq = ci->i_time_warp_seq;
>
> -       arg.uid = inode->i_uid;
> -       arg.gid = inode->i_gid;
> -       arg.mode = inode->i_mode;
> +       arg->uid = inode->i_uid;
> +       arg->gid = inode->i_gid;
> +       arg->mode = inode->i_mode;
>
> -       arg.inline_data = ci->i_inline_version != CEPH_INLINE_NONE;
> +       arg->inline_data = ci->i_inline_version != CEPH_INLINE_NONE;
>         if (!(flags & CEPH_CLIENT_CAPS_PENDING_CAPSNAP) &&
>             !list_empty(&ci->i_cap_snaps)) {
>                 struct ceph_cap_snap *capsnap;
> @@ -1429,18 +1403,46 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
>                         }
>                 }
>         }
> -       arg.flags = flags;
> +       arg->flags = flags;
> +}
>
> -       spin_unlock(&ci->i_ceph_lock);
> +/*
> + * Wake up any waiters on wanted -> needed transition. This is due to the weird
> + * transition from buffered to sync IO... we need to flush dirty pages _before_
> + * allowing sync writes to avoid reordering.
> + */
> +static inline bool should_wake_cap_waiters(struct ceph_cap *cap)
> +{
> +       lockdep_assert_held(&cap->ci->i_ceph_lock);
>
> -       ceph_buffer_put(old_blob);
> +       return cap->implemented & ~cap->issued;
> +}
>
> -       ret = send_cap_msg(&arg);
> +/*
> + * Send a cap msg on the given inode.  Update our caps state, then
> + * drop i_ceph_lock and send the message.
> + *
> + * Make note of max_size reported/requested from mds, revoked caps
> + * that have now been implemented.
> + *
> + * Return non-zero if delayed release, or we experienced an error
> + * such that the caller should requeue + retry later.
> + *
> + * called with i_ceph_lock, then drops it.
> + * caller should hold snap_rwsem (read), s_mutex.
> + */
> +static void __send_cap(struct ceph_mds_client *mdsc, struct cap_msg_args *arg,
> +                      struct ceph_inode_info *ci, bool wake)
> +{
> +       struct inode *inode = &ci->vfs_inode;
> +       int ret;
> +
> +       ret = send_cap_msg(arg);
>         if (ret < 0) {
>                 pr_err("error sending cap msg, ino (%llx.%llx) "
>                        "flushing %s tid %llu, requeue\n",
> -                      ceph_vinop(inode), ceph_cap_string(flushing),
> -                      flush_tid);
> +                      ceph_vinop(inode), ceph_cap_string(arg->dirty),
> +                      arg->flush_tid);
>                 spin_lock(&ci->i_ceph_lock);
>                 __cap_delay_requeue(mdsc, ci);
>                 spin_unlock(&ci->i_ceph_lock);
> @@ -1448,8 +1450,6 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
>
>         if (wake)
>                 wake_up_all(&ci->i_cap_wq);
> -
> -       return ret;
>  }
>
>  static inline int __send_flush_snap(struct inode *inode,
> @@ -1967,6 +1967,10 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
>         }
>
>         for (p = rb_first(&ci->i_caps); p; p = rb_next(p)) {
> +               struct cap_msg_args arg;
> +               struct ceph_buffer *old_blob = NULL;
> +               bool wake;
> +

how about defining 'wake' and 'old_blob' into cap_msg_args, move
should_wake_cap_waiters() code into __prep_cap.

>                 cap = rb_entry(p, struct ceph_cap, ci_node);
>
>                 /* avoid looping forever */
> @@ -2094,9 +2098,15 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
>
>                 mds = cap->mds;  /* remember mds, so we don't repeat */
>
> -               /* __send_cap drops i_ceph_lock */
> -               __send_cap(mdsc, cap, CEPH_CAP_OP_UPDATE, 0, cap_used, want,
> -                          retain, flushing, flush_tid, oldest_flush_tid);
> +               __prep_cap(&arg, cap, CEPH_CAP_OP_UPDATE, 0, cap_used, want,
> +                          retain, flushing, flush_tid, oldest_flush_tid,
> +                          &old_blob);
> +               wake = should_wake_cap_waiters(cap);
> +               spin_unlock(&ci->i_ceph_lock);
> +
> +               ceph_buffer_put(old_blob);
> +               __send_cap(mdsc, &arg, ci, wake);
> +
>                 goto retry; /* retake i_ceph_lock and restart our cap scan. */
>         }
>
> @@ -2135,6 +2145,9 @@ static int try_flush_caps(struct inode *inode, u64 *ptid)
>  retry_locked:
>         if (ci->i_dirty_caps && ci->i_auth_cap) {
>                 struct ceph_cap *cap = ci->i_auth_cap;
> +               struct ceph_buffer *old_blob = NULL;
> +               struct cap_msg_args arg;
> +               bool wake;
>
>                 if (session != cap->session) {
>                         spin_unlock(&ci->i_ceph_lock);
> @@ -2162,11 +2175,15 @@ static int try_flush_caps(struct inode *inode, u64 *ptid)
>                 flush_tid = __mark_caps_flushing(inode, session, true,
>                                                  &oldest_flush_tid);
>
> -               /* __send_cap drops i_ceph_lock */
> -               __send_cap(mdsc, cap, CEPH_CAP_OP_FLUSH, CEPH_CLIENT_CAPS_SYNC,
> +               __prep_cap(&arg, cap, CEPH_CAP_OP_FLUSH, CEPH_CLIENT_CAPS_SYNC,
>                            __ceph_caps_used(ci), __ceph_caps_wanted(ci),
>                            (cap->issued | cap->implemented),
> -                          flushing, flush_tid, oldest_flush_tid);
> +                          flushing, flush_tid, oldest_flush_tid, &old_blob);
> +               wake = should_wake_cap_waiters(cap);
> +               spin_unlock(&ci->i_ceph_lock);
> +
> +               ceph_buffer_put(old_blob);
> +               __send_cap(mdsc, &arg, ci, wake);
>         } else {
>                 if (!list_empty(&ci->i_cap_flush_list)) {
>                         struct ceph_cap_flush *cf =
> @@ -2368,15 +2385,25 @@ static void __kick_flushing_caps(struct ceph_mds_client *mdsc,
>                 first_tid = cf->tid + 1;
>
>                 if (cf->caps) {
> +                       struct ceph_buffer *old_blob = NULL;
> +                       struct cap_msg_args arg;
> +                       bool wake;
> +
>                         dout("kick_flushing_caps %p cap %p tid %llu %s\n",
>                              inode, cap, cf->tid, ceph_cap_string(cf->caps));
> -                       __send_cap(mdsc, cap, CEPH_CAP_OP_FLUSH,
> +                       __prep_cap(&arg, cap, CEPH_CAP_OP_FLUSH,
>                                          (cf->tid < last_snap_flush ?
>                                           CEPH_CLIENT_CAPS_PENDING_CAPSNAP : 0),
>                                           __ceph_caps_used(ci),
>                                           __ceph_caps_wanted(ci),
>                                           (cap->issued | cap->implemented),
> -                                         cf->caps, cf->tid, oldest_flush_tid);
> +                                         cf->caps, cf->tid, oldest_flush_tid,
> +                                         &old_blob);
> +                       wake = should_wake_cap_waiters(cap);
> +                       spin_unlock(&ci->i_ceph_lock);
> +
> +                       ceph_buffer_put(old_blob);
> +                       __send_cap(mdsc, &arg, ci, wake);
>                 } else {
>                         struct ceph_cap_snap *capsnap =
>                                         container_of(cf, struct ceph_cap_snap,
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/8] ceph: cap handling code fixes, cleanups and comments
  2020-03-23 16:07 [PATCH 0/8] ceph: cap handling code fixes, cleanups and comments Jeff Layton
                   ` (7 preceding siblings ...)
  2020-03-23 16:07 ` [PATCH 8/8] ceph: throw a warning if we destroy session with mutex still locked Jeff Layton
@ 2020-03-24 15:05 ` Yan, Zheng
  2020-03-31 12:24 ` Luis Henriques
  9 siblings, 0 replies; 15+ messages in thread
From: Yan, Zheng @ 2020-03-24 15:05 UTC (permalink / raw)
  To: Jeff Layton; +Cc: ceph-devel, Ilya Dryomov, Sage Weil

On Tue, Mar 24, 2020 at 12:07 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> I've been going over the cap handling code with an aim toward
> simplifying the locking. There's one fix for a potential use-after-free
> race in here. This also eliminates a number of __acquires and __releases
> annotations by reorganizing the code, and adds some (hopefully helpful)
> comments.
>
> There should be no behavioral changes with this set.
>
> Jeff Layton (8):
>   ceph: reorganize __send_cap for less spinlock abuse
>   ceph: split up __finish_cap_flush
>   ceph: add comments for handle_cap_flush_ack logic
>   ceph: don't release i_ceph_lock in handle_cap_trunc
>   ceph: don't take i_ceph_lock in handle_cap_import
>   ceph: document what protects i_dirty_item and i_flushing_item
>   ceph: fix potential race in ceph_check_caps
>   ceph: throw a warning if we destroy session with mutex still locked
>
>  fs/ceph/caps.c       | 292 ++++++++++++++++++++++++-------------------
>  fs/ceph/mds_client.c |   1 +
>  fs/ceph/super.h      |   4 +-
>  3 files changed, 170 insertions(+), 127 deletions(-)
>

Other than minor comment for the first commit, this series look good

> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/8] ceph: reorganize __send_cap for less spinlock abuse
  2020-03-24 14:48   ` Yan, Zheng
@ 2020-03-25 10:23     ` Jeff Layton
  0 siblings, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2020-03-25 10:23 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel, Ilya Dryomov, Sage Weil

On Tue, 2020-03-24 at 22:48 +0800, Yan, Zheng wrote:
> On Tue, Mar 24, 2020 at 12:07 AM Jeff Layton <jlayton@kernel.org> wrote:
> > Get rid of the __releases annotation by breaking it up into two
> > functions: __prep_cap which is done under the spinlock and __send_cap
> > that is done outside it.
> > 
> > Nothing checks the return value from __send_cap, so make it void
> > return.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/ceph/caps.c | 193 ++++++++++++++++++++++++++++---------------------
> >  1 file changed, 110 insertions(+), 83 deletions(-)
> > 
> > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > index 779433847f20..5bdca0da58a4 100644
> > --- a/fs/ceph/caps.c
> > +++ b/fs/ceph/caps.c
> > @@ -1318,44 +1318,27 @@ void __ceph_remove_caps(struct ceph_inode_info *ci)
> >  }
> > 
> >  /*
> > - * Send a cap msg on the given inode.  Update our caps state, then
> > - * drop i_ceph_lock and send the message.
> > - *
> > - * Make note of max_size reported/requested from mds, revoked caps
> > - * that have now been implemented.
> > - *
> > - * Return non-zero if delayed release, or we experienced an error
> > - * such that the caller should requeue + retry later.
> > - *
> > - * called with i_ceph_lock, then drops it.
> > - * caller should hold snap_rwsem (read), s_mutex.
> > + * Prepare to send a cap message to an MDS. Update the cap state, and populate
> > + * the arg struct with the parameters that will need to be sent. This should
> > + * be done under the i_ceph_lock to guard against changes to cap state.
> >   */
> > -static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
> > -                     int op, int flags, int used, int want, int retain,
> > -                     int flushing, u64 flush_tid, u64 oldest_flush_tid)
> > -       __releases(cap->ci->i_ceph_lock)
> > +static void __prep_cap(struct cap_msg_args *arg, struct ceph_cap *cap,
> > +                      int op, int flags, int used, int want, int retain,
> > +                      int flushing, u64 flush_tid, u64 oldest_flush_tid,
> > +                      struct ceph_buffer **old_blob)
> >  {
> >         struct ceph_inode_info *ci = cap->ci;
> >         struct inode *inode = &ci->vfs_inode;
> > -       struct ceph_buffer *old_blob = NULL;
> > -       struct cap_msg_args arg;
> >         int held, revoking;
> > -       int wake = 0;
> > -       int ret;
> > 
> > -       /* Don't send anything if it's still being created. Return delayed */
> > -       if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
> > -               spin_unlock(&ci->i_ceph_lock);
> > -               dout("%s async create in flight for %p\n", __func__, inode);
> > -               return 1;
> > -       }
> > +       lockdep_assert_held(&ci->i_ceph_lock);
> > 
> >         held = cap->issued | cap->implemented;
> >         revoking = cap->implemented & ~cap->issued;
> >         retain &= ~revoking;
> > 
> > -       dout("__send_cap %p cap %p session %p %s -> %s (revoking %s)\n",
> > -            inode, cap, cap->session,
> > +       dout("%s %p cap %p session %p %s -> %s (revoking %s)\n",
> > +            __func__, inode, cap, cap->session,
> >              ceph_cap_string(held), ceph_cap_string(held & retain),
> >              ceph_cap_string(revoking));
> >         BUG_ON((retain & CEPH_CAP_PIN) == 0);
> > @@ -1363,60 +1346,51 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
> >         ci->i_ceph_flags &= ~CEPH_I_FLUSH;
> > 
> >         cap->issued &= retain;  /* drop bits we don't want */
> > -       if (cap->implemented & ~cap->issued) {
> > -               /*
> > -                * Wake up any waiters on wanted -> needed transition.
> > -                * This is due to the weird transition from buffered
> > -                * to sync IO... we need to flush dirty pages _before_
> > -                * allowing sync writes to avoid reordering.
> > -                */
> > -               wake = 1;
> > -       }
> >         cap->implemented &= cap->issued | used;
> >         cap->mds_wanted = want;
> > 
> > -       arg.session = cap->session;
> > -       arg.ino = ceph_vino(inode).ino;
> > -       arg.cid = cap->cap_id;
> > -       arg.follows = flushing ? ci->i_head_snapc->seq : 0;
> > -       arg.flush_tid = flush_tid;
> > -       arg.oldest_flush_tid = oldest_flush_tid;
> > +       arg->session = cap->session;
> > +       arg->ino = ceph_vino(inode).ino;
> > +       arg->cid = cap->cap_id;
> > +       arg->follows = flushing ? ci->i_head_snapc->seq : 0;
> > +       arg->flush_tid = flush_tid;
> > +       arg->oldest_flush_tid = oldest_flush_tid;
> > 
> > -       arg.size = inode->i_size;
> > -       ci->i_reported_size = arg.size;
> > -       arg.max_size = ci->i_wanted_max_size;
> > +       arg->size = inode->i_size;
> > +       ci->i_reported_size = arg->size;
> > +       arg->max_size = ci->i_wanted_max_size;
> >         if (cap == ci->i_auth_cap)
> > -               ci->i_requested_max_size = arg.max_size;
> > +               ci->i_requested_max_size = arg->max_size;
> > 
> >         if (flushing & CEPH_CAP_XATTR_EXCL) {
> > -               old_blob = __ceph_build_xattrs_blob(ci);
> > -               arg.xattr_version = ci->i_xattrs.version;
> > -               arg.xattr_buf = ci->i_xattrs.blob;
> > +               *old_blob = __ceph_build_xattrs_blob(ci);
> > +               arg->xattr_version = ci->i_xattrs.version;
> > +               arg->xattr_buf = ci->i_xattrs.blob;
> >         } else {
> > -               arg.xattr_buf = NULL;
> > +               arg->xattr_buf = NULL;
> >         }
> > 
> > -       arg.mtime = inode->i_mtime;
> > -       arg.atime = inode->i_atime;
> > -       arg.ctime = inode->i_ctime;
> > -       arg.btime = ci->i_btime;
> > -       arg.change_attr = inode_peek_iversion_raw(inode);
> > +       arg->mtime = inode->i_mtime;
> > +       arg->atime = inode->i_atime;
> > +       arg->ctime = inode->i_ctime;
> > +       arg->btime = ci->i_btime;
> > +       arg->change_attr = inode_peek_iversion_raw(inode);
> > 
> > -       arg.op = op;
> > -       arg.caps = cap->implemented;
> > -       arg.wanted = want;
> > -       arg.dirty = flushing;
> > +       arg->op = op;
> > +       arg->caps = cap->implemented;
> > +       arg->wanted = want;
> > +       arg->dirty = flushing;
> > 
> > -       arg.seq = cap->seq;
> > -       arg.issue_seq = cap->issue_seq;
> > -       arg.mseq = cap->mseq;
> > -       arg.time_warp_seq = ci->i_time_warp_seq;
> > +       arg->seq = cap->seq;
> > +       arg->issue_seq = cap->issue_seq;
> > +       arg->mseq = cap->mseq;
> > +       arg->time_warp_seq = ci->i_time_warp_seq;
> > 
> > -       arg.uid = inode->i_uid;
> > -       arg.gid = inode->i_gid;
> > -       arg.mode = inode->i_mode;
> > +       arg->uid = inode->i_uid;
> > +       arg->gid = inode->i_gid;
> > +       arg->mode = inode->i_mode;
> > 
> > -       arg.inline_data = ci->i_inline_version != CEPH_INLINE_NONE;
> > +       arg->inline_data = ci->i_inline_version != CEPH_INLINE_NONE;
> >         if (!(flags & CEPH_CLIENT_CAPS_PENDING_CAPSNAP) &&
> >             !list_empty(&ci->i_cap_snaps)) {
> >                 struct ceph_cap_snap *capsnap;
> > @@ -1429,18 +1403,46 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
> >                         }
> >                 }
> >         }
> > -       arg.flags = flags;
> > +       arg->flags = flags;
> > +}
> > 
> > -       spin_unlock(&ci->i_ceph_lock);
> > +/*
> > + * Wake up any waiters on wanted -> needed transition. This is due to the weird
> > + * transition from buffered to sync IO... we need to flush dirty pages _before_
> > + * allowing sync writes to avoid reordering.
> > + */
> > +static inline bool should_wake_cap_waiters(struct ceph_cap *cap)
> > +{
> > +       lockdep_assert_held(&cap->ci->i_ceph_lock);
> > 
> > -       ceph_buffer_put(old_blob);
> > +       return cap->implemented & ~cap->issued;
> > +}
> > 
> > -       ret = send_cap_msg(&arg);
> > +/*
> > + * Send a cap msg on the given inode.  Update our caps state, then
> > + * drop i_ceph_lock and send the message.
> > + *
> > + * Make note of max_size reported/requested from mds, revoked caps
> > + * that have now been implemented.
> > + *
> > + * Return non-zero if delayed release, or we experienced an error
> > + * such that the caller should requeue + retry later.
> > + *
> > + * called with i_ceph_lock, then drops it.
> > + * caller should hold snap_rwsem (read), s_mutex.
> > + */
> > +static void __send_cap(struct ceph_mds_client *mdsc, struct cap_msg_args *arg,
> > +                      struct ceph_inode_info *ci, bool wake)
> > +{
> > +       struct inode *inode = &ci->vfs_inode;
> > +       int ret;
> > +
> > +       ret = send_cap_msg(arg);
> >         if (ret < 0) {
> >                 pr_err("error sending cap msg, ino (%llx.%llx) "
> >                        "flushing %s tid %llu, requeue\n",
> > -                      ceph_vinop(inode), ceph_cap_string(flushing),
> > -                      flush_tid);
> > +                      ceph_vinop(inode), ceph_cap_string(arg->dirty),
> > +                      arg->flush_tid);
> >                 spin_lock(&ci->i_ceph_lock);
> >                 __cap_delay_requeue(mdsc, ci);
> >                 spin_unlock(&ci->i_ceph_lock);
> > @@ -1448,8 +1450,6 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
> > 
> >         if (wake)
> >                 wake_up_all(&ci->i_cap_wq);
> > -
> > -       return ret;
> >  }
> > 
> >  static inline int __send_flush_snap(struct inode *inode,
> > @@ -1967,6 +1967,10 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
> >         }
> > 
> >         for (p = rb_first(&ci->i_caps); p; p = rb_next(p)) {
> > +               struct cap_msg_args arg;
> > +               struct ceph_buffer *old_blob = NULL;
> > +               bool wake;
> > +
> 
> how about defining 'wake' and 'old_blob' into cap_msg_args, move
> should_wake_cap_waiters() code into __prep_cap.
> 

Good idea. I'm testing this out now and will resend this one once I'm
done.

Thanks,
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2] ceph: reorganize __send_cap for less spinlock abuse
  2020-03-23 16:07 ` [PATCH 1/8] ceph: reorganize __send_cap for less spinlock abuse Jeff Layton
  2020-03-24 14:48   ` Yan, Zheng
@ 2020-03-25 13:57   ` Jeff Layton
  1 sibling, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2020-03-25 13:57 UTC (permalink / raw)
  To: ceph-devel; +Cc: ukernel, idryomov, sage

Get rid of the __releases annotation by breaking it up into two
functions: __prep_cap which is done under the spinlock and __send_cap
that is done outside it. Add new fields to cap_msg_args for the wake
boolean and old_xattr_buf pointer.

Nothing checks the return value from __send_cap, so make it void
return.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c | 164 ++++++++++++++++++++++++++-----------------------
 1 file changed, 86 insertions(+), 78 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 779433847f20..2146433a1850 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1181,6 +1181,7 @@ struct cap_msg_args {
 	u64			xattr_version;
 	u64			change_attr;
 	struct ceph_buffer	*xattr_buf;
+	struct ceph_buffer	*old_xattr_buf;
 	struct timespec64	atime, mtime, ctime, btime;
 	int			op, caps, wanted, dirty;
 	u32			seq, issue_seq, mseq, time_warp_seq;
@@ -1189,6 +1190,7 @@ struct cap_msg_args {
 	kgid_t			gid;
 	umode_t			mode;
 	bool			inline_data;
+	bool			wake;
 };
 
 /*
@@ -1318,44 +1320,29 @@ void __ceph_remove_caps(struct ceph_inode_info *ci)
 }
 
 /*
- * Send a cap msg on the given inode.  Update our caps state, then
- * drop i_ceph_lock and send the message.
+ * Prepare to send a cap message to an MDS. Update the cap state, and populate
+ * the arg struct with the parameters that will need to be sent. This should
+ * be done under the i_ceph_lock to guard against changes to cap state.
  *
  * Make note of max_size reported/requested from mds, revoked caps
  * that have now been implemented.
- *
- * Return non-zero if delayed release, or we experienced an error
- * such that the caller should requeue + retry later.
- *
- * called with i_ceph_lock, then drops it.
- * caller should hold snap_rwsem (read), s_mutex.
  */
-static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
-		      int op, int flags, int used, int want, int retain,
-		      int flushing, u64 flush_tid, u64 oldest_flush_tid)
-	__releases(cap->ci->i_ceph_lock)
+static void __prep_cap(struct cap_msg_args *arg, struct ceph_cap *cap,
+		       int op, int flags, int used, int want, int retain,
+		       int flushing, u64 flush_tid, u64 oldest_flush_tid)
 {
 	struct ceph_inode_info *ci = cap->ci;
 	struct inode *inode = &ci->vfs_inode;
-	struct ceph_buffer *old_blob = NULL;
-	struct cap_msg_args arg;
 	int held, revoking;
-	int wake = 0;
-	int ret;
 
-	/* Don't send anything if it's still being created. Return delayed */
-	if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
-		spin_unlock(&ci->i_ceph_lock);
-		dout("%s async create in flight for %p\n", __func__, inode);
-		return 1;
-	}
+	lockdep_assert_held(&ci->i_ceph_lock);
 
 	held = cap->issued | cap->implemented;
 	revoking = cap->implemented & ~cap->issued;
 	retain &= ~revoking;
 
-	dout("__send_cap %p cap %p session %p %s -> %s (revoking %s)\n",
-	     inode, cap, cap->session,
+	dout("%s %p cap %p session %p %s -> %s (revoking %s)\n",
+	     __func__, inode, cap, cap->session,
 	     ceph_cap_string(held), ceph_cap_string(held & retain),
 	     ceph_cap_string(revoking));
 	BUG_ON((retain & CEPH_CAP_PIN) == 0);
@@ -1363,60 +1350,52 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
 	ci->i_ceph_flags &= ~CEPH_I_FLUSH;
 
 	cap->issued &= retain;  /* drop bits we don't want */
-	if (cap->implemented & ~cap->issued) {
-		/*
-		 * Wake up any waiters on wanted -> needed transition.
-		 * This is due to the weird transition from buffered
-		 * to sync IO... we need to flush dirty pages _before_
-		 * allowing sync writes to avoid reordering.
-		 */
-		wake = 1;
-	}
 	cap->implemented &= cap->issued | used;
 	cap->mds_wanted = want;
 
-	arg.session = cap->session;
-	arg.ino = ceph_vino(inode).ino;
-	arg.cid = cap->cap_id;
-	arg.follows = flushing ? ci->i_head_snapc->seq : 0;
-	arg.flush_tid = flush_tid;
-	arg.oldest_flush_tid = oldest_flush_tid;
+	arg->session = cap->session;
+	arg->ino = ceph_vino(inode).ino;
+	arg->cid = cap->cap_id;
+	arg->follows = flushing ? ci->i_head_snapc->seq : 0;
+	arg->flush_tid = flush_tid;
+	arg->oldest_flush_tid = oldest_flush_tid;
 
-	arg.size = inode->i_size;
-	ci->i_reported_size = arg.size;
-	arg.max_size = ci->i_wanted_max_size;
+	arg->size = inode->i_size;
+	ci->i_reported_size = arg->size;
+	arg->max_size = ci->i_wanted_max_size;
 	if (cap == ci->i_auth_cap)
-		ci->i_requested_max_size = arg.max_size;
+		ci->i_requested_max_size = arg->max_size;
 
 	if (flushing & CEPH_CAP_XATTR_EXCL) {
-		old_blob = __ceph_build_xattrs_blob(ci);
-		arg.xattr_version = ci->i_xattrs.version;
-		arg.xattr_buf = ci->i_xattrs.blob;
+		arg->old_xattr_buf = __ceph_build_xattrs_blob(ci);
+		arg->xattr_version = ci->i_xattrs.version;
+		arg->xattr_buf = ci->i_xattrs.blob;
 	} else {
-		arg.xattr_buf = NULL;
+		arg->xattr_buf = NULL;
+		arg->old_xattr_buf = NULL;
 	}
 
-	arg.mtime = inode->i_mtime;
-	arg.atime = inode->i_atime;
-	arg.ctime = inode->i_ctime;
-	arg.btime = ci->i_btime;
-	arg.change_attr = inode_peek_iversion_raw(inode);
+	arg->mtime = inode->i_mtime;
+	arg->atime = inode->i_atime;
+	arg->ctime = inode->i_ctime;
+	arg->btime = ci->i_btime;
+	arg->change_attr = inode_peek_iversion_raw(inode);
 
-	arg.op = op;
-	arg.caps = cap->implemented;
-	arg.wanted = want;
-	arg.dirty = flushing;
+	arg->op = op;
+	arg->caps = cap->implemented;
+	arg->wanted = want;
+	arg->dirty = flushing;
 
-	arg.seq = cap->seq;
-	arg.issue_seq = cap->issue_seq;
-	arg.mseq = cap->mseq;
-	arg.time_warp_seq = ci->i_time_warp_seq;
+	arg->seq = cap->seq;
+	arg->issue_seq = cap->issue_seq;
+	arg->mseq = cap->mseq;
+	arg->time_warp_seq = ci->i_time_warp_seq;
 
-	arg.uid = inode->i_uid;
-	arg.gid = inode->i_gid;
-	arg.mode = inode->i_mode;
+	arg->uid = inode->i_uid;
+	arg->gid = inode->i_gid;
+	arg->mode = inode->i_mode;
 
-	arg.inline_data = ci->i_inline_version != CEPH_INLINE_NONE;
+	arg->inline_data = ci->i_inline_version != CEPH_INLINE_NONE;
 	if (!(flags & CEPH_CLIENT_CAPS_PENDING_CAPSNAP) &&
 	    !list_empty(&ci->i_cap_snaps)) {
 		struct ceph_cap_snap *capsnap;
@@ -1429,27 +1408,42 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
 			}
 		}
 	}
-	arg.flags = flags;
+	arg->flags = flags;
 
-	spin_unlock(&ci->i_ceph_lock);
+	/*
+	 * Wake up any waiters on wanted -> needed transition. This is due to
+	 * the weird transition from buffered to sync IO... we need to flush
+	 * dirty pages _before_ allowing sync writes to avoid reordering.
+	 */
+	arg->wake = cap->implemented & ~cap->issued;
+}
 
-	ceph_buffer_put(old_blob);
+/*
+ * Send a cap msg on the given inode.
+ *
+ * Caller should hold snap_rwsem (read), s_mutex.
+ */
+static void __send_cap(struct ceph_mds_client *mdsc, struct cap_msg_args *arg,
+		       struct ceph_inode_info *ci)
+{
+	struct inode *inode = &ci->vfs_inode;
+	int ret;
 
-	ret = send_cap_msg(&arg);
+	ret = send_cap_msg(arg);
 	if (ret < 0) {
 		pr_err("error sending cap msg, ino (%llx.%llx) "
 		       "flushing %s tid %llu, requeue\n",
-		       ceph_vinop(inode), ceph_cap_string(flushing),
-		       flush_tid);
+		       ceph_vinop(inode), ceph_cap_string(arg->dirty),
+		       arg->flush_tid);
 		spin_lock(&ci->i_ceph_lock);
 		__cap_delay_requeue(mdsc, ci);
 		spin_unlock(&ci->i_ceph_lock);
 	}
 
-	if (wake)
-		wake_up_all(&ci->i_cap_wq);
+	ceph_buffer_put(arg->old_xattr_buf);
 
-	return ret;
+	if (arg->wake)
+		wake_up_all(&ci->i_cap_wq);
 }
 
 static inline int __send_flush_snap(struct inode *inode,
@@ -1470,6 +1464,7 @@ static inline int __send_flush_snap(struct inode *inode,
 	arg.max_size = 0;
 	arg.xattr_version = capsnap->xattr_version;
 	arg.xattr_buf = capsnap->xattr_blob;
+	arg.old_xattr_buf = NULL;
 
 	arg.atime = capsnap->atime;
 	arg.mtime = capsnap->mtime;
@@ -1493,6 +1488,7 @@ static inline int __send_flush_snap(struct inode *inode,
 
 	arg.inline_data = capsnap->inline_data;
 	arg.flags = 0;
+	arg.wake = false;
 
 	return send_cap_msg(&arg);
 }
@@ -1967,6 +1963,8 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
 	}
 
 	for (p = rb_first(&ci->i_caps); p; p = rb_next(p)) {
+		struct cap_msg_args arg;
+
 		cap = rb_entry(p, struct ceph_cap, ci_node);
 
 		/* avoid looping forever */
@@ -2094,9 +2092,12 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
 
 		mds = cap->mds;  /* remember mds, so we don't repeat */
 
-		/* __send_cap drops i_ceph_lock */
-		__send_cap(mdsc, cap, CEPH_CAP_OP_UPDATE, 0, cap_used, want,
+		__prep_cap(&arg, cap, CEPH_CAP_OP_UPDATE, 0, cap_used, want,
 			   retain, flushing, flush_tid, oldest_flush_tid);
+		spin_unlock(&ci->i_ceph_lock);
+
+		__send_cap(mdsc, &arg, ci);
+
 		goto retry; /* retake i_ceph_lock and restart our cap scan. */
 	}
 
@@ -2135,6 +2136,7 @@ static int try_flush_caps(struct inode *inode, u64 *ptid)
 retry_locked:
 	if (ci->i_dirty_caps && ci->i_auth_cap) {
 		struct ceph_cap *cap = ci->i_auth_cap;
+		struct cap_msg_args arg;
 
 		if (session != cap->session) {
 			spin_unlock(&ci->i_ceph_lock);
@@ -2162,11 +2164,13 @@ static int try_flush_caps(struct inode *inode, u64 *ptid)
 		flush_tid = __mark_caps_flushing(inode, session, true,
 						 &oldest_flush_tid);
 
-		/* __send_cap drops i_ceph_lock */
-		__send_cap(mdsc, cap, CEPH_CAP_OP_FLUSH, CEPH_CLIENT_CAPS_SYNC,
+		__prep_cap(&arg, cap, CEPH_CAP_OP_FLUSH, CEPH_CLIENT_CAPS_SYNC,
 			   __ceph_caps_used(ci), __ceph_caps_wanted(ci),
 			   (cap->issued | cap->implemented),
 			   flushing, flush_tid, oldest_flush_tid);
+		spin_unlock(&ci->i_ceph_lock);
+
+		__send_cap(mdsc, &arg, ci);
 	} else {
 		if (!list_empty(&ci->i_cap_flush_list)) {
 			struct ceph_cap_flush *cf =
@@ -2368,15 +2372,19 @@ static void __kick_flushing_caps(struct ceph_mds_client *mdsc,
 		first_tid = cf->tid + 1;
 
 		if (cf->caps) {
+			struct cap_msg_args arg;
+
 			dout("kick_flushing_caps %p cap %p tid %llu %s\n",
 			     inode, cap, cf->tid, ceph_cap_string(cf->caps));
-			__send_cap(mdsc, cap, CEPH_CAP_OP_FLUSH,
+			__prep_cap(&arg, cap, CEPH_CAP_OP_FLUSH,
 					 (cf->tid < last_snap_flush ?
 					  CEPH_CLIENT_CAPS_PENDING_CAPSNAP : 0),
 					  __ceph_caps_used(ci),
 					  __ceph_caps_wanted(ci),
 					  (cap->issued | cap->implemented),
 					  cf->caps, cf->tid, oldest_flush_tid);
+			spin_unlock(&ci->i_ceph_lock);
+			__send_cap(mdsc, &arg, ci);
 		} else {
 			struct ceph_cap_snap *capsnap =
 					container_of(cf, struct ceph_cap_snap,
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2] ceph: fix potential race in ceph_check_caps
  2020-03-23 16:07 ` [PATCH 7/8] ceph: fix potential race in ceph_check_caps Jeff Layton
@ 2020-03-26 13:25   ` Jeff Layton
  0 siblings, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2020-03-26 13:25 UTC (permalink / raw)
  To: ceph-devel; +Cc: ukernel, idryomov, sage

Nothing ensures that session will still be valid by the time we
dereference the pointer. Take and put a reference.

In principle, we should always be able to get a reference here, but
throw a warning if that's ever not the case.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 80bde0024615..f8b51d0c8184 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2045,12 +2045,24 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
 			if (mutex_trylock(&session->s_mutex) == 0) {
 				dout("inverting session/ino locks on %p\n",
 				     session);
+				session = ceph_get_mds_session(session);
 				spin_unlock(&ci->i_ceph_lock);
 				if (took_snap_rwsem) {
 					up_read(&mdsc->snap_rwsem);
 					took_snap_rwsem = 0;
 				}
-				mutex_lock(&session->s_mutex);
+				if (session) {
+					mutex_lock(&session->s_mutex);
+					ceph_put_mds_session(session);
+				} else {
+					/*
+					 * Because we take the reference while
+					 * holding the i_ceph_lock, it should
+					 * never be NULL. Throw a warning if it
+					 * ever is.
+					 */
+					WARN_ON_ONCE(true);
+				}
 				goto retry;
 			}
 		}
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/8] ceph: cap handling code fixes, cleanups and comments
  2020-03-23 16:07 [PATCH 0/8] ceph: cap handling code fixes, cleanups and comments Jeff Layton
                   ` (8 preceding siblings ...)
  2020-03-24 15:05 ` [PATCH 0/8] ceph: cap handling code fixes, cleanups and comments Yan, Zheng
@ 2020-03-31 12:24 ` Luis Henriques
  9 siblings, 0 replies; 15+ messages in thread
From: Luis Henriques @ 2020-03-31 12:24 UTC (permalink / raw)
  To: Jeff Layton; +Cc: ceph-devel, ukernel, idryomov, sage

On Mon, Mar 23, 2020 at 12:07:00PM -0400, Jeff Layton wrote:
> I've been going over the cap handling code with an aim toward
> simplifying the locking. There's one fix for a potential use-after-free
> race in here. This also eliminates a number of __acquires and __releases
> annotations by reorganizing the code, and adds some (hopefully helpful)
> comments.
> 
> There should be no behavioral changes with this set.

But a lot of clarifications!  Thanks a lot for this patchset, Jeff ;-)

Cheers,
--
Luis

> 
> Jeff Layton (8):
>   ceph: reorganize __send_cap for less spinlock abuse
>   ceph: split up __finish_cap_flush
>   ceph: add comments for handle_cap_flush_ack logic
>   ceph: don't release i_ceph_lock in handle_cap_trunc
>   ceph: don't take i_ceph_lock in handle_cap_import
>   ceph: document what protects i_dirty_item and i_flushing_item
>   ceph: fix potential race in ceph_check_caps
>   ceph: throw a warning if we destroy session with mutex still locked
> 
>  fs/ceph/caps.c       | 292 ++++++++++++++++++++++++-------------------
>  fs/ceph/mds_client.c |   1 +
>  fs/ceph/super.h      |   4 +-
>  3 files changed, 170 insertions(+), 127 deletions(-)
> 
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-03-31 12:24 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-23 16:07 [PATCH 0/8] ceph: cap handling code fixes, cleanups and comments Jeff Layton
2020-03-23 16:07 ` [PATCH 1/8] ceph: reorganize __send_cap for less spinlock abuse Jeff Layton
2020-03-24 14:48   ` Yan, Zheng
2020-03-25 10:23     ` Jeff Layton
2020-03-25 13:57   ` [PATCH v2] " Jeff Layton
2020-03-23 16:07 ` [PATCH 2/8] ceph: split up __finish_cap_flush Jeff Layton
2020-03-23 16:07 ` [PATCH 3/8] ceph: add comments for handle_cap_flush_ack logic Jeff Layton
2020-03-23 16:07 ` [PATCH 4/8] ceph: don't release i_ceph_lock in handle_cap_trunc Jeff Layton
2020-03-23 16:07 ` [PATCH 5/8] ceph: don't take i_ceph_lock in handle_cap_import Jeff Layton
2020-03-23 16:07 ` [PATCH 6/8] ceph: document what protects i_dirty_item and i_flushing_item Jeff Layton
2020-03-23 16:07 ` [PATCH 7/8] ceph: fix potential race in ceph_check_caps Jeff Layton
2020-03-26 13:25   ` [PATCH v2] " Jeff Layton
2020-03-23 16:07 ` [PATCH 8/8] ceph: throw a warning if we destroy session with mutex still locked Jeff Layton
2020-03-24 15:05 ` [PATCH 0/8] ceph: cap handling code fixes, cleanups and comments Yan, Zheng
2020-03-31 12:24 ` Luis Henriques

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.