[PATCH v1 0/7] ceph: implement new-style ENOSPC handling in kcephfs

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v1 0/7] ceph: implement new-style ENOSPC handling in kcephfs
@ 2017-01-20 15:17 Jeff Layton
  2017-01-20 15:17 ` [PATCH v1 1/7] libceph: add ceph_osdc_cancel_writes Jeff Layton
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Jeff Layton @ 2017-01-20 15:17 UTC (permalink / raw)
  To: ceph-devel; +Cc: jspray, idryomov, zyan, sage

This patchset is an updated version of the patch series originally
done by John Spray and posted here:

    http://www.spinics.net/lists/ceph-devel/msg21257.html

The patchset has undergone a number of changes since the original
submission:

- updated for final version of CAPRELEASE message changes
- protect delayed cap list with spinlock instead of mutex/rwsem
- no need to allocate a new object to track delayed cap requests
- rerunning delayed caps is now shuffled off to a workqueue
- properly handle requests that come in after the "full" map comes in
- clean out delayed cap requests on last session put

With this, xfstests seems to work as well as before, and we get timely
-ENOSPC returns under these conditions with O_DIRECTi writes.

I still need to plumb in a way to throttle the dirtying of new pages in
buffered I/O when we are getting errors during writeback. Still, I
figure this is a good place to pause and post the set before I implement
that part.

Jeff Layton (7):
  libceph: add ceph_osdc_cancel_writes
  libceph: rename and export have_pool_full
  libceph: rename and export maybe_request_map
  ceph: handle new osdmap epoch updates in CLIENT_CAPS and WRITE
    codepaths
  ceph: update CAPRELEASE message format
  ceph: clean out delayed caps when destroying session
  libceph: allow requests to return immediately on full conditions if
    caller wishes

 fs/ceph/addr.c                  |  14 ++++--
 fs/ceph/caps.c                  |  43 ++++++++++++++--
 fs/ceph/debugfs.c               |   3 ++
 fs/ceph/file.c                  |   8 +--
 fs/ceph/mds_client.c            | 108 ++++++++++++++++++++++++++++++++++++++++
 fs/ceph/mds_client.h            |  10 +++-
 include/linux/ceph/osd_client.h |  22 +++++++-
 include/linux/ceph/rados.h      |   1 +
 net/ceph/osd_client.c           |  88 +++++++++++++++++++++++++-------
 9 files changed, 263 insertions(+), 34 deletions(-)

-- 
2.9.3


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v1 1/7] libceph: add ceph_osdc_cancel_writes
  2017-01-20 15:17 [PATCH v1 0/7] ceph: implement new-style ENOSPC handling in kcephfs Jeff Layton
@ 2017-01-20 15:17 ` Jeff Layton
  2017-01-20 15:17 ` [PATCH v1 2/7] libceph: rename and export have_pool_full Jeff Layton
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Jeff Layton @ 2017-01-20 15:17 UTC (permalink / raw)
  To: ceph-devel; +Cc: jspray, idryomov, zyan, sage

When a Ceph volume hits capacity, a flag is set in the OSD map to
indicate that and a new map is sprayed around the cluster. When the
cephfs client sees that, we want it to shut down any OSD writes that are
in-progress with an -ENOSPC error as they'll just hang otherwise.

Add a callback to the osdc that gets called on map updates and add
a small API to register the callback.

[ jlayton: code style cleanup and adaptation to new osd msg handling ]

Signed-off-by: John Spray <john.spray@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 include/linux/ceph/osd_client.h | 12 ++++++++++
 net/ceph/osd_client.c           | 50 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 62 insertions(+)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 03a6653d329a..a5298c02bde4 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -21,6 +21,7 @@ struct ceph_osd_client;
 /*
  * completion callback for async writepages
  */
+typedef void (*ceph_osdc_map_callback_t)(struct ceph_osd_client *, void *);
 typedef void (*ceph_osdc_callback_t)(struct ceph_osd_request *);
 typedef void (*ceph_osdc_unsafe_callback_t)(struct ceph_osd_request *, bool);
 
@@ -289,6 +290,9 @@ struct ceph_osd_client {
 	struct ceph_msgpool	msgpool_op_reply;
 
 	struct workqueue_struct	*notify_wq;
+
+	ceph_osdc_map_callback_t	map_cb;
+	void			*map_p;
 };
 
 static inline bool ceph_osdmap_flag(struct ceph_osd_client *osdc, int flag)
@@ -391,6 +395,7 @@ extern void ceph_osdc_put_request(struct ceph_osd_request *req);
 extern int ceph_osdc_start_request(struct ceph_osd_client *osdc,
 				   struct ceph_osd_request *req,
 				   bool nofail);
+extern u32 ceph_osdc_complete_writes(struct ceph_osd_client *osdc, int r);
 extern void ceph_osdc_cancel_request(struct ceph_osd_request *req);
 extern int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
 				  struct ceph_osd_request *req);
@@ -457,5 +462,12 @@ int ceph_osdc_list_watchers(struct ceph_osd_client *osdc,
 			    struct ceph_object_locator *oloc,
 			    struct ceph_watch_item **watchers,
 			    u32 *num_watchers);
+
+static inline void ceph_osdc_register_map_cb(struct ceph_osd_client *osdc,
+        ceph_osdc_map_callback_t cb, void *data)
+{
+	osdc->map_cb = cb;
+	osdc->map_p = data;
+}
 #endif
 
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 3a2417bb6ff0..0562ea76c772 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -18,6 +18,7 @@
 #include <linux/ceph/decode.h>
 #include <linux/ceph/auth.h>
 #include <linux/ceph/pagelist.h>
+#include <linux/lockdep.h>
 
 #define OSD_OPREPLY_FRONT_LEN	512
 
@@ -1771,6 +1772,51 @@ static void complete_request(struct ceph_osd_request *req, int err)
 	ceph_osdc_put_request(req);
 }
 
+/*
+ * Drop all pending write/modify requests and complete
+ * them with the `r` as return code.
+ *
+ * Returns the highest OSD map epoch of a request that was
+ * cancelled, or 0 if none were cancelled.
+ */
+u32 ceph_osdc_complete_writes(struct ceph_osd_client *osdc, int r)
+{
+	struct ceph_osd_request *req;
+	struct ceph_osd *osd;
+	struct rb_node *m, *n;
+	u32 latest_epoch = 0;
+
+	lockdep_assert_held(&osdc->lock);
+
+	dout("enter complete_writes r=%d\n", r);
+
+	for (n = rb_first(&osdc->osds); n; n = rb_next(n)) {
+		osd = rb_entry(n, struct ceph_osd, o_node);
+		m = rb_first(&osd->o_requests);
+		mutex_lock(&osd->lock);
+		while (m) {
+			req = rb_entry(m, struct ceph_osd_request, r_node);
+			m = rb_next(m);
+
+			if (req->r_flags & CEPH_OSD_FLAG_WRITE &&
+			    (ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
+			     pool_full(osdc, req->r_t.base_oloc.pool))) {
+				u32 cur_epoch = le32_to_cpu(req->r_replay_version.epoch);
+
+				dout("%s: complete tid=%llu flags 0x%x\n", __func__, req->r_tid, req->r_flags);
+				complete_request(req, r);
+				if (cur_epoch > latest_epoch)
+					latest_epoch = cur_epoch;
+			}
+		}
+		mutex_unlock(&osd->lock);
+	}
+
+	dout("return complete_writes latest_epoch=%u\n", latest_epoch);
+	return latest_epoch;
+}
+EXPORT_SYMBOL(ceph_osdc_complete_writes);
+
 static void cancel_map_check(struct ceph_osd_request *req)
 {
 	struct ceph_osd_client *osdc = req->r_osdc;
@@ -3286,6 +3332,8 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
 
 	ceph_monc_got_map(&osdc->client->monc, CEPH_SUB_OSDMAP,
 			  osdc->osdmap->epoch);
+	if (osdc->map_cb)
+		osdc->map_cb(osdc, osdc->map_p);
 	up_write(&osdc->lock);
 	wake_up_all(&osdc->client->auth_wq);
 	return;
@@ -4090,6 +4138,8 @@ int ceph_osdc_init(struct ceph_osd_client *osdc, struct ceph_client *client)
 	osdc->linger_requests = RB_ROOT;
 	osdc->map_checks = RB_ROOT;
 	osdc->linger_map_checks = RB_ROOT;
+	osdc->map_cb = NULL;
+	osdc->map_p = NULL;
 	INIT_DELAYED_WORK(&osdc->timeout_work, handle_timeout);
 	INIT_DELAYED_WORK(&osdc->osds_timeout_work, handle_osds_timeout);
 
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v1 2/7] libceph: rename and export have_pool_full
  2017-01-20 15:17 [PATCH v1 0/7] ceph: implement new-style ENOSPC handling in kcephfs Jeff Layton
  2017-01-20 15:17 ` [PATCH v1 1/7] libceph: add ceph_osdc_cancel_writes Jeff Layton
@ 2017-01-20 15:17 ` Jeff Layton
  2017-01-20 15:17 ` [PATCH v1 3/7] libceph: rename and export maybe_request_map Jeff Layton
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Jeff Layton @ 2017-01-20 15:17 UTC (permalink / raw)
  To: ceph-devel; +Cc: jspray, idryomov, zyan, sage

Rename have_pool_full to ceph_osdc_have_pool_full, and export it.

Cephfs needs to be able to call this as well.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 include/linux/ceph/osd_client.h | 1 +
 net/ceph/osd_client.c           | 7 ++++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index a5298c02bde4..35f74c86533e 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -311,6 +311,7 @@ extern void ceph_osdc_handle_reply(struct ceph_osd_client *osdc,
 				   struct ceph_msg *msg);
 extern void ceph_osdc_handle_map(struct ceph_osd_client *osdc,
 				 struct ceph_msg *msg);
+extern bool ceph_osdc_have_pool_full(struct ceph_osd_client *osdc);
 
 extern void osd_req_op_init(struct ceph_osd_request *osd_req,
 			    unsigned int which, u16 opcode, u32 flags);
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 0562ea76c772..290968865a41 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1259,7 +1259,7 @@ static bool __pool_full(struct ceph_pg_pool_info *pi)
 	return pi->flags & CEPH_POOL_FLAG_FULL;
 }
 
-static bool have_pool_full(struct ceph_osd_client *osdc)
+bool ceph_osdc_have_pool_full(struct ceph_osd_client *osdc)
 {
 	struct rb_node *n;
 
@@ -1273,6 +1273,7 @@ static bool have_pool_full(struct ceph_osd_client *osdc)
 
 	return false;
 }
+EXPORT_SYMBOL(ceph_osdc_have_pool_full);
 
 static bool pool_full(struct ceph_osd_client *osdc, s64 pool_id)
 {
@@ -3260,7 +3261,7 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
 	was_pauserd = ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSERD);
 	was_pausewr = ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
 		      ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
-		      have_pool_full(osdc);
+		      ceph_osdc_have_pool_full(osdc);
 
 	/* incremental maps */
 	ceph_decode_32_safe(&p, end, nr_maps, bad);
@@ -3324,7 +3325,7 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
 	pauserd = ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSERD);
 	pausewr = ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
 		  ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
-		  have_pool_full(osdc);
+		  ceph_osdc_have_pool_full(osdc);
 	if (was_pauserd || was_pausewr || pauserd || pausewr)
 		maybe_request_map(osdc);
 
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v1 3/7] libceph: rename and export maybe_request_map
  2017-01-20 15:17 [PATCH v1 0/7] ceph: implement new-style ENOSPC handling in kcephfs Jeff Layton
  2017-01-20 15:17 ` [PATCH v1 1/7] libceph: add ceph_osdc_cancel_writes Jeff Layton
  2017-01-20 15:17 ` [PATCH v1 2/7] libceph: rename and export have_pool_full Jeff Layton
@ 2017-01-20 15:17 ` Jeff Layton
  2017-01-20 15:17 ` [PATCH v1 4/7] ceph: handle new osdmap epoch updates in CLIENT_CAPS and WRITE codepaths Jeff Layton
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Jeff Layton @ 2017-01-20 15:17 UTC (permalink / raw)
  To: ceph-devel; +Cc: jspray, idryomov, zyan, sage

We need to be able to call this with the osdc->lock already held, so
ceph_osdc_maybe_request_map won't do. Rename and export it as
__ceph_osdc_maybe_request_map, and turn ceph_osdc_maybe_request_map
into a static inline helper that takes the osdc->lock and then calls
__ceph_osdc_maybe_request_map.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 include/linux/ceph/osd_client.h |  9 ++++++++-
 net/ceph/osd_client.c           | 25 +++++++++----------------
 2 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 35f74c86533e..b1eeb5a86657 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -403,7 +403,14 @@ extern int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
 extern void ceph_osdc_sync(struct ceph_osd_client *osdc);
 
 extern void ceph_osdc_flush_notifies(struct ceph_osd_client *osdc);
-void ceph_osdc_maybe_request_map(struct ceph_osd_client *osdc);
+void __ceph_osdc_maybe_request_map(struct ceph_osd_client *osdc);
+
+static inline void ceph_osdc_maybe_request_map(struct ceph_osd_client *osdc)
+{
+	down_read(&osdc->lock);
+	__ceph_osdc_maybe_request_map(osdc);
+	up_read(&osdc->lock);
+}
 
 int ceph_osdc_call(struct ceph_osd_client *osdc,
 		   struct ceph_object_id *oid,
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 290968865a41..97c266f96708 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1608,7 +1608,7 @@ static void send_request(struct ceph_osd_request *req)
 	ceph_con_send(&osd->o_con, ceph_msg_get(req->r_request));
 }
 
-static void maybe_request_map(struct ceph_osd_client *osdc)
+void __ceph_osdc_maybe_request_map(struct ceph_osd_client *osdc)
 {
 	bool continuous = false;
 
@@ -1628,6 +1628,7 @@ static void maybe_request_map(struct ceph_osd_client *osdc)
 			       osdc->osdmap->epoch + 1, continuous))
 		ceph_monc_renew_subs(&osdc->client->monc);
 }
+EXPORT_SYMBOL(__ceph_osdc_maybe_request_map);
 
 static void send_map_check(struct ceph_osd_request *req);
 
@@ -1657,12 +1658,12 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
 	    ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
 		dout("req %p pausewr\n", req);
 		req->r_t.paused = true;
-		maybe_request_map(osdc);
+		__ceph_osdc_maybe_request_map(osdc);
 	} else if ((req->r_flags & CEPH_OSD_FLAG_READ) &&
 		   ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSERD)) {
 		dout("req %p pauserd\n", req);
 		req->r_t.paused = true;
-		maybe_request_map(osdc);
+		__ceph_osdc_maybe_request_map(osdc);
 	} else if ((req->r_flags & CEPH_OSD_FLAG_WRITE) &&
 		   !(req->r_flags & (CEPH_OSD_FLAG_FULL_TRY |
 				     CEPH_OSD_FLAG_FULL_FORCE)) &&
@@ -1671,11 +1672,11 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
 		dout("req %p full/pool_full\n", req);
 		pr_warn_ratelimited("FULL or reached pool quota\n");
 		req->r_t.paused = true;
-		maybe_request_map(osdc);
+		__ceph_osdc_maybe_request_map(osdc);
 	} else if (!osd_homeless(osd)) {
 		need_send = true;
 	} else {
-		maybe_request_map(osdc);
+		__ceph_osdc_maybe_request_map(osdc);
 	}
 
 	mutex_lock(&osd->lock);
@@ -2587,7 +2588,7 @@ static void handle_timeout(struct work_struct *work)
 	}
 
 	if (atomic_read(&osdc->num_homeless) || !list_empty(&slow_osds))
-		maybe_request_map(osdc);
+		__ceph_osdc_maybe_request_map(osdc);
 
 	while (!list_empty(&slow_osds)) {
 		struct ceph_osd *osd = list_first_entry(&slow_osds,
@@ -3327,7 +3328,7 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
 		  ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
 		  ceph_osdc_have_pool_full(osdc);
 	if (was_pauserd || was_pausewr || pauserd || pausewr)
-		maybe_request_map(osdc);
+		__ceph_osdc_maybe_request_map(osdc);
 
 	kick_requests(osdc, &need_resend, &need_resend_linger);
 
@@ -3391,7 +3392,7 @@ static void osd_fault(struct ceph_connection *con)
 
 	if (!reopen_osd(osd))
 		kick_osd_requests(osd);
-	maybe_request_map(osdc);
+	__ceph_osdc_maybe_request_map(osdc);
 
 out_unlock:
 	up_write(&osdc->lock);
@@ -4060,14 +4061,6 @@ void ceph_osdc_flush_notifies(struct ceph_osd_client *osdc)
 }
 EXPORT_SYMBOL(ceph_osdc_flush_notifies);
 
-void ceph_osdc_maybe_request_map(struct ceph_osd_client *osdc)
-{
-	down_read(&osdc->lock);
-	maybe_request_map(osdc);
-	up_read(&osdc->lock);
-}
-EXPORT_SYMBOL(ceph_osdc_maybe_request_map);
-
 /*
  * Execute an OSD class method on an object.
  *
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v1 4/7] ceph: handle new osdmap epoch updates in CLIENT_CAPS and WRITE codepaths
  2017-01-20 15:17 [PATCH v1 0/7] ceph: implement new-style ENOSPC handling in kcephfs Jeff Layton
                   ` (2 preceding siblings ...)
  2017-01-20 15:17 ` [PATCH v1 3/7] libceph: rename and export maybe_request_map Jeff Layton
@ 2017-01-20 15:17 ` Jeff Layton
  2017-01-22  9:40   ` Yan, Zheng
  2017-01-20 15:17 ` [PATCH v1 5/7] ceph: update CAPRELEASE message format Jeff Layton
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Jeff Layton @ 2017-01-20 15:17 UTC (permalink / raw)
  To: ceph-devel; +Cc: jspray, idryomov, zyan, sage

This patch is heavily inspired by John Spray's earlier work, but
implemented in a different way.

Create and register a new map_cb for cephfs, to allow it to handle
changes to the osdmap.

In the version 5 of CLIENT_CAPS messages, the barrier field is added as an
instruction to clients that they may not use the attached capabilities
until they have a particular OSD map epoch.

When we get a message with such a field and don't have the requisite map
epoch yet, we put that message on a list in the session, to be run when
the map does come in.

When we get a new map update, the map_cb routine first checks to see
whether there may be an OSD or pool full condition. If so, then we walk
the list of OSD calls and kill off any writes to full OSDs or pools with
-ENOSPC.  While cancelling, we store the latest OSD epoch seen in each
request. This will be used later in the CAPRELEASE messages.

Then, it walks the session list and queues the workqueue job for each.
When the workqueue job runs, it walks the list of delayed caps and tries
to rerun each one. If the epoch is still not high enough, they just get
put back on the delay queue for when the map does come in.

Suggested-by: John Spray <john.spray@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ceph/caps.c       | 43 +++++++++++++++++++++++++++---
 fs/ceph/debugfs.c    |  3 +++
 fs/ceph/mds_client.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/ceph/mds_client.h |  3 +++
 4 files changed, 120 insertions(+), 4 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index d941c48e8bff..f33d424b5e12 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1077,7 +1077,7 @@ static int send_cap_msg(struct cap_msg_args *arg)
 	/* inline data size */
 	ceph_encode_32(&p, 0);
 	/* osd_epoch_barrier (version 5) */
-	ceph_encode_32(&p, 0);
+	ceph_encode_32(&p, arg->session->s_mdsc->cap_epoch_barrier);
 	/* oldest_flush_tid (version 6) */
 	ceph_encode_64(&p, arg->oldest_flush_tid);
 
@@ -3577,9 +3577,12 @@ void ceph_handle_caps(struct ceph_mds_session *session,
 	void *snaptrace;
 	size_t snaptrace_len;
 	void *p, *end;
+	u32 epoch_barrier = 0;
 
 	dout("handle_caps from mds%d\n", mds);
 
+	WARN_ON_ONCE(!list_empty(&msg->list_head));
+
 	/* decode */
 	end = msg->front.iov_base + msg->front.iov_len;
 	tid = le64_to_cpu(msg->hdr.tid);
@@ -3625,13 +3628,45 @@ void ceph_handle_caps(struct ceph_mds_session *session,
 		p += inline_len;
 	}
 
+	if (le16_to_cpu(msg->hdr.version) >= 5) {
+		struct ceph_osd_client *osdc = &mdsc->fsc->client->osdc;
+
+		ceph_decode_32_safe(&p, end, epoch_barrier, bad);
+
+		/* Do lockless check first to avoid mutex if we can */
+		if (epoch_barrier > mdsc->cap_epoch_barrier) {
+			mutex_lock(&mdsc->mutex);
+			if (epoch_barrier > mdsc->cap_epoch_barrier)
+				mdsc->cap_epoch_barrier = epoch_barrier;
+			mutex_unlock(&mdsc->mutex);
+		}
+
+		down_read(&osdc->lock);
+		if (osdc->osdmap->epoch < epoch_barrier) {
+			dout("handle_caps delaying message until OSD epoch %d\n", epoch_barrier);
+			ceph_msg_get(msg);
+			spin_lock(&session->s_cap_lock);
+			list_add(&msg->list_head, &session->s_delayed_caps);
+			spin_unlock(&session->s_cap_lock);
+
+			// Kick OSD client to get the latest map
+			__ceph_osdc_maybe_request_map(osdc);
+
+			up_read(&osdc->lock);
+			return;
+		}
+
+		dout("handle_caps barrier %d already satisfied (%d)\n", epoch_barrier, osdc->osdmap->epoch);
+		up_read(&osdc->lock);
+	}
+
+	dout("handle_caps v=%d barrier=%d\n", le16_to_cpu(msg->hdr.version), epoch_barrier);
+
 	if (le16_to_cpu(msg->hdr.version) >= 8) {
 		u64 flush_tid;
 		u32 caller_uid, caller_gid;
-		u32 osd_epoch_barrier;
 		u32 pool_ns_len;
-		/* version >= 5 */
-		ceph_decode_32_safe(&p, end, osd_epoch_barrier, bad);
+
 		/* version >= 6 */
 		ceph_decode_64_safe(&p, end, flush_tid, bad);
 		/* version >= 7 */
diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
index 39ff678e567f..825df757fba5 100644
--- a/fs/ceph/debugfs.c
+++ b/fs/ceph/debugfs.c
@@ -172,6 +172,9 @@ static int mds_sessions_show(struct seq_file *s, void *ptr)
 	/* The -o name mount argument */
 	seq_printf(s, "name \"%s\"\n", opt->name ? opt->name : "");
 
+	/* The latest OSD epoch barrier known to this client */
+	seq_printf(s, "osd_epoch_barrier \"%d\"\n", mdsc->cap_epoch_barrier);
+
 	/* The list of MDS session rank+state */
 	for (mds = 0; mds < mdsc->max_sessions; mds++) {
 		struct ceph_mds_session *session =
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 176512960b14..7055b499c08b 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -393,6 +393,7 @@ void ceph_put_mds_session(struct ceph_mds_session *s)
 	dout("mdsc put_session %p %d -> %d\n", s,
 	     atomic_read(&s->s_ref), atomic_read(&s->s_ref)-1);
 	if (atomic_dec_and_test(&s->s_ref)) {
+		WARN_ON_ONCE(cancel_work_sync(&s->s_delayed_caps_work));
 		if (s->s_auth.authorizer)
 			ceph_auth_destroy_authorizer(s->s_auth.authorizer);
 		kfree(s);
@@ -432,6 +433,74 @@ static int __verify_registered_session(struct ceph_mds_client *mdsc,
 	return 0;
 }
 
+static void handle_osd_map(struct ceph_osd_client *osdc, void *p)
+{
+	struct ceph_mds_client *mdsc = (struct ceph_mds_client*)p;
+	u32 cancelled_epoch = 0;
+	int mds_id;
+
+	lockdep_assert_held(&osdc->lock);
+
+	if ((osdc->osdmap->flags & CEPH_OSDMAP_FULL) ||
+	    ceph_osdc_have_pool_full(osdc))
+		cancelled_epoch = ceph_osdc_complete_writes(osdc, -ENOSPC);
+
+	dout("handle_osd_map: epoch=%d\n", osdc->osdmap->epoch);
+
+	mutex_lock(&mdsc->mutex);
+	if (cancelled_epoch)
+		mdsc->cap_epoch_barrier = max(cancelled_epoch + 1,
+					      mdsc->cap_epoch_barrier);
+
+	/* Schedule the workqueue job for any sessions */
+	for (mds_id = 0; mds_id < mdsc->max_sessions; ++mds_id) {
+		struct ceph_mds_session *session = mdsc->sessions[mds_id];
+		bool empty;
+
+		if (session == NULL)
+			continue;
+
+		/* Any delayed messages? */
+		spin_lock(&session->s_cap_lock);
+		empty = list_empty(&session->s_delayed_caps);
+		spin_unlock(&session->s_cap_lock);
+		if (empty)
+			continue;
+
+		/* take a reference -- if we can't get one, move on */
+		if (!get_session(session))
+			continue;
+
+		/*
+		 * Try to schedule work. If it's already queued, then just
+		 * drop the session reference.
+		 */
+		if (!schedule_work(&session->s_delayed_caps_work))
+			ceph_put_mds_session(session);
+	}
+	mutex_unlock(&mdsc->mutex);
+}
+
+static void
+run_delayed_caps(struct work_struct *work)
+{
+	struct ceph_mds_session *session = container_of(work,
+			struct ceph_mds_session, s_delayed_caps_work);
+	LIST_HEAD(delayed);
+
+	spin_lock(&session->s_cap_lock);
+	list_splice_init(&session->s_delayed_caps, &delayed);
+	spin_unlock(&session->s_cap_lock);
+
+	while (!list_empty(&delayed)) {
+		struct ceph_msg *msg = list_first_entry(&delayed,
+						struct ceph_msg, list_head);
+		list_del_init(&msg->list_head);
+		ceph_handle_caps(session, msg);
+		ceph_msg_put(msg);
+	}
+}
+
 /*
  * create+register a new session for given mds.
  * called under mdsc->mutex.
@@ -469,11 +538,13 @@ static struct ceph_mds_session *register_session(struct ceph_mds_client *mdsc,
 	atomic_set(&s->s_ref, 1);
 	INIT_LIST_HEAD(&s->s_waiting);
 	INIT_LIST_HEAD(&s->s_unsafe);
+	INIT_LIST_HEAD(&s->s_delayed_caps);
 	s->s_num_cap_releases = 0;
 	s->s_cap_reconnect = 0;
 	s->s_cap_iterator = NULL;
 	INIT_LIST_HEAD(&s->s_cap_releases);
 	INIT_LIST_HEAD(&s->s_cap_flushing);
+	INIT_WORK(&s->s_delayed_caps_work, run_delayed_caps);
 
 	dout("register_session mds%d\n", mds);
 	if (mds >= mdsc->max_sessions) {
@@ -3480,6 +3551,10 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
 
 	ceph_caps_init(mdsc);
 	ceph_adjust_min_caps(mdsc, fsc->min_caps);
+	mdsc->cap_epoch_barrier = 0;
+
+	ceph_osdc_register_map_cb(&fsc->client->osdc,
+				  handle_osd_map, (void*)mdsc);
 
 	init_rwsem(&mdsc->pool_perm_rwsem);
 	mdsc->pool_perm_tree = RB_ROOT;
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 3c6f77b7bb02..eb8144ab4995 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -159,6 +159,8 @@ struct ceph_mds_session {
 	atomic_t          s_ref;
 	struct list_head  s_waiting;  /* waiting requests */
 	struct list_head  s_unsafe;   /* unsafe requests */
+	struct list_head	s_delayed_caps;
+	struct work_struct	s_delayed_caps_work;
 };
 
 /*
@@ -331,6 +333,7 @@ struct ceph_mds_client {
 	int               num_cap_flushing; /* # caps we are flushing */
 	spinlock_t        cap_dirty_lock;   /* protects above items */
 	wait_queue_head_t cap_flushing_wq;
+	u32               cap_epoch_barrier;
 
 	/*
 	 * Cap reservations
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v1 5/7] ceph: update CAPRELEASE message format
  2017-01-20 15:17 [PATCH v1 0/7] ceph: implement new-style ENOSPC handling in kcephfs Jeff Layton
                   ` (3 preceding siblings ...)
  2017-01-20 15:17 ` [PATCH v1 4/7] ceph: handle new osdmap epoch updates in CLIENT_CAPS and WRITE codepaths Jeff Layton
@ 2017-01-20 15:17 ` Jeff Layton
  2017-01-20 15:17 ` [PATCH v1 6/7] ceph: clean out delayed caps when destroying session Jeff Layton
  2017-01-20 15:17 ` [PATCH v1 7/7] libceph: allow requests to return immediately on full conditions if caller wishes Jeff Layton
  6 siblings, 0 replies; 16+ messages in thread
From: Jeff Layton @ 2017-01-20 15:17 UTC (permalink / raw)
  To: ceph-devel; +Cc: jspray, idryomov, zyan, sage

Version 2 includes the new osd epoch barrier field. This allows clients
to inform servers that their released caps may not be used until a
particular OSD map epoch.

[ jlayton: bring up to v4.10-rc3 ]

Signed-off-by: John Spray <john.spray@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ceph/mds_client.c | 15 +++++++++++++++
 fs/ceph/mds_client.h |  7 +++++--
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 7055b499c08b..28c83454e9f6 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1620,6 +1620,7 @@ void ceph_send_cap_releases(struct ceph_mds_client *mdsc,
 	struct ceph_cap *cap;
 	LIST_HEAD(tmp_list);
 	int num_cap_releases;
+	__le32	*cap_barrier;
 
 	spin_lock(&session->s_cap_lock);
 again:
@@ -1637,7 +1638,11 @@ void ceph_send_cap_releases(struct ceph_mds_client *mdsc,
 			head = msg->front.iov_base;
 			head->num = cpu_to_le32(0);
 			msg->front.iov_len = sizeof(*head);
+
+			msg->hdr.version = cpu_to_le16(2);
+			msg->hdr.compat_version = cpu_to_le16(1);
 		}
+
 		cap = list_first_entry(&tmp_list, struct ceph_cap,
 					session_caps);
 		list_del(&cap->session_caps);
@@ -1655,6 +1660,11 @@ void ceph_send_cap_releases(struct ceph_mds_client *mdsc,
 		ceph_put_cap(mdsc, cap);
 
 		if (le32_to_cpu(head->num) == CEPH_CAPS_PER_RELEASE) {
+			// Append cap_barrier field
+			cap_barrier = msg->front.iov_base + msg->front.iov_len;
+			*cap_barrier = cpu_to_le32(mdsc->cap_epoch_barrier);
+			msg->front.iov_len += sizeof(*cap_barrier);
+
 			msg->hdr.front_len = cpu_to_le32(msg->front.iov_len);
 			dout("send_cap_releases mds%d %p\n", session->s_mds, msg);
 			ceph_con_send(&session->s_con, msg);
@@ -1670,6 +1680,11 @@ void ceph_send_cap_releases(struct ceph_mds_client *mdsc,
 	spin_unlock(&session->s_cap_lock);
 
 	if (msg) {
+		// Append cap_barrier field
+		cap_barrier = msg->front.iov_base + msg->front.iov_len;
+		*cap_barrier = cpu_to_le32(mdsc->cap_epoch_barrier);
+		msg->front.iov_len += sizeof(*cap_barrier);
+
 		msg->hdr.front_len = cpu_to_le32(msg->front.iov_len);
 		dout("send_cap_releases mds%d %p\n", session->s_mds, msg);
 		ceph_con_send(&session->s_con, msg);
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index eb8144ab4995..5da41326a0fd 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -104,10 +104,13 @@ struct ceph_mds_reply_info_parsed {
 
 /*
  * cap releases are batched and sent to the MDS en masse.
+ *
+ * Account for per-message overhead of mds_cap_release header
+ * and __le32 for osd epoch barrier trailing field.
  */
-#define CEPH_CAPS_PER_RELEASE ((PAGE_SIZE -			\
+#define CEPH_CAPS_PER_RELEASE ((PAGE_SIZE - sizeof(u32) -		\
 				sizeof(struct ceph_mds_cap_release)) /	\
-			       sizeof(struct ceph_mds_cap_item))
+			        sizeof(struct ceph_mds_cap_item))
 
 
 /*
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v1 6/7] ceph: clean out delayed caps when destroying session
  2017-01-20 15:17 [PATCH v1 0/7] ceph: implement new-style ENOSPC handling in kcephfs Jeff Layton
                   ` (4 preceding siblings ...)
  2017-01-20 15:17 ` [PATCH v1 5/7] ceph: update CAPRELEASE message format Jeff Layton
@ 2017-01-20 15:17 ` Jeff Layton
  2017-01-20 15:17 ` [PATCH v1 7/7] libceph: allow requests to return immediately on full conditions if caller wishes Jeff Layton
  6 siblings, 0 replies; 16+ messages in thread
From: Jeff Layton @ 2017-01-20 15:17 UTC (permalink / raw)
  To: ceph-devel; +Cc: jspray, idryomov, zyan, sage

On last session put, drop any delayed cap messages that haven't
run yet without processing them.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ceph/mds_client.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 28c83454e9f6..05ab69763308 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -361,6 +361,23 @@ static void destroy_reply_info(struct ceph_mds_reply_info_parsed *info)
 /*
  * sessions
  */
+static void
+drop_delayed_caps(struct ceph_mds_session *session)
+{
+	LIST_HEAD(delayed);
+
+	spin_lock(&session->s_cap_lock);
+	list_splice_init(&session->s_delayed_caps, &delayed);
+	spin_unlock(&session->s_cap_lock);
+
+	while (!list_empty(&delayed)) {
+		struct ceph_msg *msg = list_first_entry(&delayed,
+						struct ceph_msg, list_head);
+		list_del_init(&msg->list_head);
+		ceph_msg_put(msg);
+	}
+}
+
 const char *ceph_session_state_name(int s)
 {
 	switch (s) {
@@ -394,6 +411,7 @@ void ceph_put_mds_session(struct ceph_mds_session *s)
 	     atomic_read(&s->s_ref), atomic_read(&s->s_ref)-1);
 	if (atomic_dec_and_test(&s->s_ref)) {
 		WARN_ON_ONCE(cancel_work_sync(&s->s_delayed_caps_work));
+		drop_delayed_caps(s);
 		if (s->s_auth.authorizer)
 			ceph_auth_destroy_authorizer(s->s_auth.authorizer);
 		kfree(s);
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v1 7/7] libceph: allow requests to return immediately on full conditions if caller wishes
  2017-01-20 15:17 [PATCH v1 0/7] ceph: implement new-style ENOSPC handling in kcephfs Jeff Layton
                   ` (5 preceding siblings ...)
  2017-01-20 15:17 ` [PATCH v1 6/7] ceph: clean out delayed caps when destroying session Jeff Layton
@ 2017-01-20 15:17 ` Jeff Layton
  6 siblings, 0 replies; 16+ messages in thread
From: Jeff Layton @ 2017-01-20 15:17 UTC (permalink / raw)
  To: ceph-devel; +Cc: jspray, idryomov, zyan, sage

Right now, cephfs will cancel any in-flight OSD write operations when a
new map comes in that shows the OSD or pool as full, but nothing
prevents new requests from stalling out after that point.

If the caller knows that it will want an immediate error return instead
of blocking on a full or at-quota error condition then allow it to set a
flag to request that behavior. Cephfs write requests will always set
that flag.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ceph/addr.c             | 14 +++++++++-----
 fs/ceph/file.c             |  8 +++++---
 include/linux/ceph/rados.h |  1 +
 net/ceph/osd_client.c      |  6 ++++++
 4 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 4547bbf80e4f..577fe6351de1 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1019,7 +1019,8 @@ static int ceph_writepages_start(struct address_space *mapping,
 					offset, &len, 0, num_ops,
 					CEPH_OSD_OP_WRITE,
 					CEPH_OSD_FLAG_WRITE |
-					CEPH_OSD_FLAG_ONDISK,
+					CEPH_OSD_FLAG_ONDISK |
+					CEPH_OSD_FLAG_FULL_CANCEL,
 					snapc, truncate_seq,
 					truncate_size, false);
 		if (IS_ERR(req)) {
@@ -1030,7 +1031,8 @@ static int ceph_writepages_start(struct address_space *mapping,
 						    CEPH_OSD_SLAB_OPS),
 						CEPH_OSD_OP_WRITE,
 						CEPH_OSD_FLAG_WRITE |
-						CEPH_OSD_FLAG_ONDISK,
+						CEPH_OSD_FLAG_ONDISK |
+						CEPH_OSD_FLAG_FULL_CANCEL,
 						snapc, truncate_seq,
 						truncate_size, true);
 			BUG_ON(IS_ERR(req));
@@ -1681,7 +1683,9 @@ int ceph_uninline_data(struct file *filp, struct page *locked_page)
 	req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout,
 				    ceph_vino(inode), 0, &len, 0, 1,
 				    CEPH_OSD_OP_CREATE,
-				    CEPH_OSD_FLAG_ONDISK | CEPH_OSD_FLAG_WRITE,
+				    CEPH_OSD_FLAG_ONDISK |
+				    CEPH_OSD_FLAG_WRITE |
+				    CEPH_OSD_FLAG_FULL_CANCEL,
 				    NULL, 0, 0, false);
 	if (IS_ERR(req)) {
 		err = PTR_ERR(req);
@@ -1699,7 +1703,7 @@ int ceph_uninline_data(struct file *filp, struct page *locked_page)
 	req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout,
 				    ceph_vino(inode), 0, &len, 1, 3,
 				    CEPH_OSD_OP_WRITE,
-				    CEPH_OSD_FLAG_ONDISK | CEPH_OSD_FLAG_WRITE,
+				    CEPH_OSD_FLAG_ONDISK | CEPH_OSD_FLAG_WRITE | CEPH_OSD_FLAG_FULL_CANCEL,
 				    NULL, ci->i_truncate_seq,
 				    ci->i_truncate_size, false);
 	if (IS_ERR(req)) {
@@ -1872,7 +1876,7 @@ static int __ceph_pool_perm_get(struct ceph_inode_info *ci,
 		goto out_unlock;
 	}
 
-	wr_req->r_flags = CEPH_OSD_FLAG_WRITE | CEPH_OSD_FLAG_ACK;
+	wr_req->r_flags = CEPH_OSD_FLAG_WRITE | CEPH_OSD_FLAG_ACK | CEPH_OSD_FLAG_FULL_CANCEL;
 	osd_req_op_init(wr_req, 0, CEPH_OSD_OP_CREATE, CEPH_OSD_OP_FLAG_EXCL);
 	ceph_oloc_copy(&wr_req->r_base_oloc, &rd_req->r_base_oloc);
 	ceph_oid_copy(&wr_req->r_base_oid, &rd_req->r_base_oid);
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 25e71100bdad..bc2037291e49 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -736,7 +736,7 @@ static void ceph_aio_retry_work(struct work_struct *work)
 
 	req->r_flags =	CEPH_OSD_FLAG_ORDERSNAP |
 			CEPH_OSD_FLAG_ONDISK |
-			CEPH_OSD_FLAG_WRITE;
+			CEPH_OSD_FLAG_WRITE | CEPH_OSD_FLAG_FULL_CANCEL;
 	ceph_oloc_copy(&req->r_base_oloc, &orig_req->r_base_oloc);
 	ceph_oid_copy(&req->r_base_oid, &orig_req->r_base_oid);
 
@@ -893,7 +893,7 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
 
 		flags = CEPH_OSD_FLAG_ORDERSNAP |
 			CEPH_OSD_FLAG_ONDISK |
-			CEPH_OSD_FLAG_WRITE;
+			CEPH_OSD_FLAG_WRITE | CEPH_OSD_FLAG_FULL_CANCEL;
 	} else {
 		flags = CEPH_OSD_FLAG_READ;
 	}
@@ -1095,6 +1095,7 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
 	flags = CEPH_OSD_FLAG_ORDERSNAP |
 		CEPH_OSD_FLAG_ONDISK |
 		CEPH_OSD_FLAG_WRITE |
+		CEPH_OSD_FLAG_FULL_CANCEL |
 		CEPH_OSD_FLAG_ACK;
 
 	while ((len = iov_iter_count(from)) > 0) {
@@ -1593,7 +1594,8 @@ static int ceph_zero_partial_object(struct inode *inode,
 					offset, length,
 					0, 1, op,
 					CEPH_OSD_FLAG_WRITE |
-					CEPH_OSD_FLAG_ONDISK,
+					CEPH_OSD_FLAG_ONDISK |
+					CEPH_OSD_FLAG_FULL_CANCEL,
 					NULL, 0, 0, false);
 	if (IS_ERR(req)) {
 		ret = PTR_ERR(req);
diff --git a/include/linux/ceph/rados.h b/include/linux/ceph/rados.h
index 5c0da61cb763..def43570a85a 100644
--- a/include/linux/ceph/rados.h
+++ b/include/linux/ceph/rados.h
@@ -401,6 +401,7 @@ enum {
 	CEPH_OSD_FLAG_KNOWN_REDIR = 0x400000,  /* redirect bit is authoritative */
 	CEPH_OSD_FLAG_FULL_TRY =    0x800000,  /* try op despite full flag */
 	CEPH_OSD_FLAG_FULL_FORCE = 0x1000000,  /* force op despite full flag */
+	CEPH_OSD_FLAG_FULL_CANCEL = 0x2000000, /* cancel operation on full flag */
 };
 
 enum {
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 97c266f96708..b9fd5cfea343 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -50,6 +50,7 @@ static void link_linger(struct ceph_osd *osd,
 			struct ceph_osd_linger_request *lreq);
 static void unlink_linger(struct ceph_osd *osd,
 			  struct ceph_osd_linger_request *lreq);
+static void complete_request(struct ceph_osd_request *req, int err);
 
 #if 1
 static inline bool rwsem_is_wrlocked(struct rw_semaphore *sem)
@@ -1639,6 +1640,7 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
 	enum calc_target_result ct_res;
 	bool need_send = false;
 	bool promoted = false;
+	int ret = 0;
 
 	WARN_ON(req->r_tid || req->r_got_reply);
 	dout("%s req %p wrlocked %d\n", __func__, req, wrlocked);
@@ -1673,6 +1675,8 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
 		pr_warn_ratelimited("FULL or reached pool quota\n");
 		req->r_t.paused = true;
 		__ceph_osdc_maybe_request_map(osdc);
+		if (req->r_flags & CEPH_OSD_FLAG_FULL_CANCEL)
+			ret = -ENOSPC;
 	} else if (!osd_homeless(osd)) {
 		need_send = true;
 	} else {
@@ -1689,6 +1693,8 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
 	link_request(osd, req);
 	if (need_send)
 		send_request(req);
+	else if (ret)
+		complete_request(req, ret);
 	mutex_unlock(&osd->lock);
 
 	if (ct_res == CALC_TARGET_POOL_DNE)
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 4/7] ceph: handle new osdmap epoch updates in CLIENT_CAPS and WRITE codepaths
  2017-01-20 15:17 ` [PATCH v1 4/7] ceph: handle new osdmap epoch updates in CLIENT_CAPS and WRITE codepaths Jeff Layton
@ 2017-01-22  9:40   ` Yan, Zheng
  2017-01-22 15:38     ` Jeff Layton
  2017-02-01 19:50     ` Jeff Layton
  0 siblings, 2 replies; 16+ messages in thread
From: Yan, Zheng @ 2017-01-22  9:40 UTC (permalink / raw)
  To: Jeff Layton; +Cc: ceph-devel, John Spray, Ilya Dryomov, Sage Weil


> On 20 Jan 2017, at 23:17, Jeff Layton <jlayton@redhat.com> wrote:
> 
> This patch is heavily inspired by John Spray's earlier work, but
> implemented in a different way.
> 
> Create and register a new map_cb for cephfs, to allow it to handle
> changes to the osdmap.
> 
> In the version 5 of CLIENT_CAPS messages, the barrier field is added as an
> instruction to clients that they may not use the attached capabilities
> until they have a particular OSD map epoch.
> 
> When we get a message with such a field and don't have the requisite map
> epoch yet, we put that message on a list in the session, to be run when
> the map does come in.
> 
> When we get a new map update, the map_cb routine first checks to see
> whether there may be an OSD or pool full condition. If so, then we walk
> the list of OSD calls and kill off any writes to full OSDs or pools with
> -ENOSPC.  While cancelling, we store the latest OSD epoch seen in each
> request. This will be used later in the CAPRELEASE messages.
> 
> Then, it walks the session list and queues the workqueue job for each.
> When the workqueue job runs, it walks the list of delayed caps and tries
> to rerun each one. If the epoch is still not high enough, they just get
> put back on the delay queue for when the map does come in.
> 
> Suggested-by: John Spray <john.spray@redhat.com>
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
> fs/ceph/caps.c       | 43 +++++++++++++++++++++++++++---
> fs/ceph/debugfs.c    |  3 +++
> fs/ceph/mds_client.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> fs/ceph/mds_client.h |  3 +++
> 4 files changed, 120 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index d941c48e8bff..f33d424b5e12 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -1077,7 +1077,7 @@ static int send_cap_msg(struct cap_msg_args *arg)
> 	/* inline data size */
> 	ceph_encode_32(&p, 0);
> 	/* osd_epoch_barrier (version 5) */
> -	ceph_encode_32(&p, 0);
> +	ceph_encode_32(&p, arg->session->s_mdsc->cap_epoch_barrier);
> 	/* oldest_flush_tid (version 6) */
> 	ceph_encode_64(&p, arg->oldest_flush_tid);
> 
> @@ -3577,9 +3577,12 @@ void ceph_handle_caps(struct ceph_mds_session *session,
> 	void *snaptrace;
> 	size_t snaptrace_len;
> 	void *p, *end;
> +	u32 epoch_barrier = 0;
> 
> 	dout("handle_caps from mds%d\n", mds);
> 
> +	WARN_ON_ONCE(!list_empty(&msg->list_head));
> +
> 	/* decode */
> 	end = msg->front.iov_base + msg->front.iov_len;
> 	tid = le64_to_cpu(msg->hdr.tid);
> @@ -3625,13 +3628,45 @@ void ceph_handle_caps(struct ceph_mds_session *session,
> 		p += inline_len;
> 	}
> 
> +	if (le16_to_cpu(msg->hdr.version) >= 5) {
> +		struct ceph_osd_client *osdc = &mdsc->fsc->client->osdc;
> +
> +		ceph_decode_32_safe(&p, end, epoch_barrier, bad);
> +
> +		/* Do lockless check first to avoid mutex if we can */
> +		if (epoch_barrier > mdsc->cap_epoch_barrier) {
> +			mutex_lock(&mdsc->mutex);
> +			if (epoch_barrier > mdsc->cap_epoch_barrier)
> +				mdsc->cap_epoch_barrier = epoch_barrier;
> +			mutex_unlock(&mdsc->mutex);
> +		}
> +
> +		down_read(&osdc->lock);
> +		if (osdc->osdmap->epoch < epoch_barrier) {
> +			dout("handle_caps delaying message until OSD epoch %d\n", epoch_barrier);
> +			ceph_msg_get(msg);
> +			spin_lock(&session->s_cap_lock);
> +			list_add(&msg->list_head, &session->s_delayed_caps);
> +			spin_unlock(&session->s_cap_lock);
> +
> +			// Kick OSD client to get the latest map
> +			__ceph_osdc_maybe_request_map(osdc);
> +
> +			up_read(&osdc->lock);
> +			return;
> +		}

Cap messages need to be handled in the same order as they were sent. I’m worry if the delay breaks something or makes cap revoke slow. Why not use the use the same approach as user space client? pass the epoch_barrier to libceph and let libceph delay sending osd requests.

Regards
Yan, Zheng


> +
> +		dout("handle_caps barrier %d already satisfied (%d)\n", epoch_barrier, osdc->osdmap->epoch);
> +		up_read(&osdc->lock);
> +	}
> +
> +	dout("handle_caps v=%d barrier=%d\n", le16_to_cpu(msg->hdr.version), epoch_barrier);
> +
> 	if (le16_to_cpu(msg->hdr.version) >= 8) {
> 		u64 flush_tid;
> 		u32 caller_uid, caller_gid;
> -		u32 osd_epoch_barrier;
> 		u32 pool_ns_len;
> -		/* version >= 5 */
> -		ceph_decode_32_safe(&p, end, osd_epoch_barrier, bad);
> +
> 		/* version >= 6 */
> 		ceph_decode_64_safe(&p, end, flush_tid, bad);
> 		/* version >= 7 */
> diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
> index 39ff678e567f..825df757fba5 100644
> --- a/fs/ceph/debugfs.c
> +++ b/fs/ceph/debugfs.c
> @@ -172,6 +172,9 @@ static int mds_sessions_show(struct seq_file *s, void *ptr)
> 	/* The -o name mount argument */
> 	seq_printf(s, "name \"%s\"\n", opt->name ? opt->name : "");
> 
> +	/* The latest OSD epoch barrier known to this client */
> +	seq_printf(s, "osd_epoch_barrier \"%d\"\n", mdsc->cap_epoch_barrier);
> +
> 	/* The list of MDS session rank+state */
> 	for (mds = 0; mds < mdsc->max_sessions; mds++) {
> 		struct ceph_mds_session *session =
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 176512960b14..7055b499c08b 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -393,6 +393,7 @@ void ceph_put_mds_session(struct ceph_mds_session *s)
> 	dout("mdsc put_session %p %d -> %d\n", s,
> 	     atomic_read(&s->s_ref), atomic_read(&s->s_ref)-1);
> 	if (atomic_dec_and_test(&s->s_ref)) {
> +		WARN_ON_ONCE(cancel_work_sync(&s->s_delayed_caps_work));
> 		if (s->s_auth.authorizer)
> 			ceph_auth_destroy_authorizer(s->s_auth.authorizer);
> 		kfree(s);
> @@ -432,6 +433,74 @@ static int __verify_registered_session(struct ceph_mds_client *mdsc,
> 	return 0;
> }
> 
> +static void handle_osd_map(struct ceph_osd_client *osdc, void *p)
> +{
> +	struct ceph_mds_client *mdsc = (struct ceph_mds_client*)p;
> +	u32 cancelled_epoch = 0;
> +	int mds_id;
> +
> +	lockdep_assert_held(&osdc->lock);
> +
> +	if ((osdc->osdmap->flags & CEPH_OSDMAP_FULL) ||
> +	    ceph_osdc_have_pool_full(osdc))
> +		cancelled_epoch = ceph_osdc_complete_writes(osdc, -ENOSPC);
> +
> +	dout("handle_osd_map: epoch=%d\n", osdc->osdmap->epoch);
> +
> +	mutex_lock(&mdsc->mutex);
> +	if (cancelled_epoch)
> +		mdsc->cap_epoch_barrier = max(cancelled_epoch + 1,
> +					      mdsc->cap_epoch_barrier);
> +
> +	/* Schedule the workqueue job for any sessions */
> +	for (mds_id = 0; mds_id < mdsc->max_sessions; ++mds_id) {
> +		struct ceph_mds_session *session = mdsc->sessions[mds_id];
> +		bool empty;
> +
> +		if (session == NULL)
> +			continue;
> +
> +		/* Any delayed messages? */
> +		spin_lock(&session->s_cap_lock);
> +		empty = list_empty(&session->s_delayed_caps);
> +		spin_unlock(&session->s_cap_lock);
> +		if (empty)
> +			continue;
> +
> +		/* take a reference -- if we can't get one, move on */
> +		if (!get_session(session))
> +			continue;
> +
> +		/*
> +		 * Try to schedule work. If it's already queued, then just
> +		 * drop the session reference.
> +		 */
> +		if (!schedule_work(&session->s_delayed_caps_work))
> +			ceph_put_mds_session(session);
> +	}
> +	mutex_unlock(&mdsc->mutex);
> +}
> +
> +static void
> +run_delayed_caps(struct work_struct *work)
> +{
> +	struct ceph_mds_session *session = container_of(work,
> +			struct ceph_mds_session, s_delayed_caps_work);
> +	LIST_HEAD(delayed);
> +
> +	spin_lock(&session->s_cap_lock);
> +	list_splice_init(&session->s_delayed_caps, &delayed);
> +	spin_unlock(&session->s_cap_lock);
> +
> +	while (!list_empty(&delayed)) {
> +		struct ceph_msg *msg = list_first_entry(&delayed,
> +						struct ceph_msg, list_head);
> +		list_del_init(&msg->list_head);
> +		ceph_handle_caps(session, msg);
> +		ceph_msg_put(msg);
> +	}
> +}
> +
> /*
>  * create+register a new session for given mds.
>  * called under mdsc->mutex.
> @@ -469,11 +538,13 @@ static struct ceph_mds_session *register_session(struct ceph_mds_client *mdsc,
> 	atomic_set(&s->s_ref, 1);
> 	INIT_LIST_HEAD(&s->s_waiting);
> 	INIT_LIST_HEAD(&s->s_unsafe);
> +	INIT_LIST_HEAD(&s->s_delayed_caps);
> 	s->s_num_cap_releases = 0;
> 	s->s_cap_reconnect = 0;
> 	s->s_cap_iterator = NULL;
> 	INIT_LIST_HEAD(&s->s_cap_releases);
> 	INIT_LIST_HEAD(&s->s_cap_flushing);
> +	INIT_WORK(&s->s_delayed_caps_work, run_delayed_caps);
> 
> 	dout("register_session mds%d\n", mds);
> 	if (mds >= mdsc->max_sessions) {
> @@ -3480,6 +3551,10 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
> 
> 	ceph_caps_init(mdsc);
> 	ceph_adjust_min_caps(mdsc, fsc->min_caps);
> +	mdsc->cap_epoch_barrier = 0;
> +
> +	ceph_osdc_register_map_cb(&fsc->client->osdc,
> +				  handle_osd_map, (void*)mdsc);
> 
> 	init_rwsem(&mdsc->pool_perm_rwsem);
> 	mdsc->pool_perm_tree = RB_ROOT;
> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> index 3c6f77b7bb02..eb8144ab4995 100644
> --- a/fs/ceph/mds_client.h
> +++ b/fs/ceph/mds_client.h
> @@ -159,6 +159,8 @@ struct ceph_mds_session {
> 	atomic_t          s_ref;
> 	struct list_head  s_waiting;  /* waiting requests */
> 	struct list_head  s_unsafe;   /* unsafe requests */
> +	struct list_head	s_delayed_caps;
> +	struct work_struct	s_delayed_caps_work;
> };
> 
> /*
> @@ -331,6 +333,7 @@ struct ceph_mds_client {
> 	int               num_cap_flushing; /* # caps we are flushing */
> 	spinlock_t        cap_dirty_lock;   /* protects above items */
> 	wait_queue_head_t cap_flushing_wq;
> +	u32               cap_epoch_barrier;
> 
> 	/*
> 	 * Cap reservations
> -- 
> 2.9.3
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 4/7] ceph: handle new osdmap epoch updates in CLIENT_CAPS and WRITE codepaths
  2017-01-22  9:40   ` Yan, Zheng
@ 2017-01-22 15:38     ` Jeff Layton
  2017-01-23  1:38       ` Yan, Zheng
  2017-02-01 19:50     ` Jeff Layton
  1 sibling, 1 reply; 16+ messages in thread
From: Jeff Layton @ 2017-01-22 15:38 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel, John Spray, Ilya Dryomov, Sage Weil

On Sun, 2017-01-22 at 17:40 +0800, Yan, Zheng wrote:
> > On 20 Jan 2017, at 23:17, Jeff Layton <jlayton@redhat.com> wrote:
> > 
> > This patch is heavily inspired by John Spray's earlier work, but
> > implemented in a different way.
> > 
> > Create and register a new map_cb for cephfs, to allow it to handle
> > changes to the osdmap.
> > 
> > In the version 5 of CLIENT_CAPS messages, the barrier field is added as an
> > instruction to clients that they may not use the attached capabilities
> > until they have a particular OSD map epoch.
> > 
> > When we get a message with such a field and don't have the requisite map
> > epoch yet, we put that message on a list in the session, to be run when
> > the map does come in.
> > 
> > When we get a new map update, the map_cb routine first checks to see
> > whether there may be an OSD or pool full condition. If so, then we walk
> > the list of OSD calls and kill off any writes to full OSDs or pools with
> > -ENOSPC.  While cancelling, we store the latest OSD epoch seen in each
> > request. This will be used later in the CAPRELEASE messages.
> > 
> > Then, it walks the session list and queues the workqueue job for each.
> > When the workqueue job runs, it walks the list of delayed caps and tries
> > to rerun each one. If the epoch is still not high enough, they just get
> > put back on the delay queue for when the map does come in.
> > 
> > Suggested-by: John Spray <john.spray@redhat.com>
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > ---
> > fs/ceph/caps.c       | 43 +++++++++++++++++++++++++++---
> > fs/ceph/debugfs.c    |  3 +++
> > fs/ceph/mds_client.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > fs/ceph/mds_client.h |  3 +++
> > 4 files changed, 120 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > index d941c48e8bff..f33d424b5e12 100644
> > --- a/fs/ceph/caps.c
> > +++ b/fs/ceph/caps.c
> > @@ -1077,7 +1077,7 @@ static int send_cap_msg(struct cap_msg_args *arg)
> > 	/* inline data size */
> > 	ceph_encode_32(&p, 0);
> > 	/* osd_epoch_barrier (version 5) */
> > -	ceph_encode_32(&p, 0);
> > +	ceph_encode_32(&p, arg->session->s_mdsc->cap_epoch_barrier);
> > 	/* oldest_flush_tid (version 6) */
> > 	ceph_encode_64(&p, arg->oldest_flush_tid);
> > 
> > @@ -3577,9 +3577,12 @@ void ceph_handle_caps(struct ceph_mds_session *session,
> > 	void *snaptrace;
> > 	size_t snaptrace_len;
> > 	void *p, *end;
> > +	u32 epoch_barrier = 0;
> > 
> > 	dout("handle_caps from mds%d\n", mds);
> > 
> > +	WARN_ON_ONCE(!list_empty(&msg->list_head));
> > +
> > 	/* decode */
> > 	end = msg->front.iov_base + msg->front.iov_len;
> > 	tid = le64_to_cpu(msg->hdr.tid);
> > @@ -3625,13 +3628,45 @@ void ceph_handle_caps(struct ceph_mds_session *session,
> > 		p += inline_len;
> > 	}
> > 
> > +	if (le16_to_cpu(msg->hdr.version) >= 5) {
> > +		struct ceph_osd_client *osdc = &mdsc->fsc->client->osdc;
> > +
> > +		ceph_decode_32_safe(&p, end, epoch_barrier, bad);
> > +
> > +		/* Do lockless check first to avoid mutex if we can */
> > +		if (epoch_barrier > mdsc->cap_epoch_barrier) {
> > +			mutex_lock(&mdsc->mutex);
> > +			if (epoch_barrier > mdsc->cap_epoch_barrier)
> > +				mdsc->cap_epoch_barrier = epoch_barrier;
> > +			mutex_unlock(&mdsc->mutex);
> > +		}
> > +
> > +		down_read(&osdc->lock);
> > +		if (osdc->osdmap->epoch < epoch_barrier) {
> > +			dout("handle_caps delaying message until OSD epoch %d\n", epoch_barrier);
> > +			ceph_msg_get(msg);
> > +			spin_lock(&session->s_cap_lock);
> > +			list_add(&msg->list_head, &session->s_delayed_caps);
> > +			spin_unlock(&session->s_cap_lock);
> > +
> > +			// Kick OSD client to get the latest map
> > +			__ceph_osdc_maybe_request_map(osdc);
> > +
> > +			up_read(&osdc->lock);
> > +			return;
> > +		}
> 
> Cap messages need to be handled in the same order as they were sent. I’m worry if the delay breaks something or makes cap revoke slow. Why not use the use the same approach as user space client? pass the epoch_barrier to libceph and let libceph delay sending osd requests.
> 
> Regards
> Yan, Zheng
> 
> 


Can you explain why that needs to be done, and how that ordering
guaranteed now? AFAICT, the libceph code seems to drop con->mutex before
it calls ->dispatch, and I don't see anything that would prevent two cap
messages from reordered at that point. 

That said, plumbing an epoch barrier into libceph does sound better.
I'll have a look at that approach this week.

Thanks,
Jeff
 
> > +
> > +		dout("handle_caps barrier %d already satisfied (%d)\n", epoch_barrier, osdc->osdmap->epoch);
> > +		up_read(&osdc->lock);
> > +	}
> > +
> > +	dout("handle_caps v=%d barrier=%d\n", le16_to_cpu(msg->hdr.version), epoch_barrier);
> > +
> > 	if (le16_to_cpu(msg->hdr.version) >= 8) {
> > 		u64 flush_tid;
> > 		u32 caller_uid, caller_gid;
> > -		u32 osd_epoch_barrier;
> > 		u32 pool_ns_len;
> > -		/* version >= 5 */
> > -		ceph_decode_32_safe(&p, end, osd_epoch_barrier, bad);
> > +
> > 		/* version >= 6 */
> > 		ceph_decode_64_safe(&p, end, flush_tid, bad);
> > 		/* version >= 7 */
> > diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
> > index 39ff678e567f..825df757fba5 100644
> > --- a/fs/ceph/debugfs.c
> > +++ b/fs/ceph/debugfs.c
> > @@ -172,6 +172,9 @@ static int mds_sessions_show(struct seq_file *s, void *ptr)
> > 	/* The -o name mount argument */
> > 	seq_printf(s, "name \"%s\"\n", opt->name ? opt->name : "");
> > 
> > +	/* The latest OSD epoch barrier known to this client */
> > +	seq_printf(s, "osd_epoch_barrier \"%d\"\n", mdsc->cap_epoch_barrier);
> > +
> > 	/* The list of MDS session rank+state */
> > 	for (mds = 0; mds < mdsc->max_sessions; mds++) {
> > 		struct ceph_mds_session *session =
> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > index 176512960b14..7055b499c08b 100644
> > --- a/fs/ceph/mds_client.c
> > +++ b/fs/ceph/mds_client.c
> > @@ -393,6 +393,7 @@ void ceph_put_mds_session(struct ceph_mds_session *s)
> > 	dout("mdsc put_session %p %d -> %d\n", s,
> > 	     atomic_read(&s->s_ref), atomic_read(&s->s_ref)-1);
> > 	if (atomic_dec_and_test(&s->s_ref)) {
> > +		WARN_ON_ONCE(cancel_work_sync(&s->s_delayed_caps_work));
> > 		if (s->s_auth.authorizer)
> > 			ceph_auth_destroy_authorizer(s->s_auth.authorizer);
> > 		kfree(s);
> > @@ -432,6 +433,74 @@ static int __verify_registered_session(struct ceph_mds_client *mdsc,
> > 	return 0;
> > }
> > 
> > +static void handle_osd_map(struct ceph_osd_client *osdc, void *p)
> > +{
> > +	struct ceph_mds_client *mdsc = (struct ceph_mds_client*)p;
> > +	u32 cancelled_epoch = 0;
> > +	int mds_id;
> > +
> > +	lockdep_assert_held(&osdc->lock);
> > +
> > +	if ((osdc->osdmap->flags & CEPH_OSDMAP_FULL) ||
> > +	    ceph_osdc_have_pool_full(osdc))
> > +		cancelled_epoch = ceph_osdc_complete_writes(osdc, -ENOSPC);
> > +
> > +	dout("handle_osd_map: epoch=%d\n", osdc->osdmap->epoch);
> > +
> > +	mutex_lock(&mdsc->mutex);
> > +	if (cancelled_epoch)
> > +		mdsc->cap_epoch_barrier = max(cancelled_epoch + 1,
> > +					      mdsc->cap_epoch_barrier);
> > +
> > +	/* Schedule the workqueue job for any sessions */
> > +	for (mds_id = 0; mds_id < mdsc->max_sessions; ++mds_id) {
> > +		struct ceph_mds_session *session = mdsc->sessions[mds_id];
> > +		bool empty;
> > +
> > +		if (session == NULL)
> > +			continue;
> > +
> > +		/* Any delayed messages? */
> > +		spin_lock(&session->s_cap_lock);
> > +		empty = list_empty(&session->s_delayed_caps);
> > +		spin_unlock(&session->s_cap_lock);
> > +		if (empty)
> > +			continue;
> > +
> > +		/* take a reference -- if we can't get one, move on */
> > +		if (!get_session(session))
> > +			continue;
> > +
> > +		/*
> > +		 * Try to schedule work. If it's already queued, then just
> > +		 * drop the session reference.
> > +		 */
> > +		if (!schedule_work(&session->s_delayed_caps_work))
> > +			ceph_put_mds_session(session);
> > +	}
> > +	mutex_unlock(&mdsc->mutex);
> > +}
> > +
> > +static void
> > +run_delayed_caps(struct work_struct *work)
> > +{
> > +	struct ceph_mds_session *session = container_of(work,
> > +			struct ceph_mds_session, s_delayed_caps_work);
> > +	LIST_HEAD(delayed);
> > +
> > +	spin_lock(&session->s_cap_lock);
> > +	list_splice_init(&session->s_delayed_caps, &delayed);
> > +	spin_unlock(&session->s_cap_lock);
> > +
> > +	while (!list_empty(&delayed)) {
> > +		struct ceph_msg *msg = list_first_entry(&delayed,
> > +						struct ceph_msg, list_head);
> > +		list_del_init(&msg->list_head);
> > +		ceph_handle_caps(session, msg);
> > +		ceph_msg_put(msg);
> > +	}
> > +}
> > +
> > /*
> >  * create+register a new session for given mds.
> >  * called under mdsc->mutex.
> > @@ -469,11 +538,13 @@ static struct ceph_mds_session *register_session(struct ceph_mds_client *mdsc,
> > 	atomic_set(&s->s_ref, 1);
> > 	INIT_LIST_HEAD(&s->s_waiting);
> > 	INIT_LIST_HEAD(&s->s_unsafe);
> > +	INIT_LIST_HEAD(&s->s_delayed_caps);
> > 	s->s_num_cap_releases = 0;
> > 	s->s_cap_reconnect = 0;
> > 	s->s_cap_iterator = NULL;
> > 	INIT_LIST_HEAD(&s->s_cap_releases);
> > 	INIT_LIST_HEAD(&s->s_cap_flushing);
> > +	INIT_WORK(&s->s_delayed_caps_work, run_delayed_caps);
> > 
> > 	dout("register_session mds%d\n", mds);
> > 	if (mds >= mdsc->max_sessions) {
> > @@ -3480,6 +3551,10 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
> > 
> > 	ceph_caps_init(mdsc);
> > 	ceph_adjust_min_caps(mdsc, fsc->min_caps);
> > +	mdsc->cap_epoch_barrier = 0;
> > +
> > +	ceph_osdc_register_map_cb(&fsc->client->osdc,
> > +				  handle_osd_map, (void*)mdsc);
> > 
> > 	init_rwsem(&mdsc->pool_perm_rwsem);
> > 	mdsc->pool_perm_tree = RB_ROOT;
> > diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> > index 3c6f77b7bb02..eb8144ab4995 100644
> > --- a/fs/ceph/mds_client.h
> > +++ b/fs/ceph/mds_client.h
> > @@ -159,6 +159,8 @@ struct ceph_mds_session {
> > 	atomic_t          s_ref;
> > 	struct list_head  s_waiting;  /* waiting requests */
> > 	struct list_head  s_unsafe;   /* unsafe requests */
> > +	struct list_head	s_delayed_caps;
> > +	struct work_struct	s_delayed_caps_work;
> > };
> > 
> > /*
> > @@ -331,6 +333,7 @@ struct ceph_mds_client {
> > 	int               num_cap_flushing; /* # caps we are flushing */
> > 	spinlock_t        cap_dirty_lock;   /* protects above items */
> > 	wait_queue_head_t cap_flushing_wq;
> > +	u32               cap_epoch_barrier;
> > 
> > 	/*
> > 	 * Cap reservations
> > -- 
> > 2.9.3
> > 
> 
> 

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 4/7] ceph: handle new osdmap epoch updates in CLIENT_CAPS and WRITE codepaths
  2017-01-22 15:38     ` Jeff Layton
@ 2017-01-23  1:38       ` Yan, Zheng
  0 siblings, 0 replies; 16+ messages in thread
From: Yan, Zheng @ 2017-01-23  1:38 UTC (permalink / raw)
  To: Jeff Layton; +Cc: ceph-devel, John Spray, Ilya Dryomov, Sage Weil


> On 22 Jan 2017, at 23:38, Jeff Layton <jlayton@redhat.com> wrote:
> 
> On Sun, 2017-01-22 at 17:40 +0800, Yan, Zheng wrote:
>>> On 20 Jan 2017, at 23:17, Jeff Layton <jlayton@redhat.com> wrote:
>>> 
>>> This patch is heavily inspired by John Spray's earlier work, but
>>> implemented in a different way.
>>> 
>>> Create and register a new map_cb for cephfs, to allow it to handle
>>> changes to the osdmap.
>>> 
>>> In the version 5 of CLIENT_CAPS messages, the barrier field is added as an
>>> instruction to clients that they may not use the attached capabilities
>>> until they have a particular OSD map epoch.
>>> 
>>> When we get a message with such a field and don't have the requisite map
>>> epoch yet, we put that message on a list in the session, to be run when
>>> the map does come in.
>>> 
>>> When we get a new map update, the map_cb routine first checks to see
>>> whether there may be an OSD or pool full condition. If so, then we walk
>>> the list of OSD calls and kill off any writes to full OSDs or pools with
>>> -ENOSPC.  While cancelling, we store the latest OSD epoch seen in each
>>> request. This will be used later in the CAPRELEASE messages.
>>> 
>>> Then, it walks the session list and queues the workqueue job for each.
>>> When the workqueue job runs, it walks the list of delayed caps and tries
>>> to rerun each one. If the epoch is still not high enough, they just get
>>> put back on the delay queue for when the map does come in.
>>> 
>>> Suggested-by: John Spray <john.spray@redhat.com>
>>> Signed-off-by: Jeff Layton <jlayton@redhat.com>
>>> ---
>>> fs/ceph/caps.c       | 43 +++++++++++++++++++++++++++---
>>> fs/ceph/debugfs.c    |  3 +++
>>> fs/ceph/mds_client.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> fs/ceph/mds_client.h |  3 +++
>>> 4 files changed, 120 insertions(+), 4 deletions(-)
>>> 
>>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>>> index d941c48e8bff..f33d424b5e12 100644
>>> --- a/fs/ceph/caps.c
>>> +++ b/fs/ceph/caps.c
>>> @@ -1077,7 +1077,7 @@ static int send_cap_msg(struct cap_msg_args *arg)
>>> 	/* inline data size */
>>> 	ceph_encode_32(&p, 0);
>>> 	/* osd_epoch_barrier (version 5) */
>>> -	ceph_encode_32(&p, 0);
>>> +	ceph_encode_32(&p, arg->session->s_mdsc->cap_epoch_barrier);
>>> 	/* oldest_flush_tid (version 6) */
>>> 	ceph_encode_64(&p, arg->oldest_flush_tid);
>>> 
>>> @@ -3577,9 +3577,12 @@ void ceph_handle_caps(struct ceph_mds_session *session,
>>> 	void *snaptrace;
>>> 	size_t snaptrace_len;
>>> 	void *p, *end;
>>> +	u32 epoch_barrier = 0;
>>> 
>>> 	dout("handle_caps from mds%d\n", mds);
>>> 
>>> +	WARN_ON_ONCE(!list_empty(&msg->list_head));
>>> +
>>> 	/* decode */
>>> 	end = msg->front.iov_base + msg->front.iov_len;
>>> 	tid = le64_to_cpu(msg->hdr.tid);
>>> @@ -3625,13 +3628,45 @@ void ceph_handle_caps(struct ceph_mds_session *session,
>>> 		p += inline_len;
>>> 	}
>>> 
>>> +	if (le16_to_cpu(msg->hdr.version) >= 5) {
>>> +		struct ceph_osd_client *osdc = &mdsc->fsc->client->osdc;
>>> +
>>> +		ceph_decode_32_safe(&p, end, epoch_barrier, bad);
>>> +
>>> +		/* Do lockless check first to avoid mutex if we can */
>>> +		if (epoch_barrier > mdsc->cap_epoch_barrier) {
>>> +			mutex_lock(&mdsc->mutex);
>>> +			if (epoch_barrier > mdsc->cap_epoch_barrier)
>>> +				mdsc->cap_epoch_barrier = epoch_barrier;
>>> +			mutex_unlock(&mdsc->mutex);
>>> +		}
>>> +
>>> +		down_read(&osdc->lock);
>>> +		if (osdc->osdmap->epoch < epoch_barrier) {
>>> +			dout("handle_caps delaying message until OSD epoch %d\n", epoch_barrier);
>>> +			ceph_msg_get(msg);
>>> +			spin_lock(&session->s_cap_lock);
>>> +			list_add(&msg->list_head, &session->s_delayed_caps);
>>> +			spin_unlock(&session->s_cap_lock);
>>> +
>>> +			// Kick OSD client to get the latest map
>>> +			__ceph_osdc_maybe_request_map(osdc);
>>> +
>>> +			up_read(&osdc->lock);
>>> +			return;
>>> +		}
>> 
>> Cap messages need to be handled in the same order as they were sent. I’m worry if the delay breaks something or makes cap revoke slow. Why not use the use the same approach as user space client? pass the epoch_barrier to libceph and let libceph delay sending osd requests.
>> 
>> Regards
>> Yan, Zheng
>> 
>> 
> 
> 
> Can you explain why that needs to be done, and how that ordering
> guaranteed now? AFAICT, the libceph code seems to drop con->mutex before
> it calls ->dispatch, and I don't see anything that would prevent two cap
> messages from reordered at that point. 

I think the ordering is guaranteed by "single dispatch thread for each connection”.
One example is that MDS issues caps ABC to client, then issues caps AB to client
(revokes caps C). If the two messages get processed out of order, client does not
release cap C to MDS, which causes operation hang.
 
Regards
Yan, Zheng

> 
> That said, plumbing an epoch barrier into libceph does sound better.
> I'll have a look at that approach this week.
> 
> Thanks,
> Jeff
> 
>>> +
>>> +		dout("handle_caps barrier %d already satisfied (%d)\n", epoch_barrier, osdc->osdmap->epoch);
>>> +		up_read(&osdc->lock);
>>> +	}
>>> +
>>> +	dout("handle_caps v=%d barrier=%d\n", le16_to_cpu(msg->hdr.version), epoch_barrier);
>>> +
>>> 	if (le16_to_cpu(msg->hdr.version) >= 8) {
>>> 		u64 flush_tid;
>>> 		u32 caller_uid, caller_gid;
>>> -		u32 osd_epoch_barrier;
>>> 		u32 pool_ns_len;
>>> -		/* version >= 5 */
>>> -		ceph_decode_32_safe(&p, end, osd_epoch_barrier, bad);
>>> +
>>> 		/* version >= 6 */
>>> 		ceph_decode_64_safe(&p, end, flush_tid, bad);
>>> 		/* version >= 7 */
>>> diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
>>> index 39ff678e567f..825df757fba5 100644
>>> --- a/fs/ceph/debugfs.c
>>> +++ b/fs/ceph/debugfs.c
>>> @@ -172,6 +172,9 @@ static int mds_sessions_show(struct seq_file *s, void *ptr)
>>> 	/* The -o name mount argument */
>>> 	seq_printf(s, "name \"%s\"\n", opt->name ? opt->name : "");
>>> 
>>> +	/* The latest OSD epoch barrier known to this client */
>>> +	seq_printf(s, "osd_epoch_barrier \"%d\"\n", mdsc->cap_epoch_barrier);
>>> +
>>> 	/* The list of MDS session rank+state */
>>> 	for (mds = 0; mds < mdsc->max_sessions; mds++) {
>>> 		struct ceph_mds_session *session =
>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>> index 176512960b14..7055b499c08b 100644
>>> --- a/fs/ceph/mds_client.c
>>> +++ b/fs/ceph/mds_client.c
>>> @@ -393,6 +393,7 @@ void ceph_put_mds_session(struct ceph_mds_session *s)
>>> 	dout("mdsc put_session %p %d -> %d\n", s,
>>> 	     atomic_read(&s->s_ref), atomic_read(&s->s_ref)-1);
>>> 	if (atomic_dec_and_test(&s->s_ref)) {
>>> +		WARN_ON_ONCE(cancel_work_sync(&s->s_delayed_caps_work));
>>> 		if (s->s_auth.authorizer)
>>> 			ceph_auth_destroy_authorizer(s->s_auth.authorizer);
>>> 		kfree(s);
>>> @@ -432,6 +433,74 @@ static int __verify_registered_session(struct ceph_mds_client *mdsc,
>>> 	return 0;
>>> }
>>> 
>>> +static void handle_osd_map(struct ceph_osd_client *osdc, void *p)
>>> +{
>>> +	struct ceph_mds_client *mdsc = (struct ceph_mds_client*)p;
>>> +	u32 cancelled_epoch = 0;
>>> +	int mds_id;
>>> +
>>> +	lockdep_assert_held(&osdc->lock);
>>> +
>>> +	if ((osdc->osdmap->flags & CEPH_OSDMAP_FULL) ||
>>> +	    ceph_osdc_have_pool_full(osdc))
>>> +		cancelled_epoch = ceph_osdc_complete_writes(osdc, -ENOSPC);
>>> +
>>> +	dout("handle_osd_map: epoch=%d\n", osdc->osdmap->epoch);
>>> +
>>> +	mutex_lock(&mdsc->mutex);
>>> +	if (cancelled_epoch)
>>> +		mdsc->cap_epoch_barrier = max(cancelled_epoch + 1,
>>> +					      mdsc->cap_epoch_barrier);
>>> +
>>> +	/* Schedule the workqueue job for any sessions */
>>> +	for (mds_id = 0; mds_id < mdsc->max_sessions; ++mds_id) {
>>> +		struct ceph_mds_session *session = mdsc->sessions[mds_id];
>>> +		bool empty;
>>> +
>>> +		if (session == NULL)
>>> +			continue;
>>> +
>>> +		/* Any delayed messages? */
>>> +		spin_lock(&session->s_cap_lock);
>>> +		empty = list_empty(&session->s_delayed_caps);
>>> +		spin_unlock(&session->s_cap_lock);
>>> +		if (empty)
>>> +			continue;
>>> +
>>> +		/* take a reference -- if we can't get one, move on */
>>> +		if (!get_session(session))
>>> +			continue;
>>> +
>>> +		/*
>>> +		 * Try to schedule work. If it's already queued, then just
>>> +		 * drop the session reference.
>>> +		 */
>>> +		if (!schedule_work(&session->s_delayed_caps_work))
>>> +			ceph_put_mds_session(session);
>>> +	}
>>> +	mutex_unlock(&mdsc->mutex);
>>> +}
>>> +
>>> +static void
>>> +run_delayed_caps(struct work_struct *work)
>>> +{
>>> +	struct ceph_mds_session *session = container_of(work,
>>> +			struct ceph_mds_session, s_delayed_caps_work);
>>> +	LIST_HEAD(delayed);
>>> +
>>> +	spin_lock(&session->s_cap_lock);
>>> +	list_splice_init(&session->s_delayed_caps, &delayed);
>>> +	spin_unlock(&session->s_cap_lock);
>>> +
>>> +	while (!list_empty(&delayed)) {
>>> +		struct ceph_msg *msg = list_first_entry(&delayed,
>>> +						struct ceph_msg, list_head);
>>> +		list_del_init(&msg->list_head);
>>> +		ceph_handle_caps(session, msg);
>>> +		ceph_msg_put(msg);
>>> +	}
>>> +}
>>> +
>>> /*
>>> * create+register a new session for given mds.
>>> * called under mdsc->mutex.
>>> @@ -469,11 +538,13 @@ static struct ceph_mds_session *register_session(struct ceph_mds_client *mdsc,
>>> 	atomic_set(&s->s_ref, 1);
>>> 	INIT_LIST_HEAD(&s->s_waiting);
>>> 	INIT_LIST_HEAD(&s->s_unsafe);
>>> +	INIT_LIST_HEAD(&s->s_delayed_caps);
>>> 	s->s_num_cap_releases = 0;
>>> 	s->s_cap_reconnect = 0;
>>> 	s->s_cap_iterator = NULL;
>>> 	INIT_LIST_HEAD(&s->s_cap_releases);
>>> 	INIT_LIST_HEAD(&s->s_cap_flushing);
>>> +	INIT_WORK(&s->s_delayed_caps_work, run_delayed_caps);
>>> 
>>> 	dout("register_session mds%d\n", mds);
>>> 	if (mds >= mdsc->max_sessions) {
>>> @@ -3480,6 +3551,10 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
>>> 
>>> 	ceph_caps_init(mdsc);
>>> 	ceph_adjust_min_caps(mdsc, fsc->min_caps);
>>> +	mdsc->cap_epoch_barrier = 0;
>>> +
>>> +	ceph_osdc_register_map_cb(&fsc->client->osdc,
>>> +				  handle_osd_map, (void*)mdsc);
>>> 
>>> 	init_rwsem(&mdsc->pool_perm_rwsem);
>>> 	mdsc->pool_perm_tree = RB_ROOT;
>>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
>>> index 3c6f77b7bb02..eb8144ab4995 100644
>>> --- a/fs/ceph/mds_client.h
>>> +++ b/fs/ceph/mds_client.h
>>> @@ -159,6 +159,8 @@ struct ceph_mds_session {
>>> 	atomic_t          s_ref;
>>> 	struct list_head  s_waiting;  /* waiting requests */
>>> 	struct list_head  s_unsafe;   /* unsafe requests */
>>> +	struct list_head	s_delayed_caps;
>>> +	struct work_struct	s_delayed_caps_work;
>>> };
>>> 
>>> /*
>>> @@ -331,6 +333,7 @@ struct ceph_mds_client {
>>> 	int               num_cap_flushing; /* # caps we are flushing */
>>> 	spinlock_t        cap_dirty_lock;   /* protects above items */
>>> 	wait_queue_head_t cap_flushing_wq;
>>> +	u32               cap_epoch_barrier;
>>> 
>>> 	/*
>>> 	 * Cap reservations
>>> -- 
>>> 2.9.3
>>> 
>> 
>> 
> 
> -- 
> Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 4/7] ceph: handle new osdmap epoch updates in CLIENT_CAPS and WRITE codepaths
  2017-01-22  9:40   ` Yan, Zheng
  2017-01-22 15:38     ` Jeff Layton
@ 2017-02-01 19:50     ` Jeff Layton
  2017-02-01 19:55       ` John Spray
  1 sibling, 1 reply; 16+ messages in thread
From: Jeff Layton @ 2017-02-01 19:50 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel, John Spray, Ilya Dryomov, Sage Weil

On Sun, 2017-01-22 at 17:40 +0800, Yan, Zheng wrote:
> > On 20 Jan 2017, at 23:17, Jeff Layton <jlayton@redhat.com> wrote:
> > 
> > This patch is heavily inspired by John Spray's earlier work, but
> > implemented in a different way.
> > 
> > Create and register a new map_cb for cephfs, to allow it to handle
> > changes to the osdmap.
> > 
> > In the version 5 of CLIENT_CAPS messages, the barrier field is added as an
> > instruction to clients that they may not use the attached capabilities
> > until they have a particular OSD map epoch.
> > 
> > When we get a message with such a field and don't have the requisite map
> > epoch yet, we put that message on a list in the session, to be run when
> > the map does come in.
> > 
> > When we get a new map update, the map_cb routine first checks to see
> > whether there may be an OSD or pool full condition. If so, then we walk
> > the list of OSD calls and kill off any writes to full OSDs or pools with
> > -ENOSPC.  While cancelling, we store the latest OSD epoch seen in each
> > request. This will be used later in the CAPRELEASE messages.
> > 
> > Then, it walks the session list and queues the workqueue job for each.
> > When the workqueue job runs, it walks the list of delayed caps and tries
> > to rerun each one. If the epoch is still not high enough, they just get
> > put back on the delay queue for when the map does come in.
> > 
> > Suggested-by: John Spray <john.spray@redhat.com>
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > ---
> > fs/ceph/caps.c       | 43 +++++++++++++++++++++++++++---
> > fs/ceph/debugfs.c    |  3 +++
> > fs/ceph/mds_client.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > fs/ceph/mds_client.h |  3 +++
> > 4 files changed, 120 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > index d941c48e8bff..f33d424b5e12 100644
> > --- a/fs/ceph/caps.c
> > +++ b/fs/ceph/caps.c
> > @@ -1077,7 +1077,7 @@ static int send_cap_msg(struct cap_msg_args *arg)
> > 	/* inline data size */
> > 	ceph_encode_32(&p, 0);
> > 	/* osd_epoch_barrier (version 5) */
> > -	ceph_encode_32(&p, 0);
> > +	ceph_encode_32(&p, arg->session->s_mdsc->cap_epoch_barrier);
> > 	/* oldest_flush_tid (version 6) */
> > 	ceph_encode_64(&p, arg->oldest_flush_tid);
> > 
> > @@ -3577,9 +3577,12 @@ void ceph_handle_caps(struct ceph_mds_session *session,
> > 	void *snaptrace;
> > 	size_t snaptrace_len;
> > 	void *p, *end;
> > +	u32 epoch_barrier = 0;
> > 
> > 	dout("handle_caps from mds%d\n", mds);
> > 
> > +	WARN_ON_ONCE(!list_empty(&msg->list_head));
> > +
> > 	/* decode */
> > 	end = msg->front.iov_base + msg->front.iov_len;
> > 	tid = le64_to_cpu(msg->hdr.tid);
> > @@ -3625,13 +3628,45 @@ void ceph_handle_caps(struct ceph_mds_session *session,
> > 		p += inline_len;
> > 	}
> > 
> > +	if (le16_to_cpu(msg->hdr.version) >= 5) {
> > +		struct ceph_osd_client *osdc = &mdsc->fsc->client->osdc;
> > +
> > +		ceph_decode_32_safe(&p, end, epoch_barrier, bad);
> > +
> > +		/* Do lockless check first to avoid mutex if we can */
> > +		if (epoch_barrier > mdsc->cap_epoch_barrier) {
> > +			mutex_lock(&mdsc->mutex);
> > +			if (epoch_barrier > mdsc->cap_epoch_barrier)
> > +				mdsc->cap_epoch_barrier = epoch_barrier;
> > +			mutex_unlock(&mdsc->mutex);
> > +		}
> > +
> > +		down_read(&osdc->lock);
> > +		if (osdc->osdmap->epoch < epoch_barrier) {
> > +			dout("handle_caps delaying message until OSD epoch %d\n", epoch_barrier);
> > +			ceph_msg_get(msg);
> > +			spin_lock(&session->s_cap_lock);
> > +			list_add(&msg->list_head, &session->s_delayed_caps);
> > +			spin_unlock(&session->s_cap_lock);
> > +
> > +			// Kick OSD client to get the latest map
> > +			__ceph_osdc_maybe_request_map(osdc);
> > +
> > +			up_read(&osdc->lock);
> > +			return;
> > +		}
> 
> Cap messages need to be handled in the same order as they were sent. I’m worry if the delay breaks something or makes cap revoke slow. Why not use the use the same approach as user space client? pass the epoch_barrier to libceph and let libceph delay sending osd requests.
> 
> Regards
> Yan, Zheng
> 

Now that I've looked at this in more detail, I'm not sure I understand
what you're suggesting.

In this case, we have gotten a cap message from the MDS, and we can't
use any caps granted in there until the right map epoch comes in. Isn't
it wrong to do all of the stuff in handle_cap_grant (for instance)
before we've received that map epoch?

The userland client just seems to just idle OSD requests in the
objecter layer until the right map comes in. Is that really sufficient
here?


> > +
> > +		dout("handle_caps barrier %d already satisfied (%d)\n", epoch_barrier, osdc->osdmap->epoch);
> > +		up_read(&osdc->lock);
> > +	}
> > +
> > +	dout("handle_caps v=%d barrier=%d\n", le16_to_cpu(msg->hdr.version), epoch_barrier);
> > +
> > 	if (le16_to_cpu(msg->hdr.version) >= 8) {
> > 		u64 flush_tid;
> > 		u32 caller_uid, caller_gid;
> > -		u32 osd_epoch_barrier;
> > 		u32 pool_ns_len;
> > -		/* version >= 5 */
> > -		ceph_decode_32_safe(&p, end, osd_epoch_barrier, bad);
> > +
> > 		/* version >= 6 */
> > 		ceph_decode_64_safe(&p, end, flush_tid, bad);
> > 		/* version >= 7 */
> > diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
> > index 39ff678e567f..825df757fba5 100644
> > --- a/fs/ceph/debugfs.c
> > +++ b/fs/ceph/debugfs.c
> > @@ -172,6 +172,9 @@ static int mds_sessions_show(struct seq_file *s, void *ptr)
> > 	/* The -o name mount argument */
> > 	seq_printf(s, "name \"%s\"\n", opt->name ? opt->name : "");
> > 
> > +	/* The latest OSD epoch barrier known to this client */
> > +	seq_printf(s, "osd_epoch_barrier \"%d\"\n", mdsc->cap_epoch_barrier);
> > +
> > 	/* The list of MDS session rank+state */
> > 	for (mds = 0; mds < mdsc->max_sessions; mds++) {
> > 		struct ceph_mds_session *session =
> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > index 176512960b14..7055b499c08b 100644
> > --- a/fs/ceph/mds_client.c
> > +++ b/fs/ceph/mds_client.c
> > @@ -393,6 +393,7 @@ void ceph_put_mds_session(struct ceph_mds_session *s)
> > 	dout("mdsc put_session %p %d -> %d\n", s,
> > 	     atomic_read(&s->s_ref), atomic_read(&s->s_ref)-1);
> > 	if (atomic_dec_and_test(&s->s_ref)) {
> > +		WARN_ON_ONCE(cancel_work_sync(&s->s_delayed_caps_work));
> > 		if (s->s_auth.authorizer)
> > 			ceph_auth_destroy_authorizer(s->s_auth.authorizer);
> > 		kfree(s);
> > @@ -432,6 +433,74 @@ static int __verify_registered_session(struct ceph_mds_client *mdsc,
> > 	return 0;
> > }
> > 
> > +static void handle_osd_map(struct ceph_osd_client *osdc, void *p)
> > +{
> > +	struct ceph_mds_client *mdsc = (struct ceph_mds_client*)p;
> > +	u32 cancelled_epoch = 0;
> > +	int mds_id;
> > +
> > +	lockdep_assert_held(&osdc->lock);
> > +
> > +	if ((osdc->osdmap->flags & CEPH_OSDMAP_FULL) ||
> > +	    ceph_osdc_have_pool_full(osdc))
> > +		cancelled_epoch = ceph_osdc_complete_writes(osdc, -ENOSPC);
> > +
> > +	dout("handle_osd_map: epoch=%d\n", osdc->osdmap->epoch);
> > +
> > +	mutex_lock(&mdsc->mutex);
> > +	if (cancelled_epoch)
> > +		mdsc->cap_epoch_barrier = max(cancelled_epoch + 1,
> > +					      mdsc->cap_epoch_barrier);
> > +
> > +	/* Schedule the workqueue job for any sessions */
> > +	for (mds_id = 0; mds_id < mdsc->max_sessions; ++mds_id) {
> > +		struct ceph_mds_session *session = mdsc->sessions[mds_id];
> > +		bool empty;
> > +
> > +		if (session == NULL)
> > +			continue;
> > +
> > +		/* Any delayed messages? */
> > +		spin_lock(&session->s_cap_lock);
> > +		empty = list_empty(&session->s_delayed_caps);
> > +		spin_unlock(&session->s_cap_lock);
> > +		if (empty)
> > +			continue;
> > +
> > +		/* take a reference -- if we can't get one, move on */
> > +		if (!get_session(session))
> > +			continue;
> > +
> > +		/*
> > +		 * Try to schedule work. If it's already queued, then just
> > +		 * drop the session reference.
> > +		 */
> > +		if (!schedule_work(&session->s_delayed_caps_work))
> > +			ceph_put_mds_session(session);
> > +	}
> > +	mutex_unlock(&mdsc->mutex);
> > +}
> > +
> > +static void
> > +run_delayed_caps(struct work_struct *work)
> > +{
> > +	struct ceph_mds_session *session = container_of(work,
> > +			struct ceph_mds_session, s_delayed_caps_work);
> > +	LIST_HEAD(delayed);
> > +
> > +	spin_lock(&session->s_cap_lock);
> > +	list_splice_init(&session->s_delayed_caps, &delayed);
> > +	spin_unlock(&session->s_cap_lock);
> > +
> > +	while (!list_empty(&delayed)) {
> > +		struct ceph_msg *msg = list_first_entry(&delayed,
> > +						struct ceph_msg, list_head);
> > +		list_del_init(&msg->list_head);
> > +		ceph_handle_caps(session, msg);
> > +		ceph_msg_put(msg);
> > +	}
> > +}
> > +
> > /*
> >  * create+register a new session for given mds.
> >  * called under mdsc->mutex.
> > @@ -469,11 +538,13 @@ static struct ceph_mds_session *register_session(struct ceph_mds_client *mdsc,
> > 	atomic_set(&s->s_ref, 1);
> > 	INIT_LIST_HEAD(&s->s_waiting);
> > 	INIT_LIST_HEAD(&s->s_unsafe);
> > +	INIT_LIST_HEAD(&s->s_delayed_caps);
> > 	s->s_num_cap_releases = 0;
> > 	s->s_cap_reconnect = 0;
> > 	s->s_cap_iterator = NULL;
> > 	INIT_LIST_HEAD(&s->s_cap_releases);
> > 	INIT_LIST_HEAD(&s->s_cap_flushing);
> > +	INIT_WORK(&s->s_delayed_caps_work, run_delayed_caps);
> > 
> > 	dout("register_session mds%d\n", mds);
> > 	if (mds >= mdsc->max_sessions) {
> > @@ -3480,6 +3551,10 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
> > 
> > 	ceph_caps_init(mdsc);
> > 	ceph_adjust_min_caps(mdsc, fsc->min_caps);
> > +	mdsc->cap_epoch_barrier = 0;
> > +
> > +	ceph_osdc_register_map_cb(&fsc->client->osdc,
> > +				  handle_osd_map, (void*)mdsc);
> > 
> > 	init_rwsem(&mdsc->pool_perm_rwsem);
> > 	mdsc->pool_perm_tree = RB_ROOT;
> > diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> > index 3c6f77b7bb02..eb8144ab4995 100644
> > --- a/fs/ceph/mds_client.h
> > +++ b/fs/ceph/mds_client.h
> > @@ -159,6 +159,8 @@ struct ceph_mds_session {
> > 	atomic_t          s_ref;
> > 	struct list_head  s_waiting;  /* waiting requests */
> > 	struct list_head  s_unsafe;   /* unsafe requests */
> > +	struct list_head	s_delayed_caps;
> > +	struct work_struct	s_delayed_caps_work;
> > };
> > 
> > /*
> > @@ -331,6 +333,7 @@ struct ceph_mds_client {
> > 	int               num_cap_flushing; /* # caps we are flushing */
> > 	spinlock_t        cap_dirty_lock;   /* protects above items */
> > 	wait_queue_head_t cap_flushing_wq;
> > +	u32               cap_epoch_barrier;
> > 
> > 	/*
> > 	 * Cap reservations
> > -- 
> > 2.9.3
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 4/7] ceph: handle new osdmap epoch updates in CLIENT_CAPS and WRITE codepaths
  2017-02-01 19:50     ` Jeff Layton
@ 2017-02-01 19:55       ` John Spray
  2017-02-01 20:55         ` Jeff Layton
  2017-02-02 16:07         ` Jeff Layton
  0 siblings, 2 replies; 16+ messages in thread
From: John Spray @ 2017-02-01 19:55 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Yan, Zheng, ceph-devel, Ilya Dryomov, Sage Weil

On Wed, Feb 1, 2017 at 7:50 PM, Jeff Layton <jlayton@redhat.com> wrote:
> On Sun, 2017-01-22 at 17:40 +0800, Yan, Zheng wrote:
>> > On 20 Jan 2017, at 23:17, Jeff Layton <jlayton@redhat.com> wrote:
>> >
>> > This patch is heavily inspired by John Spray's earlier work, but
>> > implemented in a different way.
>> >
>> > Create and register a new map_cb for cephfs, to allow it to handle
>> > changes to the osdmap.
>> >
>> > In the version 5 of CLIENT_CAPS messages, the barrier field is added as an
>> > instruction to clients that they may not use the attached capabilities
>> > until they have a particular OSD map epoch.
>> >
>> > When we get a message with such a field and don't have the requisite map
>> > epoch yet, we put that message on a list in the session, to be run when
>> > the map does come in.
>> >
>> > When we get a new map update, the map_cb routine first checks to see
>> > whether there may be an OSD or pool full condition. If so, then we walk
>> > the list of OSD calls and kill off any writes to full OSDs or pools with
>> > -ENOSPC.  While cancelling, we store the latest OSD epoch seen in each
>> > request. This will be used later in the CAPRELEASE messages.
>> >
>> > Then, it walks the session list and queues the workqueue job for each.
>> > When the workqueue job runs, it walks the list of delayed caps and tries
>> > to rerun each one. If the epoch is still not high enough, they just get
>> > put back on the delay queue for when the map does come in.
>> >
>> > Suggested-by: John Spray <john.spray@redhat.com>
>> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
>> > ---
>> > fs/ceph/caps.c       | 43 +++++++++++++++++++++++++++---
>> > fs/ceph/debugfs.c    |  3 +++
>> > fs/ceph/mds_client.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>> > fs/ceph/mds_client.h |  3 +++
>> > 4 files changed, 120 insertions(+), 4 deletions(-)
>> >
>> > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>> > index d941c48e8bff..f33d424b5e12 100644
>> > --- a/fs/ceph/caps.c
>> > +++ b/fs/ceph/caps.c
>> > @@ -1077,7 +1077,7 @@ static int send_cap_msg(struct cap_msg_args *arg)
>> >     /* inline data size */
>> >     ceph_encode_32(&p, 0);
>> >     /* osd_epoch_barrier (version 5) */
>> > -   ceph_encode_32(&p, 0);
>> > +   ceph_encode_32(&p, arg->session->s_mdsc->cap_epoch_barrier);
>> >     /* oldest_flush_tid (version 6) */
>> >     ceph_encode_64(&p, arg->oldest_flush_tid);
>> >
>> > @@ -3577,9 +3577,12 @@ void ceph_handle_caps(struct ceph_mds_session *session,
>> >     void *snaptrace;
>> >     size_t snaptrace_len;
>> >     void *p, *end;
>> > +   u32 epoch_barrier = 0;
>> >
>> >     dout("handle_caps from mds%d\n", mds);
>> >
>> > +   WARN_ON_ONCE(!list_empty(&msg->list_head));
>> > +
>> >     /* decode */
>> >     end = msg->front.iov_base + msg->front.iov_len;
>> >     tid = le64_to_cpu(msg->hdr.tid);
>> > @@ -3625,13 +3628,45 @@ void ceph_handle_caps(struct ceph_mds_session *session,
>> >             p += inline_len;
>> >     }
>> >
>> > +   if (le16_to_cpu(msg->hdr.version) >= 5) {
>> > +           struct ceph_osd_client *osdc = &mdsc->fsc->client->osdc;
>> > +
>> > +           ceph_decode_32_safe(&p, end, epoch_barrier, bad);
>> > +
>> > +           /* Do lockless check first to avoid mutex if we can */
>> > +           if (epoch_barrier > mdsc->cap_epoch_barrier) {
>> > +                   mutex_lock(&mdsc->mutex);
>> > +                   if (epoch_barrier > mdsc->cap_epoch_barrier)
>> > +                           mdsc->cap_epoch_barrier = epoch_barrier;
>> > +                   mutex_unlock(&mdsc->mutex);
>> > +           }
>> > +
>> > +           down_read(&osdc->lock);
>> > +           if (osdc->osdmap->epoch < epoch_barrier) {
>> > +                   dout("handle_caps delaying message until OSD epoch %d\n", epoch_barrier);
>> > +                   ceph_msg_get(msg);
>> > +                   spin_lock(&session->s_cap_lock);
>> > +                   list_add(&msg->list_head, &session->s_delayed_caps);
>> > +                   spin_unlock(&session->s_cap_lock);
>> > +
>> > +                   // Kick OSD client to get the latest map
>> > +                   __ceph_osdc_maybe_request_map(osdc);
>> > +
>> > +                   up_read(&osdc->lock);
>> > +                   return;
>> > +           }
>>
>> Cap messages need to be handled in the same order as they were sent. I’m worry if the delay breaks something or makes cap revoke slow. Why not use the use the same approach as user space client? pass the epoch_barrier to libceph and let libceph delay sending osd requests.
>>
>> Regards
>> Yan, Zheng
>>
>
> Now that I've looked at this in more detail, I'm not sure I understand
> what you're suggesting.
>
> In this case, we have gotten a cap message from the MDS, and we can't
> use any caps granted in there until the right map epoch comes in. Isn't
> it wrong to do all of the stuff in handle_cap_grant (for instance)
> before we've received that map epoch?
>
> The userland client just seems to just idle OSD requests in the
> objecter layer until the right map comes in. Is that really sufficient
> here?

Yes -- the key thing is that the client sees the effects of *another*
client's blacklisting (the client that might have previously held caps
on the file we're about to write/read).  So the client receiving the
barrier message is completely OK and entitled to the caps, he just
needs to make sure his OSD ops don't go out until he's seen the epoch
so that his operations are properly ordered with respect to the
blacklisting of this notional other client.

John

>
>
>> > +
>> > +           dout("handle_caps barrier %d already satisfied (%d)\n", epoch_barrier, osdc->osdmap->epoch);
>> > +           up_read(&osdc->lock);
>> > +   }
>> > +
>> > +   dout("handle_caps v=%d barrier=%d\n", le16_to_cpu(msg->hdr.version), epoch_barrier);
>> > +
>> >     if (le16_to_cpu(msg->hdr.version) >= 8) {
>> >             u64 flush_tid;
>> >             u32 caller_uid, caller_gid;
>> > -           u32 osd_epoch_barrier;
>> >             u32 pool_ns_len;
>> > -           /* version >= 5 */
>> > -           ceph_decode_32_safe(&p, end, osd_epoch_barrier, bad);
>> > +
>> >             /* version >= 6 */
>> >             ceph_decode_64_safe(&p, end, flush_tid, bad);
>> >             /* version >= 7 */
>> > diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
>> > index 39ff678e567f..825df757fba5 100644
>> > --- a/fs/ceph/debugfs.c
>> > +++ b/fs/ceph/debugfs.c
>> > @@ -172,6 +172,9 @@ static int mds_sessions_show(struct seq_file *s, void *ptr)
>> >     /* The -o name mount argument */
>> >     seq_printf(s, "name \"%s\"\n", opt->name ? opt->name : "");
>> >
>> > +   /* The latest OSD epoch barrier known to this client */
>> > +   seq_printf(s, "osd_epoch_barrier \"%d\"\n", mdsc->cap_epoch_barrier);
>> > +
>> >     /* The list of MDS session rank+state */
>> >     for (mds = 0; mds < mdsc->max_sessions; mds++) {
>> >             struct ceph_mds_session *session =
>> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>> > index 176512960b14..7055b499c08b 100644
>> > --- a/fs/ceph/mds_client.c
>> > +++ b/fs/ceph/mds_client.c
>> > @@ -393,6 +393,7 @@ void ceph_put_mds_session(struct ceph_mds_session *s)
>> >     dout("mdsc put_session %p %d -> %d\n", s,
>> >          atomic_read(&s->s_ref), atomic_read(&s->s_ref)-1);
>> >     if (atomic_dec_and_test(&s->s_ref)) {
>> > +           WARN_ON_ONCE(cancel_work_sync(&s->s_delayed_caps_work));
>> >             if (s->s_auth.authorizer)
>> >                     ceph_auth_destroy_authorizer(s->s_auth.authorizer);
>> >             kfree(s);
>> > @@ -432,6 +433,74 @@ static int __verify_registered_session(struct ceph_mds_client *mdsc,
>> >     return 0;
>> > }
>> >
>> > +static void handle_osd_map(struct ceph_osd_client *osdc, void *p)
>> > +{
>> > +   struct ceph_mds_client *mdsc = (struct ceph_mds_client*)p;
>> > +   u32 cancelled_epoch = 0;
>> > +   int mds_id;
>> > +
>> > +   lockdep_assert_held(&osdc->lock);
>> > +
>> > +   if ((osdc->osdmap->flags & CEPH_OSDMAP_FULL) ||
>> > +       ceph_osdc_have_pool_full(osdc))
>> > +           cancelled_epoch = ceph_osdc_complete_writes(osdc, -ENOSPC);
>> > +
>> > +   dout("handle_osd_map: epoch=%d\n", osdc->osdmap->epoch);
>> > +
>> > +   mutex_lock(&mdsc->mutex);
>> > +   if (cancelled_epoch)
>> > +           mdsc->cap_epoch_barrier = max(cancelled_epoch + 1,
>> > +                                         mdsc->cap_epoch_barrier);
>> > +
>> > +   /* Schedule the workqueue job for any sessions */
>> > +   for (mds_id = 0; mds_id < mdsc->max_sessions; ++mds_id) {
>> > +           struct ceph_mds_session *session = mdsc->sessions[mds_id];
>> > +           bool empty;
>> > +
>> > +           if (session == NULL)
>> > +                   continue;
>> > +
>> > +           /* Any delayed messages? */
>> > +           spin_lock(&session->s_cap_lock);
>> > +           empty = list_empty(&session->s_delayed_caps);
>> > +           spin_unlock(&session->s_cap_lock);
>> > +           if (empty)
>> > +                   continue;
>> > +
>> > +           /* take a reference -- if we can't get one, move on */
>> > +           if (!get_session(session))
>> > +                   continue;
>> > +
>> > +           /*
>> > +            * Try to schedule work. If it's already queued, then just
>> > +            * drop the session reference.
>> > +            */
>> > +           if (!schedule_work(&session->s_delayed_caps_work))
>> > +                   ceph_put_mds_session(session);
>> > +   }
>> > +   mutex_unlock(&mdsc->mutex);
>> > +}
>> > +
>> > +static void
>> > +run_delayed_caps(struct work_struct *work)
>> > +{
>> > +   struct ceph_mds_session *session = container_of(work,
>> > +                   struct ceph_mds_session, s_delayed_caps_work);
>> > +   LIST_HEAD(delayed);
>> > +
>> > +   spin_lock(&session->s_cap_lock);
>> > +   list_splice_init(&session->s_delayed_caps, &delayed);
>> > +   spin_unlock(&session->s_cap_lock);
>> > +
>> > +   while (!list_empty(&delayed)) {
>> > +           struct ceph_msg *msg = list_first_entry(&delayed,
>> > +                                           struct ceph_msg, list_head);
>> > +           list_del_init(&msg->list_head);
>> > +           ceph_handle_caps(session, msg);
>> > +           ceph_msg_put(msg);
>> > +   }
>> > +}
>> > +
>> > /*
>> >  * create+register a new session for given mds.
>> >  * called under mdsc->mutex.
>> > @@ -469,11 +538,13 @@ static struct ceph_mds_session *register_session(struct ceph_mds_client *mdsc,
>> >     atomic_set(&s->s_ref, 1);
>> >     INIT_LIST_HEAD(&s->s_waiting);
>> >     INIT_LIST_HEAD(&s->s_unsafe);
>> > +   INIT_LIST_HEAD(&s->s_delayed_caps);
>> >     s->s_num_cap_releases = 0;
>> >     s->s_cap_reconnect = 0;
>> >     s->s_cap_iterator = NULL;
>> >     INIT_LIST_HEAD(&s->s_cap_releases);
>> >     INIT_LIST_HEAD(&s->s_cap_flushing);
>> > +   INIT_WORK(&s->s_delayed_caps_work, run_delayed_caps);
>> >
>> >     dout("register_session mds%d\n", mds);
>> >     if (mds >= mdsc->max_sessions) {
>> > @@ -3480,6 +3551,10 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
>> >
>> >     ceph_caps_init(mdsc);
>> >     ceph_adjust_min_caps(mdsc, fsc->min_caps);
>> > +   mdsc->cap_epoch_barrier = 0;
>> > +
>> > +   ceph_osdc_register_map_cb(&fsc->client->osdc,
>> > +                             handle_osd_map, (void*)mdsc);
>> >
>> >     init_rwsem(&mdsc->pool_perm_rwsem);
>> >     mdsc->pool_perm_tree = RB_ROOT;
>> > diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
>> > index 3c6f77b7bb02..eb8144ab4995 100644
>> > --- a/fs/ceph/mds_client.h
>> > +++ b/fs/ceph/mds_client.h
>> > @@ -159,6 +159,8 @@ struct ceph_mds_session {
>> >     atomic_t          s_ref;
>> >     struct list_head  s_waiting;  /* waiting requests */
>> >     struct list_head  s_unsafe;   /* unsafe requests */
>> > +   struct list_head        s_delayed_caps;
>> > +   struct work_struct      s_delayed_caps_work;
>> > };
>> >
>> > /*
>> > @@ -331,6 +333,7 @@ struct ceph_mds_client {
>> >     int               num_cap_flushing; /* # caps we are flushing */
>> >     spinlock_t        cap_dirty_lock;   /* protects above items */
>> >     wait_queue_head_t cap_flushing_wq;
>> > +   u32               cap_epoch_barrier;
>> >
>> >     /*
>> >      * Cap reservations
>> > --
>> > 2.9.3
>> >
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 4/7] ceph: handle new osdmap epoch updates in CLIENT_CAPS and WRITE codepaths
  2017-02-01 19:55       ` John Spray
@ 2017-02-01 20:55         ` Jeff Layton
  2017-02-02 16:07         ` Jeff Layton
  1 sibling, 0 replies; 16+ messages in thread
From: Jeff Layton @ 2017-02-01 20:55 UTC (permalink / raw)
  To: John Spray; +Cc: Yan, Zheng, ceph-devel, Ilya Dryomov, Sage Weil

On Wed, 2017-02-01 at 19:55 +0000, John Spray wrote:
> On Wed, Feb 1, 2017 at 7:50 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > On Sun, 2017-01-22 at 17:40 +0800, Yan, Zheng wrote:
> > > > On 20 Jan 2017, at 23:17, Jeff Layton <jlayton@redhat.com> wrote:
> > > > 
> > > > This patch is heavily inspired by John Spray's earlier work, but
> > > > implemented in a different way.
> > > > 
> > > > Create and register a new map_cb for cephfs, to allow it to handle
> > > > changes to the osdmap.
> > > > 
> > > > In the version 5 of CLIENT_CAPS messages, the barrier field is added as an
> > > > instruction to clients that they may not use the attached capabilities
> > > > until they have a particular OSD map epoch.
> > > > 
> > > > When we get a message with such a field and don't have the requisite map
> > > > epoch yet, we put that message on a list in the session, to be run when
> > > > the map does come in.
> > > > 
> > > > When we get a new map update, the map_cb routine first checks to see
> > > > whether there may be an OSD or pool full condition. If so, then we walk
> > > > the list of OSD calls and kill off any writes to full OSDs or pools with
> > > > -ENOSPC.  While cancelling, we store the latest OSD epoch seen in each
> > > > request. This will be used later in the CAPRELEASE messages.
> > > > 
> > > > Then, it walks the session list and queues the workqueue job for each.
> > > > When the workqueue job runs, it walks the list of delayed caps and tries
> > > > to rerun each one. If the epoch is still not high enough, they just get
> > > > put back on the delay queue for when the map does come in.
> > > > 
> > > > Suggested-by: John Spray <john.spray@redhat.com>
> > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > ---
> > > > fs/ceph/caps.c       | 43 +++++++++++++++++++++++++++---
> > > > fs/ceph/debugfs.c    |  3 +++
> > > > fs/ceph/mds_client.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > fs/ceph/mds_client.h |  3 +++
> > > > 4 files changed, 120 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > > index d941c48e8bff..f33d424b5e12 100644
> > > > --- a/fs/ceph/caps.c
> > > > +++ b/fs/ceph/caps.c
> > > > @@ -1077,7 +1077,7 @@ static int send_cap_msg(struct cap_msg_args *arg)
> > > >     /* inline data size */
> > > >     ceph_encode_32(&p, 0);
> > > >     /* osd_epoch_barrier (version 5) */
> > > > -   ceph_encode_32(&p, 0);
> > > > +   ceph_encode_32(&p, arg->session->s_mdsc->cap_epoch_barrier);
> > > >     /* oldest_flush_tid (version 6) */
> > > >     ceph_encode_64(&p, arg->oldest_flush_tid);
> > > > 
> > > > @@ -3577,9 +3577,12 @@ void ceph_handle_caps(struct ceph_mds_session *session,
> > > >     void *snaptrace;
> > > >     size_t snaptrace_len;
> > > >     void *p, *end;
> > > > +   u32 epoch_barrier = 0;
> > > > 
> > > >     dout("handle_caps from mds%d\n", mds);
> > > > 
> > > > +   WARN_ON_ONCE(!list_empty(&msg->list_head));
> > > > +
> > > >     /* decode */
> > > >     end = msg->front.iov_base + msg->front.iov_len;
> > > >     tid = le64_to_cpu(msg->hdr.tid);
> > > > @@ -3625,13 +3628,45 @@ void ceph_handle_caps(struct ceph_mds_session *session,
> > > >             p += inline_len;
> > > >     }
> > > > 
> > > > +   if (le16_to_cpu(msg->hdr.version) >= 5) {
> > > > +           struct ceph_osd_client *osdc = &mdsc->fsc->client->osdc;
> > > > +
> > > > +           ceph_decode_32_safe(&p, end, epoch_barrier, bad);
> > > > +
> > > > +           /* Do lockless check first to avoid mutex if we can */
> > > > +           if (epoch_barrier > mdsc->cap_epoch_barrier) {
> > > > +                   mutex_lock(&mdsc->mutex);
> > > > +                   if (epoch_barrier > mdsc->cap_epoch_barrier)
> > > > +                           mdsc->cap_epoch_barrier = epoch_barrier;
> > > > +                   mutex_unlock(&mdsc->mutex);
> > > > +           }
> > > > +
> > > > +           down_read(&osdc->lock);
> > > > +           if (osdc->osdmap->epoch < epoch_barrier) {
> > > > +                   dout("handle_caps delaying message until OSD epoch %d\n", epoch_barrier);
> > > > +                   ceph_msg_get(msg);
> > > > +                   spin_lock(&session->s_cap_lock);
> > > > +                   list_add(&msg->list_head, &session->s_delayed_caps);
> > > > +                   spin_unlock(&session->s_cap_lock);
> > > > +
> > > > +                   // Kick OSD client to get the latest map
> > > > +                   __ceph_osdc_maybe_request_map(osdc);
> > > > +
> > > > +                   up_read(&osdc->lock);
> > > > +                   return;
> > > > +           }
> > > 
> > > Cap messages need to be handled in the same order as they were sent. I’m worry if the delay breaks something or makes cap revoke slow. Why not use the use the same approach as user space client? pass the epoch_barrier to libceph and let libceph delay sending osd requests.
> > > 
> > > Regards
> > > Yan, Zheng
> > > 
> > 
> > Now that I've looked at this in more detail, I'm not sure I understand
> > what you're suggesting.
> > 
> > In this case, we have gotten a cap message from the MDS, and we can't
> > use any caps granted in there until the right map epoch comes in. Isn't
> > it wrong to do all of the stuff in handle_cap_grant (for instance)
> > before we've received that map epoch?
> > 
> > The userland client just seems to just idle OSD requests in the
> > objecter layer until the right map comes in. Is that really sufficient
> > here?
> 
> Yes -- the key thing is that the client sees the effects of *another*
> client's blacklisting (the client that might have previously held caps
> on the file we're about to write/read).  So the client receiving the
> barrier message is completely OK and entitled to the caps, he just
> needs to make sure his OSD ops don't go out until he's seen the epoch
> so that his operations are properly ordered with respect to the
> blacklisting of this notional other client.
> 

Ok, got it! I think that makes sense.

I notice that there is some plumbing in the kernel already for pausing
requests until certain flags in the osdmap, so we might be able to
reuse a lot of that to handle the barrier. I'll look and see what can
be done there.

Many thanks!
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 4/7] ceph: handle new osdmap epoch updates in CLIENT_CAPS and WRITE codepaths
  2017-02-01 19:55       ` John Spray
  2017-02-01 20:55         ` Jeff Layton
@ 2017-02-02 16:07         ` Jeff Layton
  2017-02-02 16:35           ` John Spray
  1 sibling, 1 reply; 16+ messages in thread
From: Jeff Layton @ 2017-02-02 16:07 UTC (permalink / raw)
  To: John Spray; +Cc: Yan, Zheng, ceph-devel, Ilya Dryomov, Sage Weil

On Wed, 2017-02-01 at 19:55 +0000, John Spray wrote:
> On Wed, Feb 1, 2017 at 7:50 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > On Sun, 2017-01-22 at 17:40 +0800, Yan, Zheng wrote:
> > > > On 20 Jan 2017, at 23:17, Jeff Layton <jlayton@redhat.com> wrote:
> > > > 
> > > > This patch is heavily inspired by John Spray's earlier work, but
> > > > implemented in a different way.
> > > > 
> > > > Create and register a new map_cb for cephfs, to allow it to handle
> > > > changes to the osdmap.
> > > > 
> > > > In the version 5 of CLIENT_CAPS messages, the barrier field is added as an
> > > > instruction to clients that they may not use the attached capabilities
> > > > until they have a particular OSD map epoch.
> > > > 
> > > > When we get a message with such a field and don't have the requisite map
> > > > epoch yet, we put that message on a list in the session, to be run when
> > > > the map does come in.
> > > > 
> > > > When we get a new map update, the map_cb routine first checks to see
> > > > whether there may be an OSD or pool full condition. If so, then we walk
> > > > the list of OSD calls and kill off any writes to full OSDs or pools with
> > > > -ENOSPC.  While cancelling, we store the latest OSD epoch seen in each
> > > > request. This will be used later in the CAPRELEASE messages.
> > > > 
> > > > Then, it walks the session list and queues the workqueue job for each.
> > > > When the workqueue job runs, it walks the list of delayed caps and tries
> > > > to rerun each one. If the epoch is still not high enough, they just get
> > > > put back on the delay queue for when the map does come in.
> > > > 
> > > > Suggested-by: John Spray <john.spray@redhat.com>
> > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > ---
> > > > fs/ceph/caps.c       | 43 +++++++++++++++++++++++++++---
> > > > fs/ceph/debugfs.c    |  3 +++
> > > > fs/ceph/mds_client.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > fs/ceph/mds_client.h |  3 +++
> > > > 4 files changed, 120 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > > index d941c48e8bff..f33d424b5e12 100644
> > > > --- a/fs/ceph/caps.c
> > > > +++ b/fs/ceph/caps.c
> > > > @@ -1077,7 +1077,7 @@ static int send_cap_msg(struct cap_msg_args *arg)
> > > >     /* inline data size */
> > > >     ceph_encode_32(&p, 0);
> > > >     /* osd_epoch_barrier (version 5) */
> > > > -   ceph_encode_32(&p, 0);
> > > > +   ceph_encode_32(&p, arg->session->s_mdsc->cap_epoch_barrier);
> > > >     /* oldest_flush_tid (version 6) */
> > > >     ceph_encode_64(&p, arg->oldest_flush_tid);
> > > > 
> > > > @@ -3577,9 +3577,12 @@ void ceph_handle_caps(struct ceph_mds_session *session,
> > > >     void *snaptrace;
> > > >     size_t snaptrace_len;
> > > >     void *p, *end;
> > > > +   u32 epoch_barrier = 0;
> > > > 
> > > >     dout("handle_caps from mds%d\n", mds);
> > > > 
> > > > +   WARN_ON_ONCE(!list_empty(&msg->list_head));
> > > > +
> > > >     /* decode */
> > > >     end = msg->front.iov_base + msg->front.iov_len;
> > > >     tid = le64_to_cpu(msg->hdr.tid);
> > > > @@ -3625,13 +3628,45 @@ void ceph_handle_caps(struct ceph_mds_session *session,
> > > >             p += inline_len;
> > > >     }
> > > > 
> > > > +   if (le16_to_cpu(msg->hdr.version) >= 5) {
> > > > +           struct ceph_osd_client *osdc = &mdsc->fsc->client->osdc;
> > > > +
> > > > +           ceph_decode_32_safe(&p, end, epoch_barrier, bad);
> > > > +
> > > > +           /* Do lockless check first to avoid mutex if we can */
> > > > +           if (epoch_barrier > mdsc->cap_epoch_barrier) {
> > > > +                   mutex_lock(&mdsc->mutex);
> > > > +                   if (epoch_barrier > mdsc->cap_epoch_barrier)
> > > > +                           mdsc->cap_epoch_barrier = epoch_barrier;
> > > > +                   mutex_unlock(&mdsc->mutex);
> > > > +           }
> > > > +
> > > > +           down_read(&osdc->lock);
> > > > +           if (osdc->osdmap->epoch < epoch_barrier) {
> > > > +                   dout("handle_caps delaying message until OSD epoch %d\n", epoch_barrier);
> > > > +                   ceph_msg_get(msg);
> > > > +                   spin_lock(&session->s_cap_lock);
> > > > +                   list_add(&msg->list_head, &session->s_delayed_caps);
> > > > +                   spin_unlock(&session->s_cap_lock);
> > > > +
> > > > +                   // Kick OSD client to get the latest map
> > > > +                   __ceph_osdc_maybe_request_map(osdc);
> > > > +
> > > > +                   up_read(&osdc->lock);
> > > > +                   return;
> > > > +           }
> > > 
> > > Cap messages need to be handled in the same order as they were sent. I’m worry if the delay breaks something or makes cap revoke slow. Why not use the use the same approach as user space client? pass the epoch_barrier to libceph and let libceph delay sending osd requests.
> > > 
> > > Regards
> > > Yan, Zheng
> > > 
> > 
> > Now that I've looked at this in more detail, I'm not sure I understand
> > what you're suggesting.
> > 
> > In this case, we have gotten a cap message from the MDS, and we can't
> > use any caps granted in there until the right map epoch comes in. Isn't
> > it wrong to do all of the stuff in handle_cap_grant (for instance)
> > before we've received that map epoch?
> > 
> > The userland client just seems to just idle OSD requests in the
> > objecter layer until the right map comes in. Is that really sufficient
> > here?
> 
> Yes -- the key thing is that the client sees the effects of *another*
> client's blacklisting (the client that might have previously held caps
> on the file we're about to write/read).  So the client receiving the
> barrier message is completely OK and entitled to the caps, he just
> needs to make sure his OSD ops don't go out until he's seen the epoch
> so that his operations are properly ordered with respect to the
> blacklisting of this notional other client.
> 
> John

One more question...how is epoch_t wraparound handled?

The number is 32 bits so it seems like it could happen with enough
clients flapping, and 0 has a bit of a special meaning here, AFAICS.
Most of the code seems to assume that it can never occur. Could it?

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 4/7] ceph: handle new osdmap epoch updates in CLIENT_CAPS and WRITE codepaths
  2017-02-02 16:07         ` Jeff Layton
@ 2017-02-02 16:35           ` John Spray
  0 siblings, 0 replies; 16+ messages in thread
From: John Spray @ 2017-02-02 16:35 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Yan, Zheng, ceph-devel, Ilya Dryomov, Sage Weil

On Thu, Feb 2, 2017 at 4:07 PM, Jeff Layton <jlayton@redhat.com> wrote:
> On Wed, 2017-02-01 at 19:55 +0000, John Spray wrote:
>> On Wed, Feb 1, 2017 at 7:50 PM, Jeff Layton <jlayton@redhat.com> wrote:
>> > On Sun, 2017-01-22 at 17:40 +0800, Yan, Zheng wrote:
>> > > > On 20 Jan 2017, at 23:17, Jeff Layton <jlayton@redhat.com> wrote:
>> > > >
>> > > > This patch is heavily inspired by John Spray's earlier work, but
>> > > > implemented in a different way.
>> > > >
>> > > > Create and register a new map_cb for cephfs, to allow it to handle
>> > > > changes to the osdmap.
>> > > >
>> > > > In the version 5 of CLIENT_CAPS messages, the barrier field is added as an
>> > > > instruction to clients that they may not use the attached capabilities
>> > > > until they have a particular OSD map epoch.
>> > > >
>> > > > When we get a message with such a field and don't have the requisite map
>> > > > epoch yet, we put that message on a list in the session, to be run when
>> > > > the map does come in.
>> > > >
>> > > > When we get a new map update, the map_cb routine first checks to see
>> > > > whether there may be an OSD or pool full condition. If so, then we walk
>> > > > the list of OSD calls and kill off any writes to full OSDs or pools with
>> > > > -ENOSPC.  While cancelling, we store the latest OSD epoch seen in each
>> > > > request. This will be used later in the CAPRELEASE messages.
>> > > >
>> > > > Then, it walks the session list and queues the workqueue job for each.
>> > > > When the workqueue job runs, it walks the list of delayed caps and tries
>> > > > to rerun each one. If the epoch is still not high enough, they just get
>> > > > put back on the delay queue for when the map does come in.
>> > > >
>> > > > Suggested-by: John Spray <john.spray@redhat.com>
>> > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
>> > > > ---
>> > > > fs/ceph/caps.c       | 43 +++++++++++++++++++++++++++---
>> > > > fs/ceph/debugfs.c    |  3 +++
>> > > > fs/ceph/mds_client.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>> > > > fs/ceph/mds_client.h |  3 +++
>> > > > 4 files changed, 120 insertions(+), 4 deletions(-)
>> > > >
>> > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>> > > > index d941c48e8bff..f33d424b5e12 100644
>> > > > --- a/fs/ceph/caps.c
>> > > > +++ b/fs/ceph/caps.c
>> > > > @@ -1077,7 +1077,7 @@ static int send_cap_msg(struct cap_msg_args *arg)
>> > > >     /* inline data size */
>> > > >     ceph_encode_32(&p, 0);
>> > > >     /* osd_epoch_barrier (version 5) */
>> > > > -   ceph_encode_32(&p, 0);
>> > > > +   ceph_encode_32(&p, arg->session->s_mdsc->cap_epoch_barrier);
>> > > >     /* oldest_flush_tid (version 6) */
>> > > >     ceph_encode_64(&p, arg->oldest_flush_tid);
>> > > >
>> > > > @@ -3577,9 +3577,12 @@ void ceph_handle_caps(struct ceph_mds_session *session,
>> > > >     void *snaptrace;
>> > > >     size_t snaptrace_len;
>> > > >     void *p, *end;
>> > > > +   u32 epoch_barrier = 0;
>> > > >
>> > > >     dout("handle_caps from mds%d\n", mds);
>> > > >
>> > > > +   WARN_ON_ONCE(!list_empty(&msg->list_head));
>> > > > +
>> > > >     /* decode */
>> > > >     end = msg->front.iov_base + msg->front.iov_len;
>> > > >     tid = le64_to_cpu(msg->hdr.tid);
>> > > > @@ -3625,13 +3628,45 @@ void ceph_handle_caps(struct ceph_mds_session *session,
>> > > >             p += inline_len;
>> > > >     }
>> > > >
>> > > > +   if (le16_to_cpu(msg->hdr.version) >= 5) {
>> > > > +           struct ceph_osd_client *osdc = &mdsc->fsc->client->osdc;
>> > > > +
>> > > > +           ceph_decode_32_safe(&p, end, epoch_barrier, bad);
>> > > > +
>> > > > +           /* Do lockless check first to avoid mutex if we can */
>> > > > +           if (epoch_barrier > mdsc->cap_epoch_barrier) {
>> > > > +                   mutex_lock(&mdsc->mutex);
>> > > > +                   if (epoch_barrier > mdsc->cap_epoch_barrier)
>> > > > +                           mdsc->cap_epoch_barrier = epoch_barrier;
>> > > > +                   mutex_unlock(&mdsc->mutex);
>> > > > +           }
>> > > > +
>> > > > +           down_read(&osdc->lock);
>> > > > +           if (osdc->osdmap->epoch < epoch_barrier) {
>> > > > +                   dout("handle_caps delaying message until OSD epoch %d\n", epoch_barrier);
>> > > > +                   ceph_msg_get(msg);
>> > > > +                   spin_lock(&session->s_cap_lock);
>> > > > +                   list_add(&msg->list_head, &session->s_delayed_caps);
>> > > > +                   spin_unlock(&session->s_cap_lock);
>> > > > +
>> > > > +                   // Kick OSD client to get the latest map
>> > > > +                   __ceph_osdc_maybe_request_map(osdc);
>> > > > +
>> > > > +                   up_read(&osdc->lock);
>> > > > +                   return;
>> > > > +           }
>> > >
>> > > Cap messages need to be handled in the same order as they were sent. I’m worry if the delay breaks something or makes cap revoke slow. Why not use the use the same approach as user space client? pass the epoch_barrier to libceph and let libceph delay sending osd requests.
>> > >
>> > > Regards
>> > > Yan, Zheng
>> > >
>> >
>> > Now that I've looked at this in more detail, I'm not sure I understand
>> > what you're suggesting.
>> >
>> > In this case, we have gotten a cap message from the MDS, and we can't
>> > use any caps granted in there until the right map epoch comes in. Isn't
>> > it wrong to do all of the stuff in handle_cap_grant (for instance)
>> > before we've received that map epoch?
>> >
>> > The userland client just seems to just idle OSD requests in the
>> > objecter layer until the right map comes in. Is that really sufficient
>> > here?
>>
>> Yes -- the key thing is that the client sees the effects of *another*
>> client's blacklisting (the client that might have previously held caps
>> on the file we're about to write/read).  So the client receiving the
>> barrier message is completely OK and entitled to the caps, he just
>> needs to make sure his OSD ops don't go out until he's seen the epoch
>> so that his operations are properly ordered with respect to the
>> blacklisting of this notional other client.
>>
>> John
>
> One more question...how is epoch_t wraparound handled?
>
> The number is 32 bits so it seems like it could happen with enough
> clients flapping, and 0 has a bit of a special meaning here, AFAICS.
> Most of the code seems to assume that it can never occur. Could it?

I had to check this but I think we're good - the mon coalesce rapid
updates according to paxos_propose_interval, which is 1s by default,
so that bounds the rate we can consume epochs.

John


> --
> Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-02-02 16:35 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-20 15:17 [PATCH v1 0/7] ceph: implement new-style ENOSPC handling in kcephfs Jeff Layton
2017-01-20 15:17 ` [PATCH v1 1/7] libceph: add ceph_osdc_cancel_writes Jeff Layton
2017-01-20 15:17 ` [PATCH v1 2/7] libceph: rename and export have_pool_full Jeff Layton
2017-01-20 15:17 ` [PATCH v1 3/7] libceph: rename and export maybe_request_map Jeff Layton
2017-01-20 15:17 ` [PATCH v1 4/7] ceph: handle new osdmap epoch updates in CLIENT_CAPS and WRITE codepaths Jeff Layton
2017-01-22  9:40   ` Yan, Zheng
2017-01-22 15:38     ` Jeff Layton
2017-01-23  1:38       ` Yan, Zheng
2017-02-01 19:50     ` Jeff Layton
2017-02-01 19:55       ` John Spray
2017-02-01 20:55         ` Jeff Layton
2017-02-02 16:07         ` Jeff Layton
2017-02-02 16:35           ` John Spray
2017-01-20 15:17 ` [PATCH v1 5/7] ceph: update CAPRELEASE message format Jeff Layton
2017-01-20 15:17 ` [PATCH v1 6/7] ceph: clean out delayed caps when destroying session Jeff Layton
2017-01-20 15:17 ` [PATCH v1 7/7] libceph: allow requests to return immediately on full conditions if caller wishes Jeff Layton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.