[PATCH v5 0/6] ceph: implement new-style ENOSPC handling in kcephfs

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 0/6] ceph: implement new-style ENOSPC handling in kcephfs
@ 2017-02-25 17:43 Jeff Layton
  2017-02-25 17:43 ` [PATCH v5 1/6] libceph: allow requests to return immediately on full conditions if caller wishes Jeff Layton
                   ` (6 more replies)
  0 siblings, 7 replies; 23+ messages in thread
From: Jeff Layton @ 2017-02-25 17:43 UTC (permalink / raw)
  To: zyan, sage, idryomov; +Cc: jspray, ceph-devel

v5: rebase onto ACK vs. commit changes

v4: eliminate map_cb and just call ceph_osdc_abort_on_full directly
Revert earlier patch flagging individual pages with error on writeback
failure. Add mechanism to force synchronous writes when writes start
failing, and reallowing async writes when they succeed.

v3: track "abort_on_full" behavior with a new bool in osd request
instead of a protocol flag. Remove some extraneous arguments from
various functions. Don't export have_pool_full, call it from the
abort_on_full callback instead.

v2: teach libcephfs how to hold on to requests until the right map
epoch appears, instead of delaying cap handling in the cephfs layer.

Ok, with this set, I think we have proper -ENOSPC handling for all
different write types (direct, sync, async buffered, etc...). -ENOSPC
conditions.

This patchset is an updated version of the patch series originally
done by John Spray and posted here:

    http://www.spinics.net/lists/ceph-devel/msg21257.html

The only real change from the last posting was to rebase it on top
of Ilya's changes to remove the ack vs. commit behavior. That rebase
was fairly simple and turns out to simplify the changes somewhat.

In the last series Zheng also mentioned removing the other SetPageError
sites in fs/ceph. That may make sense, but I've left that out for now,
as I'd like to look over PG_error handling in the kernel at large.

Jeff Layton (6):
  libceph: allow requests to return immediately on full conditions if
    caller wishes
  libceph: abort already submitted but abortable requests when map or
    pool goes full
  libceph: add an epoch_barrier field to struct ceph_osd_client
  ceph: handle epoch barriers in cap messages
  Revert "ceph: SetPageError() for writeback pages if writepages fails"
  ceph: when seeing write errors on an inode, switch to sync writes

 fs/ceph/addr.c                  | 10 +++--
 fs/ceph/caps.c                  | 17 ++++++--
 fs/ceph/file.c                  | 32 ++++++++------
 fs/ceph/mds_client.c            | 20 +++++++++
 fs/ceph/mds_client.h            |  7 ++-
 fs/ceph/super.h                 | 26 +++++++++++
 include/linux/ceph/osd_client.h |  3 ++
 net/ceph/osd_client.c           | 96 +++++++++++++++++++++++++++++++++++++----
 8 files changed, 180 insertions(+), 31 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v5 1/6] libceph: allow requests to return immediately on full conditions if caller wishes
  2017-02-25 17:43 [PATCH v5 0/6] ceph: implement new-style ENOSPC handling in kcephfs Jeff Layton
@ 2017-02-25 17:43 ` Jeff Layton
  2017-03-28 14:22   ` Ilya Dryomov
  2017-02-25 17:43 ` [PATCH v5 2/6] libceph: abort already submitted but abortable requests when map or pool goes full Jeff Layton
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Jeff Layton @ 2017-02-25 17:43 UTC (permalink / raw)
  To: zyan, sage, idryomov; +Cc: jspray, ceph-devel

Usually, when the osd map is flagged as full or the pool is at quota,
write requests just hang. This is not what we want for cephfs, where
it would be better to simply report -ENOSPC back to userland instead
of stalling.

If the caller knows that it will want an immediate error return instead
of blocking on a full or at-quota error condition then allow it to set a
flag to request that behavior.

Set that flag in ceph_osdc_new_request (since ceph.ko is the only caller),
and on any other write request from ceph.ko.

A later patch will deal with requests that were submitted before the new
map showing the full condition came in.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ceph/addr.c                  | 1 +
 fs/ceph/file.c                  | 1 +
 include/linux/ceph/osd_client.h | 1 +
 net/ceph/osd_client.c           | 7 +++++++
 4 files changed, 10 insertions(+)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 6ecb920602ed..4b9da6ee5c7f 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1889,6 +1889,7 @@ static int __ceph_pool_perm_get(struct ceph_inode_info *ci,
 	err = ceph_osdc_start_request(&fsc->client->osdc, rd_req, false);
 
 	wr_req->r_mtime = ci->vfs_inode.i_mtime;
+	wr_req->r_abort_on_full = true;
 	err2 = ceph_osdc_start_request(&fsc->client->osdc, wr_req, false);
 
 	if (!err)
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 5a7134ef13d3..9dad32d0ed0b 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -712,6 +712,7 @@ static void ceph_aio_retry_work(struct work_struct *work)
 	req->r_callback = ceph_aio_complete_req;
 	req->r_inode = inode;
 	req->r_priv = aio_req;
+	req->r_abort_on_full = true;
 
 	ret = ceph_osdc_start_request(req->r_osdc, req, false);
 out:
diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index f2ce9cd5ede6..55dcff2455e0 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -169,6 +169,7 @@ struct ceph_osd_request {
 	unsigned int		r_num_ops;
 
 	int               r_result;
+	bool		  r_abort_on_full; /* return ENOSPC when full */
 
 	struct ceph_osd_client *r_osdc;
 	struct kref       r_kref;
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 5c0938ddddf6..8fc0ccc7126f 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -49,6 +49,7 @@ static void link_linger(struct ceph_osd *osd,
 			struct ceph_osd_linger_request *lreq);
 static void unlink_linger(struct ceph_osd *osd,
 			  struct ceph_osd_linger_request *lreq);
+static void complete_request(struct ceph_osd_request *req, int err);
 
 #if 1
 static inline bool rwsem_is_wrlocked(struct rw_semaphore *sem)
@@ -961,6 +962,7 @@ struct ceph_osd_request *ceph_osdc_new_request(struct ceph_osd_client *osdc,
 				       truncate_size, truncate_seq);
 	}
 
+	req->r_abort_on_full = true;
 	req->r_flags = flags;
 	req->r_base_oloc.pool = layout->pool_id;
 	req->r_base_oloc.pool_ns = ceph_try_get_string(layout->pool_ns);
@@ -1635,6 +1637,7 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
 	enum calc_target_result ct_res;
 	bool need_send = false;
 	bool promoted = false;
+	int ret = 0;
 
 	WARN_ON(req->r_tid);
 	dout("%s req %p wrlocked %d\n", __func__, req, wrlocked);
@@ -1669,6 +1672,8 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
 		pr_warn_ratelimited("FULL or reached pool quota\n");
 		req->r_t.paused = true;
 		maybe_request_map(osdc);
+		if (req->r_abort_on_full)
+			ret = -ENOSPC;
 	} else if (!osd_homeless(osd)) {
 		need_send = true;
 	} else {
@@ -1685,6 +1690,8 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
 	link_request(osd, req);
 	if (need_send)
 		send_request(req);
+	else if (ret)
+		complete_request(req, ret);
 	mutex_unlock(&osd->lock);
 
 	if (ct_res == CALC_TARGET_POOL_DNE)
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 2/6] libceph: abort already submitted but abortable requests when map or pool goes full
  2017-02-25 17:43 [PATCH v5 0/6] ceph: implement new-style ENOSPC handling in kcephfs Jeff Layton
  2017-02-25 17:43 ` [PATCH v5 1/6] libceph: allow requests to return immediately on full conditions if caller wishes Jeff Layton
@ 2017-02-25 17:43 ` Jeff Layton
  2017-03-28 14:34   ` Ilya Dryomov
  2017-02-25 17:43 ` [PATCH v5 3/6] libceph: add an epoch_barrier field to struct ceph_osd_client Jeff Layton
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Jeff Layton @ 2017-02-25 17:43 UTC (permalink / raw)
  To: zyan, sage, idryomov; +Cc: jspray, ceph-devel

When a Ceph volume hits capacity, a flag is set in the OSD map to
indicate that, and a new map is sprayed around the cluster. With cephfs
we want it to shut down any abortable requests that are in progress with
an -ENOSPC error as they'd just hang otherwise.

Add a new ceph_osdc_abort_on_full helper function to handle this. It
will first check whether there is an out-of-space condition in the
cluster. It will then walk the tree and abort any request that has
r_abort_on_full set with an ENOSPC error. Call this new function
directly whenever we get a new OSD map.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 net/ceph/osd_client.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 8fc0ccc7126f..b286ff6dec29 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1770,6 +1770,47 @@ static void complete_request(struct ceph_osd_request *req, int err)
 	ceph_osdc_put_request(req);
 }
 
+/*
+ * Drop all pending requests that are stalled waiting on a full condition to
+ * clear, and complete them with ENOSPC as the return code.
+ */
+static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
+{
+	struct ceph_osd_request *req;
+	struct ceph_osd *osd;
+	struct rb_node *m, *n;
+	u32 latest_epoch = 0;
+	bool osdmap_full = ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL);
+
+	dout("enter abort_on_full\n");
+
+	if (!osdmap_full && !have_pool_full(osdc))
+		goto out;
+
+	for (n = rb_first(&osdc->osds); n; n = rb_next(n)) {
+		osd = rb_entry(n, struct ceph_osd, o_node);
+		mutex_lock(&osd->lock);
+		m = rb_first(&osd->o_requests);
+		while (m) {
+			req = rb_entry(m, struct ceph_osd_request, r_node);
+			m = rb_next(m);
+
+			if (req->r_abort_on_full &&
+			    (osdmap_full || pool_full(osdc, req->r_t.base_oloc.pool))) {
+				u32 cur_epoch = le32_to_cpu(req->r_replay_version.epoch);
+
+				dout("%s: abort tid=%llu flags 0x%x\n", __func__, req->r_tid, req->r_flags);
+				complete_request(req, -ENOSPC);
+				if (cur_epoch > latest_epoch)
+					latest_epoch = cur_epoch;
+			}
+		}
+		mutex_unlock(&osd->lock);
+	}
+out:
+	dout("return abort_on_full latest_epoch=%u\n", latest_epoch);
+}
+
 static void cancel_map_check(struct ceph_osd_request *req)
 {
 	struct ceph_osd_client *osdc = req->r_osdc;
@@ -3232,6 +3273,7 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
 
 	ceph_monc_got_map(&osdc->client->monc, CEPH_SUB_OSDMAP,
 			  osdc->osdmap->epoch);
+	ceph_osdc_abort_on_full(osdc);
 	up_write(&osdc->lock);
 	wake_up_all(&osdc->client->auth_wq);
 	return;
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 3/6] libceph: add an epoch_barrier field to struct ceph_osd_client
  2017-02-25 17:43 [PATCH v5 0/6] ceph: implement new-style ENOSPC handling in kcephfs Jeff Layton
  2017-02-25 17:43 ` [PATCH v5 1/6] libceph: allow requests to return immediately on full conditions if caller wishes Jeff Layton
  2017-02-25 17:43 ` [PATCH v5 2/6] libceph: abort already submitted but abortable requests when map or pool goes full Jeff Layton
@ 2017-02-25 17:43 ` Jeff Layton
  2017-03-28 14:54   ` Ilya Dryomov
  2017-02-25 17:43 ` [PATCH v5 4/6] ceph: handle epoch barriers in cap messages Jeff Layton
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Jeff Layton @ 2017-02-25 17:43 UTC (permalink / raw)
  To: zyan, sage, idryomov; +Cc: jspray, ceph-devel

Cephfs can get cap update requests that contain a new epoch barrier in
them. When that happens we want to pause all OSD traffic until the right
map epoch arrives.

Add an epoch_barrier field to ceph_osd_client that is protected by the
osdc->lock rwsem. When the barrier is set, and the current OSD map
epoch is below that, pause the request target when submitting the
request or when revisiting it. Add a way for upper layers (cephfs)
to update the epoch_barrier as well.

If we get a new map, compare the new epoch against the barrier before
kicking requests and request another map if the map epoch is still lower
than the one we want.

If we end up cancelling requests because of a new map showing a full OSD
or pool condition, then set the barrier higher than the highest replay
epoch of all the cancelled requests.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 include/linux/ceph/osd_client.h |  2 ++
 net/ceph/osd_client.c           | 51 +++++++++++++++++++++++++++++++++--------
 2 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 55dcff2455e0..833942226560 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -266,6 +266,7 @@ struct ceph_osd_client {
 	struct rb_root         osds;          /* osds */
 	struct list_head       osd_lru;       /* idle osds */
 	spinlock_t             osd_lru_lock;
+	u32		       epoch_barrier;
 	struct ceph_osd        homeless_osd;
 	atomic64_t             last_tid;      /* tid of last request */
 	u64                    last_linger_id;
@@ -304,6 +305,7 @@ extern void ceph_osdc_handle_reply(struct ceph_osd_client *osdc,
 				   struct ceph_msg *msg);
 extern void ceph_osdc_handle_map(struct ceph_osd_client *osdc,
 				 struct ceph_msg *msg);
+void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb);
 
 extern void osd_req_op_init(struct ceph_osd_request *osd_req,
 			    unsigned int which, u16 opcode, u32 flags);
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index b286ff6dec29..2e9b6211814a 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1299,8 +1299,10 @@ static bool target_should_be_paused(struct ceph_osd_client *osdc,
 		       __pool_full(pi);
 
 	WARN_ON(pi->id != t->base_oloc.pool);
-	return (t->flags & CEPH_OSD_FLAG_READ && pauserd) ||
-	       (t->flags & CEPH_OSD_FLAG_WRITE && pausewr);
+	return ((t->flags & CEPH_OSD_FLAG_READ) && pauserd) ||
+	       ((t->flags & CEPH_OSD_FLAG_WRITE) && pausewr) ||
+	       (osdc->epoch_barrier &&
+		osdc->osdmap->epoch < osdc->epoch_barrier);
 }
 
 enum calc_target_result {
@@ -1610,21 +1612,24 @@ static void send_request(struct ceph_osd_request *req)
 static void maybe_request_map(struct ceph_osd_client *osdc)
 {
 	bool continuous = false;
+	u32 epoch = osdc->osdmap->epoch;
 
 	verify_osdc_locked(osdc);
-	WARN_ON(!osdc->osdmap->epoch);
+	WARN_ON_ONCE(epoch == 0);
 
 	if (ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
 	    ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSERD) ||
-	    ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
+	    ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
+	    (osdc->epoch_barrier && epoch < osdc->epoch_barrier)) {
 		dout("%s osdc %p continuous\n", __func__, osdc);
 		continuous = true;
 	} else {
 		dout("%s osdc %p onetime\n", __func__, osdc);
 	}
 
+	++epoch;
 	if (ceph_monc_want_map(&osdc->client->monc, CEPH_SUB_OSDMAP,
-			       osdc->osdmap->epoch + 1, continuous))
+			       epoch, continuous))
 		ceph_monc_renew_subs(&osdc->client->monc);
 }
 
@@ -1653,8 +1658,14 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
 		goto promote;
 	}
 
-	if ((req->r_flags & CEPH_OSD_FLAG_WRITE) &&
-	    ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
+	if (osdc->epoch_barrier &&
+	    osdc->osdmap->epoch < osdc->epoch_barrier) {
+		dout("req %p epoch %u barrier %u\n", req, osdc->osdmap->epoch,
+		     osdc->epoch_barrier);
+		req->r_t.paused = true;
+		maybe_request_map(osdc);
+	} else if ((req->r_flags & CEPH_OSD_FLAG_WRITE) &&
+		   ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
 		dout("req %p pausewr\n", req);
 		req->r_t.paused = true;
 		maybe_request_map(osdc);
@@ -1772,7 +1783,8 @@ static void complete_request(struct ceph_osd_request *req, int err)
 
 /*
  * Drop all pending requests that are stalled waiting on a full condition to
- * clear, and complete them with ENOSPC as the return code.
+ * clear, and complete them with ENOSPC as the return code. Set the
+ * osdc->epoch_barrier to the latest replay version epoch that was aborted.
  */
 static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
 {
@@ -1808,7 +1820,11 @@ static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
 		mutex_unlock(&osd->lock);
 	}
 out:
-	dout("return abort_on_full latest_epoch=%u\n", latest_epoch);
+	if (latest_epoch)
+		osdc->epoch_barrier = max(latest_epoch + 1,
+					  osdc->epoch_barrier);
+	dout("return abort_on_full latest_epoch=%u barrier=%u\n", latest_epoch,
+					osdc->epoch_barrier);
 }
 
 static void cancel_map_check(struct ceph_osd_request *req)
@@ -3266,7 +3282,8 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
 	pausewr = ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
 		  ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
 		  have_pool_full(osdc);
-	if (was_pauserd || was_pausewr || pauserd || pausewr)
+	if (was_pauserd || was_pausewr || pauserd || pausewr ||
+	    (osdc->epoch_barrier && osdc->osdmap->epoch < osdc->epoch_barrier))
 		maybe_request_map(osdc);
 
 	kick_requests(osdc, &need_resend, &need_resend_linger);
@@ -3284,6 +3301,20 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
 	up_write(&osdc->lock);
 }
 
+void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb)
+{
+	down_read(&osdc->lock);
+	if (unlikely(eb > osdc->epoch_barrier)) {
+		up_read(&osdc->lock);
+		down_write(&osdc->lock);
+		osdc->epoch_barrier = max(eb, osdc->epoch_barrier);
+		up_write(&osdc->lock);
+	} else {
+		up_read(&osdc->lock);
+	}
+}
+EXPORT_SYMBOL(ceph_osdc_update_epoch_barrier);
+
 /*
  * Resubmit requests pending on the given osd.
  */
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 4/6] ceph: handle epoch barriers in cap messages
  2017-02-25 17:43 [PATCH v5 0/6] ceph: implement new-style ENOSPC handling in kcephfs Jeff Layton
                   ` (2 preceding siblings ...)
  2017-02-25 17:43 ` [PATCH v5 3/6] libceph: add an epoch_barrier field to struct ceph_osd_client Jeff Layton
@ 2017-02-25 17:43 ` Jeff Layton
  2017-02-25 17:43 ` [PATCH v5 5/6] Revert "ceph: SetPageError() for writeback pages if writepages fails" Jeff Layton
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 23+ messages in thread
From: Jeff Layton @ 2017-02-25 17:43 UTC (permalink / raw)
  To: zyan, sage, idryomov; +Cc: jspray, ceph-devel

Have the client store and update the osdc epoch_barrier when a cap
message comes in with one.

When sending cap messages, send the epoch barrier as well. This allows
clients to inform servers that their released caps may not be used until
a particular OSD map epoch.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ceph/caps.c       | 17 +++++++++++++----
 fs/ceph/mds_client.c | 20 ++++++++++++++++++++
 fs/ceph/mds_client.h |  7 +++++--
 3 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index cd966f276a8d..e5e50e673a80 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1015,6 +1015,7 @@ static int send_cap_msg(struct cap_msg_args *arg)
 	void *p;
 	size_t extra_len;
 	struct timespec zerotime = {0};
+	struct ceph_osd_client *osdc = &arg->session->s_mdsc->fsc->client->osdc;
 
 	dout("send_cap_msg %s %llx %llx caps %s wanted %s dirty %s"
 	     " seq %u/%u tid %llu/%llu mseq %u follows %lld size %llu/%llu"
@@ -1077,7 +1078,9 @@ static int send_cap_msg(struct cap_msg_args *arg)
 	/* inline data size */
 	ceph_encode_32(&p, 0);
 	/* osd_epoch_barrier (version 5) */
-	ceph_encode_32(&p, 0);
+	down_read(&osdc->lock);
+	ceph_encode_32(&p, osdc->epoch_barrier);
+	up_read(&osdc->lock);
 	/* oldest_flush_tid (version 6) */
 	ceph_encode_64(&p, arg->oldest_flush_tid);
 
@@ -3633,13 +3636,19 @@ void ceph_handle_caps(struct ceph_mds_session *session,
 		p += inline_len;
 	}
 
+	if (le16_to_cpu(msg->hdr.version) >= 5) {
+		struct ceph_osd_client	*osdc = &mdsc->fsc->client->osdc;
+		u32			epoch_barrier;
+
+		ceph_decode_32_safe(&p, end, epoch_barrier, bad);
+		ceph_osdc_update_epoch_barrier(osdc, epoch_barrier);
+	}
+
 	if (le16_to_cpu(msg->hdr.version) >= 8) {
 		u64 flush_tid;
 		u32 caller_uid, caller_gid;
-		u32 osd_epoch_barrier;
 		u32 pool_ns_len;
-		/* version >= 5 */
-		ceph_decode_32_safe(&p, end, osd_epoch_barrier, bad);
+
 		/* version >= 6 */
 		ceph_decode_64_safe(&p, end, flush_tid, bad);
 		/* version >= 7 */
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index c681762d76e6..842b18b6d20e 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1551,9 +1551,15 @@ void ceph_send_cap_releases(struct ceph_mds_client *mdsc,
 	struct ceph_msg *msg = NULL;
 	struct ceph_mds_cap_release *head;
 	struct ceph_mds_cap_item *item;
+	struct ceph_osd_client *osdc = &mdsc->fsc->client->osdc;
 	struct ceph_cap *cap;
 	LIST_HEAD(tmp_list);
 	int num_cap_releases;
+	__le32	barrier, *cap_barrier;
+
+	down_read(&osdc->lock);
+	barrier = cpu_to_le32(osdc->epoch_barrier);
+	up_read(&osdc->lock);
 
 	spin_lock(&session->s_cap_lock);
 again:
@@ -1571,7 +1577,11 @@ void ceph_send_cap_releases(struct ceph_mds_client *mdsc,
 			head = msg->front.iov_base;
 			head->num = cpu_to_le32(0);
 			msg->front.iov_len = sizeof(*head);
+
+			msg->hdr.version = cpu_to_le16(2);
+			msg->hdr.compat_version = cpu_to_le16(1);
 		}
+
 		cap = list_first_entry(&tmp_list, struct ceph_cap,
 					session_caps);
 		list_del(&cap->session_caps);
@@ -1589,6 +1599,11 @@ void ceph_send_cap_releases(struct ceph_mds_client *mdsc,
 		ceph_put_cap(mdsc, cap);
 
 		if (le32_to_cpu(head->num) == CEPH_CAPS_PER_RELEASE) {
+			// Append cap_barrier field
+			cap_barrier = msg->front.iov_base + msg->front.iov_len;
+			*cap_barrier = barrier;
+			msg->front.iov_len += sizeof(*cap_barrier);
+
 			msg->hdr.front_len = cpu_to_le32(msg->front.iov_len);
 			dout("send_cap_releases mds%d %p\n", session->s_mds, msg);
 			ceph_con_send(&session->s_con, msg);
@@ -1604,6 +1619,11 @@ void ceph_send_cap_releases(struct ceph_mds_client *mdsc,
 	spin_unlock(&session->s_cap_lock);
 
 	if (msg) {
+		// Append cap_barrier field
+		cap_barrier = msg->front.iov_base + msg->front.iov_len;
+		*cap_barrier = barrier;
+		msg->front.iov_len += sizeof(*cap_barrier);
+
 		msg->hdr.front_len = cpu_to_le32(msg->front.iov_len);
 		dout("send_cap_releases mds%d %p\n", session->s_mds, msg);
 		ceph_con_send(&session->s_con, msg);
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index ac0475a2daa7..517684c7c5f0 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -104,10 +104,13 @@ struct ceph_mds_reply_info_parsed {
 
 /*
  * cap releases are batched and sent to the MDS en masse.
+ *
+ * Account for per-message overhead of mds_cap_release header
+ * and __le32 for osd epoch barrier trailing field.
  */
-#define CEPH_CAPS_PER_RELEASE ((PAGE_SIZE -			\
+#define CEPH_CAPS_PER_RELEASE ((PAGE_SIZE - sizeof(u32) -		\
 				sizeof(struct ceph_mds_cap_release)) /	\
-			       sizeof(struct ceph_mds_cap_item))
+			        sizeof(struct ceph_mds_cap_item))
 
 
 /*
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 5/6] Revert "ceph: SetPageError() for writeback pages if writepages fails"
  2017-02-25 17:43 [PATCH v5 0/6] ceph: implement new-style ENOSPC handling in kcephfs Jeff Layton
                   ` (3 preceding siblings ...)
  2017-02-25 17:43 ` [PATCH v5 4/6] ceph: handle epoch barriers in cap messages Jeff Layton
@ 2017-02-25 17:43 ` Jeff Layton
  2017-02-25 17:43 ` [PATCH v5 6/6] ceph: when seeing write errors on an inode, switch to sync writes Jeff Layton
  2017-02-27  2:43 ` [PATCH v5 0/6] ceph: implement new-style ENOSPC handling in kcephfs Yan, Zheng
  6 siblings, 0 replies; 23+ messages in thread
From: Jeff Layton @ 2017-02-25 17:43 UTC (permalink / raw)
  To: zyan, sage, idryomov; +Cc: jspray, ceph-devel

This reverts commit b109eec6f4332bd517e2f41e207037c4b9065094.

If I'm filling up a filesystem with this sort of command:

    $ dd if=/dev/urandom of=/mnt/cephfs/fillfile bs=2M oflag=sync

...then I'll eventually get back EIO on a write. Further calls
will give us ENOSPC.

I'm not sure what prompted this change, but I don't think it's what we
want to do. If writepages failed, we will have already set the mapping
error appropriately, and that's what gets reported by fsync() or
close().

__filemap_fdatawait_range however, does this:

	wait_on_page_writeback(page);
	if (TestClearPageError(page))
		ret = -EIO;

...and that -EIO ends up trumping the mapping's error if one exists.

When writepages fails, we only want to set the error in the mapping,
and not flag the individual pages.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ceph/addr.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 4b9da6ee5c7f..be3aa84188a5 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -702,9 +702,6 @@ static void writepages_finish(struct ceph_osd_request *req)
 				clear_bdi_congested(&fsc->backing_dev_info,
 						    BLK_RW_ASYNC);
 
-			if (rc < 0)
-				SetPageError(page);
-
 			ceph_put_snap_context(page_snap_context(page));
 			page->private = 0;
 			ClearPagePrivate(page);
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 6/6] ceph: when seeing write errors on an inode, switch to sync writes
  2017-02-25 17:43 [PATCH v5 0/6] ceph: implement new-style ENOSPC handling in kcephfs Jeff Layton
                   ` (4 preceding siblings ...)
  2017-02-25 17:43 ` [PATCH v5 5/6] Revert "ceph: SetPageError() for writeback pages if writepages fails" Jeff Layton
@ 2017-02-25 17:43 ` Jeff Layton
  2017-02-27  2:43 ` [PATCH v5 0/6] ceph: implement new-style ENOSPC handling in kcephfs Yan, Zheng
  6 siblings, 0 replies; 23+ messages in thread
From: Jeff Layton @ 2017-02-25 17:43 UTC (permalink / raw)
  To: zyan, sage, idryomov; +Cc: jspray, ceph-devel

Currently, we don't have a real feedback mechanism in place for when we
start seeing buffered writeback errors. If writeback is failing, there
is nothing that prevents an application from continuing to dirty pages
that aren't being cleaned.

In the event that we're seeing write errors of any sort occur on an
inode, have the callback set a flag to force further writes to be
synchronous. When the next write succeeds, clear the flag to allow
buffered writeback to continue.

Since this is just a hint to the write submission mechanism, we only
take the i_ceph_lock when a lockless check shows that the flag needs to
be changed.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ceph/addr.c  |  6 +++++-
 fs/ceph/file.c  | 31 ++++++++++++++++++-------------
 fs/ceph/super.h | 26 ++++++++++++++++++++++++++
 3 files changed, 49 insertions(+), 14 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index be3aa84188a5..ef26541247c2 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -669,8 +669,12 @@ static void writepages_finish(struct ceph_osd_request *req)
 	bool remove_page;
 
 	dout("writepages_finish %p rc %d\n", inode, rc);
-	if (rc < 0)
+	if (rc < 0) {
 		mapping_set_error(mapping, rc);
+		ceph_set_error_write(ci);
+	} else {
+		ceph_clear_error_write(ci);
+	}
 
 	/*
 	 * We lost the cache cap, need to truncate the page before
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 9dad32d0ed0b..579a16ca0b8e 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1020,19 +1020,22 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
 
 out:
 		ceph_osdc_put_request(req);
-		if (ret == 0) {
-			pos += len;
-			written += len;
-
-			if (pos > i_size_read(inode)) {
-				check_caps = ceph_inode_set_size(inode, pos);
-				if (check_caps)
-					ceph_check_caps(ceph_inode(inode),
-							CHECK_CAPS_AUTHONLY,
-							NULL);
-			}
-		} else
+		if (ret != 0) {
+			ceph_set_error_write(ci);
 			break;
+		}
+
+		ceph_clear_error_write(ci);
+		pos += len;
+		written += len;
+		if (pos > i_size_read(inode)) {
+			check_caps = ceph_inode_set_size(inode, pos);
+			if (check_caps)
+				ceph_check_caps(ceph_inode(inode),
+						CHECK_CAPS_AUTHONLY,
+						NULL);
+		}
+
 	}
 
 	if (ret != -EOLDSNAPC && written > 0) {
@@ -1238,6 +1241,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	}
 
 retry_snap:
+	/* FIXME: not complete since it doesn't account for being at quota */
 	if (ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL)) {
 		err = -ENOSPC;
 		goto out;
@@ -1259,7 +1263,8 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	     inode, ceph_vinop(inode), pos, count, ceph_cap_string(got));
 
 	if ((got & (CEPH_CAP_FILE_BUFFER|CEPH_CAP_FILE_LAZYIO)) == 0 ||
-	    (iocb->ki_flags & IOCB_DIRECT) || (fi->flags & CEPH_F_SYNC)) {
+	    (iocb->ki_flags & IOCB_DIRECT) || (fi->flags & CEPH_F_SYNC) ||
+	    (ci->i_ceph_flags & CEPH_I_ERROR_WRITE)) {
 		struct ceph_snap_context *snapc;
 		struct iov_iter data;
 		inode_unlock(inode);
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index e9410bcf4113..b766823d2bb3 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -473,6 +473,32 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
 #define CEPH_I_CAP_DROPPED	(1 << 8)  /* caps were forcibly dropped */
 #define CEPH_I_KICK_FLUSH	(1 << 9)  /* kick flushing caps */
 #define CEPH_I_FLUSH_SNAPS	(1 << 10) /* need flush snapss */
+#define CEPH_I_ERROR_WRITE	(1 << 11) /* have seen write errors */
+
+/*
+ * We set the ERROR_WRITE bit when we start seeing write errors on an inode
+ * and then clear it when they start succeeding. Note that we do a lockless
+ * check first, and only take the lock if it looks like it needs to be changed.
+ * The write submission code just takes this as a hint, so we're not too
+ * worried if a few slip through in either direction.
+ */
+static inline void ceph_set_error_write(struct ceph_inode_info *ci)
+{
+	if (!(ci->i_ceph_flags & CEPH_I_ERROR_WRITE)) {
+		spin_lock(&ci->i_ceph_lock);
+		ci->i_ceph_flags |= CEPH_I_ERROR_WRITE;
+		spin_unlock(&ci->i_ceph_lock);
+	}
+}
+
+static inline void ceph_clear_error_write(struct ceph_inode_info *ci)
+{
+	if (ci->i_ceph_flags & CEPH_I_ERROR_WRITE) {
+		spin_lock(&ci->i_ceph_lock);
+		ci->i_ceph_flags &= ~CEPH_I_ERROR_WRITE;
+		spin_unlock(&ci->i_ceph_lock);
+	}
+}
 
 static inline void __ceph_dir_set_complete(struct ceph_inode_info *ci,
 					   long long release_count,
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 0/6] ceph: implement new-style ENOSPC handling in kcephfs
  2017-02-25 17:43 [PATCH v5 0/6] ceph: implement new-style ENOSPC handling in kcephfs Jeff Layton
                   ` (5 preceding siblings ...)
  2017-02-25 17:43 ` [PATCH v5 6/6] ceph: when seeing write errors on an inode, switch to sync writes Jeff Layton
@ 2017-02-27  2:43 ` Yan, Zheng
  6 siblings, 0 replies; 23+ messages in thread
From: Yan, Zheng @ 2017-02-27  2:43 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Sage Weil, Ilya Dryomov, John Spray, ceph-devel


> On 26 Feb 2017, at 01:43, Jeff Layton <jlayton@redhat.com> wrote:
> 
> v5: rebase onto ACK vs. commit changes
> 
> v4: eliminate map_cb and just call ceph_osdc_abort_on_full directly
> Revert earlier patch flagging individual pages with error on writeback
> failure. Add mechanism to force synchronous writes when writes start
> failing, and reallowing async writes when they succeed.
> 
> v3: track "abort_on_full" behavior with a new bool in osd request
> instead of a protocol flag. Remove some extraneous arguments from
> various functions. Don't export have_pool_full, call it from the
> abort_on_full callback instead.
> 
> v2: teach libcephfs how to hold on to requests until the right map
> epoch appears, instead of delaying cap handling in the cephfs layer.
> 
> Ok, with this set, I think we have proper -ENOSPC handling for all
> different write types (direct, sync, async buffered, etc...). -ENOSPC
> conditions.
> 
> This patchset is an updated version of the patch series originally
> done by John Spray and posted here:
> 
>    http://www.spinics.net/lists/ceph-devel/msg21257.html
> 
> The only real change from the last posting was to rebase it on top
> of Ilya's changes to remove the ack vs. commit behavior. That rebase
> was fairly simple and turns out to simplify the changes somewhat.
> 
> In the last series Zheng also mentioned removing the other SetPageError
> sites in fs/ceph. That may make sense, but I've left that out for now,
> as I'd like to look over PG_error handling in the kernel at large.
> 
> Jeff Layton (6):
>  libceph: allow requests to return immediately on full conditions if
>    caller wishes
>  libceph: abort already submitted but abortable requests when map or
>    pool goes full
>  libceph: add an epoch_barrier field to struct ceph_osd_client
>  ceph: handle epoch barriers in cap messages
>  Revert "ceph: SetPageError() for writeback pages if writepages fails"
>  ceph: when seeing write errors on an inode, switch to sync writes
> 
> fs/ceph/addr.c                  | 10 +++--
> fs/ceph/caps.c                  | 17 ++++++--
> fs/ceph/file.c                  | 32 ++++++++------
> fs/ceph/mds_client.c            | 20 +++++++++
> fs/ceph/mds_client.h            |  7 ++-
> fs/ceph/super.h                 | 26 +++++++++++
> include/linux/ceph/osd_client.h |  3 ++
> net/ceph/osd_client.c           | 96 +++++++++++++++++++++++++++++++++++++----
> 8 files changed, 180 insertions(+), 31 deletions(-)

Reviewed-by: "Yan, Zheng” <zyan@redhat.com>

> 
> -- 
> 2.9.3
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 1/6] libceph: allow requests to return immediately on full conditions if caller wishes
  2017-02-25 17:43 ` [PATCH v5 1/6] libceph: allow requests to return immediately on full conditions if caller wishes Jeff Layton
@ 2017-03-28 14:22   ` Ilya Dryomov
  2017-03-28 20:12     ` Jeff Layton
  0 siblings, 1 reply; 23+ messages in thread
From: Ilya Dryomov @ 2017-03-28 14:22 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Sat, Feb 25, 2017 at 6:43 PM, Jeff Layton <jlayton@redhat.com> wrote:
> Usually, when the osd map is flagged as full or the pool is at quota,
> write requests just hang. This is not what we want for cephfs, where
> it would be better to simply report -ENOSPC back to userland instead
> of stalling.
>
> If the caller knows that it will want an immediate error return instead
> of blocking on a full or at-quota error condition then allow it to set a
> flag to request that behavior.
>
> Set that flag in ceph_osdc_new_request (since ceph.ko is the only caller),
> and on any other write request from ceph.ko.
>
> A later patch will deal with requests that were submitted before the new
> map showing the full condition came in.
>
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
>  fs/ceph/addr.c                  | 1 +
>  fs/ceph/file.c                  | 1 +
>  include/linux/ceph/osd_client.h | 1 +
>  net/ceph/osd_client.c           | 7 +++++++
>  4 files changed, 10 insertions(+)
>
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 6ecb920602ed..4b9da6ee5c7f 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -1889,6 +1889,7 @@ static int __ceph_pool_perm_get(struct ceph_inode_info *ci,
>         err = ceph_osdc_start_request(&fsc->client->osdc, rd_req, false);
>
>         wr_req->r_mtime = ci->vfs_inode.i_mtime;
> +       wr_req->r_abort_on_full = true;
>         err2 = ceph_osdc_start_request(&fsc->client->osdc, wr_req, false);
>
>         if (!err)
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 5a7134ef13d3..9dad32d0ed0b 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -712,6 +712,7 @@ static void ceph_aio_retry_work(struct work_struct *work)
>         req->r_callback = ceph_aio_complete_req;
>         req->r_inode = inode;
>         req->r_priv = aio_req;
> +       req->r_abort_on_full = true;
>
>         ret = ceph_osdc_start_request(req->r_osdc, req, false);
>  out:
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index f2ce9cd5ede6..55dcff2455e0 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -169,6 +169,7 @@ struct ceph_osd_request {
>         unsigned int            r_num_ops;
>
>         int               r_result;
> +       bool              r_abort_on_full; /* return ENOSPC when full */

Move this under "set by submitter", after r_linger?

>
>         struct ceph_osd_client *r_osdc;
>         struct kref       r_kref;
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 5c0938ddddf6..8fc0ccc7126f 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -49,6 +49,7 @@ static void link_linger(struct ceph_osd *osd,
>                         struct ceph_osd_linger_request *lreq);
>  static void unlink_linger(struct ceph_osd *osd,
>                           struct ceph_osd_linger_request *lreq);
> +static void complete_request(struct ceph_osd_request *req, int err);

Move this closer to __submit_request(), before send_map_check() forward
declaration?

>
>  #if 1
>  static inline bool rwsem_is_wrlocked(struct rw_semaphore *sem)
> @@ -961,6 +962,7 @@ struct ceph_osd_request *ceph_osdc_new_request(struct ceph_osd_client *osdc,
>                                        truncate_size, truncate_seq);
>         }
>
> +       req->r_abort_on_full = true;
>         req->r_flags = flags;
>         req->r_base_oloc.pool = layout->pool_id;
>         req->r_base_oloc.pool_ns = ceph_try_get_string(layout->pool_ns);
> @@ -1635,6 +1637,7 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
>         enum calc_target_result ct_res;
>         bool need_send = false;

bool need_abort here instead of ret?  This is a filesystem specific
hack, so it's always going to be ENOSPC.

>         bool promoted = false;
> +       int ret = 0;
>
>         WARN_ON(req->r_tid);
>         dout("%s req %p wrlocked %d\n", __func__, req, wrlocked);
> @@ -1669,6 +1672,8 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
>                 pr_warn_ratelimited("FULL or reached pool quota\n");
>                 req->r_t.paused = true;
>                 maybe_request_map(osdc);
> +               if (req->r_abort_on_full)
> +                       ret = -ENOSPC;
>         } else if (!osd_homeless(osd)) {
>                 need_send = true;
>         } else {
> @@ -1685,6 +1690,8 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
>         link_request(osd, req);
>         if (need_send)
>                 send_request(req);
> +       else if (ret)
> +               complete_request(req, ret);

I think this is the first complete_request() on the submit thread, so
I'm a little worried about possible deadlocks and/or other weirdness.
Did you look through all start/handle_reply sites in the filesystem?

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 2/6] libceph: abort already submitted but abortable requests when map or pool goes full
  2017-02-25 17:43 ` [PATCH v5 2/6] libceph: abort already submitted but abortable requests when map or pool goes full Jeff Layton
@ 2017-03-28 14:34   ` Ilya Dryomov
  2017-03-28 20:44     ` Jeff Layton
  2017-03-29 14:01     ` Jeff Layton
  0 siblings, 2 replies; 23+ messages in thread
From: Ilya Dryomov @ 2017-03-28 14:34 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Sat, Feb 25, 2017 at 6:43 PM, Jeff Layton <jlayton@redhat.com> wrote:
> When a Ceph volume hits capacity, a flag is set in the OSD map to
> indicate that, and a new map is sprayed around the cluster. With cephfs
> we want it to shut down any abortable requests that are in progress with
> an -ENOSPC error as they'd just hang otherwise.
>
> Add a new ceph_osdc_abort_on_full helper function to handle this. It
> will first check whether there is an out-of-space condition in the
> cluster. It will then walk the tree and abort any request that has
> r_abort_on_full set with an ENOSPC error. Call this new function
> directly whenever we get a new OSD map.
>
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
>  net/ceph/osd_client.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 42 insertions(+)
>
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 8fc0ccc7126f..b286ff6dec29 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -1770,6 +1770,47 @@ static void complete_request(struct ceph_osd_request *req, int err)
>         ceph_osdc_put_request(req);
>  }
>
> +/*
> + * Drop all pending requests that are stalled waiting on a full condition to
> + * clear, and complete them with ENOSPC as the return code.
> + */
> +static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
> +{
> +       struct ceph_osd_request *req;
> +       struct ceph_osd *osd;

These variables could have narrower scope.

> +       struct rb_node *m, *n;
> +       u32 latest_epoch = 0;
> +       bool osdmap_full = ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL);

This variable is redundant.

> +
> +       dout("enter abort_on_full\n");
> +
> +       if (!osdmap_full && !have_pool_full(osdc))
> +               goto out;
> +
> +       for (n = rb_first(&osdc->osds); n; n = rb_next(n)) {
> +               osd = rb_entry(n, struct ceph_osd, o_node);
> +               mutex_lock(&osd->lock);

No need to take osd->lock here as osdc->lock is held for write.

> +               m = rb_first(&osd->o_requests);
> +               while (m) {
> +                       req = rb_entry(m, struct ceph_osd_request, r_node);
> +                       m = rb_next(m);
> +
> +                       if (req->r_abort_on_full &&
> +                           (osdmap_full || pool_full(osdc, req->r_t.base_oloc.pool))) {

Over 80 lines and I think we need target_oloc instead of base_oloc
here.

> +                               u32 cur_epoch = le32_to_cpu(req->r_replay_version.epoch);

Is replay_version used this way in the userspace client?  It is an ack
vs commit leftover and AFAICT it is never set (i.e. always 0'0) in newer
Objecter, so something is wrong here.

> +
> +                               dout("%s: abort tid=%llu flags 0x%x\n", __func__, req->r_tid, req->r_flags);

Over 80 lines and probably unnecessary -- complete_request() has
a similar dout.

> +                               complete_request(req, -ENOSPC);

This should be abort_request() (newly introduced in 4.11).

> +                               if (cur_epoch > latest_epoch)
> +                                       latest_epoch = cur_epoch;
> +                       }
> +               }
> +               mutex_unlock(&osd->lock);
> +       }
> +out:
> +       dout("return abort_on_full latest_epoch=%u\n", latest_epoch);
> +}
> +
>  static void cancel_map_check(struct ceph_osd_request *req)
>  {
>         struct ceph_osd_client *osdc = req->r_osdc;
> @@ -3232,6 +3273,7 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
>
>         ceph_monc_got_map(&osdc->client->monc, CEPH_SUB_OSDMAP,
>                           osdc->osdmap->epoch);
> +       ceph_osdc_abort_on_full(osdc);

Nit: swap ceph_osdc_abort_on_full() and ceph_monc_got_map() so that
ceph_osdc_abort_on_full() follows kick_requests().  No real reason
other than for "I processed this map" notification to go last, after
the map is fully processed.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 3/6] libceph: add an epoch_barrier field to struct ceph_osd_client
  2017-02-25 17:43 ` [PATCH v5 3/6] libceph: add an epoch_barrier field to struct ceph_osd_client Jeff Layton
@ 2017-03-28 14:54   ` Ilya Dryomov
  2017-03-29 11:54     ` Jeff Layton
  0 siblings, 1 reply; 23+ messages in thread
From: Ilya Dryomov @ 2017-03-28 14:54 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Sat, Feb 25, 2017 at 6:43 PM, Jeff Layton <jlayton@redhat.com> wrote:
> Cephfs can get cap update requests that contain a new epoch barrier in
> them. When that happens we want to pause all OSD traffic until the right
> map epoch arrives.
>
> Add an epoch_barrier field to ceph_osd_client that is protected by the
> osdc->lock rwsem. When the barrier is set, and the current OSD map
> epoch is below that, pause the request target when submitting the
> request or when revisiting it. Add a way for upper layers (cephfs)
> to update the epoch_barrier as well.
>
> If we get a new map, compare the new epoch against the barrier before
> kicking requests and request another map if the map epoch is still lower
> than the one we want.
>
> If we end up cancelling requests because of a new map showing a full OSD
> or pool condition, then set the barrier higher than the highest replay
> epoch of all the cancelled requests.
>
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
>  include/linux/ceph/osd_client.h |  2 ++
>  net/ceph/osd_client.c           | 51 +++++++++++++++++++++++++++++++++--------
>  2 files changed, 43 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index 55dcff2455e0..833942226560 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -266,6 +266,7 @@ struct ceph_osd_client {
>         struct rb_root         osds;          /* osds */
>         struct list_head       osd_lru;       /* idle osds */
>         spinlock_t             osd_lru_lock;
> +       u32                    epoch_barrier;

It would be good to include it in debugfs -- osdmap_show().

>         struct ceph_osd        homeless_osd;
>         atomic64_t             last_tid;      /* tid of last request */
>         u64                    last_linger_id;
> @@ -304,6 +305,7 @@ extern void ceph_osdc_handle_reply(struct ceph_osd_client *osdc,
>                                    struct ceph_msg *msg);
>  extern void ceph_osdc_handle_map(struct ceph_osd_client *osdc,
>                                  struct ceph_msg *msg);
> +void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb);
>
>  extern void osd_req_op_init(struct ceph_osd_request *osd_req,
>                             unsigned int which, u16 opcode, u32 flags);
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index b286ff6dec29..2e9b6211814a 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -1299,8 +1299,10 @@ static bool target_should_be_paused(struct ceph_osd_client *osdc,
>                        __pool_full(pi);
>
>         WARN_ON(pi->id != t->base_oloc.pool);
> -       return (t->flags & CEPH_OSD_FLAG_READ && pauserd) ||
> -              (t->flags & CEPH_OSD_FLAG_WRITE && pausewr);
> +       return ((t->flags & CEPH_OSD_FLAG_READ) && pauserd) ||
> +              ((t->flags & CEPH_OSD_FLAG_WRITE) && pausewr) ||
> +              (osdc->epoch_barrier &&
> +               osdc->osdmap->epoch < osdc->epoch_barrier);

Why the osdc->epoch_barrier != 0 check, here and everywhere else?

>  }
>
>  enum calc_target_result {
> @@ -1610,21 +1612,24 @@ static void send_request(struct ceph_osd_request *req)
>  static void maybe_request_map(struct ceph_osd_client *osdc)
>  {
>         bool continuous = false;
> +       u32 epoch = osdc->osdmap->epoch;
>
>         verify_osdc_locked(osdc);
> -       WARN_ON(!osdc->osdmap->epoch);
> +       WARN_ON_ONCE(epoch == 0);
>
>         if (ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
>             ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSERD) ||
> -           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
> +           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
> +           (osdc->epoch_barrier && epoch < osdc->epoch_barrier)) {
>                 dout("%s osdc %p continuous\n", __func__, osdc);
>                 continuous = true;
>         } else {
>                 dout("%s osdc %p onetime\n", __func__, osdc);
>         }
>
> +       ++epoch;
>         if (ceph_monc_want_map(&osdc->client->monc, CEPH_SUB_OSDMAP,
> -                              osdc->osdmap->epoch + 1, continuous))
> +                              epoch, continuous))
>                 ceph_monc_renew_subs(&osdc->client->monc);
>  }

Why was this change needed?  Wouldn't the existing call to unmodified
maybe_request_map() from ceph_osdc_handle_map() be sufficient?

>
> @@ -1653,8 +1658,14 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
>                 goto promote;
>         }
>
> -       if ((req->r_flags & CEPH_OSD_FLAG_WRITE) &&
> -           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
> +       if (osdc->epoch_barrier &&
> +           osdc->osdmap->epoch < osdc->epoch_barrier) {
> +               dout("req %p epoch %u barrier %u\n", req, osdc->osdmap->epoch,
> +                    osdc->epoch_barrier);
> +               req->r_t.paused = true;
> +               maybe_request_map(osdc);
> +       } else if ((req->r_flags & CEPH_OSD_FLAG_WRITE) &&
> +                  ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
>                 dout("req %p pausewr\n", req);
>                 req->r_t.paused = true;
>                 maybe_request_map(osdc);
> @@ -1772,7 +1783,8 @@ static void complete_request(struct ceph_osd_request *req, int err)
>
>  /*
>   * Drop all pending requests that are stalled waiting on a full condition to
> - * clear, and complete them with ENOSPC as the return code.
> + * clear, and complete them with ENOSPC as the return code. Set the
> + * osdc->epoch_barrier to the latest replay version epoch that was aborted.
>   */
>  static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
>  {
> @@ -1808,7 +1820,11 @@ static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
>                 mutex_unlock(&osd->lock);
>         }
>  out:
> -       dout("return abort_on_full latest_epoch=%u\n", latest_epoch);
> +       if (latest_epoch)
> +               osdc->epoch_barrier = max(latest_epoch + 1,
> +                                         osdc->epoch_barrier);
> +       dout("return abort_on_full latest_epoch=%u barrier=%u\n", latest_epoch,
> +                                       osdc->epoch_barrier);
>  }
>
>  static void cancel_map_check(struct ceph_osd_request *req)
> @@ -3266,7 +3282,8 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
>         pausewr = ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
>                   ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
>                   have_pool_full(osdc);
> -       if (was_pauserd || was_pausewr || pauserd || pausewr)
> +       if (was_pauserd || was_pausewr || pauserd || pausewr ||
> +           (osdc->epoch_barrier && osdc->osdmap->epoch < osdc->epoch_barrier))
>                 maybe_request_map(osdc);
>
>         kick_requests(osdc, &need_resend, &need_resend_linger);
> @@ -3284,6 +3301,20 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
>         up_write(&osdc->lock);
>  }
>
> +void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb)
> +{
> +       down_read(&osdc->lock);
> +       if (unlikely(eb > osdc->epoch_barrier)) {
> +               up_read(&osdc->lock);
> +               down_write(&osdc->lock);
> +               osdc->epoch_barrier = max(eb, osdc->epoch_barrier);

I'd replace this with an if and add a dout for when the epoch_barrier
is actually updated.  We should probably do maybe_request_map() in that
branch is well.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 1/6] libceph: allow requests to return immediately on full conditions if caller wishes
  2017-03-28 14:22   ` Ilya Dryomov
@ 2017-03-28 20:12     ` Jeff Layton
  2017-03-29 14:47       ` Ilya Dryomov
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff Layton @ 2017-03-28 20:12 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Tue, 2017-03-28 at 16:22 +0200, Ilya Dryomov wrote:
> On Sat, Feb 25, 2017 at 6:43 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > Usually, when the osd map is flagged as full or the pool is at quota,
> > write requests just hang. This is not what we want for cephfs, where
> > it would be better to simply report -ENOSPC back to userland instead
> > of stalling.
> > 
> > If the caller knows that it will want an immediate error return instead
> > of blocking on a full or at-quota error condition then allow it to set a
> > flag to request that behavior.
> > 
> > Set that flag in ceph_osdc_new_request (since ceph.ko is the only caller),
> > and on any other write request from ceph.ko.
> > 
> > A later patch will deal with requests that were submitted before the new
> > map showing the full condition came in.
> > 
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > ---
> >  fs/ceph/addr.c                  | 1 +
> >  fs/ceph/file.c                  | 1 +
> >  include/linux/ceph/osd_client.h | 1 +
> >  net/ceph/osd_client.c           | 7 +++++++
> >  4 files changed, 10 insertions(+)
> > 
> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > index 6ecb920602ed..4b9da6ee5c7f 100644
> > --- a/fs/ceph/addr.c
> > +++ b/fs/ceph/addr.c
> > @@ -1889,6 +1889,7 @@ static int __ceph_pool_perm_get(struct ceph_inode_info *ci,
> >         err = ceph_osdc_start_request(&fsc->client->osdc, rd_req, false);
> > 
> >         wr_req->r_mtime = ci->vfs_inode.i_mtime;
> > +       wr_req->r_abort_on_full = true;
> >         err2 = ceph_osdc_start_request(&fsc->client->osdc, wr_req, false);
> > 
> >         if (!err)
> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > index 5a7134ef13d3..9dad32d0ed0b 100644
> > --- a/fs/ceph/file.c
> > +++ b/fs/ceph/file.c
> > @@ -712,6 +712,7 @@ static void ceph_aio_retry_work(struct work_struct *work)
> >         req->r_callback = ceph_aio_complete_req;
> >         req->r_inode = inode;
> >         req->r_priv = aio_req;
> > +       req->r_abort_on_full = true;
> > 
> >         ret = ceph_osdc_start_request(req->r_osdc, req, false);
> >  out:
> > diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> > index f2ce9cd5ede6..55dcff2455e0 100644
> > --- a/include/linux/ceph/osd_client.h
> > +++ b/include/linux/ceph/osd_client.h
> > @@ -169,6 +169,7 @@ struct ceph_osd_request {
> >         unsigned int            r_num_ops;
> > 
> >         int               r_result;
> > +       bool              r_abort_on_full; /* return ENOSPC when full */
> 
> Move this under "set by submitter", after r_linger?
> 
> > 
> >         struct ceph_osd_client *r_osdc;
> >         struct kref       r_kref;
> > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> > index 5c0938ddddf6..8fc0ccc7126f 100644
> > --- a/net/ceph/osd_client.c
> > +++ b/net/ceph/osd_client.c
> > @@ -49,6 +49,7 @@ static void link_linger(struct ceph_osd *osd,
> >                         struct ceph_osd_linger_request *lreq);
> >  static void unlink_linger(struct ceph_osd *osd,
> >                           struct ceph_osd_linger_request *lreq);
> > +static void complete_request(struct ceph_osd_request *req, int err);
> 
> Move this closer to __submit_request(), before send_map_check() forward
> declaration?
> 
> > 
> >  #if 1
> >  static inline bool rwsem_is_wrlocked(struct rw_semaphore *sem)
> > @@ -961,6 +962,7 @@ struct ceph_osd_request *ceph_osdc_new_request(struct ceph_osd_client *osdc,
> >                                        truncate_size, truncate_seq);
> >         }
> > 
> > +       req->r_abort_on_full = true;
> >         req->r_flags = flags;
> >         req->r_base_oloc.pool = layout->pool_id;
> >         req->r_base_oloc.pool_ns = ceph_try_get_string(layout->pool_ns);
> > @@ -1635,6 +1637,7 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
> >         enum calc_target_result ct_res;
> >         bool need_send = false;
> 
> bool need_abort here instead of ret?  This is a filesystem specific
> hack, so it's always going to be ENOSPC.
> 

Fixing all of the above for the next iteration.

> >         bool promoted = false;
> > +       int ret = 0;
> > 
> >         WARN_ON(req->r_tid);
> >         dout("%s req %p wrlocked %d\n", __func__, req, wrlocked);
> > @@ -1669,6 +1672,8 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
> >                 pr_warn_ratelimited("FULL or reached pool quota\n");
> >                 req->r_t.paused = true;
> >                 maybe_request_map(osdc);
> > +               if (req->r_abort_on_full)
> > +                       ret = -ENOSPC;
> >         } else if (!osd_homeless(osd)) {
> >                 need_send = true;
> >         } else {
> > @@ -1685,6 +1690,8 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
> >         link_request(osd, req);
> >         if (need_send)
> >                 send_request(req);
> > +       else if (ret)
> > +               complete_request(req, ret);
> 
> I think this is the first complete_request() on the submit thread, so
> I'm a little worried about possible deadlocks and/or other weirdness.
> Did you look through all start/handle_reply sites in the filesystem?
> 

I did. That's not to say I didn't miss anything, but as far as I can
tell, we expect to call complete_request with the osd locked, and it is
at this point in the __submit_request codepath.

Thanks for the comments so far!
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 2/6] libceph: abort already submitted but abortable requests when map or pool goes full
  2017-03-28 14:34   ` Ilya Dryomov
@ 2017-03-28 20:44     ` Jeff Layton
  2017-03-29 16:52       ` Ilya Dryomov
  2017-03-29 14:01     ` Jeff Layton
  1 sibling, 1 reply; 23+ messages in thread
From: Jeff Layton @ 2017-03-28 20:44 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Tue, 2017-03-28 at 16:34 +0200, Ilya Dryomov wrote:
> On Sat, Feb 25, 2017 at 6:43 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > When a Ceph volume hits capacity, a flag is set in the OSD map to
> > indicate that, and a new map is sprayed around the cluster. With cephfs
> > we want it to shut down any abortable requests that are in progress with
> > an -ENOSPC error as they'd just hang otherwise.
> > 
> > Add a new ceph_osdc_abort_on_full helper function to handle this. It
> > will first check whether there is an out-of-space condition in the
> > cluster. It will then walk the tree and abort any request that has
> > r_abort_on_full set with an ENOSPC error. Call this new function
> > directly whenever we get a new OSD map.
> > 
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > ---
> >  net/ceph/osd_client.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 42 insertions(+)
> > 
> > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> > index 8fc0ccc7126f..b286ff6dec29 100644
> > --- a/net/ceph/osd_client.c
> > +++ b/net/ceph/osd_client.c
> > @@ -1770,6 +1770,47 @@ static void complete_request(struct ceph_osd_request *req, int err)
> >         ceph_osdc_put_request(req);
> >  }
> > 
> > +/*
> > + * Drop all pending requests that are stalled waiting on a full condition to
> > + * clear, and complete them with ENOSPC as the return code.
> > + */
> > +static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
> > +{
> > +       struct ceph_osd_request *req;
> > +       struct ceph_osd *osd;
> 
> These variables could have narrower scope.
> 
> > +       struct rb_node *m, *n;
> > +       u32 latest_epoch = 0;
> > +       bool osdmap_full = ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL);
> 
> This variable is redundant.
> 
> > +
> > +       dout("enter abort_on_full\n");
> > +
> > +       if (!osdmap_full && !have_pool_full(osdc))
> > +               goto out;
> > +
> > +       for (n = rb_first(&osdc->osds); n; n = rb_next(n)) {
> > +               osd = rb_entry(n, struct ceph_osd, o_node);
> > +               mutex_lock(&osd->lock);
> 
> No need to take osd->lock here as osdc->lock is held for write.
> 

Could you explain how this locking works?

It appeared to me that o_requests was protected by osd->lock based on
when link_request is called in __submit_request. If the osdc->lock is
sufficient, then why take the osd->lock before grabbing the tid and
linking it into the tree?

> > +               m = rb_first(&osd->o_requests);
> > +               while (m) {
> > +                       req = rb_entry(m, struct ceph_osd_request, r_node);
> > +                       m = rb_next(m);
> > +
> > +                       if (req->r_abort_on_full &&
> > +                           (osdmap_full || pool_full(osdc, req->r_t.base_oloc.pool))) {
> 
> Over 80 lines and I think we need target_oloc instead of base_oloc
> here.
> 
> > +                               u32 cur_epoch = le32_to_cpu(req->r_replay_version.epoch);
> 
> Is replay_version used this way in the userspace client?  It is an ack
> vs commit leftover and AFAICT it is never set (i.e. always 0'0) in newer
> Objecter, so something is wrong here.
> 
> > +
> > +                               dout("%s: abort tid=%llu flags 0x%x\n", __func__, req->r_tid, req->r_flags);
> 
> Over 80 lines and probably unnecessary -- complete_request() has
> a similar dout.
> 
> > +                               complete_request(req, -ENOSPC);
> 
> This should be abort_request() (newly introduced in 4.11).
> 

Fixed most of the above, but could you explain this bit?

abort_request() just calls cancel_map_check() and then does a
complete_request(). Why do we want to cancel the map check here?

> > +                               if (cur_epoch > latest_epoch)
> > +                                       latest_epoch = cur_epoch;
> > +                       }
> > +               }
> > +               mutex_unlock(&osd->lock);
> > +       }
> > +out:
> > +       dout("return abort_on_full latest_epoch=%u\n", latest_epoch);
> > +}
> > +
> >  static void cancel_map_check(struct ceph_osd_request *req)
> >  {
> >         struct ceph_osd_client *osdc = req->r_osdc;
> > @@ -3232,6 +3273,7 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
> > 
> >         ceph_monc_got_map(&osdc->client->monc, CEPH_SUB_OSDMAP,
> >                           osdc->osdmap->epoch);
> > +       ceph_osdc_abort_on_full(osdc);
> 
> Nit: swap ceph_osdc_abort_on_full() and ceph_monc_got_map() so that
> ceph_osdc_abort_on_full() follows kick_requests().  No real reason
> other than for "I processed this map" notification to go last, after
> the map is fully processed.
> 

Sure, np.

Thanks,
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 3/6] libceph: add an epoch_barrier field to struct ceph_osd_client
  2017-03-28 14:54   ` Ilya Dryomov
@ 2017-03-29 11:54     ` Jeff Layton
  2017-03-29 16:52       ` Ilya Dryomov
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff Layton @ 2017-03-29 11:54 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Tue, 2017-03-28 at 16:54 +0200, Ilya Dryomov wrote:
> On Sat, Feb 25, 2017 at 6:43 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > Cephfs can get cap update requests that contain a new epoch barrier in
> > them. When that happens we want to pause all OSD traffic until the right
> > map epoch arrives.
> > 
> > Add an epoch_barrier field to ceph_osd_client that is protected by the
> > osdc->lock rwsem. When the barrier is set, and the current OSD map
> > epoch is below that, pause the request target when submitting the
> > request or when revisiting it. Add a way for upper layers (cephfs)
> > to update the epoch_barrier as well.
> > 
> > If we get a new map, compare the new epoch against the barrier before
> > kicking requests and request another map if the map epoch is still lower
> > than the one we want.
> > 
> > If we end up cancelling requests because of a new map showing a full OSD
> > or pool condition, then set the barrier higher than the highest replay
> > epoch of all the cancelled requests.
> > 
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > ---
> >  include/linux/ceph/osd_client.h |  2 ++
> >  net/ceph/osd_client.c           | 51 +++++++++++++++++++++++++++++++++--------
> >  2 files changed, 43 insertions(+), 10 deletions(-)
> > 
> > diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> > index 55dcff2455e0..833942226560 100644
> > --- a/include/linux/ceph/osd_client.h
> > +++ b/include/linux/ceph/osd_client.h
> > @@ -266,6 +266,7 @@ struct ceph_osd_client {
> >         struct rb_root         osds;          /* osds */
> >         struct list_head       osd_lru;       /* idle osds */
> >         spinlock_t             osd_lru_lock;
> > +       u32                    epoch_barrier;
> 
> It would be good to include it in debugfs -- osdmap_show().
> 

Ok, I'll plan to add it in there.

> >         struct ceph_osd        homeless_osd;
> >         atomic64_t             last_tid;      /* tid of last request */
> >         u64                    last_linger_id;
> > @@ -304,6 +305,7 @@ extern void ceph_osdc_handle_reply(struct ceph_osd_client *osdc,
> >                                    struct ceph_msg *msg);
> >  extern void ceph_osdc_handle_map(struct ceph_osd_client *osdc,
> >                                  struct ceph_msg *msg);
> > +void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb);
> > 
> >  extern void osd_req_op_init(struct ceph_osd_request *osd_req,
> >                             unsigned int which, u16 opcode, u32 flags);
> > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> > index b286ff6dec29..2e9b6211814a 100644
> > --- a/net/ceph/osd_client.c
> > +++ b/net/ceph/osd_client.c
> > @@ -1299,8 +1299,10 @@ static bool target_should_be_paused(struct ceph_osd_client *osdc,
> >                        __pool_full(pi);
> > 
> >         WARN_ON(pi->id != t->base_oloc.pool);
> > -       return (t->flags & CEPH_OSD_FLAG_READ && pauserd) ||
> > -              (t->flags & CEPH_OSD_FLAG_WRITE && pausewr);
> > +       return ((t->flags & CEPH_OSD_FLAG_READ) && pauserd) ||
> > +              ((t->flags & CEPH_OSD_FLAG_WRITE) && pausewr) ||
> > +              (osdc->epoch_barrier &&
> > +               osdc->osdmap->epoch < osdc->epoch_barrier);
> 
> Why the osdc->epoch_barrier != 0 check, here and everywhere else?
> 

My understanding is that we have to deal with clusters that predate the
addition of this value. The client sets this to 0 when it's not set.

> >  }
> > 
> >  enum calc_target_result {
> > @@ -1610,21 +1612,24 @@ static void send_request(struct ceph_osd_request *req)
> >  static void maybe_request_map(struct ceph_osd_client *osdc)
> >  {
> >         bool continuous = false;
> > +       u32 epoch = osdc->osdmap->epoch;
> > 
> >         verify_osdc_locked(osdc);
> > -       WARN_ON(!osdc->osdmap->epoch);
> > +       WARN_ON_ONCE(epoch == 0);
> > 
> >         if (ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
> >             ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSERD) ||
> > -           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
> > +           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
> > +           (osdc->epoch_barrier && epoch < osdc->epoch_barrier)) {
> >                 dout("%s osdc %p continuous\n", __func__, osdc);
> >                 continuous = true;
> >         } else {
> >                 dout("%s osdc %p onetime\n", __func__, osdc);
> >         }
> > 
> > +       ++epoch;
> >         if (ceph_monc_want_map(&osdc->client->monc, CEPH_SUB_OSDMAP,
> > -                              osdc->osdmap->epoch + 1, continuous))
> > +                              epoch, continuous))
> >                 ceph_monc_renew_subs(&osdc->client->monc);
> >  }
> 
> Why was this change needed?  Wouldn't the existing call to unmodified
> maybe_request_map() from ceph_osdc_handle_map() be sufficient?
> 

It's not strictly required, but it's more efficient to fetch the epoch
at the beginning of the function like this and then work with it the
value on the stack than to potentially deal with having to refetch it.
It also looks cleaner.

> > 
> > @@ -1653,8 +1658,14 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
> >                 goto promote;
> >         }
> > 
> > -       if ((req->r_flags & CEPH_OSD_FLAG_WRITE) &&
> > -           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
> > +       if (osdc->epoch_barrier &&
> > +           osdc->osdmap->epoch < osdc->epoch_barrier) {
> > +               dout("req %p epoch %u barrier %u\n", req, osdc->osdmap->epoch,
> > +                    osdc->epoch_barrier);
> > +               req->r_t.paused = true;
> > +               maybe_request_map(osdc);
> > +       } else if ((req->r_flags & CEPH_OSD_FLAG_WRITE) &&
> > +                  ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
> >                 dout("req %p pausewr\n", req);
> >                 req->r_t.paused = true;
> >                 maybe_request_map(osdc);
> > @@ -1772,7 +1783,8 @@ static void complete_request(struct ceph_osd_request *req, int err)
> > 
> >  /*
> >   * Drop all pending requests that are stalled waiting on a full condition to
> > - * clear, and complete them with ENOSPC as the return code.
> > + * clear, and complete them with ENOSPC as the return code. Set the
> > + * osdc->epoch_barrier to the latest replay version epoch that was aborted.
> >   */
> >  static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
> >  {
> > @@ -1808,7 +1820,11 @@ static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
> >                 mutex_unlock(&osd->lock);
> >         }
> >  out:
> > -       dout("return abort_on_full latest_epoch=%u\n", latest_epoch);
> > +       if (latest_epoch)
> > +               osdc->epoch_barrier = max(latest_epoch + 1,
> > +                                         osdc->epoch_barrier);
> > +       dout("return abort_on_full latest_epoch=%u barrier=%u\n", latest_epoch,
> > +                                       osdc->epoch_barrier);
> >  }
> > 
> >  static void cancel_map_check(struct ceph_osd_request *req)
> > @@ -3266,7 +3282,8 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
> >         pausewr = ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
> >                   ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
> >                   have_pool_full(osdc);
> > -       if (was_pauserd || was_pausewr || pauserd || pausewr)
> > +       if (was_pauserd || was_pausewr || pauserd || pausewr ||
> > +           (osdc->epoch_barrier && osdc->osdmap->epoch < osdc->epoch_barrier))
> >                 maybe_request_map(osdc);
> > 
> >         kick_requests(osdc, &need_resend, &need_resend_linger);
> > @@ -3284,6 +3301,20 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
> >         up_write(&osdc->lock);
> >  }
> > 
> > +void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb)
> > +{
> > +       down_read(&osdc->lock);
> > +       if (unlikely(eb > osdc->epoch_barrier)) {
> > +               up_read(&osdc->lock);
> > +               down_write(&osdc->lock);
> > +               osdc->epoch_barrier = max(eb, osdc->epoch_barrier);
> 
> I'd replace this with an if and add a dout for when the epoch_barrier
> is actually updated.  We should probably do maybe_request_map() in that
> branch is well.
> 

Ok, thanks. I'll clean that up.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 2/6] libceph: abort already submitted but abortable requests when map or pool goes full
  2017-03-28 14:34   ` Ilya Dryomov
  2017-03-28 20:44     ` Jeff Layton
@ 2017-03-29 14:01     ` Jeff Layton
  2017-03-29 14:14       ` Ilya Dryomov
  1 sibling, 1 reply; 23+ messages in thread
From: Jeff Layton @ 2017-03-29 14:01 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Tue, 2017-03-28 at 16:34 +0200, Ilya Dryomov wrote:
> On Sat, Feb 25, 2017 at 6:43 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > When a Ceph volume hits capacity, a flag is set in the OSD map to
> > indicate that, and a new map is sprayed around the cluster. With cephfs
> > we want it to shut down any abortable requests that are in progress with
> > an -ENOSPC error as they'd just hang otherwise.
> > 
> > Add a new ceph_osdc_abort_on_full helper function to handle this. It
> > will first check whether there is an out-of-space condition in the
> > cluster. It will then walk the tree and abort any request that has
> > r_abort_on_full set with an ENOSPC error. Call this new function
> > directly whenever we get a new OSD map.
> > 
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > ---
> >  net/ceph/osd_client.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 42 insertions(+)
> > 
> > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> > index 8fc0ccc7126f..b286ff6dec29 100644
> > --- a/net/ceph/osd_client.c
> > +++ b/net/ceph/osd_client.c
> > @@ -1770,6 +1770,47 @@ static void complete_request(struct ceph_osd_request *req, int err)
> >         ceph_osdc_put_request(req);
> >  }
> > 
> > +/*
> > + * Drop all pending requests that are stalled waiting on a full condition to
> > + * clear, and complete them with ENOSPC as the return code.
> > + */
> > +static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
> > +{
> > +       struct ceph_osd_request *req;
> > +       struct ceph_osd *osd;
> 
> These variables could have narrower scope.
> 
> > +       struct rb_node *m, *n;
> > +       u32 latest_epoch = 0;
> > +       bool osdmap_full = ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL);
> 
> This variable is redundant.
> 
> > +
> > +       dout("enter abort_on_full\n");
> > +
> > +       if (!osdmap_full && !have_pool_full(osdc))
> > +               goto out;
> > +
> > +       for (n = rb_first(&osdc->osds); n; n = rb_next(n)) {
> > +               osd = rb_entry(n, struct ceph_osd, o_node);
> > +               mutex_lock(&osd->lock);
> 
> No need to take osd->lock here as osdc->lock is held for write.
> 
> > +               m = rb_first(&osd->o_requests);
> > +               while (m) {
> > +                       req = rb_entry(m, struct ceph_osd_request, r_node);
> > +                       m = rb_next(m);
> > +
> > +                       if (req->r_abort_on_full &&
> > +                           (osdmap_full || pool_full(osdc, req->r_t.base_oloc.pool))) {
> 
> Over 80 lines and I think we need target_oloc instead of base_oloc
> here.
> 
> > +                               u32 cur_epoch = le32_to_cpu(req->r_replay_version.epoch);
> 
> Is replay_version used this way in the userspace client?  It is an ack
> vs commit leftover and AFAICT it is never set (i.e. always 0'0) in newer
> Objecter, so something is wrong here.
> 

Yeah I see what you mean now. This does look to be entirely vestigial
in the current code.

Basically, what we want is to scan the list of outstanding requests for
this OSD, and abort any that predate the current map epoch. I suppose
this means that we need to record the current map epoch in the request
somewhere. ISTR looking and seeing that it was recorded here, but I
don't think that's the case anymore.

Since that field is never set to anything now, what I'll probably do is
just remove it, and replace it with an epoch field that we record at
the time that the request is marshaled.

> > +
> > +                               dout("%s: abort tid=%llu flags 0x%x\n", __func__, req->r_tid, req->r_flags);
> 
> Over 80 lines and probably unnecessary -- complete_request() has
> a similar dout.
> 
> > +                               complete_request(req, -ENOSPC);
> 
> This should be abort_request() (newly introduced in 4.11).
> 
> > +                               if (cur_epoch > latest_epoch)
> > +                                       latest_epoch = cur_epoch;
> > +                       }
> > +               }
> > +               mutex_unlock(&osd->lock);
> > +       }
> > +out:
> > +       dout("return abort_on_full latest_epoch=%u\n", latest_epoch);
> > +}
> > +
> >  static void cancel_map_check(struct ceph_osd_request *req)
> >  {
> >         struct ceph_osd_client *osdc = req->r_osdc;
> > @@ -3232,6 +3273,7 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
> > 
> >         ceph_monc_got_map(&osdc->client->monc, CEPH_SUB_OSDMAP,
> >                           osdc->osdmap->epoch);
> > +       ceph_osdc_abort_on_full(osdc);
> 
> Nit: swap ceph_osdc_abort_on_full() and ceph_monc_got_map() so that
> ceph_osdc_abort_on_full() follows kick_requests().  No real reason
> other than for "I processed this map" notification to go last, after
> the map is fully processed.
> 
> Thanks,
> 
>                 Ilya
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 2/6] libceph: abort already submitted but abortable requests when map or pool goes full
  2017-03-29 14:01     ` Jeff Layton
@ 2017-03-29 14:14       ` Ilya Dryomov
  2017-03-29 16:39         ` Jeff Layton
  0 siblings, 1 reply; 23+ messages in thread
From: Ilya Dryomov @ 2017-03-29 14:14 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Wed, Mar 29, 2017 at 4:01 PM, Jeff Layton <jlayton@redhat.com> wrote:
> On Tue, 2017-03-28 at 16:34 +0200, Ilya Dryomov wrote:
>> On Sat, Feb 25, 2017 at 6:43 PM, Jeff Layton <jlayton@redhat.com> wrote:
>> > When a Ceph volume hits capacity, a flag is set in the OSD map to
>> > indicate that, and a new map is sprayed around the cluster. With cephfs
>> > we want it to shut down any abortable requests that are in progress with
>> > an -ENOSPC error as they'd just hang otherwise.
>> >
>> > Add a new ceph_osdc_abort_on_full helper function to handle this. It
>> > will first check whether there is an out-of-space condition in the
>> > cluster. It will then walk the tree and abort any request that has
>> > r_abort_on_full set with an ENOSPC error. Call this new function
>> > directly whenever we get a new OSD map.
>> >
>> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
>> > ---
>> >  net/ceph/osd_client.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>> >  1 file changed, 42 insertions(+)
>> >
>> > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
>> > index 8fc0ccc7126f..b286ff6dec29 100644
>> > --- a/net/ceph/osd_client.c
>> > +++ b/net/ceph/osd_client.c
>> > @@ -1770,6 +1770,47 @@ static void complete_request(struct ceph_osd_request *req, int err)
>> >         ceph_osdc_put_request(req);
>> >  }
>> >
>> > +/*
>> > + * Drop all pending requests that are stalled waiting on a full condition to
>> > + * clear, and complete them with ENOSPC as the return code.
>> > + */
>> > +static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
>> > +{
>> > +       struct ceph_osd_request *req;
>> > +       struct ceph_osd *osd;
>>
>> These variables could have narrower scope.
>>
>> > +       struct rb_node *m, *n;
>> > +       u32 latest_epoch = 0;
>> > +       bool osdmap_full = ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL);
>>
>> This variable is redundant.
>>
>> > +
>> > +       dout("enter abort_on_full\n");
>> > +
>> > +       if (!osdmap_full && !have_pool_full(osdc))
>> > +               goto out;
>> > +
>> > +       for (n = rb_first(&osdc->osds); n; n = rb_next(n)) {
>> > +               osd = rb_entry(n, struct ceph_osd, o_node);
>> > +               mutex_lock(&osd->lock);
>>
>> No need to take osd->lock here as osdc->lock is held for write.
>>
>> > +               m = rb_first(&osd->o_requests);
>> > +               while (m) {
>> > +                       req = rb_entry(m, struct ceph_osd_request, r_node);
>> > +                       m = rb_next(m);
>> > +
>> > +                       if (req->r_abort_on_full &&
>> > +                           (osdmap_full || pool_full(osdc, req->r_t.base_oloc.pool))) {
>>
>> Over 80 lines and I think we need target_oloc instead of base_oloc
>> here.
>>
>> > +                               u32 cur_epoch = le32_to_cpu(req->r_replay_version.epoch);
>>
>> Is replay_version used this way in the userspace client?  It is an ack
>> vs commit leftover and AFAICT it is never set (i.e. always 0'0) in newer
>> Objecter, so something is wrong here.
>>
>
> Yeah I see what you mean now. This does look to be entirely vestigial
> in the current code.
>
> Basically, what we want is to scan the list of outstanding requests for
> this OSD, and abort any that predate the current map epoch. I suppose
> this means that we need to record the current map epoch in the request
> somewhere. ISTR looking and seeing that it was recorded here, but I
> don't think that's the case anymore.

Why only those that predate the current epoch and not simply all of
those marked as r_abort_on_full?  I mean, this is triggered by a pool
or an entire cluster going full, right?

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 1/6] libceph: allow requests to return immediately on full conditions if caller wishes
  2017-03-28 20:12     ` Jeff Layton
@ 2017-03-29 14:47       ` Ilya Dryomov
  0 siblings, 0 replies; 23+ messages in thread
From: Ilya Dryomov @ 2017-03-29 14:47 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Tue, Mar 28, 2017 at 10:12 PM, Jeff Layton <jlayton@redhat.com> wrote:
> On Tue, 2017-03-28 at 16:22 +0200, Ilya Dryomov wrote:
>> On Sat, Feb 25, 2017 at 6:43 PM, Jeff Layton <jlayton@redhat.com> wrote:
>> > Usually, when the osd map is flagged as full or the pool is at quota,
>> > write requests just hang. This is not what we want for cephfs, where
>> > it would be better to simply report -ENOSPC back to userland instead
>> > of stalling.
>> >
>> > If the caller knows that it will want an immediate error return instead
>> > of blocking on a full or at-quota error condition then allow it to set a
>> > flag to request that behavior.
>> >
>> > Set that flag in ceph_osdc_new_request (since ceph.ko is the only caller),
>> > and on any other write request from ceph.ko.
>> >
>> > A later patch will deal with requests that were submitted before the new
>> > map showing the full condition came in.
>> >
>> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
>> > ---
>> >  fs/ceph/addr.c                  | 1 +
>> >  fs/ceph/file.c                  | 1 +
>> >  include/linux/ceph/osd_client.h | 1 +
>> >  net/ceph/osd_client.c           | 7 +++++++
>> >  4 files changed, 10 insertions(+)
>> >
>> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>> > index 6ecb920602ed..4b9da6ee5c7f 100644
>> > --- a/fs/ceph/addr.c
>> > +++ b/fs/ceph/addr.c
>> > @@ -1889,6 +1889,7 @@ static int __ceph_pool_perm_get(struct ceph_inode_info *ci,
>> >         err = ceph_osdc_start_request(&fsc->client->osdc, rd_req, false);
>> >
>> >         wr_req->r_mtime = ci->vfs_inode.i_mtime;
>> > +       wr_req->r_abort_on_full = true;
>> >         err2 = ceph_osdc_start_request(&fsc->client->osdc, wr_req, false);
>> >
>> >         if (!err)
>> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> > index 5a7134ef13d3..9dad32d0ed0b 100644
>> > --- a/fs/ceph/file.c
>> > +++ b/fs/ceph/file.c
>> > @@ -712,6 +712,7 @@ static void ceph_aio_retry_work(struct work_struct *work)
>> >         req->r_callback = ceph_aio_complete_req;
>> >         req->r_inode = inode;
>> >         req->r_priv = aio_req;
>> > +       req->r_abort_on_full = true;
>> >
>> >         ret = ceph_osdc_start_request(req->r_osdc, req, false);
>> >  out:
>> > diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
>> > index f2ce9cd5ede6..55dcff2455e0 100644
>> > --- a/include/linux/ceph/osd_client.h
>> > +++ b/include/linux/ceph/osd_client.h
>> > @@ -169,6 +169,7 @@ struct ceph_osd_request {
>> >         unsigned int            r_num_ops;
>> >
>> >         int               r_result;
>> > +       bool              r_abort_on_full; /* return ENOSPC when full */
>>
>> Move this under "set by submitter", after r_linger?
>>
>> >
>> >         struct ceph_osd_client *r_osdc;
>> >         struct kref       r_kref;
>> > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
>> > index 5c0938ddddf6..8fc0ccc7126f 100644
>> > --- a/net/ceph/osd_client.c
>> > +++ b/net/ceph/osd_client.c
>> > @@ -49,6 +49,7 @@ static void link_linger(struct ceph_osd *osd,
>> >                         struct ceph_osd_linger_request *lreq);
>> >  static void unlink_linger(struct ceph_osd *osd,
>> >                           struct ceph_osd_linger_request *lreq);
>> > +static void complete_request(struct ceph_osd_request *req, int err);
>>
>> Move this closer to __submit_request(), before send_map_check() forward
>> declaration?
>>
>> >
>> >  #if 1
>> >  static inline bool rwsem_is_wrlocked(struct rw_semaphore *sem)
>> > @@ -961,6 +962,7 @@ struct ceph_osd_request *ceph_osdc_new_request(struct ceph_osd_client *osdc,
>> >                                        truncate_size, truncate_seq);
>> >         }
>> >
>> > +       req->r_abort_on_full = true;
>> >         req->r_flags = flags;
>> >         req->r_base_oloc.pool = layout->pool_id;
>> >         req->r_base_oloc.pool_ns = ceph_try_get_string(layout->pool_ns);
>> > @@ -1635,6 +1637,7 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
>> >         enum calc_target_result ct_res;
>> >         bool need_send = false;
>>
>> bool need_abort here instead of ret?  This is a filesystem specific
>> hack, so it's always going to be ENOSPC.
>>
>
> Fixing all of the above for the next iteration.
>
>> >         bool promoted = false;
>> > +       int ret = 0;
>> >
>> >         WARN_ON(req->r_tid);
>> >         dout("%s req %p wrlocked %d\n", __func__, req, wrlocked);
>> > @@ -1669,6 +1672,8 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
>> >                 pr_warn_ratelimited("FULL or reached pool quota\n");
>> >                 req->r_t.paused = true;
>> >                 maybe_request_map(osdc);
>> > +               if (req->r_abort_on_full)
>> > +                       ret = -ENOSPC;
>> >         } else if (!osd_homeless(osd)) {
>> >                 need_send = true;
>> >         } else {
>> > @@ -1685,6 +1690,8 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
>> >         link_request(osd, req);
>> >         if (need_send)
>> >                 send_request(req);
>> > +       else if (ret)
>> > +               complete_request(req, ret);
>>
>> I think this is the first complete_request() on the submit thread, so
>> I'm a little worried about possible deadlocks and/or other weirdness.
>> Did you look through all start/handle_reply sites in the filesystem?
>>
>
> I did. That's not to say I didn't miss anything, but as far as I can
> tell, we expect to call complete_request with the osd locked, and it is
> at this point in the __submit_request codepath.

It's not the osd->lock, more the surrounding filesystem locks/context.
I looked briefly earlier and didn't notice anything alarming either.
If it blows up, blame it on teuthology test coverage ;)

Speaking of teuthology, we need to look into enabling
qa/tasks/cephfs/test_full.py tests for the kernel client once these
patches are finalized.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 2/6] libceph: abort already submitted but abortable requests when map or pool goes full
  2017-03-29 14:14       ` Ilya Dryomov
@ 2017-03-29 16:39         ` Jeff Layton
  2017-03-29 16:56           ` Ilya Dryomov
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff Layton @ 2017-03-29 16:39 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Wed, 2017-03-29 at 16:14 +0200, Ilya Dryomov wrote:
> On Wed, Mar 29, 2017 at 4:01 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > On Tue, 2017-03-28 at 16:34 +0200, Ilya Dryomov wrote:
> > > On Sat, Feb 25, 2017 at 6:43 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > > > When a Ceph volume hits capacity, a flag is set in the OSD map to
> > > > indicate that, and a new map is sprayed around the cluster. With cephfs
> > > > we want it to shut down any abortable requests that are in progress with
> > > > an -ENOSPC error as they'd just hang otherwise.
> > > > 
> > > > Add a new ceph_osdc_abort_on_full helper function to handle this. It
> > > > will first check whether there is an out-of-space condition in the
> > > > cluster. It will then walk the tree and abort any request that has
> > > > r_abort_on_full set with an ENOSPC error. Call this new function
> > > > directly whenever we get a new OSD map.
> > > > 
> > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > ---
> > > >  net/ceph/osd_client.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 42 insertions(+)
> > > > 
> > > > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> > > > index 8fc0ccc7126f..b286ff6dec29 100644
> > > > --- a/net/ceph/osd_client.c
> > > > +++ b/net/ceph/osd_client.c
> > > > @@ -1770,6 +1770,47 @@ static void complete_request(struct ceph_osd_request *req, int err)
> > > >         ceph_osdc_put_request(req);
> > > >  }
> > > > 
> > > > +/*
> > > > + * Drop all pending requests that are stalled waiting on a full condition to
> > > > + * clear, and complete them with ENOSPC as the return code.
> > > > + */
> > > > +static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
> > > > +{
> > > > +       struct ceph_osd_request *req;
> > > > +       struct ceph_osd *osd;
> > > 
> > > These variables could have narrower scope.
> > > 
> > > > +       struct rb_node *m, *n;
> > > > +       u32 latest_epoch = 0;
> > > > +       bool osdmap_full = ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL);
> > > 
> > > This variable is redundant.
> > > 
> > > > +
> > > > +       dout("enter abort_on_full\n");
> > > > +
> > > > +       if (!osdmap_full && !have_pool_full(osdc))
> > > > +               goto out;
> > > > +
> > > > +       for (n = rb_first(&osdc->osds); n; n = rb_next(n)) {
> > > > +               osd = rb_entry(n, struct ceph_osd, o_node);
> > > > +               mutex_lock(&osd->lock);
> > > 
> > > No need to take osd->lock here as osdc->lock is held for write.
> > > 
> > > > +               m = rb_first(&osd->o_requests);
> > > > +               while (m) {
> > > > +                       req = rb_entry(m, struct ceph_osd_request, r_node);
> > > > +                       m = rb_next(m);
> > > > +
> > > > +                       if (req->r_abort_on_full &&
> > > > +                           (osdmap_full || pool_full(osdc, req->r_t.base_oloc.pool))) {
> > > 
> > > Over 80 lines and I think we need target_oloc instead of base_oloc
> > > here.
> > > 
> > > > +                               u32 cur_epoch = le32_to_cpu(req->r_replay_version.epoch);
> > > 
> > > Is replay_version used this way in the userspace client?  It is an ack
> > > vs commit leftover and AFAICT it is never set (i.e. always 0'0) in newer
> > > Objecter, so something is wrong here.
> > > 
> > 
> > Yeah I see what you mean now. This does look to be entirely vestigial
> > in the current code.
> > 
> > Basically, what we want is to scan the list of outstanding requests for
> > this OSD, and abort any that predate the current map epoch. I suppose
> > this means that we need to record the current map epoch in the request
> > somewhere. ISTR looking and seeing that it was recorded here, but I
> > don't think that's the case anymore.
> 
> Why only those that predate the current epoch and not simply all of
> those marked as r_abort_on_full?  I mean, this is triggered by a pool
> or an entire cluster going full, right?
> 
> 

My mistake, you're right. We cancel based on r_abort_on_full and the
pool/OSD full flags.

We do need to determine the latest cancelled map epoch though, so that
we can update the epoch_barrier properly when this occurs. What I'm
doing now is just recording the current map epoch before we link a
request into the tree and using that to determine this.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 2/6] libceph: abort already submitted but abortable requests when map or pool goes full
  2017-03-28 20:44     ` Jeff Layton
@ 2017-03-29 16:52       ` Ilya Dryomov
  0 siblings, 0 replies; 23+ messages in thread
From: Ilya Dryomov @ 2017-03-29 16:52 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Tue, Mar 28, 2017 at 10:44 PM, Jeff Layton <jlayton@redhat.com> wrote:
> On Tue, 2017-03-28 at 16:34 +0200, Ilya Dryomov wrote:
>> On Sat, Feb 25, 2017 at 6:43 PM, Jeff Layton <jlayton@redhat.com> wrote:
>> > When a Ceph volume hits capacity, a flag is set in the OSD map to
>> > indicate that, and a new map is sprayed around the cluster. With cephfs
>> > we want it to shut down any abortable requests that are in progress with
>> > an -ENOSPC error as they'd just hang otherwise.
>> >
>> > Add a new ceph_osdc_abort_on_full helper function to handle this. It
>> > will first check whether there is an out-of-space condition in the
>> > cluster. It will then walk the tree and abort any request that has
>> > r_abort_on_full set with an ENOSPC error. Call this new function
>> > directly whenever we get a new OSD map.
>> >
>> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
>> > ---
>> >  net/ceph/osd_client.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>> >  1 file changed, 42 insertions(+)
>> >
>> > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
>> > index 8fc0ccc7126f..b286ff6dec29 100644
>> > --- a/net/ceph/osd_client.c
>> > +++ b/net/ceph/osd_client.c
>> > @@ -1770,6 +1770,47 @@ static void complete_request(struct ceph_osd_request *req, int err)
>> >         ceph_osdc_put_request(req);
>> >  }
>> >
>> > +/*
>> > + * Drop all pending requests that are stalled waiting on a full condition to
>> > + * clear, and complete them with ENOSPC as the return code.
>> > + */
>> > +static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
>> > +{
>> > +       struct ceph_osd_request *req;
>> > +       struct ceph_osd *osd;
>>
>> These variables could have narrower scope.
>>
>> > +       struct rb_node *m, *n;
>> > +       u32 latest_epoch = 0;
>> > +       bool osdmap_full = ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL);
>>
>> This variable is redundant.
>>
>> > +
>> > +       dout("enter abort_on_full\n");
>> > +
>> > +       if (!osdmap_full && !have_pool_full(osdc))
>> > +               goto out;
>> > +
>> > +       for (n = rb_first(&osdc->osds); n; n = rb_next(n)) {
>> > +               osd = rb_entry(n, struct ceph_osd, o_node);
>> > +               mutex_lock(&osd->lock);
>>
>> No need to take osd->lock here as osdc->lock is held for write.
>>
>
> Could you explain how this locking works?
>
> It appeared to me that o_requests was protected by osd->lock based on
> when link_request is called in __submit_request. If the osdc->lock is
> sufficient, then why take the osd->lock before grabbing the tid and
> linking it into the tree?

osd->lock protects the individual ceph_osd trees and is always taken
along with osdc->lock.  osdc->lock is the per-ceph_osd_client -- if you
have it down for write, osd->lock becomes unnecessary.

__submit_request() is called with osdc->lock down for read.

>
>> > +               m = rb_first(&osd->o_requests);
>> > +               while (m) {
>> > +                       req = rb_entry(m, struct ceph_osd_request, r_node);
>> > +                       m = rb_next(m);
>> > +
>> > +                       if (req->r_abort_on_full &&
>> > +                           (osdmap_full || pool_full(osdc, req->r_t.base_oloc.pool))) {
>>
>> Over 80 lines and I think we need target_oloc instead of base_oloc
>> here.
>>
>> > +                               u32 cur_epoch = le32_to_cpu(req->r_replay_version.epoch);
>>
>> Is replay_version used this way in the userspace client?  It is an ack
>> vs commit leftover and AFAICT it is never set (i.e. always 0'0) in newer
>> Objecter, so something is wrong here.
>>
>> > +
>> > +                               dout("%s: abort tid=%llu flags 0x%x\n", __func__, req->r_tid, req->r_flags);
>>
>> Over 80 lines and probably unnecessary -- complete_request() has
>> a similar dout.
>>
>> > +                               complete_request(req, -ENOSPC);
>>
>> This should be abort_request() (newly introduced in 4.11).
>>
>
> Fixed most of the above, but could you explain this bit?
>
> abort_request() just calls cancel_map_check() and then does a
> complete_request(). Why do we want to cancel the map check here?

Because the request in question could be in the map check tree.
ceph_osdc_abort_on_full() is asynchronous with respect to the rest of
request processing code, much like the recently merged timeout stuff.

complete_request() is for handle_reply() and map check code itself.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 3/6] libceph: add an epoch_barrier field to struct ceph_osd_client
  2017-03-29 11:54     ` Jeff Layton
@ 2017-03-29 16:52       ` Ilya Dryomov
  2017-03-29 17:43         ` Jeff Layton
  0 siblings, 1 reply; 23+ messages in thread
From: Ilya Dryomov @ 2017-03-29 16:52 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Wed, Mar 29, 2017 at 1:54 PM, Jeff Layton <jlayton@redhat.com> wrote:
> On Tue, 2017-03-28 at 16:54 +0200, Ilya Dryomov wrote:
>> On Sat, Feb 25, 2017 at 6:43 PM, Jeff Layton <jlayton@redhat.com> wrote:
>> > Cephfs can get cap update requests that contain a new epoch barrier in
>> > them. When that happens we want to pause all OSD traffic until the right
>> > map epoch arrives.
>> >
>> > Add an epoch_barrier field to ceph_osd_client that is protected by the
>> > osdc->lock rwsem. When the barrier is set, and the current OSD map
>> > epoch is below that, pause the request target when submitting the
>> > request or when revisiting it. Add a way for upper layers (cephfs)
>> > to update the epoch_barrier as well.
>> >
>> > If we get a new map, compare the new epoch against the barrier before
>> > kicking requests and request another map if the map epoch is still lower
>> > than the one we want.
>> >
>> > If we end up cancelling requests because of a new map showing a full OSD
>> > or pool condition, then set the barrier higher than the highest replay
>> > epoch of all the cancelled requests.
>> >
>> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
>> > ---
>> >  include/linux/ceph/osd_client.h |  2 ++
>> >  net/ceph/osd_client.c           | 51 +++++++++++++++++++++++++++++++++--------
>> >  2 files changed, 43 insertions(+), 10 deletions(-)
>> >
>> > diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
>> > index 55dcff2455e0..833942226560 100644
>> > --- a/include/linux/ceph/osd_client.h
>> > +++ b/include/linux/ceph/osd_client.h
>> > @@ -266,6 +266,7 @@ struct ceph_osd_client {
>> >         struct rb_root         osds;          /* osds */
>> >         struct list_head       osd_lru;       /* idle osds */
>> >         spinlock_t             osd_lru_lock;
>> > +       u32                    epoch_barrier;
>>
>> It would be good to include it in debugfs -- osdmap_show().
>>
>
> Ok, I'll plan to add it in there.
>
>> >         struct ceph_osd        homeless_osd;
>> >         atomic64_t             last_tid;      /* tid of last request */
>> >         u64                    last_linger_id;
>> > @@ -304,6 +305,7 @@ extern void ceph_osdc_handle_reply(struct ceph_osd_client *osdc,
>> >                                    struct ceph_msg *msg);
>> >  extern void ceph_osdc_handle_map(struct ceph_osd_client *osdc,
>> >                                  struct ceph_msg *msg);
>> > +void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb);
>> >
>> >  extern void osd_req_op_init(struct ceph_osd_request *osd_req,
>> >                             unsigned int which, u16 opcode, u32 flags);
>> > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
>> > index b286ff6dec29..2e9b6211814a 100644
>> > --- a/net/ceph/osd_client.c
>> > +++ b/net/ceph/osd_client.c
>> > @@ -1299,8 +1299,10 @@ static bool target_should_be_paused(struct ceph_osd_client *osdc,
>> >                        __pool_full(pi);
>> >
>> >         WARN_ON(pi->id != t->base_oloc.pool);
>> > -       return (t->flags & CEPH_OSD_FLAG_READ && pauserd) ||
>> > -              (t->flags & CEPH_OSD_FLAG_WRITE && pausewr);
>> > +       return ((t->flags & CEPH_OSD_FLAG_READ) && pauserd) ||
>> > +              ((t->flags & CEPH_OSD_FLAG_WRITE) && pausewr) ||
>> > +              (osdc->epoch_barrier &&
>> > +               osdc->osdmap->epoch < osdc->epoch_barrier);
>>
>> Why the osdc->epoch_barrier != 0 check, here and everywhere else?
>>
>
> My understanding is that we have to deal with clusters that predate the
> addition of this value. The client sets this to 0 when it's not set.

That and initialization to 0 in ceph_create_client(), so how does this
!= 0 check affect anything?  Won't we always return false in that case,
thus preserving the old behavior?

>
>> >  }
>> >
>> >  enum calc_target_result {
>> > @@ -1610,21 +1612,24 @@ static void send_request(struct ceph_osd_request *req)
>> >  static void maybe_request_map(struct ceph_osd_client *osdc)
>> >  {
>> >         bool continuous = false;
>> > +       u32 epoch = osdc->osdmap->epoch;
>> >
>> >         verify_osdc_locked(osdc);
>> > -       WARN_ON(!osdc->osdmap->epoch);
>> > +       WARN_ON_ONCE(epoch == 0);
>> >
>> >         if (ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
>> >             ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSERD) ||
>> > -           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
>> > +           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
>> > +           (osdc->epoch_barrier && epoch < osdc->epoch_barrier)) {
>> >                 dout("%s osdc %p continuous\n", __func__, osdc);
>> >                 continuous = true;
>> >         } else {
>> >                 dout("%s osdc %p onetime\n", __func__, osdc);
>> >         }
>> >
>> > +       ++epoch;
>> >         if (ceph_monc_want_map(&osdc->client->monc, CEPH_SUB_OSDMAP,
>> > -                              osdc->osdmap->epoch + 1, continuous))
>> > +                              epoch, continuous))
>> >                 ceph_monc_renew_subs(&osdc->client->monc);
>> >  }
>>
>> Why was this change needed?  Wouldn't the existing call to unmodified
>> maybe_request_map() from ceph_osdc_handle_map() be sufficient?
>>
>
> It's not strictly required, but it's more efficient to fetch the epoch
> at the beginning of the function like this and then work with it the
> value on the stack than to potentially deal with having to refetch it.
> It also looks cleaner.

It was more about the if condition change.  I don't see a reason for it
and it's not present in the Objecter counterpart, so I'd rather we drop
the entire hunk.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 2/6] libceph: abort already submitted but abortable requests when map or pool goes full
  2017-03-29 16:39         ` Jeff Layton
@ 2017-03-29 16:56           ` Ilya Dryomov
  0 siblings, 0 replies; 23+ messages in thread
From: Ilya Dryomov @ 2017-03-29 16:56 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Wed, Mar 29, 2017 at 6:39 PM, Jeff Layton <jlayton@redhat.com> wrote:
> On Wed, 2017-03-29 at 16:14 +0200, Ilya Dryomov wrote:
>> On Wed, Mar 29, 2017 at 4:01 PM, Jeff Layton <jlayton@redhat.com> wrote:
>> > On Tue, 2017-03-28 at 16:34 +0200, Ilya Dryomov wrote:
>> > > On Sat, Feb 25, 2017 at 6:43 PM, Jeff Layton <jlayton@redhat.com> wrote:
>> > > > When a Ceph volume hits capacity, a flag is set in the OSD map to
>> > > > indicate that, and a new map is sprayed around the cluster. With cephfs
>> > > > we want it to shut down any abortable requests that are in progress with
>> > > > an -ENOSPC error as they'd just hang otherwise.
>> > > >
>> > > > Add a new ceph_osdc_abort_on_full helper function to handle this. It
>> > > > will first check whether there is an out-of-space condition in the
>> > > > cluster. It will then walk the tree and abort any request that has
>> > > > r_abort_on_full set with an ENOSPC error. Call this new function
>> > > > directly whenever we get a new OSD map.
>> > > >
>> > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
>> > > > ---
>> > > >  net/ceph/osd_client.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>> > > >  1 file changed, 42 insertions(+)
>> > > >
>> > > > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
>> > > > index 8fc0ccc7126f..b286ff6dec29 100644
>> > > > --- a/net/ceph/osd_client.c
>> > > > +++ b/net/ceph/osd_client.c
>> > > > @@ -1770,6 +1770,47 @@ static void complete_request(struct ceph_osd_request *req, int err)
>> > > >         ceph_osdc_put_request(req);
>> > > >  }
>> > > >
>> > > > +/*
>> > > > + * Drop all pending requests that are stalled waiting on a full condition to
>> > > > + * clear, and complete them with ENOSPC as the return code.
>> > > > + */
>> > > > +static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
>> > > > +{
>> > > > +       struct ceph_osd_request *req;
>> > > > +       struct ceph_osd *osd;
>> > >
>> > > These variables could have narrower scope.
>> > >
>> > > > +       struct rb_node *m, *n;
>> > > > +       u32 latest_epoch = 0;
>> > > > +       bool osdmap_full = ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL);
>> > >
>> > > This variable is redundant.
>> > >
>> > > > +
>> > > > +       dout("enter abort_on_full\n");
>> > > > +
>> > > > +       if (!osdmap_full && !have_pool_full(osdc))
>> > > > +               goto out;
>> > > > +
>> > > > +       for (n = rb_first(&osdc->osds); n; n = rb_next(n)) {
>> > > > +               osd = rb_entry(n, struct ceph_osd, o_node);
>> > > > +               mutex_lock(&osd->lock);
>> > >
>> > > No need to take osd->lock here as osdc->lock is held for write.
>> > >
>> > > > +               m = rb_first(&osd->o_requests);
>> > > > +               while (m) {
>> > > > +                       req = rb_entry(m, struct ceph_osd_request, r_node);
>> > > > +                       m = rb_next(m);
>> > > > +
>> > > > +                       if (req->r_abort_on_full &&
>> > > > +                           (osdmap_full || pool_full(osdc, req->r_t.base_oloc.pool))) {
>> > >
>> > > Over 80 lines and I think we need target_oloc instead of base_oloc
>> > > here.
>> > >
>> > > > +                               u32 cur_epoch = le32_to_cpu(req->r_replay_version.epoch);
>> > >
>> > > Is replay_version used this way in the userspace client?  It is an ack
>> > > vs commit leftover and AFAICT it is never set (i.e. always 0'0) in newer
>> > > Objecter, so something is wrong here.
>> > >
>> >
>> > Yeah I see what you mean now. This does look to be entirely vestigial
>> > in the current code.
>> >
>> > Basically, what we want is to scan the list of outstanding requests for
>> > this OSD, and abort any that predate the current map epoch. I suppose
>> > this means that we need to record the current map epoch in the request
>> > somewhere. ISTR looking and seeing that it was recorded here, but I
>> > don't think that's the case anymore.
>>
>> Why only those that predate the current epoch and not simply all of
>> those marked as r_abort_on_full?  I mean, this is triggered by a pool
>> or an entire cluster going full, right?
>>
>>
>
> My mistake, you're right. We cancel based on r_abort_on_full and the
> pool/OSD full flags.
>
> We do need to determine the latest cancelled map epoch though, so that
> we can update the epoch_barrier properly when this occurs. What I'm
> doing now is just recording the current map epoch before we link a
> request into the tree and using that to determine this.

I think you can get away with returning the current epoch.  It's always
going to be the latest, by definition ;)

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 3/6] libceph: add an epoch_barrier field to struct ceph_osd_client
  2017-03-29 16:52       ` Ilya Dryomov
@ 2017-03-29 17:43         ` Jeff Layton
  2017-03-29 17:56           ` Ilya Dryomov
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff Layton @ 2017-03-29 17:43 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Wed, 2017-03-29 at 18:52 +0200, Ilya Dryomov wrote:
> On Wed, Mar 29, 2017 at 1:54 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > On Tue, 2017-03-28 at 16:54 +0200, Ilya Dryomov wrote:
> > > On Sat, Feb 25, 2017 at 6:43 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > > > Cephfs can get cap update requests that contain a new epoch barrier in
> > > > them. When that happens we want to pause all OSD traffic until the right
> > > > map epoch arrives.
> > > > 
> > > > Add an epoch_barrier field to ceph_osd_client that is protected by the
> > > > osdc->lock rwsem. When the barrier is set, and the current OSD map
> > > > epoch is below that, pause the request target when submitting the
> > > > request or when revisiting it. Add a way for upper layers (cephfs)
> > > > to update the epoch_barrier as well.
> > > > 
> > > > If we get a new map, compare the new epoch against the barrier before
> > > > kicking requests and request another map if the map epoch is still lower
> > > > than the one we want.
> > > > 
> > > > If we end up cancelling requests because of a new map showing a full OSD
> > > > or pool condition, then set the barrier higher than the highest replay
> > > > epoch of all the cancelled requests.
> > > > 
> > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > ---
> > > >  include/linux/ceph/osd_client.h |  2 ++
> > > >  net/ceph/osd_client.c           | 51 +++++++++++++++++++++++++++++++++--------
> > > >  2 files changed, 43 insertions(+), 10 deletions(-)
> > > > 
> > > > diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> > > > index 55dcff2455e0..833942226560 100644
> > > > --- a/include/linux/ceph/osd_client.h
> > > > +++ b/include/linux/ceph/osd_client.h
> > > > @@ -266,6 +266,7 @@ struct ceph_osd_client {
> > > >         struct rb_root         osds;          /* osds */
> > > >         struct list_head       osd_lru;       /* idle osds */
> > > >         spinlock_t             osd_lru_lock;
> > > > +       u32                    epoch_barrier;
> > > 
> > > It would be good to include it in debugfs -- osdmap_show().
> > > 
> > 
> > Ok, I'll plan to add it in there.
> > 
> > > >         struct ceph_osd        homeless_osd;
> > > >         atomic64_t             last_tid;      /* tid of last request */
> > > >         u64                    last_linger_id;
> > > > @@ -304,6 +305,7 @@ extern void ceph_osdc_handle_reply(struct ceph_osd_client *osdc,
> > > >                                    struct ceph_msg *msg);
> > > >  extern void ceph_osdc_handle_map(struct ceph_osd_client *osdc,
> > > >                                  struct ceph_msg *msg);
> > > > +void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb);
> > > > 
> > > >  extern void osd_req_op_init(struct ceph_osd_request *osd_req,
> > > >                             unsigned int which, u16 opcode, u32 flags);
> > > > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> > > > index b286ff6dec29..2e9b6211814a 100644
> > > > --- a/net/ceph/osd_client.c
> > > > +++ b/net/ceph/osd_client.c
> > > > @@ -1299,8 +1299,10 @@ static bool target_should_be_paused(struct ceph_osd_client *osdc,
> > > >                        __pool_full(pi);
> > > > 
> > > >         WARN_ON(pi->id != t->base_oloc.pool);
> > > > -       return (t->flags & CEPH_OSD_FLAG_READ && pauserd) ||
> > > > -              (t->flags & CEPH_OSD_FLAG_WRITE && pausewr);
> > > > +       return ((t->flags & CEPH_OSD_FLAG_READ) && pauserd) ||
> > > > +              ((t->flags & CEPH_OSD_FLAG_WRITE) && pausewr) ||
> > > > +              (osdc->epoch_barrier &&
> > > > +               osdc->osdmap->epoch < osdc->epoch_barrier);
> > > 
> > > Why the osdc->epoch_barrier != 0 check, here and everywhere else?
> > > 
> > 
> > My understanding is that we have to deal with clusters that predate the
> > addition of this value. The client sets this to 0 when it's not set.
> 
> That and initialization to 0 in ceph_create_client(), so how does this
> != 0 check affect anything?  Won't we always return false in that case,
> thus preserving the old behavior?
> 
> > 
> > > >  }
> > > > 
> > > >  enum calc_target_result {
> > > > @@ -1610,21 +1612,24 @@ static void send_request(struct ceph_osd_request *req)
> > > >  static void maybe_request_map(struct ceph_osd_client *osdc)
> > > >  {
> > > >         bool continuous = false;
> > > > +       u32 epoch = osdc->osdmap->epoch;
> > > > 
> > > >         verify_osdc_locked(osdc);
> > > > -       WARN_ON(!osdc->osdmap->epoch);
> > > > +       WARN_ON_ONCE(epoch == 0);
> > > > 
> > > >         if (ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
> > > >             ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSERD) ||
> > > > -           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
> > > > +           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
> > > > +           (osdc->epoch_barrier && epoch < osdc->epoch_barrier)) {
> > > >                 dout("%s osdc %p continuous\n", __func__, osdc);
> > > >                 continuous = true;
> > > >         } else {
> > > >                 dout("%s osdc %p onetime\n", __func__, osdc);
> > > >         }
> > > > 
> > > > +       ++epoch;
> > > >         if (ceph_monc_want_map(&osdc->client->monc, CEPH_SUB_OSDMAP,
> > > > -                              osdc->osdmap->epoch + 1, continuous))
> > > > +                              epoch, continuous))
> > > >                 ceph_monc_renew_subs(&osdc->client->monc);
> > > >  }
> > > 
> > > Why was this change needed?  Wouldn't the existing call to unmodified
> > > maybe_request_map() from ceph_osdc_handle_map() be sufficient?
> > > 
> > 
> > It's not strictly required, but it's more efficient to fetch the epoch
> > at the beginning of the function like this and then work with it the
> > value on the stack than to potentially deal with having to refetch it.
> > It also looks cleaner.
> 
> It was more about the if condition change.  I don't see a reason for it
> and it's not present in the Objecter counterpart, so I'd rather we drop
> the entire hunk.
> 

To be clear, there is no functional difference here.

epoch == osdc->osdmap->epoch, and we don't use "epoch" after that
point. I'll leave the code as-is per your wish, but I don't think it
makes any substantive difference here.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 3/6] libceph: add an epoch_barrier field to struct ceph_osd_client
  2017-03-29 17:43         ` Jeff Layton
@ 2017-03-29 17:56           ` Ilya Dryomov
  0 siblings, 0 replies; 23+ messages in thread
From: Ilya Dryomov @ 2017-03-29 17:56 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Wed, Mar 29, 2017 at 7:43 PM, Jeff Layton <jlayton@redhat.com> wrote:
> On Wed, 2017-03-29 at 18:52 +0200, Ilya Dryomov wrote:
>> On Wed, Mar 29, 2017 at 1:54 PM, Jeff Layton <jlayton@redhat.com> wrote:
>> > On Tue, 2017-03-28 at 16:54 +0200, Ilya Dryomov wrote:
>> > > On Sat, Feb 25, 2017 at 6:43 PM, Jeff Layton <jlayton@redhat.com> wrote:
>> > > > Cephfs can get cap update requests that contain a new epoch barrier in
>> > > > them. When that happens we want to pause all OSD traffic until the right
>> > > > map epoch arrives.
>> > > >
>> > > > Add an epoch_barrier field to ceph_osd_client that is protected by the
>> > > > osdc->lock rwsem. When the barrier is set, and the current OSD map
>> > > > epoch is below that, pause the request target when submitting the
>> > > > request or when revisiting it. Add a way for upper layers (cephfs)
>> > > > to update the epoch_barrier as well.
>> > > >
>> > > > If we get a new map, compare the new epoch against the barrier before
>> > > > kicking requests and request another map if the map epoch is still lower
>> > > > than the one we want.
>> > > >
>> > > > If we end up cancelling requests because of a new map showing a full OSD
>> > > > or pool condition, then set the barrier higher than the highest replay
>> > > > epoch of all the cancelled requests.
>> > > >
>> > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
>> > > > ---
>> > > >  include/linux/ceph/osd_client.h |  2 ++
>> > > >  net/ceph/osd_client.c           | 51 +++++++++++++++++++++++++++++++++--------
>> > > >  2 files changed, 43 insertions(+), 10 deletions(-)
>> > > >
>> > > > diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
>> > > > index 55dcff2455e0..833942226560 100644
>> > > > --- a/include/linux/ceph/osd_client.h
>> > > > +++ b/include/linux/ceph/osd_client.h
>> > > > @@ -266,6 +266,7 @@ struct ceph_osd_client {
>> > > >         struct rb_root         osds;          /* osds */
>> > > >         struct list_head       osd_lru;       /* idle osds */
>> > > >         spinlock_t             osd_lru_lock;
>> > > > +       u32                    epoch_barrier;
>> > >
>> > > It would be good to include it in debugfs -- osdmap_show().
>> > >
>> >
>> > Ok, I'll plan to add it in there.
>> >
>> > > >         struct ceph_osd        homeless_osd;
>> > > >         atomic64_t             last_tid;      /* tid of last request */
>> > > >         u64                    last_linger_id;
>> > > > @@ -304,6 +305,7 @@ extern void ceph_osdc_handle_reply(struct ceph_osd_client *osdc,
>> > > >                                    struct ceph_msg *msg);
>> > > >  extern void ceph_osdc_handle_map(struct ceph_osd_client *osdc,
>> > > >                                  struct ceph_msg *msg);
>> > > > +void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb);
>> > > >
>> > > >  extern void osd_req_op_init(struct ceph_osd_request *osd_req,
>> > > >                             unsigned int which, u16 opcode, u32 flags);
>> > > > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
>> > > > index b286ff6dec29..2e9b6211814a 100644
>> > > > --- a/net/ceph/osd_client.c
>> > > > +++ b/net/ceph/osd_client.c
>> > > > @@ -1299,8 +1299,10 @@ static bool target_should_be_paused(struct ceph_osd_client *osdc,
>> > > >                        __pool_full(pi);
>> > > >
>> > > >         WARN_ON(pi->id != t->base_oloc.pool);
>> > > > -       return (t->flags & CEPH_OSD_FLAG_READ && pauserd) ||
>> > > > -              (t->flags & CEPH_OSD_FLAG_WRITE && pausewr);
>> > > > +       return ((t->flags & CEPH_OSD_FLAG_READ) && pauserd) ||
>> > > > +              ((t->flags & CEPH_OSD_FLAG_WRITE) && pausewr) ||
>> > > > +              (osdc->epoch_barrier &&
>> > > > +               osdc->osdmap->epoch < osdc->epoch_barrier);
>> > >
>> > > Why the osdc->epoch_barrier != 0 check, here and everywhere else?
>> > >
>> >
>> > My understanding is that we have to deal with clusters that predate the
>> > addition of this value. The client sets this to 0 when it's not set.
>>
>> That and initialization to 0 in ceph_create_client(), so how does this
>> != 0 check affect anything?  Won't we always return false in that case,
>> thus preserving the old behavior?
>>
>> >
>> > > >  }
>> > > >
>> > > >  enum calc_target_result {
>> > > > @@ -1610,21 +1612,24 @@ static void send_request(struct ceph_osd_request *req)
>> > > >  static void maybe_request_map(struct ceph_osd_client *osdc)
>> > > >  {
>> > > >         bool continuous = false;
>> > > > +       u32 epoch = osdc->osdmap->epoch;
>> > > >
>> > > >         verify_osdc_locked(osdc);
>> > > > -       WARN_ON(!osdc->osdmap->epoch);
>> > > > +       WARN_ON_ONCE(epoch == 0);
>> > > >
>> > > >         if (ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
>> > > >             ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSERD) ||
>> > > > -           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
>> > > > +           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
>> > > > +           (osdc->epoch_barrier && epoch < osdc->epoch_barrier)) {
>> > > >                 dout("%s osdc %p continuous\n", __func__, osdc);
>> > > >                 continuous = true;
>> > > >         } else {
>> > > >                 dout("%s osdc %p onetime\n", __func__, osdc);
>> > > >         }
>> > > >
>> > > > +       ++epoch;
>> > > >         if (ceph_monc_want_map(&osdc->client->monc, CEPH_SUB_OSDMAP,
>> > > > -                              osdc->osdmap->epoch + 1, continuous))
>> > > > +                              epoch, continuous))
>> > > >                 ceph_monc_renew_subs(&osdc->client->monc);
>> > > >  }
>> > >
>> > > Why was this change needed?  Wouldn't the existing call to unmodified
>> > > maybe_request_map() from ceph_osdc_handle_map() be sufficient?
>> > >
>> >
>> > It's not strictly required, but it's more efficient to fetch the epoch
>> > at the beginning of the function like this and then work with it the
>> > value on the stack than to potentially deal with having to refetch it.
>> > It also looks cleaner.
>>
>> It was more about the if condition change.  I don't see a reason for it
>> and it's not present in the Objecter counterpart, so I'd rather we drop
>> the entire hunk.
>>
>
> To be clear, there is no functional difference here.
>
> epoch == osdc->osdmap->epoch, and we don't use "epoch" after that
> point. I'll leave the code as-is per your wish, but I don't think it
> makes any substantive difference here.

I was talking about this bit:

> -           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
> +           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
> +           (osdc->epoch_barrier && epoch < osdc->epoch_barrier)) {

It is a functional change, although definitely not substantive.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-03-29 17:56 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-25 17:43 [PATCH v5 0/6] ceph: implement new-style ENOSPC handling in kcephfs Jeff Layton
2017-02-25 17:43 ` [PATCH v5 1/6] libceph: allow requests to return immediately on full conditions if caller wishes Jeff Layton
2017-03-28 14:22   ` Ilya Dryomov
2017-03-28 20:12     ` Jeff Layton
2017-03-29 14:47       ` Ilya Dryomov
2017-02-25 17:43 ` [PATCH v5 2/6] libceph: abort already submitted but abortable requests when map or pool goes full Jeff Layton
2017-03-28 14:34   ` Ilya Dryomov
2017-03-28 20:44     ` Jeff Layton
2017-03-29 16:52       ` Ilya Dryomov
2017-03-29 14:01     ` Jeff Layton
2017-03-29 14:14       ` Ilya Dryomov
2017-03-29 16:39         ` Jeff Layton
2017-03-29 16:56           ` Ilya Dryomov
2017-02-25 17:43 ` [PATCH v5 3/6] libceph: add an epoch_barrier field to struct ceph_osd_client Jeff Layton
2017-03-28 14:54   ` Ilya Dryomov
2017-03-29 11:54     ` Jeff Layton
2017-03-29 16:52       ` Ilya Dryomov
2017-03-29 17:43         ` Jeff Layton
2017-03-29 17:56           ` Ilya Dryomov
2017-02-25 17:43 ` [PATCH v5 4/6] ceph: handle epoch barriers in cap messages Jeff Layton
2017-02-25 17:43 ` [PATCH v5 5/6] Revert "ceph: SetPageError() for writeback pages if writepages fails" Jeff Layton
2017-02-25 17:43 ` [PATCH v5 6/6] ceph: when seeing write errors on an inode, switch to sync writes Jeff Layton
2017-02-27  2:43 ` [PATCH v5 0/6] ceph: implement new-style ENOSPC handling in kcephfs Yan, Zheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.