All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/7] implement -ENOSPC handling in cephfs
@ 2017-03-30 18:05 Jeff Layton
  2017-03-30 18:07 ` [PATCH v6 1/7] libceph: remove req->r_replay_version Jeff Layton
  0 siblings, 1 reply; 18+ messages in thread
From: Jeff Layton @ 2017-03-30 18:05 UTC (permalink / raw)
  To: idryomov, zyan, sage; +Cc: jspray, ceph-devel

v6: reset barrier to current epoch when receiving map with full pool or
    at quota condition. Show epoch barrier in debugfs. Don't take osd->lock
    unnecessarily. Remove req->r_replay_version. Other cleanups and
    fixes suggested by Ilya.

v5: rebase onto ACK vs. commit changes

v4: eliminate map_cb and just call ceph_osdc_abort_on_full directly
Revert earlier patch flagging individual pages with error on writeback
failure. Add mechanism to force synchronous writes when writes start
failing, and reallowing async writes when they succeed.

v3: track "abort_on_full" behavior with a new bool in osd request
instead of a protocol flag. Remove some extraneous arguments from
various functions. Don't export have_pool_full, call it from the
abort_on_full callback instead.

v2: teach libcephfs how to hold on to requests until the right map
epoch appears, instead of delaying cap handling in the cephfs layer.

Ok, with this set, I think we have proper -ENOSPC handling for all
different write types (direct, sync, async buffered, etc...). -ENOSPC
conditions.

This patchset is an updated version of the patch series originally
done by John Spray and posted here:

    http://www.spinics.net/lists/ceph-devel/msg21257.html

The only real change from the last posting was to rebase it on top
of Ilya's changes to remove the ack vs. commit behavior. That rebase
was fairly simple and turns out to simplify the changes somewhat.

In the last series Zheng also mentioned removing the other SetPageError
sites in fs/ceph. That may make sense, but I've left that out for now,
as I'd like to look over PG_error handling in the kernel at large.

Jeff Layton (7):
  libceph: remove req->r_replay_version
  libceph: allow requests to return immediately on full conditions if
    caller wishes
  libceph: abort already submitted but abortable requests when map or
    pool goes full
  libceph: add an epoch_barrier field to struct ceph_osd_client
  ceph: handle epoch barriers in cap messages
  Revert "ceph: SetPageError() for writeback pages if writepages fails"
  ceph: when seeing write errors on an inode, switch to sync writes

 fs/ceph/addr.c                  | 10 +++--
 fs/ceph/caps.c                  | 17 ++++++--
 fs/ceph/file.c                  | 32 ++++++++------
 fs/ceph/mds_client.c            | 20 +++++++++
 fs/ceph/mds_client.h            |  7 +++-
 fs/ceph/super.h                 | 26 ++++++++++++
 include/linux/ceph/osd_client.h |  4 +-
 net/ceph/debugfs.c              |  7 ++--
 net/ceph/osd_client.c           | 92 ++++++++++++++++++++++++++++++++++++-----
 9 files changed, 177 insertions(+), 38 deletions(-)

-- 
2.9.3


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v6 1/7] libceph: remove req->r_replay_version
  2017-03-30 18:05 [PATCH v6 0/7] implement -ENOSPC handling in cephfs Jeff Layton
@ 2017-03-30 18:07 ` Jeff Layton
  2017-03-30 18:07   ` [PATCH v6 2/7] libceph: allow requests to return immediately on full conditions if caller wishes Jeff Layton
                     ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Jeff Layton @ 2017-03-30 18:07 UTC (permalink / raw)
  To: idryomov, zyan, sage; +Cc: jspray, ceph-devel

Nothing uses this anymore with the removal of the ack vs. commit code.
Remove the field and just encode zeroes into place in the request
encoding.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 include/linux/ceph/osd_client.h | 1 -
 net/ceph/debugfs.c              | 4 +---
 net/ceph/osd_client.c           | 6 +++---
 3 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index d6a625e75040..3fc9e7754a9b 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -192,7 +192,6 @@ struct ceph_osd_request {
 	unsigned long r_stamp;                /* jiffies, send or check time */
 	unsigned long r_start_stamp;          /* jiffies */
 	int r_attempts;
-	struct ceph_eversion r_replay_version; /* aka reassert_version */
 	u32 r_last_force_resend;
 	u32 r_map_dne_bound;
 
diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c
index c62b2b029a6e..d7e63a4f5578 100644
--- a/net/ceph/debugfs.c
+++ b/net/ceph/debugfs.c
@@ -177,9 +177,7 @@ static void dump_request(struct seq_file *s, struct ceph_osd_request *req)
 	seq_printf(s, "%llu\t", req->r_tid);
 	dump_target(s, &req->r_t);
 
-	seq_printf(s, "\t%d\t%u'%llu", req->r_attempts,
-		   le32_to_cpu(req->r_replay_version.epoch),
-		   le64_to_cpu(req->r_replay_version.version));
+	seq_printf(s, "\t%d", req->r_attempts);
 
 	for (i = 0; i < req->r_num_ops; i++) {
 		struct ceph_osd_req_op *op = &req->r_ops[i];
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index b4500a8ab8b3..27f14ae69eb7 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1503,9 +1503,9 @@ static void encode_request(struct ceph_osd_request *req, struct ceph_msg *msg)
 	ceph_encode_32(&p, req->r_flags);
 	ceph_encode_timespec(p, &req->r_mtime);
 	p += sizeof(struct ceph_timespec);
-	/* aka reassert_version */
-	memcpy(p, &req->r_replay_version, sizeof(req->r_replay_version));
-	p += sizeof(req->r_replay_version);
+	/* replay version field */
+	memset(p, 0, sizeof(struct ceph_eversion));
+	p += sizeof(struct ceph_eversion);
 
 	/* oloc */
 	ceph_start_encoding(&p, 5, 4,
-- 
2.9.3


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v6 2/7] libceph: allow requests to return immediately on full conditions if caller wishes
  2017-03-30 18:07 ` [PATCH v6 1/7] libceph: remove req->r_replay_version Jeff Layton
@ 2017-03-30 18:07   ` Jeff Layton
  2017-04-04 14:55     ` Ilya Dryomov
  2017-03-30 18:07   ` [PATCH v6 3/7] libceph: abort already submitted but abortable requests when map or pool goes full Jeff Layton
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Jeff Layton @ 2017-03-30 18:07 UTC (permalink / raw)
  To: idryomov, zyan, sage; +Cc: jspray, ceph-devel

Usually, when the osd map is flagged as full or the pool is at quota,
write requests just hang. This is not what we want for cephfs, where
it would be better to simply report -ENOSPC back to userland instead
of stalling.

If the caller knows that it will want an immediate error return instead
of blocking on a full or at-quota error condition then allow it to set a
flag to request that behavior.

Set that flag in ceph_osdc_new_request (since ceph.ko is the only caller),
and on any other write request from ceph.ko.

A later patch will deal with requests that were submitted before the new
map showing the full condition came in.

Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ceph/addr.c                  | 1 +
 fs/ceph/file.c                  | 1 +
 include/linux/ceph/osd_client.h | 1 +
 net/ceph/osd_client.c           | 7 +++++++
 4 files changed, 10 insertions(+)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 1a3e1b40799a..7e3fae334620 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1892,6 +1892,7 @@ static int __ceph_pool_perm_get(struct ceph_inode_info *ci,
 	err = ceph_osdc_start_request(&fsc->client->osdc, rd_req, false);
 
 	wr_req->r_mtime = ci->vfs_inode.i_mtime;
+	wr_req->r_abort_on_full = true;
 	err2 = ceph_osdc_start_request(&fsc->client->osdc, wr_req, false);
 
 	if (!err)
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 356b7c76a2f1..cff35a1ff53c 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -712,6 +712,7 @@ static void ceph_aio_retry_work(struct work_struct *work)
 	req->r_callback = ceph_aio_complete_req;
 	req->r_inode = inode;
 	req->r_priv = aio_req;
+	req->r_abort_on_full = true;
 
 	ret = ceph_osdc_start_request(req->r_osdc, req, false);
 out:
diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 3fc9e7754a9b..8cf644197b1a 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -187,6 +187,7 @@ struct ceph_osd_request {
 	struct timespec r_mtime;              /* ditto */
 	u64 r_data_offset;                    /* ditto */
 	bool r_linger;                        /* don't resend on failure */
+	bool r_abort_on_full;		      /* return ENOSPC when full */
 
 	/* internal */
 	unsigned long r_stamp;                /* jiffies, send or check time */
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 27f14ae69eb7..781048990599 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -961,6 +961,7 @@ struct ceph_osd_request *ceph_osdc_new_request(struct ceph_osd_client *osdc,
 				       truncate_size, truncate_seq);
 	}
 
+	req->r_abort_on_full = true;
 	req->r_flags = flags;
 	req->r_base_oloc.pool = layout->pool_id;
 	req->r_base_oloc.pool_ns = ceph_try_get_string(layout->pool_ns);
@@ -1626,6 +1627,7 @@ static void maybe_request_map(struct ceph_osd_client *osdc)
 		ceph_monc_renew_subs(&osdc->client->monc);
 }
 
+static void complete_request(struct ceph_osd_request *req, int err);
 static void send_map_check(struct ceph_osd_request *req);
 
 static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
@@ -1635,6 +1637,7 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
 	enum calc_target_result ct_res;
 	bool need_send = false;
 	bool promoted = false;
+	bool need_abort = false;
 
 	WARN_ON(req->r_tid);
 	dout("%s req %p wrlocked %d\n", __func__, req, wrlocked);
@@ -1669,6 +1672,8 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
 		pr_warn_ratelimited("FULL or reached pool quota\n");
 		req->r_t.paused = true;
 		maybe_request_map(osdc);
+		if (req->r_abort_on_full)
+			need_abort = true;
 	} else if (!osd_homeless(osd)) {
 		need_send = true;
 	} else {
@@ -1685,6 +1690,8 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
 	link_request(osd, req);
 	if (need_send)
 		send_request(req);
+	else if (need_abort)
+		complete_request(req, -ENOSPC);
 	mutex_unlock(&osd->lock);
 
 	if (ct_res == CALC_TARGET_POOL_DNE)
-- 
2.9.3


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v6 3/7] libceph: abort already submitted but abortable requests when map or pool goes full
  2017-03-30 18:07 ` [PATCH v6 1/7] libceph: remove req->r_replay_version Jeff Layton
  2017-03-30 18:07   ` [PATCH v6 2/7] libceph: allow requests to return immediately on full conditions if caller wishes Jeff Layton
@ 2017-03-30 18:07   ` Jeff Layton
  2017-04-04 14:57     ` Ilya Dryomov
  2017-03-30 18:07   ` [PATCH v6 4/7] libceph: add an epoch_barrier field to struct ceph_osd_client Jeff Layton
                     ` (4 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Jeff Layton @ 2017-03-30 18:07 UTC (permalink / raw)
  To: idryomov, zyan, sage; +Cc: jspray, ceph-devel

When a Ceph volume hits capacity, a flag is set in the OSD map to
indicate that, and a new map is sprayed around the cluster. With cephfs
we want it to shut down any abortable requests that are in progress with
an -ENOSPC error as they'd just hang otherwise.

Add a new ceph_osdc_abort_on_full helper function to handle this. It
will first check whether there is an out-of-space condition in the
cluster and then walk the tree and abort any request that has
r_abort_on_full set with a -ENOSPC error. Call this new function
directly whenever we get a new OSD map.

Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 net/ceph/osd_client.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 781048990599..4e56cd1ec265 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1806,6 +1806,40 @@ static void abort_request(struct ceph_osd_request *req, int err)
 	complete_request(req, err);
 }
 
+/*
+ * Drop all pending requests that are stalled waiting on a full condition to
+ * clear, and complete them with ENOSPC as the return code.
+ */
+static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
+{
+	struct rb_node *n;
+	bool osdmap_full = ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL);
+
+	dout("enter abort_on_full\n");
+
+	if (!osdmap_full && !have_pool_full(osdc))
+		goto out;
+
+	for (n = rb_first(&osdc->osds); n; n = rb_next(n)) {
+		struct ceph_osd *osd = rb_entry(n, struct ceph_osd, o_node);
+		struct rb_node *m;
+
+		m = rb_first(&osd->o_requests);
+		while (m) {
+			struct ceph_osd_request *req = rb_entry(m,
+					struct ceph_osd_request, r_node);
+			m = rb_next(m);
+
+			if (req->r_abort_on_full &&
+			    (ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
+			     pool_full(osdc, req->r_t.target_oloc.pool)))
+				abort_request(req, -ENOSPC);
+		}
+	}
+out:
+	dout("return abort_on_full\n");
+}
+
 static void check_pool_dne(struct ceph_osd_request *req)
 {
 	struct ceph_osd_client *osdc = req->r_osdc;
@@ -3264,6 +3298,7 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
 
 	kick_requests(osdc, &need_resend, &need_resend_linger);
 
+	ceph_osdc_abort_on_full(osdc);
 	ceph_monc_got_map(&osdc->client->monc, CEPH_SUB_OSDMAP,
 			  osdc->osdmap->epoch);
 	up_write(&osdc->lock);
-- 
2.9.3


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v6 4/7] libceph: add an epoch_barrier field to struct ceph_osd_client
  2017-03-30 18:07 ` [PATCH v6 1/7] libceph: remove req->r_replay_version Jeff Layton
  2017-03-30 18:07   ` [PATCH v6 2/7] libceph: allow requests to return immediately on full conditions if caller wishes Jeff Layton
  2017-03-30 18:07   ` [PATCH v6 3/7] libceph: abort already submitted but abortable requests when map or pool goes full Jeff Layton
@ 2017-03-30 18:07   ` Jeff Layton
  2017-04-04 15:00     ` Ilya Dryomov
  2017-03-30 18:07   ` [PATCH v6 5/7] ceph: handle epoch barriers in cap messages Jeff Layton
                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Jeff Layton @ 2017-03-30 18:07 UTC (permalink / raw)
  To: idryomov, zyan, sage; +Cc: jspray, ceph-devel

Cephfs can get cap update requests that contain a new epoch barrier in
them. When that happens we want to pause all OSD traffic until the right
map epoch arrives.

Add an epoch_barrier field to ceph_osd_client that is protected by the
osdc->lock rwsem. When the barrier is set, and the current OSD map
epoch is below that, pause the request target when submitting the
request or when revisiting it. Add a way for upper layers (cephfs)
to update the epoch_barrier as well.

If we get a new map, compare the new epoch against the barrier before
kicking requests and request another map if the map epoch is still lower
than the one we want.

If we get a map with a full pool, or at quota condition, then set the
barrier to the current epoch value.

Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 include/linux/ceph/osd_client.h |  2 ++
 net/ceph/debugfs.c              |  3 ++-
 net/ceph/osd_client.c           | 48 +++++++++++++++++++++++++++++++++--------
 3 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 8cf644197b1a..85650b415e73 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -267,6 +267,7 @@ struct ceph_osd_client {
 	struct rb_root         osds;          /* osds */
 	struct list_head       osd_lru;       /* idle osds */
 	spinlock_t             osd_lru_lock;
+	u32		       epoch_barrier;
 	struct ceph_osd        homeless_osd;
 	atomic64_t             last_tid;      /* tid of last request */
 	u64                    last_linger_id;
@@ -305,6 +306,7 @@ extern void ceph_osdc_handle_reply(struct ceph_osd_client *osdc,
 				   struct ceph_msg *msg);
 extern void ceph_osdc_handle_map(struct ceph_osd_client *osdc,
 				 struct ceph_msg *msg);
+void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb);
 
 extern void osd_req_op_init(struct ceph_osd_request *osd_req,
 			    unsigned int which, u16 opcode, u32 flags);
diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c
index d7e63a4f5578..71ba13927b3d 100644
--- a/net/ceph/debugfs.c
+++ b/net/ceph/debugfs.c
@@ -62,7 +62,8 @@ static int osdmap_show(struct seq_file *s, void *p)
 		return 0;
 
 	down_read(&osdc->lock);
-	seq_printf(s, "epoch %d flags 0x%x\n", map->epoch, map->flags);
+	seq_printf(s, "epoch %u barrier %u flags 0x%x\n", map->epoch,
+			osdc->epoch_barrier, map->flags);
 
 	for (n = rb_first(&map->pg_pools); n; n = rb_next(n)) {
 		struct ceph_pg_pool_info *pi =
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 4e56cd1ec265..3a94e8a1c7ff 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1298,8 +1298,9 @@ static bool target_should_be_paused(struct ceph_osd_client *osdc,
 		       __pool_full(pi);
 
 	WARN_ON(pi->id != t->base_oloc.pool);
-	return (t->flags & CEPH_OSD_FLAG_READ && pauserd) ||
-	       (t->flags & CEPH_OSD_FLAG_WRITE && pausewr);
+	return ((t->flags & CEPH_OSD_FLAG_READ) && pauserd) ||
+	       ((t->flags & CEPH_OSD_FLAG_WRITE) && pausewr) ||
+	       (osdc->osdmap->epoch < osdc->epoch_barrier);
 }
 
 enum calc_target_result {
@@ -1609,13 +1610,15 @@ static void send_request(struct ceph_osd_request *req)
 static void maybe_request_map(struct ceph_osd_client *osdc)
 {
 	bool continuous = false;
+	u32 epoch = osdc->osdmap->epoch;
 
 	verify_osdc_locked(osdc);
-	WARN_ON(!osdc->osdmap->epoch);
+	WARN_ON_ONCE(epoch == 0);
 
 	if (ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
 	    ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSERD) ||
-	    ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
+	    ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
+	    epoch < osdc->epoch_barrier) {
 		dout("%s osdc %p continuous\n", __func__, osdc);
 		continuous = true;
 	} else {
@@ -1653,8 +1656,13 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
 		goto promote;
 	}
 
-	if ((req->r_flags & CEPH_OSD_FLAG_WRITE) &&
-	    ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
+	if (osdc->osdmap->epoch < osdc->epoch_barrier) {
+		dout("req %p epoch %u barrier %u\n", req, osdc->osdmap->epoch,
+		     osdc->epoch_barrier);
+		req->r_t.paused = true;
+		maybe_request_map(osdc);
+	} else if ((req->r_flags & CEPH_OSD_FLAG_WRITE) &&
+		   ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
 		dout("req %p pausewr\n", req);
 		req->r_t.paused = true;
 		maybe_request_map(osdc);
@@ -1808,7 +1816,8 @@ static void abort_request(struct ceph_osd_request *req, int err)
 
 /*
  * Drop all pending requests that are stalled waiting on a full condition to
- * clear, and complete them with ENOSPC as the return code.
+ * clear, and complete them with ENOSPC as the return code. Set the
+ * osdc->epoch_barrier to the latest replay version epoch that was aborted.
  */
 static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
 {
@@ -1836,8 +1845,10 @@ static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
 				abort_request(req, -ENOSPC);
 		}
 	}
+	/* Update the epoch barrier to current epoch */
+	osdc->epoch_barrier = osdc->osdmap->epoch;
 out:
-	dout("return abort_on_full\n");
+	dout("return abort_on_full barrier=%u\n", osdc->epoch_barrier);
 }
 
 static void check_pool_dne(struct ceph_osd_request *req)
@@ -3293,7 +3304,8 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
 	pausewr = ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
 		  ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
 		  have_pool_full(osdc);
-	if (was_pauserd || was_pausewr || pauserd || pausewr)
+	if (was_pauserd || was_pausewr || pauserd || pausewr ||
+	    osdc->osdmap->epoch < osdc->epoch_barrier)
 		maybe_request_map(osdc);
 
 	kick_requests(osdc, &need_resend, &need_resend_linger);
@@ -3311,6 +3323,24 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
 	up_write(&osdc->lock);
 }
 
+void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb)
+{
+	down_read(&osdc->lock);
+	if (unlikely(eb > osdc->epoch_barrier)) {
+		up_read(&osdc->lock);
+		down_write(&osdc->lock);
+		if (osdc->epoch_barrier < eb) {
+			dout("updating epoch_barrier from %u to %u\n",
+					osdc->epoch_barrier, eb);
+			osdc->epoch_barrier = eb;
+		}
+		up_write(&osdc->lock);
+	} else {
+		up_read(&osdc->lock);
+	}
+}
+EXPORT_SYMBOL(ceph_osdc_update_epoch_barrier);
+
 /*
  * Resubmit requests pending on the given osd.
  */
-- 
2.9.3


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v6 5/7] ceph: handle epoch barriers in cap messages
  2017-03-30 18:07 ` [PATCH v6 1/7] libceph: remove req->r_replay_version Jeff Layton
                     ` (2 preceding siblings ...)
  2017-03-30 18:07   ` [PATCH v6 4/7] libceph: add an epoch_barrier field to struct ceph_osd_client Jeff Layton
@ 2017-03-30 18:07   ` Jeff Layton
  2017-03-30 18:07   ` [PATCH v6 6/7] Revert "ceph: SetPageError() for writeback pages if writepages fails" Jeff Layton
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Jeff Layton @ 2017-03-30 18:07 UTC (permalink / raw)
  To: idryomov, zyan, sage; +Cc: jspray, ceph-devel

Have the client store and update the osdc epoch_barrier when a cap
message comes in with one.

When sending cap messages, send the epoch barrier as well. This allows
clients to inform servers that their released caps may not be used until
a particular OSD map epoch.

Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ceph/caps.c       | 17 +++++++++++++----
 fs/ceph/mds_client.c | 20 ++++++++++++++++++++
 fs/ceph/mds_client.h |  7 +++++--
 3 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 60185434162a..f2df84a8f460 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1015,6 +1015,7 @@ static int send_cap_msg(struct cap_msg_args *arg)
 	void *p;
 	size_t extra_len;
 	struct timespec zerotime = {0};
+	struct ceph_osd_client *osdc = &arg->session->s_mdsc->fsc->client->osdc;
 
 	dout("send_cap_msg %s %llx %llx caps %s wanted %s dirty %s"
 	     " seq %u/%u tid %llu/%llu mseq %u follows %lld size %llu/%llu"
@@ -1077,7 +1078,9 @@ static int send_cap_msg(struct cap_msg_args *arg)
 	/* inline data size */
 	ceph_encode_32(&p, 0);
 	/* osd_epoch_barrier (version 5) */
-	ceph_encode_32(&p, 0);
+	down_read(&osdc->lock);
+	ceph_encode_32(&p, osdc->epoch_barrier);
+	up_read(&osdc->lock);
 	/* oldest_flush_tid (version 6) */
 	ceph_encode_64(&p, arg->oldest_flush_tid);
 
@@ -3633,13 +3636,19 @@ void ceph_handle_caps(struct ceph_mds_session *session,
 		p += inline_len;
 	}
 
+	if (le16_to_cpu(msg->hdr.version) >= 5) {
+		struct ceph_osd_client	*osdc = &mdsc->fsc->client->osdc;
+		u32			epoch_barrier;
+
+		ceph_decode_32_safe(&p, end, epoch_barrier, bad);
+		ceph_osdc_update_epoch_barrier(osdc, epoch_barrier);
+	}
+
 	if (le16_to_cpu(msg->hdr.version) >= 8) {
 		u64 flush_tid;
 		u32 caller_uid, caller_gid;
-		u32 osd_epoch_barrier;
 		u32 pool_ns_len;
-		/* version >= 5 */
-		ceph_decode_32_safe(&p, end, osd_epoch_barrier, bad);
+
 		/* version >= 6 */
 		ceph_decode_64_safe(&p, end, flush_tid, bad);
 		/* version >= 7 */
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index b16f1cf552a8..820bf0fb7745 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1551,9 +1551,15 @@ void ceph_send_cap_releases(struct ceph_mds_client *mdsc,
 	struct ceph_msg *msg = NULL;
 	struct ceph_mds_cap_release *head;
 	struct ceph_mds_cap_item *item;
+	struct ceph_osd_client *osdc = &mdsc->fsc->client->osdc;
 	struct ceph_cap *cap;
 	LIST_HEAD(tmp_list);
 	int num_cap_releases;
+	__le32	barrier, *cap_barrier;
+
+	down_read(&osdc->lock);
+	barrier = cpu_to_le32(osdc->epoch_barrier);
+	up_read(&osdc->lock);
 
 	spin_lock(&session->s_cap_lock);
 again:
@@ -1571,7 +1577,11 @@ void ceph_send_cap_releases(struct ceph_mds_client *mdsc,
 			head = msg->front.iov_base;
 			head->num = cpu_to_le32(0);
 			msg->front.iov_len = sizeof(*head);
+
+			msg->hdr.version = cpu_to_le16(2);
+			msg->hdr.compat_version = cpu_to_le16(1);
 		}
+
 		cap = list_first_entry(&tmp_list, struct ceph_cap,
 					session_caps);
 		list_del(&cap->session_caps);
@@ -1589,6 +1599,11 @@ void ceph_send_cap_releases(struct ceph_mds_client *mdsc,
 		ceph_put_cap(mdsc, cap);
 
 		if (le32_to_cpu(head->num) == CEPH_CAPS_PER_RELEASE) {
+			// Append cap_barrier field
+			cap_barrier = msg->front.iov_base + msg->front.iov_len;
+			*cap_barrier = barrier;
+			msg->front.iov_len += sizeof(*cap_barrier);
+
 			msg->hdr.front_len = cpu_to_le32(msg->front.iov_len);
 			dout("send_cap_releases mds%d %p\n", session->s_mds, msg);
 			ceph_con_send(&session->s_con, msg);
@@ -1604,6 +1619,11 @@ void ceph_send_cap_releases(struct ceph_mds_client *mdsc,
 	spin_unlock(&session->s_cap_lock);
 
 	if (msg) {
+		// Append cap_barrier field
+		cap_barrier = msg->front.iov_base + msg->front.iov_len;
+		*cap_barrier = barrier;
+		msg->front.iov_len += sizeof(*cap_barrier);
+
 		msg->hdr.front_len = cpu_to_le32(msg->front.iov_len);
 		dout("send_cap_releases mds%d %p\n", session->s_mds, msg);
 		ceph_con_send(&session->s_con, msg);
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index bbebcd55d79e..54166758093f 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -105,10 +105,13 @@ struct ceph_mds_reply_info_parsed {
 
 /*
  * cap releases are batched and sent to the MDS en masse.
+ *
+ * Account for per-message overhead of mds_cap_release header
+ * and __le32 for osd epoch barrier trailing field.
  */
-#define CEPH_CAPS_PER_RELEASE ((PAGE_SIZE -			\
+#define CEPH_CAPS_PER_RELEASE ((PAGE_SIZE - sizeof(u32) -		\
 				sizeof(struct ceph_mds_cap_release)) /	\
-			       sizeof(struct ceph_mds_cap_item))
+			        sizeof(struct ceph_mds_cap_item))
 
 
 /*
-- 
2.9.3


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v6 6/7] Revert "ceph: SetPageError() for writeback pages if writepages fails"
  2017-03-30 18:07 ` [PATCH v6 1/7] libceph: remove req->r_replay_version Jeff Layton
                     ` (3 preceding siblings ...)
  2017-03-30 18:07   ` [PATCH v6 5/7] ceph: handle epoch barriers in cap messages Jeff Layton
@ 2017-03-30 18:07   ` Jeff Layton
  2017-03-30 18:07   ` [PATCH v6 7/7] ceph: when seeing write errors on an inode, switch to sync writes Jeff Layton
  2017-04-04 14:55   ` [PATCH v6 1/7] libceph: remove req->r_replay_version Ilya Dryomov
  6 siblings, 0 replies; 18+ messages in thread
From: Jeff Layton @ 2017-03-30 18:07 UTC (permalink / raw)
  To: idryomov, zyan, sage; +Cc: jspray, ceph-devel

This reverts commit b109eec6f4332bd517e2f41e207037c4b9065094.

If I'm filling up a filesystem with this sort of command:

    $ dd if=/dev/urandom of=/mnt/cephfs/fillfile bs=2M oflag=sync

...then I'll eventually get back EIO on a write. Further calls
will give us ENOSPC.

I'm not sure what prompted this change, but I don't think it's what we
want to do. If writepages failed, we will have already set the mapping
error appropriately, and that's what gets reported by fsync() or
close().

__filemap_fdatawait_range however, does this:

	wait_on_page_writeback(page);
	if (TestClearPageError(page))
		ret = -EIO;

...and that -EIO ends up trumping the mapping's error if one exists.

When writepages fails, we only want to set the error in the mapping,
and not flag the individual pages.

Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ceph/addr.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 7e3fae334620..6cdf94459ac4 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -703,9 +703,6 @@ static void writepages_finish(struct ceph_osd_request *req)
 				clear_bdi_congested(&fsc->backing_dev_info,
 						    BLK_RW_ASYNC);
 
-			if (rc < 0)
-				SetPageError(page);
-
 			ceph_put_snap_context(page_snap_context(page));
 			page->private = 0;
 			ClearPagePrivate(page);
-- 
2.9.3


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v6 7/7] ceph: when seeing write errors on an inode, switch to sync writes
  2017-03-30 18:07 ` [PATCH v6 1/7] libceph: remove req->r_replay_version Jeff Layton
                     ` (4 preceding siblings ...)
  2017-03-30 18:07   ` [PATCH v6 6/7] Revert "ceph: SetPageError() for writeback pages if writepages fails" Jeff Layton
@ 2017-03-30 18:07   ` Jeff Layton
  2017-04-04 14:55   ` [PATCH v6 1/7] libceph: remove req->r_replay_version Ilya Dryomov
  6 siblings, 0 replies; 18+ messages in thread
From: Jeff Layton @ 2017-03-30 18:07 UTC (permalink / raw)
  To: idryomov, zyan, sage; +Cc: jspray, ceph-devel

Currently, we don't have a real feedback mechanism in place for when we
start seeing buffered writeback errors. If writeback is failing, there
is nothing that prevents an application from continuing to dirty pages
that aren't being cleaned.

In the event that we're seeing write errors of any sort occur on an
inode, have the callback set a flag to force further writes to be
synchronous. When the next write succeeds, clear the flag to allow
buffered writeback to continue.

Since this is just a hint to the write submission mechanism, we only
take the i_ceph_lock when a lockless check shows that the flag needs to
be changed.

Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ceph/addr.c  |  6 +++++-
 fs/ceph/file.c  | 31 ++++++++++++++++++-------------
 fs/ceph/super.h | 26 ++++++++++++++++++++++++++
 3 files changed, 49 insertions(+), 14 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 6cdf94459ac4..e253102b43cd 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -670,8 +670,12 @@ static void writepages_finish(struct ceph_osd_request *req)
 	bool remove_page;
 
 	dout("writepages_finish %p rc %d\n", inode, rc);
-	if (rc < 0)
+	if (rc < 0) {
 		mapping_set_error(mapping, rc);
+		ceph_set_error_write(ci);
+	} else {
+		ceph_clear_error_write(ci);
+	}
 
 	/*
 	 * We lost the cache cap, need to truncate the page before
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index cff35a1ff53c..0480492aa349 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1020,19 +1020,22 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
 
 out:
 		ceph_osdc_put_request(req);
-		if (ret == 0) {
-			pos += len;
-			written += len;
-
-			if (pos > i_size_read(inode)) {
-				check_caps = ceph_inode_set_size(inode, pos);
-				if (check_caps)
-					ceph_check_caps(ceph_inode(inode),
-							CHECK_CAPS_AUTHONLY,
-							NULL);
-			}
-		} else
+		if (ret != 0) {
+			ceph_set_error_write(ci);
 			break;
+		}
+
+		ceph_clear_error_write(ci);
+		pos += len;
+		written += len;
+		if (pos > i_size_read(inode)) {
+			check_caps = ceph_inode_set_size(inode, pos);
+			if (check_caps)
+				ceph_check_caps(ceph_inode(inode),
+						CHECK_CAPS_AUTHONLY,
+						NULL);
+		}
+
 	}
 
 	if (ret != -EOLDSNAPC && written > 0) {
@@ -1238,6 +1241,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	}
 
 retry_snap:
+	/* FIXME: not complete since it doesn't account for being at quota */
 	if (ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL)) {
 		err = -ENOSPC;
 		goto out;
@@ -1259,7 +1263,8 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	     inode, ceph_vinop(inode), pos, count, ceph_cap_string(got));
 
 	if ((got & (CEPH_CAP_FILE_BUFFER|CEPH_CAP_FILE_LAZYIO)) == 0 ||
-	    (iocb->ki_flags & IOCB_DIRECT) || (fi->flags & CEPH_F_SYNC)) {
+	    (iocb->ki_flags & IOCB_DIRECT) || (fi->flags & CEPH_F_SYNC) ||
+	    (ci->i_ceph_flags & CEPH_I_ERROR_WRITE)) {
 		struct ceph_snap_context *snapc;
 		struct iov_iter data;
 		inode_unlock(inode);
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index c68e6a045fb9..193dc61abc4d 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -474,6 +474,32 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
 #define CEPH_I_CAP_DROPPED	(1 << 8)  /* caps were forcibly dropped */
 #define CEPH_I_KICK_FLUSH	(1 << 9)  /* kick flushing caps */
 #define CEPH_I_FLUSH_SNAPS	(1 << 10) /* need flush snapss */
+#define CEPH_I_ERROR_WRITE	(1 << 11) /* have seen write errors */
+
+/*
+ * We set the ERROR_WRITE bit when we start seeing write errors on an inode
+ * and then clear it when they start succeeding. Note that we do a lockless
+ * check first, and only take the lock if it looks like it needs to be changed.
+ * The write submission code just takes this as a hint, so we're not too
+ * worried if a few slip through in either direction.
+ */
+static inline void ceph_set_error_write(struct ceph_inode_info *ci)
+{
+	if (!(ci->i_ceph_flags & CEPH_I_ERROR_WRITE)) {
+		spin_lock(&ci->i_ceph_lock);
+		ci->i_ceph_flags |= CEPH_I_ERROR_WRITE;
+		spin_unlock(&ci->i_ceph_lock);
+	}
+}
+
+static inline void ceph_clear_error_write(struct ceph_inode_info *ci)
+{
+	if (ci->i_ceph_flags & CEPH_I_ERROR_WRITE) {
+		spin_lock(&ci->i_ceph_lock);
+		ci->i_ceph_flags &= ~CEPH_I_ERROR_WRITE;
+		spin_unlock(&ci->i_ceph_lock);
+	}
+}
 
 static inline void __ceph_dir_set_complete(struct ceph_inode_info *ci,
 					   long long release_count,
-- 
2.9.3


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 1/7] libceph: remove req->r_replay_version
  2017-03-30 18:07 ` [PATCH v6 1/7] libceph: remove req->r_replay_version Jeff Layton
                     ` (5 preceding siblings ...)
  2017-03-30 18:07   ` [PATCH v6 7/7] ceph: when seeing write errors on an inode, switch to sync writes Jeff Layton
@ 2017-04-04 14:55   ` Ilya Dryomov
  6 siblings, 0 replies; 18+ messages in thread
From: Ilya Dryomov @ 2017-04-04 14:55 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Thu, Mar 30, 2017 at 8:07 PM, Jeff Layton <jlayton@redhat.com> wrote:
> Nothing uses this anymore with the removal of the ack vs. commit code.
> Remove the field and just encode zeroes into place in the request
> encoding.
>
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
>  include/linux/ceph/osd_client.h | 1 -
>  net/ceph/debugfs.c              | 4 +---
>  net/ceph/osd_client.c           | 6 +++---
>  3 files changed, 4 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index d6a625e75040..3fc9e7754a9b 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -192,7 +192,6 @@ struct ceph_osd_request {
>         unsigned long r_stamp;                /* jiffies, send or check time */
>         unsigned long r_start_stamp;          /* jiffies */
>         int r_attempts;
> -       struct ceph_eversion r_replay_version; /* aka reassert_version */
>         u32 r_last_force_resend;
>         u32 r_map_dne_bound;
>
> diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c
> index c62b2b029a6e..d7e63a4f5578 100644
> --- a/net/ceph/debugfs.c
> +++ b/net/ceph/debugfs.c
> @@ -177,9 +177,7 @@ static void dump_request(struct seq_file *s, struct ceph_osd_request *req)
>         seq_printf(s, "%llu\t", req->r_tid);
>         dump_target(s, &req->r_t);
>
> -       seq_printf(s, "\t%d\t%u'%llu", req->r_attempts,
> -                  le32_to_cpu(req->r_replay_version.epoch),
> -                  le64_to_cpu(req->r_replay_version.version));
> +       seq_printf(s, "\t%d", req->r_attempts);
>
>         for (i = 0; i < req->r_num_ops; i++) {
>                 struct ceph_osd_req_op *op = &req->r_ops[i];
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index b4500a8ab8b3..27f14ae69eb7 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -1503,9 +1503,9 @@ static void encode_request(struct ceph_osd_request *req, struct ceph_msg *msg)
>         ceph_encode_32(&p, req->r_flags);
>         ceph_encode_timespec(p, &req->r_mtime);
>         p += sizeof(struct ceph_timespec);
> -       /* aka reassert_version */
> -       memcpy(p, &req->r_replay_version, sizeof(req->r_replay_version));
> -       p += sizeof(req->r_replay_version);
> +       /* replay version field */
> +       memset(p, 0, sizeof(struct ceph_eversion));
> +       p += sizeof(struct ceph_eversion);

It's called reassert_version in userspace.  Don't change the comment,
just drop the "aka":

    /* reassert_version */
    memset(p, 0, sizeof(struct ceph_eversion));
    p += sizeof(struct ceph_eversion);

With that,

Reviewed-by: Ilya Dryomov <idryomov@gmail.com>

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 2/7] libceph: allow requests to return immediately on full conditions if caller wishes
  2017-03-30 18:07   ` [PATCH v6 2/7] libceph: allow requests to return immediately on full conditions if caller wishes Jeff Layton
@ 2017-04-04 14:55     ` Ilya Dryomov
  0 siblings, 0 replies; 18+ messages in thread
From: Ilya Dryomov @ 2017-04-04 14:55 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Thu, Mar 30, 2017 at 8:07 PM, Jeff Layton <jlayton@redhat.com> wrote:
> Usually, when the osd map is flagged as full or the pool is at quota,
> write requests just hang. This is not what we want for cephfs, where
> it would be better to simply report -ENOSPC back to userland instead
> of stalling.
>
> If the caller knows that it will want an immediate error return instead
> of blocking on a full or at-quota error condition then allow it to set a
> flag to request that behavior.
>
> Set that flag in ceph_osdc_new_request (since ceph.ko is the only caller),
> and on any other write request from ceph.ko.
>
> A later patch will deal with requests that were submitted before the new
> map showing the full condition came in.
>
> Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
>  fs/ceph/addr.c                  | 1 +
>  fs/ceph/file.c                  | 1 +
>  include/linux/ceph/osd_client.h | 1 +
>  net/ceph/osd_client.c           | 7 +++++++
>  4 files changed, 10 insertions(+)
>
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 1a3e1b40799a..7e3fae334620 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -1892,6 +1892,7 @@ static int __ceph_pool_perm_get(struct ceph_inode_info *ci,
>         err = ceph_osdc_start_request(&fsc->client->osdc, rd_req, false);
>
>         wr_req->r_mtime = ci->vfs_inode.i_mtime;
> +       wr_req->r_abort_on_full = true;
>         err2 = ceph_osdc_start_request(&fsc->client->osdc, wr_req, false);
>
>         if (!err)
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 356b7c76a2f1..cff35a1ff53c 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -712,6 +712,7 @@ static void ceph_aio_retry_work(struct work_struct *work)
>         req->r_callback = ceph_aio_complete_req;
>         req->r_inode = inode;
>         req->r_priv = aio_req;
> +       req->r_abort_on_full = true;
>
>         ret = ceph_osdc_start_request(req->r_osdc, req, false);
>  out:
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index 3fc9e7754a9b..8cf644197b1a 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -187,6 +187,7 @@ struct ceph_osd_request {
>         struct timespec r_mtime;              /* ditto */
>         u64 r_data_offset;                    /* ditto */
>         bool r_linger;                        /* don't resend on failure */
> +       bool r_abort_on_full;                 /* return ENOSPC when full */
>
>         /* internal */
>         unsigned long r_stamp;                /* jiffies, send or check time */
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 27f14ae69eb7..781048990599 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -961,6 +961,7 @@ struct ceph_osd_request *ceph_osdc_new_request(struct ceph_osd_client *osdc,
>                                        truncate_size, truncate_seq);
>         }
>
> +       req->r_abort_on_full = true;
>         req->r_flags = flags;
>         req->r_base_oloc.pool = layout->pool_id;
>         req->r_base_oloc.pool_ns = ceph_try_get_string(layout->pool_ns);
> @@ -1626,6 +1627,7 @@ static void maybe_request_map(struct ceph_osd_client *osdc)
>                 ceph_monc_renew_subs(&osdc->client->monc);
>  }
>
> +static void complete_request(struct ceph_osd_request *req, int err);
>  static void send_map_check(struct ceph_osd_request *req);
>
>  static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
> @@ -1635,6 +1637,7 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
>         enum calc_target_result ct_res;
>         bool need_send = false;
>         bool promoted = false;
> +       bool need_abort = false;
>
>         WARN_ON(req->r_tid);
>         dout("%s req %p wrlocked %d\n", __func__, req, wrlocked);
> @@ -1669,6 +1672,8 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
>                 pr_warn_ratelimited("FULL or reached pool quota\n");
>                 req->r_t.paused = true;
>                 maybe_request_map(osdc);
> +               if (req->r_abort_on_full)
> +                       need_abort = true;
>         } else if (!osd_homeless(osd)) {
>                 need_send = true;
>         } else {
> @@ -1685,6 +1690,8 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
>         link_request(osd, req);
>         if (need_send)
>                 send_request(req);
> +       else if (need_abort)
> +               complete_request(req, -ENOSPC);
>         mutex_unlock(&osd->lock);
>
>         if (ct_res == CALC_TARGET_POOL_DNE)

Reviewed-by: Ilya Dryomov <idryomov@gmail.com>

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 3/7] libceph: abort already submitted but abortable requests when map or pool goes full
  2017-03-30 18:07   ` [PATCH v6 3/7] libceph: abort already submitted but abortable requests when map or pool goes full Jeff Layton
@ 2017-04-04 14:57     ` Ilya Dryomov
  0 siblings, 0 replies; 18+ messages in thread
From: Ilya Dryomov @ 2017-04-04 14:57 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Thu, Mar 30, 2017 at 8:07 PM, Jeff Layton <jlayton@redhat.com> wrote:
> When a Ceph volume hits capacity, a flag is set in the OSD map to
> indicate that, and a new map is sprayed around the cluster. With cephfs
> we want it to shut down any abortable requests that are in progress with
> an -ENOSPC error as they'd just hang otherwise.
>
> Add a new ceph_osdc_abort_on_full helper function to handle this. It
> will first check whether there is an out-of-space condition in the
> cluster and then walk the tree and abort any request that has
> r_abort_on_full set with a -ENOSPC error. Call this new function
> directly whenever we get a new OSD map.
>
> Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
>  net/ceph/osd_client.c | 35 +++++++++++++++++++++++++++++++++++
>  1 file changed, 35 insertions(+)
>
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 781048990599..4e56cd1ec265 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -1806,6 +1806,40 @@ static void abort_request(struct ceph_osd_request *req, int err)
>         complete_request(req, err);
>  }
>
> +/*
> + * Drop all pending requests that are stalled waiting on a full condition to
> + * clear, and complete them with ENOSPC as the return code.
> + */
> +static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
> +{
> +       struct rb_node *n;
> +       bool osdmap_full = ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL);

This variable is still redundant ;)

> +
> +       dout("enter abort_on_full\n");
> +
> +       if (!osdmap_full && !have_pool_full(osdc))
> +               goto out;
> +
> +       for (n = rb_first(&osdc->osds); n; n = rb_next(n)) {
> +               struct ceph_osd *osd = rb_entry(n, struct ceph_osd, o_node);
> +               struct rb_node *m;
> +
> +               m = rb_first(&osd->o_requests);
> +               while (m) {
> +                       struct ceph_osd_request *req = rb_entry(m,
> +                                       struct ceph_osd_request, r_node);
> +                       m = rb_next(m);
> +
> +                       if (req->r_abort_on_full &&
> +                           (ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
> +                            pool_full(osdc, req->r_t.target_oloc.pool)))
> +                               abort_request(req, -ENOSPC);
> +               }
> +       }
> +out:
> +       dout("return abort_on_full\n");
> +}
> +
>  static void check_pool_dne(struct ceph_osd_request *req)
>  {
>         struct ceph_osd_client *osdc = req->r_osdc;
> @@ -3264,6 +3298,7 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
>
>         kick_requests(osdc, &need_resend, &need_resend_linger);
>
> +       ceph_osdc_abort_on_full(osdc);
>         ceph_monc_got_map(&osdc->client->monc, CEPH_SUB_OSDMAP,
>                           osdc->osdmap->epoch);
>         up_write(&osdc->lock);

With osdmap_full dropped,

Reviewed-by: Ilya Dryomov <idryomov@gmail.com>

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 4/7] libceph: add an epoch_barrier field to struct ceph_osd_client
  2017-03-30 18:07   ` [PATCH v6 4/7] libceph: add an epoch_barrier field to struct ceph_osd_client Jeff Layton
@ 2017-04-04 15:00     ` Ilya Dryomov
  2017-04-04 16:34       ` Jeff Layton
  0 siblings, 1 reply; 18+ messages in thread
From: Ilya Dryomov @ 2017-04-04 15:00 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Thu, Mar 30, 2017 at 8:07 PM, Jeff Layton <jlayton@redhat.com> wrote:
> Cephfs can get cap update requests that contain a new epoch barrier in
> them. When that happens we want to pause all OSD traffic until the right
> map epoch arrives.
>
> Add an epoch_barrier field to ceph_osd_client that is protected by the
> osdc->lock rwsem. When the barrier is set, and the current OSD map
> epoch is below that, pause the request target when submitting the
> request or when revisiting it. Add a way for upper layers (cephfs)
> to update the epoch_barrier as well.
>
> If we get a new map, compare the new epoch against the barrier before
> kicking requests and request another map if the map epoch is still lower
> than the one we want.
>
> If we get a map with a full pool, or at quota condition, then set the
> barrier to the current epoch value.
>
> Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
>  include/linux/ceph/osd_client.h |  2 ++
>  net/ceph/debugfs.c              |  3 ++-
>  net/ceph/osd_client.c           | 48 +++++++++++++++++++++++++++++++++--------
>  3 files changed, 43 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index 8cf644197b1a..85650b415e73 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -267,6 +267,7 @@ struct ceph_osd_client {
>         struct rb_root         osds;          /* osds */
>         struct list_head       osd_lru;       /* idle osds */
>         spinlock_t             osd_lru_lock;
> +       u32                    epoch_barrier;
>         struct ceph_osd        homeless_osd;
>         atomic64_t             last_tid;      /* tid of last request */
>         u64                    last_linger_id;
> @@ -305,6 +306,7 @@ extern void ceph_osdc_handle_reply(struct ceph_osd_client *osdc,
>                                    struct ceph_msg *msg);
>  extern void ceph_osdc_handle_map(struct ceph_osd_client *osdc,
>                                  struct ceph_msg *msg);
> +void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb);
>
>  extern void osd_req_op_init(struct ceph_osd_request *osd_req,
>                             unsigned int which, u16 opcode, u32 flags);
> diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c
> index d7e63a4f5578..71ba13927b3d 100644
> --- a/net/ceph/debugfs.c
> +++ b/net/ceph/debugfs.c
> @@ -62,7 +62,8 @@ static int osdmap_show(struct seq_file *s, void *p)
>                 return 0;
>
>         down_read(&osdc->lock);
> -       seq_printf(s, "epoch %d flags 0x%x\n", map->epoch, map->flags);
> +       seq_printf(s, "epoch %u barrier %u flags 0x%x\n", map->epoch,
> +                       osdc->epoch_barrier, map->flags);
>
>         for (n = rb_first(&map->pg_pools); n; n = rb_next(n)) {
>                 struct ceph_pg_pool_info *pi =
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 4e56cd1ec265..3a94e8a1c7ff 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -1298,8 +1298,9 @@ static bool target_should_be_paused(struct ceph_osd_client *osdc,
>                        __pool_full(pi);
>
>         WARN_ON(pi->id != t->base_oloc.pool);
> -       return (t->flags & CEPH_OSD_FLAG_READ && pauserd) ||
> -              (t->flags & CEPH_OSD_FLAG_WRITE && pausewr);
> +       return ((t->flags & CEPH_OSD_FLAG_READ) && pauserd) ||
> +              ((t->flags & CEPH_OSD_FLAG_WRITE) && pausewr) ||
> +              (osdc->osdmap->epoch < osdc->epoch_barrier);
>  }
>
>  enum calc_target_result {
> @@ -1609,13 +1610,15 @@ static void send_request(struct ceph_osd_request *req)
>  static void maybe_request_map(struct ceph_osd_client *osdc)
>  {
>         bool continuous = false;
> +       u32 epoch = osdc->osdmap->epoch;
>
>         verify_osdc_locked(osdc);
> -       WARN_ON(!osdc->osdmap->epoch);
> +       WARN_ON_ONCE(epoch == 0);
>
>         if (ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
>             ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSERD) ||
> -           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
> +           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
> +           epoch < osdc->epoch_barrier) {
>                 dout("%s osdc %p continuous\n", __func__, osdc);
>                 continuous = true;
>         } else {

Looks like this hunk is smaller now, but I thought we agreed to drop it
entirely?  "epoch < osdc->epoch_barrier" isn't there in Objecter.

> @@ -1653,8 +1656,13 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
>                 goto promote;
>         }
>
> -       if ((req->r_flags & CEPH_OSD_FLAG_WRITE) &&
> -           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
> +       if (osdc->osdmap->epoch < osdc->epoch_barrier) {
> +               dout("req %p epoch %u barrier %u\n", req, osdc->osdmap->epoch,
> +                    osdc->epoch_barrier);
> +               req->r_t.paused = true;
> +               maybe_request_map(osdc);
> +       } else if ((req->r_flags & CEPH_OSD_FLAG_WRITE) &&
> +                  ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
>                 dout("req %p pausewr\n", req);
>                 req->r_t.paused = true;
>                 maybe_request_map(osdc);
> @@ -1808,7 +1816,8 @@ static void abort_request(struct ceph_osd_request *req, int err)
>
>  /*
>   * Drop all pending requests that are stalled waiting on a full condition to
> - * clear, and complete them with ENOSPC as the return code.
> + * clear, and complete them with ENOSPC as the return code. Set the
> + * osdc->epoch_barrier to the latest replay version epoch that was aborted.

This comment needs an update -- replay version is gone...

>   */
>  static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
>  {
> @@ -1836,8 +1845,10 @@ static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
>                                 abort_request(req, -ENOSPC);
>                 }
>         }
> +       /* Update the epoch barrier to current epoch */
> +       osdc->epoch_barrier = osdc->osdmap->epoch;

How important is it to update the epoch barrier only if something was
aborted?  Being here doesn't mean that something was actually aborted.

>  out:
> -       dout("return abort_on_full\n");
> +       dout("return abort_on_full barrier=%u\n", osdc->epoch_barrier);
>  }
>
>  static void check_pool_dne(struct ceph_osd_request *req)
> @@ -3293,7 +3304,8 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
>         pausewr = ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
>                   ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
>                   have_pool_full(osdc);
> -       if (was_pauserd || was_pausewr || pauserd || pausewr)
> +       if (was_pauserd || was_pausewr || pauserd || pausewr ||
> +           osdc->osdmap->epoch < osdc->epoch_barrier)
>                 maybe_request_map(osdc);
>
>         kick_requests(osdc, &need_resend, &need_resend_linger);
> @@ -3311,6 +3323,24 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
>         up_write(&osdc->lock);
>  }
>
> +void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb)
> +{
> +       down_read(&osdc->lock);
> +       if (unlikely(eb > osdc->epoch_barrier)) {
> +               up_read(&osdc->lock);
> +               down_write(&osdc->lock);
> +               if (osdc->epoch_barrier < eb) {

Nit: make it "eb > osdc->epoch_barrier" so it matches the unlikely
condition.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 4/7] libceph: add an epoch_barrier field to struct ceph_osd_client
  2017-04-04 15:00     ` Ilya Dryomov
@ 2017-04-04 16:34       ` Jeff Layton
  2017-04-04 19:47         ` Ilya Dryomov
  0 siblings, 1 reply; 18+ messages in thread
From: Jeff Layton @ 2017-04-04 16:34 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Tue, 2017-04-04 at 17:00 +0200, Ilya Dryomov wrote:
> On Thu, Mar 30, 2017 at 8:07 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > Cephfs can get cap update requests that contain a new epoch barrier in
> > them. When that happens we want to pause all OSD traffic until the right
> > map epoch arrives.
> > 
> > Add an epoch_barrier field to ceph_osd_client that is protected by the
> > osdc->lock rwsem. When the barrier is set, and the current OSD map
> > epoch is below that, pause the request target when submitting the
> > request or when revisiting it. Add a way for upper layers (cephfs)
> > to update the epoch_barrier as well.
> > 
> > If we get a new map, compare the new epoch against the barrier before
> > kicking requests and request another map if the map epoch is still lower
> > than the one we want.
> > 
> > If we get a map with a full pool, or at quota condition, then set the
> > barrier to the current epoch value.
> > 
> > Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > ---
> >  include/linux/ceph/osd_client.h |  2 ++
> >  net/ceph/debugfs.c              |  3 ++-
> >  net/ceph/osd_client.c           | 48 +++++++++++++++++++++++++++++++++--------
> >  3 files changed, 43 insertions(+), 10 deletions(-)
> > 
> > diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> > index 8cf644197b1a..85650b415e73 100644
> > --- a/include/linux/ceph/osd_client.h
> > +++ b/include/linux/ceph/osd_client.h
> > @@ -267,6 +267,7 @@ struct ceph_osd_client {
> >         struct rb_root         osds;          /* osds */
> >         struct list_head       osd_lru;       /* idle osds */
> >         spinlock_t             osd_lru_lock;
> > +       u32                    epoch_barrier;
> >         struct ceph_osd        homeless_osd;
> >         atomic64_t             last_tid;      /* tid of last request */
> >         u64                    last_linger_id;
> > @@ -305,6 +306,7 @@ extern void ceph_osdc_handle_reply(struct ceph_osd_client *osdc,
> >                                    struct ceph_msg *msg);
> >  extern void ceph_osdc_handle_map(struct ceph_osd_client *osdc,
> >                                  struct ceph_msg *msg);
> > +void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb);
> > 
> >  extern void osd_req_op_init(struct ceph_osd_request *osd_req,
> >                             unsigned int which, u16 opcode, u32 flags);
> > diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c
> > index d7e63a4f5578..71ba13927b3d 100644
> > --- a/net/ceph/debugfs.c
> > +++ b/net/ceph/debugfs.c
> > @@ -62,7 +62,8 @@ static int osdmap_show(struct seq_file *s, void *p)
> >                 return 0;
> > 
> >         down_read(&osdc->lock);
> > -       seq_printf(s, "epoch %d flags 0x%x\n", map->epoch, map->flags);
> > +       seq_printf(s, "epoch %u barrier %u flags 0x%x\n", map->epoch,
> > +                       osdc->epoch_barrier, map->flags);
> > 
> >         for (n = rb_first(&map->pg_pools); n; n = rb_next(n)) {
> >                 struct ceph_pg_pool_info *pi =
> > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> > index 4e56cd1ec265..3a94e8a1c7ff 100644
> > --- a/net/ceph/osd_client.c
> > +++ b/net/ceph/osd_client.c
> > @@ -1298,8 +1298,9 @@ static bool target_should_be_paused(struct ceph_osd_client *osdc,
> >                        __pool_full(pi);
> > 
> >         WARN_ON(pi->id != t->base_oloc.pool);
> > -       return (t->flags & CEPH_OSD_FLAG_READ && pauserd) ||
> > -              (t->flags & CEPH_OSD_FLAG_WRITE && pausewr);
> > +       return ((t->flags & CEPH_OSD_FLAG_READ) && pauserd) ||
> > +              ((t->flags & CEPH_OSD_FLAG_WRITE) && pausewr) ||
> > +              (osdc->osdmap->epoch < osdc->epoch_barrier);
> >  }
> > 
> >  enum calc_target_result {
> > @@ -1609,13 +1610,15 @@ static void send_request(struct ceph_osd_request *req)
> >  static void maybe_request_map(struct ceph_osd_client *osdc)
> >  {
> >         bool continuous = false;
> > +       u32 epoch = osdc->osdmap->epoch;
> > 
> >         verify_osdc_locked(osdc);
> > -       WARN_ON(!osdc->osdmap->epoch);
> > +       WARN_ON_ONCE(epoch == 0);
> > 
> >         if (ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
> >             ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSERD) ||
> > -           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
> > +           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
> > +           epoch < osdc->epoch_barrier) {
> >                 dout("%s osdc %p continuous\n", __func__, osdc);
> >                 continuous = true;
> >         } else {
> 
> Looks like this hunk is smaller now, but I thought we agreed to drop it
> entirely?  "epoch < osdc->epoch_barrier" isn't there in Objecter.
> 

I still think if the current map is behind the current barrier value,
then you really do want to request a map. I'm not sure that the other
flags will necessarily be set in that case, will they?

> > @@ -1653,8 +1656,13 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
> >                 goto promote;
> >         }
> > 
> > -       if ((req->r_flags & CEPH_OSD_FLAG_WRITE) &&
> > -           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
> > +       if (osdc->osdmap->epoch < osdc->epoch_barrier) {
> > +               dout("req %p epoch %u barrier %u\n", req, osdc->osdmap->epoch,
> > +                    osdc->epoch_barrier);
> > +               req->r_t.paused = true;
> > +               maybe_request_map(osdc);
> > +       } else if ((req->r_flags & CEPH_OSD_FLAG_WRITE) &&
> > +                  ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
> >                 dout("req %p pausewr\n", req);
> >                 req->r_t.paused = true;
> >                 maybe_request_map(osdc);
> > @@ -1808,7 +1816,8 @@ static void abort_request(struct ceph_osd_request *req, int err)
> > 
> >  /*
> >   * Drop all pending requests that are stalled waiting on a full condition to
> > - * clear, and complete them with ENOSPC as the return code.
> > + * clear, and complete them with ENOSPC as the return code. Set the
> > + * osdc->epoch_barrier to the latest replay version epoch that was aborted.
> 
> This comment needs an update -- replay version is gone...
> 
> >   */
> >  static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
> >  {
> > @@ -1836,8 +1845,10 @@ static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
> >                                 abort_request(req, -ENOSPC);
> >                 }
> >         }
> > +       /* Update the epoch barrier to current epoch */
> > +       osdc->epoch_barrier = osdc->osdmap->epoch;
> 
> How important is it to update the epoch barrier only if something was
> aborted?  Being here doesn't mean that something was actually aborted.
> 

That, I'm not sure of. The epoch_barrier here is really all about cap
releases, AFAICT. i.e., we want to ensure that when we release caps
after receiving a new map that the MDS doesn't try to use them until it
sees the right map.

I could certainly be wrong here. John, do you have any thoughts on
this? You seem to have a better grasp of the epoch barrier handling
than I do.

> >  out:
> > -       dout("return abort_on_full\n");
> > +       dout("return abort_on_full barrier=%u\n", osdc->epoch_barrier);
> >  }
> > 
> >  static void check_pool_dne(struct ceph_osd_request *req)
> > @@ -3293,7 +3304,8 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
> >         pausewr = ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
> >                   ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
> >                   have_pool_full(osdc);
> > -       if (was_pauserd || was_pausewr || pauserd || pausewr)
> > +       if (was_pauserd || was_pausewr || pauserd || pausewr ||
> > +           osdc->osdmap->epoch < osdc->epoch_barrier)
> >                 maybe_request_map(osdc);
> > 
> >         kick_requests(osdc, &need_resend, &need_resend_linger);
> > @@ -3311,6 +3323,24 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
> >         up_write(&osdc->lock);
> >  }
> > 
> > +void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb)
> > +{
> > +       down_read(&osdc->lock);
> > +       if (unlikely(eb > osdc->epoch_barrier)) {
> > +               up_read(&osdc->lock);
> > +               down_write(&osdc->lock);
> > +               if (osdc->epoch_barrier < eb) {
> 
> Nit: make it "eb > osdc->epoch_barrier" so it matches the unlikely
> condition.
> 

Will do, and I'll fix up the other nits that you pointed out in this
and earlier replies.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 4/7] libceph: add an epoch_barrier field to struct ceph_osd_client
  2017-04-04 16:34       ` Jeff Layton
@ 2017-04-04 19:47         ` Ilya Dryomov
  2017-04-04 21:12           ` Jeff Layton
  0 siblings, 1 reply; 18+ messages in thread
From: Ilya Dryomov @ 2017-04-04 19:47 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Tue, Apr 4, 2017 at 6:34 PM, Jeff Layton <jlayton@redhat.com> wrote:
> On Tue, 2017-04-04 at 17:00 +0200, Ilya Dryomov wrote:
>> On Thu, Mar 30, 2017 at 8:07 PM, Jeff Layton <jlayton@redhat.com> wrote:
>> > Cephfs can get cap update requests that contain a new epoch barrier in
>> > them. When that happens we want to pause all OSD traffic until the right
>> > map epoch arrives.
>> >
>> > Add an epoch_barrier field to ceph_osd_client that is protected by the
>> > osdc->lock rwsem. When the barrier is set, and the current OSD map
>> > epoch is below that, pause the request target when submitting the
>> > request or when revisiting it. Add a way for upper layers (cephfs)
>> > to update the epoch_barrier as well.
>> >
>> > If we get a new map, compare the new epoch against the barrier before
>> > kicking requests and request another map if the map epoch is still lower
>> > than the one we want.
>> >
>> > If we get a map with a full pool, or at quota condition, then set the
>> > barrier to the current epoch value.
>> >
>> > Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
>> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
>> > ---
>> >  include/linux/ceph/osd_client.h |  2 ++
>> >  net/ceph/debugfs.c              |  3 ++-
>> >  net/ceph/osd_client.c           | 48 +++++++++++++++++++++++++++++++++--------
>> >  3 files changed, 43 insertions(+), 10 deletions(-)
>> >
>> > diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
>> > index 8cf644197b1a..85650b415e73 100644
>> > --- a/include/linux/ceph/osd_client.h
>> > +++ b/include/linux/ceph/osd_client.h
>> > @@ -267,6 +267,7 @@ struct ceph_osd_client {
>> >         struct rb_root         osds;          /* osds */
>> >         struct list_head       osd_lru;       /* idle osds */
>> >         spinlock_t             osd_lru_lock;
>> > +       u32                    epoch_barrier;
>> >         struct ceph_osd        homeless_osd;
>> >         atomic64_t             last_tid;      /* tid of last request */
>> >         u64                    last_linger_id;
>> > @@ -305,6 +306,7 @@ extern void ceph_osdc_handle_reply(struct ceph_osd_client *osdc,
>> >                                    struct ceph_msg *msg);
>> >  extern void ceph_osdc_handle_map(struct ceph_osd_client *osdc,
>> >                                  struct ceph_msg *msg);
>> > +void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb);
>> >
>> >  extern void osd_req_op_init(struct ceph_osd_request *osd_req,
>> >                             unsigned int which, u16 opcode, u32 flags);
>> > diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c
>> > index d7e63a4f5578..71ba13927b3d 100644
>> > --- a/net/ceph/debugfs.c
>> > +++ b/net/ceph/debugfs.c
>> > @@ -62,7 +62,8 @@ static int osdmap_show(struct seq_file *s, void *p)
>> >                 return 0;
>> >
>> >         down_read(&osdc->lock);
>> > -       seq_printf(s, "epoch %d flags 0x%x\n", map->epoch, map->flags);
>> > +       seq_printf(s, "epoch %u barrier %u flags 0x%x\n", map->epoch,
>> > +                       osdc->epoch_barrier, map->flags);
>> >
>> >         for (n = rb_first(&map->pg_pools); n; n = rb_next(n)) {
>> >                 struct ceph_pg_pool_info *pi =
>> > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
>> > index 4e56cd1ec265..3a94e8a1c7ff 100644
>> > --- a/net/ceph/osd_client.c
>> > +++ b/net/ceph/osd_client.c
>> > @@ -1298,8 +1298,9 @@ static bool target_should_be_paused(struct ceph_osd_client *osdc,
>> >                        __pool_full(pi);
>> >
>> >         WARN_ON(pi->id != t->base_oloc.pool);
>> > -       return (t->flags & CEPH_OSD_FLAG_READ && pauserd) ||
>> > -              (t->flags & CEPH_OSD_FLAG_WRITE && pausewr);
>> > +       return ((t->flags & CEPH_OSD_FLAG_READ) && pauserd) ||
>> > +              ((t->flags & CEPH_OSD_FLAG_WRITE) && pausewr) ||
>> > +              (osdc->osdmap->epoch < osdc->epoch_barrier);
>> >  }
>> >
>> >  enum calc_target_result {
>> > @@ -1609,13 +1610,15 @@ static void send_request(struct ceph_osd_request *req)
>> >  static void maybe_request_map(struct ceph_osd_client *osdc)
>> >  {
>> >         bool continuous = false;
>> > +       u32 epoch = osdc->osdmap->epoch;
>> >
>> >         verify_osdc_locked(osdc);
>> > -       WARN_ON(!osdc->osdmap->epoch);
>> > +       WARN_ON_ONCE(epoch == 0);
>> >
>> >         if (ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
>> >             ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSERD) ||
>> > -           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
>> > +           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
>> > +           epoch < osdc->epoch_barrier) {
>> >                 dout("%s osdc %p continuous\n", __func__, osdc);
>> >                 continuous = true;
>> >         } else {
>>
>> Looks like this hunk is smaller now, but I thought we agreed to drop it
>> entirely?  "epoch < osdc->epoch_barrier" isn't there in Objecter.
>>
>
> I still think if the current map is behind the current barrier value,
> then you really do want to request a map. I'm not sure that the other
> flags will necessarily be set in that case, will they?

We do that from ceph_osdc_handle_map(), on every new map.  That should
be good enough -- I'm not sure if that continuous sub in FULL, PAUSERD
and PAUSEWR cases buys us anything at all.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 4/7] libceph: add an epoch_barrier field to struct ceph_osd_client
  2017-04-04 19:47         ` Ilya Dryomov
@ 2017-04-04 21:12           ` Jeff Layton
  2017-04-05  9:22             ` Ilya Dryomov
  0 siblings, 1 reply; 18+ messages in thread
From: Jeff Layton @ 2017-04-04 21:12 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Tue, 2017-04-04 at 21:47 +0200, Ilya Dryomov wrote:
> On Tue, Apr 4, 2017 at 6:34 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > On Tue, 2017-04-04 at 17:00 +0200, Ilya Dryomov wrote:
> > > On Thu, Mar 30, 2017 at 8:07 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > > > Cephfs can get cap update requests that contain a new epoch barrier in
> > > > them. When that happens we want to pause all OSD traffic until the right
> > > > map epoch arrives.
> > > > 
> > > > Add an epoch_barrier field to ceph_osd_client that is protected by the
> > > > osdc->lock rwsem. When the barrier is set, and the current OSD map
> > > > epoch is below that, pause the request target when submitting the
> > > > request or when revisiting it. Add a way for upper layers (cephfs)
> > > > to update the epoch_barrier as well.
> > > > 
> > > > If we get a new map, compare the new epoch against the barrier before
> > > > kicking requests and request another map if the map epoch is still lower
> > > > than the one we want.
> > > > 
> > > > If we get a map with a full pool, or at quota condition, then set the
> > > > barrier to the current epoch value.
> > > > 
> > > > Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
> > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > ---
> > > >  include/linux/ceph/osd_client.h |  2 ++
> > > >  net/ceph/debugfs.c              |  3 ++-
> > > >  net/ceph/osd_client.c           | 48 +++++++++++++++++++++++++++++++++--------
> > > >  3 files changed, 43 insertions(+), 10 deletions(-)
> > > > 
> > > > diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> > > > index 8cf644197b1a..85650b415e73 100644
> > > > --- a/include/linux/ceph/osd_client.h
> > > > +++ b/include/linux/ceph/osd_client.h
> > > > @@ -267,6 +267,7 @@ struct ceph_osd_client {
> > > >         struct rb_root         osds;          /* osds */
> > > >         struct list_head       osd_lru;       /* idle osds */
> > > >         spinlock_t             osd_lru_lock;
> > > > +       u32                    epoch_barrier;
> > > >         struct ceph_osd        homeless_osd;
> > > >         atomic64_t             last_tid;      /* tid of last request */
> > > >         u64                    last_linger_id;
> > > > @@ -305,6 +306,7 @@ extern void ceph_osdc_handle_reply(struct ceph_osd_client *osdc,
> > > >                                    struct ceph_msg *msg);
> > > >  extern void ceph_osdc_handle_map(struct ceph_osd_client *osdc,
> > > >                                  struct ceph_msg *msg);
> > > > +void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb);
> > > > 
> > > >  extern void osd_req_op_init(struct ceph_osd_request *osd_req,
> > > >                             unsigned int which, u16 opcode, u32 flags);
> > > > diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c
> > > > index d7e63a4f5578..71ba13927b3d 100644
> > > > --- a/net/ceph/debugfs.c
> > > > +++ b/net/ceph/debugfs.c
> > > > @@ -62,7 +62,8 @@ static int osdmap_show(struct seq_file *s, void *p)
> > > >                 return 0;
> > > > 
> > > >         down_read(&osdc->lock);
> > > > -       seq_printf(s, "epoch %d flags 0x%x\n", map->epoch, map->flags);
> > > > +       seq_printf(s, "epoch %u barrier %u flags 0x%x\n", map->epoch,
> > > > +                       osdc->epoch_barrier, map->flags);
> > > > 
> > > >         for (n = rb_first(&map->pg_pools); n; n = rb_next(n)) {
> > > >                 struct ceph_pg_pool_info *pi =
> > > > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> > > > index 4e56cd1ec265..3a94e8a1c7ff 100644
> > > > --- a/net/ceph/osd_client.c
> > > > +++ b/net/ceph/osd_client.c
> > > > @@ -1298,8 +1298,9 @@ static bool target_should_be_paused(struct ceph_osd_client *osdc,
> > > >                        __pool_full(pi);
> > > > 
> > > >         WARN_ON(pi->id != t->base_oloc.pool);
> > > > -       return (t->flags & CEPH_OSD_FLAG_READ && pauserd) ||
> > > > -              (t->flags & CEPH_OSD_FLAG_WRITE && pausewr);
> > > > +       return ((t->flags & CEPH_OSD_FLAG_READ) && pauserd) ||
> > > > +              ((t->flags & CEPH_OSD_FLAG_WRITE) && pausewr) ||
> > > > +              (osdc->osdmap->epoch < osdc->epoch_barrier);
> > > >  }
> > > > 
> > > >  enum calc_target_result {
> > > > @@ -1609,13 +1610,15 @@ static void send_request(struct ceph_osd_request *req)
> > > >  static void maybe_request_map(struct ceph_osd_client *osdc)
> > > >  {
> > > >         bool continuous = false;
> > > > +       u32 epoch = osdc->osdmap->epoch;
> > > > 
> > > >         verify_osdc_locked(osdc);
> > > > -       WARN_ON(!osdc->osdmap->epoch);
> > > > +       WARN_ON_ONCE(epoch == 0);
> > > > 
> > > >         if (ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
> > > >             ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSERD) ||
> > > > -           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
> > > > +           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
> > > > +           epoch < osdc->epoch_barrier) {
> > > >                 dout("%s osdc %p continuous\n", __func__, osdc);
> > > >                 continuous = true;
> > > >         } else {
> > > 
> > > Looks like this hunk is smaller now, but I thought we agreed to drop it
> > > entirely?  "epoch < osdc->epoch_barrier" isn't there in Objecter.
> > > 
> > 
> > I still think if the current map is behind the current barrier value,
> > then you really do want to request a map. I'm not sure that the other
> > flags will necessarily be set in that case, will they?
> 
> We do that from ceph_osdc_handle_map(), on every new map.  That should
> be good enough -- I'm not sure if that continuous sub in FULL, PAUSERD
> and PAUSEWR cases buys us anything at all.
> 

Ahh ok, I see what you're saying now. Fair enough, we probably don't
need a continuous sub to handle an epoch_barrier that we don't have the
map for yet.

That said...should maybe_request_map be calling ceph_monc_want_map with
  this as the epoch argument? 

     max(epoch+1, osdc->epoch_barrier)

It seems like if the barrier is more than one greater than the one we
currently have then we should request enough to get us to the barrier.

Thoughts? 
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 4/7] libceph: add an epoch_barrier field to struct ceph_osd_client
  2017-04-04 21:12           ` Jeff Layton
@ 2017-04-05  9:22             ` Ilya Dryomov
  2017-04-05 13:29               ` Jeff Layton
  0 siblings, 1 reply; 18+ messages in thread
From: Ilya Dryomov @ 2017-04-05  9:22 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Tue, Apr 4, 2017 at 11:12 PM, Jeff Layton <jlayton@redhat.com> wrote:
> On Tue, 2017-04-04 at 21:47 +0200, Ilya Dryomov wrote:
>> On Tue, Apr 4, 2017 at 6:34 PM, Jeff Layton <jlayton@redhat.com> wrote:
>> > On Tue, 2017-04-04 at 17:00 +0200, Ilya Dryomov wrote:
>> > > On Thu, Mar 30, 2017 at 8:07 PM, Jeff Layton <jlayton@redhat.com> wrote:
>> > > > Cephfs can get cap update requests that contain a new epoch barrier in
>> > > > them. When that happens we want to pause all OSD traffic until the right
>> > > > map epoch arrives.
>> > > >
>> > > > Add an epoch_barrier field to ceph_osd_client that is protected by the
>> > > > osdc->lock rwsem. When the barrier is set, and the current OSD map
>> > > > epoch is below that, pause the request target when submitting the
>> > > > request or when revisiting it. Add a way for upper layers (cephfs)
>> > > > to update the epoch_barrier as well.
>> > > >
>> > > > If we get a new map, compare the new epoch against the barrier before
>> > > > kicking requests and request another map if the map epoch is still lower
>> > > > than the one we want.
>> > > >
>> > > > If we get a map with a full pool, or at quota condition, then set the
>> > > > barrier to the current epoch value.
>> > > >
>> > > > Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
>> > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
>> > > > ---
>> > > >  include/linux/ceph/osd_client.h |  2 ++
>> > > >  net/ceph/debugfs.c              |  3 ++-
>> > > >  net/ceph/osd_client.c           | 48 +++++++++++++++++++++++++++++++++--------
>> > > >  3 files changed, 43 insertions(+), 10 deletions(-)
>> > > >
>> > > > diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
>> > > > index 8cf644197b1a..85650b415e73 100644
>> > > > --- a/include/linux/ceph/osd_client.h
>> > > > +++ b/include/linux/ceph/osd_client.h
>> > > > @@ -267,6 +267,7 @@ struct ceph_osd_client {
>> > > >         struct rb_root         osds;          /* osds */
>> > > >         struct list_head       osd_lru;       /* idle osds */
>> > > >         spinlock_t             osd_lru_lock;
>> > > > +       u32                    epoch_barrier;
>> > > >         struct ceph_osd        homeless_osd;
>> > > >         atomic64_t             last_tid;      /* tid of last request */
>> > > >         u64                    last_linger_id;
>> > > > @@ -305,6 +306,7 @@ extern void ceph_osdc_handle_reply(struct ceph_osd_client *osdc,
>> > > >                                    struct ceph_msg *msg);
>> > > >  extern void ceph_osdc_handle_map(struct ceph_osd_client *osdc,
>> > > >                                  struct ceph_msg *msg);
>> > > > +void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb);
>> > > >
>> > > >  extern void osd_req_op_init(struct ceph_osd_request *osd_req,
>> > > >                             unsigned int which, u16 opcode, u32 flags);
>> > > > diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c
>> > > > index d7e63a4f5578..71ba13927b3d 100644
>> > > > --- a/net/ceph/debugfs.c
>> > > > +++ b/net/ceph/debugfs.c
>> > > > @@ -62,7 +62,8 @@ static int osdmap_show(struct seq_file *s, void *p)
>> > > >                 return 0;
>> > > >
>> > > >         down_read(&osdc->lock);
>> > > > -       seq_printf(s, "epoch %d flags 0x%x\n", map->epoch, map->flags);
>> > > > +       seq_printf(s, "epoch %u barrier %u flags 0x%x\n", map->epoch,
>> > > > +                       osdc->epoch_barrier, map->flags);
>> > > >
>> > > >         for (n = rb_first(&map->pg_pools); n; n = rb_next(n)) {
>> > > >                 struct ceph_pg_pool_info *pi =
>> > > > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
>> > > > index 4e56cd1ec265..3a94e8a1c7ff 100644
>> > > > --- a/net/ceph/osd_client.c
>> > > > +++ b/net/ceph/osd_client.c
>> > > > @@ -1298,8 +1298,9 @@ static bool target_should_be_paused(struct ceph_osd_client *osdc,
>> > > >                        __pool_full(pi);
>> > > >
>> > > >         WARN_ON(pi->id != t->base_oloc.pool);
>> > > > -       return (t->flags & CEPH_OSD_FLAG_READ && pauserd) ||
>> > > > -              (t->flags & CEPH_OSD_FLAG_WRITE && pausewr);
>> > > > +       return ((t->flags & CEPH_OSD_FLAG_READ) && pauserd) ||
>> > > > +              ((t->flags & CEPH_OSD_FLAG_WRITE) && pausewr) ||
>> > > > +              (osdc->osdmap->epoch < osdc->epoch_barrier);
>> > > >  }
>> > > >
>> > > >  enum calc_target_result {
>> > > > @@ -1609,13 +1610,15 @@ static void send_request(struct ceph_osd_request *req)
>> > > >  static void maybe_request_map(struct ceph_osd_client *osdc)
>> > > >  {
>> > > >         bool continuous = false;
>> > > > +       u32 epoch = osdc->osdmap->epoch;
>> > > >
>> > > >         verify_osdc_locked(osdc);
>> > > > -       WARN_ON(!osdc->osdmap->epoch);
>> > > > +       WARN_ON_ONCE(epoch == 0);
>> > > >
>> > > >         if (ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
>> > > >             ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSERD) ||
>> > > > -           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
>> > > > +           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
>> > > > +           epoch < osdc->epoch_barrier) {
>> > > >                 dout("%s osdc %p continuous\n", __func__, osdc);
>> > > >                 continuous = true;
>> > > >         } else {
>> > >
>> > > Looks like this hunk is smaller now, but I thought we agreed to drop it
>> > > entirely?  "epoch < osdc->epoch_barrier" isn't there in Objecter.
>> > >
>> >
>> > I still think if the current map is behind the current barrier value,
>> > then you really do want to request a map. I'm not sure that the other
>> > flags will necessarily be set in that case, will they?
>>
>> We do that from ceph_osdc_handle_map(), on every new map.  That should
>> be good enough -- I'm not sure if that continuous sub in FULL, PAUSERD
>> and PAUSEWR cases buys us anything at all.
>>
>
> Ahh ok, I see what you're saying now. Fair enough, we probably don't
> need a continuous sub to handle an epoch_barrier that we don't have the
> map for yet.
>
> That said...should maybe_request_map be calling ceph_monc_want_map with
>   this as the epoch argument?
>
>      max(epoch+1, osdc->epoch_barrier)
>
> It seems like if the barrier is more than one greater than the one we
> currently have then we should request enough to get us to the barrier.

No.  If the osdc->epoch_barrier is more than one greater, that would
request maps with epochs >= osdc->epoch_barrier, leaving the [epoch + 1,
osdc->epoch_barrier) gap.

We are checking osdc->epoch_barrier in ceph_osdc_handle_map() on every
incoming map and requesting more maps if needed, so eventually we will
get to the barrier.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 4/7] libceph: add an epoch_barrier field to struct ceph_osd_client
  2017-04-05  9:22             ` Ilya Dryomov
@ 2017-04-05 13:29               ` Jeff Layton
  2017-04-06  9:17                 ` Ilya Dryomov
  0 siblings, 1 reply; 18+ messages in thread
From: Jeff Layton @ 2017-04-05 13:29 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Wed, 2017-04-05 at 11:22 +0200, Ilya Dryomov wrote:
> On Tue, Apr 4, 2017 at 11:12 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > On Tue, 2017-04-04 at 21:47 +0200, Ilya Dryomov wrote:
> > > On Tue, Apr 4, 2017 at 6:34 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > > > On Tue, 2017-04-04 at 17:00 +0200, Ilya Dryomov wrote:
> > > > > On Thu, Mar 30, 2017 at 8:07 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > > > > > Cephfs can get cap update requests that contain a new epoch barrier in
> > > > > > them. When that happens we want to pause all OSD traffic until the right
> > > > > > map epoch arrives.
> > > > > > 
> > > > > > Add an epoch_barrier field to ceph_osd_client that is protected by the
> > > > > > osdc->lock rwsem. When the barrier is set, and the current OSD map
> > > > > > epoch is below that, pause the request target when submitting the
> > > > > > request or when revisiting it. Add a way for upper layers (cephfs)
> > > > > > to update the epoch_barrier as well.
> > > > > > 
> > > > > > If we get a new map, compare the new epoch against the barrier before
> > > > > > kicking requests and request another map if the map epoch is still lower
> > > > > > than the one we want.
> > > > > > 
> > > > > > If we get a map with a full pool, or at quota condition, then set the
> > > > > > barrier to the current epoch value.
> > > > > > 
> > > > > > Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
> > > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > > > ---
> > > > > >  include/linux/ceph/osd_client.h |  2 ++
> > > > > >  net/ceph/debugfs.c              |  3 ++-
> > > > > >  net/ceph/osd_client.c           | 48 +++++++++++++++++++++++++++++++++--------
> > > > > >  3 files changed, 43 insertions(+), 10 deletions(-)
> > > > > > 
> > > > > > diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> > > > > > index 8cf644197b1a..85650b415e73 100644
> > > > > > --- a/include/linux/ceph/osd_client.h
> > > > > > +++ b/include/linux/ceph/osd_client.h
> > > > > > @@ -267,6 +267,7 @@ struct ceph_osd_client {
> > > > > >         struct rb_root         osds;          /* osds */
> > > > > >         struct list_head       osd_lru;       /* idle osds */
> > > > > >         spinlock_t             osd_lru_lock;
> > > > > > +       u32                    epoch_barrier;
> > > > > >         struct ceph_osd        homeless_osd;
> > > > > >         atomic64_t             last_tid;      /* tid of last request */
> > > > > >         u64                    last_linger_id;
> > > > > > @@ -305,6 +306,7 @@ extern void ceph_osdc_handle_reply(struct ceph_osd_client *osdc,
> > > > > >                                    struct ceph_msg *msg);
> > > > > >  extern void ceph_osdc_handle_map(struct ceph_osd_client *osdc,
> > > > > >                                  struct ceph_msg *msg);
> > > > > > +void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb);
> > > > > > 
> > > > > >  extern void osd_req_op_init(struct ceph_osd_request *osd_req,
> > > > > >                             unsigned int which, u16 opcode, u32 flags);
> > > > > > diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c
> > > > > > index d7e63a4f5578..71ba13927b3d 100644
> > > > > > --- a/net/ceph/debugfs.c
> > > > > > +++ b/net/ceph/debugfs.c
> > > > > > @@ -62,7 +62,8 @@ static int osdmap_show(struct seq_file *s, void *p)
> > > > > >                 return 0;
> > > > > > 
> > > > > >         down_read(&osdc->lock);
> > > > > > -       seq_printf(s, "epoch %d flags 0x%x\n", map->epoch, map->flags);
> > > > > > +       seq_printf(s, "epoch %u barrier %u flags 0x%x\n", map->epoch,
> > > > > > +                       osdc->epoch_barrier, map->flags);
> > > > > > 
> > > > > >         for (n = rb_first(&map->pg_pools); n; n = rb_next(n)) {
> > > > > >                 struct ceph_pg_pool_info *pi =
> > > > > > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> > > > > > index 4e56cd1ec265..3a94e8a1c7ff 100644
> > > > > > --- a/net/ceph/osd_client.c
> > > > > > +++ b/net/ceph/osd_client.c
> > > > > > @@ -1298,8 +1298,9 @@ static bool target_should_be_paused(struct ceph_osd_client *osdc,
> > > > > >                        __pool_full(pi);
> > > > > > 
> > > > > >         WARN_ON(pi->id != t->base_oloc.pool);
> > > > > > -       return (t->flags & CEPH_OSD_FLAG_READ && pauserd) ||
> > > > > > -              (t->flags & CEPH_OSD_FLAG_WRITE && pausewr);
> > > > > > +       return ((t->flags & CEPH_OSD_FLAG_READ) && pauserd) ||
> > > > > > +              ((t->flags & CEPH_OSD_FLAG_WRITE) && pausewr) ||
> > > > > > +              (osdc->osdmap->epoch < osdc->epoch_barrier);
> > > > > >  }
> > > > > > 
> > > > > >  enum calc_target_result {
> > > > > > @@ -1609,13 +1610,15 @@ static void send_request(struct ceph_osd_request *req)
> > > > > >  static void maybe_request_map(struct ceph_osd_client *osdc)
> > > > > >  {
> > > > > >         bool continuous = false;
> > > > > > +       u32 epoch = osdc->osdmap->epoch;
> > > > > > 
> > > > > >         verify_osdc_locked(osdc);
> > > > > > -       WARN_ON(!osdc->osdmap->epoch);
> > > > > > +       WARN_ON_ONCE(epoch == 0);
> > > > > > 
> > > > > >         if (ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
> > > > > >             ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSERD) ||
> > > > > > -           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
> > > > > > +           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
> > > > > > +           epoch < osdc->epoch_barrier) {
> > > > > >                 dout("%s osdc %p continuous\n", __func__, osdc);
> > > > > >                 continuous = true;
> > > > > >         } else {
> > > > > 
> > > > > Looks like this hunk is smaller now, but I thought we agreed to drop it
> > > > > entirely?  "epoch < osdc->epoch_barrier" isn't there in Objecter.
> > > > > 
> > > > 
> > > > I still think if the current map is behind the current barrier value,
> > > > then you really do want to request a map. I'm not sure that the other
> > > > flags will necessarily be set in that case, will they?
> > > 
> > > We do that from ceph_osdc_handle_map(), on every new map.  That should
> > > be good enough -- I'm not sure if that continuous sub in FULL, PAUSERD
> > > and PAUSEWR cases buys us anything at all.
> > > 
> > 
> > Ahh ok, I see what you're saying now. Fair enough, we probably don't
> > need a continuous sub to handle an epoch_barrier that we don't have the
> > map for yet.
> > 
> > That said...should maybe_request_map be calling ceph_monc_want_map with
> >   this as the epoch argument?
> > 
> >      max(epoch+1, osdc->epoch_barrier)
> > 
> > It seems like if the barrier is more than one greater than the one we
> > currently have then we should request enough to get us to the barrier.
> 
> No.  If the osdc->epoch_barrier is more than one greater, that would
> request maps with epochs >= osdc->epoch_barrier, leaving the [epoch + 1,
> osdc->epoch_barrier) gap.
> 
> We are checking osdc->epoch_barrier in ceph_osdc_handle_map() on every
> incoming map and requesting more maps if needed, so eventually we will
> get to the barrier.
> 

Ok, got it...does this patch look OK?

--------------------------8<----------------------------------

[PATCH] libceph: add an epoch_barrier field to struct ceph_osd_client

Cephfs can get cap update requests that contain a new epoch barrier in
them. When that happens we want to pause all OSD traffic until the right
map epoch arrives.

Add an epoch_barrier field to ceph_osd_client that is protected by the
osdc->lock rwsem. When the barrier is set, and the current OSD map
epoch is below that, pause the request target when submitting the
request or when revisiting it. Add a way for upper layers (cephfs)
to update the epoch_barrier as well.

If we get a new map, compare the new epoch against the barrier before
kicking requests and request another map if the map epoch is still lower
than the one we want.

If we get a map with a full pool, or at quota condition, then set the
barrier to the current epoch value.

Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
---
 include/linux/ceph/osd_client.h |  2 ++
 net/ceph/debugfs.c              |  3 ++-
 net/ceph/osd_client.c           | 50 ++++++++++++++++++++++++++++++++++-------
 3 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 8cf644197b1a..85650b415e73 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -267,6 +267,7 @@ struct ceph_osd_client {
 	struct rb_root         osds;          /* osds */
 	struct list_head       osd_lru;       /* idle osds */
 	spinlock_t             osd_lru_lock;
+	u32		       epoch_barrier;
 	struct ceph_osd        homeless_osd;
 	atomic64_t             last_tid;      /* tid of last request */
 	u64                    last_linger_id;
@@ -305,6 +306,7 @@ extern void ceph_osdc_handle_reply(struct ceph_osd_client *osdc,
 				   struct ceph_msg *msg);
 extern void ceph_osdc_handle_map(struct ceph_osd_client *osdc,
 				 struct ceph_msg *msg);
+void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb);
 
 extern void osd_req_op_init(struct ceph_osd_request *osd_req,
 			    unsigned int which, u16 opcode, u32 flags);
diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c
index d7e63a4f5578..71ba13927b3d 100644
--- a/net/ceph/debugfs.c
+++ b/net/ceph/debugfs.c
@@ -62,7 +62,8 @@ static int osdmap_show(struct seq_file *s, void *p)
 		return 0;
 
 	down_read(&osdc->lock);
-	seq_printf(s, "epoch %d flags 0x%x\n", map->epoch, map->flags);
+	seq_printf(s, "epoch %u barrier %u flags 0x%x\n", map->epoch,
+			osdc->epoch_barrier, map->flags);
 
 	for (n = rb_first(&map->pg_pools); n; n = rb_next(n)) {
 		struct ceph_pg_pool_info *pi =
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 55b7585ccefd..fb35adae7fbf 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1298,8 +1298,9 @@ static bool target_should_be_paused(struct ceph_osd_client *osdc,
 		       __pool_full(pi);
 
 	WARN_ON(pi->id != t->base_oloc.pool);
-	return (t->flags & CEPH_OSD_FLAG_READ && pauserd) ||
-	       (t->flags & CEPH_OSD_FLAG_WRITE && pausewr);
+	return ((t->flags & CEPH_OSD_FLAG_READ) && pauserd) ||
+	       ((t->flags & CEPH_OSD_FLAG_WRITE) && pausewr) ||
+	       (osdc->osdmap->epoch < osdc->epoch_barrier);
 }
 
 enum calc_target_result {
@@ -1654,8 +1655,13 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
 		goto promote;
 	}
 
-	if ((req->r_flags & CEPH_OSD_FLAG_WRITE) &&
-	    ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
+	if (osdc->osdmap->epoch < osdc->epoch_barrier) {
+		dout("req %p epoch %u barrier %u\n", req, osdc->osdmap->epoch,
+		     osdc->epoch_barrier);
+		req->r_t.paused = true;
+		maybe_request_map(osdc);
+	} else if ((req->r_flags & CEPH_OSD_FLAG_WRITE) &&
+		   ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
 		dout("req %p pausewr\n", req);
 		req->r_t.paused = true;
 		maybe_request_map(osdc);
@@ -1809,11 +1815,14 @@ static void abort_request(struct ceph_osd_request *req, int err)
 
 /*
  * Drop all pending requests that are stalled waiting on a full condition to
- * clear, and complete them with ENOSPC as the return code.
+ * clear, and complete them with ENOSPC as the return code. Set the
+ * osdc->epoch_barrier to the latest map epoch that we've seen if any were
+ * cancelled.
  */
 static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
 {
 	struct rb_node *n;
+	bool set_barrier = false;
 
 	dout("enter abort_on_full\n");
 
@@ -1832,12 +1841,18 @@ static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
 
 			if (req->r_abort_on_full &&
 			    (ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
-			     pool_full(osdc, req->r_t.target_oloc.pool)))
+			     pool_full(osdc, req->r_t.target_oloc.pool))) {
 				abort_request(req, -ENOSPC);
+				set_barrier = true;
+			}
 		}
 	}
+
+	/* Update the epoch barrier to current epoch if a call was aborted */
+	if (set_barrier)
+		osdc->epoch_barrier = osdc->osdmap->epoch;
 out:
-	dout("return abort_on_full\n");
+	dout("return abort_on_full barrier=%u\n", osdc->epoch_barrier);
 }
 
 static void check_pool_dne(struct ceph_osd_request *req)
@@ -3293,7 +3308,8 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
 	pausewr = ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
 		  ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
 		  have_pool_full(osdc);
-	if (was_pauserd || was_pausewr || pauserd || pausewr)
+	if (was_pauserd || was_pausewr || pauserd || pausewr ||
+	    osdc->osdmap->epoch < osdc->epoch_barrier)
 		maybe_request_map(osdc);
 
 	kick_requests(osdc, &need_resend, &need_resend_linger);
@@ -3311,6 +3327,24 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
 	up_write(&osdc->lock);
 }
 
+void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb)
+{
+	down_read(&osdc->lock);
+	if (unlikely(eb > osdc->epoch_barrier)) {
+		up_read(&osdc->lock);
+		down_write(&osdc->lock);
+		if (likely(eb > osdc->epoch_barrier)) {
+			dout("updating epoch_barrier from %u to %u\n",
+					osdc->epoch_barrier, eb);
+			osdc->epoch_barrier = eb;
+		}
+		up_write(&osdc->lock);
+	} else {
+		up_read(&osdc->lock);
+	}
+}
+EXPORT_SYMBOL(ceph_osdc_update_epoch_barrier);
+
 /*
  * Resubmit requests pending on the given osd.
  */
-- 
2.9.3


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 4/7] libceph: add an epoch_barrier field to struct ceph_osd_client
  2017-04-05 13:29               ` Jeff Layton
@ 2017-04-06  9:17                 ` Ilya Dryomov
  0 siblings, 0 replies; 18+ messages in thread
From: Ilya Dryomov @ 2017-04-06  9:17 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Yan, Zheng, Sage Weil, John Spray, Ceph Development

On Wed, Apr 5, 2017 at 3:29 PM, Jeff Layton <jlayton@redhat.com> wrote:
> On Wed, 2017-04-05 at 11:22 +0200, Ilya Dryomov wrote:
>> On Tue, Apr 4, 2017 at 11:12 PM, Jeff Layton <jlayton@redhat.com> wrote:
>> > On Tue, 2017-04-04 at 21:47 +0200, Ilya Dryomov wrote:
>> > > On Tue, Apr 4, 2017 at 6:34 PM, Jeff Layton <jlayton@redhat.com> wrote:
>> > > > On Tue, 2017-04-04 at 17:00 +0200, Ilya Dryomov wrote:
>> > > > > On Thu, Mar 30, 2017 at 8:07 PM, Jeff Layton <jlayton@redhat.com> wrote:
>> > > > > > Cephfs can get cap update requests that contain a new epoch barrier in
>> > > > > > them. When that happens we want to pause all OSD traffic until the right
>> > > > > > map epoch arrives.
>> > > > > >
>> > > > > > Add an epoch_barrier field to ceph_osd_client that is protected by the
>> > > > > > osdc->lock rwsem. When the barrier is set, and the current OSD map
>> > > > > > epoch is below that, pause the request target when submitting the
>> > > > > > request or when revisiting it. Add a way for upper layers (cephfs)
>> > > > > > to update the epoch_barrier as well.
>> > > > > >
>> > > > > > If we get a new map, compare the new epoch against the barrier before
>> > > > > > kicking requests and request another map if the map epoch is still lower
>> > > > > > than the one we want.
>> > > > > >
>> > > > > > If we get a map with a full pool, or at quota condition, then set the
>> > > > > > barrier to the current epoch value.
>> > > > > >
>> > > > > > Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
>> > > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
>> > > > > > ---
>> > > > > >  include/linux/ceph/osd_client.h |  2 ++
>> > > > > >  net/ceph/debugfs.c              |  3 ++-
>> > > > > >  net/ceph/osd_client.c           | 48 +++++++++++++++++++++++++++++++++--------
>> > > > > >  3 files changed, 43 insertions(+), 10 deletions(-)
>> > > > > >
>> > > > > > diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
>> > > > > > index 8cf644197b1a..85650b415e73 100644
>> > > > > > --- a/include/linux/ceph/osd_client.h
>> > > > > > +++ b/include/linux/ceph/osd_client.h
>> > > > > > @@ -267,6 +267,7 @@ struct ceph_osd_client {
>> > > > > >         struct rb_root         osds;          /* osds */
>> > > > > >         struct list_head       osd_lru;       /* idle osds */
>> > > > > >         spinlock_t             osd_lru_lock;
>> > > > > > +       u32                    epoch_barrier;
>> > > > > >         struct ceph_osd        homeless_osd;
>> > > > > >         atomic64_t             last_tid;      /* tid of last request */
>> > > > > >         u64                    last_linger_id;
>> > > > > > @@ -305,6 +306,7 @@ extern void ceph_osdc_handle_reply(struct ceph_osd_client *osdc,
>> > > > > >                                    struct ceph_msg *msg);
>> > > > > >  extern void ceph_osdc_handle_map(struct ceph_osd_client *osdc,
>> > > > > >                                  struct ceph_msg *msg);
>> > > > > > +void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb);
>> > > > > >
>> > > > > >  extern void osd_req_op_init(struct ceph_osd_request *osd_req,
>> > > > > >                             unsigned int which, u16 opcode, u32 flags);
>> > > > > > diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c
>> > > > > > index d7e63a4f5578..71ba13927b3d 100644
>> > > > > > --- a/net/ceph/debugfs.c
>> > > > > > +++ b/net/ceph/debugfs.c
>> > > > > > @@ -62,7 +62,8 @@ static int osdmap_show(struct seq_file *s, void *p)
>> > > > > >                 return 0;
>> > > > > >
>> > > > > >         down_read(&osdc->lock);
>> > > > > > -       seq_printf(s, "epoch %d flags 0x%x\n", map->epoch, map->flags);
>> > > > > > +       seq_printf(s, "epoch %u barrier %u flags 0x%x\n", map->epoch,
>> > > > > > +                       osdc->epoch_barrier, map->flags);
>> > > > > >
>> > > > > >         for (n = rb_first(&map->pg_pools); n; n = rb_next(n)) {
>> > > > > >                 struct ceph_pg_pool_info *pi =
>> > > > > > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
>> > > > > > index 4e56cd1ec265..3a94e8a1c7ff 100644
>> > > > > > --- a/net/ceph/osd_client.c
>> > > > > > +++ b/net/ceph/osd_client.c
>> > > > > > @@ -1298,8 +1298,9 @@ static bool target_should_be_paused(struct ceph_osd_client *osdc,
>> > > > > >                        __pool_full(pi);
>> > > > > >
>> > > > > >         WARN_ON(pi->id != t->base_oloc.pool);
>> > > > > > -       return (t->flags & CEPH_OSD_FLAG_READ && pauserd) ||
>> > > > > > -              (t->flags & CEPH_OSD_FLAG_WRITE && pausewr);
>> > > > > > +       return ((t->flags & CEPH_OSD_FLAG_READ) && pauserd) ||
>> > > > > > +              ((t->flags & CEPH_OSD_FLAG_WRITE) && pausewr) ||
>> > > > > > +              (osdc->osdmap->epoch < osdc->epoch_barrier);
>> > > > > >  }
>> > > > > >
>> > > > > >  enum calc_target_result {
>> > > > > > @@ -1609,13 +1610,15 @@ static void send_request(struct ceph_osd_request *req)
>> > > > > >  static void maybe_request_map(struct ceph_osd_client *osdc)
>> > > > > >  {
>> > > > > >         bool continuous = false;
>> > > > > > +       u32 epoch = osdc->osdmap->epoch;
>> > > > > >
>> > > > > >         verify_osdc_locked(osdc);
>> > > > > > -       WARN_ON(!osdc->osdmap->epoch);
>> > > > > > +       WARN_ON_ONCE(epoch == 0);
>> > > > > >
>> > > > > >         if (ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
>> > > > > >             ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSERD) ||
>> > > > > > -           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
>> > > > > > +           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
>> > > > > > +           epoch < osdc->epoch_barrier) {
>> > > > > >                 dout("%s osdc %p continuous\n", __func__, osdc);
>> > > > > >                 continuous = true;
>> > > > > >         } else {
>> > > > >
>> > > > > Looks like this hunk is smaller now, but I thought we agreed to drop it
>> > > > > entirely?  "epoch < osdc->epoch_barrier" isn't there in Objecter.
>> > > > >
>> > > >
>> > > > I still think if the current map is behind the current barrier value,
>> > > > then you really do want to request a map. I'm not sure that the other
>> > > > flags will necessarily be set in that case, will they?
>> > >
>> > > We do that from ceph_osdc_handle_map(), on every new map.  That should
>> > > be good enough -- I'm not sure if that continuous sub in FULL, PAUSERD
>> > > and PAUSEWR cases buys us anything at all.
>> > >
>> >
>> > Ahh ok, I see what you're saying now. Fair enough, we probably don't
>> > need a continuous sub to handle an epoch_barrier that we don't have the
>> > map for yet.
>> >
>> > That said...should maybe_request_map be calling ceph_monc_want_map with
>> >   this as the epoch argument?
>> >
>> >      max(epoch+1, osdc->epoch_barrier)
>> >
>> > It seems like if the barrier is more than one greater than the one we
>> > currently have then we should request enough to get us to the barrier.
>>
>> No.  If the osdc->epoch_barrier is more than one greater, that would
>> request maps with epochs >= osdc->epoch_barrier, leaving the [epoch + 1,
>> osdc->epoch_barrier) gap.
>>
>> We are checking osdc->epoch_barrier in ceph_osdc_handle_map() on every
>> incoming map and requesting more maps if needed, so eventually we will
>> get to the barrier.
>>
>
> Ok, got it...does this patch look OK?
>
> --------------------------8<----------------------------------
>
> [PATCH] libceph: add an epoch_barrier field to struct ceph_osd_client
>
> Cephfs can get cap update requests that contain a new epoch barrier in
> them. When that happens we want to pause all OSD traffic until the right
> map epoch arrives.
>
> Add an epoch_barrier field to ceph_osd_client that is protected by the
> osdc->lock rwsem. When the barrier is set, and the current OSD map
> epoch is below that, pause the request target when submitting the
> request or when revisiting it. Add a way for upper layers (cephfs)
> to update the epoch_barrier as well.
>
> If we get a new map, compare the new epoch against the barrier before
> kicking requests and request another map if the map epoch is still lower
> than the one we want.
>
> If we get a map with a full pool, or at quota condition, then set the
> barrier to the current epoch value.
>
> Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
> ---
>  include/linux/ceph/osd_client.h |  2 ++
>  net/ceph/debugfs.c              |  3 ++-
>  net/ceph/osd_client.c           | 50 ++++++++++++++++++++++++++++++++++-------
>  3 files changed, 46 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index 8cf644197b1a..85650b415e73 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -267,6 +267,7 @@ struct ceph_osd_client {
>         struct rb_root         osds;          /* osds */
>         struct list_head       osd_lru;       /* idle osds */
>         spinlock_t             osd_lru_lock;
> +       u32                    epoch_barrier;
>         struct ceph_osd        homeless_osd;
>         atomic64_t             last_tid;      /* tid of last request */
>         u64                    last_linger_id;
> @@ -305,6 +306,7 @@ extern void ceph_osdc_handle_reply(struct ceph_osd_client *osdc,
>                                    struct ceph_msg *msg);
>  extern void ceph_osdc_handle_map(struct ceph_osd_client *osdc,
>                                  struct ceph_msg *msg);
> +void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb);
>
>  extern void osd_req_op_init(struct ceph_osd_request *osd_req,
>                             unsigned int which, u16 opcode, u32 flags);
> diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c
> index d7e63a4f5578..71ba13927b3d 100644
> --- a/net/ceph/debugfs.c
> +++ b/net/ceph/debugfs.c
> @@ -62,7 +62,8 @@ static int osdmap_show(struct seq_file *s, void *p)
>                 return 0;
>
>         down_read(&osdc->lock);
> -       seq_printf(s, "epoch %d flags 0x%x\n", map->epoch, map->flags);
> +       seq_printf(s, "epoch %u barrier %u flags 0x%x\n", map->epoch,
> +                       osdc->epoch_barrier, map->flags);
>
>         for (n = rb_first(&map->pg_pools); n; n = rb_next(n)) {
>                 struct ceph_pg_pool_info *pi =
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 55b7585ccefd..fb35adae7fbf 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -1298,8 +1298,9 @@ static bool target_should_be_paused(struct ceph_osd_client *osdc,
>                        __pool_full(pi);
>
>         WARN_ON(pi->id != t->base_oloc.pool);
> -       return (t->flags & CEPH_OSD_FLAG_READ && pauserd) ||
> -              (t->flags & CEPH_OSD_FLAG_WRITE && pausewr);
> +       return ((t->flags & CEPH_OSD_FLAG_READ) && pauserd) ||
> +              ((t->flags & CEPH_OSD_FLAG_WRITE) && pausewr) ||
> +              (osdc->osdmap->epoch < osdc->epoch_barrier);
>  }
>
>  enum calc_target_result {
> @@ -1654,8 +1655,13 @@ static void __submit_request(struct ceph_osd_request *req, bool wrlocked)
>                 goto promote;
>         }
>
> -       if ((req->r_flags & CEPH_OSD_FLAG_WRITE) &&
> -           ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
> +       if (osdc->osdmap->epoch < osdc->epoch_barrier) {
> +               dout("req %p epoch %u barrier %u\n", req, osdc->osdmap->epoch,
> +                    osdc->epoch_barrier);
> +               req->r_t.paused = true;
> +               maybe_request_map(osdc);
> +       } else if ((req->r_flags & CEPH_OSD_FLAG_WRITE) &&
> +                  ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR)) {
>                 dout("req %p pausewr\n", req);
>                 req->r_t.paused = true;
>                 maybe_request_map(osdc);
> @@ -1809,11 +1815,14 @@ static void abort_request(struct ceph_osd_request *req, int err)
>
>  /*
>   * Drop all pending requests that are stalled waiting on a full condition to
> - * clear, and complete them with ENOSPC as the return code.
> + * clear, and complete them with ENOSPC as the return code. Set the
> + * osdc->epoch_barrier to the latest map epoch that we've seen if any were
> + * cancelled.
>   */
>  static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
>  {
>         struct rb_node *n;
> +       bool set_barrier = false;
>
>         dout("enter abort_on_full\n");
>
> @@ -1832,12 +1841,18 @@ static void ceph_osdc_abort_on_full(struct ceph_osd_client *osdc)
>
>                         if (req->r_abort_on_full &&
>                             (ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
> -                            pool_full(osdc, req->r_t.target_oloc.pool)))
> +                            pool_full(osdc, req->r_t.target_oloc.pool))) {
>                                 abort_request(req, -ENOSPC);
> +                               set_barrier = true;
> +                       }
>                 }
>         }
> +
> +       /* Update the epoch barrier to current epoch if a call was aborted */
> +       if (set_barrier)
> +               osdc->epoch_barrier = osdc->osdmap->epoch;
>  out:
> -       dout("return abort_on_full\n");
> +       dout("return abort_on_full barrier=%u\n", osdc->epoch_barrier);
>  }
>
>  static void check_pool_dne(struct ceph_osd_request *req)
> @@ -3293,7 +3308,8 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
>         pausewr = ceph_osdmap_flag(osdc, CEPH_OSDMAP_PAUSEWR) ||
>                   ceph_osdmap_flag(osdc, CEPH_OSDMAP_FULL) ||
>                   have_pool_full(osdc);
> -       if (was_pauserd || was_pausewr || pauserd || pausewr)
> +       if (was_pauserd || was_pausewr || pauserd || pausewr ||
> +           osdc->osdmap->epoch < osdc->epoch_barrier)
>                 maybe_request_map(osdc);
>
>         kick_requests(osdc, &need_resend, &need_resend_linger);
> @@ -3311,6 +3327,24 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
>         up_write(&osdc->lock);
>  }
>
> +void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb)
> +{
> +       down_read(&osdc->lock);
> +       if (unlikely(eb > osdc->epoch_barrier)) {
> +               up_read(&osdc->lock);
> +               down_write(&osdc->lock);
> +               if (likely(eb > osdc->epoch_barrier)) {
> +                       dout("updating epoch_barrier from %u to %u\n",
> +                                       osdc->epoch_barrier, eb);
> +                       osdc->epoch_barrier = eb;
> +               }
> +               up_write(&osdc->lock);
> +       } else {
> +               up_read(&osdc->lock);
> +       }
> +}
> +EXPORT_SYMBOL(ceph_osdc_update_epoch_barrier);
> +
>  /*
>   * Resubmit requests pending on the given osd.
>   */

LGTM

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-04-06  9:17 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-30 18:05 [PATCH v6 0/7] implement -ENOSPC handling in cephfs Jeff Layton
2017-03-30 18:07 ` [PATCH v6 1/7] libceph: remove req->r_replay_version Jeff Layton
2017-03-30 18:07   ` [PATCH v6 2/7] libceph: allow requests to return immediately on full conditions if caller wishes Jeff Layton
2017-04-04 14:55     ` Ilya Dryomov
2017-03-30 18:07   ` [PATCH v6 3/7] libceph: abort already submitted but abortable requests when map or pool goes full Jeff Layton
2017-04-04 14:57     ` Ilya Dryomov
2017-03-30 18:07   ` [PATCH v6 4/7] libceph: add an epoch_barrier field to struct ceph_osd_client Jeff Layton
2017-04-04 15:00     ` Ilya Dryomov
2017-04-04 16:34       ` Jeff Layton
2017-04-04 19:47         ` Ilya Dryomov
2017-04-04 21:12           ` Jeff Layton
2017-04-05  9:22             ` Ilya Dryomov
2017-04-05 13:29               ` Jeff Layton
2017-04-06  9:17                 ` Ilya Dryomov
2017-03-30 18:07   ` [PATCH v6 5/7] ceph: handle epoch barriers in cap messages Jeff Layton
2017-03-30 18:07   ` [PATCH v6 6/7] Revert "ceph: SetPageError() for writeback pages if writepages fails" Jeff Layton
2017-03-30 18:07   ` [PATCH v6 7/7] ceph: when seeing write errors on an inode, switch to sync writes Jeff Layton
2017-04-04 14:55   ` [PATCH v6 1/7] libceph: remove req->r_replay_version Ilya Dryomov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.