All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] ceph: support for sparse read in msgr2 crc path
@ 2022-03-09 12:33 Jeff Layton
  2022-03-09 12:33 ` [PATCH 1/3] libceph: add sparse read support to msgr2 crc state machine Jeff Layton
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Jeff Layton @ 2022-03-09 12:33 UTC (permalink / raw)
  To: ceph-devel, idryomov

This patchset is a revised version of the one I sent a couple of weeks
ago. This adds support for sparse reads to libceph, and changes cephfs
over to use instead of non-sparse reads. The sparse read codepath is a
drop in replacement for regular reads, so the upper layers should be
able to use it interchangeibly.

This is necessary for the (ongoing) fscrypt work. We need to know which
regions in a file are actually sparse so that we can avoid decrypting
them.

The next step is to add the same support to the msgr2 secure codepath.
Currently that code sets up a scatterlist with the final destination
data pages in it and passes that to the decrypt routine so that the
decrypted data is written directly to the destination.

My thinking here is to change that to decrypt the data in-place for
sparse reads, and then we'll just parse the decrypted buffer via
calling sparse_read and copy the data into the right places.

Ilya, does that sound sane? Is it OK to pass gcm_crypt two different
scatterlists with a region that overlaps?

Jeff Layton (3):
  libceph: add sparse read support to msgr2 crc state machine
  libceph: add sparse read support to OSD client
  ceph: convert to sparse reads

 fs/ceph/addr.c                  |   2 +-
 fs/ceph/file.c                  |   4 +-
 include/linux/ceph/messenger.h  |  31 +++++
 include/linux/ceph/osd_client.h |  38 ++++++
 net/ceph/messenger.c            |   1 +
 net/ceph/messenger_v2.c         | 215 ++++++++++++++++++++++++++++++--
 net/ceph/osd_client.c           | 163 ++++++++++++++++++++++--
 7 files changed, 435 insertions(+), 19 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] libceph: add sparse read support to msgr2 crc state machine
  2022-03-09 12:33 [PATCH 0/3] ceph: support for sparse read in msgr2 crc path Jeff Layton
@ 2022-03-09 12:33 ` Jeff Layton
  2022-03-09 13:37   ` Jeff Layton
  2022-03-09 12:33 ` [PATCH 2/3] libceph: add sparse read support to OSD client Jeff Layton
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Jeff Layton @ 2022-03-09 12:33 UTC (permalink / raw)
  To: ceph-devel, idryomov

Add support for a new sparse_read ceph_connection operation. The idea is
that the client driver can define this operation use it to do special
handling for incoming reads.

The alloc_msg routine will look at the request and determine whether the
reply is expected to be sparse. If it is, then we'll dispatch to a
different set of state machine states that will repeatedly call the
driver's sparse_read op to get length and placement info for reading the
extent map, and the extents themselves.

This necessitates adding some new field to some other structs:

- The msg gets a new bool to track whether it's a sparse_read request.

- A new field is added to the cursor to track the amount remaining in the
current extent. This is used to cap the read from the socket into the
msg_data

- Handing a revoke with all of this is particularly difficult, so I've
added a new data_len_remain field to the v2 connection info, and then
use that to skip that much on a revoke. We may want to expand the use of
that to the normal read path as well, just for consistency's sake.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 include/linux/ceph/messenger.h |  31 +++++
 net/ceph/messenger.c           |   1 +
 net/ceph/messenger_v2.c        | 215 +++++++++++++++++++++++++++++++--
 3 files changed, 238 insertions(+), 9 deletions(-)

diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index e7f2fb2fc207..e9c86d6de2e6 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -17,6 +17,7 @@
 
 struct ceph_msg;
 struct ceph_connection;
+struct ceph_msg_data_cursor;
 
 /*
  * Ceph defines these callbacks for handling connection events.
@@ -70,6 +71,31 @@ struct ceph_connection_operations {
 				      int used_proto, int result,
 				      const int *allowed_protos, int proto_cnt,
 				      const int *allowed_modes, int mode_cnt);
+
+	/**
+	 * sparse_read: read sparse data
+	 * @con: connection we're reading from
+	 * @cursor: data cursor for reading extents
+	 * @len: len of the data that msgr should read
+	 * @buf: optional buffer to read into
+	 *
+	 * This should be called more than once, each time setting up to
+	 * receive an extent into the current cursor position, and zeroing
+	 * the holes between them.
+	 *
+	 * Returns 1 if there is more data to be read, 0 if reading is
+	 * complete, or -errno if there was an error.
+	 *
+	 * If @buf is set on a 1 return, then the data should be read into
+	 * the provided buffer. Otherwise, it should be read into the cursor.
+	 *
+	 * The sparse read operation is expected to initialize the cursor
+	 * with a length covering up to the end of the last extent.
+	 */
+	int (*sparse_read)(struct ceph_connection *con,
+			   struct ceph_msg_data_cursor *cursor,
+			   u64 *len, char **buf);
+
 };
 
 /* use format string %s%lld */
@@ -207,6 +233,7 @@ struct ceph_msg_data_cursor {
 
 	struct ceph_msg_data	*data;		/* current data item */
 	size_t			resid;		/* bytes not yet consumed */
+	int			sr_resid;	/* residual sparse_read len */
 	bool			last_piece;	/* current is last piece */
 	bool			need_crc;	/* crc update needed */
 	union {
@@ -252,6 +279,7 @@ struct ceph_msg {
 	struct kref kref;
 	bool more_to_follow;
 	bool needs_out_seq;
+	bool sparse_read;
 	int front_alloc_len;
 
 	struct ceph_msgpool *pool;
@@ -396,6 +424,7 @@ struct ceph_connection_v2_info {
 
 	void *conn_bufs[16];
 	int conn_buf_cnt;
+	int data_len_remain;
 
 	struct kvec in_sign_kvecs[8];
 	struct kvec out_sign_kvecs[8];
@@ -464,6 +493,8 @@ struct ceph_connection {
 	struct page *bounce_page;
 	u32 in_front_crc, in_middle_crc, in_data_crc;  /* calculated crc */
 
+	int sparse_resid;
+
 	struct timespec64 last_keepalive_ack; /* keepalive2 ack stamp */
 
 	struct delayed_work work;	    /* send|recv work */
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index d3bb656308b4..bf4e7f5751ee 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1034,6 +1034,7 @@ void ceph_msg_data_cursor_init(struct ceph_msg_data_cursor *cursor,
 
 	cursor->total_resid = length;
 	cursor->data = msg->data;
+	cursor->sr_resid = 0;
 
 	__ceph_msg_data_cursor_init(cursor);
 }
diff --git a/net/ceph/messenger_v2.c b/net/ceph/messenger_v2.c
index c6e5bfc717d5..845c2f093a02 100644
--- a/net/ceph/messenger_v2.c
+++ b/net/ceph/messenger_v2.c
@@ -52,14 +52,17 @@
 #define FRAME_LATE_STATUS_COMPLETE	0xe
 #define FRAME_LATE_STATUS_ABORTED_MASK	0xf
 
-#define IN_S_HANDLE_PREAMBLE		1
-#define IN_S_HANDLE_CONTROL		2
-#define IN_S_HANDLE_CONTROL_REMAINDER	3
-#define IN_S_PREPARE_READ_DATA		4
-#define IN_S_PREPARE_READ_DATA_CONT	5
-#define IN_S_PREPARE_READ_ENC_PAGE	6
-#define IN_S_HANDLE_EPILOGUE		7
-#define IN_S_FINISH_SKIP		8
+#define IN_S_HANDLE_PREAMBLE			1
+#define IN_S_HANDLE_CONTROL			2
+#define IN_S_HANDLE_CONTROL_REMAINDER		3
+#define IN_S_PREPARE_READ_DATA			4
+#define IN_S_PREPARE_READ_DATA_CONT		5
+#define IN_S_PREPARE_READ_ENC_PAGE		6
+#define IN_S_PREPARE_SPARSE_DATA		7
+#define IN_S_PREPARE_SPARSE_DATA_HDR		8
+#define IN_S_PREPARE_SPARSE_DATA_CONT		9
+#define IN_S_HANDLE_EPILOGUE			10
+#define IN_S_FINISH_SKIP			11
 
 #define OUT_S_QUEUE_DATA		1
 #define OUT_S_QUEUE_DATA_CONT		2
@@ -1819,6 +1822,166 @@ static void prepare_read_data_cont(struct ceph_connection *con)
 	con->v2.in_state = IN_S_HANDLE_EPILOGUE;
 }
 
+static int prepare_sparse_read_cont(struct ceph_connection *con)
+{
+	int ret;
+	struct bio_vec bv;
+	char *buf = NULL;
+	struct ceph_msg_data_cursor *cursor = &con->v2.in_cursor;
+	u64 len = 0;
+
+	if (!iov_iter_is_bvec(&con->v2.in_iter))
+		return -EIO;
+
+	if (ceph_test_opt(from_msgr(con->msgr), RXBOUNCE)) {
+		con->in_data_crc = crc32c(con->in_data_crc,
+					  page_address(con->bounce_page),
+					  con->v2.in_bvec.bv_len);
+
+		get_bvec_at(cursor, &bv);
+		memcpy_to_page(bv.bv_page, bv.bv_offset,
+			       page_address(con->bounce_page),
+			       con->v2.in_bvec.bv_len);
+	} else {
+		con->in_data_crc = ceph_crc32c_page(con->in_data_crc,
+						    con->v2.in_bvec.bv_page,
+						    con->v2.in_bvec.bv_offset,
+						    con->v2.in_bvec.bv_len);
+	}
+
+	ceph_msg_data_advance(cursor, con->v2.in_bvec.bv_len);
+	cursor->sr_resid -= con->v2.in_bvec.bv_len;
+	dout("%s: advance by 0x%x sr_resid 0x%x\n", __func__,
+		con->v2.in_bvec.bv_len, cursor->sr_resid);
+	WARN_ON_ONCE(cursor->sr_resid > cursor->total_resid);
+	if (cursor->sr_resid) {
+		get_bvec_at(cursor, &bv);
+		if (bv.bv_len > cursor->sr_resid)
+			bv.bv_len = cursor->sr_resid;
+		if (ceph_test_opt(from_msgr(con->msgr), RXBOUNCE)) {
+			bv.bv_page = con->bounce_page;
+			bv.bv_offset = 0;
+		}
+		set_in_bvec(con, &bv);
+		con->v2.data_len_remain -= bv.bv_len;
+		WARN_ON(con->v2.in_state != IN_S_PREPARE_SPARSE_DATA_CONT);
+		return 0;
+	}
+
+	/* get next extent */
+	ret = con->ops->sparse_read(con, cursor, &len, &buf);
+	if (ret <= 0) {
+		if (ret < 0)
+			return ret;
+
+		reset_in_kvecs(con);
+		add_in_kvec(con, con->v2.in_buf, CEPH_EPILOGUE_PLAIN_LEN);
+		con->v2.in_state = IN_S_HANDLE_EPILOGUE;
+		return 0;
+	}
+
+	cursor->sr_resid = len;
+	get_bvec_at(cursor, &bv);
+	if (bv.bv_len > cursor->sr_resid)
+		bv.bv_len = cursor->sr_resid;
+	if (ceph_test_opt(from_msgr(con->msgr), RXBOUNCE)) {
+		if (unlikely(!con->bounce_page)) {
+			con->bounce_page = alloc_page(GFP_NOIO);
+			if (!con->bounce_page) {
+				pr_err("failed to allocate bounce page\n");
+				return -ENOMEM;
+			}
+		}
+
+		bv.bv_page = con->bounce_page;
+		bv.bv_offset = 0;
+	}
+	set_in_bvec(con, &bv);
+	con->v2.data_len_remain -= len;
+	return ret;
+}
+
+static int prepare_sparse_read_header(struct ceph_connection *con)
+{
+	int ret;
+	char *buf = NULL;
+	struct bio_vec bv;
+	struct ceph_msg_data_cursor *cursor = &con->v2.in_cursor;
+	u64 len = 0;
+
+	if (!iov_iter_is_kvec(&con->v2.in_iter))
+		return -EIO;
+
+	/* On first call, we have no kvec so don't compute crc */
+	if (con->v2.in_kvec_cnt) {
+		WARN_ON_ONCE(con->v2.in_kvec_cnt > 1);
+		con->in_data_crc = crc32c(con->in_data_crc,
+				  con->v2.in_kvecs[0].iov_base,
+				  con->v2.in_kvecs[0].iov_len);
+	}
+
+	ret = con->ops->sparse_read(con, cursor, &len, &buf);
+	if (ret < 0)
+		return ret;
+	if (ret == 0) {
+		reset_in_kvecs(con);
+		add_in_kvec(con, con->v2.in_buf, CEPH_EPILOGUE_PLAIN_LEN);
+		con->v2.in_state = IN_S_HANDLE_EPILOGUE;
+		return 0;
+	}
+
+	/* No actual data? */
+	if (WARN_ON_ONCE(!ret))
+		return -EIO;
+
+	if (!buf) {
+		cursor->sr_resid = len;
+		get_bvec_at(cursor, &bv);
+		if (bv.bv_len > cursor->sr_resid)
+			bv.bv_len = cursor->sr_resid;
+		if (ceph_test_opt(from_msgr(con->msgr), RXBOUNCE)) {
+			if (unlikely(!con->bounce_page)) {
+				con->bounce_page = alloc_page(GFP_NOIO);
+				if (!con->bounce_page) {
+					pr_err("failed to allocate bounce page\n");
+					return -ENOMEM;
+				}
+			}
+
+			bv.bv_page = con->bounce_page;
+			bv.bv_offset = 0;
+		}
+		set_in_bvec(con, &bv);
+		con->v2.data_len_remain -= len;
+		con->v2.in_state = IN_S_PREPARE_SPARSE_DATA_CONT;
+		return ret;
+	}
+
+	WARN_ON_ONCE(con->v2.in_state != IN_S_PREPARE_SPARSE_DATA_HDR);
+	reset_in_kvecs(con);
+	add_in_kvec(con, buf, len);
+	con->v2.data_len_remain -= len;
+	return 0;
+}
+
+static int prepare_sparse_read_data(struct ceph_connection *con)
+{
+	struct ceph_msg *msg = con->in_msg;
+
+	dout("%s: starting sparse read\n", __func__);
+
+	if (WARN_ON_ONCE(!con->ops->sparse_read))
+		return -EOPNOTSUPP;
+
+	if (!con_secure(con))
+		con->in_data_crc = -1;
+
+	reset_in_kvecs(con);
+	con->v2.in_state = IN_S_PREPARE_SPARSE_DATA_HDR;
+	con->v2.data_len_remain = data_len(msg);
+	return prepare_sparse_read_header(con);
+}
+
 static int prepare_read_tail_plain(struct ceph_connection *con)
 {
 	struct ceph_msg *msg = con->in_msg;
@@ -1839,7 +2002,10 @@ static int prepare_read_tail_plain(struct ceph_connection *con)
 	}
 
 	if (data_len(msg)) {
-		con->v2.in_state = IN_S_PREPARE_READ_DATA;
+		if (msg->sparse_read)
+			con->v2.in_state = IN_S_PREPARE_SPARSE_DATA;
+		else
+			con->v2.in_state = IN_S_PREPARE_READ_DATA;
 	} else {
 		add_in_kvec(con, con->v2.in_buf, CEPH_EPILOGUE_PLAIN_LEN);
 		con->v2.in_state = IN_S_HANDLE_EPILOGUE;
@@ -2893,6 +3059,15 @@ static int populate_in_iter(struct ceph_connection *con)
 			prepare_read_enc_page(con);
 			ret = 0;
 			break;
+		case IN_S_PREPARE_SPARSE_DATA:
+			ret = prepare_sparse_read_data(con);
+			break;
+		case IN_S_PREPARE_SPARSE_DATA_HDR:
+			ret = prepare_sparse_read_header(con);
+			break;
+		case IN_S_PREPARE_SPARSE_DATA_CONT:
+			ret = prepare_sparse_read_cont(con);
+			break;
 		case IN_S_HANDLE_EPILOGUE:
 			ret = handle_epilogue(con);
 			break;
@@ -3485,6 +3660,23 @@ static void revoke_at_prepare_read_enc_page(struct ceph_connection *con)
 	con->v2.in_state = IN_S_FINISH_SKIP;
 }
 
+static void revoke_at_prepare_sparse_data(struct ceph_connection *con)
+{
+	int resid;  /* current piece of data */
+	int remaining;
+
+	WARN_ON(con_secure(con));
+	WARN_ON(!data_len(con->in_msg));
+	WARN_ON(!iov_iter_is_bvec(&con->v2.in_iter));
+	resid = iov_iter_count(&con->v2.in_iter);
+	dout("%s con %p resid %d\n", __func__, con, resid);
+
+	remaining = CEPH_EPILOGUE_PLAIN_LEN + con->v2.data_len_remain;
+	con->v2.in_iter.count -= resid;
+	set_in_skip(con, resid + remaining);
+	con->v2.in_state = IN_S_FINISH_SKIP;
+}
+
 static void revoke_at_handle_epilogue(struct ceph_connection *con)
 {
 	int resid;
@@ -3501,6 +3693,7 @@ static void revoke_at_handle_epilogue(struct ceph_connection *con)
 void ceph_con_v2_revoke_incoming(struct ceph_connection *con)
 {
 	switch (con->v2.in_state) {
+	case IN_S_PREPARE_SPARSE_DATA:
 	case IN_S_PREPARE_READ_DATA:
 		revoke_at_prepare_read_data(con);
 		break;
@@ -3510,6 +3703,10 @@ void ceph_con_v2_revoke_incoming(struct ceph_connection *con)
 	case IN_S_PREPARE_READ_ENC_PAGE:
 		revoke_at_prepare_read_enc_page(con);
 		break;
+	case IN_S_PREPARE_SPARSE_DATA_HDR:
+	case IN_S_PREPARE_SPARSE_DATA_CONT:
+		revoke_at_prepare_sparse_data(con);
+		break;
 	case IN_S_HANDLE_EPILOGUE:
 		revoke_at_handle_epilogue(con);
 		break;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] libceph: add sparse read support to OSD client
  2022-03-09 12:33 [PATCH 0/3] ceph: support for sparse read in msgr2 crc path Jeff Layton
  2022-03-09 12:33 ` [PATCH 1/3] libceph: add sparse read support to msgr2 crc state machine Jeff Layton
@ 2022-03-09 12:33 ` Jeff Layton
  2022-03-11 17:08   ` Jeff Layton
  2022-03-09 12:33 ` [PATCH 3/3] ceph: convert to sparse reads Jeff Layton
  2022-03-15  6:23 ` [PATCH 0/3] ceph: support for sparse read in msgr2 crc path Xiubo Li
  3 siblings, 1 reply; 15+ messages in thread
From: Jeff Layton @ 2022-03-09 12:33 UTC (permalink / raw)
  To: ceph-devel, idryomov

Add a new sparse_read operation for the OSD client, driven by its own
state machine. The messenger can repeatedly call the sparse_read
operation, and it will pass back the necessary info to set up to read
the next extent of data, while zeroing in the sparse regions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 include/linux/ceph/osd_client.h |  38 ++++++++
 net/ceph/osd_client.c           | 163 ++++++++++++++++++++++++++++++--
 2 files changed, 194 insertions(+), 7 deletions(-)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 3431011f364d..42eb1628a66d 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -29,6 +29,43 @@ typedef void (*ceph_osdc_callback_t)(struct ceph_osd_request *);
 
 #define CEPH_HOMELESS_OSD	-1
 
+enum ceph_sparse_read_state {
+	CEPH_SPARSE_READ_HDR	= 0,
+	CEPH_SPARSE_READ_EXTENTS,
+	CEPH_SPARSE_READ_DATA_LEN,
+	CEPH_SPARSE_READ_DATA,
+};
+
+/* A single extent in a SPARSE_READ reply */
+struct ceph_sparse_extent {
+	__le64	off;
+	__le64	len;
+} __attribute__((packed));
+
+/*
+ * A SPARSE_READ reply is a 32-bit count of extents, followed by an array of
+ * 64-bit offset/length pairs, and then all of the actual file data
+ * concatenated after it (sans holes).
+ *
+ * Unfortunately, we don't know how long the extent array is until we've
+ * started reading the data section of the reply, so for a real sparse read, we
+ * have to allocate the array after alloc_msg returns.
+ *
+ * For the common case of a single extent, we keep an embedded extent here so
+ * we can avoid the extra allocation.
+ */
+struct ceph_sparse_read {
+	enum ceph_sparse_read_state	sr_state;	/* state machine state */
+	u64				sr_req_off;	/* orig request offset */
+	u64				sr_req_len;	/* orig request length */
+	u64				sr_pos;		/* current pos in buffer */
+	int				sr_index;	/* current extent index */
+	__le32				sr_datalen;	/* length of actual data */
+	__le32				sr_count;	/* extent count */
+	struct ceph_sparse_extent	*sr_extent;	/* extent array */
+	struct ceph_sparse_extent	sr_emb_ext[1];	/* embedded extent */
+};
+
 /* a given osd we're communicating with */
 struct ceph_osd {
 	refcount_t o_ref;
@@ -46,6 +83,7 @@ struct ceph_osd {
 	unsigned long lru_ttl;
 	struct list_head o_keepalive_item;
 	struct mutex lock;
+	struct ceph_sparse_read	o_sparse_read;
 };
 
 #define CEPH_OSD_SLAB_OPS	2
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 1c5815530e0d..f519b5727ee3 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -376,6 +376,7 @@ static void osd_req_op_data_release(struct ceph_osd_request *osd_req,
 
 	switch (op->op) {
 	case CEPH_OSD_OP_READ:
+	case CEPH_OSD_OP_SPARSE_READ:
 	case CEPH_OSD_OP_WRITE:
 	case CEPH_OSD_OP_WRITEFULL:
 		ceph_osd_data_release(&op->extent.osd_data);
@@ -706,6 +707,7 @@ static void get_num_data_items(struct ceph_osd_request *req,
 		/* reply */
 		case CEPH_OSD_OP_STAT:
 		case CEPH_OSD_OP_READ:
+		case CEPH_OSD_OP_SPARSE_READ:
 		case CEPH_OSD_OP_LIST_WATCHERS:
 			*num_reply_data_items += 1;
 			break;
@@ -775,7 +777,7 @@ void osd_req_op_extent_init(struct ceph_osd_request *osd_req,
 
 	BUG_ON(opcode != CEPH_OSD_OP_READ && opcode != CEPH_OSD_OP_WRITE &&
 	       opcode != CEPH_OSD_OP_WRITEFULL && opcode != CEPH_OSD_OP_ZERO &&
-	       opcode != CEPH_OSD_OP_TRUNCATE);
+	       opcode != CEPH_OSD_OP_TRUNCATE && opcode != CEPH_OSD_OP_SPARSE_READ);
 
 	op->extent.offset = offset;
 	op->extent.length = length;
@@ -984,6 +986,7 @@ static u32 osd_req_encode_op(struct ceph_osd_op *dst,
 	case CEPH_OSD_OP_STAT:
 		break;
 	case CEPH_OSD_OP_READ:
+	case CEPH_OSD_OP_SPARSE_READ:
 	case CEPH_OSD_OP_WRITE:
 	case CEPH_OSD_OP_WRITEFULL:
 	case CEPH_OSD_OP_ZERO:
@@ -1080,7 +1083,8 @@ struct ceph_osd_request *ceph_osdc_new_request(struct ceph_osd_client *osdc,
 
 	BUG_ON(opcode != CEPH_OSD_OP_READ && opcode != CEPH_OSD_OP_WRITE &&
 	       opcode != CEPH_OSD_OP_ZERO && opcode != CEPH_OSD_OP_TRUNCATE &&
-	       opcode != CEPH_OSD_OP_CREATE && opcode != CEPH_OSD_OP_DELETE);
+	       opcode != CEPH_OSD_OP_CREATE && opcode != CEPH_OSD_OP_DELETE &&
+	       opcode != CEPH_OSD_OP_SPARSE_READ);
 
 	req = ceph_osdc_alloc_request(osdc, snapc, num_ops, use_mempool,
 					GFP_NOFS);
@@ -2037,6 +2041,7 @@ static void setup_request_data(struct ceph_osd_request *req)
 					       &op->raw_data_in);
 			break;
 		case CEPH_OSD_OP_READ:
+		case CEPH_OSD_OP_SPARSE_READ:
 			ceph_osdc_msg_data_add(reply_msg,
 					       &op->extent.osd_data);
 			break;
@@ -2443,6 +2448,21 @@ static void submit_request(struct ceph_osd_request *req, bool wrlocked)
 	__submit_request(req, wrlocked);
 }
 
+static void ceph_init_sparse_read(struct ceph_sparse_read *sr, struct ceph_osd_req_op *op)
+{
+	if (sr->sr_extent != sr->sr_emb_ext)
+		kfree(sr->sr_extent);
+	sr->sr_state = CEPH_SPARSE_READ_HDR;
+	sr->sr_req_off = op ? op->extent.offset : 0;
+	sr->sr_req_len = op ? op->extent.length : 0;
+	sr->sr_pos = sr->sr_req_off;
+	sr->sr_index = 0;
+	sr->sr_count = 0;
+	sr->sr_extent = sr->sr_emb_ext;
+	sr->sr_extent[0].off = 0;
+	sr->sr_extent[0].len = 0;
+}
+
 static void finish_request(struct ceph_osd_request *req)
 {
 	struct ceph_osd_client *osdc = req->r_osdc;
@@ -2452,8 +2472,10 @@ static void finish_request(struct ceph_osd_request *req)
 
 	req->r_end_latency = ktime_get();
 
-	if (req->r_osd)
+	if (req->r_osd) {
+		ceph_init_sparse_read(&req->r_osd->o_sparse_read, NULL);
 		unlink_request(req->r_osd, req);
+	}
 	atomic_dec(&osdc->num_requests);
 
 	/*
@@ -3655,6 +3677,8 @@ static void handle_reply(struct ceph_osd *osd, struct ceph_msg *msg)
 	struct MOSDOpReply m;
 	u64 tid = le64_to_cpu(msg->hdr.tid);
 	u32 data_len = 0;
+	u32 result_len = 0;
+	bool sparse = false;
 	int ret;
 	int i;
 
@@ -3749,21 +3773,32 @@ static void handle_reply(struct ceph_osd *osd, struct ceph_msg *msg)
 		req->r_ops[i].rval = m.rval[i];
 		req->r_ops[i].outdata_len = m.outdata_len[i];
 		data_len += m.outdata_len[i];
+		if (req->r_ops[i].op == CEPH_OSD_OP_SPARSE_READ)
+			sparse = true;
 	}
+
+	result_len = data_len;
+	if (sparse) {
+		struct ceph_sparse_read *sr = &osd->o_sparse_read;
+
+		/* Fudge the result if this was a sparse read. */
+		result_len = sr->sr_pos - sr->sr_req_off;
+	}
+
 	if (data_len != le32_to_cpu(msg->hdr.data_len)) {
 		pr_err("sum of lens %u != %u for tid %llu\n", data_len,
 		       le32_to_cpu(msg->hdr.data_len), req->r_tid);
 		goto fail_request;
 	}
-	dout("%s req %p tid %llu result %d data_len %u\n", __func__,
-	     req, req->r_tid, m.result, data_len);
+	dout("%s req %p tid %llu result %d data_len %u result_len %u\n", __func__,
+	     req, req->r_tid, m.result, data_len, result_len);
 
 	/*
 	 * Since we only ever request ONDISK, we should only ever get
 	 * one (type of) reply back.
 	 */
 	WARN_ON(!(m.flags & CEPH_OSD_FLAG_ONDISK));
-	req->r_result = m.result ?: data_len;
+	req->r_result = m.result ?: result_len;
 	finish_request(req);
 	mutex_unlock(&osd->lock);
 	up_read(&osdc->lock);
@@ -5398,6 +5433,21 @@ static void osd_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
 	ceph_msg_put(msg);
 }
 
+static struct ceph_osd_req_op *
+sparse_read_op(struct ceph_osd_request *req)
+{
+	int i;
+
+	if (!(req->r_flags & CEPH_OSD_FLAG_READ))
+		return NULL;
+
+	for (i = 0; i < req->r_num_ops; ++i) {
+		if (req->r_ops[i].op == CEPH_OSD_OP_SPARSE_READ)
+			return &req->r_ops[i];
+	}
+	return NULL;
+}
+
 /*
  * Lookup and return message for incoming reply.  Don't try to do
  * anything about a larger than preallocated data portion of the
@@ -5414,6 +5464,7 @@ static struct ceph_msg *get_reply(struct ceph_connection *con,
 	int front_len = le32_to_cpu(hdr->front_len);
 	int data_len = le32_to_cpu(hdr->data_len);
 	u64 tid = le64_to_cpu(hdr->tid);
+	struct ceph_osd_req_op *srop;
 
 	down_read(&osdc->lock);
 	if (!osd_registered(osd)) {
@@ -5446,7 +5497,9 @@ static struct ceph_msg *get_reply(struct ceph_connection *con,
 		req->r_reply = m;
 	}
 
-	if (data_len > req->r_reply->data_length) {
+	srop = sparse_read_op(req);
+
+	if (!srop && (data_len > req->r_reply->data_length)) {
 		pr_warn("%s osd%d tid %llu data %d > preallocated %zu, skipping\n",
 			__func__, osd->o_osd, req->r_tid, data_len,
 			req->r_reply->data_length);
@@ -5456,6 +5509,10 @@ static struct ceph_msg *get_reply(struct ceph_connection *con,
 	}
 
 	m = ceph_msg_get(req->r_reply);
+	m->sparse_read = srop;
+	if (srop)
+		ceph_init_sparse_read(&osd->o_sparse_read, srop);
+
 	dout("get_reply tid %lld %p\n", tid, m);
 
 out_unlock_session:
@@ -5688,9 +5745,101 @@ static int osd_check_message_signature(struct ceph_msg *msg)
 	return ceph_auth_check_message_signature(auth, msg);
 }
 
+static void zero_len(struct ceph_msg_data_cursor *cursor, size_t len)
+{
+	while (len) {
+		struct page *page;
+		size_t poff, plen;
+		bool last = false;
+
+		page = ceph_msg_data_next(cursor, &poff, &plen, &last);
+		if (plen > len)
+			plen = len;
+		zero_user_segment(page, poff, poff + plen);
+		len -= plen;
+		ceph_msg_data_advance(cursor, plen);
+	}
+}
+
+static int osd_sparse_read(struct ceph_connection *con,
+			   struct ceph_msg_data_cursor *cursor,
+			   u64 *plen, char **pbuf)
+{
+	struct ceph_osd *o = con->private;
+	struct ceph_sparse_read *sr = &o->o_sparse_read;
+	u32 count = __le32_to_cpu(sr->sr_count);
+	u64 eoff, elen;
+
+	switch (sr->sr_state) {
+	case CEPH_SPARSE_READ_HDR:
+		dout("[%d] request to read 0x%llx~0x%llx\n", o->o_osd, sr->sr_req_off, sr->sr_req_len);
+		/* number of extents */
+		*plen = sizeof(sr->sr_count);
+		*pbuf = (char *)&sr->sr_count;
+		sr->sr_state = CEPH_SPARSE_READ_EXTENTS;
+		break;
+	case CEPH_SPARSE_READ_EXTENTS:
+		dout("[%d] got %u extents\n", o->o_osd, count);
+
+		if (count > 0) {
+			if (count > 1) {
+				/* can't use the embedded extent array */
+				sr->sr_extent = kmalloc_array(count, sizeof(*sr->sr_extent),
+							   GFP_NOIO);
+				if (!sr->sr_extent)
+					return -ENOMEM;
+			}
+			*plen = count * sizeof(*sr->sr_extent);
+			*pbuf = (char *)sr->sr_extent;
+			sr->sr_state = CEPH_SPARSE_READ_DATA_LEN;
+			break;
+		}
+		/* No extents? Fall through to reading data len */
+		fallthrough;
+	case CEPH_SPARSE_READ_DATA_LEN:
+		*plen = sizeof(sr->sr_datalen);
+		*pbuf = (char *)&sr->sr_datalen;
+		sr->sr_state = CEPH_SPARSE_READ_DATA;
+		break;
+	case CEPH_SPARSE_READ_DATA:
+		if (sr->sr_index >= count)
+			return 0;
+		if (sr->sr_index == 0) {
+			/* last extent */
+			eoff = le64_to_cpu(sr->sr_extent[count - 1].off);
+			elen = le64_to_cpu(sr->sr_extent[count - 1].len);
+
+			/* set up cursor to end of last extent */
+			ceph_msg_data_cursor_init(cursor, con->in_msg,
+						  eoff + elen - sr->sr_req_off);
+		}
+
+		eoff = le64_to_cpu(sr->sr_extent[sr->sr_index].off);
+		elen = le64_to_cpu(sr->sr_extent[sr->sr_index].len);
+
+		dout("[%d] ext %d off 0x%llx len 0x%llx\n", o->o_osd, sr->sr_index, eoff, elen);
+
+		/* zero out anything from sr_pos to start of extent */
+		if (sr->sr_pos < eoff)
+			zero_len(cursor, eoff - sr->sr_pos);
+
+		/* Set position to end of extent */
+		sr->sr_pos = eoff + elen;
+
+		/* send back the new length */
+		*plen = elen;
+
+		/* Bump the array index */
+		++sr->sr_index;
+		break;
+	}
+	return 1;
+}
+
 static const struct ceph_connection_operations osd_con_ops = {
 	.get = osd_get_con,
 	.put = osd_put_con,
+	.sparse_read = osd_sparse_read,
 	.alloc_msg = osd_alloc_msg,
 	.dispatch = osd_dispatch,
 	.fault = osd_fault,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] ceph: convert to sparse reads
  2022-03-09 12:33 [PATCH 0/3] ceph: support for sparse read in msgr2 crc path Jeff Layton
  2022-03-09 12:33 ` [PATCH 1/3] libceph: add sparse read support to msgr2 crc state machine Jeff Layton
  2022-03-09 12:33 ` [PATCH 2/3] libceph: add sparse read support to OSD client Jeff Layton
@ 2022-03-09 12:33 ` Jeff Layton
  2022-03-14  2:22   ` Xiubo Li
  2022-03-15  6:23 ` [PATCH 0/3] ceph: support for sparse read in msgr2 crc path Xiubo Li
  3 siblings, 1 reply; 15+ messages in thread
From: Jeff Layton @ 2022-03-09 12:33 UTC (permalink / raw)
  To: ceph-devel, idryomov

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/addr.c | 2 +-
 fs/ceph/file.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 752c421c9922..f42440d7102b 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -317,7 +317,7 @@ static void ceph_netfs_issue_op(struct netfs_read_subrequest *subreq)
 		return;
 
 	req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout, vino, subreq->start, &len,
-			0, 1, CEPH_OSD_OP_READ,
+			0, 1, CEPH_OSD_OP_SPARSE_READ,
 			CEPH_OSD_FLAG_READ | fsc->client->osdc.client->options->read_from_replica,
 			NULL, ci->i_truncate_seq, ci->i_truncate_size, false);
 	if (IS_ERR(req)) {
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index feb75eb1cd82..d1956a20c627 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -934,7 +934,7 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
 
 		req = ceph_osdc_new_request(osdc, &ci->i_layout,
 					ci->i_vino, off, &len, 0, 1,
-					CEPH_OSD_OP_READ, CEPH_OSD_FLAG_READ,
+					CEPH_OSD_OP_SPARSE_READ, CEPH_OSD_FLAG_READ,
 					NULL, ci->i_truncate_seq,
 					ci->i_truncate_size, false);
 		if (IS_ERR(req)) {
@@ -1291,7 +1291,7 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
 					    vino, pos, &size, 0,
 					    1,
 					    write ? CEPH_OSD_OP_WRITE :
-						    CEPH_OSD_OP_READ,
+						    CEPH_OSD_OP_SPARSE_READ,
 					    flags, snapc,
 					    ci->i_truncate_seq,
 					    ci->i_truncate_size,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] libceph: add sparse read support to msgr2 crc state machine
  2022-03-09 12:33 ` [PATCH 1/3] libceph: add sparse read support to msgr2 crc state machine Jeff Layton
@ 2022-03-09 13:37   ` Jeff Layton
  0 siblings, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2022-03-09 13:37 UTC (permalink / raw)
  To: ceph-devel, idryomov

On Wed, 2022-03-09 at 07:33 -0500, Jeff Layton wrote:
> Add support for a new sparse_read ceph_connection operation. The idea is
> that the client driver can define this operation use it to do special
> handling for incoming reads.
> 
> The alloc_msg routine will look at the request and determine whether the
> reply is expected to be sparse. If it is, then we'll dispatch to a
> different set of state machine states that will repeatedly call the
> driver's sparse_read op to get length and placement info for reading the
> extent map, and the extents themselves.
> 
> This necessitates adding some new field to some other structs:
> 
> - The msg gets a new bool to track whether it's a sparse_read request.
> 
> - A new field is added to the cursor to track the amount remaining in the
> current extent. This is used to cap the read from the socket into the
> msg_data
> 
> - Handing a revoke with all of this is particularly difficult, so I've
> added a new data_len_remain field to the v2 connection info, and then
> use that to skip that much on a revoke. We may want to expand the use of
> that to the normal read path as well, just for consistency's sake.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  include/linux/ceph/messenger.h |  31 +++++
>  net/ceph/messenger.c           |   1 +
>  net/ceph/messenger_v2.c        | 215 +++++++++++++++++++++++++++++++--
>  3 files changed, 238 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
> index e7f2fb2fc207..e9c86d6de2e6 100644
> --- a/include/linux/ceph/messenger.h
> +++ b/include/linux/ceph/messenger.h
> @@ -17,6 +17,7 @@
>  
>  struct ceph_msg;
>  struct ceph_connection;
> +struct ceph_msg_data_cursor;
>  
>  /*
>   * Ceph defines these callbacks for handling connection events.
> @@ -70,6 +71,31 @@ struct ceph_connection_operations {
>  				      int used_proto, int result,
>  				      const int *allowed_protos, int proto_cnt,
>  				      const int *allowed_modes, int mode_cnt);
> +
> +	/**
> +	 * sparse_read: read sparse data
> +	 * @con: connection we're reading from
> +	 * @cursor: data cursor for reading extents
> +	 * @len: len of the data that msgr should read
> +	 * @buf: optional buffer to read into
> +	 *
> +	 * This should be called more than once, each time setting up to
> +	 * receive an extent into the current cursor position, and zeroing
> +	 * the holes between them.
> +	 *
> +	 * Returns 1 if there is more data to be read, 0 if reading is
> +	 * complete, or -errno if there was an error.
> +	 *
> +	 * If @buf is set on a 1 return, then the data should be read into
> +	 * the provided buffer. Otherwise, it should be read into the cursor.
> +	 *
> +	 * The sparse read operation is expected to initialize the cursor
> +	 * with a length covering up to the end of the last extent.
> +	 */
> +	int (*sparse_read)(struct ceph_connection *con,
> +			   struct ceph_msg_data_cursor *cursor,
> +			   u64 *len, char **buf);
> +
>  };
>  
>  /* use format string %s%lld */
> @@ -207,6 +233,7 @@ struct ceph_msg_data_cursor {
>  
>  	struct ceph_msg_data	*data;		/* current data item */
>  	size_t			resid;		/* bytes not yet consumed */
> +	int			sr_resid;	/* residual sparse_read len */
>  	bool			last_piece;	/* current is last piece */
>  	bool			need_crc;	/* crc update needed */
>  	union {
> @@ -252,6 +279,7 @@ struct ceph_msg {
>  	struct kref kref;
>  	bool more_to_follow;
>  	bool needs_out_seq;
> +	bool sparse_read;
>  	int front_alloc_len;
>  
>  	struct ceph_msgpool *pool;
> @@ -396,6 +424,7 @@ struct ceph_connection_v2_info {
>  
>  	void *conn_bufs[16];
>  	int conn_buf_cnt;
> +	int data_len_remain;
>  
>  	struct kvec in_sign_kvecs[8];
>  	struct kvec out_sign_kvecs[8];
> @@ -464,6 +493,8 @@ struct ceph_connection {
>  	struct page *bounce_page;
>  	u32 in_front_crc, in_middle_crc, in_data_crc;  /* calculated crc */
>  
> +	int sparse_resid;
> +
>  	struct timespec64 last_keepalive_ack; /* keepalive2 ack stamp */
>  
>  	struct delayed_work work;	    /* send|recv work */
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index d3bb656308b4..bf4e7f5751ee 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -1034,6 +1034,7 @@ void ceph_msg_data_cursor_init(struct ceph_msg_data_cursor *cursor,
>  
>  	cursor->total_resid = length;
>  	cursor->data = msg->data;
> +	cursor->sr_resid = 0;
>  
>  	__ceph_msg_data_cursor_init(cursor);
>  }
> diff --git a/net/ceph/messenger_v2.c b/net/ceph/messenger_v2.c
> index c6e5bfc717d5..845c2f093a02 100644
> --- a/net/ceph/messenger_v2.c
> +++ b/net/ceph/messenger_v2.c
> @@ -52,14 +52,17 @@
>  #define FRAME_LATE_STATUS_COMPLETE	0xe
>  #define FRAME_LATE_STATUS_ABORTED_MASK	0xf
>  
> -#define IN_S_HANDLE_PREAMBLE		1
> -#define IN_S_HANDLE_CONTROL		2
> -#define IN_S_HANDLE_CONTROL_REMAINDER	3
> -#define IN_S_PREPARE_READ_DATA		4
> -#define IN_S_PREPARE_READ_DATA_CONT	5
> -#define IN_S_PREPARE_READ_ENC_PAGE	6
> -#define IN_S_HANDLE_EPILOGUE		7
> -#define IN_S_FINISH_SKIP		8
> +#define IN_S_HANDLE_PREAMBLE			1
> +#define IN_S_HANDLE_CONTROL			2
> +#define IN_S_HANDLE_CONTROL_REMAINDER		3
> +#define IN_S_PREPARE_READ_DATA			4
> +#define IN_S_PREPARE_READ_DATA_CONT		5
> +#define IN_S_PREPARE_READ_ENC_PAGE		6
> +#define IN_S_PREPARE_SPARSE_DATA		7
> +#define IN_S_PREPARE_SPARSE_DATA_HDR		8
> +#define IN_S_PREPARE_SPARSE_DATA_CONT		9
> +#define IN_S_HANDLE_EPILOGUE			10
> +#define IN_S_FINISH_SKIP			11
>  
>  #define OUT_S_QUEUE_DATA		1
>  #define OUT_S_QUEUE_DATA_CONT		2
> @@ -1819,6 +1822,166 @@ static void prepare_read_data_cont(struct ceph_connection *con)
>  	con->v2.in_state = IN_S_HANDLE_EPILOGUE;
>  }
>  
> +static int prepare_sparse_read_cont(struct ceph_connection *con)
> +{
> +	int ret;
> +	struct bio_vec bv;
> +	char *buf = NULL;
> +	struct ceph_msg_data_cursor *cursor = &con->v2.in_cursor;
> +	u64 len = 0;
> +
> +	if (!iov_iter_is_bvec(&con->v2.in_iter))
> +		return -EIO;
> +
> +	if (ceph_test_opt(from_msgr(con->msgr), RXBOUNCE)) {
> +		con->in_data_crc = crc32c(con->in_data_crc,
> +					  page_address(con->bounce_page),
> +					  con->v2.in_bvec.bv_len);
> +
> +		get_bvec_at(cursor, &bv);
> +		memcpy_to_page(bv.bv_page, bv.bv_offset,
> +			       page_address(con->bounce_page),
> +			       con->v2.in_bvec.bv_len);
> +	} else {
> +		con->in_data_crc = ceph_crc32c_page(con->in_data_crc,
> +						    con->v2.in_bvec.bv_page,
> +						    con->v2.in_bvec.bv_offset,
> +						    con->v2.in_bvec.bv_len);
> +	}
> +
> +	ceph_msg_data_advance(cursor, con->v2.in_bvec.bv_len);
> +	cursor->sr_resid -= con->v2.in_bvec.bv_len;
> +	dout("%s: advance by 0x%x sr_resid 0x%x\n", __func__,
> +		con->v2.in_bvec.bv_len, cursor->sr_resid);
> +	WARN_ON_ONCE(cursor->sr_resid > cursor->total_resid);
> +	if (cursor->sr_resid) {
> +		get_bvec_at(cursor, &bv);
> +		if (bv.bv_len > cursor->sr_resid)
> +			bv.bv_len = cursor->sr_resid;
> +		if (ceph_test_opt(from_msgr(con->msgr), RXBOUNCE)) {
> +			bv.bv_page = con->bounce_page;
> +			bv.bv_offset = 0;
> +		}
> +		set_in_bvec(con, &bv);
> +		con->v2.data_len_remain -= bv.bv_len;
> +		WARN_ON(con->v2.in_state != IN_S_PREPARE_SPARSE_DATA_CONT);
> +		return 0;
> +	}
> +
> +	/* get next extent */
> +	ret = con->ops->sparse_read(con, cursor, &len, &buf);
> +	if (ret <= 0) {
> +		if (ret < 0)
> +			return ret;
> +
> +		reset_in_kvecs(con);
> +		add_in_kvec(con, con->v2.in_buf, CEPH_EPILOGUE_PLAIN_LEN);
> +		con->v2.in_state = IN_S_HANDLE_EPILOGUE;
> +		return 0;
> +	}
> +
> +	cursor->sr_resid = len;
> +	get_bvec_at(cursor, &bv);
> +	if (bv.bv_len > cursor->sr_resid)
> +		bv.bv_len = cursor->sr_resid;
> +	if (ceph_test_opt(from_msgr(con->msgr), RXBOUNCE)) {
> +		if (unlikely(!con->bounce_page)) {
> +			con->bounce_page = alloc_page(GFP_NOIO);
> +			if (!con->bounce_page) {
> +				pr_err("failed to allocate bounce page\n");
> +				return -ENOMEM;
> +			}
> +		}
> +
> +		bv.bv_page = con->bounce_page;
> +		bv.bv_offset = 0;
> +	}
> +	set_in_bvec(con, &bv);
> +	con->v2.data_len_remain -= len;
> +	return ret;
> +}
> +
> +static int prepare_sparse_read_header(struct ceph_connection *con)
> +{
> +	int ret;
> +	char *buf = NULL;
> +	struct bio_vec bv;
> +	struct ceph_msg_data_cursor *cursor = &con->v2.in_cursor;
> +	u64 len = 0;
> +
> +	if (!iov_iter_is_kvec(&con->v2.in_iter))
> +		return -EIO;
> +
> +	/* On first call, we have no kvec so don't compute crc */
> +	if (con->v2.in_kvec_cnt) {
> +		WARN_ON_ONCE(con->v2.in_kvec_cnt > 1);
> +		con->in_data_crc = crc32c(con->in_data_crc,
> +				  con->v2.in_kvecs[0].iov_base,
> +				  con->v2.in_kvecs[0].iov_len);
> +	}
> +
> +	ret = con->ops->sparse_read(con, cursor, &len, &buf);
> +	if (ret < 0)
> +		return ret;
> +	if (ret == 0) {
> +		reset_in_kvecs(con);
> +		add_in_kvec(con, con->v2.in_buf, CEPH_EPILOGUE_PLAIN_LEN);
> +		con->v2.in_state = IN_S_HANDLE_EPILOGUE;
> +		return 0;
> +	}
> +
> +	/* No actual data? */
> +	if (WARN_ON_ONCE(!ret))
> +		return -EIO;
> +
> +	if (!buf) {
> +		cursor->sr_resid = len;
> +		get_bvec_at(cursor, &bv);
> +		if (bv.bv_len > cursor->sr_resid)
> +			bv.bv_len = cursor->sr_resid;
> +		if (ceph_test_opt(from_msgr(con->msgr), RXBOUNCE)) {
> +			if (unlikely(!con->bounce_page)) {
> +				con->bounce_page = alloc_page(GFP_NOIO);
> +				if (!con->bounce_page) {
> +					pr_err("failed to allocate bounce page\n");
> +					return -ENOMEM;
> +				}
> +			}
> +
> +			bv.bv_page = con->bounce_page;
> +			bv.bv_offset = 0;
> +		}
> +		set_in_bvec(con, &bv);
> +		con->v2.data_len_remain -= len;
> +		con->v2.in_state = IN_S_PREPARE_SPARSE_DATA_CONT;
> +		return ret;
> +	}
> +
> +	WARN_ON_ONCE(con->v2.in_state != IN_S_PREPARE_SPARSE_DATA_HDR);
> +	reset_in_kvecs(con);
> +	add_in_kvec(con, buf, len);
> +	con->v2.data_len_remain -= len;
> +	return 0;
> +}
> +
> +static int prepare_sparse_read_data(struct ceph_connection *con)
> +{
> +	struct ceph_msg *msg = con->in_msg;
> +
> +	dout("%s: starting sparse read\n", __func__);
> +
> +	if (WARN_ON_ONCE(!con->ops->sparse_read))
> +		return -EOPNOTSUPP;
> +
> +	if (!con_secure(con))
> +		con->in_data_crc = -1;
> +
> +	reset_in_kvecs(con);
> +	con->v2.in_state = IN_S_PREPARE_SPARSE_DATA_HDR;
> +	con->v2.data_len_remain = data_len(msg);
> +	return prepare_sparse_read_header(con);
> +}
> +
>  static int prepare_read_tail_plain(struct ceph_connection *con)
>  {
>  	struct ceph_msg *msg = con->in_msg;
> @@ -1839,7 +2002,10 @@ static int prepare_read_tail_plain(struct ceph_connection *con)
>  	}
>  
>  	if (data_len(msg)) {
> -		con->v2.in_state = IN_S_PREPARE_READ_DATA;
> +		if (msg->sparse_read)
> +			con->v2.in_state = IN_S_PREPARE_SPARSE_DATA;
> +		else
> +			con->v2.in_state = IN_S_PREPARE_READ_DATA;
>  	} else {
>  		add_in_kvec(con, con->v2.in_buf, CEPH_EPILOGUE_PLAIN_LEN);
>  		con->v2.in_state = IN_S_HANDLE_EPILOGUE;
> @@ -2893,6 +3059,15 @@ static int populate_in_iter(struct ceph_connection *con)
>  			prepare_read_enc_page(con);
>  			ret = 0;
>  			break;
> +		case IN_S_PREPARE_SPARSE_DATA:
> +			ret = prepare_sparse_read_data(con);
> +			break;
> +		case IN_S_PREPARE_SPARSE_DATA_HDR:
> +			ret = prepare_sparse_read_header(con);
> +			break;
> +		case IN_S_PREPARE_SPARSE_DATA_CONT:
> +			ret = prepare_sparse_read_cont(con);
> +			break;
>  		case IN_S_HANDLE_EPILOGUE:
>  			ret = handle_epilogue(con);
>  			break;
> @@ -3485,6 +3660,23 @@ static void revoke_at_prepare_read_enc_page(struct ceph_connection *con)
>  	con->v2.in_state = IN_S_FINISH_SKIP;
>  }
>  
> +static void revoke_at_prepare_sparse_data(struct ceph_connection *con)
> +{
> +	int resid;  /* current piece of data */
> +	int remaining;
> +
> +	WARN_ON(con_secure(con));
> +	WARN_ON(!data_len(con->in_msg));
> +	WARN_ON(!iov_iter_is_bvec(&con->v2.in_iter));
> +	resid = iov_iter_count(&con->v2.in_iter);
> +	dout("%s con %p resid %d\n", __func__, con, resid);
> +
> +	remaining = CEPH_EPILOGUE_PLAIN_LEN + con->v2.data_len_remain;
> +	con->v2.in_iter.count -= resid;
> +	set_in_skip(con, resid + remaining);
> +	con->v2.in_state = IN_S_FINISH_SKIP;
> +}
> +
>  static void revoke_at_handle_epilogue(struct ceph_connection *con)
>  {
>  	int resid;
> @@ -3501,6 +3693,7 @@ static void revoke_at_handle_epilogue(struct ceph_connection *con)
>  void ceph_con_v2_revoke_incoming(struct ceph_connection *con)
>  {
>  	switch (con->v2.in_state) {
> +	case IN_S_PREPARE_SPARSE_DATA:

Oops, the above line should have been removed from this patch. I'll fix
that in v2.

>  	case IN_S_PREPARE_READ_DATA:
>  		revoke_at_prepare_read_data(con);
>  		break;
> @@ -3510,6 +3703,10 @@ void ceph_con_v2_revoke_incoming(struct ceph_connection *con)
>  	case IN_S_PREPARE_READ_ENC_PAGE:
>  		revoke_at_prepare_read_enc_page(con);
>  		break;
> +	case IN_S_PREPARE_SPARSE_DATA_HDR:
> +	case IN_S_PREPARE_SPARSE_DATA_CONT:
> +		revoke_at_prepare_sparse_data(con);
> +		break;
>  	case IN_S_HANDLE_EPILOGUE:
>  		revoke_at_handle_epilogue(con);
>  		break;

-- 
Jeff Layton <jlayton@samba.org>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] libceph: add sparse read support to OSD client
  2022-03-09 12:33 ` [PATCH 2/3] libceph: add sparse read support to OSD client Jeff Layton
@ 2022-03-11 17:08   ` Jeff Layton
  0 siblings, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2022-03-11 17:08 UTC (permalink / raw)
  To: ceph-devel, idryomov

On Wed, 2022-03-09 at 07:33 -0500, Jeff Layton wrote:
> Add a new sparse_read operation for the OSD client, driven by its own
> state machine. The messenger can repeatedly call the sparse_read
> operation, and it will pass back the necessary info to set up to read
> the next extent of data, while zeroing in the sparse regions.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  include/linux/ceph/osd_client.h |  38 ++++++++
>  net/ceph/osd_client.c           | 163 ++++++++++++++++++++++++++++++--
>  2 files changed, 194 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index 3431011f364d..42eb1628a66d 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -29,6 +29,43 @@ typedef void (*ceph_osdc_callback_t)(struct ceph_osd_request *);
>  
>  #define CEPH_HOMELESS_OSD	-1
>  
> +enum ceph_sparse_read_state {
> +	CEPH_SPARSE_READ_HDR	= 0,
> +	CEPH_SPARSE_READ_EXTENTS,
> +	CEPH_SPARSE_READ_DATA_LEN,
> +	CEPH_SPARSE_READ_DATA,
> +};
> +
> +/* A single extent in a SPARSE_READ reply */
> +struct ceph_sparse_extent {
> +	__le64	off;
> +	__le64	len;
> +} __attribute__((packed));
> +
> +/*
> + * A SPARSE_READ reply is a 32-bit count of extents, followed by an array of
> + * 64-bit offset/length pairs, and then all of the actual file data
> + * concatenated after it (sans holes).
> + *
> + * Unfortunately, we don't know how long the extent array is until we've
> + * started reading the data section of the reply, so for a real sparse read, we
> + * have to allocate the array after alloc_msg returns.
> + *
> + * For the common case of a single extent, we keep an embedded extent here so
> + * we can avoid the extra allocation.
> + */
> +struct ceph_sparse_read {
> +	enum ceph_sparse_read_state	sr_state;	/* state machine state */
> +	u64				sr_req_off;	/* orig request offset */
> +	u64				sr_req_len;	/* orig request length */
> +	u64				sr_pos;		/* current pos in buffer */
> +	int				sr_index;	/* current extent index */
> +	__le32				sr_datalen;	/* length of actual data */
> +	__le32				sr_count;	/* extent count */
> +	struct ceph_sparse_extent	*sr_extent;	/* extent array */
> +	struct ceph_sparse_extent	sr_emb_ext[1];	/* embedded extent */
> +};
> +
>  /* a given osd we're communicating with */
>  struct ceph_osd {
>  	refcount_t o_ref;
> @@ -46,6 +83,7 @@ struct ceph_osd {
>  	unsigned long lru_ttl;
>  	struct list_head o_keepalive_item;
>  	struct mutex lock;
> +	struct ceph_sparse_read	o_sparse_read;
>  };
>  
>  #define CEPH_OSD_SLAB_OPS	2
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 1c5815530e0d..f519b5727ee3 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -376,6 +376,7 @@ static void osd_req_op_data_release(struct ceph_osd_request *osd_req,
>  
>  	switch (op->op) {
>  	case CEPH_OSD_OP_READ:
> +	case CEPH_OSD_OP_SPARSE_READ:
>  	case CEPH_OSD_OP_WRITE:
>  	case CEPH_OSD_OP_WRITEFULL:
>  		ceph_osd_data_release(&op->extent.osd_data);
> @@ -706,6 +707,7 @@ static void get_num_data_items(struct ceph_osd_request *req,
>  		/* reply */
>  		case CEPH_OSD_OP_STAT:
>  		case CEPH_OSD_OP_READ:
> +		case CEPH_OSD_OP_SPARSE_READ:
>  		case CEPH_OSD_OP_LIST_WATCHERS:
>  			*num_reply_data_items += 1;
>  			break;
> @@ -775,7 +777,7 @@ void osd_req_op_extent_init(struct ceph_osd_request *osd_req,
>  
>  	BUG_ON(opcode != CEPH_OSD_OP_READ && opcode != CEPH_OSD_OP_WRITE &&
>  	       opcode != CEPH_OSD_OP_WRITEFULL && opcode != CEPH_OSD_OP_ZERO &&
> -	       opcode != CEPH_OSD_OP_TRUNCATE);
> +	       opcode != CEPH_OSD_OP_TRUNCATE && opcode != CEPH_OSD_OP_SPARSE_READ);
>  
>  	op->extent.offset = offset;
>  	op->extent.length = length;
> @@ -984,6 +986,7 @@ static u32 osd_req_encode_op(struct ceph_osd_op *dst,
>  	case CEPH_OSD_OP_STAT:
>  		break;
>  	case CEPH_OSD_OP_READ:
> +	case CEPH_OSD_OP_SPARSE_READ:
>  	case CEPH_OSD_OP_WRITE:
>  	case CEPH_OSD_OP_WRITEFULL:
>  	case CEPH_OSD_OP_ZERO:
> @@ -1080,7 +1083,8 @@ struct ceph_osd_request *ceph_osdc_new_request(struct ceph_osd_client *osdc,
>  
>  	BUG_ON(opcode != CEPH_OSD_OP_READ && opcode != CEPH_OSD_OP_WRITE &&
>  	       opcode != CEPH_OSD_OP_ZERO && opcode != CEPH_OSD_OP_TRUNCATE &&
> -	       opcode != CEPH_OSD_OP_CREATE && opcode != CEPH_OSD_OP_DELETE);
> +	       opcode != CEPH_OSD_OP_CREATE && opcode != CEPH_OSD_OP_DELETE &&
> +	       opcode != CEPH_OSD_OP_SPARSE_READ);
>  
>  	req = ceph_osdc_alloc_request(osdc, snapc, num_ops, use_mempool,
>  					GFP_NOFS);
> @@ -2037,6 +2041,7 @@ static void setup_request_data(struct ceph_osd_request *req)
>  					       &op->raw_data_in);
>  			break;
>  		case CEPH_OSD_OP_READ:
> +		case CEPH_OSD_OP_SPARSE_READ:
>  			ceph_osdc_msg_data_add(reply_msg,
>  					       &op->extent.osd_data);
>  			break;
> @@ -2443,6 +2448,21 @@ static void submit_request(struct ceph_osd_request *req, bool wrlocked)
>  	__submit_request(req, wrlocked);
>  }
>  
> +static void ceph_init_sparse_read(struct ceph_sparse_read *sr, struct ceph_osd_req_op *op)
> +{
> +	if (sr->sr_extent != sr->sr_emb_ext)
> +		kfree(sr->sr_extent);
> +	sr->sr_state = CEPH_SPARSE_READ_HDR;
> +	sr->sr_req_off = op ? op->extent.offset : 0;
> +	sr->sr_req_len = op ? op->extent.length : 0;
> +	sr->sr_pos = sr->sr_req_off;
> +	sr->sr_index = 0;
> +	sr->sr_count = 0;
> +	sr->sr_extent = sr->sr_emb_ext;
> +	sr->sr_extent[0].off = 0;
> +	sr->sr_extent[0].len = 0;
> +}
> +

I think that this patch also needs make osd_cleanup() call
ceph_init_sparse_read as well, to ensure that we kfree the sr_extent (if
there was one and the previous call didn't complete). Fixed in my
tree...

>  static void finish_request(struct ceph_osd_request *req)
>  {
>  	struct ceph_osd_client *osdc = req->r_osdc;
> @@ -2452,8 +2472,10 @@ static void finish_request(struct ceph_osd_request *req)
>  
>  	req->r_end_latency = ktime_get();
>  
> -	if (req->r_osd)
> +	if (req->r_osd) {
> +		ceph_init_sparse_read(&req->r_osd->o_sparse_read, NULL);
>  		unlink_request(req->r_osd, req);
> +	}
>  	atomic_dec(&osdc->num_requests);
>  
>  	/*
> @@ -3655,6 +3677,8 @@ static void handle_reply(struct ceph_osd *osd, struct ceph_msg *msg)
>  	struct MOSDOpReply m;
>  	u64 tid = le64_to_cpu(msg->hdr.tid);
>  	u32 data_len = 0;
> +	u32 result_len = 0;
> +	bool sparse = false;
>  	int ret;
>  	int i;
>  
> @@ -3749,21 +3773,32 @@ static void handle_reply(struct ceph_osd *osd, struct ceph_msg *msg)
>  		req->r_ops[i].rval = m.rval[i];
>  		req->r_ops[i].outdata_len = m.outdata_len[i];
>  		data_len += m.outdata_len[i];
> +		if (req->r_ops[i].op == CEPH_OSD_OP_SPARSE_READ)
> +			sparse = true;
>  	}
> +
> +	result_len = data_len;
> +	if (sparse) {
> +		struct ceph_sparse_read *sr = &osd->o_sparse_read;
> +
> +		/* Fudge the result if this was a sparse read. */
> +		result_len = sr->sr_pos - sr->sr_req_off;
> +	}
> +
>  	if (data_len != le32_to_cpu(msg->hdr.data_len)) {
>  		pr_err("sum of lens %u != %u for tid %llu\n", data_len,
>  		       le32_to_cpu(msg->hdr.data_len), req->r_tid);
>  		goto fail_request;
>  	}
> -	dout("%s req %p tid %llu result %d data_len %u\n", __func__,
> -	     req, req->r_tid, m.result, data_len);
> +	dout("%s req %p tid %llu result %d data_len %u result_len %u\n", __func__,
> +	     req, req->r_tid, m.result, data_len, result_len);
>  
>  	/*
>  	 * Since we only ever request ONDISK, we should only ever get
>  	 * one (type of) reply back.
>  	 */
>  	WARN_ON(!(m.flags & CEPH_OSD_FLAG_ONDISK));
> -	req->r_result = m.result ?: data_len;
> +	req->r_result = m.result ?: result_len;
>  	finish_request(req);
>  	mutex_unlock(&osd->lock);
>  	up_read(&osdc->lock);
> @@ -5398,6 +5433,21 @@ static void osd_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
>  	ceph_msg_put(msg);
>  }
>  
> +static struct ceph_osd_req_op *
> +sparse_read_op(struct ceph_osd_request *req)
> +{
> +	int i;
> +
> +	if (!(req->r_flags & CEPH_OSD_FLAG_READ))
> +		return NULL;
> +
> +	for (i = 0; i < req->r_num_ops; ++i) {
> +		if (req->r_ops[i].op == CEPH_OSD_OP_SPARSE_READ)
> +			return &req->r_ops[i];
> +	}
> +	return NULL;
> +}
> +
>  /*
>   * Lookup and return message for incoming reply.  Don't try to do
>   * anything about a larger than preallocated data portion of the
> @@ -5414,6 +5464,7 @@ static struct ceph_msg *get_reply(struct ceph_connection *con,
>  	int front_len = le32_to_cpu(hdr->front_len);
>  	int data_len = le32_to_cpu(hdr->data_len);
>  	u64 tid = le64_to_cpu(hdr->tid);
> +	struct ceph_osd_req_op *srop;
>  
>  	down_read(&osdc->lock);
>  	if (!osd_registered(osd)) {
> @@ -5446,7 +5497,9 @@ static struct ceph_msg *get_reply(struct ceph_connection *con,
>  		req->r_reply = m;
>  	}
>  
> -	if (data_len > req->r_reply->data_length) {
> +	srop = sparse_read_op(req);
> +
> +	if (!srop && (data_len > req->r_reply->data_length)) {
>  		pr_warn("%s osd%d tid %llu data %d > preallocated %zu, skipping\n",
>  			__func__, osd->o_osd, req->r_tid, data_len,
>  			req->r_reply->data_length);
> @@ -5456,6 +5509,10 @@ static struct ceph_msg *get_reply(struct ceph_connection *con,
>  	}
>  
>  	m = ceph_msg_get(req->r_reply);
> +	m->sparse_read = srop;
> +	if (srop)
> +		ceph_init_sparse_read(&osd->o_sparse_read, srop);
> +
>  	dout("get_reply tid %lld %p\n", tid, m);
>  
>  out_unlock_session:
> @@ -5688,9 +5745,101 @@ static int osd_check_message_signature(struct ceph_msg *msg)
>  	return ceph_auth_check_message_signature(auth, msg);
>  }
>  
> +static void zero_len(struct ceph_msg_data_cursor *cursor, size_t len)
> +{
> +	while (len) {
> +		struct page *page;
> +		size_t poff, plen;
> +		bool last = false;
> +
> +		page = ceph_msg_data_next(cursor, &poff, &plen, &last);
> +		if (plen > len)
> +			plen = len;
> +		zero_user_segment(page, poff, poff + plen);
> +		len -= plen;
> +		ceph_msg_data_advance(cursor, plen);
> +	}
> +}
> +
> +static int osd_sparse_read(struct ceph_connection *con,
> +			   struct ceph_msg_data_cursor *cursor,
> +			   u64 *plen, char **pbuf)
> +{
> +	struct ceph_osd *o = con->private;
> +	struct ceph_sparse_read *sr = &o->o_sparse_read;
> +	u32 count = __le32_to_cpu(sr->sr_count);
> +	u64 eoff, elen;
> +
> +	switch (sr->sr_state) {
> +	case CEPH_SPARSE_READ_HDR:
> +		dout("[%d] request to read 0x%llx~0x%llx\n", o->o_osd, sr->sr_req_off, sr->sr_req_len);
> +		/* number of extents */
> +		*plen = sizeof(sr->sr_count);
> +		*pbuf = (char *)&sr->sr_count;
> +		sr->sr_state = CEPH_SPARSE_READ_EXTENTS;
> +		break;
> +	case CEPH_SPARSE_READ_EXTENTS:
> +		dout("[%d] got %u extents\n", o->o_osd, count);
> +
> +		if (count > 0) {
> +			if (count > 1) {
> +				/* can't use the embedded extent array */
> +				sr->sr_extent = kmalloc_array(count, sizeof(*sr->sr_extent),
> +							   GFP_NOIO);
> +				if (!sr->sr_extent)
> +					return -ENOMEM;
> +			}
> +			*plen = count * sizeof(*sr->sr_extent);
> +			*pbuf = (char *)sr->sr_extent;
> +			sr->sr_state = CEPH_SPARSE_READ_DATA_LEN;
> +			break;
> +		}
> +		/* No extents? Fall through to reading data len */
> +		fallthrough;
> +	case CEPH_SPARSE_READ_DATA_LEN:
> +		*plen = sizeof(sr->sr_datalen);
> +		*pbuf = (char *)&sr->sr_datalen;
> +		sr->sr_state = CEPH_SPARSE_READ_DATA;
> +		break;
> +	case CEPH_SPARSE_READ_DATA:
> +		if (sr->sr_index >= count)
> +			return 0;
> +		if (sr->sr_index == 0) {
> +			/* last extent */
> +			eoff = le64_to_cpu(sr->sr_extent[count - 1].off);
> +			elen = le64_to_cpu(sr->sr_extent[count - 1].len);
> +
> +			/* set up cursor to end of last extent */
> +			ceph_msg_data_cursor_init(cursor, con->in_msg,
> +						  eoff + elen - sr->sr_req_off);
> +		}
> +
> +		eoff = le64_to_cpu(sr->sr_extent[sr->sr_index].off);
> +		elen = le64_to_cpu(sr->sr_extent[sr->sr_index].len);
> +
> +		dout("[%d] ext %d off 0x%llx len 0x%llx\n", o->o_osd, sr->sr_index, eoff, elen);
> +
> +		/* zero out anything from sr_pos to start of extent */
> +		if (sr->sr_pos < eoff)
> +			zero_len(cursor, eoff - sr->sr_pos);
> +
> +		/* Set position to end of extent */
> +		sr->sr_pos = eoff + elen;
> +
> +		/* send back the new length */
> +		*plen = elen;
> +
> +		/* Bump the array index */
> +		++sr->sr_index;
> +		break;
> +	}
> +	return 1;
> +}
> +
>  static const struct ceph_connection_operations osd_con_ops = {
>  	.get = osd_get_con,
>  	.put = osd_put_con,
> +	.sparse_read = osd_sparse_read,
>  	.alloc_msg = osd_alloc_msg,
>  	.dispatch = osd_dispatch,
>  	.fault = osd_fault,

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] ceph: convert to sparse reads
  2022-03-09 12:33 ` [PATCH 3/3] ceph: convert to sparse reads Jeff Layton
@ 2022-03-14  2:22   ` Xiubo Li
  2022-03-14 12:09     ` Jeff Layton
  0 siblings, 1 reply; 15+ messages in thread
From: Xiubo Li @ 2022-03-14  2:22 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel, idryomov


On 3/9/22 8:33 PM, Jeff Layton wrote:
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>   fs/ceph/addr.c | 2 +-
>   fs/ceph/file.c | 4 ++--
>   2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 752c421c9922..f42440d7102b 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -317,7 +317,7 @@ static void ceph_netfs_issue_op(struct netfs_read_subrequest *subreq)
>   		return;
>   
>   	req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout, vino, subreq->start, &len,
> -			0, 1, CEPH_OSD_OP_READ,
> +			0, 1, CEPH_OSD_OP_SPARSE_READ,

For this possibly should we add one option to disable it ? Just in case 
we need to debug the fscrypt or something else when we hit the 
read/write related issue ?

- Xiubo

>   			CEPH_OSD_FLAG_READ | fsc->client->osdc.client->options->read_from_replica,
>   			NULL, ci->i_truncate_seq, ci->i_truncate_size, false);
>   	if (IS_ERR(req)) {
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index feb75eb1cd82..d1956a20c627 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -934,7 +934,7 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
>   
>   		req = ceph_osdc_new_request(osdc, &ci->i_layout,
>   					ci->i_vino, off, &len, 0, 1,
> -					CEPH_OSD_OP_READ, CEPH_OSD_FLAG_READ,
> +					CEPH_OSD_OP_SPARSE_READ, CEPH_OSD_FLAG_READ,
>   					NULL, ci->i_truncate_seq,
>   					ci->i_truncate_size, false);
>   		if (IS_ERR(req)) {
> @@ -1291,7 +1291,7 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
>   					    vino, pos, &size, 0,
>   					    1,
>   					    write ? CEPH_OSD_OP_WRITE :
> -						    CEPH_OSD_OP_READ,
> +						    CEPH_OSD_OP_SPARSE_READ,
>   					    flags, snapc,
>   					    ci->i_truncate_seq,
>   					    ci->i_truncate_size,


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] ceph: convert to sparse reads
  2022-03-14  2:22   ` Xiubo Li
@ 2022-03-14 12:09     ` Jeff Layton
  0 siblings, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2022-03-14 12:09 UTC (permalink / raw)
  To: Xiubo Li, ceph-devel, idryomov

On Mon, 2022-03-14 at 10:22 +0800, Xiubo Li wrote:
> On 3/9/22 8:33 PM, Jeff Layton wrote:
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >   fs/ceph/addr.c | 2 +-
> >   fs/ceph/file.c | 4 ++--
> >   2 files changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > index 752c421c9922..f42440d7102b 100644
> > --- a/fs/ceph/addr.c
> > +++ b/fs/ceph/addr.c
> > @@ -317,7 +317,7 @@ static void ceph_netfs_issue_op(struct netfs_read_subrequest *subreq)
> >   		return;
> >   
> >   	req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout, vino, subreq->start, &len,
> > -			0, 1, CEPH_OSD_OP_READ,
> > +			0, 1, CEPH_OSD_OP_SPARSE_READ,
> 
> For this possibly should we add one option to disable it ? Just in case 
> we need to debug the fscrypt or something else when we hit the 
> read/write related issue ?
> 
> 

Yeah, it's probably a reasonable thing to add. I had that at one point
in development and dropped it. Let me see if I can resurrect that before
I post a v2.

> 
> >   			CEPH_OSD_FLAG_READ | fsc->client->osdc.client->options->read_from_replica,
> >   			NULL, ci->i_truncate_seq, ci->i_truncate_size, false);
> >   	if (IS_ERR(req)) {
> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > index feb75eb1cd82..d1956a20c627 100644
> > --- a/fs/ceph/file.c
> > +++ b/fs/ceph/file.c
> > @@ -934,7 +934,7 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
> >   
> >   		req = ceph_osdc_new_request(osdc, &ci->i_layout,
> >   					ci->i_vino, off, &len, 0, 1,
> > -					CEPH_OSD_OP_READ, CEPH_OSD_FLAG_READ,
> > +					CEPH_OSD_OP_SPARSE_READ, CEPH_OSD_FLAG_READ,
> >   					NULL, ci->i_truncate_seq,
> >   					ci->i_truncate_size, false);
> >   		if (IS_ERR(req)) {
> > @@ -1291,7 +1291,7 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
> >   					    vino, pos, &size, 0,
> >   					    1,
> >   					    write ? CEPH_OSD_OP_WRITE :
> > -						    CEPH_OSD_OP_READ,
> > +						    CEPH_OSD_OP_SPARSE_READ,
> >   					    flags, snapc,
> >   					    ci->i_truncate_seq,
> >   					    ci->i_truncate_size,
> 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] ceph: support for sparse read in msgr2 crc path
  2022-03-09 12:33 [PATCH 0/3] ceph: support for sparse read in msgr2 crc path Jeff Layton
                   ` (2 preceding siblings ...)
  2022-03-09 12:33 ` [PATCH 3/3] ceph: convert to sparse reads Jeff Layton
@ 2022-03-15  6:23 ` Xiubo Li
  2022-03-15 10:12   ` Jeff Layton
  3 siblings, 1 reply; 15+ messages in thread
From: Xiubo Li @ 2022-03-15  6:23 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel, idryomov

Hi Jeff,

I hit the following crash by using the latest wip-fscrypt branch:

<5>[245348.815462] Key type ceph unregistered
<5>[245545.560567] Key type ceph registered
<6>[245545.566723] libceph: loaded (mon/osd proto 15/24)
<6>[245545.775116] ceph: loaded (mds proto 32)
<6>[245545.822200] libceph: mon2 (1)10.72.47.117:40843 session established
<6>[245545.829658] libceph: client5000 fsid 
2b9c5f33-3f43-4f89-945d-2a1b6372c5af
<4>[245583.531648] ------------[ cut here ]------------
<2>[245583.531701] kernel BUG at net/ceph/messenger.c:1032!
<4>[245583.531929] invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
<4>[245583.532030] CPU: 2 PID: 283539 Comm: kworker/2:0 Tainted: 
G            E     5.17.0-rc6+ #98
<4>[245583.532050] Hardware name: Red Hat RHEV Hypervisor, BIOS 
1.11.0-2.el7 04/01/2014
<4>[245583.532086] Workqueue: ceph-msgr ceph_con_workfn [libceph]
<4>[245583.532380] RIP: 0010:ceph_msg_data_cursor_init+0x79/0x80 [libceph]
<4>[245583.532592] Code: 8d 7b 08 e8 39 99 00 dc 48 8d 7b 18 48 89 6b 08 
e8 ec 97 00 dc c7 43 18 00 00 00 00 48 89 df 5b 5d 41 5c e9 89 c7 ff ff 
0f 0b <0f> 0b 0f 0b 0f 1f 00 0f 1f 44 00 00 41 57 41 56 41 55 49 89 cd 41
<4>[245583.532609] RSP: 0018:ffffc90018847c10 EFLAGS: 00010287
<4>[245583.532654] RAX: 0000000000000000 RBX: ffff888244ff3850 RCX: 
ffffffffc10bd850
<4>[245583.532683] RDX: dffffc0000000000 RSI: ffff888244ff37d0 RDI: 
ffff888244ff3838
<4>[245583.532705] RBP: ffff888244ff37d0 R08: 00000000000000d1 R09: 
ffffed10489fe701
<4>[245583.532726] R10: ffff888244ff3804 R11: ffffed10489fe700 R12: 
00000000000000d1
<4>[245583.532746] R13: ffff8882d1adb030 R14: ffff8882d1adb408 R15: 
0000000000000000
<4>[245583.532764] FS:  0000000000000000(0000) GS:ffff8887ccd00000(0000) 
knlGS:0000000000000000
<4>[245583.532785] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[245583.532797] CR2: 00007fb991665000 CR3: 0000000368ebe001 CR4: 
00000000007706e0
<4>[245583.532809] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
<4>[245583.532820] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
<4>[245583.532831] PKRU: 55555554
<4>[245583.532840] Call Trace:
<4>[245583.532852]  <TASK>
<4>[245583.532863]  ceph_con_v1_try_read+0xd21/0x15c0 [libceph]
<4>[245583.533124]  ? ceph_tcp_sendpage+0x100/0x100 [libceph]
<4>[245583.533348]  ? load_balance+0x1240/0x1240
<4>[245583.533655]  ? dequeue_entity+0x18b/0x6f0
<4>[245583.533690]  ? mutex_lock+0x8e/0xe0
<4>[245583.533869]  ? __mutex_lock_slowpath+0x10/0x10
<4>[245583.533887]  ? _raw_spin_unlock+0x16/0x30
<4>[245583.533906]  ? __switch_to+0x2fa/0x690
<4>[245583.534007]  ceph_con_workfn+0x545/0x940 [libceph]
<4>[245583.534234]  process_one_work+0x3c1/0x6e0
<4>[245583.534340]  worker_thread+0x57/0x580
<4>[245583.534363]  ? process_one_work+0x6e0/0x6e0
<4>[245583.534382]  kthread+0x160/0x190
<4>[245583.534421]  ? kthread_complete_and_exit+0x20/0x20
<4>[245583.534477]  ret_from_fork+0x1f/0x30
<4>[245583.534544]  </TASK>
...

<4>[245583.535413] RIP: 0010:ceph_msg_data_cursor_init+0x79/0x80 [libceph]
<4>[245583.535627] Code: 8d 7b 08 e8 39 99 00 dc 48 8d 7b 18 48 89 6b 08 
e8 ec 97 00 dc c7 43 18 00 00 00 00 48 89 df 5b 5d 41 5c e9 89 c7 ff ff 
0f 0b <0f> 0b 0f 0b 0f 1f 00 0f 1f 44 00 00 41 57 41 56 41 55 49 89 cd 41
<4>[245583.535644] RSP: 0018:ffffc90018847c10 EFLAGS: 00010287
<4>[245583.535720] RAX: 0000000000000000 RBX: ffff888244ff3850 RCX: 
ffffffffc10bd850
<4>[245583.535745] RDX: dffffc0000000000 RSI: ffff888244ff37d0 RDI: 
ffff888244ff3838
<4>[245583.535769] RBP: ffff888244ff37d0 R08: 00000000000000d1 R09: 
ffffed10489fe701

Seems caused by the parse read ?



On 3/9/22 8:33 PM, Jeff Layton wrote:
> This patchset is a revised version of the one I sent a couple of weeks
> ago. This adds support for sparse reads to libceph, and changes cephfs
> over to use instead of non-sparse reads. The sparse read codepath is a
> drop in replacement for regular reads, so the upper layers should be
> able to use it interchangeibly.
>
> This is necessary for the (ongoing) fscrypt work. We need to know which
> regions in a file are actually sparse so that we can avoid decrypting
> them.
>
> The next step is to add the same support to the msgr2 secure codepath.
> Currently that code sets up a scatterlist with the final destination
> data pages in it and passes that to the decrypt routine so that the
> decrypted data is written directly to the destination.
>
> My thinking here is to change that to decrypt the data in-place for
> sparse reads, and then we'll just parse the decrypted buffer via
> calling sparse_read and copy the data into the right places.
>
> Ilya, does that sound sane? Is it OK to pass gcm_crypt two different
> scatterlists with a region that overlaps?
>
> Jeff Layton (3):
>    libceph: add sparse read support to msgr2 crc state machine
>    libceph: add sparse read support to OSD client
>    ceph: convert to sparse reads
>
>   fs/ceph/addr.c                  |   2 +-
>   fs/ceph/file.c                  |   4 +-
>   include/linux/ceph/messenger.h  |  31 +++++
>   include/linux/ceph/osd_client.h |  38 ++++++
>   net/ceph/messenger.c            |   1 +
>   net/ceph/messenger_v2.c         | 215 ++++++++++++++++++++++++++++++--
>   net/ceph/osd_client.c           | 163 ++++++++++++++++++++++--
>   7 files changed, 435 insertions(+), 19 deletions(-)
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] ceph: support for sparse read in msgr2 crc path
  2022-03-15  6:23 ` [PATCH 0/3] ceph: support for sparse read in msgr2 crc path Xiubo Li
@ 2022-03-15 10:12   ` Jeff Layton
  2022-03-15 11:03     ` Xiubo Li
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff Layton @ 2022-03-15 10:12 UTC (permalink / raw)
  To: Xiubo Li, ceph-devel, idryomov

On Tue, 2022-03-15 at 14:23 +0800, Xiubo Li wrote:
> Hi Jeff,
> 
> I hit the following crash by using the latest wip-fscrypt branch:
> 
> <5>[245348.815462] Key type ceph unregistered
> <5>[245545.560567] Key type ceph registered
> <6>[245545.566723] libceph: loaded (mon/osd proto 15/24)
> <6>[245545.775116] ceph: loaded (mds proto 32)
> <6>[245545.822200] libceph: mon2 (1)10.72.47.117:40843 session established
> <6>[245545.829658] libceph: client5000 fsid 
> 2b9c5f33-3f43-4f89-945d-2a1b6372c5af
> <4>[245583.531648] ------------[ cut here ]------------
> <2>[245583.531701] kernel BUG at net/ceph/messenger.c:1032!
> <4>[245583.531929] invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> <4>[245583.532030] CPU: 2 PID: 283539 Comm: kworker/2:0 Tainted: 
> G            E     5.17.0-rc6+ #98
> <4>[245583.532050] Hardware name: Red Hat RHEV Hypervisor, BIOS 
> 1.11.0-2.el7 04/01/2014
> <4>[245583.532086] Workqueue: ceph-msgr ceph_con_workfn [libceph]
> <4>[245583.532380] RIP: 0010:ceph_msg_data_cursor_init+0x79/0x80 [libceph]
> <4>[245583.532592] Code: 8d 7b 08 e8 39 99 00 dc 48 8d 7b 18 48 89 6b 08 
> e8 ec 97 00 dc c7 43 18 00 00 00 00 48 89 df 5b 5d 41 5c e9 89 c7 ff ff 
> 0f 0b <0f> 0b 0f 0b 0f 1f 00 0f 1f 44 00 00 41 57 41 56 41 55 49 89 cd 41
> <4>[245583.532609] RSP: 0018:ffffc90018847c10 EFLAGS: 00010287
> <4>[245583.532654] RAX: 0000000000000000 RBX: ffff888244ff3850 RCX: 
> ffffffffc10bd850
> <4>[245583.532683] RDX: dffffc0000000000 RSI: ffff888244ff37d0 RDI: 
> ffff888244ff3838
> <4>[245583.532705] RBP: ffff888244ff37d0 R08: 00000000000000d1 R09: 
> ffffed10489fe701
> <4>[245583.532726] R10: ffff888244ff3804 R11: ffffed10489fe700 R12: 
> 00000000000000d1
> <4>[245583.532746] R13: ffff8882d1adb030 R14: ffff8882d1adb408 R15: 
> 0000000000000000
> <4>[245583.532764] FS:  0000000000000000(0000) GS:ffff8887ccd00000(0000) 
> knlGS:0000000000000000
> <4>[245583.532785] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4>[245583.532797] CR2: 00007fb991665000 CR3: 0000000368ebe001 CR4: 
> 00000000007706e0
> <4>[245583.532809] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> <4>[245583.532820] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
> 0000000000000400
> <4>[245583.532831] PKRU: 55555554
> <4>[245583.532840] Call Trace:
> <4>[245583.532852]  <TASK>
> <4>[245583.532863]  ceph_con_v1_try_read+0xd21/0x15c0 [libceph]
> <4>[245583.533124]  ? ceph_tcp_sendpage+0x100/0x100 [libceph]
> <4>[245583.533348]  ? load_balance+0x1240/0x1240
> <4>[245583.533655]  ? dequeue_entity+0x18b/0x6f0
> <4>[245583.533690]  ? mutex_lock+0x8e/0xe0
> <4>[245583.533869]  ? __mutex_lock_slowpath+0x10/0x10
> <4>[245583.533887]  ? _raw_spin_unlock+0x16/0x30
> <4>[245583.533906]  ? __switch_to+0x2fa/0x690
> <4>[245583.534007]  ceph_con_workfn+0x545/0x940 [libceph]
> <4>[245583.534234]  process_one_work+0x3c1/0x6e0
> <4>[245583.534340]  worker_thread+0x57/0x580
> <4>[245583.534363]  ? process_one_work+0x6e0/0x6e0
> <4>[245583.534382]  kthread+0x160/0x190
> <4>[245583.534421]  ? kthread_complete_and_exit+0x20/0x20
> <4>[245583.534477]  ret_from_fork+0x1f/0x30
> <4>[245583.534544]  </TASK>
> ...
> 
> <4>[245583.535413] RIP: 0010:ceph_msg_data_cursor_init+0x79/0x80 [libceph]
> <4>[245583.535627] Code: 8d 7b 08 e8 39 99 00 dc 48 8d 7b 18 48 89 6b 08 
> e8 ec 97 00 dc c7 43 18 00 00 00 00 48 89 df 5b 5d 41 5c e9 89 c7 ff ff 
> 0f 0b <0f> 0b 0f 0b 0f 1f 00 0f 1f 44 00 00 41 57 41 56 41 55 49 89 cd 41
> <4>[245583.535644] RSP: 0018:ffffc90018847c10 EFLAGS: 00010287
> <4>[245583.535720] RAX: 0000000000000000 RBX: ffff888244ff3850 RCX: 
> ffffffffc10bd850
> <4>[245583.535745] RDX: dffffc0000000000 RSI: ffff888244ff37d0 RDI: 
> ffff888244ff3838
> <4>[245583.535769] RBP: ffff888244ff37d0 R08: 00000000000000d1 R09: 
> ffffed10489fe701
> 
> Seems caused by the parse read ?
> 
> 

You mean sparse read? Those patches aren't in the wip-fscrypt branch
yet, so I wouldn't think them a factor here. What commit was at the top
of your branch?

If you're testing the latest sparse read patches on top of wip-fscrypt,
then those only work with msgr2 so far. Using msgr1 with it is likely to
have problems.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] ceph: support for sparse read in msgr2 crc path
  2022-03-15 10:12   ` Jeff Layton
@ 2022-03-15 11:03     ` Xiubo Li
  2022-03-15 11:10       ` Jeff Layton
  0 siblings, 1 reply; 15+ messages in thread
From: Xiubo Li @ 2022-03-15 11:03 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel, idryomov


On 3/15/22 6:12 PM, Jeff Layton wrote:
> On Tue, 2022-03-15 at 14:23 +0800, Xiubo Li wrote:
>> Hi Jeff,
>>
>> I hit the following crash by using the latest wip-fscrypt branch:
>>
>> <5>[245348.815462] Key type ceph unregistered
>> <5>[245545.560567] Key type ceph registered
>> <6>[245545.566723] libceph: loaded (mon/osd proto 15/24)
>> <6>[245545.775116] ceph: loaded (mds proto 32)
>> <6>[245545.822200] libceph: mon2 (1)10.72.47.117:40843 session established
>> <6>[245545.829658] libceph: client5000 fsid
>> 2b9c5f33-3f43-4f89-945d-2a1b6372c5af
>> <4>[245583.531648] ------------[ cut here ]------------
>> <2>[245583.531701] kernel BUG at net/ceph/messenger.c:1032!
>> <4>[245583.531929] invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
>> <4>[245583.532030] CPU: 2 PID: 283539 Comm: kworker/2:0 Tainted:
>> G            E     5.17.0-rc6+ #98
>> <4>[245583.532050] Hardware name: Red Hat RHEV Hypervisor, BIOS
>> 1.11.0-2.el7 04/01/2014
>> <4>[245583.532086] Workqueue: ceph-msgr ceph_con_workfn [libceph]
>> <4>[245583.532380] RIP: 0010:ceph_msg_data_cursor_init+0x79/0x80 [libceph]
>> <4>[245583.532592] Code: 8d 7b 08 e8 39 99 00 dc 48 8d 7b 18 48 89 6b 08
>> e8 ec 97 00 dc c7 43 18 00 00 00 00 48 89 df 5b 5d 41 5c e9 89 c7 ff ff
>> 0f 0b <0f> 0b 0f 0b 0f 1f 00 0f 1f 44 00 00 41 57 41 56 41 55 49 89 cd 41
>> <4>[245583.532609] RSP: 0018:ffffc90018847c10 EFLAGS: 00010287
>> <4>[245583.532654] RAX: 0000000000000000 RBX: ffff888244ff3850 RCX:
>> ffffffffc10bd850
>> <4>[245583.532683] RDX: dffffc0000000000 RSI: ffff888244ff37d0 RDI:
>> ffff888244ff3838
>> <4>[245583.532705] RBP: ffff888244ff37d0 R08: 00000000000000d1 R09:
>> ffffed10489fe701
>> <4>[245583.532726] R10: ffff888244ff3804 R11: ffffed10489fe700 R12:
>> 00000000000000d1
>> <4>[245583.532746] R13: ffff8882d1adb030 R14: ffff8882d1adb408 R15:
>> 0000000000000000
>> <4>[245583.532764] FS:  0000000000000000(0000) GS:ffff8887ccd00000(0000)
>> knlGS:0000000000000000
>> <4>[245583.532785] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> <4>[245583.532797] CR2: 00007fb991665000 CR3: 0000000368ebe001 CR4:
>> 00000000007706e0
>> <4>[245583.532809] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> <4>[245583.532820] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>> 0000000000000400
>> <4>[245583.532831] PKRU: 55555554
>> <4>[245583.532840] Call Trace:
>> <4>[245583.532852]  <TASK>
>> <4>[245583.532863]  ceph_con_v1_try_read+0xd21/0x15c0 [libceph]
>> <4>[245583.533124]  ? ceph_tcp_sendpage+0x100/0x100 [libceph]
>> <4>[245583.533348]  ? load_balance+0x1240/0x1240
>> <4>[245583.533655]  ? dequeue_entity+0x18b/0x6f0
>> <4>[245583.533690]  ? mutex_lock+0x8e/0xe0
>> <4>[245583.533869]  ? __mutex_lock_slowpath+0x10/0x10
>> <4>[245583.533887]  ? _raw_spin_unlock+0x16/0x30
>> <4>[245583.533906]  ? __switch_to+0x2fa/0x690
>> <4>[245583.534007]  ceph_con_workfn+0x545/0x940 [libceph]
>> <4>[245583.534234]  process_one_work+0x3c1/0x6e0
>> <4>[245583.534340]  worker_thread+0x57/0x580
>> <4>[245583.534363]  ? process_one_work+0x6e0/0x6e0
>> <4>[245583.534382]  kthread+0x160/0x190
>> <4>[245583.534421]  ? kthread_complete_and_exit+0x20/0x20
>> <4>[245583.534477]  ret_from_fork+0x1f/0x30
>> <4>[245583.534544]  </TASK>
>> ...
>>
>> <4>[245583.535413] RIP: 0010:ceph_msg_data_cursor_init+0x79/0x80 [libceph]
>> <4>[245583.535627] Code: 8d 7b 08 e8 39 99 00 dc 48 8d 7b 18 48 89 6b 08
>> e8 ec 97 00 dc c7 43 18 00 00 00 00 48 89 df 5b 5d 41 5c e9 89 c7 ff ff
>> 0f 0b <0f> 0b 0f 0b 0f 1f 00 0f 1f 44 00 00 41 57 41 56 41 55 49 89 cd 41
>> <4>[245583.535644] RSP: 0018:ffffc90018847c10 EFLAGS: 00010287
>> <4>[245583.535720] RAX: 0000000000000000 RBX: ffff888244ff3850 RCX:
>> ffffffffc10bd850
>> <4>[245583.535745] RDX: dffffc0000000000 RSI: ffff888244ff37d0 RDI:
>> ffff888244ff3838
>> <4>[245583.535769] RBP: ffff888244ff37d0 R08: 00000000000000d1 R09:
>> ffffed10489fe701
>>
>> Seems caused by the parse read ?
>>
>>
> You mean sparse read? Those patches aren't in the wip-fscrypt branch
> yet, so I wouldn't think them a factor here. What commit was at the top
> of your branch?
>
> If you're testing the latest sparse read patches on top of wip-fscrypt,
> then those only work with msgr2 so far. Using msgr1 with it is likely to
> have problems.

Okay, it seems another patch series. But I am not sure, this is the 
latest commits in the wip-fscrypt branch I hit the issue:


ce03ef535e7b SQUASH: OSD
5020f1a5464b ceph: convert to sparse reads
8614b45d3758 libceph: add sparse read support to OSD client
a11c3b2673aa libceph: add sparse read support to msgr2 crc state machine
a020a702a220 (origin/testing) ceph: allow `ceph.dir.rctime' xattr to be 
updatable


I just switched back to the branch I pulled days ago, it works well:

Yeah, I am using the '--msgr1'. Maybe this is a problem. I will try 
'--msgr2' later.

- Xiubo




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] ceph: support for sparse read in msgr2 crc path
  2022-03-15 11:03     ` Xiubo Li
@ 2022-03-15 11:10       ` Jeff Layton
  2022-03-15 11:12         ` Xiubo Li
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff Layton @ 2022-03-15 11:10 UTC (permalink / raw)
  To: Xiubo Li, ceph-devel, idryomov

On Tue, 2022-03-15 at 19:03 +0800, Xiubo Li wrote:
> On 3/15/22 6:12 PM, Jeff Layton wrote:
> > On Tue, 2022-03-15 at 14:23 +0800, Xiubo Li wrote:
> > > Hi Jeff,
> > > 
> > > I hit the following crash by using the latest wip-fscrypt branch:
> > > 
> > > <5>[245348.815462] Key type ceph unregistered
> > > <5>[245545.560567] Key type ceph registered
> > > <6>[245545.566723] libceph: loaded (mon/osd proto 15/24)
> > > <6>[245545.775116] ceph: loaded (mds proto 32)
> > > <6>[245545.822200] libceph: mon2 (1)10.72.47.117:40843 session established
> > > <6>[245545.829658] libceph: client5000 fsid
> > > 2b9c5f33-3f43-4f89-945d-2a1b6372c5af
> > > <4>[245583.531648] ------------[ cut here ]------------
> > > <2>[245583.531701] kernel BUG at net/ceph/messenger.c:1032!
> > > <4>[245583.531929] invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> > > <4>[245583.532030] CPU: 2 PID: 283539 Comm: kworker/2:0 Tainted:
> > > G            E     5.17.0-rc6+ #98
> > > <4>[245583.532050] Hardware name: Red Hat RHEV Hypervisor, BIOS
> > > 1.11.0-2.el7 04/01/2014
> > > <4>[245583.532086] Workqueue: ceph-msgr ceph_con_workfn [libceph]
> > > <4>[245583.532380] RIP: 0010:ceph_msg_data_cursor_init+0x79/0x80 [libceph]
> > > <4>[245583.532592] Code: 8d 7b 08 e8 39 99 00 dc 48 8d 7b 18 48 89 6b 08
> > > e8 ec 97 00 dc c7 43 18 00 00 00 00 48 89 df 5b 5d 41 5c e9 89 c7 ff ff
> > > 0f 0b <0f> 0b 0f 0b 0f 1f 00 0f 1f 44 00 00 41 57 41 56 41 55 49 89 cd 41
> > > <4>[245583.532609] RSP: 0018:ffffc90018847c10 EFLAGS: 00010287
> > > <4>[245583.532654] RAX: 0000000000000000 RBX: ffff888244ff3850 RCX:
> > > ffffffffc10bd850
> > > <4>[245583.532683] RDX: dffffc0000000000 RSI: ffff888244ff37d0 RDI:
> > > ffff888244ff3838
> > > <4>[245583.532705] RBP: ffff888244ff37d0 R08: 00000000000000d1 R09:
> > > ffffed10489fe701
> > > <4>[245583.532726] R10: ffff888244ff3804 R11: ffffed10489fe700 R12:
> > > 00000000000000d1
> > > <4>[245583.532746] R13: ffff8882d1adb030 R14: ffff8882d1adb408 R15:
> > > 0000000000000000
> > > <4>[245583.532764] FS:  0000000000000000(0000) GS:ffff8887ccd00000(0000)
> > > knlGS:0000000000000000
> > > <4>[245583.532785] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > <4>[245583.532797] CR2: 00007fb991665000 CR3: 0000000368ebe001 CR4:
> > > 00000000007706e0
> > > <4>[245583.532809] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > 0000000000000000
> > > <4>[245583.532820] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > 0000000000000400
> > > <4>[245583.532831] PKRU: 55555554
> > > <4>[245583.532840] Call Trace:
> > > <4>[245583.532852]  <TASK>
> > > <4>[245583.532863]  ceph_con_v1_try_read+0xd21/0x15c0 [libceph]
> > > <4>[245583.533124]  ? ceph_tcp_sendpage+0x100/0x100 [libceph]
> > > <4>[245583.533348]  ? load_balance+0x1240/0x1240
> > > <4>[245583.533655]  ? dequeue_entity+0x18b/0x6f0
> > > <4>[245583.533690]  ? mutex_lock+0x8e/0xe0
> > > <4>[245583.533869]  ? __mutex_lock_slowpath+0x10/0x10
> > > <4>[245583.533887]  ? _raw_spin_unlock+0x16/0x30
> > > <4>[245583.533906]  ? __switch_to+0x2fa/0x690
> > > <4>[245583.534007]  ceph_con_workfn+0x545/0x940 [libceph]
> > > <4>[245583.534234]  process_one_work+0x3c1/0x6e0
> > > <4>[245583.534340]  worker_thread+0x57/0x580
> > > <4>[245583.534363]  ? process_one_work+0x6e0/0x6e0
> > > <4>[245583.534382]  kthread+0x160/0x190
> > > <4>[245583.534421]  ? kthread_complete_and_exit+0x20/0x20
> > > <4>[245583.534477]  ret_from_fork+0x1f/0x30
> > > <4>[245583.534544]  </TASK>
> > > ...
> > > 
> > > <4>[245583.535413] RIP: 0010:ceph_msg_data_cursor_init+0x79/0x80 [libceph]
> > > <4>[245583.535627] Code: 8d 7b 08 e8 39 99 00 dc 48 8d 7b 18 48 89 6b 08
> > > e8 ec 97 00 dc c7 43 18 00 00 00 00 48 89 df 5b 5d 41 5c e9 89 c7 ff ff
> > > 0f 0b <0f> 0b 0f 0b 0f 1f 00 0f 1f 44 00 00 41 57 41 56 41 55 49 89 cd 41
> > > <4>[245583.535644] RSP: 0018:ffffc90018847c10 EFLAGS: 00010287
> > > <4>[245583.535720] RAX: 0000000000000000 RBX: ffff888244ff3850 RCX:
> > > ffffffffc10bd850
> > > <4>[245583.535745] RDX: dffffc0000000000 RSI: ffff888244ff37d0 RDI:
> > > ffff888244ff3838
> > > <4>[245583.535769] RBP: ffff888244ff37d0 R08: 00000000000000d1 R09:
> > > ffffed10489fe701
> > > 
> > > Seems caused by the parse read ?
> > > 
> > > 
> > You mean sparse read? Those patches aren't in the wip-fscrypt branch
> > yet, so I wouldn't think them a factor here. What commit was at the top
> > of your branch?
> > 
> > If you're testing the latest sparse read patches on top of wip-fscrypt,
> > then those only work with msgr2 so far. Using msgr1 with it is likely to
> > have problems.
> 
> Okay, it seems another patch series. But I am not sure, this is the 
> latest commits in the wip-fscrypt branch I hit the issue:
> 
> 
> ce03ef535e7b SQUASH: OSD
> 5020f1a5464b ceph: convert to sparse reads
> 8614b45d3758 libceph: add sparse read support to OSD client
> a11c3b2673aa libceph: add sparse read support to msgr2 crc state machine
> a020a702a220 (origin/testing) ceph: allow `ceph.dir.rctime' xattr to be 
> updatable
> 
> I just switched back to the branch I pulled days ago, it works well:
> 
> Yeah, I am using the '--msgr1'. Maybe this is a problem. I will try 
> '--msgr2' later.
> 

That makes sense. That oops is probably expected with msgr1 and the
sparse read code. If you test with "-o ms_mode=crc" then it should
(hopefully!) work.

My plan is eventually to add support to all 3 msgr codepaths (v1, v2-
crc, and v2-secure), but I'd like some feedback from Ilya on what I have
so far before I do that.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] ceph: support for sparse read in msgr2 crc path
  2022-03-15 11:10       ` Jeff Layton
@ 2022-03-15 11:12         ` Xiubo Li
  2022-03-15 13:33           ` Jeff Layton
  0 siblings, 1 reply; 15+ messages in thread
From: Xiubo Li @ 2022-03-15 11:12 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel, idryomov


On 3/15/22 7:10 PM, Jeff Layton wrote:
> On Tue, 2022-03-15 at 19:03 +0800, Xiubo Li wrote:
>> On 3/15/22 6:12 PM, Jeff Layton wrote:
>>> On Tue, 2022-03-15 at 14:23 +0800, Xiubo Li wrote:
>>>> Hi Jeff,
>>>>
>>>> I hit the following crash by using the latest wip-fscrypt branch:
>>>>
>>>> <5>[245348.815462] Key type ceph unregistered
>>>> <5>[245545.560567] Key type ceph registered
>>>> <6>[245545.566723] libceph: loaded (mon/osd proto 15/24)
>>>> <6>[245545.775116] ceph: loaded (mds proto 32)
>>>> <6>[245545.822200] libceph: mon2 (1)10.72.47.117:40843 session established
>>>> <6>[245545.829658] libceph: client5000 fsid
>>>> 2b9c5f33-3f43-4f89-945d-2a1b6372c5af
>>>> <4>[245583.531648] ------------[ cut here ]------------
>>>> <2>[245583.531701] kernel BUG at net/ceph/messenger.c:1032!
>>>> <4>[245583.531929] invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
>>>> <4>[245583.532030] CPU: 2 PID: 283539 Comm: kworker/2:0 Tainted:
>>>> G            E     5.17.0-rc6+ #98
>>>> <4>[245583.532050] Hardware name: Red Hat RHEV Hypervisor, BIOS
>>>> 1.11.0-2.el7 04/01/2014
>>>> <4>[245583.532086] Workqueue: ceph-msgr ceph_con_workfn [libceph]
>>>> <4>[245583.532380] RIP: 0010:ceph_msg_data_cursor_init+0x79/0x80 [libceph]
>>>> <4>[245583.532592] Code: 8d 7b 08 e8 39 99 00 dc 48 8d 7b 18 48 89 6b 08
>>>> e8 ec 97 00 dc c7 43 18 00 00 00 00 48 89 df 5b 5d 41 5c e9 89 c7 ff ff
>>>> 0f 0b <0f> 0b 0f 0b 0f 1f 00 0f 1f 44 00 00 41 57 41 56 41 55 49 89 cd 41
>>>> <4>[245583.532609] RSP: 0018:ffffc90018847c10 EFLAGS: 00010287
>>>> <4>[245583.532654] RAX: 0000000000000000 RBX: ffff888244ff3850 RCX:
>>>> ffffffffc10bd850
>>>> <4>[245583.532683] RDX: dffffc0000000000 RSI: ffff888244ff37d0 RDI:
>>>> ffff888244ff3838
>>>> <4>[245583.532705] RBP: ffff888244ff37d0 R08: 00000000000000d1 R09:
>>>> ffffed10489fe701
>>>> <4>[245583.532726] R10: ffff888244ff3804 R11: ffffed10489fe700 R12:
>>>> 00000000000000d1
>>>> <4>[245583.532746] R13: ffff8882d1adb030 R14: ffff8882d1adb408 R15:
>>>> 0000000000000000
>>>> <4>[245583.532764] FS:  0000000000000000(0000) GS:ffff8887ccd00000(0000)
>>>> knlGS:0000000000000000
>>>> <4>[245583.532785] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> <4>[245583.532797] CR2: 00007fb991665000 CR3: 0000000368ebe001 CR4:
>>>> 00000000007706e0
>>>> <4>[245583.532809] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>> 0000000000000000
>>>> <4>[245583.532820] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>> 0000000000000400
>>>> <4>[245583.532831] PKRU: 55555554
>>>> <4>[245583.532840] Call Trace:
>>>> <4>[245583.532852]  <TASK>
>>>> <4>[245583.532863]  ceph_con_v1_try_read+0xd21/0x15c0 [libceph]
>>>> <4>[245583.533124]  ? ceph_tcp_sendpage+0x100/0x100 [libceph]
>>>> <4>[245583.533348]  ? load_balance+0x1240/0x1240
>>>> <4>[245583.533655]  ? dequeue_entity+0x18b/0x6f0
>>>> <4>[245583.533690]  ? mutex_lock+0x8e/0xe0
>>>> <4>[245583.533869]  ? __mutex_lock_slowpath+0x10/0x10
>>>> <4>[245583.533887]  ? _raw_spin_unlock+0x16/0x30
>>>> <4>[245583.533906]  ? __switch_to+0x2fa/0x690
>>>> <4>[245583.534007]  ceph_con_workfn+0x545/0x940 [libceph]
>>>> <4>[245583.534234]  process_one_work+0x3c1/0x6e0
>>>> <4>[245583.534340]  worker_thread+0x57/0x580
>>>> <4>[245583.534363]  ? process_one_work+0x6e0/0x6e0
>>>> <4>[245583.534382]  kthread+0x160/0x190
>>>> <4>[245583.534421]  ? kthread_complete_and_exit+0x20/0x20
>>>> <4>[245583.534477]  ret_from_fork+0x1f/0x30
>>>> <4>[245583.534544]  </TASK>
>>>> ...
>>>>
>>>> <4>[245583.535413] RIP: 0010:ceph_msg_data_cursor_init+0x79/0x80 [libceph]
>>>> <4>[245583.535627] Code: 8d 7b 08 e8 39 99 00 dc 48 8d 7b 18 48 89 6b 08
>>>> e8 ec 97 00 dc c7 43 18 00 00 00 00 48 89 df 5b 5d 41 5c e9 89 c7 ff ff
>>>> 0f 0b <0f> 0b 0f 0b 0f 1f 00 0f 1f 44 00 00 41 57 41 56 41 55 49 89 cd 41
>>>> <4>[245583.535644] RSP: 0018:ffffc90018847c10 EFLAGS: 00010287
>>>> <4>[245583.535720] RAX: 0000000000000000 RBX: ffff888244ff3850 RCX:
>>>> ffffffffc10bd850
>>>> <4>[245583.535745] RDX: dffffc0000000000 RSI: ffff888244ff37d0 RDI:
>>>> ffff888244ff3838
>>>> <4>[245583.535769] RBP: ffff888244ff37d0 R08: 00000000000000d1 R09:
>>>> ffffed10489fe701
>>>>
>>>> Seems caused by the parse read ?
>>>>
>>>>
>>> You mean sparse read? Those patches aren't in the wip-fscrypt branch
>>> yet, so I wouldn't think them a factor here. What commit was at the top
>>> of your branch?
>>>
>>> If you're testing the latest sparse read patches on top of wip-fscrypt,
>>> then those only work with msgr2 so far. Using msgr1 with it is likely to
>>> have problems.
>> Okay, it seems another patch series. But I am not sure, this is the
>> latest commits in the wip-fscrypt branch I hit the issue:
>>
>>
>> ce03ef535e7b SQUASH: OSD
>> 5020f1a5464b ceph: convert to sparse reads
>> 8614b45d3758 libceph: add sparse read support to OSD client
>> a11c3b2673aa libceph: add sparse read support to msgr2 crc state machine
>> a020a702a220 (origin/testing) ceph: allow `ceph.dir.rctime' xattr to be
>> updatable
>>
>> I just switched back to the branch I pulled days ago, it works well:
>>
>> Yeah, I am using the '--msgr1'. Maybe this is a problem. I will try
>> '--msgr2' later.
>>
> That makes sense. That oops is probably expected with msgr1 and the
> sparse read code. If you test with "-o ms_mode=crc" then it should
> (hopefully!) work.

Sure.


> My plan is eventually to add support to all 3 msgr codepaths (v1, v2-
> crc, and v2-secure), but I'd like some feedback from Ilya on what I have
> so far before I do that.
>
I will help test your related patches tomorrow.

- Xiubo



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] ceph: support for sparse read in msgr2 crc path
  2022-03-15 11:12         ` Xiubo Li
@ 2022-03-15 13:33           ` Jeff Layton
  2022-03-16  9:53             ` Xiubo Li
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff Layton @ 2022-03-15 13:33 UTC (permalink / raw)
  To: Xiubo Li, ceph-devel, idryomov

On Tue, 2022-03-15 at 19:12 +0800, Xiubo Li wrote:
> On 3/15/22 7:10 PM, Jeff Layton wrote:
> > On Tue, 2022-03-15 at 19:03 +0800, Xiubo Li wrote:
> > > On 3/15/22 6:12 PM, Jeff Layton wrote:
> > > > On Tue, 2022-03-15 at 14:23 +0800, Xiubo Li wrote:
> > > > > Hi Jeff,
> > > > > 
> > > > > I hit the following crash by using the latest wip-fscrypt branch:
> > > > > 
> > > > > <5>[245348.815462] Key type ceph unregistered
> > > > > <5>[245545.560567] Key type ceph registered
> > > > > <6>[245545.566723] libceph: loaded (mon/osd proto 15/24)
> > > > > <6>[245545.775116] ceph: loaded (mds proto 32)
> > > > > <6>[245545.822200] libceph: mon2 (1)10.72.47.117:40843 session established
> > > > > <6>[245545.829658] libceph: client5000 fsid
> > > > > 2b9c5f33-3f43-4f89-945d-2a1b6372c5af
> > > > > <4>[245583.531648] ------------[ cut here ]------------
> > > > > <2>[245583.531701] kernel BUG at net/ceph/messenger.c:1032!
> > > > > <4>[245583.531929] invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> > > > > <4>[245583.532030] CPU: 2 PID: 283539 Comm: kworker/2:0 Tainted:
> > > > > G            E     5.17.0-rc6+ #98
> > > > > <4>[245583.532050] Hardware name: Red Hat RHEV Hypervisor, BIOS
> > > > > 1.11.0-2.el7 04/01/2014
> > > > > <4>[245583.532086] Workqueue: ceph-msgr ceph_con_workfn [libceph]
> > > > > <4>[245583.532380] RIP: 0010:ceph_msg_data_cursor_init+0x79/0x80 [libceph]
> > > > > <4>[245583.532592] Code: 8d 7b 08 e8 39 99 00 dc 48 8d 7b 18 48 89 6b 08
> > > > > e8 ec 97 00 dc c7 43 18 00 00 00 00 48 89 df 5b 5d 41 5c e9 89 c7 ff ff
> > > > > 0f 0b <0f> 0b 0f 0b 0f 1f 00 0f 1f 44 00 00 41 57 41 56 41 55 49 89 cd 41
> > > > > <4>[245583.532609] RSP: 0018:ffffc90018847c10 EFLAGS: 00010287
> > > > > <4>[245583.532654] RAX: 0000000000000000 RBX: ffff888244ff3850 RCX:
> > > > > ffffffffc10bd850
> > > > > <4>[245583.532683] RDX: dffffc0000000000 RSI: ffff888244ff37d0 RDI:
> > > > > ffff888244ff3838
> > > > > <4>[245583.532705] RBP: ffff888244ff37d0 R08: 00000000000000d1 R09:
> > > > > ffffed10489fe701
> > > > > <4>[245583.532726] R10: ffff888244ff3804 R11: ffffed10489fe700 R12:
> > > > > 00000000000000d1
> > > > > <4>[245583.532746] R13: ffff8882d1adb030 R14: ffff8882d1adb408 R15:
> > > > > 0000000000000000
> > > > > <4>[245583.532764] FS:  0000000000000000(0000) GS:ffff8887ccd00000(0000)
> > > > > knlGS:0000000000000000
> > > > > <4>[245583.532785] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > <4>[245583.532797] CR2: 00007fb991665000 CR3: 0000000368ebe001 CR4:
> > > > > 00000000007706e0
> > > > > <4>[245583.532809] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > > > 0000000000000000
> > > > > <4>[245583.532820] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > > > 0000000000000400
> > > > > <4>[245583.532831] PKRU: 55555554
> > > > > <4>[245583.532840] Call Trace:
> > > > > <4>[245583.532852]  <TASK>
> > > > > <4>[245583.532863]  ceph_con_v1_try_read+0xd21/0x15c0 [libceph]
> > > > > <4>[245583.533124]  ? ceph_tcp_sendpage+0x100/0x100 [libceph]
> > > > > <4>[245583.533348]  ? load_balance+0x1240/0x1240
> > > > > <4>[245583.533655]  ? dequeue_entity+0x18b/0x6f0
> > > > > <4>[245583.533690]  ? mutex_lock+0x8e/0xe0
> > > > > <4>[245583.533869]  ? __mutex_lock_slowpath+0x10/0x10
> > > > > <4>[245583.533887]  ? _raw_spin_unlock+0x16/0x30
> > > > > <4>[245583.533906]  ? __switch_to+0x2fa/0x690
> > > > > <4>[245583.534007]  ceph_con_workfn+0x545/0x940 [libceph]
> > > > > <4>[245583.534234]  process_one_work+0x3c1/0x6e0
> > > > > <4>[245583.534340]  worker_thread+0x57/0x580
> > > > > <4>[245583.534363]  ? process_one_work+0x6e0/0x6e0
> > > > > <4>[245583.534382]  kthread+0x160/0x190
> > > > > <4>[245583.534421]  ? kthread_complete_and_exit+0x20/0x20
> > > > > <4>[245583.534477]  ret_from_fork+0x1f/0x30
> > > > > <4>[245583.534544]  </TASK>
> > > > > ...
> > > > > 
> > > > > <4>[245583.535413] RIP: 0010:ceph_msg_data_cursor_init+0x79/0x80 [libceph]
> > > > > <4>[245583.535627] Code: 8d 7b 08 e8 39 99 00 dc 48 8d 7b 18 48 89 6b 08
> > > > > e8 ec 97 00 dc c7 43 18 00 00 00 00 48 89 df 5b 5d 41 5c e9 89 c7 ff ff
> > > > > 0f 0b <0f> 0b 0f 0b 0f 1f 00 0f 1f 44 00 00 41 57 41 56 41 55 49 89 cd 41
> > > > > <4>[245583.535644] RSP: 0018:ffffc90018847c10 EFLAGS: 00010287
> > > > > <4>[245583.535720] RAX: 0000000000000000 RBX: ffff888244ff3850 RCX:
> > > > > ffffffffc10bd850
> > > > > <4>[245583.535745] RDX: dffffc0000000000 RSI: ffff888244ff37d0 RDI:
> > > > > ffff888244ff3838
> > > > > <4>[245583.535769] RBP: ffff888244ff37d0 R08: 00000000000000d1 R09:
> > > > > ffffed10489fe701
> > > > > 
> > > > > Seems caused by the parse read ?
> > > > > 
> > > > > 
> > > > You mean sparse read? Those patches aren't in the wip-fscrypt branch
> > > > yet, so I wouldn't think them a factor here. What commit was at the top
> > > > of your branch?
> > > > 
> > > > If you're testing the latest sparse read patches on top of wip-fscrypt,
> > > > then those only work with msgr2 so far. Using msgr1 with it is likely to
> > > > have problems.
> > > Okay, it seems another patch series. But I am not sure, this is the
> > > latest commits in the wip-fscrypt branch I hit the issue:
> > > 
> > > 
> > > ce03ef535e7b SQUASH: OSD
> > > 5020f1a5464b ceph: convert to sparse reads
> > > 8614b45d3758 libceph: add sparse read support to OSD client
> > > a11c3b2673aa libceph: add sparse read support to msgr2 crc state machine
> > > a020a702a220 (origin/testing) ceph: allow `ceph.dir.rctime' xattr to be
> > > updatable
> > > 
> > > I just switched back to the branch I pulled days ago, it works well:
> > > 
> > > Yeah, I am using the '--msgr1'. Maybe this is a problem. I will try
> > > '--msgr2' later.
> > > 
> > That makes sense. That oops is probably expected with msgr1 and the
> > sparse read code. If you test with "-o ms_mode=crc" then it should
> > (hopefully!) work.
> 
> Sure.
> 
> 
> > My plan is eventually to add support to all 3 msgr codepaths (v1, v2-
> > crc, and v2-secure), but I'd like some feedback from Ilya on what I have
> > so far before I do that.
> > 
> I will help test your related patches tomorrow.
> 

Thanks.

I think this was my mistake. I pushed the wrong branch into wip-fscrypt
from my tree and it ended up having the sparse read patches for the last
day or so. I've dropped those from the branch for now.

We'll want them eventually, but they're still a little too bleeding-edge
at this point.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] ceph: support for sparse read in msgr2 crc path
  2022-03-15 13:33           ` Jeff Layton
@ 2022-03-16  9:53             ` Xiubo Li
  0 siblings, 0 replies; 15+ messages in thread
From: Xiubo Li @ 2022-03-16  9:53 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel, idryomov


On 3/15/22 9:33 PM, Jeff Layton wrote:
> On Tue, 2022-03-15 at 19:12 +0800, Xiubo Li wrote:
>> On 3/15/22 7:10 PM, Jeff Layton wrote:
>>> On Tue, 2022-03-15 at 19:03 +0800, Xiubo Li wrote:
>>>> On 3/15/22 6:12 PM, Jeff Layton wrote:
>>>>> On Tue, 2022-03-15 at 14:23 +0800, Xiubo Li wrote:
>>>>>> Hi Jeff,
>>>>>>
>>>>>> I hit the following crash by using the latest wip-fscrypt branch:
>>>>>>
>>>>>> <5>[245348.815462] Key type ceph unregistered
>>>>>> <5>[245545.560567] Key type ceph registered
>>>>>> <6>[245545.566723] libceph: loaded (mon/osd proto 15/24)
>>>>>> <6>[245545.775116] ceph: loaded (mds proto 32)
>>>>>> <6>[245545.822200] libceph: mon2 (1)10.72.47.117:40843 session established
>>>>>> <6>[245545.829658] libceph: client5000 fsid
>>>>>> 2b9c5f33-3f43-4f89-945d-2a1b6372c5af
>>>>>> <4>[245583.531648] ------------[ cut here ]------------
>>>>>> <2>[245583.531701] kernel BUG at net/ceph/messenger.c:1032!
>>>>>> <4>[245583.531929] invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
>>>>>> <4>[245583.532030] CPU: 2 PID: 283539 Comm: kworker/2:0 Tainted:
>>>>>> G            E     5.17.0-rc6+ #98
>>>>>> <4>[245583.532050] Hardware name: Red Hat RHEV Hypervisor, BIOS
>>>>>> 1.11.0-2.el7 04/01/2014
>>>>>> <4>[245583.532086] Workqueue: ceph-msgr ceph_con_workfn [libceph]
>>>>>> <4>[245583.532380] RIP: 0010:ceph_msg_data_cursor_init+0x79/0x80 [libceph]
>>>>>> <4>[245583.532592] Code: 8d 7b 08 e8 39 99 00 dc 48 8d 7b 18 48 89 6b 08
>>>>>> e8 ec 97 00 dc c7 43 18 00 00 00 00 48 89 df 5b 5d 41 5c e9 89 c7 ff ff
>>>>>> 0f 0b <0f> 0b 0f 0b 0f 1f 00 0f 1f 44 00 00 41 57 41 56 41 55 49 89 cd 41
>>>>>> <4>[245583.532609] RSP: 0018:ffffc90018847c10 EFLAGS: 00010287
>>>>>> <4>[245583.532654] RAX: 0000000000000000 RBX: ffff888244ff3850 RCX:
>>>>>> ffffffffc10bd850
>>>>>> <4>[245583.532683] RDX: dffffc0000000000 RSI: ffff888244ff37d0 RDI:
>>>>>> ffff888244ff3838
>>>>>> <4>[245583.532705] RBP: ffff888244ff37d0 R08: 00000000000000d1 R09:
>>>>>> ffffed10489fe701
>>>>>> <4>[245583.532726] R10: ffff888244ff3804 R11: ffffed10489fe700 R12:
>>>>>> 00000000000000d1
>>>>>> <4>[245583.532746] R13: ffff8882d1adb030 R14: ffff8882d1adb408 R15:
>>>>>> 0000000000000000
>>>>>> <4>[245583.532764] FS:  0000000000000000(0000) GS:ffff8887ccd00000(0000)
>>>>>> knlGS:0000000000000000
>>>>>> <4>[245583.532785] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> <4>[245583.532797] CR2: 00007fb991665000 CR3: 0000000368ebe001 CR4:
>>>>>> 00000000007706e0
>>>>>> <4>[245583.532809] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>>>> 0000000000000000
>>>>>> <4>[245583.532820] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>>>> 0000000000000400
>>>>>> <4>[245583.532831] PKRU: 55555554
>>>>>> <4>[245583.532840] Call Trace:
>>>>>> <4>[245583.532852]  <TASK>
>>>>>> <4>[245583.532863]  ceph_con_v1_try_read+0xd21/0x15c0 [libceph]
>>>>>> <4>[245583.533124]  ? ceph_tcp_sendpage+0x100/0x100 [libceph]
>>>>>> <4>[245583.533348]  ? load_balance+0x1240/0x1240
>>>>>> <4>[245583.533655]  ? dequeue_entity+0x18b/0x6f0
>>>>>> <4>[245583.533690]  ? mutex_lock+0x8e/0xe0
>>>>>> <4>[245583.533869]  ? __mutex_lock_slowpath+0x10/0x10
>>>>>> <4>[245583.533887]  ? _raw_spin_unlock+0x16/0x30
>>>>>> <4>[245583.533906]  ? __switch_to+0x2fa/0x690
>>>>>> <4>[245583.534007]  ceph_con_workfn+0x545/0x940 [libceph]
>>>>>> <4>[245583.534234]  process_one_work+0x3c1/0x6e0
>>>>>> <4>[245583.534340]  worker_thread+0x57/0x580
>>>>>> <4>[245583.534363]  ? process_one_work+0x6e0/0x6e0
>>>>>> <4>[245583.534382]  kthread+0x160/0x190
>>>>>> <4>[245583.534421]  ? kthread_complete_and_exit+0x20/0x20
>>>>>> <4>[245583.534477]  ret_from_fork+0x1f/0x30
>>>>>> <4>[245583.534544]  </TASK>
>>>>>> ...
>>>>>>
>>>>>> <4>[245583.535413] RIP: 0010:ceph_msg_data_cursor_init+0x79/0x80 [libceph]
>>>>>> <4>[245583.535627] Code: 8d 7b 08 e8 39 99 00 dc 48 8d 7b 18 48 89 6b 08
>>>>>> e8 ec 97 00 dc c7 43 18 00 00 00 00 48 89 df 5b 5d 41 5c e9 89 c7 ff ff
>>>>>> 0f 0b <0f> 0b 0f 0b 0f 1f 00 0f 1f 44 00 00 41 57 41 56 41 55 49 89 cd 41
>>>>>> <4>[245583.535644] RSP: 0018:ffffc90018847c10 EFLAGS: 00010287
>>>>>> <4>[245583.535720] RAX: 0000000000000000 RBX: ffff888244ff3850 RCX:
>>>>>> ffffffffc10bd850
>>>>>> <4>[245583.535745] RDX: dffffc0000000000 RSI: ffff888244ff37d0 RDI:
>>>>>> ffff888244ff3838
>>>>>> <4>[245583.535769] RBP: ffff888244ff37d0 R08: 00000000000000d1 R09:
>>>>>> ffffed10489fe701
>>>>>>
>>>>>> Seems caused by the parse read ?
>>>>>>
>>>>>>
>>>>> You mean sparse read? Those patches aren't in the wip-fscrypt branch
>>>>> yet, so I wouldn't think them a factor here. What commit was at the top
>>>>> of your branch?
>>>>>
>>>>> If you're testing the latest sparse read patches on top of wip-fscrypt,
>>>>> then those only work with msgr2 so far. Using msgr1 with it is likely to
>>>>> have problems.
>>>> Okay, it seems another patch series. But I am not sure, this is the
>>>> latest commits in the wip-fscrypt branch I hit the issue:
>>>>
>>>>
>>>> ce03ef535e7b SQUASH: OSD
>>>> 5020f1a5464b ceph: convert to sparse reads
>>>> 8614b45d3758 libceph: add sparse read support to OSD client
>>>> a11c3b2673aa libceph: add sparse read support to msgr2 crc state machine
>>>> a020a702a220 (origin/testing) ceph: allow `ceph.dir.rctime' xattr to be
>>>> updatable
>>>>
>>>> I just switched back to the branch I pulled days ago, it works well:
>>>>
>>>> Yeah, I am using the '--msgr1'. Maybe this is a problem. I will try
>>>> '--msgr2' later.
>>>>
>>> That makes sense. That oops is probably expected with msgr1 and the
>>> sparse read code. If you test with "-o ms_mode=crc" then it should
>>> (hopefully!) work.
>> Sure.
>>
>>
>>> My plan is eventually to add support to all 3 msgr codepaths (v1, v2-
>>> crc, and v2-secure), but I'd like some feedback from Ilya on what I have
>>> so far before I do that.
>>>
>> I will help test your related patches tomorrow.
>>
> Thanks.
>
> I think this was my mistake. I pushed the wrong branch into wip-fscrypt
> from my tree and it ended up having the sparse read patches for the last
> day or so. I've dropped those from the branch for now.
Sure.
> We'll want them eventually, but they're still a little too bleeding-edge
> at this point.

Yeah.

Thanks


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-03-16  9:53 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-09 12:33 [PATCH 0/3] ceph: support for sparse read in msgr2 crc path Jeff Layton
2022-03-09 12:33 ` [PATCH 1/3] libceph: add sparse read support to msgr2 crc state machine Jeff Layton
2022-03-09 13:37   ` Jeff Layton
2022-03-09 12:33 ` [PATCH 2/3] libceph: add sparse read support to OSD client Jeff Layton
2022-03-11 17:08   ` Jeff Layton
2022-03-09 12:33 ` [PATCH 3/3] ceph: convert to sparse reads Jeff Layton
2022-03-14  2:22   ` Xiubo Li
2022-03-14 12:09     ` Jeff Layton
2022-03-15  6:23 ` [PATCH 0/3] ceph: support for sparse read in msgr2 crc path Xiubo Li
2022-03-15 10:12   ` Jeff Layton
2022-03-15 11:03     ` Xiubo Li
2022-03-15 11:10       ` Jeff Layton
2022-03-15 11:12         ` Xiubo Li
2022-03-15 13:33           ` Jeff Layton
2022-03-16  9:53             ` Xiubo Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.