All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/3] libceph: fix sparse-read failure bug
@ 2024-01-18 10:50 xiubli
  2024-01-18 10:50 ` [PATCH v4 1/3] libceph: fail the sparse-read if there still has data in socket xiubli
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: xiubli @ 2024-01-18 10:50 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, jlayton, vshankar, mchangir, Xiubo Li

From: Xiubo Li <xiubli@redhat.com>

V4:
- remove the sr_resid_elen field
- improved the 'read_partial_sparse_msg_data()'

V3:
- rename read_sparse_msg_XX to read_partial_sparse_msg_XX
- fix the sparse-read bug in the messager v1 code.

V2:
- fix the sparse-read bug in the sparse-read code instead


Xiubo Li (3):
  libceph: fail the sparse-read if there still has data in socket
  libceph: rename read_sparse_msg_XX to read_partial_sparse_msg_XX
  libceph: just wait for more data to be available on the socket

 include/linux/ceph/messenger.h |  1 +
 net/ceph/messenger_v1.c        | 40 +++++++++++++++++++++-------------
 net/ceph/osd_client.c          |  9 ++++++++
 3 files changed, 35 insertions(+), 15 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v4 1/3] libceph: fail the sparse-read if there still has data in socket
  2024-01-18 10:50 [PATCH v4 0/3] libceph: fix sparse-read failure bug xiubli
@ 2024-01-18 10:50 ` xiubli
  2024-01-18 14:03   ` Jeff Layton
  2024-01-18 10:50 ` [PATCH v4 2/3] libceph: rename read_sparse_msg_XX to read_partial_sparse_msg_XX xiubli
  2024-01-18 10:50 ` [PATCH v4 3/3] libceph: just wait for more data to be available on the socket xiubli
  2 siblings, 1 reply; 20+ messages in thread
From: xiubli @ 2024-01-18 10:50 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, jlayton, vshankar, mchangir, Xiubo Li

From: Xiubo Li <xiubli@redhat.com>

Once this happens that means there have bugs.

URL: https://tracker.ceph.com/issues/63586
Signed-off-by: Xiubo Li <xiubli@redhat.com>
---
 net/ceph/osd_client.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 9be80d01c1dc..f8029b30a3fb 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -5912,6 +5912,13 @@ static int osd_sparse_read(struct ceph_connection *con,
 		fallthrough;
 	case CEPH_SPARSE_READ_DATA:
 		if (sr->sr_index >= count) {
+			if (sr->sr_datalen) {
+				pr_warn_ratelimited("sr_datalen %u sr_index %d count %u\n",
+						    sr->sr_datalen, sr->sr_index,
+						    count);
+				return -EREMOTEIO;
+			}
+
 			sr->sr_state = CEPH_SPARSE_READ_HDR;
 			goto next_op;
 		}
@@ -5919,6 +5926,8 @@ static int osd_sparse_read(struct ceph_connection *con,
 		eoff = sr->sr_extent[sr->sr_index].off;
 		elen = sr->sr_extent[sr->sr_index].len;
 
+		sr->sr_datalen -= elen;
+
 		dout("[%d] ext %d off 0x%llx len 0x%llx\n",
 		     o->o_osd, sr->sr_index, eoff, elen);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 2/3] libceph: rename read_sparse_msg_XX to read_partial_sparse_msg_XX
  2024-01-18 10:50 [PATCH v4 0/3] libceph: fix sparse-read failure bug xiubli
  2024-01-18 10:50 ` [PATCH v4 1/3] libceph: fail the sparse-read if there still has data in socket xiubli
@ 2024-01-18 10:50 ` xiubli
  2024-01-18 14:04   ` Jeff Layton
  2024-01-18 10:50 ` [PATCH v4 3/3] libceph: just wait for more data to be available on the socket xiubli
  2 siblings, 1 reply; 20+ messages in thread
From: xiubli @ 2024-01-18 10:50 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, jlayton, vshankar, mchangir, Xiubo Li

From: Xiubo Li <xiubli@redhat.com>

Actually the read_sparse_msg_XX functions allow to continue reading
and parsing the socket buffer when handling of short receives.

Just rename it with _partial_ prefixed.

URL: https://tracker.ceph.com/issues/63586
Signed-off-by: Xiubo Li <xiubli@redhat.com>
---
 net/ceph/messenger_v1.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c
index f9a50d7f0d20..4cb60bacf5f5 100644
--- a/net/ceph/messenger_v1.c
+++ b/net/ceph/messenger_v1.c
@@ -991,7 +991,7 @@ static inline int read_partial_message_section(struct ceph_connection *con,
 	return read_partial_message_chunk(con, section, sec_len, crc);
 }
 
-static int read_sparse_msg_extent(struct ceph_connection *con, u32 *crc)
+static int read_partial_sparse_msg_extent(struct ceph_connection *con, u32 *crc)
 {
 	struct ceph_msg_data_cursor *cursor = &con->in_msg->cursor;
 	bool do_bounce = ceph_test_opt(from_msgr(con->msgr), RXBOUNCE);
@@ -1026,7 +1026,7 @@ static int read_sparse_msg_extent(struct ceph_connection *con, u32 *crc)
 	return 1;
 }
 
-static int read_sparse_msg_data(struct ceph_connection *con)
+static int read_partial_sparse_msg_data(struct ceph_connection *con)
 {
 	struct ceph_msg_data_cursor *cursor = &con->in_msg->cursor;
 	bool do_datacrc = !ceph_test_opt(from_msgr(con->msgr), NOCRC);
@@ -1043,7 +1043,7 @@ static int read_sparse_msg_data(struct ceph_connection *con)
 							 con->v1.in_sr_len,
 							 &crc);
 		else if (cursor->sr_resid > 0)
-			ret = read_sparse_msg_extent(con, &crc);
+			ret = read_partial_sparse_msg_extent(con, &crc);
 
 		if (ret <= 0) {
 			if (do_datacrc)
@@ -1254,7 +1254,7 @@ static int read_partial_message(struct ceph_connection *con)
 			return -EIO;
 
 		if (m->sparse_read)
-			ret = read_sparse_msg_data(con);
+			ret = read_partial_sparse_msg_data(con);
 		else if (ceph_test_opt(from_msgr(con->msgr), RXBOUNCE))
 			ret = read_partial_msg_data_bounce(con);
 		else
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 3/3] libceph: just wait for more data to be available on the socket
  2024-01-18 10:50 [PATCH v4 0/3] libceph: fix sparse-read failure bug xiubli
  2024-01-18 10:50 ` [PATCH v4 1/3] libceph: fail the sparse-read if there still has data in socket xiubli
  2024-01-18 10:50 ` [PATCH v4 2/3] libceph: rename read_sparse_msg_XX to read_partial_sparse_msg_XX xiubli
@ 2024-01-18 10:50 ` xiubli
  2024-01-18 14:36   ` Jeff Layton
                     ` (2 more replies)
  2 siblings, 3 replies; 20+ messages in thread
From: xiubli @ 2024-01-18 10:50 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, jlayton, vshankar, mchangir, Xiubo Li

From: Xiubo Li <xiubli@redhat.com>

The messages from ceph maybe split into multiple socket packages
and we just need to wait for all the data to be availiable on the
sokcet.

This will add 'sr_total_resid' to record the total length for all
data items for sparse-read message and 'sr_resid_elen' to record
the current extent total length.

URL: https://tracker.ceph.com/issues/63586
Signed-off-by: Xiubo Li <xiubli@redhat.com>
---
 include/linux/ceph/messenger.h |  1 +
 net/ceph/messenger_v1.c        | 32 +++++++++++++++++++++-----------
 2 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 2eaaabbe98cb..ca6f82abed62 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -231,6 +231,7 @@ struct ceph_msg_data {
 
 struct ceph_msg_data_cursor {
 	size_t			total_resid;	/* across all data items */
+	size_t			sr_total_resid;	/* across all data items for sparse-read */
 
 	struct ceph_msg_data	*data;		/* current data item */
 	size_t			resid;		/* bytes not yet consumed */
diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c
index 4cb60bacf5f5..2733da891688 100644
--- a/net/ceph/messenger_v1.c
+++ b/net/ceph/messenger_v1.c
@@ -160,7 +160,9 @@ static size_t sizeof_footer(struct ceph_connection *con)
 static void prepare_message_data(struct ceph_msg *msg, u32 data_len)
 {
 	/* Initialize data cursor if it's not a sparse read */
-	if (!msg->sparse_read)
+	if (msg->sparse_read)
+		msg->cursor.sr_total_resid = data_len;
+	else
 		ceph_msg_data_cursor_init(&msg->cursor, msg, data_len);
 }
 
@@ -1032,35 +1034,43 @@ static int read_partial_sparse_msg_data(struct ceph_connection *con)
 	bool do_datacrc = !ceph_test_opt(from_msgr(con->msgr), NOCRC);
 	u32 crc = 0;
 	int ret = 1;
+	int len;
 
 	if (do_datacrc)
 		crc = con->in_data_crc;
 
-	do {
-		if (con->v1.in_sr_kvec.iov_base)
+	while (cursor->sr_total_resid) {
+		len = 0;
+		if (con->v1.in_sr_kvec.iov_base) {
+			len = con->v1.in_sr_kvec.iov_len;
 			ret = read_partial_message_chunk(con,
 							 &con->v1.in_sr_kvec,
 							 con->v1.in_sr_len,
 							 &crc);
-		else if (cursor->sr_resid > 0)
+			len = con->v1.in_sr_kvec.iov_len - len;
+		} else if (cursor->sr_resid > 0) {
+			len = cursor->sr_resid;
 			ret = read_partial_sparse_msg_extent(con, &crc);
-
-		if (ret <= 0) {
-			if (do_datacrc)
-				con->in_data_crc = crc;
-			return ret;
+			len -= cursor->sr_resid;
 		}
+		cursor->sr_total_resid -= len;
+		if (ret <= 0)
+			break;
 
 		memset(&con->v1.in_sr_kvec, 0, sizeof(con->v1.in_sr_kvec));
 		ret = con->ops->sparse_read(con, cursor,
 				(char **)&con->v1.in_sr_kvec.iov_base);
+		if (ret <= 0) {
+			ret = ret ? : 1; /* must return > 0 to indicate success */
+			break;
+		}
 		con->v1.in_sr_len = ret;
-	} while (ret > 0);
+	}
 
 	if (do_datacrc)
 		con->in_data_crc = crc;
 
-	return ret < 0 ? ret : 1;  /* must return > 0 to indicate success */
+	return ret;
 }
 
 static int read_partial_msg_data(struct ceph_connection *con)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/3] libceph: fail the sparse-read if there still has data in socket
  2024-01-18 10:50 ` [PATCH v4 1/3] libceph: fail the sparse-read if there still has data in socket xiubli
@ 2024-01-18 14:03   ` Jeff Layton
  2024-01-19  4:07     ` Xiubo Li
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff Layton @ 2024-01-18 14:03 UTC (permalink / raw)
  To: xiubli, ceph-devel; +Cc: idryomov, vshankar, mchangir

On Thu, 2024-01-18 at 18:50 +0800, xiubli@redhat.com wrote:
> From: Xiubo Li <xiubli@redhat.com>
> 
> Once this happens that means there have bugs.
> 
> URL: https://tracker.ceph.com/issues/63586
> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> ---
>  net/ceph/osd_client.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 9be80d01c1dc..f8029b30a3fb 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -5912,6 +5912,13 @@ static int osd_sparse_read(struct ceph_connection *con,
>  		fallthrough;
>  	case CEPH_SPARSE_READ_DATA:
>  		if (sr->sr_index >= count) {
> +			if (sr->sr_datalen) {
> +				pr_warn_ratelimited("sr_datalen %u sr_index %d count %u\n",
> +						    sr->sr_datalen, sr->sr_index,
> +						    count);
> +				return -EREMOTEIO;
> +			}
> +

Ok, so the server has (presumably) sent us a longer value for the
sr_datalen than was in the extent map?

Why should the sparse read engine care about that? It was (presumably)
able to do its job of handling the read. Why not just advance past the
extra junk and try to do another sparse read? Do we really need to fail
the op for this?

>  			sr->sr_state = CEPH_SPARSE_READ_HDR;
>  			goto next_op;
>  		}
> @@ -5919,6 +5926,8 @@ static int osd_sparse_read(struct ceph_connection *con,
>  		eoff = sr->sr_extent[sr->sr_index].off;
>  		elen = sr->sr_extent[sr->sr_index].len;
>  
> +		sr->sr_datalen -= elen;
> +
>  		dout("[%d] ext %d off 0x%llx len 0x%llx\n",
>  		     o->o_osd, sr->sr_index, eoff, elen);
>  

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/3] libceph: rename read_sparse_msg_XX to read_partial_sparse_msg_XX
  2024-01-18 10:50 ` [PATCH v4 2/3] libceph: rename read_sparse_msg_XX to read_partial_sparse_msg_XX xiubli
@ 2024-01-18 14:04   ` Jeff Layton
  0 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2024-01-18 14:04 UTC (permalink / raw)
  To: xiubli, ceph-devel; +Cc: idryomov, vshankar, mchangir

On Thu, 2024-01-18 at 18:50 +0800, xiubli@redhat.com wrote:
> From: Xiubo Li <xiubli@redhat.com>
> 
> Actually the read_sparse_msg_XX functions allow to continue reading
> and parsing the socket buffer when handling of short receives.
> 
> Just rename it with _partial_ prefixed.
> 
> URL: https://tracker.ceph.com/issues/63586
> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> ---
>  net/ceph/messenger_v1.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c
> index f9a50d7f0d20..4cb60bacf5f5 100644
> --- a/net/ceph/messenger_v1.c
> +++ b/net/ceph/messenger_v1.c
> @@ -991,7 +991,7 @@ static inline int read_partial_message_section(struct ceph_connection *con,
>  	return read_partial_message_chunk(con, section, sec_len, crc);
>  }
>  
> -static int read_sparse_msg_extent(struct ceph_connection *con, u32 *crc)
> +static int read_partial_sparse_msg_extent(struct ceph_connection *con, u32 *crc)
>  {
>  	struct ceph_msg_data_cursor *cursor = &con->in_msg->cursor;
>  	bool do_bounce = ceph_test_opt(from_msgr(con->msgr), RXBOUNCE);
> @@ -1026,7 +1026,7 @@ static int read_sparse_msg_extent(struct ceph_connection *con, u32 *crc)
>  	return 1;
>  }
>  
> -static int read_sparse_msg_data(struct ceph_connection *con)
> +static int read_partial_sparse_msg_data(struct ceph_connection *con)
>  {
>  	struct ceph_msg_data_cursor *cursor = &con->in_msg->cursor;
>  	bool do_datacrc = !ceph_test_opt(from_msgr(con->msgr), NOCRC);
> @@ -1043,7 +1043,7 @@ static int read_sparse_msg_data(struct ceph_connection *con)
>  							 con->v1.in_sr_len,
>  							 &crc);
>  		else if (cursor->sr_resid > 0)
> -			ret = read_sparse_msg_extent(con, &crc);
> +			ret = read_partial_sparse_msg_extent(con, &crc);
>  
>  		if (ret <= 0) {
>  			if (do_datacrc)
> @@ -1254,7 +1254,7 @@ static int read_partial_message(struct ceph_connection *con)
>  			return -EIO;
>  
>  		if (m->sparse_read)
> -			ret = read_sparse_msg_data(con);
> +			ret = read_partial_sparse_msg_data(con);
>  		else if (ceph_test_opt(from_msgr(con->msgr), RXBOUNCE))
>  			ret = read_partial_msg_data_bounce(con);
>  		else

Meh. I'm not sure this makes things any more clear, but OK:

Acked-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/3] libceph: just wait for more data to be available on the socket
  2024-01-18 10:50 ` [PATCH v4 3/3] libceph: just wait for more data to be available on the socket xiubli
@ 2024-01-18 14:36   ` Jeff Layton
  2024-01-18 18:24   ` Jeff Layton
  2024-01-22 15:02   ` Jeff Layton
  2 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2024-01-18 14:36 UTC (permalink / raw)
  To: xiubli, ceph-devel; +Cc: idryomov, vshankar, mchangir

On Thu, 2024-01-18 at 18:50 +0800, xiubli@redhat.com wrote:
> From: Xiubo Li <xiubli@redhat.com>
> 
> The messages from ceph maybe split into multiple socket packages
> and we just need to wait for all the data to be availiable on the
> sokcet.
>
> This will add 'sr_total_resid' to record the total length for all
> data items for sparse-read message and 'sr_resid_elen' to record
> the current extent total length.
> 

It's been a while since I was in the ceph messenger code, and my v1
memory is especially fuzzy. I really don't quite understand how tracking
yet another length field solves the stated problem.

I'd really appreciate a description of the problem that you saw and how
this solves it. The ceph bug is not very straightforward.

I get that we need to wait for a certain amount of data to be available
on the socket before we drive the sparse_read op, but I don't quite see
how that's being achieved here.

Also, why is this not a problem on messenger v2?

> URL: https://tracker.ceph.com/issues/63586
> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> ---
>  include/linux/ceph/messenger.h |  1 +
>  net/ceph/messenger_v1.c        | 32 +++++++++++++++++++++-----------
>  2 files changed, 22 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
> index 2eaaabbe98cb..ca6f82abed62 100644
> --- a/include/linux/ceph/messenger.h
> +++ b/include/linux/ceph/messenger.h
> @@ -231,6 +231,7 @@ struct ceph_msg_data {
>  
>  struct ceph_msg_data_cursor {
>  	size_t			total_resid;	/* across all data items */
> +	size_t			sr_total_resid;	/* across all data items for sparse-read */
>  
>  	struct ceph_msg_data	*data;		/* current data item */
>  	size_t			resid;		/* bytes not yet consumed */
> diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c
> index 4cb60bacf5f5..2733da891688 100644
> --- a/net/ceph/messenger_v1.c
> +++ b/net/ceph/messenger_v1.c
> @@ -160,7 +160,9 @@ static size_t sizeof_footer(struct ceph_connection *con)
>  static void prepare_message_data(struct ceph_msg *msg, u32 data_len)
>  {
>  	/* Initialize data cursor if it's not a sparse read */
> -	if (!msg->sparse_read)
> +	if (msg->sparse_read)
> +		msg->cursor.sr_total_resid = data_len;
> +	else
>  		ceph_msg_data_cursor_init(&msg->cursor, msg, data_len);
>  }
>  
> @@ -1032,35 +1034,43 @@ static int read_partial_sparse_msg_data(struct ceph_connection *con)
>  	bool do_datacrc = !ceph_test_opt(from_msgr(con->msgr), NOCRC);
>  	u32 crc = 0;
>  	int ret = 1;
> +	int len;
>  
>  	if (do_datacrc)
>  		crc = con->in_data_crc;
>  
> -	do {
> -		if (con->v1.in_sr_kvec.iov_base)
> +	while (cursor->sr_total_resid) {
> +		len = 0;
> +		if (con->v1.in_sr_kvec.iov_base) {
> +			len = con->v1.in_sr_kvec.iov_len;
>  			ret = read_partial_message_chunk(con,
>  							 &con->v1.in_sr_kvec,
>  							 con->v1.in_sr_len,
>  							 &crc);
> -		else if (cursor->sr_resid > 0)
> +			len = con->v1.in_sr_kvec.iov_len - len;
> +		} else if (cursor->sr_resid > 0) {
> +			len = cursor->sr_resid;
>  			ret = read_partial_sparse_msg_extent(con, &crc);
> -
> -		if (ret <= 0) {
> -			if (do_datacrc)
> -				con->in_data_crc = crc;
> -			return ret;
> +			len -= cursor->sr_resid;
>  		}
> +		cursor->sr_total_resid -= len;
> +		if (ret <= 0)
> +			break;
>  
>  		memset(&con->v1.in_sr_kvec, 0, sizeof(con->v1.in_sr_kvec));
>  		ret = con->ops->sparse_read(con, cursor,
>  				(char **)&con->v1.in_sr_kvec.iov_base);
> +		if (ret <= 0) {
> +			ret = ret ? : 1; /* must return > 0 to indicate success */

The above is a gcc-ism that we probably shouldn't be using. Can this be:

	ret = ret ? ret : 1;

?

> +			break;
> +		}
>  		con->v1.in_sr_len = ret;
> -	} while (ret > 0);
> +	}
>  
>  	if (do_datacrc)
>  		con->in_data_crc = crc;
>  
> -	return ret < 0 ? ret : 1;  /* must return > 0 to indicate success */
> +	return ret;
>  }
>  
>  static int read_partial_msg_data(struct ceph_connection *con)

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/3] libceph: just wait for more data to be available on the socket
  2024-01-18 10:50 ` [PATCH v4 3/3] libceph: just wait for more data to be available on the socket xiubli
  2024-01-18 14:36   ` Jeff Layton
@ 2024-01-18 18:24   ` Jeff Layton
  2024-01-19  4:35     ` Xiubo Li
  2024-01-22 15:02   ` Jeff Layton
  2 siblings, 1 reply; 20+ messages in thread
From: Jeff Layton @ 2024-01-18 18:24 UTC (permalink / raw)
  To: xiubli, ceph-devel; +Cc: idryomov, vshankar, mchangir

On Thu, 2024-01-18 at 18:50 +0800, xiubli@redhat.com wrote:
> From: Xiubo Li <xiubli@redhat.com>
> 
> The messages from ceph maybe split into multiple socket packages
> and we just need to wait for all the data to be availiable on the
> sokcet.
> 
> This will add 'sr_total_resid' to record the total length for all
> data items for sparse-read message and 'sr_resid_elen' to record
> the current extent total length.
> 
> URL: https://tracker.ceph.com/issues/63586
> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> ---
>  include/linux/ceph/messenger.h |  1 +
>  net/ceph/messenger_v1.c        | 32 +++++++++++++++++++++-----------
>  2 files changed, 22 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
> index 2eaaabbe98cb..ca6f82abed62 100644
> --- a/include/linux/ceph/messenger.h
> +++ b/include/linux/ceph/messenger.h
> @@ -231,6 +231,7 @@ struct ceph_msg_data {
>  
>  struct ceph_msg_data_cursor {
>  	size_t			total_resid;	/* across all data items */
> +	size_t			sr_total_resid;	/* across all data items for sparse-read */
>  
>  	struct ceph_msg_data	*data;		/* current data item */
>  	size_t			resid;		/* bytes not yet consumed */
> diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c
> index 4cb60bacf5f5..2733da891688 100644
> --- a/net/ceph/messenger_v1.c
> +++ b/net/ceph/messenger_v1.c
> @@ -160,7 +160,9 @@ static size_t sizeof_footer(struct ceph_connection *con)
>  static void prepare_message_data(struct ceph_msg *msg, u32 data_len)
>  {
>  	/* Initialize data cursor if it's not a sparse read */
> -	if (!msg->sparse_read)
> +	if (msg->sparse_read)
> +		msg->cursor.sr_total_resid = data_len;
> +	else
>  		ceph_msg_data_cursor_init(&msg->cursor, msg, data_len);
>  }
>  
> @@ -1032,35 +1034,43 @@ static int read_partial_sparse_msg_data(struct ceph_connection *con)
>  	bool do_datacrc = !ceph_test_opt(from_msgr(con->msgr), NOCRC);
>  	u32 crc = 0;
>  	int ret = 1;
> +	int len;
>  
>  	if (do_datacrc)
>  		crc = con->in_data_crc;
>  
> -	do {
> -		if (con->v1.in_sr_kvec.iov_base)
> +	while (cursor->sr_total_resid) {
> +		len = 0;
> +		if (con->v1.in_sr_kvec.iov_base) {
> +			len = con->v1.in_sr_kvec.iov_len;
>  			ret = read_partial_message_chunk(con,
>  							 &con->v1.in_sr_kvec,
>  							 con->v1.in_sr_len,
>  							 &crc);
> -		else if (cursor->sr_resid > 0)
> +			len = con->v1.in_sr_kvec.iov_len - len;
> +		} else if (cursor->sr_resid > 0) {
> +			len = cursor->sr_resid;
>  			ret = read_partial_sparse_msg_extent(con, &crc);
> -
> -		if (ret <= 0) {
> -			if (do_datacrc)
> -				con->in_data_crc = crc;
> -			return ret;
> +			len -= cursor->sr_resid;
>  		}
> +		cursor->sr_total_resid -= len;
> +		if (ret <= 0)
> +			break;
>  
>  		memset(&con->v1.in_sr_kvec, 0, sizeof(con->v1.in_sr_kvec));
>  		ret = con->ops->sparse_read(con, cursor,
>  				(char **)&con->v1.in_sr_kvec.iov_base);
> +		if (ret <= 0) {
> +			ret = ret ? : 1; /* must return > 0 to indicate success */
> +			break;
> +		}
>  		con->v1.in_sr_len = ret;
> -	} while (ret > 0);
> +	}
>  
>  	if (do_datacrc)
>  		con->in_data_crc = crc;
>  
> -	return ret < 0 ? ret : 1;  /* must return > 0 to indicate success */
> +	return ret;
>  }
>  
>  static int read_partial_msg_data(struct ceph_connection *con)

Looking back over this code...

The way it works today, once we determine it's a sparse read, we call
read_sparse_msg_data. At that point we call either
read_partial_message_chunk (to read into the kvec) or
read_sparse_msg_extent if sr_resid is already set (indicating that we're
receiving an extent).

read_sparse_msg_extent calls ceph_tcp_recvpage in a loop until
cursor->sr_resid have been received. The exception there when
ceph_tcp_recvpage returns <= 0.

ceph_tcp_recvpage returns 0 if sock_recvmsg returns -EAGAIN (maybe also
in other cases). So it sounds like the client just timed out on a read
from the socket or caught a signal or something?

If that's correct, then do we know what ceph_tcp_recvpage returned when
the problem happened?
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/3] libceph: fail the sparse-read if there still has data in socket
  2024-01-18 14:03   ` Jeff Layton
@ 2024-01-19  4:07     ` Xiubo Li
  2024-01-19 11:03       ` Jeff Layton
  0 siblings, 1 reply; 20+ messages in thread
From: Xiubo Li @ 2024-01-19  4:07 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: idryomov, vshankar, mchangir


On 1/18/24 22:03, Jeff Layton wrote:
> On Thu, 2024-01-18 at 18:50 +0800, xiubli@redhat.com wrote:
>> From: Xiubo Li <xiubli@redhat.com>
>>
>> Once this happens that means there have bugs.
>>
>> URL: https://tracker.ceph.com/issues/63586
>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>> ---
>>   net/ceph/osd_client.c | 9 +++++++++
>>   1 file changed, 9 insertions(+)
>>
>> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
>> index 9be80d01c1dc..f8029b30a3fb 100644
>> --- a/net/ceph/osd_client.c
>> +++ b/net/ceph/osd_client.c
>> @@ -5912,6 +5912,13 @@ static int osd_sparse_read(struct ceph_connection *con,
>>   		fallthrough;
>>   	case CEPH_SPARSE_READ_DATA:
>>   		if (sr->sr_index >= count) {
>> +			if (sr->sr_datalen) {
>> +				pr_warn_ratelimited("sr_datalen %u sr_index %d count %u\n",
>> +						    sr->sr_datalen, sr->sr_index,
>> +						    count);
>> +				return -EREMOTEIO;
>> +			}
>> +
> Ok, so the server has (presumably) sent us a longer value for the
> sr_datalen than was in the extent map?
>
> Why should the sparse read engine care about that? It was (presumably)
> able to do its job of handling the read. Why not just advance past the
> extra junk and try to do another sparse read? Do we really need to fail
> the op for this?

Hi Jeff,

I saw the problem just when I first debugging the sparse-read bug, and 
the length will be very large, more detail and the logs please see the 
tracker https://tracker.ceph.com/issues/63586:

      11741055 <4>[180940.606488] libceph: sr_datalen 251723776 sr_index 
0 count 0

In this case the request could cause the same request being retrying 
infinitely. While just in other case the when the ceph send a incorrect 
data length, how should we do ? Should we retry it ? How could we skip 
it if the length is so large ?

Thanks

- Xiubo


>
>>   			sr->sr_state = CEPH_SPARSE_READ_HDR;
>>   			goto next_op;
>>   		}
>> @@ -5919,6 +5926,8 @@ static int osd_sparse_read(struct ceph_connection *con,
>>   		eoff = sr->sr_extent[sr->sr_index].off;
>>   		elen = sr->sr_extent[sr->sr_index].len;
>>   
>> +		sr->sr_datalen -= elen;
>> +
>>   		dout("[%d] ext %d off 0x%llx len 0x%llx\n",
>>   		     o->o_osd, sr->sr_index, eoff, elen);
>>   


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/3] libceph: just wait for more data to be available on the socket
  2024-01-18 18:24   ` Jeff Layton
@ 2024-01-19  4:35     ` Xiubo Li
  2024-01-19 11:09       ` Jeff Layton
  0 siblings, 1 reply; 20+ messages in thread
From: Xiubo Li @ 2024-01-19  4:35 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: idryomov, vshankar, mchangir


On 1/19/24 02:24, Jeff Layton wrote:
> On Thu, 2024-01-18 at 18:50 +0800, xiubli@redhat.com wrote:
>> From: Xiubo Li <xiubli@redhat.com>
>>
>> The messages from ceph maybe split into multiple socket packages
>> and we just need to wait for all the data to be availiable on the
>> sokcet.
>>
>> This will add 'sr_total_resid' to record the total length for all
>> data items for sparse-read message and 'sr_resid_elen' to record
>> the current extent total length.
>>
>> URL: https://tracker.ceph.com/issues/63586
>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>> ---
>>   include/linux/ceph/messenger.h |  1 +
>>   net/ceph/messenger_v1.c        | 32 +++++++++++++++++++++-----------
>>   2 files changed, 22 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
>> index 2eaaabbe98cb..ca6f82abed62 100644
>> --- a/include/linux/ceph/messenger.h
>> +++ b/include/linux/ceph/messenger.h
>> @@ -231,6 +231,7 @@ struct ceph_msg_data {
>>   
>>   struct ceph_msg_data_cursor {
>>   	size_t			total_resid;	/* across all data items */
>> +	size_t			sr_total_resid;	/* across all data items for sparse-read */
>>   
>>   	struct ceph_msg_data	*data;		/* current data item */
>>   	size_t			resid;		/* bytes not yet consumed */
>> diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c
>> index 4cb60bacf5f5..2733da891688 100644
>> --- a/net/ceph/messenger_v1.c
>> +++ b/net/ceph/messenger_v1.c
>> @@ -160,7 +160,9 @@ static size_t sizeof_footer(struct ceph_connection *con)
>>   static void prepare_message_data(struct ceph_msg *msg, u32 data_len)
>>   {
>>   	/* Initialize data cursor if it's not a sparse read */
>> -	if (!msg->sparse_read)
>> +	if (msg->sparse_read)
>> +		msg->cursor.sr_total_resid = data_len;
>> +	else
>>   		ceph_msg_data_cursor_init(&msg->cursor, msg, data_len);
>>   }
>>   
>> @@ -1032,35 +1034,43 @@ static int read_partial_sparse_msg_data(struct ceph_connection *con)
>>   	bool do_datacrc = !ceph_test_opt(from_msgr(con->msgr), NOCRC);
>>   	u32 crc = 0;
>>   	int ret = 1;
>> +	int len;
>>   
>>   	if (do_datacrc)
>>   		crc = con->in_data_crc;
>>   
>> -	do {
>> -		if (con->v1.in_sr_kvec.iov_base)
>> +	while (cursor->sr_total_resid) {
>> +		len = 0;
>> +		if (con->v1.in_sr_kvec.iov_base) {
>> +			len = con->v1.in_sr_kvec.iov_len;
>>   			ret = read_partial_message_chunk(con,
>>   							 &con->v1.in_sr_kvec,
>>   							 con->v1.in_sr_len,
>>   							 &crc);
>> -		else if (cursor->sr_resid > 0)
>> +			len = con->v1.in_sr_kvec.iov_len - len;
>> +		} else if (cursor->sr_resid > 0) {
>> +			len = cursor->sr_resid;
>>   			ret = read_partial_sparse_msg_extent(con, &crc);
>> -
>> -		if (ret <= 0) {
>> -			if (do_datacrc)
>> -				con->in_data_crc = crc;
>> -			return ret;
>> +			len -= cursor->sr_resid;
>>   		}
>> +		cursor->sr_total_resid -= len;
>> +		if (ret <= 0)
>> +			break;
>>   
>>   		memset(&con->v1.in_sr_kvec, 0, sizeof(con->v1.in_sr_kvec));
>>   		ret = con->ops->sparse_read(con, cursor,
>>   				(char **)&con->v1.in_sr_kvec.iov_base);
>> +		if (ret <= 0) {
>> +			ret = ret ? : 1; /* must return > 0 to indicate success */
>> +			break;
>> +		}
>>   		con->v1.in_sr_len = ret;
>> -	} while (ret > 0);
>> +	}
>>   
>>   	if (do_datacrc)
>>   		con->in_data_crc = crc;
>>   
>> -	return ret < 0 ? ret : 1;  /* must return > 0 to indicate success */
>> +	return ret;
>>   }
>>   
>>   static int read_partial_msg_data(struct ceph_connection *con)
> Looking back over this code...
>
> The way it works today, once we determine it's a sparse read, we call
> read_sparse_msg_data. At that point we call either
> read_partial_message_chunk (to read into the kvec) or
> read_sparse_msg_extent if sr_resid is already set (indicating that we're
> receiving an extent).
>
> read_sparse_msg_extent calls ceph_tcp_recvpage in a loop until
> cursor->sr_resid have been received. The exception there when
> ceph_tcp_recvpage returns <= 0.
>
> ceph_tcp_recvpage returns 0 if sock_recvmsg returns -EAGAIN (maybe also
> in other cases). So it sounds like the client just timed out on a read
> from the socket or caught a signal or something?
>
> If that's correct, then do we know what ceph_tcp_recvpage returned when
> the problem happened?

It should just return parital data, and we should continue from here in 
the next loop when the reset data comes.

Thanks

- Xiubo




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/3] libceph: fail the sparse-read if there still has data in socket
  2024-01-19  4:07     ` Xiubo Li
@ 2024-01-19 11:03       ` Jeff Layton
  2024-01-22  3:17         ` Xiubo Li
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff Layton @ 2024-01-19 11:03 UTC (permalink / raw)
  To: Xiubo Li, ceph-devel; +Cc: idryomov, vshankar, mchangir

On Fri, 2024-01-19 at 12:07 +0800, Xiubo Li wrote:
> On 1/18/24 22:03, Jeff Layton wrote:
> > On Thu, 2024-01-18 at 18:50 +0800, xiubli@redhat.com wrote:
> > > From: Xiubo Li <xiubli@redhat.com>
> > > 
> > > Once this happens that means there have bugs.
> > > 
> > > URL: https://tracker.ceph.com/issues/63586
> > > Signed-off-by: Xiubo Li <xiubli@redhat.com>
> > > ---
> > >   net/ceph/osd_client.c | 9 +++++++++
> > >   1 file changed, 9 insertions(+)
> > > 
> > > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> > > index 9be80d01c1dc..f8029b30a3fb 100644
> > > --- a/net/ceph/osd_client.c
> > > +++ b/net/ceph/osd_client.c
> > > @@ -5912,6 +5912,13 @@ static int osd_sparse_read(struct ceph_connection *con,
> > >   		fallthrough;
> > >   	case CEPH_SPARSE_READ_DATA:
> > >   		if (sr->sr_index >= count) {
> > > +			if (sr->sr_datalen) {
> > > +				pr_warn_ratelimited("sr_datalen %u sr_index %d count %u\n",
> > > +						    sr->sr_datalen, sr->sr_index,
> > > +						    count);
> > > +				return -EREMOTEIO;
> > > +			}
> > > +
> > Ok, so the server has (presumably) sent us a longer value for the
> > sr_datalen than was in the extent map?
> > 
> > Why should the sparse read engine care about that? It was (presumably)
> > able to do its job of handling the read. Why not just advance past the
> > extra junk and try to do another sparse read? Do we really need to fail
> > the op for this?
> 
> Hi Jeff,
> 
> I saw the problem just when I first debugging the sparse-read bug, and 
> the length will be very large, more detail and the logs please see the 
> tracker https://tracker.ceph.com/issues/63586:
> 
>       11741055 <4>[180940.606488] libceph: sr_datalen 251723776 sr_index 
> 0 count 0
> 
> In this case the request could cause the same request being retrying 
> infinitely. While just in other case the when the ceph send a incorrect 
> data length, how should we do ? Should we retry it ? How could we skip 
> it if the length is so large ?
> 
> 

If the sparse_read datalen is so long that it goes beyond the length of
the entire frame, then it's clearly malformed and we have to reject it.

I do wonder though whether this is the right place to do that. It seems
like the client ought to do that sort of vetting of lengths before it
starts dealing with the read data.

IOW, maybe there should be a simple check in the
CEPH_SPARSE_READ_DATA_LEN case that validates that the sr_datalen fits
inside the "data_len" of the entire frame?

> 
> 
> > 
> > >   			sr->sr_state = CEPH_SPARSE_READ_HDR;
> > >   			goto next_op;
> > >   		}
> > > @@ -5919,6 +5926,8 @@ static int osd_sparse_read(struct ceph_connection *con,
> > >   		eoff = sr->sr_extent[sr->sr_index].off;
> > >   		elen = sr->sr_extent[sr->sr_index].len;
> > >   
> > > +		sr->sr_datalen -= elen;
> > > +
> > >   		dout("[%d] ext %d off 0x%llx len 0x%llx\n",
> > >   		     o->o_osd, sr->sr_index, eoff, elen);
> > >   
> 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/3] libceph: just wait for more data to be available on the socket
  2024-01-19  4:35     ` Xiubo Li
@ 2024-01-19 11:09       ` Jeff Layton
  2024-01-22  2:52         ` Xiubo Li
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff Layton @ 2024-01-19 11:09 UTC (permalink / raw)
  To: Xiubo Li, ceph-devel; +Cc: idryomov, vshankar, mchangir

On Fri, 2024-01-19 at 12:35 +0800, Xiubo Li wrote:
> On 1/19/24 02:24, Jeff Layton wrote:
> > On Thu, 2024-01-18 at 18:50 +0800, xiubli@redhat.com wrote:
> > > From: Xiubo Li <xiubli@redhat.com>
> > > 
> > > The messages from ceph maybe split into multiple socket packages
> > > and we just need to wait for all the data to be availiable on the
> > > sokcet.
> > > 
> > > This will add 'sr_total_resid' to record the total length for all
> > > data items for sparse-read message and 'sr_resid_elen' to record
> > > the current extent total length.
> > > 
> > > URL: https://tracker.ceph.com/issues/63586
> > > Signed-off-by: Xiubo Li <xiubli@redhat.com>
> > > ---
> > >   include/linux/ceph/messenger.h |  1 +
> > >   net/ceph/messenger_v1.c        | 32 +++++++++++++++++++++-----------
> > >   2 files changed, 22 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
> > > index 2eaaabbe98cb..ca6f82abed62 100644
> > > --- a/include/linux/ceph/messenger.h
> > > +++ b/include/linux/ceph/messenger.h
> > > @@ -231,6 +231,7 @@ struct ceph_msg_data {
> > >   
> > >   struct ceph_msg_data_cursor {
> > >   	size_t			total_resid;	/* across all data items */
> > > +	size_t			sr_total_resid;	/* across all data items for sparse-read */
> > >   
> > >   	struct ceph_msg_data	*data;		/* current data item */
> > >   	size_t			resid;		/* bytes not yet consumed */
> > > diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c
> > > index 4cb60bacf5f5..2733da891688 100644
> > > --- a/net/ceph/messenger_v1.c
> > > +++ b/net/ceph/messenger_v1.c
> > > @@ -160,7 +160,9 @@ static size_t sizeof_footer(struct ceph_connection *con)
> > >   static void prepare_message_data(struct ceph_msg *msg, u32 data_len)
> > >   {
> > >   	/* Initialize data cursor if it's not a sparse read */
> > > -	if (!msg->sparse_read)
> > > +	if (msg->sparse_read)
> > > +		msg->cursor.sr_total_resid = data_len;
> > > +	else
> > >   		ceph_msg_data_cursor_init(&msg->cursor, msg, data_len);
> > >   }
> > >   
> > > @@ -1032,35 +1034,43 @@ static int read_partial_sparse_msg_data(struct ceph_connection *con)
> > >   	bool do_datacrc = !ceph_test_opt(from_msgr(con->msgr), NOCRC);
> > >   	u32 crc = 0;
> > >   	int ret = 1;
> > > +	int len;
> > >   
> > >   	if (do_datacrc)
> > >   		crc = con->in_data_crc;
> > >   
> > > -	do {
> > > -		if (con->v1.in_sr_kvec.iov_base)
> > > +	while (cursor->sr_total_resid) {
> > > +		len = 0;
> > > +		if (con->v1.in_sr_kvec.iov_base) {
> > > +			len = con->v1.in_sr_kvec.iov_len;
> > >   			ret = read_partial_message_chunk(con,
> > >   							 &con->v1.in_sr_kvec,
> > >   							 con->v1.in_sr_len,
> > >   							 &crc);
> > > -		else if (cursor->sr_resid > 0)
> > > +			len = con->v1.in_sr_kvec.iov_len - len;
> > > +		} else if (cursor->sr_resid > 0) {
> > > +			len = cursor->sr_resid;
> > >   			ret = read_partial_sparse_msg_extent(con, &crc);
> > > -
> > > -		if (ret <= 0) {
> > > -			if (do_datacrc)
> > > -				con->in_data_crc = crc;
> > > -			return ret;
> > > +			len -= cursor->sr_resid;
> > >   		}
> > > +		cursor->sr_total_resid -= len;
> > > +		if (ret <= 0)
> > > +			break;
> > >   
> > >   		memset(&con->v1.in_sr_kvec, 0, sizeof(con->v1.in_sr_kvec));
> > >   		ret = con->ops->sparse_read(con, cursor,
> > >   				(char **)&con->v1.in_sr_kvec.iov_base);
> > > +		if (ret <= 0) {
> > > +			ret = ret ? : 1; /* must return > 0 to indicate success */
> > > +			break;
> > > +		}
> > >   		con->v1.in_sr_len = ret;
> > > -	} while (ret > 0);
> > > +	}
> > >   
> > >   	if (do_datacrc)
> > >   		con->in_data_crc = crc;
> > >   
> > > -	return ret < 0 ? ret : 1;  /* must return > 0 to indicate success */
> > > +	return ret;
> > >   }
> > >   
> > >   static int read_partial_msg_data(struct ceph_connection *con)
> > Looking back over this code...
> > 
> > The way it works today, once we determine it's a sparse read, we call
> > read_sparse_msg_data. At that point we call either
> > read_partial_message_chunk (to read into the kvec) or
> > read_sparse_msg_extent if sr_resid is already set (indicating that we're
> > receiving an extent).
> > 
> > read_sparse_msg_extent calls ceph_tcp_recvpage in a loop until
> > cursor->sr_resid have been received. The exception there when
> > ceph_tcp_recvpage returns <= 0.
> > 
> > ceph_tcp_recvpage returns 0 if sock_recvmsg returns -EAGAIN (maybe also
> > in other cases). So it sounds like the client just timed out on a read
> > from the socket or caught a signal or something?
> > 
> > If that's correct, then do we know what ceph_tcp_recvpage returned when
> > the problem happened?
> 
> It should just return parital data, and we should continue from here in 
> the next loop when the reset data comes.
> 

Tracking this extra length seems like the wrong fix. We're already
looping in read_sparse_msg_extent until the sr_resid goes to 0. ISTM
that it's just that read_sparse_msg_extent is returning inappropriately
in the face of timeouts.

IOW, it does this:

                ret = ceph_tcp_recvpage(con->sock, rpage, (int)off, len);
                if (ret <= 0)
                        return ret;

...should it just not be returning there when ret == 0? Maybe it should
be retrying the recvpage instead?
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/3] libceph: just wait for more data to be available on the socket
  2024-01-19 11:09       ` Jeff Layton
@ 2024-01-22  2:52         ` Xiubo Li
  2024-01-22 11:44           ` Jeff Layton
  0 siblings, 1 reply; 20+ messages in thread
From: Xiubo Li @ 2024-01-22  2:52 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: idryomov, vshankar, mchangir


On 1/19/24 19:09, Jeff Layton wrote:
> On Fri, 2024-01-19 at 12:35 +0800, Xiubo Li wrote:
>> On 1/19/24 02:24, Jeff Layton wrote:
>>> On Thu, 2024-01-18 at 18:50 +0800, xiubli@redhat.com wrote:
>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>
>>>> The messages from ceph maybe split into multiple socket packages
>>>> and we just need to wait for all the data to be availiable on the
>>>> sokcet.
>>>>
>>>> This will add 'sr_total_resid' to record the total length for all
>>>> data items for sparse-read message and 'sr_resid_elen' to record
>>>> the current extent total length.
>>>>
>>>> URL: https://tracker.ceph.com/issues/63586
>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>>> ---
>>>>    include/linux/ceph/messenger.h |  1 +
>>>>    net/ceph/messenger_v1.c        | 32 +++++++++++++++++++++-----------
>>>>    2 files changed, 22 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
>>>> index 2eaaabbe98cb..ca6f82abed62 100644
>>>> --- a/include/linux/ceph/messenger.h
>>>> +++ b/include/linux/ceph/messenger.h
>>>> @@ -231,6 +231,7 @@ struct ceph_msg_data {
>>>>    
>>>>    struct ceph_msg_data_cursor {
>>>>    	size_t			total_resid;	/* across all data items */
>>>> +	size_t			sr_total_resid;	/* across all data items for sparse-read */
>>>>    
>>>>    	struct ceph_msg_data	*data;		/* current data item */
>>>>    	size_t			resid;		/* bytes not yet consumed */
>>>> diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c
>>>> index 4cb60bacf5f5..2733da891688 100644
>>>> --- a/net/ceph/messenger_v1.c
>>>> +++ b/net/ceph/messenger_v1.c
>>>> @@ -160,7 +160,9 @@ static size_t sizeof_footer(struct ceph_connection *con)
>>>>    static void prepare_message_data(struct ceph_msg *msg, u32 data_len)
>>>>    {
>>>>    	/* Initialize data cursor if it's not a sparse read */
>>>> -	if (!msg->sparse_read)
>>>> +	if (msg->sparse_read)
>>>> +		msg->cursor.sr_total_resid = data_len;
>>>> +	else
>>>>    		ceph_msg_data_cursor_init(&msg->cursor, msg, data_len);
>>>>    }
>>>>    
>>>> @@ -1032,35 +1034,43 @@ static int read_partial_sparse_msg_data(struct ceph_connection *con)
>>>>    	bool do_datacrc = !ceph_test_opt(from_msgr(con->msgr), NOCRC);
>>>>    	u32 crc = 0;
>>>>    	int ret = 1;
>>>> +	int len;
>>>>    
>>>>    	if (do_datacrc)
>>>>    		crc = con->in_data_crc;
>>>>    
>>>> -	do {
>>>> -		if (con->v1.in_sr_kvec.iov_base)
>>>> +	while (cursor->sr_total_resid) {
>>>> +		len = 0;
>>>> +		if (con->v1.in_sr_kvec.iov_base) {
>>>> +			len = con->v1.in_sr_kvec.iov_len;
>>>>    			ret = read_partial_message_chunk(con,
>>>>    							 &con->v1.in_sr_kvec,
>>>>    							 con->v1.in_sr_len,
>>>>    							 &crc);
>>>> -		else if (cursor->sr_resid > 0)
>>>> +			len = con->v1.in_sr_kvec.iov_len - len;
>>>> +		} else if (cursor->sr_resid > 0) {
>>>> +			len = cursor->sr_resid;
>>>>    			ret = read_partial_sparse_msg_extent(con, &crc);
>>>> -
>>>> -		if (ret <= 0) {
>>>> -			if (do_datacrc)
>>>> -				con->in_data_crc = crc;
>>>> -			return ret;
>>>> +			len -= cursor->sr_resid;
>>>>    		}
>>>> +		cursor->sr_total_resid -= len;
>>>> +		if (ret <= 0)
>>>> +			break;
>>>>    
>>>>    		memset(&con->v1.in_sr_kvec, 0, sizeof(con->v1.in_sr_kvec));
>>>>    		ret = con->ops->sparse_read(con, cursor,
>>>>    				(char **)&con->v1.in_sr_kvec.iov_base);
>>>> +		if (ret <= 0) {
>>>> +			ret = ret ? : 1; /* must return > 0 to indicate success */
>>>> +			break;
>>>> +		}
>>>>    		con->v1.in_sr_len = ret;
>>>> -	} while (ret > 0);
>>>> +	}
>>>>    
>>>>    	if (do_datacrc)
>>>>    		con->in_data_crc = crc;
>>>>    
>>>> -	return ret < 0 ? ret : 1;  /* must return > 0 to indicate success */
>>>> +	return ret;
>>>>    }
>>>>    
>>>>    static int read_partial_msg_data(struct ceph_connection *con)
>>> Looking back over this code...
>>>
>>> The way it works today, once we determine it's a sparse read, we call
>>> read_sparse_msg_data. At that point we call either
>>> read_partial_message_chunk (to read into the kvec) or
>>> read_sparse_msg_extent if sr_resid is already set (indicating that we're
>>> receiving an extent).
>>>
>>> read_sparse_msg_extent calls ceph_tcp_recvpage in a loop until
>>> cursor->sr_resid have been received. The exception there when
>>> ceph_tcp_recvpage returns <= 0.
>>>
>>> ceph_tcp_recvpage returns 0 if sock_recvmsg returns -EAGAIN (maybe also
>>> in other cases). So it sounds like the client just timed out on a read
>>> from the socket or caught a signal or something?
>>>
>>> If that's correct, then do we know what ceph_tcp_recvpage returned when
>>> the problem happened?
>> It should just return parital data, and we should continue from here in
>> the next loop when the reset data comes.
>>
> Tracking this extra length seems like the wrong fix. We're already
> looping in read_sparse_msg_extent until the sr_resid goes to 0.
Yeah, it is and it works well.
>   ISTM
> that it's just that read_sparse_msg_extent is returning inappropriately
> in the face of timeouts.
>
> IOW, it does this:
>
>                  ret = ceph_tcp_recvpage(con->sock, rpage, (int)off, len);
>                  if (ret <= 0)
>                          return ret;
>
> ...should it just not be returning there when ret == 0? Maybe it should
> be retrying the recvpage instead?

Currently the the ceph_tcp_recvpage() will read data without blocking. 
If so we will change the logic here then all the other non-sparse-read 
cases will be changed to.

Please note this won't fix anything here in this bug.

Because currently the sparse-read code works well if in step 4 it 
partially read the sparse-read data or extents.

But in case of partially reading 'footer' in step 5. What we need to 
guarantee is that in the second loop we could skip triggering a new 
sparse-read in step 4:

1, /* header */         ===> will skip and do nothing if it has already 
read the 'header' data in last loop

2, /* front */             ===> will skip and do nothing if it has 
already read the 'front' data in last loop

3, /* middle */         ===> will skip and do nothing if it has already 
read the 'middle' data in last loop

4, /* (page) data */   ===> sparse-read here, it also should skip and do 
nothing if it has already read the whole 'sparse-read' data in last 
loop, but it won't. This is the ROOT CAUSE of this bug.

5, /* footer */            ===> the 'read_partial()' will only read 
partial 'footer' data then need to loop start from /* header */ when the 
data comes

My patch could guarantee that the sparse-read code will do nothing. 
While currently the code will trigger a new sparse-read from beginning 
again, which is incorrect.

Jeff, please let me know if you have better approaches.  The last one 
Ilya mentioned didn't work.

Thanks

- Xiubo


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/3] libceph: fail the sparse-read if there still has data in socket
  2024-01-19 11:03       ` Jeff Layton
@ 2024-01-22  3:17         ` Xiubo Li
  0 siblings, 0 replies; 20+ messages in thread
From: Xiubo Li @ 2024-01-22  3:17 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: idryomov, vshankar, mchangir


On 1/19/24 19:03, Jeff Layton wrote:
> On Fri, 2024-01-19 at 12:07 +0800, Xiubo Li wrote:
>> On 1/18/24 22:03, Jeff Layton wrote:
>>> On Thu, 2024-01-18 at 18:50 +0800, xiubli@redhat.com wrote:
>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>
>>>> Once this happens that means there have bugs.
>>>>
>>>> URL: https://tracker.ceph.com/issues/63586
>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>>> ---
>>>>    net/ceph/osd_client.c | 9 +++++++++
>>>>    1 file changed, 9 insertions(+)
>>>>
>>>> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
>>>> index 9be80d01c1dc..f8029b30a3fb 100644
>>>> --- a/net/ceph/osd_client.c
>>>> +++ b/net/ceph/osd_client.c
>>>> @@ -5912,6 +5912,13 @@ static int osd_sparse_read(struct ceph_connection *con,
>>>>    		fallthrough;
>>>>    	case CEPH_SPARSE_READ_DATA:
>>>>    		if (sr->sr_index >= count) {
>>>> +			if (sr->sr_datalen) {
>>>> +				pr_warn_ratelimited("sr_datalen %u sr_index %d count %u\n",
>>>> +						    sr->sr_datalen, sr->sr_index,
>>>> +						    count);
>>>> +				return -EREMOTEIO;
>>>> +			}
>>>> +
>>> Ok, so the server has (presumably) sent us a longer value for the
>>> sr_datalen than was in the extent map?
>>>
>>> Why should the sparse read engine care about that? It was (presumably)
>>> able to do its job of handling the read. Why not just advance past the
>>> extra junk and try to do another sparse read? Do we really need to fail
>>> the op for this?
>> Hi Jeff,
>>
>> I saw the problem just when I first debugging the sparse-read bug, and
>> the length will be very large, more detail and the logs please see the
>> tracker https://tracker.ceph.com/issues/63586:
>>
>>        11741055 <4>[180940.606488] libceph: sr_datalen 251723776 sr_index
>> 0 count 0
>>
>> In this case the request could cause the same request being retrying
>> infinitely. While just in other case the when the ceph send a incorrect
>> data length, how should we do ? Should we retry it ? How could we skip
>> it if the length is so large ?
>>
>>
> If the sparse_read datalen is so long that it goes beyond the length of
> the entire frame, then it's clearly malformed and we have to reject it.
>
> I do wonder though whether this is the right place to do that. It seems
> like the client ought to do that sort of vetting of lengths before it
> starts dealing with the read data.
>
> IOW, maybe there should be a simple check in the
> CEPH_SPARSE_READ_DATA_LEN case that validates that the sr_datalen fits
> inside the "data_len" of the entire frame?

Well it should be in CEPH_SPARSE_READ_DATA_PRE instead.

I can improve this.

Thanks

- Xiubo

>>
>>>>    			sr->sr_state = CEPH_SPARSE_READ_HDR;
>>>>    			goto next_op;
>>>>    		}
>>>> @@ -5919,6 +5926,8 @@ static int osd_sparse_read(struct ceph_connection *con,
>>>>    		eoff = sr->sr_extent[sr->sr_index].off;
>>>>    		elen = sr->sr_extent[sr->sr_index].len;
>>>>    
>>>> +		sr->sr_datalen -= elen;
>>>> +
>>>>    		dout("[%d] ext %d off 0x%llx len 0x%llx\n",
>>>>    		     o->o_osd, sr->sr_index, eoff, elen);
>>>>    


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/3] libceph: just wait for more data to be available on the socket
  2024-01-22  2:52         ` Xiubo Li
@ 2024-01-22 11:44           ` Jeff Layton
  0 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2024-01-22 11:44 UTC (permalink / raw)
  To: Xiubo Li, ceph-devel; +Cc: idryomov, vshankar, mchangir

On Mon, 2024-01-22 at 10:52 +0800, Xiubo Li wrote:
> On 1/19/24 19:09, Jeff Layton wrote:
> > On Fri, 2024-01-19 at 12:35 +0800, Xiubo Li wrote:
> > > On 1/19/24 02:24, Jeff Layton wrote:
> > > > On Thu, 2024-01-18 at 18:50 +0800, xiubli@redhat.com wrote:
> > > > > From: Xiubo Li <xiubli@redhat.com>
> > > > > 
> > > > > The messages from ceph maybe split into multiple socket packages
> > > > > and we just need to wait for all the data to be availiable on the
> > > > > sokcet.
> > > > > 
> > > > > This will add 'sr_total_resid' to record the total length for all
> > > > > data items for sparse-read message and 'sr_resid_elen' to record
> > > > > the current extent total length.
> > > > > 
> > > > > URL: https://tracker.ceph.com/issues/63586
> > > > > Signed-off-by: Xiubo Li <xiubli@redhat.com>
> > > > > ---
> > > > >    include/linux/ceph/messenger.h |  1 +
> > > > >    net/ceph/messenger_v1.c        | 32 +++++++++++++++++++++-----------
> > > > >    2 files changed, 22 insertions(+), 11 deletions(-)
> > > > > 
> > > > > diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
> > > > > index 2eaaabbe98cb..ca6f82abed62 100644
> > > > > --- a/include/linux/ceph/messenger.h
> > > > > +++ b/include/linux/ceph/messenger.h
> > > > > @@ -231,6 +231,7 @@ struct ceph_msg_data {
> > > > >    
> > > > >    struct ceph_msg_data_cursor {
> > > > >    	size_t			total_resid;	/* across all data items */
> > > > > +	size_t			sr_total_resid;	/* across all data items for sparse-read */
> > > > >    
> > > > >    	struct ceph_msg_data	*data;		/* current data item */
> > > > >    	size_t			resid;		/* bytes not yet consumed */
> > > > > diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c
> > > > > index 4cb60bacf5f5..2733da891688 100644
> > > > > --- a/net/ceph/messenger_v1.c
> > > > > +++ b/net/ceph/messenger_v1.c
> > > > > @@ -160,7 +160,9 @@ static size_t sizeof_footer(struct ceph_connection *con)
> > > > >    static void prepare_message_data(struct ceph_msg *msg, u32 data_len)
> > > > >    {
> > > > >    	/* Initialize data cursor if it's not a sparse read */
> > > > > -	if (!msg->sparse_read)
> > > > > +	if (msg->sparse_read)
> > > > > +		msg->cursor.sr_total_resid = data_len;
> > > > > +	else
> > > > >    		ceph_msg_data_cursor_init(&msg->cursor, msg, data_len);
> > > > >    }
> > > > >    
> > > > > @@ -1032,35 +1034,43 @@ static int read_partial_sparse_msg_data(struct ceph_connection *con)
> > > > >    	bool do_datacrc = !ceph_test_opt(from_msgr(con->msgr), NOCRC);
> > > > >    	u32 crc = 0;
> > > > >    	int ret = 1;
> > > > > +	int len;
> > > > >    
> > > > >    	if (do_datacrc)
> > > > >    		crc = con->in_data_crc;
> > > > >    
> > > > > -	do {
> > > > > -		if (con->v1.in_sr_kvec.iov_base)
> > > > > +	while (cursor->sr_total_resid) {
> > > > > +		len = 0;
> > > > > +		if (con->v1.in_sr_kvec.iov_base) {
> > > > > +			len = con->v1.in_sr_kvec.iov_len;
> > > > >    			ret = read_partial_message_chunk(con,
> > > > >    							 &con->v1.in_sr_kvec,
> > > > >    							 con->v1.in_sr_len,
> > > > >    							 &crc);
> > > > > -		else if (cursor->sr_resid > 0)
> > > > > +			len = con->v1.in_sr_kvec.iov_len - len;
> > > > > +		} else if (cursor->sr_resid > 0) {
> > > > > +			len = cursor->sr_resid;
> > > > >    			ret = read_partial_sparse_msg_extent(con, &crc);
> > > > > -
> > > > > -		if (ret <= 0) {
> > > > > -			if (do_datacrc)
> > > > > -				con->in_data_crc = crc;
> > > > > -			return ret;
> > > > > +			len -= cursor->sr_resid;
> > > > >    		}
> > > > > +		cursor->sr_total_resid -= len;
> > > > > +		if (ret <= 0)
> > > > > +			break;
> > > > >    
> > > > >    		memset(&con->v1.in_sr_kvec, 0, sizeof(con->v1.in_sr_kvec));
> > > > >    		ret = con->ops->sparse_read(con, cursor,
> > > > >    				(char **)&con->v1.in_sr_kvec.iov_base);
> > > > > +		if (ret <= 0) {
> > > > > +			ret = ret ? : 1; /* must return > 0 to indicate success */
> > > > > +			break;
> > > > > +		}
> > > > >    		con->v1.in_sr_len = ret;
> > > > > -	} while (ret > 0);
> > > > > +	}
> > > > >    
> > > > >    	if (do_datacrc)
> > > > >    		con->in_data_crc = crc;
> > > > >    
> > > > > -	return ret < 0 ? ret : 1;  /* must return > 0 to indicate success */
> > > > > +	return ret;
> > > > >    }
> > > > >    
> > > > >    static int read_partial_msg_data(struct ceph_connection *con)
> > > > Looking back over this code...
> > > > 
> > > > The way it works today, once we determine it's a sparse read, we call
> > > > read_sparse_msg_data. At that point we call either
> > > > read_partial_message_chunk (to read into the kvec) or
> > > > read_sparse_msg_extent if sr_resid is already set (indicating that we're
> > > > receiving an extent).
> > > > 
> > > > read_sparse_msg_extent calls ceph_tcp_recvpage in a loop until
> > > > cursor->sr_resid have been received. The exception there when
> > > > ceph_tcp_recvpage returns <= 0.
> > > > 
> > > > ceph_tcp_recvpage returns 0 if sock_recvmsg returns -EAGAIN (maybe also
> > > > in other cases). So it sounds like the client just timed out on a read
> > > > from the socket or caught a signal or something?
> > > > 
> > > > If that's correct, then do we know what ceph_tcp_recvpage returned when
> > > > the problem happened?
> > > It should just return parital data, and we should continue from here in
> > > the next loop when the reset data comes.
> > > 
> > Tracking this extra length seems like the wrong fix. We're already
> > looping in read_sparse_msg_extent until the sr_resid goes to 0.
> Yeah, it is and it works well.
> >   ISTM
> > that it's just that read_sparse_msg_extent is returning inappropriately
> > in the face of timeouts.
> > 
> > IOW, it does this:
> > 
> >                  ret = ceph_tcp_recvpage(con->sock, rpage, (int)off, len);
> >                  if (ret <= 0)
> >                          return ret;
> > 
> > ...should it just not be returning there when ret == 0? Maybe it should
> > be retrying the recvpage instead?
> 
> Currently the the ceph_tcp_recvpage() will read data without blocking. 

Yes.

> If so we will change the logic here then all the other non-sparse-read 
> cases will be changed to.
> 

No. read_sparse_msg_data is only called from the sparse-read codepath.
If we change it, only that will be affected.


> Please note this won't fix anything here in this bug.
> 
> Because currently the sparse-read code works well if in step 4 it 
> partially read the sparse-read data or extents.
> 
> But in case of partially reading 'footer' in step 5. What we need to 
> guarantee is that in the second loop we could skip triggering a new 
> sparse-read in step 4:
> 
> 1, /* header */         ===> will skip and do nothing if it has already 
> read the 'header' data in last loop
> 
> 2, /* front */             ===> will skip and do nothing if it has 
> already read the 'front' data in last loop
> 
> 3, /* middle */         ===> will skip and do nothing if it has already 
> read the 'middle' data in last loop
> 
> 4, /* (page) data */   ===> sparse-read here, it also should skip and do 
> nothing if it has already read the whole 'sparse-read' data in last 
> loop, but it won't. This is the ROOT CAUSE of this bug.
> 
> 5, /* footer */            ===> the 'read_partial()' will only read 
> partial 'footer' data then need to loop start from /* header */ when the 
> data comes
> 
> My patch could guarantee that the sparse-read code will do nothing. 
> While currently the code will trigger a new sparse-read from beginning 
> again, which is incorrect.
> 
> Jeff, please let me know if you have better approaches.  The last one 
> Ilya mentioned didn't work.

Your patch tries to track sr_resid independently in a new variable
sr_total_resid, but I think that's unnecessary.

read_sparse_msg_data returns under two conditions:

1. It has read all of the sparse read data (i.e. sr_resid is 0), in
which case it returns 1.

...or...

2. ceph_tcp_recvpage returns a negative error code or 0.

After your patch, the only way you'd get a case where sr_total_resid
is >0 is if case #2 happens. Clearly if we receive all of the data then
sr_total_resid will also be 0.

We want to return an error if there is a hard error from
ceph_tcp_recvpage, but it looks like it also returns 0 if the socket
read returns -EAGAIN. So, it seems to be that doing something like this
would be a sufficient fix. What am I missing?

diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c
index f9a50d7f0d20..cf94ebdb3b34 100644
--- a/net/ceph/messenger_v1.c
+++ b/net/ceph/messenger_v1.c
@@ -1015,8 +1015,10 @@ static int read_sparse_msg_extent(struct ceph_connection *con, u32 *crc)
                /* clamp to what remains in extent */
                len = min_t(int, len, cursor->sr_resid);
                ret = ceph_tcp_recvpage(con->sock, rpage, (int)off, len);
-               if (ret <= 0)
+               if (ret < 0)
                        return ret;
+               else if (ret == 0)
+                       continue;
                *crc = ceph_crc32c_page(*crc, rpage, off, ret);
                ceph_msg_data_advance(cursor, (size_t)ret);
                cursor->sr_resid -= ret;

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/3] libceph: just wait for more data to be available on the socket
  2024-01-18 10:50 ` [PATCH v4 3/3] libceph: just wait for more data to be available on the socket xiubli
  2024-01-18 14:36   ` Jeff Layton
  2024-01-18 18:24   ` Jeff Layton
@ 2024-01-22 15:02   ` Jeff Layton
  2024-01-22 16:55     ` Ilya Dryomov
  2 siblings, 1 reply; 20+ messages in thread
From: Jeff Layton @ 2024-01-22 15:02 UTC (permalink / raw)
  To: xiubli, ceph-devel; +Cc: idryomov, vshankar, mchangir

On Thu, 2024-01-18 at 18:50 +0800, xiubli@redhat.com wrote:
> From: Xiubo Li <xiubli@redhat.com>
> 
> The messages from ceph maybe split into multiple socket packages
> and we just need to wait for all the data to be availiable on the
> sokcet.
> 
> This will add 'sr_total_resid' to record the total length for all
> data items for sparse-read message and 'sr_resid_elen' to record
> the current extent total length.
> 
> URL: https://tracker.ceph.com/issues/63586
> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> ---
>  include/linux/ceph/messenger.h |  1 +
>  net/ceph/messenger_v1.c        | 32 +++++++++++++++++++++-----------
>  2 files changed, 22 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
> index 2eaaabbe98cb..ca6f82abed62 100644
> --- a/include/linux/ceph/messenger.h
> +++ b/include/linux/ceph/messenger.h
> @@ -231,6 +231,7 @@ struct ceph_msg_data {
>  
>  struct ceph_msg_data_cursor {
>  	size_t			total_resid;	/* across all data items */
> +	size_t			sr_total_resid;	/* across all data items for sparse-read */
>  
>  	struct ceph_msg_data	*data;		/* current data item */
>  	size_t			resid;		/* bytes not yet consumed */
> diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c
> index 4cb60bacf5f5..2733da891688 100644
> --- a/net/ceph/messenger_v1.c
> +++ b/net/ceph/messenger_v1.c
> @@ -160,7 +160,9 @@ static size_t sizeof_footer(struct ceph_connection *con)
>  static void prepare_message_data(struct ceph_msg *msg, u32 data_len)
>  {
>  	/* Initialize data cursor if it's not a sparse read */
> -	if (!msg->sparse_read)
> +	if (msg->sparse_read)
> +		msg->cursor.sr_total_resid = data_len;
> +	else
>  		ceph_msg_data_cursor_init(&msg->cursor, msg, data_len);
>  }
> 
> 

Sorry, disregard my last email.

I went and looked at the patch in the tree, and I better understand what
you're trying to do. We already have a value that gets set to data_len
called total_resid, but that is a nonsense value in a sparse read,
because we can advance farther through the receive buffer than the
amount of data in the socket.

So, I think you do need a separate value like this that's only used in
sparse reads, or you'll have to change how ->total_resid is handled
during a sparse read. The first way is simpler, so I agree with the
approach.

That said, if you're going to add a new value to ceph_msg_data_cursor,
then you really ought to initialize it in ceph_msg_data_cursor_init. You
don't necessarily have to use it elsewhere, but that would be cleaner
than the if statement above.

More comments below.

>  
> @@ -1032,35 +1034,43 @@ static int read_partial_sparse_msg_data(struct ceph_connection *con)
>  	bool do_datacrc = !ceph_test_opt(from_msgr(con->msgr), NOCRC);
>  	u32 crc = 0;
>  	int ret = 1;
> +	int len;
>  
>  	if (do_datacrc)
>  		crc = con->in_data_crc;
>  
> -	do {
> -		if (con->v1.in_sr_kvec.iov_base)
> +	while (cursor->sr_total_resid) {
> +		len = 0;
> +		if (con->v1.in_sr_kvec.iov_base) {
> +			len = con->v1.in_sr_kvec.iov_len;
>  			ret = read_partial_message_chunk(con,
>  							 &con->v1.in_sr_kvec,
>  							 con->v1.in_sr_len,
>  							 &crc);
> -		else if (cursor->sr_resid > 0)
> +			len = con->v1.in_sr_kvec.iov_len - len;

len is going to be 0 after the above statement. I'm not sure I
understand the point of messing with its value in the above block.

> +		} else if (cursor->sr_resid > 0) {
> +			len = cursor->sr_resid;
>  			ret = read_partial_sparse_msg_extent(con, &crc);
> -
> -		if (ret <= 0) {
> -			if (do_datacrc)
> -				con->in_data_crc = crc;
> -			return ret;
> +			len -= cursor->sr_resid;


Won't "len" also be 0 after the above?

>  		}
> +		cursor->sr_total_resid -= len;


...and so sr_total_resid always decrements by 0? That doesn't look like
it's doing the right thing.

> +		if (ret <= 0)
> +			break;
>  
>  		memset(&con->v1.in_sr_kvec, 0, sizeof(con->v1.in_sr_kvec));
>  		ret = con->ops->sparse_read(con, cursor,
>  				(char **)&con->v1.in_sr_kvec.iov_base);
> +		if (ret <= 0) {
> +			ret = ret ? : 1; /* must return > 0 to indicate success */
> +			break;
> +		}
>  		con->v1.in_sr_len = ret;
> -	} while (ret > 0);
> +	}
>  
>  	if (do_datacrc)
>  		con->in_data_crc = crc;
>  
> -	return ret < 0 ? ret : 1;  /* must return > 0 to indicate success */
> +	return ret;
>  }
>  
>  static int read_partial_msg_data(struct ceph_connection *con)


-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/3] libceph: just wait for more data to be available on the socket
  2024-01-22 15:02   ` Jeff Layton
@ 2024-01-22 16:55     ` Ilya Dryomov
  2024-01-22 17:14       ` Jeff Layton
  0 siblings, 1 reply; 20+ messages in thread
From: Ilya Dryomov @ 2024-01-22 16:55 UTC (permalink / raw)
  To: Jeff Layton; +Cc: xiubli, ceph-devel, vshankar, mchangir

On Mon, Jan 22, 2024 at 4:02 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Thu, 2024-01-18 at 18:50 +0800, xiubli@redhat.com wrote:
> > From: Xiubo Li <xiubli@redhat.com>
> >
> > The messages from ceph maybe split into multiple socket packages
> > and we just need to wait for all the data to be availiable on the
> > sokcet.
> >
> > This will add 'sr_total_resid' to record the total length for all
> > data items for sparse-read message and 'sr_resid_elen' to record
> > the current extent total length.
> >
> > URL: https://tracker.ceph.com/issues/63586
> > Signed-off-by: Xiubo Li <xiubli@redhat.com>
> > ---
> >  include/linux/ceph/messenger.h |  1 +
> >  net/ceph/messenger_v1.c        | 32 +++++++++++++++++++++-----------
> >  2 files changed, 22 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
> > index 2eaaabbe98cb..ca6f82abed62 100644
> > --- a/include/linux/ceph/messenger.h
> > +++ b/include/linux/ceph/messenger.h
> > @@ -231,6 +231,7 @@ struct ceph_msg_data {
> >
> >  struct ceph_msg_data_cursor {
> >       size_t                  total_resid;    /* across all data items */
> > +     size_t                  sr_total_resid; /* across all data items for sparse-read */
> >
> >       struct ceph_msg_data    *data;          /* current data item */
> >       size_t                  resid;          /* bytes not yet consumed */
> > diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c
> > index 4cb60bacf5f5..2733da891688 100644
> > --- a/net/ceph/messenger_v1.c
> > +++ b/net/ceph/messenger_v1.c
> > @@ -160,7 +160,9 @@ static size_t sizeof_footer(struct ceph_connection *con)
> >  static void prepare_message_data(struct ceph_msg *msg, u32 data_len)
> >  {
> >       /* Initialize data cursor if it's not a sparse read */
> > -     if (!msg->sparse_read)
> > +     if (msg->sparse_read)
> > +             msg->cursor.sr_total_resid = data_len;
> > +     else
> >               ceph_msg_data_cursor_init(&msg->cursor, msg, data_len);
> >  }
> >
> >
>
> Sorry, disregard my last email.
>
> I went and looked at the patch in the tree, and I better understand what
> you're trying to do. We already have a value that gets set to data_len
> called total_resid, but that is a nonsense value in a sparse read,
> because we can advance farther through the receive buffer than the
> amount of data in the socket.

Hi Jeff,

I see that total_resid is set to sparse_data_requested(), which is
effectively the size of the receive buffer, not data_len.  (This is
ignoring the seemingly unnecessary complication of trying to support
normal reads mixed with sparse reads in the same message, which I'm
pretty sure doesn't work anyway.)

With that, total_resid should represent the amount that needs to be
filled in (advanced through) in the receive buffer.  When total_resid
reaches 0, wouldn't that mean that the amount of data in the socket is
also 0?  If not, where would the remaining data in the socket go?

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/3] libceph: just wait for more data to be available on the socket
  2024-01-22 16:55     ` Ilya Dryomov
@ 2024-01-22 17:14       ` Jeff Layton
  2024-01-22 19:41         ` Ilya Dryomov
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff Layton @ 2024-01-22 17:14 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: xiubli, ceph-devel, vshankar, mchangir

On Mon, 2024-01-22 at 17:55 +0100, Ilya Dryomov wrote:
> On Mon, Jan 22, 2024 at 4:02 PM Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Thu, 2024-01-18 at 18:50 +0800, xiubli@redhat.com wrote:
> > > From: Xiubo Li <xiubli@redhat.com>
> > > 
> > > The messages from ceph maybe split into multiple socket packages
> > > and we just need to wait for all the data to be availiable on the
> > > sokcet.
> > > 
> > > This will add 'sr_total_resid' to record the total length for all
> > > data items for sparse-read message and 'sr_resid_elen' to record
> > > the current extent total length.
> > > 
> > > URL: https://tracker.ceph.com/issues/63586
> > > Signed-off-by: Xiubo Li <xiubli@redhat.com>
> > > ---
> > >  include/linux/ceph/messenger.h |  1 +
> > >  net/ceph/messenger_v1.c        | 32 +++++++++++++++++++++-----------
> > >  2 files changed, 22 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
> > > index 2eaaabbe98cb..ca6f82abed62 100644
> > > --- a/include/linux/ceph/messenger.h
> > > +++ b/include/linux/ceph/messenger.h
> > > @@ -231,6 +231,7 @@ struct ceph_msg_data {
> > > 
> > >  struct ceph_msg_data_cursor {
> > >       size_t                  total_resid;    /* across all data items */
> > > +     size_t                  sr_total_resid; /* across all data items for sparse-read */
> > > 
> > >       struct ceph_msg_data    *data;          /* current data item */
> > >       size_t                  resid;          /* bytes not yet consumed */
> > > diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c
> > > index 4cb60bacf5f5..2733da891688 100644
> > > --- a/net/ceph/messenger_v1.c
> > > +++ b/net/ceph/messenger_v1.c
> > > @@ -160,7 +160,9 @@ static size_t sizeof_footer(struct ceph_connection *con)
> > >  static void prepare_message_data(struct ceph_msg *msg, u32 data_len)
> > >  {
> > >       /* Initialize data cursor if it's not a sparse read */
> > > -     if (!msg->sparse_read)
> > > +     if (msg->sparse_read)
> > > +             msg->cursor.sr_total_resid = data_len;
> > > +     else
> > >               ceph_msg_data_cursor_init(&msg->cursor, msg, data_len);
> > >  }
> > > 
> > > 
> > 
> > Sorry, disregard my last email.
> > 
> > I went and looked at the patch in the tree, and I better understand what
> > you're trying to do. We already have a value that gets set to data_len
> > called total_resid, but that is a nonsense value in a sparse read,
> > because we can advance farther through the receive buffer than the
> > amount of data in the socket.
> 
> Hi Jeff,
> 
> I see that total_resid is set to sparse_data_requested(), which is
> effectively the size of the receive buffer, not data_len.  (This is
> ignoring the seemingly unnecessary complication of trying to support
> normal reads mixed with sparse reads in the same message, which I'm
> pretty sure doesn't work anyway.)
> 

Oh right. I missed that bit when I was re-reviewing this. Once we're in
a sparse read we'll override that value. Ok, so maybe we don't need
sr_total_resid.

> With that, total_resid should represent the amount that needs to be
> filled in (advanced through) in the receive buffer.  When total_resid
> reaches 0, wouldn't that mean that the amount of data in the socket is
> also 0?  If not, where would the remaining data in the socket go?
> 

With a properly formed reply, then I think yes, there should be no
remaining data in the socket at the end of the receive.

At this point I think I must just be confused about the actual problem.
I think I need a detailed description of it before I can properly review
this patch.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/3] libceph: just wait for more data to be available on the socket
  2024-01-22 17:14       ` Jeff Layton
@ 2024-01-22 19:41         ` Ilya Dryomov
  2024-01-23  0:53           ` Xiubo Li
  0 siblings, 1 reply; 20+ messages in thread
From: Ilya Dryomov @ 2024-01-22 19:41 UTC (permalink / raw)
  To: Jeff Layton; +Cc: xiubli, ceph-devel, vshankar, mchangir

On Mon, Jan 22, 2024 at 6:14 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Mon, 2024-01-22 at 17:55 +0100, Ilya Dryomov wrote:
> > On Mon, Jan 22, 2024 at 4:02 PM Jeff Layton <jlayton@kernel.org> wrote:
> > >
> > > On Thu, 2024-01-18 at 18:50 +0800, xiubli@redhat.com wrote:
> > > > From: Xiubo Li <xiubli@redhat.com>
> > > >
> > > > The messages from ceph maybe split into multiple socket packages
> > > > and we just need to wait for all the data to be availiable on the
> > > > sokcet.
> > > >
> > > > This will add 'sr_total_resid' to record the total length for all
> > > > data items for sparse-read message and 'sr_resid_elen' to record
> > > > the current extent total length.
> > > >
> > > > URL: https://tracker.ceph.com/issues/63586
> > > > Signed-off-by: Xiubo Li <xiubli@redhat.com>
> > > > ---
> > > >  include/linux/ceph/messenger.h |  1 +
> > > >  net/ceph/messenger_v1.c        | 32 +++++++++++++++++++++-----------
> > > >  2 files changed, 22 insertions(+), 11 deletions(-)
> > > >
> > > > diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
> > > > index 2eaaabbe98cb..ca6f82abed62 100644
> > > > --- a/include/linux/ceph/messenger.h
> > > > +++ b/include/linux/ceph/messenger.h
> > > > @@ -231,6 +231,7 @@ struct ceph_msg_data {
> > > >
> > > >  struct ceph_msg_data_cursor {
> > > >       size_t                  total_resid;    /* across all data items */
> > > > +     size_t                  sr_total_resid; /* across all data items for sparse-read */
> > > >
> > > >       struct ceph_msg_data    *data;          /* current data item */
> > > >       size_t                  resid;          /* bytes not yet consumed */
> > > > diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c
> > > > index 4cb60bacf5f5..2733da891688 100644
> > > > --- a/net/ceph/messenger_v1.c
> > > > +++ b/net/ceph/messenger_v1.c
> > > > @@ -160,7 +160,9 @@ static size_t sizeof_footer(struct ceph_connection *con)
> > > >  static void prepare_message_data(struct ceph_msg *msg, u32 data_len)
> > > >  {
> > > >       /* Initialize data cursor if it's not a sparse read */
> > > > -     if (!msg->sparse_read)
> > > > +     if (msg->sparse_read)
> > > > +             msg->cursor.sr_total_resid = data_len;
> > > > +     else
> > > >               ceph_msg_data_cursor_init(&msg->cursor, msg, data_len);
> > > >  }
> > > >
> > > >
> > >
> > > Sorry, disregard my last email.
> > >
> > > I went and looked at the patch in the tree, and I better understand what
> > > you're trying to do. We already have a value that gets set to data_len
> > > called total_resid, but that is a nonsense value in a sparse read,
> > > because we can advance farther through the receive buffer than the
> > > amount of data in the socket.
> >
> > Hi Jeff,
> >
> > I see that total_resid is set to sparse_data_requested(), which is
> > effectively the size of the receive buffer, not data_len.  (This is
> > ignoring the seemingly unnecessary complication of trying to support
> > normal reads mixed with sparse reads in the same message, which I'm
> > pretty sure doesn't work anyway.)
> >
>
> Oh right. I missed that bit when I was re-reviewing this. Once we're in
> a sparse read we'll override that value. Ok, so maybe we don't need
> sr_total_resid.
>
> > With that, total_resid should represent the amount that needs to be
> > filled in (advanced through) in the receive buffer.  When total_resid
> > reaches 0, wouldn't that mean that the amount of data in the socket is
> > also 0?  If not, where would the remaining data in the socket go?
> >
>
> With a properly formed reply, then I think yes, there should be no
> remaining data in the socket at the end of the receive.

There would be no actual data, but a message footer which follows the
data section and ends the message would be outstanding.

>
> At this point I think I must just be confused about the actual problem.
> I think I need a detailed description of it before I can properly review
> this patch.

AFAIU the problem is that a short read may occur while reading the
message footer from the socket.  Later, when the socket is ready for
another read, the messenger invokes all read_partial* handlers,
including read_partial_sparse_msg_data().  The contract between the
messenger and these handlers is that the handler should bail if the
area of the message it's responsible for is already processed.  So,
in this case, it's expected that read_sparse_msg_data() would bail,
allowing the messenger to invoke read_partial() for the footer and
pick up where it left off.

However read_sparse_msg_data() violates that contract and ends up
calling into the state machine in the OSD client.  The state machine
assumes that it's a new op and interprets some piece of the footer (or
even completely random memory?) as the sparse-read header and returns
bogus extent length, etc.

(BTW it's why I suggested the rename from read_sparse_msg_data() to
read_partial_sparse_msg_data() in another patch -- to highlight that
it's one of the "partial" handlers and the respective behavior.)

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/3] libceph: just wait for more data to be available on the socket
  2024-01-22 19:41         ` Ilya Dryomov
@ 2024-01-23  0:53           ` Xiubo Li
  0 siblings, 0 replies; 20+ messages in thread
From: Xiubo Li @ 2024-01-23  0:53 UTC (permalink / raw)
  To: Ilya Dryomov, Jeff Layton; +Cc: ceph-devel, vshankar, mchangir


On 1/23/24 03:41, Ilya Dryomov wrote:
> On Mon, Jan 22, 2024 at 6:14 PM Jeff Layton <jlayton@kernel.org> wrote:
>> On Mon, 2024-01-22 at 17:55 +0100, Ilya Dryomov wrote:
>>> On Mon, Jan 22, 2024 at 4:02 PM Jeff Layton <jlayton@kernel.org> wrote:
>>>> On Thu, 2024-01-18 at 18:50 +0800, xiubli@redhat.com wrote:
>>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>>
>>>>> The messages from ceph maybe split into multiple socket packages
>>>>> and we just need to wait for all the data to be availiable on the
>>>>> sokcet.
>>>>>
>>>>> This will add 'sr_total_resid' to record the total length for all
>>>>> data items for sparse-read message and 'sr_resid_elen' to record
>>>>> the current extent total length.
>>>>>
>>>>> URL: https://tracker.ceph.com/issues/63586
>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>>>> ---
>>>>>   include/linux/ceph/messenger.h |  1 +
>>>>>   net/ceph/messenger_v1.c        | 32 +++++++++++++++++++++-----------
>>>>>   2 files changed, 22 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
>>>>> index 2eaaabbe98cb..ca6f82abed62 100644
>>>>> --- a/include/linux/ceph/messenger.h
>>>>> +++ b/include/linux/ceph/messenger.h
>>>>> @@ -231,6 +231,7 @@ struct ceph_msg_data {
>>>>>
>>>>>   struct ceph_msg_data_cursor {
>>>>>        size_t                  total_resid;    /* across all data items */
>>>>> +     size_t                  sr_total_resid; /* across all data items for sparse-read */
>>>>>
>>>>>        struct ceph_msg_data    *data;          /* current data item */
>>>>>        size_t                  resid;          /* bytes not yet consumed */
>>>>> diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c
>>>>> index 4cb60bacf5f5..2733da891688 100644
>>>>> --- a/net/ceph/messenger_v1.c
>>>>> +++ b/net/ceph/messenger_v1.c
>>>>> @@ -160,7 +160,9 @@ static size_t sizeof_footer(struct ceph_connection *con)
>>>>>   static void prepare_message_data(struct ceph_msg *msg, u32 data_len)
>>>>>   {
>>>>>        /* Initialize data cursor if it's not a sparse read */
>>>>> -     if (!msg->sparse_read)
>>>>> +     if (msg->sparse_read)
>>>>> +             msg->cursor.sr_total_resid = data_len;
>>>>> +     else
>>>>>                ceph_msg_data_cursor_init(&msg->cursor, msg, data_len);
>>>>>   }
>>>>>
>>>>>
>>>> Sorry, disregard my last email.
>>>>
>>>> I went and looked at the patch in the tree, and I better understand what
>>>> you're trying to do. We already have a value that gets set to data_len
>>>> called total_resid, but that is a nonsense value in a sparse read,
>>>> because we can advance farther through the receive buffer than the
>>>> amount of data in the socket.
>>> Hi Jeff,
>>>
>>> I see that total_resid is set to sparse_data_requested(), which is
>>> effectively the size of the receive buffer, not data_len.  (This is
>>> ignoring the seemingly unnecessary complication of trying to support
>>> normal reads mixed with sparse reads in the same message, which I'm
>>> pretty sure doesn't work anyway.)
>>>
>> Oh right. I missed that bit when I was re-reviewing this. Once we're in
>> a sparse read we'll override that value. Ok, so maybe we don't need
>> sr_total_resid.
>>
Oh, I get you now. Yeah we can just reuse the 'total_resid' instead of 
adding a new one.

>>> With that, total_resid should represent the amount that needs to be
>>> filled in (advanced through) in the receive buffer.  When total_resid
>>> reaches 0, wouldn't that mean that the amount of data in the socket is
>>> also 0?  If not, where would the remaining data in the socket go?
>>>
>> With a properly formed reply, then I think yes, there should be no
>> remaining data in the socket at the end of the receive.
> There would be no actual data, but a message footer which follows the
> data section and ends the message would be outstanding.

Yeah, correct.

>> At this point I think I must just be confused about the actual problem.
>> I think I need a detailed description of it before I can properly review
>> this patch.
> AFAIU the problem is that a short read may occur while reading the
> message footer from the socket.  Later, when the socket is ready for
> another read, the messenger invokes all read_partial* handlers,
> including read_partial_sparse_msg_data().  The contract between the
> messenger and these handlers is that the handler should bail if the
> area of the message it's responsible for is already processed.  So,
> in this case, it's expected that read_sparse_msg_data() would bail,
> allowing the messenger to invoke read_partial() for the footer and
> pick up where it left off.
>
> However read_sparse_msg_data() violates that contract and ends up
> calling into the state machine in the OSD client.  The state machine
> assumes that it's a new op and interprets some piece of the footer (or
> even completely random memory?) as the sparse-read header and returns
> bogus extent length, etc.
>
> (BTW it's why I suggested the rename from read_sparse_msg_data() to
> read_partial_sparse_msg_data() in another patch -- to highlight that
> it's one of the "partial" handlers and the respective behavior.)

Yeah, Ilya is correct.

Thanks

- Xiubo

> Thanks,
>
>                  Ilya
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-01-23  0:53 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-18 10:50 [PATCH v4 0/3] libceph: fix sparse-read failure bug xiubli
2024-01-18 10:50 ` [PATCH v4 1/3] libceph: fail the sparse-read if there still has data in socket xiubli
2024-01-18 14:03   ` Jeff Layton
2024-01-19  4:07     ` Xiubo Li
2024-01-19 11:03       ` Jeff Layton
2024-01-22  3:17         ` Xiubo Li
2024-01-18 10:50 ` [PATCH v4 2/3] libceph: rename read_sparse_msg_XX to read_partial_sparse_msg_XX xiubli
2024-01-18 14:04   ` Jeff Layton
2024-01-18 10:50 ` [PATCH v4 3/3] libceph: just wait for more data to be available on the socket xiubli
2024-01-18 14:36   ` Jeff Layton
2024-01-18 18:24   ` Jeff Layton
2024-01-19  4:35     ` Xiubo Li
2024-01-19 11:09       ` Jeff Layton
2024-01-22  2:52         ` Xiubo Li
2024-01-22 11:44           ` Jeff Layton
2024-01-22 15:02   ` Jeff Layton
2024-01-22 16:55     ` Ilya Dryomov
2024-01-22 17:14       ` Jeff Layton
2024-01-22 19:41         ` Ilya Dryomov
2024-01-23  0:53           ` Xiubo Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.