Re: [PATCH v6 3/3] libceph: just wait for more data to be available on the socket

From: Xiubo Li <xiubli@redhat.com>
To: Luis Henriques <lhenriques@suse.de>
Cc: ceph-devel@vger.kernel.org, idryomov@gmail.com,
	jlayton@kernel.org, vshankar@redhat.com, mchangir@redhat.com
Subject: Re: [PATCH v6 3/3] libceph: just wait for more data to be available on the socket
Date: Wed, 28 Feb 2024 08:45:41 +0800	[thread overview]
Message-ID: <831bfb4a-6213-4e32-8c68-252be354342e@redhat.com> (raw)
In-Reply-To: <871q8x4yac.fsf@suse.de>


On 2/28/24 01:22, Luis Henriques wrote:
> Hi Xiubo!
>
> xiubli@redhat.com writes:
>
>> From: Xiubo Li <xiubli@redhat.com>
>>
>> A short read may occur while reading the message footer from the
>> socket.  Later, when the socket is ready for another read, the
>> messenger shoudl invoke all read_partial* handlers, except the
>> read_partial_sparse_msg_data().  The contract between the messenger
>> and these handlers is that the handlers should bail if the area
>> of the message is responsible for is already processed.  So,
>> in this case, it's expected that read_sparse_msg_data() would bail,
>> allowing the messenger to invoke read_partial() for the footer and
>> pick up where it left off.
>>
>> However read_partial_sparse_msg_data() violates that contract and
>> ends up calling into the state machine in the OSD client. The
>> sparse-read state machine just assumes that it's a new op and
>> interprets some piece of the footer as the sparse-read extents/data
>> and then returns bogus extents/data length, etc.
>>
>> This will just reuse the 'total_resid' to determine whether should
>> the read_partial_sparse_msg_data() bail out or not. Because once
>> it reaches to zero that means all the extents and data have been
>> successfully received in last read, else it could break out when
>> partially reading any of the extents and data. And then the
>> osd_sparse_read() could continue where it left off.
> I'm seeing an issue with fstest generic/580, which seems to enter an
> infinite loop effectively rendering the testing VM unusable.  It's pretty
> easy to reproduce, just run the test ensuring to be using msgv2 (I'm
> mounting the filesystem with 'ms_mode=crc'), and you should see the
> following on the logs:
>
> [...]
>    libceph: prepare_sparse_read_cont: ret 0x1000 total_resid 0x0 resid 0x0
>    libceph: osd1 (2)192.168.155.1:6810 read processing error
>    libceph: mon0 (2)192.168.155.1:40608 session established
>    libceph: bad late_status 0x1
>    libceph: osd1 (2)192.168.155.1:6810 protocol error, bad epilogue
>    libceph: mon0 (2)192.168.155.1:40608 session established
>    libceph: prepare_sparse_read_cont: ret 0x1000 total_resid 0x0 resid 0x0
>    libceph: osd1 (2)192.168.155.1:6810 read processing error
>    libceph: mon0 (2)192.168.155.1:40608 session established
>    libceph: bad late_status 0x1
> [...]
>
> Reverting this patch (commit 8e46a2d068c9 ("libceph: just wait for more
> data to be available on the socket")) seems to fix.  I haven't
> investigated it further, but since it'll take me some time to refresh my
> memory, I thought I should report it immediately.  Maybe someone has any
> idea.

The following patch should fix it. I haven't test it yet. Will do it 
later today:

diff --git a/net/ceph/messenger_v2.c b/net/ceph/messenger_v2.c
index a0ca5414b333..2c32ad4d9774 100644
--- a/net/ceph/messenger_v2.c
+++ b/net/ceph/messenger_v2.c
@@ -1860,10 +1860,10 @@ static int prepare_read_control_remainder(struct 
ceph_connection *con)
  static int prepare_read_data(struct ceph_connection *con)
  {
         struct bio_vec bv;
+       u64 len = con->in_msg->sparse_read_total ? : data_len(con->in_msg);

         con->in_data_crc = -1;
-       ceph_msg_data_cursor_init(&con->v2.in_cursor, con->in_msg,
-                                 data_len(con->in_msg));
+       ceph_msg_data_cursor_init(&con->v2.in_cursor, con->in_msg, len);

         get_bvec_at(&con->v2.in_cursor, &bv);
         if (ceph_test_opt(from_msgr(con->msgr), RXBOUNCE)) {


> Cheers,