All of lore.kernel.org
 help / color / mirror / Atom feed
* dmesg: mdsc_handle_reply got x on session mds1 not mds0
@ 2022-01-09 18:17 胡玮文
  2022-01-12  7:54 ` Venky Shankar
  0 siblings, 1 reply; 7+ messages in thread
From: 胡玮文 @ 2022-01-09 18:17 UTC (permalink / raw)
  To: ceph-devel, dev

Hi ceph developers,

Today we got one of our OSD hosts hang on OOM. Some OSDs were flapping and eventually went down and out. The recovery caused one OSD to go full, which is used in both cephfs metadata and data pools.

The strange thing is:
* Many of our users report unexpected “Permission denied” error when creating new files
* dmesg has some strange error (see examples below). During that time, no special logs on both active MDSes.
* The above two strange things happens BEFORE the OSD got full.

Jan 09 01:27:13 gpu027 kernel: libceph: osd9 up
Jan 09 01:27:13 gpu027 kernel: libceph: osd10 up
Jan 09 01:28:55 gpu027 kernel: libceph: osd9 down
Jan 09 01:28:55 gpu027 kernel: libceph: osd10 down
Jan 09 01:32:35 gpu027 kernel: libceph: osd6 weight 0x0 (out)
Jan 09 01:32:35 gpu027 kernel: libceph: osd16 weight 0x0 (out)
Jan 09 01:34:18 gpu027 kernel: libceph: osd1 weight 0x0 (out)
Jan 09 01:39:20 gpu027 kernel: libceph: osd9 weight 0x0 (out)
Jan 09 01:39:20 gpu027 kernel: libceph: osd10 weight 0x0 (out)
Jan 09 01:53:07 gpu027 kernel: ceph: mdsc_handle_reply got 30408991 on session mds1 not mds0
Jan 09 01:53:14 gpu027 kernel: ceph: mdsc_handle_reply got 30409829 on session mds1 not mds0
Jan 09 01:53:15 gpu027 kernel: ceph: mdsc_handle_reply got 30409925 on session mds1 not mds0
Jan 09 01:53:28 gpu027 kernel: ceph: mdsc_handle_reply got 30411416 on session mds1 not mds0
Jan 09 02:05:07 gpu027 kernel: ceph: mdsc_handle_reply got 30417742 on session mds0 not mds1
Jan 09 02:48:52 gpu027 kernel: ceph: mdsc_handle_reply got 30449177 on session mds1 not mds0
Jan 09 02:49:17 gpu027 kernel: ceph: mdsc_handle_reply got 30452750 on session mds1 not mds0

After reading the code, the replies are unexpected and just dropped. Any ideas about how this could happen? And is there anything I need to worry about? (The cluster is now recovered and looks good)

The clients are Ubuntu 20.04 with kernel 5.11.0-43-generic. Ceph version is 16.2.7. No active MDS restarts during that time. Standby-replay MDSes did restart, which should be fixed by my PR https://github.com/ceph/ceph/pull/44501 . But I don’t know if it is related to the issue here.

Regards,
Weiwen Hu

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: dmesg: mdsc_handle_reply got x on session mds1 not mds0
  2022-01-09 18:17 dmesg: mdsc_handle_reply got x on session mds1 not mds0 胡玮文
@ 2022-01-12  7:54 ` Venky Shankar
  2022-01-12  9:47   ` 胡玮文
  2022-01-12 17:19   ` Gregory Farnum
  0 siblings, 2 replies; 7+ messages in thread
From: Venky Shankar @ 2022-01-12  7:54 UTC (permalink / raw)
  To: 胡玮文; +Cc: ceph-devel, dev, Gregory Farnum

+Greg

On Mon, Jan 10, 2022 at 12:08 AM 胡玮文 <sehuww@mail.scut.edu.cn> wrote:
>
> Hi ceph developers,
>
> Today we got one of our OSD hosts hang on OOM. Some OSDs were flapping and eventually went down and out. The recovery caused one OSD to go full, which is used in both cephfs metadata and data pools.
>
> The strange thing is:
> * Many of our users report unexpected “Permission denied” error when creating new files

That's weird. I would expect the operation to block in the worst case.

> * dmesg has some strange error (see examples below). During that time, no special logs on both active MDSes.
> * The above two strange things happens BEFORE the OSD got full.
>
> Jan 09 01:27:13 gpu027 kernel: libceph: osd9 up
> Jan 09 01:27:13 gpu027 kernel: libceph: osd10 up
> Jan 09 01:28:55 gpu027 kernel: libceph: osd9 down
> Jan 09 01:28:55 gpu027 kernel: libceph: osd10 down
> Jan 09 01:32:35 gpu027 kernel: libceph: osd6 weight 0x0 (out)
> Jan 09 01:32:35 gpu027 kernel: libceph: osd16 weight 0x0 (out)
> Jan 09 01:34:18 gpu027 kernel: libceph: osd1 weight 0x0 (out)
> Jan 09 01:39:20 gpu027 kernel: libceph: osd9 weight 0x0 (out)
> Jan 09 01:39:20 gpu027 kernel: libceph: osd10 weight 0x0 (out)
> Jan 09 01:53:07 gpu027 kernel: ceph: mdsc_handle_reply got 30408991 on session mds1 not mds0
> Jan 09 01:53:14 gpu027 kernel: ceph: mdsc_handle_reply got 30409829 on session mds1 not mds0
> Jan 09 01:53:15 gpu027 kernel: ceph: mdsc_handle_reply got 30409925 on session mds1 not mds0
> Jan 09 01:53:28 gpu027 kernel: ceph: mdsc_handle_reply got 30411416 on session mds1 not mds0
> Jan 09 02:05:07 gpu027 kernel: ceph: mdsc_handle_reply got 30417742 on session mds0 not mds1
> Jan 09 02:48:52 gpu027 kernel: ceph: mdsc_handle_reply got 30449177 on session mds1 not mds0
> Jan 09 02:49:17 gpu027 kernel: ceph: mdsc_handle_reply got 30452750 on session mds1 not mds0
>
> After reading the code, the replies are unexpected and just dropped. Any ideas about how this could happen? And is there anything I need to worry about? (The cluster is now recovered and looks good)

The MDS should ask the client to "forward" the operation to another
MDS if it is not the auth for an inode.

It would be interesting to see what "mds1" was doing around the
"01:53:07" timestamp. Could you gather that from the mds log?

>
> The clients are Ubuntu 20.04 with kernel 5.11.0-43-generic. Ceph version is 16.2.7. No active MDS restarts during that time. Standby-replay MDSes did restart, which should be fixed by my PR https://github.com/ceph/ceph/pull/44501 . But I don’t know if it is related to the issue here.
>
> Regards,
> Weiwen Hu



-- 
Cheers,
Venky


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: dmesg: mdsc_handle_reply got x on session mds1 not mds0
  2022-01-12  7:54 ` Venky Shankar
@ 2022-01-12  9:47   ` 胡玮文
  2022-01-12 17:19   ` Gregory Farnum
  1 sibling, 0 replies; 7+ messages in thread
From: 胡玮文 @ 2022-01-12  9:47 UTC (permalink / raw)
  To: Venky Shankar; +Cc: ceph-devel, dev, Gregory Farnum

On Wed, Jan 12, 2022 at 01:24:20PM +0530, Venky Shankar wrote:
> It would be interesting to see what "mds1" was doing around the
> "01:53:07" timestamp. Could you gather that from the mds log?

Nothing special:

debug 2022-01-08T17:48:03.493+0000 7f3c30d91700  1 mds.cephfs.gpu024.rpfbnh Updating MDS map to version 86134 from mon.4
debug 2022-01-08T17:48:22.029+0000 7f3c30d91700  1 mds.cephfs.gpu024.rpfbnh Updating MDS map to version 86135 from mon.4
debug 2022-01-08T17:48:40.154+0000 7f3c30d91700  1 mds.cephfs.gpu024.rpfbnh Updating MDS map to version 86136 from mon.4
debug 2022-01-08T18:01:56.084+0000 7f3c30d91700  1 mds.cephfs.gpu024.rpfbnh Updating MDS map to version 86137 from mon.4
debug 2022-01-08T18:01:59.784+0000 7f3c30d91700  1 mds.cephfs.gpu024.rpfbnh Updating MDS map to version 86138 from mon.4
debug 2022-01-08T18:08:15.788+0000 7f3c30d91700  1 mds.cephfs.gpu024.rpfbnh Updating MDS map to version 86139 from mon.4

(01:53:07 in kernel log is in timezone +08:00)

I've checked MDSMap e86137, mds.cephfs.gpu024.rpfbnh is active in rank 1

And I just checked again, the first OSD_FULL appears in the log at
'Jan 09 01:52:43' or '2022-01-08T17:52:43.726+0000'.  So the mdsc_handle_reply
dmesg can be related to OSD_FULL.  I was misleaded by the prometheus query:

ceph_osd_stat_bytes_used / ceph_osd_stat_bytes

which only reports about 79% to that full OSD at the time when OSD_FULL appears
in cluster log.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: dmesg: mdsc_handle_reply got x on session mds1 not mds0
  2022-01-12  7:54 ` Venky Shankar
  2022-01-12  9:47   ` 胡玮文
@ 2022-01-12 17:19   ` Gregory Farnum
  2022-01-13  2:18     ` Xiubo Li
  2022-01-13  2:18     ` Xiubo Li
  1 sibling, 2 replies; 7+ messages in thread
From: Gregory Farnum @ 2022-01-12 17:19 UTC (permalink / raw)
  To: Venky Shankar; +Cc: 胡玮文, ceph-devel, dev

On Tue, Jan 11, 2022 at 11:55 PM Venky Shankar <vshankar@redhat.com> wrote:
>
> +Greg
>
> On Mon, Jan 10, 2022 at 12:08 AM 胡玮文 <sehuww@mail.scut.edu.cn> wrote:
> >
> > Hi ceph developers,
> >
> > Today we got one of our OSD hosts hang on OOM. Some OSDs were flapping and eventually went down and out. The recovery caused one OSD to go full, which is used in both cephfs metadata and data pools.
> >
> > The strange thing is:
> > * Many of our users report unexpected “Permission denied” error when creating new files
>
> That's weird. I would expect the operation to block in the worst case.

I think some configurations or code paths end up returning odd errors
instead of ENOSPC or EQUOTA, which is probably what the cluster was
returning.

>
> > * dmesg has some strange error (see examples below). During that time, no special logs on both active MDSes.
> > * The above two strange things happens BEFORE the OSD got full.
> >
> > Jan 09 01:27:13 gpu027 kernel: libceph: osd9 up
> > Jan 09 01:27:13 gpu027 kernel: libceph: osd10 up
> > Jan 09 01:28:55 gpu027 kernel: libceph: osd9 down
> > Jan 09 01:28:55 gpu027 kernel: libceph: osd10 down
> > Jan 09 01:32:35 gpu027 kernel: libceph: osd6 weight 0x0 (out)
> > Jan 09 01:32:35 gpu027 kernel: libceph: osd16 weight 0x0 (out)
> > Jan 09 01:34:18 gpu027 kernel: libceph: osd1 weight 0x0 (out)
> > Jan 09 01:39:20 gpu027 kernel: libceph: osd9 weight 0x0 (out)
> > Jan 09 01:39:20 gpu027 kernel: libceph: osd10 weight 0x0 (out)

All of these are just reports about changes in the OSDMap. It looks
like something pretty bad happened with your cluster (an OSD filling
up isn't good, but I didn't think on its own it would cause them to
get marked down+out), but this isn't a direct impact on the kernel
client or filesystem...

> > Jan 09 01:53:07 gpu027 kernel: ceph: mdsc_handle_reply got 30408991 on session mds1 not mds0
> > Jan 09 01:53:14 gpu027 kernel: ceph: mdsc_handle_reply got 30409829 on session mds1 not mds0
> > Jan 09 01:53:15 gpu027 kernel: ceph: mdsc_handle_reply got 30409925 on session mds1 not mds0
> > Jan 09 01:53:28 gpu027 kernel: ceph: mdsc_handle_reply got 30411416 on session mds1 not mds0
> > Jan 09 02:05:07 gpu027 kernel: ceph: mdsc_handle_reply got 30417742 on session mds0 not mds1
> > Jan 09 02:48:52 gpu027 kernel: ceph: mdsc_handle_reply got 30449177 on session mds1 not mds0
> > Jan 09 02:49:17 gpu027 kernel: ceph: mdsc_handle_reply got 30452750 on session mds1 not mds0
> >
> > After reading the code, the replies are unexpected and just dropped. Any ideas about how this could happen? And is there anything I need to worry about? (The cluster is now recovered and looks good)
>
> The MDS should ask the client to "forward" the operation to another
> MDS if it is not the auth for an inode.

Yeah. Been a while since I was in this code but I think this will
generally happen if the MDS cluster is migrating metadata and the
client for some reason hasn't kept up on the changes — the
newly-responsible MDS might have gotten the request forwarded
directly, and replied to the client, but the client doesn't expect it
to be responsible for that op.
*IIRC* it all gets cleaned up when the client resubmits the operation
so this isn't an issue.
-Greg


>
> It would be interesting to see what "mds1" was doing around the
> "01:53:07" timestamp. Could you gather that from the mds log?
>
> >
> > The clients are Ubuntu 20.04 with kernel 5.11.0-43-generic. Ceph version is 16.2.7. No active MDS restarts during that time. Standby-replay MDSes did restart, which should be fixed by my PR https://github.com/ceph/ceph/pull/44501 . But I don’t know if it is related to the issue here.
> >
> > Regards,
> > Weiwen Hu
>
>
>
> --
> Cheers,
> Venky
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: dmesg: mdsc_handle_reply got x on session mds1 not mds0
  2022-01-12 17:19   ` Gregory Farnum
@ 2022-01-13  2:18     ` Xiubo Li
  2022-01-13  2:18     ` Xiubo Li
  1 sibling, 0 replies; 7+ messages in thread
From: Xiubo Li @ 2022-01-13  2:18 UTC (permalink / raw)
  To: Gregory Farnum, Venky Shankar; +Cc: 胡玮文, ceph-devel, dev


On 1/13/22 1:19 AM, Gregory Farnum wrote:
> On Tue, Jan 11, 2022 at 11:55 PM Venky Shankar <vshankar@redhat.com> wrote:
>> +Greg
>>
>> On Mon, Jan 10, 2022 at 12:08 AM 胡玮文 <sehuww@mail.scut.edu.cn> wrote:
>>> Hi ceph developers,
>>>
>>> Today we got one of our OSD hosts hang on OOM. Some OSDs were flapping and eventually went down and out. The recovery caused one OSD to go full, which is used in both cephfs metadata and data pools.
>>>
>>> The strange thing is:
>>> * Many of our users report unexpected “Permission denied” error when creating new files
>> That's weird. I would expect the operation to block in the worst case.
> I think some configurations or code paths end up returning odd errors
> instead of ENOSPC or EQUOTA, which is probably what the cluster was
> returning.
>
>>> * dmesg has some strange error (see examples below). During that time, no special logs on both active MDSes.
>>> * The above two strange things happens BEFORE the OSD got full.
>>>
>>> Jan 09 01:27:13 gpu027 kernel: libceph: osd9 up
>>> Jan 09 01:27:13 gpu027 kernel: libceph: osd10 up
>>> Jan 09 01:28:55 gpu027 kernel: libceph: osd9 down
>>> Jan 09 01:28:55 gpu027 kernel: libceph: osd10 down
>>> Jan 09 01:32:35 gpu027 kernel: libceph: osd6 weight 0x0 (out)
>>> Jan 09 01:32:35 gpu027 kernel: libceph: osd16 weight 0x0 (out)
>>> Jan 09 01:34:18 gpu027 kernel: libceph: osd1 weight 0x0 (out)
>>> Jan 09 01:39:20 gpu027 kernel: libceph: osd9 weight 0x0 (out)
>>> Jan 09 01:39:20 gpu027 kernel: libceph: osd10 weight 0x0 (out)
> All of these are just reports about changes in the OSDMap. It looks
> like something pretty bad happened with your cluster (an OSD filling
> up isn't good, but I didn't think on its own it would cause them to
> get marked down+out), but this isn't a direct impact on the kernel
> client or filesystem...
>
>>> Jan 09 01:53:07 gpu027 kernel: ceph: mdsc_handle_reply got 30408991 on session mds1 not mds0
>>> Jan 09 01:53:14 gpu027 kernel: ceph: mdsc_handle_reply got 30409829 on session mds1 not mds0
>>> Jan 09 01:53:15 gpu027 kernel: ceph: mdsc_handle_reply got 30409925 on session mds1 not mds0
>>> Jan 09 01:53:28 gpu027 kernel: ceph: mdsc_handle_reply got 30411416 on session mds1 not mds0
>>> Jan 09 02:05:07 gpu027 kernel: ceph: mdsc_handle_reply got 30417742 on session mds0 not mds1
>>> Jan 09 02:48:52 gpu027 kernel: ceph: mdsc_handle_reply got 30449177 on session mds1 not mds0
>>> Jan 09 02:49:17 gpu027 kernel: ceph: mdsc_handle_reply got 30452750 on session mds1 not mds0
>>>
>>> After reading the code, the replies are unexpected and just dropped. Any ideas about how this could happen? And is there anything I need to worry about? (The cluster is now recovered and looks good)
>> The MDS should ask the client to "forward" the operation to another
>> MDS if it is not the auth for an inode.
> Yeah. Been a while since I was in this code but I think this will
> generally happen if the MDS cluster is migrating metadata and the
> client for some reason hasn't kept up on the changes — the
> newly-responsible MDS might have gotten the request forwarded
> directly, and replied to the client, but the client doesn't expect it
> to be responsible for that op.
> *IIRC* it all gets cleaned up when the client resubmits the operation
> so this isn't an issue.
> -Greg

Actually it's buggy in MDS, and there has one PR is fixing this.

https://github.com/ceph/ceph/pull/44229

The 'try_get_auth_inode()' is possibly have already send the forward 
request to clients and the clients has changed the request->sesssion, 
but in 'dispatch_client_request()' it will respond it with '-EINVAL'.

The above errors should from the second respond.

Regards

-- Xiubo

>> It would be interesting to see what "mds1" was doing around the
>> "01:53:07" timestamp. Could you gather that from the mds log?
>>
>>> The clients are Ubuntu 20.04 with kernel 5.11.0-43-generic. Ceph version is 16.2.7. No active MDS restarts during that time. Standby-replay MDSes did restart, which should be fixed by my PR https://github.com/ceph/ceph/pull/44501 . But I don’t know if it is related to the issue here.
>>>
>>> Regards,
>>> Weiwen Hu
>>
>>
>> --
>> Cheers,
>> Venky
>>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: dmesg: mdsc_handle_reply got x on session mds1 not mds0
  2022-01-12 17:19   ` Gregory Farnum
  2022-01-13  2:18     ` Xiubo Li
@ 2022-01-13  2:18     ` Xiubo Li
  1 sibling, 0 replies; 7+ messages in thread
From: Xiubo Li @ 2022-01-13  2:18 UTC (permalink / raw)
  To: Gregory Farnum, Venky Shankar; +Cc: 胡玮文, ceph-devel, dev


On 1/13/22 1:19 AM, Gregory Farnum wrote:
> On Tue, Jan 11, 2022 at 11:55 PM Venky Shankar <vshankar@redhat.com> wrote:
>> +Greg
>>
>> On Mon, Jan 10, 2022 at 12:08 AM 胡玮文 <sehuww@mail.scut.edu.cn> wrote:
>>> Hi ceph developers,
>>>
>>> Today we got one of our OSD hosts hang on OOM. Some OSDs were flapping and eventually went down and out. The recovery caused one OSD to go full, which is used in both cephfs metadata and data pools.
>>>
>>> The strange thing is:
>>> * Many of our users report unexpected “Permission denied” error when creating new files
>> That's weird. I would expect the operation to block in the worst case.
> I think some configurations or code paths end up returning odd errors
> instead of ENOSPC or EQUOTA, which is probably what the cluster was
> returning.
>
>>> * dmesg has some strange error (see examples below). During that time, no special logs on both active MDSes.
>>> * The above two strange things happens BEFORE the OSD got full.
>>>
>>> Jan 09 01:27:13 gpu027 kernel: libceph: osd9 up
>>> Jan 09 01:27:13 gpu027 kernel: libceph: osd10 up
>>> Jan 09 01:28:55 gpu027 kernel: libceph: osd9 down
>>> Jan 09 01:28:55 gpu027 kernel: libceph: osd10 down
>>> Jan 09 01:32:35 gpu027 kernel: libceph: osd6 weight 0x0 (out)
>>> Jan 09 01:32:35 gpu027 kernel: libceph: osd16 weight 0x0 (out)
>>> Jan 09 01:34:18 gpu027 kernel: libceph: osd1 weight 0x0 (out)
>>> Jan 09 01:39:20 gpu027 kernel: libceph: osd9 weight 0x0 (out)
>>> Jan 09 01:39:20 gpu027 kernel: libceph: osd10 weight 0x0 (out)
> All of these are just reports about changes in the OSDMap. It looks
> like something pretty bad happened with your cluster (an OSD filling
> up isn't good, but I didn't think on its own it would cause them to
> get marked down+out), but this isn't a direct impact on the kernel
> client or filesystem...
>
>>> Jan 09 01:53:07 gpu027 kernel: ceph: mdsc_handle_reply got 30408991 on session mds1 not mds0
>>> Jan 09 01:53:14 gpu027 kernel: ceph: mdsc_handle_reply got 30409829 on session mds1 not mds0
>>> Jan 09 01:53:15 gpu027 kernel: ceph: mdsc_handle_reply got 30409925 on session mds1 not mds0
>>> Jan 09 01:53:28 gpu027 kernel: ceph: mdsc_handle_reply got 30411416 on session mds1 not mds0
>>> Jan 09 02:05:07 gpu027 kernel: ceph: mdsc_handle_reply got 30417742 on session mds0 not mds1
>>> Jan 09 02:48:52 gpu027 kernel: ceph: mdsc_handle_reply got 30449177 on session mds1 not mds0
>>> Jan 09 02:49:17 gpu027 kernel: ceph: mdsc_handle_reply got 30452750 on session mds1 not mds0
>>>
>>> After reading the code, the replies are unexpected and just dropped. Any ideas about how this could happen? And is there anything I need to worry about? (The cluster is now recovered and looks good)
>> The MDS should ask the client to "forward" the operation to another
>> MDS if it is not the auth for an inode.
> Yeah. Been a while since I was in this code but I think this will
> generally happen if the MDS cluster is migrating metadata and the
> client for some reason hasn't kept up on the changes — the
> newly-responsible MDS might have gotten the request forwarded
> directly, and replied to the client, but the client doesn't expect it
> to be responsible for that op.
> *IIRC* it all gets cleaned up when the client resubmits the operation
> so this isn't an issue.
> -Greg

Actually it's buggy in MDS, and there has one PR is fixing this.

https://github.com/ceph/ceph/pull/44229

The 'try_get_auth_inode()' is possibly have already send the forward 
request to clients and the clients has changed the request->sesssion, 
but in 'dispatch_client_request()' it will respond it with '-EINVAL'.

The above errors should from the second respond.

Regards

-- Xiubo

>> It would be interesting to see what "mds1" was doing around the
>> "01:53:07" timestamp. Could you gather that from the mds log?
>>
>>> The clients are Ubuntu 20.04 with kernel 5.11.0-43-generic. Ceph version is 16.2.7. No active MDS restarts during that time. Standby-replay MDSes did restart, which should be fixed by my PR https://github.com/ceph/ceph/pull/44501 . But I don’t know if it is related to the issue here.
>>>
>>> Regards,
>>> Weiwen Hu
>>
>>
>> --
>> Cheers,
>> Venky
>>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* dmesg: mdsc_handle_reply got x on session mds1 not mds0
@ 2022-01-09 18:50 胡玮文
  0 siblings, 0 replies; 7+ messages in thread
From: 胡玮文 @ 2022-01-09 18:50 UTC (permalink / raw)
  To: ceph-devel, dev

Hi ceph developers,

Today we got one of our OSD hosts hang on OOM. Some OSDs were flapping and eventually went down and out. The recovery caused one OSD to go full, which is used in both cephfs metadata and data pools.

The strange thing is:
* Many of our users report unexpected “Permission denied” error when creating new files
* dmesg has some strange error (see examples below). During that time, no special logs on both active MDSes.
* The above two strange things happens BEFORE the OSD got full.

Jan 09 01:27:13 gpu027 kernel: libceph: osd9 up
Jan 09 01:27:13 gpu027 kernel: libceph: osd10 up
Jan 09 01:28:55 gpu027 kernel: libceph: osd9 down
Jan 09 01:28:55 gpu027 kernel: libceph: osd10 down
Jan 09 01:32:35 gpu027 kernel: libceph: osd6 weight 0x0 (out)
Jan 09 01:32:35 gpu027 kernel: libceph: osd16 weight 0x0 (out)
Jan 09 01:34:18 gpu027 kernel: libceph: osd1 weight 0x0 (out)
Jan 09 01:39:20 gpu027 kernel: libceph: osd9 weight 0x0 (out)
Jan 09 01:39:20 gpu027 kernel: libceph: osd10 weight 0x0 (out)
Jan 09 01:53:07 gpu027 kernel: ceph: mdsc_handle_reply got 30408991 on session mds1 not mds0
Jan 09 01:53:14 gpu027 kernel: ceph: mdsc_handle_reply got 30409829 on session mds1 not mds0
Jan 09 01:53:15 gpu027 kernel: ceph: mdsc_handle_reply got 30409925 on session mds1 not mds0
Jan 09 01:53:28 gpu027 kernel: ceph: mdsc_handle_reply got 30411416 on session mds1 not mds0
Jan 09 02:05:07 gpu027 kernel: ceph: mdsc_handle_reply got 30417742 on session mds0 not mds1
Jan 09 02:48:52 gpu027 kernel: ceph: mdsc_handle_reply got 30449177 on session mds1 not mds0
Jan 09 02:49:17 gpu027 kernel: ceph: mdsc_handle_reply got 30452750 on session mds1 not mds0

After reading the code, the replies are unexpected and just dropped. Any ideas about how this could happen? And is there anything I need to worry about? (The cluster is now recovered and looks good)

The clients are Ubuntu 20.04 with kernel 5.11.0-43-generic. Ceph version is 16.2.7. No active MDS restarts during that time. Standby-replay MDSes did restart, which should be fixed by my PR https://github.com/ceph/ceph/pull/44501 . But I don’t know if it is related to the issue here.

Sorry if you see this twice. I resend it after subscribing to dev@ceph.io

Regards,
Weiwen Hu

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-01-13  2:19 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-09 18:17 dmesg: mdsc_handle_reply got x on session mds1 not mds0 胡玮文
2022-01-12  7:54 ` Venky Shankar
2022-01-12  9:47   ` 胡玮文
2022-01-12 17:19   ` Gregory Farnum
2022-01-13  2:18     ` Xiubo Li
2022-01-13  2:18     ` Xiubo Li
2022-01-09 18:50 胡玮文

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.