CephFS kclient gets stuck when getattr() on a certain file

* CephFS kclient gets stuck when getattr() on a certain file
@ 2021-04-28  4:33 Jerry Lee
  2021-05-03 21:20 ` Gregory Farnum
  0 siblings, 1 reply; 6+ messages in thread
From: Jerry Lee @ 2021-04-28  4:33 UTC (permalink / raw)
  To: ceph-devel, Patrick Donnelly, Jeff Layton, Yan, Zheng

Hi,

I deploy a 3-node Ceph cluster (v15.2.9) and the CephFS is mounted via
kclient (linux-4.14.24) on all of the 3 nodes.  All of the kclients
try to update (read/write) a certain file periodically in order to
know whether the CephFS is alive or not.  After a kclient gets evicted
due to abnormal reboot, a new kclient mounts to the CephFS when the
node comes back.  However, the newly mounted kclient gets stuck when
it tries to getattr on the common file.  Under such conditions, all of
the other kclients are affected and they cannot update the common
file, too.  From the debugfs entris, a request does get stuck:
------
[/sys/kernel/debug/ceph/1bbb7753-85e5-4d33-a860-84419fdcfd7d.client3230166]
# cat mdsc
12      mds0    getattr  #100000003ed

[/sys/kernel/debug/ceph/1bbb7753-85e5-4d33-a860-84419fdcfd7d.client3230166]
# cat osdc
REQUESTS 0 homeless 0
LINGER REQUESTS
BACKOFFS

[/sys/kernel/debug/ceph/1bbb7753-85e5-4d33-a860-84419fdcfd7d.client3230166]
# ceph -s
  cluster:
    id:     1bbb7753-85e5-4d33-a860-84419fdcfd7d
    health: HEALTH_WARN
            1 MDSs report slow requests

  services:
    mon: 3 daemons, quorum Jerry-ceph-n2,Jerry-x85-n1,Jerry-x85-n3 (age 23h)
    mgr: Jerry-x85-n1(active, since 25h), standbys: Jerry-ceph-n2, Jerry-x85-n3
    mds: cephfs:1 {0=qceph-mds-Jerry-ceph-n2=up:active} 1
up:standby-replay 1 up:standby
    osd: 18 osds: 18 up (since 23h), 18 in (since 23h)
------

The MDS logs (debug_mds =20) are provided:
https://drive.google.com/file/d/1aj101NOTzCsfDdC-neqVTvKpEPOd3M6Q/view?usp=sharing

Some of the logs wrt client.3230166 and ino#100000003ed are shown as below:
2021-04-27T11:57:03.467+0800 7fccbd3be700  4 mds.0.server
handle_client_request client_request(client.3230166:12 getattr
pAsLsXsFs #0x100000003ed 2021-04-27T11:57:03.469426+0800 caller_uid=0,
caller_gid=0{}) v2
2021-04-27T11:57:03.467+0800 7fccbd3be700 20 mds.0.98 get_session have
0x56130c5ce480 client.3230166 v1:192.168.92.89:0/679429733 state open
2021-04-27T11:57:03.467+0800 7fccbd3be700 15 mds.0.server  oldest_client_tid=12
2021-04-27T11:57:03.467+0800 7fccbd3be700  7 mds.0.cache request_start
request(client.3230166:12 nref=2 cr=0x56130db96480)
2021-04-27T11:57:03.467+0800 7fccbd3be700  7 mds.0.server
dispatch_client_request client_request(client.3230166:12 getattr
pAsLsXsFs #0x100000003ed 2021-04-27T11:57:03.469426+0800 caller_uid=0,
caller_gid=0{}) v2

2021-04-27T11:57:03.467+0800 7fccbd3be700 10 mds.0.locker
acquire_locks request(client.3230166:12 nref=3 cr=0x56130db96480)
2021-04-27T11:57:03.467+0800 7fccbd3be700 10
mds.0.cache.ino(0x100000003ed) auth_pin by 0x56130c584ed0 on [inode
0x100000003ed [2,head]
/QTS/VOL_1/.ovirt_data_domain/37065419-e7f3-47ca-97df-8af0c67d30a0/dom_md/ids
auth v91583 pv91585 ap=4 recovering s=1048576 n(v0
rc2021-04-27T11:57:02.625542+0800 b1048576 1=1+0) (ifile mix->sync
w=1) (iversion lock)
cr={3198501=0-4194304@1,3198504=0-4194304@1,3221169=0-4194304@1}
caps={3198501=pAsLsXsFr/pAsLsXsFrw/pAsxXsxFsxcrwb@17,3198504=pAsLsXsFr/pAsLsXsFrw/pAsxXsxFsxcrwb@9,3230166=pAsLsXsFr/pAsLsXsFrw/pAsxXsxFsxcrwb@2}
| ptrwaiter=1 request=1 lock=2 caps=1 dirty=1 waiter=0 authpin=1
0x56130c584a00] now 4
2021-04-27T11:57:03.467+0800 7fccbd3be700 15
mds.0.cache.dir(0x100000003eb) adjust_nested_auth_pins 1 on [dir
0x100000003eb /QTS/VOL_1/.ovirt_data_domain/37065419-e7f3-47ca-97df-8af0c67d30a0/dom_md/
[2,head] auth pv=91586 v=91584 cv=0/0 ap=1+4 state=1610874881|complete
f(v0 m2021-04-23T15:00:04.377198+0800 6=6+0) n(v3
rc2021-04-27T11:57:02.625542+0800 b38005818 6=6+0) hs=6+0,ss=0+0
dirty=4 | child=1 dirty=1 waiter=0 authpin=1 0x56130c586a00] by
0x56130c584a00 count now 1/4
2021-04-27T11:57:03.467+0800 7fccbd3be700 10 mds.0
RecoveryQueue::prioritize not queued [inode 0x100000003ed [2,head]
/QTS/VOL_1/.ovirt_data_domain/37065419-e7f3-47ca-97df-8af0c67d30a0/dom_md/ids
auth v91583 pv91585 ap=4 recovering s=1048576 n(v0
rc2021-04-27T11:57:02.625542+0800 b1048576 1=1+0) (ifile mix->sync
w=1) (iversion lock)
cr={3198501=0-4194304@1,3198504=0-4194304@1,3221169=0-4194304@1}
caps={3198501=pAsLsXsFr/pAsLsXsFrw/pAsxXsxFsxcrwb@17,3198504=pAsLsXsFr/pAsLsXsFrw/pAsxXsxFsxcrwb@9,3230166=pAsLsXsFr/pAsLsXsFrw/pAsxXsxFsxcrwb@2}
| ptrwaiter=1 request=1 lock=2 caps=1 dirty=1 waiter=0 authpin=1
0x56130c584a00]
2021-04-27T11:57:03.467+0800 7fccbd3be700  7 mds.0.locker rdlock_start
waiting on (ifile mix->sync w=1) on [inode 0x100000003ed [2,head]
/QTS/VOL_1/.ovirt_data_domain/37065419-e7f3-47ca-97df-8af0c67d30a0/dom_md/ids
auth v91583 pv91585 ap=4 recovering s=1048576 n(v0
rc2021-04-27T11:57:02.625542+0800 b1048576 1=1+0) (ifile mix->sync
w=1) (iversion lock)
cr={3198501=0-4194304@1,3198504=0-4194304@1,3221169=0-4194304@1}
caps={3198501=pAsLsXsFr/pAsLsXsFrw/pAsxXsxFsxcrwb@17,3198504=pAsLsXsFr/pAsLsXsFrw/pAsxXsxFsxcrwb@9,3230166=pAsLsXsFr/pAsLsXsFrw/pAsxXsxFsxcrwb@2}
| ptrwaiter=1 request=1 lock=2 caps=1 dirty=1 waiter=0 authpin=1
0x56130c584a00]
2021-04-27T11:57:03.467+0800 7fccbd3be700 10
mds.0.cache.ino(0x100000003ed) add_waiter tag 2000000040000000
0x56130ea1bbe0 !ambig 1 !frozen 1 !freezing 1
2021-04-27T11:57:03.467+0800 7fccbd3be700 15
mds.0.cache.ino(0x100000003ed) taking waiter here
2021-04-27T11:57:03.468+0800 7fccbd3be700 20 mds.0.locker
client.3230166 pending pAsLsXsFr allowed pAsLsXsFrl wanted
pAsxXsxFsxcrwb
2021-04-27T11:57:03.468+0800 7fccbd3be700  7 mds.0.locker
handle_client_caps  on 0x100000003ed tid 0 follows 0 op update flags
0x2
2021-04-27T11:57:03.468+0800 7fccbd3be700 20 mds.0.98 get_session have
0x56130b81f600 client.3198501 v1:192.168.50.108:0/2478094748 state
open
2021-04-27T11:57:03.468+0800 7fccbd3be700 10 mds.0.locker  head inode
[inode 0x100000003ed [2,head]
/QTS/VOL_1/.ovirt_data_domain/37065419-e7f3-47ca-97df-8af0c67d30a0/dom_md/ids
auth v91583 pv91585 ap=4 recovering s=1048576 n(v0
rc2021-04-27T11:57:02.625542+0800 b1048576 1=1+0) (ifile mix->sync
w=1) (iversion lock)
cr={3198501=0-4194304@1,3198504=0-4194304@1,3221169=0-4194304@1}
caps={3198501=pAsLsXsFr/pAsLsXsFrw/pAsxXsxFsxcrwb@17,3198504=pAsLsXsFr/pAsxXsxFsxcrwb@9,3230166=pAsLsXsFr/pAsxXsxFsxcrwb@2}
| ptrwaiter=1 request=1 lock=2 caps=1 dirty=1 waiter=1 authpin=1
0x56130c584a00]
2021-04-27T11:57:03.468+0800 7fccbd3be700 10 mds.0.locker  follows 0
retains pAsLsXsFr dirty - on [inode 0x100000003ed [2,head]
/QTS/VOL_1/.ovirt_data_domain/37065419-e7f3-47ca-97df-8af0c67d30a0/dom_md/ids
auth v91583 pv91585 ap=4 recovering s=1048576 n(v0
rc2021-04-27T11:57:02.625542+0800 b1048576 1=1+0) (ifile mix->sync
w=1) (iversion lock)
cr={3198501=0-4194304@1,3198504=0-4194304@1,3221169=0-4194304@1}
caps={3198501=pAsLsXsFr/pAsxXsxFsxcrwb@17,3198504=pAsLsXsFr/pAsxXsxFsxcrwb@9,3230166=pAsLsXsFr/pAsxXsxFsxcrwb@2}
| ptrwaiter=1 request=1 lock=2 caps=1 dirty=1 waiter=1 authpin=1
0x56130c584a00]
2021-04-27T11:57:37.027+0800 7fccbb3ba700  0 log_channel(cluster) log
[WRN] : slow request 33.561029 seconds old, received at
2021-04-27T11:57:03.467164+0800: client_request(client.3230166:12
getattr pAsLsXsFs #0x100000003ed 2021-04-27T11:57:03.469426+0800
caller_uid=0, caller_gid=0{}) currently failed to rdlock, waiting

Any idea or insight to help to further investigate the issue are appreciated.

- Jerry

^ permalink raw reply	[flat|nested] 6+ messages in thread