linux-lvm.redhat.com archive mirror
 help / color / mirror / Atom feed
From: Damon Wang <damon.devops@gmail.com>
To: LVM general discussion and development <linux-lvm@redhat.com>,
	sanlock-devel@lists.fedorahosted.org
Subject: [linux-lvm] [lvmlockd] sanlock stuck while getting VGLK
Date: Tue, 19 May 2020 21:31:16 +0800	[thread overview]
Message-ID: <CABZYMH53DVpN0mPrrEacDedyyX-fdi4+QH7-hHc6Fvtjm8akUg@mail.gmail.com> (raw)

Hi,

I got a wired case recently with a 24-nodes sanlock cluster recently.
One sanlock process stuck with getting VGLK since it thinks there are
two live host using VGLK shared lock, here is sanlock log from this
host(host_id 19): https://pastebin.com/zJubYLRv

I find important messages:


    "2020-05-14 21:28:51 2018540 [14765]: r857 clear_dead_shared
host_id 61 gen 1 alive"
    "2020-05-14 21:28:51 2018540 [14765]: r857 clear_dead_shared
host_id 94 gen 1 alive"
    "2020-05-14 21:28:52 2018541 [14764]: r858 clear_dead_shared
host_id 61 gen 1 alive"
    "2020-05-14 21:28:52 2018541 [14764]: r858 clear_dead_shared
host_id 94 gen 1 alive"
    "2020-05-14 21:28:53 2018542 [14765]: r859 clear_dead_shared
host_id 61 gen 1 alive"
    "2020-05-14 21:28:53 2018542 [14765]: r859 clear_dead_shared
host_id 94 gen 1 alive"
    "2020-05-14 21:28:54 2018543 [14764]: r860 clear_dead_shared
host_id 61 gen 1 alive"
    "2020-05-14 21:28:54 2018543 [14764]: r860 clear_dead_shared
host_id 94 gen 1 alive"
    "2020-05-14 21:28:55 2018544 [14765]: r861 clear_dead_shared
host_id 61 gen 1 alive"
    "2020-05-14 21:28:55 2018544 [14765]: r861 clear_dead_shared
host_id 94 gen 1 alive"


and this last hours till I init VGLK manually: sanlock direct init -r xxx

So I try to find out what happend to host_id 61 and host_id 94, it
seems both two hosts tried to acquire VGLK and then released, though
with long time.

full log of host_id 61: https://pastebin.com/zDWQAtHE

useful log:


    2020-05-14 20:23:03 2545197 [33739]: cmd_acquire 6,16,33743 ci_in
6 fd 16 count 1 flags 9
    2020-05-14 20:23:03 2545197 [33739]: s5:r1044 resource
lvm_fc261665e94c4d0f8341e254398e7ed5:VGLK:/dev/mapper/fc261665e94c4d0f8341e254398e7ed5-lvmlock:69206016:SH
for 6,16,33743
    2020-05-14 20:23:03 2545197 [33739]: r1044 paxos_acquire begin e 0 0
    2020-05-14 20:23:03 2545197 [33739]: r1044 leader 6067 owner 36 2
0 dblocks 35:12132036:12132036:36:2:2099201:6067:1,
    2020-05-14 20:23:03 2545197 [33739]: r1044 paxos_acquire leader
6067 owner 36 2 0 max mbal[35] 12132036 our_dblock 12096061 12096061
61 1 2537794 6049
    2020-05-14 20:23:03 2545197 [33739]: r1044 paxos_acquire leader 6067 free
    2020-05-14 20:23:03 2545197 [33739]: r1044 ballot 6068 phase1
write mbal 12134061
    2020-05-14 20:23:03 2545198 [33739]: r1044 ballot 6068 mode[93]
shared 1 gen 1
    2020-05-14 20:23:03 2545198 [33739]: r1044 ballot 6068 phase1 read
60:12134061:0:0:0:0:6068:0,
    2020-05-14 20:23:03 2545198 [33739]: r1044 ballot 6068 phase2
write bal 12134061 inp 61 1 2545198 q_max -1
    2020-05-14 20:23:03 2545198 [33739]: r1044 ballot 6068 phase2 read
60:12134061:12134061:61:1:2545198:6068:0,
    2020-05-14 20:23:03 2545198 [33739]: r1044 ballot 6068 commit self
owner 61 1 2545198
    2020-05-14 20:23:03 2545198 [33739]: r1044 acquire_disk rv 1 lver
6068 at 2545198
    2020-05-14 20:23:03 2545198 [33739]: r1044 write_host_block
host_id 61 flags 1 gen 1 dblock
12134061:12134061:61:1:2545198:6068:RELEASED.
    2020-05-14 20:23:03 2545198 [33739]: r1044 paxos_release leader
6068 owner 61 1 2545198
    2020-05-14 20:23:03 2545198 [33739]: cmd_acquire 6,16,33743 result
0 pid_dead 0
    2020-05-14 20:23:03 2545198 [33740]: cmd_get_lvb ci 8 fd 20 result
0 res lvm_fc261665e94c4d0f8341e254398e7ed5:VGLK
    2020-05-14 20:23:03 2545198 [33739]: cmd_acquire 6,16,33743 ci_in
6 fd 16 count 1 flags 8
    2020-05-14 20:23:03 2545198 [33739]: s5:r1045 resource
lvm_fc261665e94c4d0f8341e254398e7ed5:lL0kHh-CZ1i-jtMi-1qHO-ykfR-6nmu-Gk0rES:/dev/mapper/fc261665e94c4d0f8341e254398e7ed5-lvmlock:104857600
for 6,16,33743
    2020-05-14 20:23:03 2545198 [33739]: r1045 paxos_acquire begin a 0 0
    2020-05-14 20:23:03 2545198 [33739]: r1045 leader 9 owner 94 1 0
dblocks 93:16094:16094:94:1:8414771:9:1,
    2020-05-14 20:23:03 2545198 [33739]: r1045 paxos_acquire leader 9
owner 94 1 0 max mbal[93] 16094 our_dblock 0 0 0 0 0 0
    2020-05-14 20:23:03 2545198 [33739]: r1045 paxos_acquire leader 9 free
    2020-05-14 20:23:03 2545198 [33739]: r1045 ballot 10 phase1 write mbal 18061
    2020-05-14 20:23:03 2545198 [33739]: r1045 ballot 10 phase1 read
60:18061:0:0:0:0:10:0,
    2020-05-14 20:23:03 2545198 [33739]: r1045 ballot 10 phase2 write
bal 18061 inp 61 1 2545198 q_max -1
    2020-05-14 20:23:03 2545198 [33739]: r1045 ballot 10 phase2 read
60:18061:18061:61:1:2545198:10:0,
    2020-05-14 20:23:03 2545198 [33739]: r1045 ballot 10 commit self
owner 61 1 2545198
    2020-05-14 20:23:03 2545198 [33739]: r1045 acquire_disk rv 1 lver
10 at 2545198
    2020-05-14 20:23:03 2545198 [33739]: cmd_acquire 6,16,33743 result
0 pid_dead 0
    ...
    2020-05-14 20:55:05 2547119 [33740]: cmd_release 6,16,33743 ci_in
6 fd 16 count 1 flags 0
    2020-05-14 20:55:05 2547119 [33740]: r1045 release_token r_flags 8 lver 10
    2020-05-14 20:55:05 2547119 [33740]: r1045 write_host_block
host_id 61 flags 0 gen 0 dblock 18061:18061:61:1:2545198:10:RELEASED.
    2020-05-14 20:55:05 2547119 [33740]: r1045 paxos_release leader 10
owner 61 1 2545198
    2020-05-14 20:55:05 2547119 [33740]: r1045 release_token done r_flags 8
    2020-05-14 20:55:05 2547119 [33740]: cmd_release 6,16,33743 result
0 pid_dead 0 count 1


full log of host_id 94: https://pastebin.com/H18vAHZq

useful log, it's similar to host_id 60, but shorter time to release:


    2020-05-14 20:01:20 8414771 [11203]: cmd_acquire 6,16,11206 ci_in
6 fd 16 count 1 flags 9
    2020-05-14 20:01:20 8414771 [11203]: s5:r3021 resource
lvm_fc261665e94c4d0f8341e254398e7ed5:VGLK:/dev/mapper/fc261665e94c4d0f8341e254398e7ed5-lvmlock:69206016:SH
for 6,16,11206
    2020-05-14 20:01:20 8414771 [11203]: r3021 paxos_acquire begin e 0 0
    2020-05-14 20:01:20 8414771 [11203]: r3021 leader 6063 owner 36 2
0 dblocks 35:12124036:12124036:36:2:2097966:6063:1,
    2020-05-14 20:01:20 8414771 [11203]: r3021 paxos_acquire leader
6063 owner 36 2 0 max mbal[35] 12124036 our_dblock 11840094 11840094
94 1 8325309 5921
    2020-05-14 20:01:20 8414771 [11203]: r3021 paxos_acquire leader 6063 free
    2020-05-14 20:01:20 8414771 [11203]: r3021 ballot 6064 phase1
write mbal 12126094
    2020-05-14 20:01:20 8414771 [11203]: r3021 ballot 6064 phase1 read
93:12126094:0:0:0:0:6064:0,
    2020-05-14 20:01:20 8414771 [11203]: r3021 ballot 6064 phase2
write bal 12126094 inp 94 1 8414771 q_max -1
    2020-05-14 20:01:20 8414771 [11203]: r3021 ballot 6064 phase2 read
93:12126094:12126094:94:1:8414771:6064:0,
    2020-05-14 20:01:20 8414771 [11203]: r3021 ballot 6064 commit self
owner 94 1 8414771
    2020-05-14 20:01:20 8414771 [11203]: r3021 acquire_disk rv 1 lver
6064 at 8414771
    2020-05-14 20:01:20 8414771 [11203]: r3021 write_host_block
host_id 94 flags 1 gen 1 dblock
12126094:12126094:94:1:8414771:6064:RELEASED.
    2020-05-14 20:01:20 8414771 [11203]: r3021 paxos_release leader
6064 owner 94 1 8414771
    2020-05-14 20:01:20 8414771 [11203]: cmd_acquire 6,16,11206 result
0 pid_dead 0
    2020-05-14 20:01:20 8414771 [11202]: cmd_get_lvb ci 8 fd 20 result
0 res lvm_fc261665e94c4d0f8341e254398e7ed5:VGLK
    2020-05-14 20:01:20 8414771 [11203]: cmd_acquire 6,16,11206 ci_in
6 fd 16 count 1 flags 8
    2020-05-14 20:01:20 8414771 [11203]: s5:r3022 resource
lvm_fc261665e94c4d0f8341e254398e7ed5:lL0kHh-CZ1i-jtMi-1qHO-ykfR-6nmu-Gk0rES:/dev/mapper/fc261665e94c4d0f8341e254398e7ed5-lvmlock:104857600:SH
for 6,16,11206
    2020-05-14 20:01:20 8414771 [11203]: r3022 paxos_acquire begin e 0 0
    2020-05-14 20:01:20 8414771 [11203]: r3022 leader 8 owner 36 2 0
dblocks 35:14036:14036:36:2:1566014:8:1,
    2020-05-14 20:01:20 8414771 [11203]: r3022 paxos_acquire leader 8
owner 36 2 0 max mbal[35] 14036 our_dblock 0 0 0 0 0 0
    2020-05-14 20:01:20 8414771 [11203]: r3022 paxos_acquire leader 8 free
    2020-05-14 20:01:20 8414771 [11203]: r3022 ballot 9 phase1 write mbal 16094
    2020-05-14 20:01:20 8414771 [11203]: r3022 ballot 9 mode[35] shared 1 gen 2
    2020-05-14 20:01:20 8414771 [11203]: r3022 ballot 9 phase1 read
93:16094:0:0:0:0:9:0,
    2020-05-14 20:01:20 8414771 [11203]: r3022 ballot 9 phase2 write
bal 16094 inp 94 1 8414771 q_max -1
    2020-05-14 20:01:20 8414771 [11203]: r3022 ballot 9 phase2 read
93:16094:16094:94:1:8414771:9:0,
    2020-05-14 20:01:20 8414771 [11203]: r3022 ballot 9 commit self
owner 94 1 8414771
    2020-05-14 20:01:20 8414771 [11203]: r3022 acquire_disk rv 1 lver
9 at 8414771
    2020-05-14 20:01:20 8414771 [11203]: r3022 write_host_block
host_id 94 flags 1 gen 1 dblock 16094:16094:94:1:8414771:9:RELEASED.
    2020-05-14 20:01:20 8414771 [11203]: r3022 paxos_release leader 9
owner 94 1 8414771
    2020-05-14 20:01:20 8414771 [11203]: cmd_acquire 6,16,11206 result
0 pid_dead 0
    ... ...
    2020-05-14 20:11:20 8415371 [11202]: cmd_release 6,16,11206 ci_in
6 fd 16 count 1 flags 0
    2020-05-14 20:11:20 8415371 [11202]: r3022 release_token r_flags 1 lver 9
    2020-05-14 20:11:20 8415371 [11202]: r3022 write_host_block
host_id 94 flags 0 gen 0 dblock 16094:16094:94:1:8414771:9:RELEASED.
    2020-05-14 20:11:20 8415371 [11202]: r3022 release_token done r_flags 1
    2020-05-14 20:11:20 8415371 [11202]: cmd_release 6,16,11206 result
0 pid_dead 0 count 1


So, I got several questions:

1. Why it takes so long to release lock, does it should blame to storage I/O ?
2. Why other host(host_id 19) can read pervious shared lock(host_id 60
& 94) after hour?
3. What's the "right" way to solve this problem?

Thanks a lot !

Damon Wang

             reply	other threads:[~2020-05-19 13:31 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-19 13:31 Damon Wang [this message]
2020-05-19 16:08 ` David Teigland
2020-05-20 15:18   ` Damon Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABZYMH53DVpN0mPrrEacDedyyX-fdi4+QH7-hHc6Fvtjm8akUg@mail.gmail.com \
    --to=damon.devops@gmail.com \
    --cc=linux-lvm@redhat.com \
    --cc=sanlock-devel@lists.fedorahosted.org \
    --subject='Re: [linux-lvm] [lvmlockd] sanlock stuck while getting VGLK' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).