linux-lvm.redhat.com archive mirror
 help / color / mirror / Atom feed
* [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..."
@ 2018-10-10  7:51 Damon Wang
  2018-10-10  7:58 ` Damon Wang
  2018-10-10 19:08 ` David Teigland
  0 siblings, 2 replies; 9+ messages in thread
From: Damon Wang @ 2018-10-10  7:51 UTC (permalink / raw)
  To: LVM general discussion and development

Hi,

I can not create lv in an environment and after some check I found a
server has hold the VGLK and not released for long time.

this is "sanlock client status" prints:

        [root@HLD-1-4-S10 ~]# sanlock client status
        daemon 091c17d0-822c6a4f-HLD-1-4-S10
        p -1 helper
        p -1 listener
        p 27919 lvmlockd
        p 27919 lvmlockd
        p 27919 lvmlockd
        p 27919 lvmlockd
        p -1 status
        s lvm_b075258f5b9547d7b4464fff246bbce1:49:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:0
        s lvm_671bfaeb4e344014b041fc78115307bb:30:/dev/mapper/671bfaeb4e344014b041fc78115307bb-lvmlock:0
        s lvm_4c149e7262a64c728997954257bece01:94:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:0
        s lvm_091c17d096894a179313aad65cb4127a:87:/dev/mapper/091c17d096894a179313aad65cb4127a-lvmlock:0
        r lvm_b075258f5b9547d7b4464fff246bbce1:VGLK:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016:SH
p 27919        <------
        r lvm_4c149e7262a64c728997954257bece01:0fiorC-82Su-yVRq-dQPe-ed93-xYC5-Pou1i7:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:101711872:SH
p 27919
        r lvm_4c149e7262a64c728997954257bece01:TMG0ev-vvP0-kVWU-xRLE-G9y3-kMf0-bIgIrc:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:111149056:4
p 27919
        ... ...
        r lvm_4c149e7262a64c728997954257bece01:yyjuZt-euSf-ZCG6-eJWF-a0f4-MtpA-KX2Jzr:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:119537664:4
p 27919
        r lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH
p 27919 ADD        <------

As it shows, it is adding lock for
"lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD"
and holding VGLK with shared mode, so there is an command hanging:

        [root@HLD-1-4-S10 ~]# ps aux | grep lv
        root      7755  0.0  0.0 160248  5724 ?        S    14:12
0:00 lvchange -asy
/dev/b075258f5b9547d7b4464fff246bbce1/e36d96c2a29716b999b2a33aa16e94e0
       <------
        root     11335  0.0  0.0 112660   972 pts/0    S+   14:40
0:00 grep --color=auto lv
        root     27919  0.0  0.0 423860  5088 ?        Ssl  11:07
0:00 /usr/sbin/lvmlockd --daemon-debug --sanlock-timeout 40
        root     27926  0.0  0.0 114476  1116 ?        Ssl  11:07
0:00 /usr/sbin/lvmetad -f

As "ps" shows, I use lvmlockd with damon debug mode, so it will print
log in messages:

        Oct 10 14:07:13 HLD-1-4-S10 kernel: kvm [39758]: vcpu0
disabled perfctr wrmsr: 0xc1 data 0xabcd
        Oct 10 14:10:01 HLD-1-4-S10 systemd: Started Session 248 of user root.
        Oct 10 14:10:01 HLD-1-4-S10 systemd: Starting Session 248 of user root.
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 new cl 849 pi 2 fd 9
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 recv
lvchange[7755] cl 849 lock vg "b075258f5b9547d7b4464fff246bbce1" mode
sh flags 0
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK action lock sh
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock cl 849 mode sh
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK lock_san sh at
/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock rv 0 read vb 101
0 3301
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock got version 3301
our 3292
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock set lvmetad vg
version 3301
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 send
lvchange[7755] cl 849 lock vg rv 0
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 recv
lvchange[7755] cl 849 lock lv "b075258f5b9547d7b4464fff246bbce1" mode
sh flags 1
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R
u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD action lock sh
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R
u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD res_lock cl 849 mode sh
(e36d96c2a29716b999b2a33aa16e94e0)
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R
u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD lock_san sh at
/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056
        Oct 10 14:15:23 HLD-1-4-S10 systemd-logind: New session 249 of
user root.
        Oct 10 14:15:23 HLD-1-4-S10 systemd: Started Session 249 of user root.
        Oct 10 14:15:23 HLD-1-4-S10 systemd: Starting Session 249 of user root.
        Oct 10 14:18:12 HLD-1-4-S10 sanlock[27912]: 2018-10-10
14:18:12 11831 [27917]: cmd_release 2,9,27919 no resource VGLK
        Oct 10 14:20:01 HLD-1-4-S10 systemd: Started Session 250 of user root.
        Oct 10 14:20:01 HLD-1-4-S10 systemd: Starting Session 250 of user root.
        Oct 10 14:30:01 HLD-1-4-S10 systemd: Started Session 251 of user root.
        Oct 10 14:30:01 HLD-1-4-S10 systemd: Starting Session 251 of user root.
        Oct 10 14:32:22 HLD-1-4-S10 sanlock[27912]: 2018-10-10
14:32:22 12680 [27917]: cmd_release 2,9,27919 no resource VGLK
        Oct 10 14:40:01 HLD-1-4-S10 systemd: Started Session 252 of user root.
        Oct 10 14:40:01 HLD-1-4-S10 systemd: Starting Session 252 of user root.
        Oct 10 14:40:28 HLD-1-4-S10 lvmlockd: 1539153628 new cl 850 pi
3 fd 14

the timestamp matches, it holds VGLK at 14:12:36 and not release.
So I keep digging log of sanlock and find this:

        2018-10-10 14:12:33 11492 [27916]: cmd_examine 6,17
lvm_091c17d096894a179313aad65cb4127a
        2018-10-10 14:12:33 11492 [27916]: cmd_examine 6,17 done 2
        2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0
        2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0
        2018-10-10 14:12:33 11492 [27917]: cmd_examine 6,17
lvm_b075258f5b9547d7b4464fff246bbce1
        2018-10-10 14:12:33 11492 [27917]: cmd_examine 6,17 done 2
        2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0
        2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0
        2018-10-10 14:12:33 11492 [27916]: set_lockspace_event
lvm_091c17d096894a179313aad65cb4127a
        2018-10-10 14:12:33 11492 [27916]: cmd_set_event result 0
        2018-10-10 14:12:33 11492 [27917]: set_lockspace_event
lvm_b075258f5b9547d7b4464fff246bbce1
        2018-10-10 14:12:33 11492 [27917]: cmd_set_event result 0
        2018-10-10 14:12:34 11493 [9550]: s2 delta_renew begin read
        2018-10-10 14:12:34 11493 [9550]: s2 delta_renew begin write
for new ts 11493
        2018-10-10 14:12:34 11493 [9550]: s2 renewed 11493
delta_length 0 interval 81
        2018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919
ci_in 5 fd 15 count 1 flags 9
        2018-10-10 14:12:36 11494 [27916]: s4:r3097 resource
lvm_b075258f5b9547d7b4464fff246bbce1:VGLK:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016:SH
for 5,15,27919
        2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire begin e 0 0
        2018-10-10 14:12:36 11494 [27916]: r3097 leader 42693 owner 66
2 0 dblocks 65:140695692504066:140695692504066:66:2:3805:42693:1,
        2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire leader
42693 owner 66 2 0 max mbal[65] 140695692504066 our_dblock
140695692434049 140695692434049 49 2 4303 42658
        2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire leader 42693 free
        2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase1
write mbal 140695692506049
        2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase1
read 48:140695692506049:0:0:0:0:42694:0,
        2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase2
write bal 140695692506049 inp 49 2 11494 q_max -1
        2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase2
read 48:140695692506049:140695692506049:49:2:11494:42694:0,
        2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 commit
self owner 49 2 11494
        2018-10-10 14:12:36 11494 [27916]: r3097 acquire_disk rv 1
lver 42694@11494
        2018-10-10 14:12:36 11494 [27916]: r3097 write_host_block
host_id 49 flags 1 gen 2 dblock
140695692506049:140695692506049:49:2:11494:42694:RELEASED.
        2018-10-10 14:12:36 11494 [27916]: r3097 paxos_release leader
42694 owner 49 2 11494
        2018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919
result 0 pid_dead 0
        2018-10-10 14:12:36 11494 [27917]: cmd_get_lvb ci 6 fd 17
result 0 res lvm_b075258f5b9547d7b4464fff246bbce1:VGLK
        018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919 ci_in
5 fd 15 count 1 flags 8
        018-10-10 14:12:36 11494 [27916]: s4:r3098 resource
lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH
for 5,15,27919
        018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire begin e 0 0
        018-10-10 14:12:36 11494 [27916]: r3098 leader 2 owner 19 1
4854717 dblocks
8:29802:510:140245418403952:140245440585933:140245418403840:4:1,48:202556049:0:0:0:0:3:0,65:218746066:0:0:0:0:3:0,79:202874080:0:0:0:0:3:0,
        018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2
owner 19 1 4854717 max mbal[65] 218746066 our_dblock 202556049 0 0 0 0
3
        018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19
1 4854717 host_status 19 2 10914 wait_start 11494
        018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19
1 4854717 delta 19 2 10995 mismatch
        018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2
owner 19 1 4854717
        018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 phase1 write
mbal 218748049
        018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 abort1 larger
lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
our dblock 218748049:0:0:0:0:3
        018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 phase1 read
18:29802:510:140245418403952:140245440585933:140245418403840:4:1,
        018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire 3 retry
delay 887122 us
        018-10-10 14:12:37 11495 [27916]: r3098 paxos_acquire leader 2
owner 19 1 4854717
        018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 phase1 write
mbal 218750049
        018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 abort1 larger
lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
our dblock 218750049:0:0:0:0:3
        018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 phase1 read
18:29802:510:140245418403952:140245440585933:140245418403840:4:1,
        018-10-10 14:12:37 11495 [27916]: r3098 paxos_acquire 3 retry
delay 455075 us
        018-10-10 14:12:38 11496 [27916]: r3098 paxos_acquire leader 2
owner 19 1 4854717
        018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 phase1 write
mbal 218752049
        018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 abort1 larger
lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
our dblock 218752049:0:0:0:0:3
        018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 phase1 read
18:29802:510:140245418403952:140245440585933:140245418403840:4:1,
        018-10-10 14:12:38 11496 [27916]: r3098 paxos_acquire 3 retry
delay 904853 us
        018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20
lvm_671bfaeb4e344014b041fc78115307bb
        018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 done 2
        018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20
lvm_4c149e7262a64c728997954257bece01
        018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 done 5
        018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11496 [27917]: set_lockspace_event
lvm_671bfaeb4e344014b041fc78115307bb
        018-10-10 14:12:38 11496 [27917]: cmd_set_event result 0
        018-10-10 14:12:38 11496 [27917]: set_lockspace_event
lvm_4c149e7262a64c728997954257bece01
        018-10-10 14:12:38 11496 [27917]: cmd_set_event result 0
        018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20
lvm_091c17d096894a179313aad65cb4127a
        018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 done 2
        018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20
lvm_b075258f5b9547d7b4464fff246bbce1
        018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 done 3
        018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11497 [27917]: set_lockspace_event
lvm_091c17d096894a179313aad65cb4127a
        018-10-10 14:12:38 11497 [27917]: cmd_set_event result 0
        018-10-10 14:12:38 11497 [27917]: set_lockspace_event
lvm_b075258f5b9547d7b4464fff246bbce1
        018-10-10 14:12:38 11497 [27917]: cmd_set_event result 0
        018-10-10 14:12:39 11497 [27916]: r3098 paxos_acquire leader 2
owner 19 1 4854717
        018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 phase1 write
mbal 218754049
        018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 abort1 larger
lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
our dblock 218754049:0:0:0:0:3
        018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 phase1 read
18:29802:510:140245418403952:140245440585933:140245418403840:4:1,
        018-10-10 14:12:39 11497 [27916]: r3098 paxos_acquire 3 retry
delay 981162 us
        018-10-10 14:12:40 11498 [27916]: r3098 paxos_acquire leader 2
owner 19 1 4854717

This is confusing to me, what means "ballot 3 abort1 larger lver in
bk" and what causes this?
Seems related to the bug I reported that io timeout when release vg
lock causes problems?

Besides, I found there are some hosts has different sanlock io timeout
than others, is this the root cause?

Thanks,
Damon

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-10-12 14:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-10  7:51 [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..." Damon Wang
2018-10-10  7:58 ` Damon Wang
2018-10-10  8:12   ` Damon Wang
2018-10-10 19:08 ` David Teigland
2018-10-11 13:03   ` Damon Wang
2018-10-11 15:55     ` David Teigland
2018-10-11 18:58       ` David Teigland
2018-10-12  8:58         ` Damon Wang
2018-10-12 14:21           ` David Teigland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).