linux-lvm.redhat.com archive mirror
 help / color / mirror / Atom feed
* [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..."
@ 2018-10-10  7:51 Damon Wang
  2018-10-10  7:58 ` Damon Wang
  2018-10-10 19:08 ` David Teigland
  0 siblings, 2 replies; 9+ messages in thread
From: Damon Wang @ 2018-10-10  7:51 UTC (permalink / raw)
  To: LVM general discussion and development

Hi,

I can not create lv in an environment and after some check I found a
server has hold the VGLK and not released for long time.

this is "sanlock client status" prints:

        [root@HLD-1-4-S10 ~]# sanlock client status
        daemon 091c17d0-822c6a4f-HLD-1-4-S10
        p -1 helper
        p -1 listener
        p 27919 lvmlockd
        p 27919 lvmlockd
        p 27919 lvmlockd
        p 27919 lvmlockd
        p -1 status
        s lvm_b075258f5b9547d7b4464fff246bbce1:49:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:0
        s lvm_671bfaeb4e344014b041fc78115307bb:30:/dev/mapper/671bfaeb4e344014b041fc78115307bb-lvmlock:0
        s lvm_4c149e7262a64c728997954257bece01:94:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:0
        s lvm_091c17d096894a179313aad65cb4127a:87:/dev/mapper/091c17d096894a179313aad65cb4127a-lvmlock:0
        r lvm_b075258f5b9547d7b4464fff246bbce1:VGLK:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016:SH
p 27919        <------
        r lvm_4c149e7262a64c728997954257bece01:0fiorC-82Su-yVRq-dQPe-ed93-xYC5-Pou1i7:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:101711872:SH
p 27919
        r lvm_4c149e7262a64c728997954257bece01:TMG0ev-vvP0-kVWU-xRLE-G9y3-kMf0-bIgIrc:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:111149056:4
p 27919
        ... ...
        r lvm_4c149e7262a64c728997954257bece01:yyjuZt-euSf-ZCG6-eJWF-a0f4-MtpA-KX2Jzr:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:119537664:4
p 27919
        r lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH
p 27919 ADD        <------

As it shows, it is adding lock for
"lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD"
and holding VGLK with shared mode, so there is an command hanging:

        [root@HLD-1-4-S10 ~]# ps aux | grep lv
        root      7755  0.0  0.0 160248  5724 ?        S    14:12
0:00 lvchange -asy
/dev/b075258f5b9547d7b4464fff246bbce1/e36d96c2a29716b999b2a33aa16e94e0
       <------
        root     11335  0.0  0.0 112660   972 pts/0    S+   14:40
0:00 grep --color=auto lv
        root     27919  0.0  0.0 423860  5088 ?        Ssl  11:07
0:00 /usr/sbin/lvmlockd --daemon-debug --sanlock-timeout 40
        root     27926  0.0  0.0 114476  1116 ?        Ssl  11:07
0:00 /usr/sbin/lvmetad -f

As "ps" shows, I use lvmlockd with damon debug mode, so it will print
log in messages:

        Oct 10 14:07:13 HLD-1-4-S10 kernel: kvm [39758]: vcpu0
disabled perfctr wrmsr: 0xc1 data 0xabcd
        Oct 10 14:10:01 HLD-1-4-S10 systemd: Started Session 248 of user root.
        Oct 10 14:10:01 HLD-1-4-S10 systemd: Starting Session 248 of user root.
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 new cl 849 pi 2 fd 9
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 recv
lvchange[7755] cl 849 lock vg "b075258f5b9547d7b4464fff246bbce1" mode
sh flags 0
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK action lock sh
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock cl 849 mode sh
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK lock_san sh at
/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock rv 0 read vb 101
0 3301
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock got version 3301
our 3292
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock set lvmetad vg
version 3301
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 send
lvchange[7755] cl 849 lock vg rv 0
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 recv
lvchange[7755] cl 849 lock lv "b075258f5b9547d7b4464fff246bbce1" mode
sh flags 1
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R
u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD action lock sh
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R
u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD res_lock cl 849 mode sh
(e36d96c2a29716b999b2a33aa16e94e0)
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R
u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD lock_san sh at
/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056
        Oct 10 14:15:23 HLD-1-4-S10 systemd-logind: New session 249 of
user root.
        Oct 10 14:15:23 HLD-1-4-S10 systemd: Started Session 249 of user root.
        Oct 10 14:15:23 HLD-1-4-S10 systemd: Starting Session 249 of user root.
        Oct 10 14:18:12 HLD-1-4-S10 sanlock[27912]: 2018-10-10
14:18:12 11831 [27917]: cmd_release 2,9,27919 no resource VGLK
        Oct 10 14:20:01 HLD-1-4-S10 systemd: Started Session 250 of user root.
        Oct 10 14:20:01 HLD-1-4-S10 systemd: Starting Session 250 of user root.
        Oct 10 14:30:01 HLD-1-4-S10 systemd: Started Session 251 of user root.
        Oct 10 14:30:01 HLD-1-4-S10 systemd: Starting Session 251 of user root.
        Oct 10 14:32:22 HLD-1-4-S10 sanlock[27912]: 2018-10-10
14:32:22 12680 [27917]: cmd_release 2,9,27919 no resource VGLK
        Oct 10 14:40:01 HLD-1-4-S10 systemd: Started Session 252 of user root.
        Oct 10 14:40:01 HLD-1-4-S10 systemd: Starting Session 252 of user root.
        Oct 10 14:40:28 HLD-1-4-S10 lvmlockd: 1539153628 new cl 850 pi
3 fd 14

the timestamp matches, it holds VGLK at 14:12:36 and not release.
So I keep digging log of sanlock and find this:

        2018-10-10 14:12:33 11492 [27916]: cmd_examine 6,17
lvm_091c17d096894a179313aad65cb4127a
        2018-10-10 14:12:33 11492 [27916]: cmd_examine 6,17 done 2
        2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0
        2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0
        2018-10-10 14:12:33 11492 [27917]: cmd_examine 6,17
lvm_b075258f5b9547d7b4464fff246bbce1
        2018-10-10 14:12:33 11492 [27917]: cmd_examine 6,17 done 2
        2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0
        2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0
        2018-10-10 14:12:33 11492 [27916]: set_lockspace_event
lvm_091c17d096894a179313aad65cb4127a
        2018-10-10 14:12:33 11492 [27916]: cmd_set_event result 0
        2018-10-10 14:12:33 11492 [27917]: set_lockspace_event
lvm_b075258f5b9547d7b4464fff246bbce1
        2018-10-10 14:12:33 11492 [27917]: cmd_set_event result 0
        2018-10-10 14:12:34 11493 [9550]: s2 delta_renew begin read
        2018-10-10 14:12:34 11493 [9550]: s2 delta_renew begin write
for new ts 11493
        2018-10-10 14:12:34 11493 [9550]: s2 renewed 11493
delta_length 0 interval 81
        2018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919
ci_in 5 fd 15 count 1 flags 9
        2018-10-10 14:12:36 11494 [27916]: s4:r3097 resource
lvm_b075258f5b9547d7b4464fff246bbce1:VGLK:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016:SH
for 5,15,27919
        2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire begin e 0 0
        2018-10-10 14:12:36 11494 [27916]: r3097 leader 42693 owner 66
2 0 dblocks 65:140695692504066:140695692504066:66:2:3805:42693:1,
        2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire leader
42693 owner 66 2 0 max mbal[65] 140695692504066 our_dblock
140695692434049 140695692434049 49 2 4303 42658
        2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire leader 42693 free
        2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase1
write mbal 140695692506049
        2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase1
read 48:140695692506049:0:0:0:0:42694:0,
        2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase2
write bal 140695692506049 inp 49 2 11494 q_max -1
        2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase2
read 48:140695692506049:140695692506049:49:2:11494:42694:0,
        2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 commit
self owner 49 2 11494
        2018-10-10 14:12:36 11494 [27916]: r3097 acquire_disk rv 1
lver 42694@11494
        2018-10-10 14:12:36 11494 [27916]: r3097 write_host_block
host_id 49 flags 1 gen 2 dblock
140695692506049:140695692506049:49:2:11494:42694:RELEASED.
        2018-10-10 14:12:36 11494 [27916]: r3097 paxos_release leader
42694 owner 49 2 11494
        2018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919
result 0 pid_dead 0
        2018-10-10 14:12:36 11494 [27917]: cmd_get_lvb ci 6 fd 17
result 0 res lvm_b075258f5b9547d7b4464fff246bbce1:VGLK
        018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919 ci_in
5 fd 15 count 1 flags 8
        018-10-10 14:12:36 11494 [27916]: s4:r3098 resource
lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH
for 5,15,27919
        018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire begin e 0 0
        018-10-10 14:12:36 11494 [27916]: r3098 leader 2 owner 19 1
4854717 dblocks
8:29802:510:140245418403952:140245440585933:140245418403840:4:1,48:202556049:0:0:0:0:3:0,65:218746066:0:0:0:0:3:0,79:202874080:0:0:0:0:3:0,
        018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2
owner 19 1 4854717 max mbal[65] 218746066 our_dblock 202556049 0 0 0 0
3
        018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19
1 4854717 host_status 19 2 10914 wait_start 11494
        018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19
1 4854717 delta 19 2 10995 mismatch
        018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2
owner 19 1 4854717
        018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 phase1 write
mbal 218748049
        018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 abort1 larger
lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
our dblock 218748049:0:0:0:0:3
        018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 phase1 read
18:29802:510:140245418403952:140245440585933:140245418403840:4:1,
        018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire 3 retry
delay 887122 us
        018-10-10 14:12:37 11495 [27916]: r3098 paxos_acquire leader 2
owner 19 1 4854717
        018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 phase1 write
mbal 218750049
        018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 abort1 larger
lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
our dblock 218750049:0:0:0:0:3
        018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 phase1 read
18:29802:510:140245418403952:140245440585933:140245418403840:4:1,
        018-10-10 14:12:37 11495 [27916]: r3098 paxos_acquire 3 retry
delay 455075 us
        018-10-10 14:12:38 11496 [27916]: r3098 paxos_acquire leader 2
owner 19 1 4854717
        018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 phase1 write
mbal 218752049
        018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 abort1 larger
lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
our dblock 218752049:0:0:0:0:3
        018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 phase1 read
18:29802:510:140245418403952:140245440585933:140245418403840:4:1,
        018-10-10 14:12:38 11496 [27916]: r3098 paxos_acquire 3 retry
delay 904853 us
        018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20
lvm_671bfaeb4e344014b041fc78115307bb
        018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 done 2
        018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20
lvm_4c149e7262a64c728997954257bece01
        018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 done 5
        018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11496 [27917]: set_lockspace_event
lvm_671bfaeb4e344014b041fc78115307bb
        018-10-10 14:12:38 11496 [27917]: cmd_set_event result 0
        018-10-10 14:12:38 11496 [27917]: set_lockspace_event
lvm_4c149e7262a64c728997954257bece01
        018-10-10 14:12:38 11496 [27917]: cmd_set_event result 0
        018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20
lvm_091c17d096894a179313aad65cb4127a
        018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 done 2
        018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20
lvm_b075258f5b9547d7b4464fff246bbce1
        018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 done 3
        018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11497 [27917]: set_lockspace_event
lvm_091c17d096894a179313aad65cb4127a
        018-10-10 14:12:38 11497 [27917]: cmd_set_event result 0
        018-10-10 14:12:38 11497 [27917]: set_lockspace_event
lvm_b075258f5b9547d7b4464fff246bbce1
        018-10-10 14:12:38 11497 [27917]: cmd_set_event result 0
        018-10-10 14:12:39 11497 [27916]: r3098 paxos_acquire leader 2
owner 19 1 4854717
        018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 phase1 write
mbal 218754049
        018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 abort1 larger
lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
our dblock 218754049:0:0:0:0:3
        018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 phase1 read
18:29802:510:140245418403952:140245440585933:140245418403840:4:1,
        018-10-10 14:12:39 11497 [27916]: r3098 paxos_acquire 3 retry
delay 981162 us
        018-10-10 14:12:40 11498 [27916]: r3098 paxos_acquire leader 2
owner 19 1 4854717

This is confusing to me, what means "ballot 3 abort1 larger lver in
bk" and what causes this?
Seems related to the bug I reported that io timeout when release vg
lock causes problems?

Besides, I found there are some hosts has different sanlock io timeout
than others, is this the root cause?

Thanks,
Damon

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..."
  2018-10-10  7:51 [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..." Damon Wang
@ 2018-10-10  7:58 ` Damon Wang
  2018-10-10  8:12   ` Damon Wang
  2018-10-10 19:08 ` David Teigland
  1 sibling, 1 reply; 9+ messages in thread
From: Damon Wang @ 2018-10-10  7:58 UTC (permalink / raw)
  To: LVM general discussion and development

the log format of previous mail become horrible, so please this, thanks.


Hi,

I can not create lv in an environment and after some check I found a
server has hold the VGLK and not released for long time.

This is "sanlock client status" prints:

        [root@HLD-1-4-S10 ~]# sanlock client status
        daemon 091c17d0-822c6a4f-HLD-1-4-S10
        p -1 helper
        p -1 listener
        p 27919 lvmlockd
        p 27919 lvmlockd
        p 27919 lvmlockd
        p 27919 lvmlockd
        p -1 status
        s lvm_b075258f5b9547d7b4464fff246bbce1:49:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:0
        s lvm_671bfaeb4e344014b041fc78115307bb:30:/dev/mapper/671bfaeb4e344014b041fc78115307bb-lvmlock:0
        s lvm_4c149e7262a64c728997954257bece01:94:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:0
        s lvm_091c17d096894a179313aad65cb4127a:87:/dev/mapper/091c17d096894a179313aad65cb4127a-lvmlock:0
        r lvm_b075258f5b9547d7b4464fff246bbce1:VGLK:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016:SH
p 27919        <------
        r lvm_4c149e7262a64c728997954257bece01:0fiorC-82Su-yVRq-dQPe-ed93-xYC5-Pou1i7:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:101711872:SH
p 27919
        r lvm_4c149e7262a64c728997954257bece01:TMG0ev-vvP0-kVWU-xRLE-G9y3-kMf0-bIgIrc:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:111149056:4
p 27919
        ... ...
        r lvm_4c149e7262a64c728997954257bece01:yyjuZt-euSf-ZCG6-eJWF-a0f4-MtpA-KX2Jzr:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:119537664:4
p 27919
        r lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH
p 27919 ADD        <------

As it shows, it is adding lock for
"lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD"
and holding VGLK with shared mode, so there is an command hanging:

        [root@HLD-1-4-S10 ~]# ps aux | grep lv
        root      7755  0.0  0.0 160248  5724 ?        S    14:12
0:00 lvchange -asy
/dev/b075258f5b9547d7b4464fff246bbce1/e36d96c2a29716b999b2a33aa16e94e0
        root     11335  0.0  0.0 112660   972 pts/0    S+   14:40
0:00 grep --color=auto lv
        root     27919  0.0  0.0 423860  5088 ?        Ssl  11:07
0:00 /usr/sbin/lvmlockd --daemon-debug --sanlock-timeout 40
        root     27926  0.0  0.0 114476  1116 ?        Ssl  11:07
0:00 /usr/sbin/lvmetad -f

As "ps" shows, I use lvmlockd with damon debug mode, so it will print
log in messages:

        Oct 10 14:07:13 HLD-1-4-S10 kernel: kvm [39758]: vcpu0
disabled perfctr wrmsr: 0xc1 data 0xabcd
        Oct 10 14:10:01 HLD-1-4-S10 systemd: Started Session 248 of user root.
        Oct 10 14:10:01 HLD-1-4-S10 systemd: Starting Session 248 of user root.
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 new cl 849 pi 2 fd 9
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 recv
lvchange[7755] cl 849 lock vg "b075258f5b9547d7b4464fff246bbce1" mode
sh flags 0
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK action lock sh
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock cl 849 mode sh
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK lock_san sh at
/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock rv 0 read vb 101
0 3301
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock got version 3301
our 3292
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock set lvmetad vg
version 3301
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 send
lvchange[7755] cl 849 lock vg rv 0
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 recv
lvchange[7755] cl 849 lock lv "b075258f5b9547d7b4464fff246bbce1" mode
sh flags 1
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R
u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD action lock sh
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R
u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD res_lock cl 849 mode sh
(e36d96c2a29716b999b2a33aa16e94e0)
        Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
lvm_b075258f5b9547d7b4464fff246bbce1 R
u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD lock_san sh at
/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056
        Oct 10 14:15:23 HLD-1-4-S10 systemd-logind: New session 249 of
user root.
        Oct 10 14:15:23 HLD-1-4-S10 systemd: Started Session 249 of user root.
        Oct 10 14:15:23 HLD-1-4-S10 systemd: Starting Session 249 of user root.
        Oct 10 14:18:12 HLD-1-4-S10 sanlock[27912]: 2018-10-10
14:18:12 11831 [27917]: cmd_release 2,9,27919 no resource VGLK
        Oct 10 14:20:01 HLD-1-4-S10 systemd: Started Session 250 of user root.
        Oct 10 14:20:01 HLD-1-4-S10 systemd: Starting Session 250 of user root.
        Oct 10 14:30:01 HLD-1-4-S10 systemd: Started Session 251 of user root.
        Oct 10 14:30:01 HLD-1-4-S10 systemd: Starting Session 251 of user root.
        Oct 10 14:32:22 HLD-1-4-S10 sanlock[27912]: 2018-10-10
14:32:22 12680 [27917]: cmd_release 2,9,27919 no resource VGLK
        Oct 10 14:40:01 HLD-1-4-S10 systemd: Started Session 252 of user root.
        Oct 10 14:40:01 HLD-1-4-S10 systemd: Starting Session 252 of user root.
        Oct 10 14:40:28 HLD-1-4-S10 lvmlockd: 1539153628 new cl 850 pi
3 fd 14

the timestamp matches, it holds VGLK at 14:12:36 and not release.
So I keep digging log of sanlock and find this:

        2018-10-10 14:12:33 11492 [27916]: cmd_examine 6,17
lvm_091c17d096894a179313aad65cb4127a
        2018-10-10 14:12:33 11492 [27916]: cmd_examine 6,17 done 2
        2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0
        2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0
        2018-10-10 14:12:33 11492 [27917]: cmd_examine 6,17
lvm_b075258f5b9547d7b4464fff246bbce1
        2018-10-10 14:12:33 11492 [27917]: cmd_examine 6,17 done 2
        2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0
        2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0
        2018-10-10 14:12:33 11492 [27916]: set_lockspace_event
lvm_091c17d096894a179313aad65cb4127a
        2018-10-10 14:12:33 11492 [27916]: cmd_set_event result 0
        2018-10-10 14:12:33 11492 [27917]: set_lockspace_event
lvm_b075258f5b9547d7b4464fff246bbce1
        2018-10-10 14:12:33 11492 [27917]: cmd_set_event result 0
        2018-10-10 14:12:34 11493 [9550]: s2 delta_renew begin read
        2018-10-10 14:12:34 11493 [9550]: s2 delta_renew begin write
for new ts 11493
        2018-10-10 14:12:34 11493 [9550]: s2 renewed 11493
delta_length 0 interval 81
        2018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919
ci_in 5 fd 15 count 1 flags 9
        2018-10-10 14:12:36 11494 [27916]: s4:r3097 resource
lvm_b075258f5b9547d7b4464fff246bbce1:VGLK:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016:SH
for 5,15,27919
        2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire begin e 0 0
        2018-10-10 14:12:36 11494 [27916]: r3097 leader 42693 owner 66
2 0 dblocks 65:140695692504066:140695692504066:66:2:3805:42693:1,
        2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire leader
42693 owner 66 2 0 max mbal[65] 140695692504066 our_dblock
140695692434049 140695692434049 49 2 4303 42658
        2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire leader 42693 free
        2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase1
write mbal 140695692506049
        2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase1
read 48:140695692506049:0:0:0:0:42694:0,
        2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase2
write bal 140695692506049 inp 49 2 11494 q_max -1
        2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase2
read 48:140695692506049:140695692506049:49:2:11494:42694:0,
        2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 commit
self owner 49 2 11494
        2018-10-10 14:12:36 11494 [27916]: r3097 acquire_disk rv 1
lver 42694 at 11494
        2018-10-10 14:12:36 11494 [27916]: r3097 write_host_block
host_id 49 flags 1 gen 2 dblock
140695692506049:140695692506049:49:2:11494:42694:RELEASED.
        2018-10-10 14:12:36 11494 [27916]: r3097 paxos_release leader
42694 owner 49 2 11494
        2018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919
result 0 pid_dead 0
        2018-10-10 14:12:36 11494 [27917]: cmd_get_lvb ci 6 fd 17
result 0 res lvm_b075258f5b9547d7b4464fff246bbce1:VGLK
        018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919 ci_in
5 fd 15 count 1 flags 8
        018-10-10 14:12:36 11494 [27916]: s4:r3098 resource
lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH
for 5,15,27919
        018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire begin e 0 0
        018-10-10 14:12:36 11494 [27916]: r3098 leader 2 owner 19 1
4854717 dblocks
8:29802:510:140245418403952:140245440585933:140245418403840:4:1,48:202556049:0:0:0:0:3:0,65:218746066:0:0:0:0:3:0,79:202874080:0:0:0:0:3:0,
        018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2
owner 19 1 4854717 max mbal[65] 218746066 our_dblock 202556049 0 0 0 0
3
        018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19
1 4854717 host_status 19 2 10914 wait_start 11494
        018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19
1 4854717 delta 19 2 10995 mismatch
        018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2
owner 19 1 4854717
        018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 phase1 write
mbal 218748049
        018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 abort1 larger
lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
our dblock 218748049:0:0:0:0:3
        018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 phase1 read
18:29802:510:140245418403952:140245440585933:140245418403840:4:1,
        018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire 3 retry
delay 887122 us
        018-10-10 14:12:37 11495 [27916]: r3098 paxos_acquire leader 2
owner 19 1 4854717
        018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 phase1 write
mbal 218750049
        018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 abort1 larger
lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
our dblock 218750049:0:0:0:0:3
        018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 phase1 read
18:29802:510:140245418403952:140245440585933:140245418403840:4:1,
        018-10-10 14:12:37 11495 [27916]: r3098 paxos_acquire 3 retry
delay 455075 us
        018-10-10 14:12:38 11496 [27916]: r3098 paxos_acquire leader 2
owner 19 1 4854717
        018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 phase1 write
mbal 218752049
        018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 abort1 larger
lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
our dblock 218752049:0:0:0:0:3
        018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 phase1 read
18:29802:510:140245418403952:140245440585933:140245418403840:4:1,
        018-10-10 14:12:38 11496 [27916]: r3098 paxos_acquire 3 retry
delay 904853 us
        018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20
lvm_671bfaeb4e344014b041fc78115307bb
        018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 done 2
        018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20
lvm_4c149e7262a64c728997954257bece01
        018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 done 5
        018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11496 [27917]: set_lockspace_event
lvm_671bfaeb4e344014b041fc78115307bb
        018-10-10 14:12:38 11496 [27917]: cmd_set_event result 0
        018-10-10 14:12:38 11496 [27917]: set_lockspace_event
lvm_4c149e7262a64c728997954257bece01
        018-10-10 14:12:38 11496 [27917]: cmd_set_event result 0
        018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20
lvm_091c17d096894a179313aad65cb4127a
        018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 done 2
        018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20
lvm_b075258f5b9547d7b4464fff246bbce1
        018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 done 3
        018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
        018-10-10 14:12:38 11497 [27917]: set_lockspace_event
lvm_091c17d096894a179313aad65cb4127a
        018-10-10 14:12:38 11497 [27917]: cmd_set_event result 0
        018-10-10 14:12:38 11497 [27917]: set_lockspace_event
lvm_b075258f5b9547d7b4464fff246bbce1
        018-10-10 14:12:38 11497 [27917]: cmd_set_event result 0
        018-10-10 14:12:39 11497 [27916]: r3098 paxos_acquire leader 2
owner 19 1 4854717
        018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 phase1 write
mbal 218754049
        018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 abort1 larger
lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
our dblock 218754049:0:0:0:0:3
        018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 phase1 read
18:29802:510:140245418403952:140245440585933:140245418403840:4:1,
        018-10-10 14:12:39 11497 [27916]: r3098 paxos_acquire 3 retry
delay 981162 us
        018-10-10 14:12:40 11498 [27916]: r3098 paxos_acquire leader 2
owner 19 1 4854717


This is confusing to me, what means "ballot 3 abort1 larger lver inbk"
and what causes this?
Seems related to the bug I reported that io timeout when release
vglock causes problems?

Besides, I found there are some hosts has different sanlock io timeout
than others, is this the root cause?

Thanks,
Damon
On Wed, Oct 10, 2018@3:51 PM Damon Wang <damon.devops@gmail.com> wrote:
>
> Hi,
>
> I can not create lv in an environment and after some check I found a
> server has hold the VGLK and not released for long time.
>
> this is "sanlock client status" prints:
>
>         [root@HLD-1-4-S10 ~]# sanlock client status
>         daemon 091c17d0-822c6a4f-HLD-1-4-S10
>         p -1 helper
>         p -1 listener
>         p 27919 lvmlockd
>         p 27919 lvmlockd
>         p 27919 lvmlockd
>         p 27919 lvmlockd
>         p -1 status
>         s lvm_b075258f5b9547d7b4464fff246bbce1:49:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:0
>         s lvm_671bfaeb4e344014b041fc78115307bb:30:/dev/mapper/671bfaeb4e344014b041fc78115307bb-lvmlock:0
>         s lvm_4c149e7262a64c728997954257bece01:94:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:0
>         s lvm_091c17d096894a179313aad65cb4127a:87:/dev/mapper/091c17d096894a179313aad65cb4127a-lvmlock:0
>         r lvm_b075258f5b9547d7b4464fff246bbce1:VGLK:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016:SH
> p 27919        <------
>         r lvm_4c149e7262a64c728997954257bece01:0fiorC-82Su-yVRq-dQPe-ed93-xYC5-Pou1i7:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:101711872:SH
> p 27919
>         r lvm_4c149e7262a64c728997954257bece01:TMG0ev-vvP0-kVWU-xRLE-G9y3-kMf0-bIgIrc:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:111149056:4
> p 27919
>         ... ...
>         r lvm_4c149e7262a64c728997954257bece01:yyjuZt-euSf-ZCG6-eJWF-a0f4-MtpA-KX2Jzr:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:119537664:4
> p 27919
>         r lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH
> p 27919 ADD        <------
>
> As it shows, it is adding lock for
> "lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD"
> and holding VGLK with shared mode, so there is an command hanging:
>
>         [root@HLD-1-4-S10 ~]# ps aux | grep lv
>         root      7755  0.0  0.0 160248  5724 ?        S    14:12
> 0:00 lvchange -asy
> /dev/b075258f5b9547d7b4464fff246bbce1/e36d96c2a29716b999b2a33aa16e94e0
>        <------
>         root     11335  0.0  0.0 112660   972 pts/0    S+   14:40
> 0:00 grep --color=auto lv
>         root     27919  0.0  0.0 423860  5088 ?        Ssl  11:07
> 0:00 /usr/sbin/lvmlockd --daemon-debug --sanlock-timeout 40
>         root     27926  0.0  0.0 114476  1116 ?        Ssl  11:07
> 0:00 /usr/sbin/lvmetad -f
>
> As "ps" shows, I use lvmlockd with damon debug mode, so it will print
> log in messages:
>
>         Oct 10 14:07:13 HLD-1-4-S10 kernel: kvm [39758]: vcpu0
> disabled perfctr wrmsr: 0xc1 data 0xabcd
>         Oct 10 14:10:01 HLD-1-4-S10 systemd: Started Session 248 of user root.
>         Oct 10 14:10:01 HLD-1-4-S10 systemd: Starting Session 248 of user root.
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 new cl 849 pi 2 fd 9
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 recv
> lvchange[7755] cl 849 lock vg "b075258f5b9547d7b4464fff246bbce1" mode
> sh flags 0
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK action lock sh
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock cl 849 mode sh
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK lock_san sh at
> /dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock rv 0 read vb 101
> 0 3301
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock got version 3301
> our 3292
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock set lvmetad vg
> version 3301
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 send
> lvchange[7755] cl 849 lock vg rv 0
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 recv
> lvchange[7755] cl 849 lock lv "b075258f5b9547d7b4464fff246bbce1" mode
> sh flags 1
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> lvm_b075258f5b9547d7b4464fff246bbce1 R
> u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD action lock sh
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> lvm_b075258f5b9547d7b4464fff246bbce1 R
> u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD res_lock cl 849 mode sh
> (e36d96c2a29716b999b2a33aa16e94e0)
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> lvm_b075258f5b9547d7b4464fff246bbce1 R
> u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD lock_san sh at
> /dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056
>         Oct 10 14:15:23 HLD-1-4-S10 systemd-logind: New session 249 of
> user root.
>         Oct 10 14:15:23 HLD-1-4-S10 systemd: Started Session 249 of user root.
>         Oct 10 14:15:23 HLD-1-4-S10 systemd: Starting Session 249 of user root.
>         Oct 10 14:18:12 HLD-1-4-S10 sanlock[27912]: 2018-10-10
> 14:18:12 11831 [27917]: cmd_release 2,9,27919 no resource VGLK
>         Oct 10 14:20:01 HLD-1-4-S10 systemd: Started Session 250 of user root.
>         Oct 10 14:20:01 HLD-1-4-S10 systemd: Starting Session 250 of user root.
>         Oct 10 14:30:01 HLD-1-4-S10 systemd: Started Session 251 of user root.
>         Oct 10 14:30:01 HLD-1-4-S10 systemd: Starting Session 251 of user root.
>         Oct 10 14:32:22 HLD-1-4-S10 sanlock[27912]: 2018-10-10
> 14:32:22 12680 [27917]: cmd_release 2,9,27919 no resource VGLK
>         Oct 10 14:40:01 HLD-1-4-S10 systemd: Started Session 252 of user root.
>         Oct 10 14:40:01 HLD-1-4-S10 systemd: Starting Session 252 of user root.
>         Oct 10 14:40:28 HLD-1-4-S10 lvmlockd: 1539153628 new cl 850 pi
> 3 fd 14
>
> the timestamp matches, it holds VGLK at 14:12:36 and not release.
> So I keep digging log of sanlock and find this:
>
>         2018-10-10 14:12:33 11492 [27916]: cmd_examine 6,17
> lvm_091c17d096894a179313aad65cb4127a
>         2018-10-10 14:12:33 11492 [27916]: cmd_examine 6,17 done 2
>         2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0
>         2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0
>         2018-10-10 14:12:33 11492 [27917]: cmd_examine 6,17
> lvm_b075258f5b9547d7b4464fff246bbce1
>         2018-10-10 14:12:33 11492 [27917]: cmd_examine 6,17 done 2
>         2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0
>         2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0
>         2018-10-10 14:12:33 11492 [27916]: set_lockspace_event
> lvm_091c17d096894a179313aad65cb4127a
>         2018-10-10 14:12:33 11492 [27916]: cmd_set_event result 0
>         2018-10-10 14:12:33 11492 [27917]: set_lockspace_event
> lvm_b075258f5b9547d7b4464fff246bbce1
>         2018-10-10 14:12:33 11492 [27917]: cmd_set_event result 0
>         2018-10-10 14:12:34 11493 [9550]: s2 delta_renew begin read
>         2018-10-10 14:12:34 11493 [9550]: s2 delta_renew begin write
> for new ts 11493
>         2018-10-10 14:12:34 11493 [9550]: s2 renewed 11493
> delta_length 0 interval 81
>         2018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919
> ci_in 5 fd 15 count 1 flags 9
>         2018-10-10 14:12:36 11494 [27916]: s4:r3097 resource
> lvm_b075258f5b9547d7b4464fff246bbce1:VGLK:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016:SH
> for 5,15,27919
>         2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire begin e 0 0
>         2018-10-10 14:12:36 11494 [27916]: r3097 leader 42693 owner 66
> 2 0 dblocks 65:140695692504066:140695692504066:66:2:3805:42693:1,
>         2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire leader
> 42693 owner 66 2 0 max mbal[65] 140695692504066 our_dblock
> 140695692434049 140695692434049 49 2 4303 42658
>         2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire leader 42693 free
>         2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase1
> write mbal 140695692506049
>         2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase1
> read 48:140695692506049:0:0:0:0:42694:0,
>         2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase2
> write bal 140695692506049 inp 49 2 11494 q_max -1
>         2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase2
> read 48:140695692506049:140695692506049:49:2:11494:42694:0,
>         2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 commit
> self owner 49 2 11494
>         2018-10-10 14:12:36 11494 [27916]: r3097 acquire_disk rv 1
> lver 42694 at 11494
>         2018-10-10 14:12:36 11494 [27916]: r3097 write_host_block
> host_id 49 flags 1 gen 2 dblock
> 140695692506049:140695692506049:49:2:11494:42694:RELEASED.
>         2018-10-10 14:12:36 11494 [27916]: r3097 paxos_release leader
> 42694 owner 49 2 11494
>         2018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919
> result 0 pid_dead 0
>         2018-10-10 14:12:36 11494 [27917]: cmd_get_lvb ci 6 fd 17
> result 0 res lvm_b075258f5b9547d7b4464fff246bbce1:VGLK
>         018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919 ci_in
> 5 fd 15 count 1 flags 8
>         018-10-10 14:12:36 11494 [27916]: s4:r3098 resource
> lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH
> for 5,15,27919
>         018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire begin e 0 0
>         018-10-10 14:12:36 11494 [27916]: r3098 leader 2 owner 19 1
> 4854717 dblocks
> 8:29802:510:140245418403952:140245440585933:140245418403840:4:1,48:202556049:0:0:0:0:3:0,65:218746066:0:0:0:0:3:0,79:202874080:0:0:0:0:3:0,
>         018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2
> owner 19 1 4854717 max mbal[65] 218746066 our_dblock 202556049 0 0 0 0
> 3
>         018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19
> 1 4854717 host_status 19 2 10914 wait_start 11494
>         018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19
> 1 4854717 delta 19 2 10995 mismatch
>         018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2
> owner 19 1 4854717
>         018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 phase1 write
> mbal 218748049
>         018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 abort1 larger
> lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
> our dblock 218748049:0:0:0:0:3
>         018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 phase1 read
> 18:29802:510:140245418403952:140245440585933:140245418403840:4:1,
>         018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire 3 retry
> delay 887122 us
>         018-10-10 14:12:37 11495 [27916]: r3098 paxos_acquire leader 2
> owner 19 1 4854717
>         018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 phase1 write
> mbal 218750049
>         018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 abort1 larger
> lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
> our dblock 218750049:0:0:0:0:3
>         018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 phase1 read
> 18:29802:510:140245418403952:140245440585933:140245418403840:4:1,
>         018-10-10 14:12:37 11495 [27916]: r3098 paxos_acquire 3 retry
> delay 455075 us
>         018-10-10 14:12:38 11496 [27916]: r3098 paxos_acquire leader 2
> owner 19 1 4854717
>         018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 phase1 write
> mbal 218752049
>         018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 abort1 larger
> lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
> our dblock 218752049:0:0:0:0:3
>         018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 phase1 read
> 18:29802:510:140245418403952:140245440585933:140245418403840:4:1,
>         018-10-10 14:12:38 11496 [27916]: r3098 paxos_acquire 3 retry
> delay 904853 us
>         018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20
> lvm_671bfaeb4e344014b041fc78115307bb
>         018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 done 2
>         018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
>         018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
>         018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20
> lvm_4c149e7262a64c728997954257bece01
>         018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 done 5
>         018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
>         018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
>         018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
>         018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
>         018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
>         018-10-10 14:12:38 11496 [27917]: set_lockspace_event
> lvm_671bfaeb4e344014b041fc78115307bb
>         018-10-10 14:12:38 11496 [27917]: cmd_set_event result 0
>         018-10-10 14:12:38 11496 [27917]: set_lockspace_event
> lvm_4c149e7262a64c728997954257bece01
>         018-10-10 14:12:38 11496 [27917]: cmd_set_event result 0
>         018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20
> lvm_091c17d096894a179313aad65cb4127a
>         018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 done 2
>         018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
>         018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
>         018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20
> lvm_b075258f5b9547d7b4464fff246bbce1
>         018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 done 3
>         018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
>         018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
>         018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
>         018-10-10 14:12:38 11497 [27917]: set_lockspace_event
> lvm_091c17d096894a179313aad65cb4127a
>         018-10-10 14:12:38 11497 [27917]: cmd_set_event result 0
>         018-10-10 14:12:38 11497 [27917]: set_lockspace_event
> lvm_b075258f5b9547d7b4464fff246bbce1
>         018-10-10 14:12:38 11497 [27917]: cmd_set_event result 0
>         018-10-10 14:12:39 11497 [27916]: r3098 paxos_acquire leader 2
> owner 19 1 4854717
>         018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 phase1 write
> mbal 218754049
>         018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 abort1 larger
> lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
> our dblock 218754049:0:0:0:0:3
>         018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 phase1 read
> 18:29802:510:140245418403952:140245440585933:140245418403840:4:1,
>         018-10-10 14:12:39 11497 [27916]: r3098 paxos_acquire 3 retry
> delay 981162 us
>         018-10-10 14:12:40 11498 [27916]: r3098 paxos_acquire leader 2
> owner 19 1 4854717
>
> This is confusing to me, what means "ballot 3 abort1 larger lver in
> bk" and what causes this?
> Seems related to the bug I reported that io timeout when release vg
> lock causes problems?
>
> Besides, I found there are some hosts has different sanlock io timeout
> than others, is this the root cause?
>
> Thanks,
> Damon

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..."
  2018-10-10  7:58 ` Damon Wang
@ 2018-10-10  8:12   ` Damon Wang
  0 siblings, 0 replies; 9+ messages in thread
From: Damon Wang @ 2018-10-10  8:12 UTC (permalink / raw)
  To: LVM general discussion and development

seems mail server is not friendly with long text, so paste them in pastebin:

messages: https://pastebin.com/FUHLuWry
sanlock.log: https://pastebin.com/5JwmSCWP
sanlock client status: https://pastebin.com/9cGbav5f
ps aux: https://pastebin.com/6t6SQU6X


On Wed, Oct 10, 2018 at 3:58 PM Damon Wang <damon.devops@gmail.com> wrote:
>
> the log format of previous mail become horrible, so please this, thanks.
>
>
> Hi,
>
> I can not create lv in an environment and after some check I found a
> server has hold the VGLK and not released for long time.
>
> This is "sanlock client status" prints:
>
>         [root@HLD-1-4-S10 ~]# sanlock client status
>         daemon 091c17d0-822c6a4f-HLD-1-4-S10
>         p -1 helper
>         p -1 listener
>         p 27919 lvmlockd
>         p 27919 lvmlockd
>         p 27919 lvmlockd
>         p 27919 lvmlockd
>         p -1 status
>         s lvm_b075258f5b9547d7b4464fff246bbce1:49:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:0
>         s lvm_671bfaeb4e344014b041fc78115307bb:30:/dev/mapper/671bfaeb4e344014b041fc78115307bb-lvmlock:0
>         s lvm_4c149e7262a64c728997954257bece01:94:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:0
>         s lvm_091c17d096894a179313aad65cb4127a:87:/dev/mapper/091c17d096894a179313aad65cb4127a-lvmlock:0
>         r lvm_b075258f5b9547d7b4464fff246bbce1:VGLK:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016:SH
> p 27919        <------
>         r lvm_4c149e7262a64c728997954257bece01:0fiorC-82Su-yVRq-dQPe-ed93-xYC5-Pou1i7:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:101711872:SH
> p 27919
>         r lvm_4c149e7262a64c728997954257bece01:TMG0ev-vvP0-kVWU-xRLE-G9y3-kMf0-bIgIrc:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:111149056:4
> p 27919
>         ... ...
>         r lvm_4c149e7262a64c728997954257bece01:yyjuZt-euSf-ZCG6-eJWF-a0f4-MtpA-KX2Jzr:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:119537664:4
> p 27919
>         r lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH
> p 27919 ADD        <------
>
> As it shows, it is adding lock for
> "lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD"
> and holding VGLK with shared mode, so there is an command hanging:
>
>         [root@HLD-1-4-S10 ~]# ps aux | grep lv
>         root      7755  0.0  0.0 160248  5724 ?        S    14:12
> 0:00 lvchange -asy
> /dev/b075258f5b9547d7b4464fff246bbce1/e36d96c2a29716b999b2a33aa16e94e0
>         root     11335  0.0  0.0 112660   972 pts/0    S+   14:40
> 0:00 grep --color=auto lv
>         root     27919  0.0  0.0 423860  5088 ?        Ssl  11:07
> 0:00 /usr/sbin/lvmlockd --daemon-debug --sanlock-timeout 40
>         root     27926  0.0  0.0 114476  1116 ?        Ssl  11:07
> 0:00 /usr/sbin/lvmetad -f
>
> As "ps" shows, I use lvmlockd with damon debug mode, so it will print
> log in messages:
>
>         Oct 10 14:07:13 HLD-1-4-S10 kernel: kvm [39758]: vcpu0
> disabled perfctr wrmsr: 0xc1 data 0xabcd
>         Oct 10 14:10:01 HLD-1-4-S10 systemd: Started Session 248 of user root.
>         Oct 10 14:10:01 HLD-1-4-S10 systemd: Starting Session 248 of user root.
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 new cl 849 pi 2 fd 9
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 recv
> lvchange[7755] cl 849 lock vg "b075258f5b9547d7b4464fff246bbce1" mode
> sh flags 0
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK action lock sh
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock cl 849 mode sh
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK lock_san sh at
> /dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock rv 0 read vb 101
> 0 3301
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock got version 3301
> our 3292
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock set lvmetad vg
> version 3301
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 send
> lvchange[7755] cl 849 lock vg rv 0
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 recv
> lvchange[7755] cl 849 lock lv "b075258f5b9547d7b4464fff246bbce1" mode
> sh flags 1
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> lvm_b075258f5b9547d7b4464fff246bbce1 R
> u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD action lock sh
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> lvm_b075258f5b9547d7b4464fff246bbce1 R
> u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD res_lock cl 849 mode sh
> (e36d96c2a29716b999b2a33aa16e94e0)
>         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> lvm_b075258f5b9547d7b4464fff246bbce1 R
> u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD lock_san sh at
> /dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056
>         Oct 10 14:15:23 HLD-1-4-S10 systemd-logind: New session 249 of
> user root.
>         Oct 10 14:15:23 HLD-1-4-S10 systemd: Started Session 249 of user root.
>         Oct 10 14:15:23 HLD-1-4-S10 systemd: Starting Session 249 of user root.
>         Oct 10 14:18:12 HLD-1-4-S10 sanlock[27912]: 2018-10-10
> 14:18:12 11831 [27917]: cmd_release 2,9,27919 no resource VGLK
>         Oct 10 14:20:01 HLD-1-4-S10 systemd: Started Session 250 of user root.
>         Oct 10 14:20:01 HLD-1-4-S10 systemd: Starting Session 250 of user root.
>         Oct 10 14:30:01 HLD-1-4-S10 systemd: Started Session 251 of user root.
>         Oct 10 14:30:01 HLD-1-4-S10 systemd: Starting Session 251 of user root.
>         Oct 10 14:32:22 HLD-1-4-S10 sanlock[27912]: 2018-10-10
> 14:32:22 12680 [27917]: cmd_release 2,9,27919 no resource VGLK
>         Oct 10 14:40:01 HLD-1-4-S10 systemd: Started Session 252 of user root.
>         Oct 10 14:40:01 HLD-1-4-S10 systemd: Starting Session 252 of user root.
>         Oct 10 14:40:28 HLD-1-4-S10 lvmlockd: 1539153628 new cl 850 pi
> 3 fd 14
>
> the timestamp matches, it holds VGLK at 14:12:36 and not release.
> So I keep digging log of sanlock and find this:
>
>         2018-10-10 14:12:33 11492 [27916]: cmd_examine 6,17
> lvm_091c17d096894a179313aad65cb4127a
>         2018-10-10 14:12:33 11492 [27916]: cmd_examine 6,17 done 2
>         2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0
>         2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0
>         2018-10-10 14:12:33 11492 [27917]: cmd_examine 6,17
> lvm_b075258f5b9547d7b4464fff246bbce1
>         2018-10-10 14:12:33 11492 [27917]: cmd_examine 6,17 done 2
>         2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0
>         2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0
>         2018-10-10 14:12:33 11492 [27916]: set_lockspace_event
> lvm_091c17d096894a179313aad65cb4127a
>         2018-10-10 14:12:33 11492 [27916]: cmd_set_event result 0
>         2018-10-10 14:12:33 11492 [27917]: set_lockspace_event
> lvm_b075258f5b9547d7b4464fff246bbce1
>         2018-10-10 14:12:33 11492 [27917]: cmd_set_event result 0
>         2018-10-10 14:12:34 11493 [9550]: s2 delta_renew begin read
>         2018-10-10 14:12:34 11493 [9550]: s2 delta_renew begin write
> for new ts 11493
>         2018-10-10 14:12:34 11493 [9550]: s2 renewed 11493
> delta_length 0 interval 81
>         2018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919
> ci_in 5 fd 15 count 1 flags 9
>         2018-10-10 14:12:36 11494 [27916]: s4:r3097 resource
> lvm_b075258f5b9547d7b4464fff246bbce1:VGLK:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016:SH
> for 5,15,27919
>         2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire begin e 0 0
>         2018-10-10 14:12:36 11494 [27916]: r3097 leader 42693 owner 66
> 2 0 dblocks 65:140695692504066:140695692504066:66:2:3805:42693:1,
>         2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire leader
> 42693 owner 66 2 0 max mbal[65] 140695692504066 our_dblock
> 140695692434049 140695692434049 49 2 4303 42658
>         2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire leader 42693 free
>         2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase1
> write mbal 140695692506049
>         2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase1
> read 48:140695692506049:0:0:0:0:42694:0,
>         2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase2
> write bal 140695692506049 inp 49 2 11494 q_max -1
>         2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase2
> read 48:140695692506049:140695692506049:49:2:11494:42694:0,
>         2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 commit
> self owner 49 2 11494
>         2018-10-10 14:12:36 11494 [27916]: r3097 acquire_disk rv 1
> lver 42694 at 11494
>         2018-10-10 14:12:36 11494 [27916]: r3097 write_host_block
> host_id 49 flags 1 gen 2 dblock
> 140695692506049:140695692506049:49:2:11494:42694:RELEASED.
>         2018-10-10 14:12:36 11494 [27916]: r3097 paxos_release leader
> 42694 owner 49 2 11494
>         2018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919
> result 0 pid_dead 0
>         2018-10-10 14:12:36 11494 [27917]: cmd_get_lvb ci 6 fd 17
> result 0 res lvm_b075258f5b9547d7b4464fff246bbce1:VGLK
>         018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919 ci_in
> 5 fd 15 count 1 flags 8
>         018-10-10 14:12:36 11494 [27916]: s4:r3098 resource
> lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH
> for 5,15,27919
>         018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire begin e 0 0
>         018-10-10 14:12:36 11494 [27916]: r3098 leader 2 owner 19 1
> 4854717 dblocks
> 8:29802:510:140245418403952:140245440585933:140245418403840:4:1,48:202556049:0:0:0:0:3:0,65:218746066:0:0:0:0:3:0,79:202874080:0:0:0:0:3:0,
>         018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2
> owner 19 1 4854717 max mbal[65] 218746066 our_dblock 202556049 0 0 0 0
> 3
>         018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19
> 1 4854717 host_status 19 2 10914 wait_start 11494
>         018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19
> 1 4854717 delta 19 2 10995 mismatch
>         018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2
> owner 19 1 4854717
>         018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 phase1 write
> mbal 218748049
>         018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 abort1 larger
> lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
> our dblock 218748049:0:0:0:0:3
>         018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 phase1 read
> 18:29802:510:140245418403952:140245440585933:140245418403840:4:1,
>         018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire 3 retry
> delay 887122 us
>         018-10-10 14:12:37 11495 [27916]: r3098 paxos_acquire leader 2
> owner 19 1 4854717
>         018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 phase1 write
> mbal 218750049
>         018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 abort1 larger
> lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
> our dblock 218750049:0:0:0:0:3
>         018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 phase1 read
> 18:29802:510:140245418403952:140245440585933:140245418403840:4:1,
>         018-10-10 14:12:37 11495 [27916]: r3098 paxos_acquire 3 retry
> delay 455075 us
>         018-10-10 14:12:38 11496 [27916]: r3098 paxos_acquire leader 2
> owner 19 1 4854717
>         018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 phase1 write
> mbal 218752049
>         018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 abort1 larger
> lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
> our dblock 218752049:0:0:0:0:3
>         018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 phase1 read
> 18:29802:510:140245418403952:140245440585933:140245418403840:4:1,
>         018-10-10 14:12:38 11496 [27916]: r3098 paxos_acquire 3 retry
> delay 904853 us
>         018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20
> lvm_671bfaeb4e344014b041fc78115307bb
>         018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 done 2
>         018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
>         018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
>         018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20
> lvm_4c149e7262a64c728997954257bece01
>         018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 done 5
>         018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
>         018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
>         018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
>         018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
>         018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
>         018-10-10 14:12:38 11496 [27917]: set_lockspace_event
> lvm_671bfaeb4e344014b041fc78115307bb
>         018-10-10 14:12:38 11496 [27917]: cmd_set_event result 0
>         018-10-10 14:12:38 11496 [27917]: set_lockspace_event
> lvm_4c149e7262a64c728997954257bece01
>         018-10-10 14:12:38 11496 [27917]: cmd_set_event result 0
>         018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20
> lvm_091c17d096894a179313aad65cb4127a
>         018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 done 2
>         018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
>         018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
>         018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20
> lvm_b075258f5b9547d7b4464fff246bbce1
>         018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 done 3
>         018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
>         018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
>         018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
>         018-10-10 14:12:38 11497 [27917]: set_lockspace_event
> lvm_091c17d096894a179313aad65cb4127a
>         018-10-10 14:12:38 11497 [27917]: cmd_set_event result 0
>         018-10-10 14:12:38 11497 [27917]: set_lockspace_event
> lvm_b075258f5b9547d7b4464fff246bbce1
>         018-10-10 14:12:38 11497 [27917]: cmd_set_event result 0
>         018-10-10 14:12:39 11497 [27916]: r3098 paxos_acquire leader 2
> owner 19 1 4854717
>         018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 phase1 write
> mbal 218754049
>         018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 abort1 larger
> lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
> our dblock 218754049:0:0:0:0:3
>         018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 phase1 read
> 18:29802:510:140245418403952:140245440585933:140245418403840:4:1,
>         018-10-10 14:12:39 11497 [27916]: r3098 paxos_acquire 3 retry
> delay 981162 us
>         018-10-10 14:12:40 11498 [27916]: r3098 paxos_acquire leader 2
> owner 19 1 4854717
>
>
> This is confusing to me, what means "ballot 3 abort1 larger lver inbk"
> and what causes this?
> Seems related to the bug I reported that io timeout when release
> vglock causes problems?
>
> Besides, I found there are some hosts has different sanlock io timeout
> than others, is this the root cause?
>
> Thanks,
> Damon
> On Wed, Oct 10, 2018 at 3:51 PM Damon Wang <damon.devops@gmail.com> wrote:
> >
> > Hi,
> >
> > I can not create lv in an environment and after some check I found a
> > server has hold the VGLK and not released for long time.
> >
> > this is "sanlock client status" prints:
> >
> >         [root@HLD-1-4-S10 ~]# sanlock client status
> >         daemon 091c17d0-822c6a4f-HLD-1-4-S10
> >         p -1 helper
> >         p -1 listener
> >         p 27919 lvmlockd
> >         p 27919 lvmlockd
> >         p 27919 lvmlockd
> >         p 27919 lvmlockd
> >         p -1 status
> >         s lvm_b075258f5b9547d7b4464fff246bbce1:49:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:0
> >         s lvm_671bfaeb4e344014b041fc78115307bb:30:/dev/mapper/671bfaeb4e344014b041fc78115307bb-lvmlock:0
> >         s lvm_4c149e7262a64c728997954257bece01:94:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:0
> >         s lvm_091c17d096894a179313aad65cb4127a:87:/dev/mapper/091c17d096894a179313aad65cb4127a-lvmlock:0
> >         r lvm_b075258f5b9547d7b4464fff246bbce1:VGLK:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016:SH
> > p 27919        <------
> >         r lvm_4c149e7262a64c728997954257bece01:0fiorC-82Su-yVRq-dQPe-ed93-xYC5-Pou1i7:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:101711872:SH
> > p 27919
> >         r lvm_4c149e7262a64c728997954257bece01:TMG0ev-vvP0-kVWU-xRLE-G9y3-kMf0-bIgIrc:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:111149056:4
> > p 27919
> >         ... ...
> >         r lvm_4c149e7262a64c728997954257bece01:yyjuZt-euSf-ZCG6-eJWF-a0f4-MtpA-KX2Jzr:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:119537664:4
> > p 27919
> >         r lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH
> > p 27919 ADD        <------
> >
> > As it shows, it is adding lock for
> > "lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD"
> > and holding VGLK with shared mode, so there is an command hanging:
> >
> >         [root@HLD-1-4-S10 ~]# ps aux | grep lv
> >         root      7755  0.0  0.0 160248  5724 ?        S    14:12
> > 0:00 lvchange -asy
> > /dev/b075258f5b9547d7b4464fff246bbce1/e36d96c2a29716b999b2a33aa16e94e0
> >        <------
> >         root     11335  0.0  0.0 112660   972 pts/0    S+   14:40
> > 0:00 grep --color=auto lv
> >         root     27919  0.0  0.0 423860  5088 ?        Ssl  11:07
> > 0:00 /usr/sbin/lvmlockd --daemon-debug --sanlock-timeout 40
> >         root     27926  0.0  0.0 114476  1116 ?        Ssl  11:07
> > 0:00 /usr/sbin/lvmetad -f
> >
> > As "ps" shows, I use lvmlockd with damon debug mode, so it will print
> > log in messages:
> >
> >         Oct 10 14:07:13 HLD-1-4-S10 kernel: kvm [39758]: vcpu0
> > disabled perfctr wrmsr: 0xc1 data 0xabcd
> >         Oct 10 14:10:01 HLD-1-4-S10 systemd: Started Session 248 of user root.
> >         Oct 10 14:10:01 HLD-1-4-S10 systemd: Starting Session 248 of user root.
> >         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 new cl 849 pi 2 fd 9
> >         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 recv
> > lvchange[7755] cl 849 lock vg "b075258f5b9547d7b4464fff246bbce1" mode
> > sh flags 0
> >         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> > lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK action lock sh
> >         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> > lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock cl 849 mode sh
> >         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> > lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK lock_san sh at
> > /dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016
> >         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> > lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock rv 0 read vb 101
> > 0 3301
> >         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> > lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock got version 3301
> > our 3292
> >         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> > lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock set lvmetad vg
> > version 3301
> >         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 send
> > lvchange[7755] cl 849 lock vg rv 0
> >         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 recv
> > lvchange[7755] cl 849 lock lv "b075258f5b9547d7b4464fff246bbce1" mode
> > sh flags 1
> >         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> > lvm_b075258f5b9547d7b4464fff246bbce1 R
> > u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD action lock sh
> >         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> > lvm_b075258f5b9547d7b4464fff246bbce1 R
> > u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD res_lock cl 849 mode sh
> > (e36d96c2a29716b999b2a33aa16e94e0)
> >         Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S
> > lvm_b075258f5b9547d7b4464fff246bbce1 R
> > u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD lock_san sh at
> > /dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056
> >         Oct 10 14:15:23 HLD-1-4-S10 systemd-logind: New session 249 of
> > user root.
> >         Oct 10 14:15:23 HLD-1-4-S10 systemd: Started Session 249 of user root.
> >         Oct 10 14:15:23 HLD-1-4-S10 systemd: Starting Session 249 of user root.
> >         Oct 10 14:18:12 HLD-1-4-S10 sanlock[27912]: 2018-10-10
> > 14:18:12 11831 [27917]: cmd_release 2,9,27919 no resource VGLK
> >         Oct 10 14:20:01 HLD-1-4-S10 systemd: Started Session 250 of user root.
> >         Oct 10 14:20:01 HLD-1-4-S10 systemd: Starting Session 250 of user root.
> >         Oct 10 14:30:01 HLD-1-4-S10 systemd: Started Session 251 of user root.
> >         Oct 10 14:30:01 HLD-1-4-S10 systemd: Starting Session 251 of user root.
> >         Oct 10 14:32:22 HLD-1-4-S10 sanlock[27912]: 2018-10-10
> > 14:32:22 12680 [27917]: cmd_release 2,9,27919 no resource VGLK
> >         Oct 10 14:40:01 HLD-1-4-S10 systemd: Started Session 252 of user root.
> >         Oct 10 14:40:01 HLD-1-4-S10 systemd: Starting Session 252 of user root.
> >         Oct 10 14:40:28 HLD-1-4-S10 lvmlockd: 1539153628 new cl 850 pi
> > 3 fd 14
> >
> > the timestamp matches, it holds VGLK at 14:12:36 and not release.
> > So I keep digging log of sanlock and find this:
> >
> >         2018-10-10 14:12:33 11492 [27916]: cmd_examine 6,17
> > lvm_091c17d096894a179313aad65cb4127a
> >         2018-10-10 14:12:33 11492 [27916]: cmd_examine 6,17 done 2
> >         2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0
> >         2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0
> >         2018-10-10 14:12:33 11492 [27917]: cmd_examine 6,17
> > lvm_b075258f5b9547d7b4464fff246bbce1
> >         2018-10-10 14:12:33 11492 [27917]: cmd_examine 6,17 done 2
> >         2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0
> >         2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0
> >         2018-10-10 14:12:33 11492 [27916]: set_lockspace_event
> > lvm_091c17d096894a179313aad65cb4127a
> >         2018-10-10 14:12:33 11492 [27916]: cmd_set_event result 0
> >         2018-10-10 14:12:33 11492 [27917]: set_lockspace_event
> > lvm_b075258f5b9547d7b4464fff246bbce1
> >         2018-10-10 14:12:33 11492 [27917]: cmd_set_event result 0
> >         2018-10-10 14:12:34 11493 [9550]: s2 delta_renew begin read
> >         2018-10-10 14:12:34 11493 [9550]: s2 delta_renew begin write
> > for new ts 11493
> >         2018-10-10 14:12:34 11493 [9550]: s2 renewed 11493
> > delta_length 0 interval 81
> >         2018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919
> > ci_in 5 fd 15 count 1 flags 9
> >         2018-10-10 14:12:36 11494 [27916]: s4:r3097 resource
> > lvm_b075258f5b9547d7b4464fff246bbce1:VGLK:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016:SH
> > for 5,15,27919
> >         2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire begin e 0 0
> >         2018-10-10 14:12:36 11494 [27916]: r3097 leader 42693 owner 66
> > 2 0 dblocks 65:140695692504066:140695692504066:66:2:3805:42693:1,
> >         2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire leader
> > 42693 owner 66 2 0 max mbal[65] 140695692504066 our_dblock
> > 140695692434049 140695692434049 49 2 4303 42658
> >         2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire leader 42693 free
> >         2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase1
> > write mbal 140695692506049
> >         2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase1
> > read 48:140695692506049:0:0:0:0:42694:0,
> >         2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase2
> > write bal 140695692506049 inp 49 2 11494 q_max -1
> >         2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase2
> > read 48:140695692506049:140695692506049:49:2:11494:42694:0,
> >         2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 commit
> > self owner 49 2 11494
> >         2018-10-10 14:12:36 11494 [27916]: r3097 acquire_disk rv 1
> > lver 42694 at 11494
> >         2018-10-10 14:12:36 11494 [27916]: r3097 write_host_block
> > host_id 49 flags 1 gen 2 dblock
> > 140695692506049:140695692506049:49:2:11494:42694:RELEASED.
> >         2018-10-10 14:12:36 11494 [27916]: r3097 paxos_release leader
> > 42694 owner 49 2 11494
> >         2018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919
> > result 0 pid_dead 0
> >         2018-10-10 14:12:36 11494 [27917]: cmd_get_lvb ci 6 fd 17
> > result 0 res lvm_b075258f5b9547d7b4464fff246bbce1:VGLK
> >         018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919 ci_in
> > 5 fd 15 count 1 flags 8
> >         018-10-10 14:12:36 11494 [27916]: s4:r3098 resource
> > lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH
> > for 5,15,27919
> >         018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire begin e 0 0
> >         018-10-10 14:12:36 11494 [27916]: r3098 leader 2 owner 19 1
> > 4854717 dblocks
> > 8:29802:510:140245418403952:140245440585933:140245418403840:4:1,48:202556049:0:0:0:0:3:0,65:218746066:0:0:0:0:3:0,79:202874080:0:0:0:0:3:0,
> >         018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2
> > owner 19 1 4854717 max mbal[65] 218746066 our_dblock 202556049 0 0 0 0
> > 3
> >         018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19
> > 1 4854717 host_status 19 2 10914 wait_start 11494
> >         018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19
> > 1 4854717 delta 19 2 10995 mismatch
> >         018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2
> > owner 19 1 4854717
> >         018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 phase1 write
> > mbal 218748049
> >         018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 abort1 larger
> > lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
> > our dblock 218748049:0:0:0:0:3
> >         018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 phase1 read
> > 18:29802:510:140245418403952:140245440585933:140245418403840:4:1,
> >         018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire 3 retry
> > delay 887122 us
> >         018-10-10 14:12:37 11495 [27916]: r3098 paxos_acquire leader 2
> > owner 19 1 4854717
> >         018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 phase1 write
> > mbal 218750049
> >         018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 abort1 larger
> > lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
> > our dblock 218750049:0:0:0:0:3
> >         018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 phase1 read
> > 18:29802:510:140245418403952:140245440585933:140245418403840:4:1,
> >         018-10-10 14:12:37 11495 [27916]: r3098 paxos_acquire 3 retry
> > delay 455075 us
> >         018-10-10 14:12:38 11496 [27916]: r3098 paxos_acquire leader 2
> > owner 19 1 4854717
> >         018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 phase1 write
> > mbal 218752049
> >         018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 abort1 larger
> > lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
> > our dblock 218752049:0:0:0:0:3
> >         018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 phase1 read
> > 18:29802:510:140245418403952:140245440585933:140245418403840:4:1,
> >         018-10-10 14:12:38 11496 [27916]: r3098 paxos_acquire 3 retry
> > delay 904853 us
> >         018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20
> > lvm_671bfaeb4e344014b041fc78115307bb
> >         018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 done 2
> >         018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
> >         018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
> >         018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20
> > lvm_4c149e7262a64c728997954257bece01
> >         018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 done 5
> >         018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
> >         018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
> >         018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
> >         018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
> >         018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0
> >         018-10-10 14:12:38 11496 [27917]: set_lockspace_event
> > lvm_671bfaeb4e344014b041fc78115307bb
> >         018-10-10 14:12:38 11496 [27917]: cmd_set_event result 0
> >         018-10-10 14:12:38 11496 [27917]: set_lockspace_event
> > lvm_4c149e7262a64c728997954257bece01
> >         018-10-10 14:12:38 11496 [27917]: cmd_set_event result 0
> >         018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20
> > lvm_091c17d096894a179313aad65cb4127a
> >         018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 done 2
> >         018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
> >         018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
> >         018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20
> > lvm_b075258f5b9547d7b4464fff246bbce1
> >         018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 done 3
> >         018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
> >         018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
> >         018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0
> >         018-10-10 14:12:38 11497 [27917]: set_lockspace_event
> > lvm_091c17d096894a179313aad65cb4127a
> >         018-10-10 14:12:38 11497 [27917]: cmd_set_event result 0
> >         018-10-10 14:12:38 11497 [27917]: set_lockspace_event
> > lvm_b075258f5b9547d7b4464fff246bbce1
> >         018-10-10 14:12:38 11497 [27917]: cmd_set_event result 0
> >         018-10-10 14:12:39 11497 [27916]: r3098 paxos_acquire leader 2
> > owner 19 1 4854717
> >         018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 phase1 write
> > mbal 218754049
> >         018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 abort1 larger
> > lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
> > our dblock 218754049:0:0:0:0:3
> >         018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 phase1 read
> > 18:29802:510:140245418403952:140245440585933:140245418403840:4:1,
> >         018-10-10 14:12:39 11497 [27916]: r3098 paxos_acquire 3 retry
> > delay 981162 us
> >         018-10-10 14:12:40 11498 [27916]: r3098 paxos_acquire leader 2
> > owner 19 1 4854717
> >
> > This is confusing to me, what means "ballot 3 abort1 larger lver in
> > bk" and what causes this?
> > Seems related to the bug I reported that io timeout when release vg
> > lock causes problems?
> >
> > Besides, I found there are some hosts has different sanlock io timeout
> > than others, is this the root cause?
> >
> > Thanks,
> > Damon

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..."
  2018-10-10  7:51 [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..." Damon Wang
  2018-10-10  7:58 ` Damon Wang
@ 2018-10-10 19:08 ` David Teigland
  2018-10-11 13:03   ` Damon Wang
  1 sibling, 1 reply; 9+ messages in thread
From: David Teigland @ 2018-10-10 19:08 UTC (permalink / raw)
  To: Damon Wang; +Cc: linux-lvm

On Wed, Oct 10, 2018 at 03:51:14PM +0800, Damon Wang wrote:
> I can not create lv in an environment and after some check I found a
> server has hold the VGLK and not released for long time.

Hi, I'm curious how many hosts are using this VG?  (It's not immediately
relevant to this problem.)

I'm confused about how this host is using different host_id's in each VG
lockspace.  I would normally expect that /etc/lvm/lvmlocal.conf contains
local/host_id = N, and then find host_id N for all the lvm lockspaces:

s lvm_b075258f5b9547d7b4464fff246bbce1:49:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:0
s lvm_671bfaeb4e344014b041fc78115307bb:30:/dev/mapper/671bfaeb4e344014b041fc78115307bb-lvmlock:0
s lvm_4c149e7262a64c728997954257bece01:94:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:0
s lvm_091c17d096894a179313aad65cb4127a:87:/dev/mapper/091c17d096894a179313aad65cb4127a-lvmlock:0

The VGLK being held for a long time is just a secondary effect of the real
problem, which is that acquiring an LV lock is stuck in a retry loop in
the low level sanlock code.  sanlock keeps retrying because there's bad
data in host_id 19's lease structure on disk.  host_id 19 failed sometime
before 18-10-10 14:12, and if you could find and send any messages from
that machine prior to it failing, that might help.

(Also, if you'd like to upload a copy of the hidden "lvmlock" LV from this
VG, I might try to pull more clues from that.)


> r lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH
> p 27919 ADD        <------

sanlock is stuck repeatedly trying to acquire the lock for that LV (u3G3 /
r3098). The debug info for acquiring that lock begins here:

> 018-10-10 14:12:36 11494 [27916]: s4:r3098 resource
> lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH

> 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire begin e 0 0
> 018-10-10 14:12:36 11494 [27916]: r3098 leader 2 owner 19 1 4854717
  dblocks 8:29802:510:140245418403952:140245440585933:140245418403840:4:1,48:202556049:0:0:0:0:3:0,65:218746066:0:0:0:0:3:0,79:202874080:0:0:0:0:3:0,

That shows the LV is already locked by host_id 19 (generation 1).

> 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2
> owner 19 1 4854717 max mbal[65] 218746066 our_dblock 202556049 0 0 0 0 3

> 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19 1 4854717
  host_status 19 2 10914 wait_start 11494

> 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19 1 4854717
  delta 19 2 10995 mismatch

Those two messages show that host_id 19 is currently alive with generation
2.  Since the lock is owned by host_id 19 generation 1, it means that
host_id 19 previously failed while holding a lock on this LV, and has
since restarted and rejoined the lockspace (increasing the generation
number to 2 upon rejoining).  Since the owner of the lock is now stale,
this host should be able to acquire it immediately, but it doesn't...

> 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2
> owner 19 1 4854717

> 018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 phase1 write mbal 218748049

> 018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 abort1 larger
> lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4
> our dblock 218748049:0:0:0:0:3

While going through the steps to acquire this lock, it finds invalid
dblock data on disk from host_id 19.  There's no checksum error for the
invalid data, so I wonder if it could be bad data in the buffer which 19
wrote before failing.  This host is looking at the invalid lease version
from the dblock, which is causing it to retry (the correct thing when it
legitimately sees a larger lease version.)  The checksum should normally
prevent it from using the invalid data.

> Besides, I found there are some hosts has different sanlock io timeout
> than others, is this the root cause?

Some hosts could still be using the previous 10 sec timeout if they have
not restarted the lockspace (or a particular delta lease has not been
reacquired using the newer timeout).  I don't think it would effect this
problem.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..."
  2018-10-10 19:08 ` David Teigland
@ 2018-10-11 13:03   ` Damon Wang
  2018-10-11 15:55     ` David Teigland
  0 siblings, 1 reply; 9+ messages in thread
From: Damon Wang @ 2018-10-11 13:03 UTC (permalink / raw)
  To: David Teigland; +Cc: LVM general discussion and development

Hi,

1. About host ID

This is because I regenerate a host id when the host join a new
lockspace -- I find host id need to be unique only in each lockspace
rather than all lockspace.
it's very natural a host should keep a single host id, since the
exists of global lock, the host id on global lock lockspace must
unique to all and can be set to all lockspaces.
but consider this situation:

three hosts a, b, c and 3 storage 1, 2, 3
each host only attach 2 storage,
a possible combination: a(1,2), b(2,3), c(1,3)
so none of these storage is a proper storage to hold global lock!

so I give up the global lock setting and the host id on global, I'll
only correct global lock when I need(add vg, pv, etc)


2. About host 19

I found host 19 truly hold the lease since 2018-10-09 20:49:15:

daemon 091c17d0-648eb28c-HLD-1-3-S07
p -1 helper
p -1 listener
p 2235 lvmlockd
p 2235 lvmlockd
p 2235 lvmlockd
p 2235 lvmlockd
p -1 status
s lvm_b075258f5b9547d7b4464fff246bbce1:19:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:0

 2018-10-09 20:49:15 4854716 [29802]: s4:r2320 resource
lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-
dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH
for 5,14,29715
 2018-10-09 20:49:15 4854716 [29802]: r2320 paxos_acquire begin e 0 0
 2018-10-09 20:49:15 4854716 [29802]: r2320 leader 1 owner 54 2 0
dblocks 53:54:54:54:2:4755629:1:1,
 2018-10-09 20:49:15 4854716 [29802]: r2320 paxos_acquire leader 1
owner 54 2 0 max mbal[53] 54 our_dblock 0 0 0 0 0 0
 2018-10-09 20:49:15 4854716 [29802]: r2320 paxos_acquire leader 1 free
 2018-10-09 20:49:15 4854716 [29802]: r2320 ballot 2 phase1 write mbal 2019
 2018-10-09 20:49:15 4854717 [29802]: r2320 ballot 2 mode[53] shared 1 gen 2
 2018-10-09 20:49:15 4854717 [29802]: r2320 ballot 2 phase1 read
18:2019:0:0:0:0:2:0,
 2018-10-09 20:49:15 4854717 [29802]: r2320 ballot 2 phase2 write bal
2019 inp 19 1 4854717 q_max -1
 2018-10-09 20:49:15 4854717 [29802]: r2320 ballot 2 abort2 larger
mbal in bk[79] 4080:0:0:0:0:2 our dblock 2019:2019: 19:1:4854717:2
 2018-10-09 20:49:15 4854717 [29802]: r2320 ballot 2 phase2 read
18:2019:2019:19:1:4854717:2:0,79:4080:0:0:0:0:2:0,
 2018-10-09 20:49:15 4854717 [29802]: r2320 paxos_acquire 2 retry
delay 724895 us
 2018-10-09 20:49:16 4854717 [29802]: r2320 paxos_acquire leader 2
owner 19 1 4854717
 2018-10-09 20:49:16 4854717 [29802]: r2320 paxos_acquire 2 owner is
our inp 19 1 4854717 commited by 80
 2018-10-09 20:49:16 4854717 [29802]: r2320 acquire_disk rv 1 lver 2 at 4854717
 2018-10-09 20:49:16 4854717 [29802]: r2320 write_host_block host_id
19 flags 1 gen 1 dblock 29802:510:
140245418403952:140245440585933:140245418403840:4:RELEASED.
 2018-10-09 20:49:16 4854717 [29802]: r2320 paxos_release leader 2
owner 19 1 4854717
 2018-10-09 20:49:16 4854717 [29802]: r2320 paxos_release skip write
last lver 2 owner 19 1 4854717 writer 80 1        4854737 disk lver 2
owner 19 1 4854717 writer 80 1 4854737

does the "paxos_release skip write last lver" is abnormal?


3. Others

The lvmlockd I set it size as 1GB, it maybe to large to upload and
analyse, but I can upload to s3 if we have no other clues.
Because of the problem of multipath queue_if_no_path, it's difficult
to kill process using lv, I may clear lockspace directly without the
process killed, is this related to this problem?
I'm wondering why host generation change in host 19, does clear
lockspace and rejoin or reboot host cause this?

Thanks,
Damon

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..."
  2018-10-11 13:03   ` Damon Wang
@ 2018-10-11 15:55     ` David Teigland
  2018-10-11 18:58       ` David Teigland
  0 siblings, 1 reply; 9+ messages in thread
From: David Teigland @ 2018-10-11 15:55 UTC (permalink / raw)
  To: Damon Wang; +Cc: linux-lvm

On Thu, Oct 11, 2018 at 09:03:01PM +0800, Damon Wang wrote:
> This is because I regenerate a host id when the host join a new
> lockspace

OK, I guess there's something about your setup that makes it difficult to
assign static host_id's.

>  2018-10-09 20:49:16 4854717 [29802]: r2320 write_host_block host_id
>  19 flags 1 gen 1 dblock 29802:510:140245418403952:140245440585933:140245418403840:4:RELEASED.

Perfect, that shows exactly where the bug is.  There's a case where it has
missed saving the latest dblock values, so random values are being copied
back to the dblock instead.  I'll push out a patch once I've had a chance
to test it.

> does the "paxos_release skip write last lver" is abnormal?

Not very unusual when you have lock contention (multiple hosts trying to
acquire the same lock concurrently.)

Thanks,
Dave

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..."
  2018-10-11 15:55     ` David Teigland
@ 2018-10-11 18:58       ` David Teigland
  2018-10-12  8:58         ` Damon Wang
  0 siblings, 1 reply; 9+ messages in thread
From: David Teigland @ 2018-10-11 18:58 UTC (permalink / raw)
  To: Damon Wang; +Cc: linux-lvm

> >  2018-10-09 20:49:16 4854717 [29802]: r2320 write_host_block host_id
> >  19 flags 1 gen 1 dblock 29802:510:140245418403952:140245440585933:140245418403840:4:RELEASED.
> 
> Perfect, that shows exactly where the bug is.  There's a case where it has
> missed saving the latest dblock values, so random values are being copied
> back to the dblock instead.  I'll push out a patch once I've had a chance
> to test it.

Here's the fix:
https://www.pagure.io/sanlock/c/1471e241a33abc1f0774ce7ec418a134b655f9ac?branch=master

Thanks for the help finding that.
Dave

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..."
  2018-10-11 18:58       ` David Teigland
@ 2018-10-12  8:58         ` Damon Wang
  2018-10-12 14:21           ` David Teigland
  0 siblings, 1 reply; 9+ messages in thread
From: Damon Wang @ 2018-10-12  8:58 UTC (permalink / raw)
  To: David Teigland; +Cc: LVM general discussion and development

Hi,

Thanks for your mail, I read the patch and have some questions:

1. Is this bug easy to repeat (seems need lease owner committed by a
host is not the lease leader)?

2. I find the commit 6501351b742 introduced bug is released with
3.6.0, if I want to fix it, which is recommend, make a master build or
downgrade to 3.5.0?

Thanks again,
Damon
On Fri, Oct 12, 2018 at 2:58 AM David Teigland <teigland@redhat.com> wrote:
>
> > >  2018-10-09 20:49:16 4854717 [29802]: r2320 write_host_block host_id
> > >  19 flags 1 gen 1 dblock 29802:510:140245418403952:140245440585933:140245418403840:4:RELEASED.
> >
> > Perfect, that shows exactly where the bug is.  There's a case where it has
> > missed saving the latest dblock values, so random values are being copied
> > back to the dblock instead.  I'll push out a patch once I've had a chance
> > to test it.
>
> Here's the fix:
> https://www.pagure.io/sanlock/c/1471e241a33abc1f0774ce7ec418a134b655f9ac?branch=master
>
> Thanks for the help finding that.
> Dave

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..."
  2018-10-12  8:58         ` Damon Wang
@ 2018-10-12 14:21           ` David Teigland
  0 siblings, 0 replies; 9+ messages in thread
From: David Teigland @ 2018-10-12 14:21 UTC (permalink / raw)
  To: Damon Wang; +Cc: LVM general discussion and development

On Fri, Oct 12, 2018 at 04:58:45PM +0800, Damon Wang wrote:
> 1. Is this bug easy to repeat (seems need lease owner committed by a
> host is not the lease leader)?

It will happen occasionally, depending entirely on timing, when there are
commands running concurrently on different hosts that need the same lock.

> 2. I find the commit 6501351b742 introduced bug is released with
> 3.6.0, if I want to fix it, which is recommend, make a master build or
> downgrade to 3.5.0?

I suggest applying the patch onto 3.6.0.

Dave

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-10-12 14:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-10  7:51 [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..." Damon Wang
2018-10-10  7:58 ` Damon Wang
2018-10-10  8:12   ` Damon Wang
2018-10-10 19:08 ` David Teigland
2018-10-11 13:03   ` Damon Wang
2018-10-11 15:55     ` David Teigland
2018-10-11 18:58       ` David Teigland
2018-10-12  8:58         ` Damon Wang
2018-10-12 14:21           ` David Teigland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).