* [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..." @ 2018-10-10 7:51 Damon Wang 2018-10-10 7:58 ` Damon Wang 2018-10-10 19:08 ` David Teigland 0 siblings, 2 replies; 9+ messages in thread From: Damon Wang @ 2018-10-10 7:51 UTC (permalink / raw) To: LVM general discussion and development Hi, I can not create lv in an environment and after some check I found a server has hold the VGLK and not released for long time. this is "sanlock client status" prints: [root@HLD-1-4-S10 ~]# sanlock client status daemon 091c17d0-822c6a4f-HLD-1-4-S10 p -1 helper p -1 listener p 27919 lvmlockd p 27919 lvmlockd p 27919 lvmlockd p 27919 lvmlockd p -1 status s lvm_b075258f5b9547d7b4464fff246bbce1:49:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:0 s lvm_671bfaeb4e344014b041fc78115307bb:30:/dev/mapper/671bfaeb4e344014b041fc78115307bb-lvmlock:0 s lvm_4c149e7262a64c728997954257bece01:94:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:0 s lvm_091c17d096894a179313aad65cb4127a:87:/dev/mapper/091c17d096894a179313aad65cb4127a-lvmlock:0 r lvm_b075258f5b9547d7b4464fff246bbce1:VGLK:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016:SH p 27919 <------ r lvm_4c149e7262a64c728997954257bece01:0fiorC-82Su-yVRq-dQPe-ed93-xYC5-Pou1i7:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:101711872:SH p 27919 r lvm_4c149e7262a64c728997954257bece01:TMG0ev-vvP0-kVWU-xRLE-G9y3-kMf0-bIgIrc:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:111149056:4 p 27919 ... ... r lvm_4c149e7262a64c728997954257bece01:yyjuZt-euSf-ZCG6-eJWF-a0f4-MtpA-KX2Jzr:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:119537664:4 p 27919 r lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH p 27919 ADD <------ As it shows, it is adding lock for "lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD" and holding VGLK with shared mode, so there is an command hanging: [root@HLD-1-4-S10 ~]# ps aux | grep lv root 7755 0.0 0.0 160248 5724 ? S 14:12 0:00 lvchange -asy /dev/b075258f5b9547d7b4464fff246bbce1/e36d96c2a29716b999b2a33aa16e94e0 <------ root 11335 0.0 0.0 112660 972 pts/0 S+ 14:40 0:00 grep --color=auto lv root 27919 0.0 0.0 423860 5088 ? Ssl 11:07 0:00 /usr/sbin/lvmlockd --daemon-debug --sanlock-timeout 40 root 27926 0.0 0.0 114476 1116 ? Ssl 11:07 0:00 /usr/sbin/lvmetad -f As "ps" shows, I use lvmlockd with damon debug mode, so it will print log in messages: Oct 10 14:07:13 HLD-1-4-S10 kernel: kvm [39758]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xabcd Oct 10 14:10:01 HLD-1-4-S10 systemd: Started Session 248 of user root. Oct 10 14:10:01 HLD-1-4-S10 systemd: Starting Session 248 of user root. Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 new cl 849 pi 2 fd 9 Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 recv lvchange[7755] cl 849 lock vg "b075258f5b9547d7b4464fff246bbce1" mode sh flags 0 Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK action lock sh Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock cl 849 mode sh Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK lock_san sh at /dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016 Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock rv 0 read vb 101 0 3301 Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock got version 3301 our 3292 Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock set lvmetad vg version 3301 Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 send lvchange[7755] cl 849 lock vg rv 0 Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 recv lvchange[7755] cl 849 lock lv "b075258f5b9547d7b4464fff246bbce1" mode sh flags 1 Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S lvm_b075258f5b9547d7b4464fff246bbce1 R u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD action lock sh Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S lvm_b075258f5b9547d7b4464fff246bbce1 R u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD res_lock cl 849 mode sh (e36d96c2a29716b999b2a33aa16e94e0) Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S lvm_b075258f5b9547d7b4464fff246bbce1 R u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD lock_san sh at /dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056 Oct 10 14:15:23 HLD-1-4-S10 systemd-logind: New session 249 of user root. Oct 10 14:15:23 HLD-1-4-S10 systemd: Started Session 249 of user root. Oct 10 14:15:23 HLD-1-4-S10 systemd: Starting Session 249 of user root. Oct 10 14:18:12 HLD-1-4-S10 sanlock[27912]: 2018-10-10 14:18:12 11831 [27917]: cmd_release 2,9,27919 no resource VGLK Oct 10 14:20:01 HLD-1-4-S10 systemd: Started Session 250 of user root. Oct 10 14:20:01 HLD-1-4-S10 systemd: Starting Session 250 of user root. Oct 10 14:30:01 HLD-1-4-S10 systemd: Started Session 251 of user root. Oct 10 14:30:01 HLD-1-4-S10 systemd: Starting Session 251 of user root. Oct 10 14:32:22 HLD-1-4-S10 sanlock[27912]: 2018-10-10 14:32:22 12680 [27917]: cmd_release 2,9,27919 no resource VGLK Oct 10 14:40:01 HLD-1-4-S10 systemd: Started Session 252 of user root. Oct 10 14:40:01 HLD-1-4-S10 systemd: Starting Session 252 of user root. Oct 10 14:40:28 HLD-1-4-S10 lvmlockd: 1539153628 new cl 850 pi 3 fd 14 the timestamp matches, it holds VGLK at 14:12:36 and not release. So I keep digging log of sanlock and find this: 2018-10-10 14:12:33 11492 [27916]: cmd_examine 6,17 lvm_091c17d096894a179313aad65cb4127a 2018-10-10 14:12:33 11492 [27916]: cmd_examine 6,17 done 2 2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0 2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0 2018-10-10 14:12:33 11492 [27917]: cmd_examine 6,17 lvm_b075258f5b9547d7b4464fff246bbce1 2018-10-10 14:12:33 11492 [27917]: cmd_examine 6,17 done 2 2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0 2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0 2018-10-10 14:12:33 11492 [27916]: set_lockspace_event lvm_091c17d096894a179313aad65cb4127a 2018-10-10 14:12:33 11492 [27916]: cmd_set_event result 0 2018-10-10 14:12:33 11492 [27917]: set_lockspace_event lvm_b075258f5b9547d7b4464fff246bbce1 2018-10-10 14:12:33 11492 [27917]: cmd_set_event result 0 2018-10-10 14:12:34 11493 [9550]: s2 delta_renew begin read 2018-10-10 14:12:34 11493 [9550]: s2 delta_renew begin write for new ts 11493 2018-10-10 14:12:34 11493 [9550]: s2 renewed 11493 delta_length 0 interval 81 2018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919 ci_in 5 fd 15 count 1 flags 9 2018-10-10 14:12:36 11494 [27916]: s4:r3097 resource lvm_b075258f5b9547d7b4464fff246bbce1:VGLK:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016:SH for 5,15,27919 2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire begin e 0 0 2018-10-10 14:12:36 11494 [27916]: r3097 leader 42693 owner 66 2 0 dblocks 65:140695692504066:140695692504066:66:2:3805:42693:1, 2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire leader 42693 owner 66 2 0 max mbal[65] 140695692504066 our_dblock 140695692434049 140695692434049 49 2 4303 42658 2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire leader 42693 free 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase1 write mbal 140695692506049 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase1 read 48:140695692506049:0:0:0:0:42694:0, 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase2 write bal 140695692506049 inp 49 2 11494 q_max -1 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase2 read 48:140695692506049:140695692506049:49:2:11494:42694:0, 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 commit self owner 49 2 11494 2018-10-10 14:12:36 11494 [27916]: r3097 acquire_disk rv 1 lver 42694@11494 2018-10-10 14:12:36 11494 [27916]: r3097 write_host_block host_id 49 flags 1 gen 2 dblock 140695692506049:140695692506049:49:2:11494:42694:RELEASED. 2018-10-10 14:12:36 11494 [27916]: r3097 paxos_release leader 42694 owner 49 2 11494 2018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919 result 0 pid_dead 0 2018-10-10 14:12:36 11494 [27917]: cmd_get_lvb ci 6 fd 17 result 0 res lvm_b075258f5b9547d7b4464fff246bbce1:VGLK 018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919 ci_in 5 fd 15 count 1 flags 8 018-10-10 14:12:36 11494 [27916]: s4:r3098 resource lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH for 5,15,27919 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire begin e 0 0 018-10-10 14:12:36 11494 [27916]: r3098 leader 2 owner 19 1 4854717 dblocks 8:29802:510:140245418403952:140245440585933:140245418403840:4:1,48:202556049:0:0:0:0:3:0,65:218746066:0:0:0:0:3:0,79:202874080:0:0:0:0:3:0, 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2 owner 19 1 4854717 max mbal[65] 218746066 our_dblock 202556049 0 0 0 0 3 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19 1 4854717 host_status 19 2 10914 wait_start 11494 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19 1 4854717 delta 19 2 10995 mismatch 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2 owner 19 1 4854717 018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 phase1 write mbal 218748049 018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 abort1 larger lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4 our dblock 218748049:0:0:0:0:3 018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 phase1 read 18:29802:510:140245418403952:140245440585933:140245418403840:4:1, 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire 3 retry delay 887122 us 018-10-10 14:12:37 11495 [27916]: r3098 paxos_acquire leader 2 owner 19 1 4854717 018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 phase1 write mbal 218750049 018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 abort1 larger lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4 our dblock 218750049:0:0:0:0:3 018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 phase1 read 18:29802:510:140245418403952:140245440585933:140245418403840:4:1, 018-10-10 14:12:37 11495 [27916]: r3098 paxos_acquire 3 retry delay 455075 us 018-10-10 14:12:38 11496 [27916]: r3098 paxos_acquire leader 2 owner 19 1 4854717 018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 phase1 write mbal 218752049 018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 abort1 larger lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4 our dblock 218752049:0:0:0:0:3 018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 phase1 read 18:29802:510:140245418403952:140245440585933:140245418403840:4:1, 018-10-10 14:12:38 11496 [27916]: r3098 paxos_acquire 3 retry delay 904853 us 018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 lvm_671bfaeb4e344014b041fc78115307bb 018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 done 2 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 lvm_4c149e7262a64c728997954257bece01 018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 done 5 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 018-10-10 14:12:38 11496 [27917]: set_lockspace_event lvm_671bfaeb4e344014b041fc78115307bb 018-10-10 14:12:38 11496 [27917]: cmd_set_event result 0 018-10-10 14:12:38 11496 [27917]: set_lockspace_event lvm_4c149e7262a64c728997954257bece01 018-10-10 14:12:38 11496 [27917]: cmd_set_event result 0 018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 lvm_091c17d096894a179313aad65cb4127a 018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 done 2 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 lvm_b075258f5b9547d7b4464fff246bbce1 018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 done 3 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 018-10-10 14:12:38 11497 [27917]: set_lockspace_event lvm_091c17d096894a179313aad65cb4127a 018-10-10 14:12:38 11497 [27917]: cmd_set_event result 0 018-10-10 14:12:38 11497 [27917]: set_lockspace_event lvm_b075258f5b9547d7b4464fff246bbce1 018-10-10 14:12:38 11497 [27917]: cmd_set_event result 0 018-10-10 14:12:39 11497 [27916]: r3098 paxos_acquire leader 2 owner 19 1 4854717 018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 phase1 write mbal 218754049 018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 abort1 larger lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4 our dblock 218754049:0:0:0:0:3 018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 phase1 read 18:29802:510:140245418403952:140245440585933:140245418403840:4:1, 018-10-10 14:12:39 11497 [27916]: r3098 paxos_acquire 3 retry delay 981162 us 018-10-10 14:12:40 11498 [27916]: r3098 paxos_acquire leader 2 owner 19 1 4854717 This is confusing to me, what means "ballot 3 abort1 larger lver in bk" and what causes this? Seems related to the bug I reported that io timeout when release vg lock causes problems? Besides, I found there are some hosts has different sanlock io timeout than others, is this the root cause? Thanks, Damon ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..." 2018-10-10 7:51 [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..." Damon Wang @ 2018-10-10 7:58 ` Damon Wang 2018-10-10 8:12 ` Damon Wang 2018-10-10 19:08 ` David Teigland 1 sibling, 1 reply; 9+ messages in thread From: Damon Wang @ 2018-10-10 7:58 UTC (permalink / raw) To: LVM general discussion and development the log format of previous mail become horrible, so please this, thanks. Hi, I can not create lv in an environment and after some check I found a server has hold the VGLK and not released for long time. This is "sanlock client status" prints: [root@HLD-1-4-S10 ~]# sanlock client status daemon 091c17d0-822c6a4f-HLD-1-4-S10 p -1 helper p -1 listener p 27919 lvmlockd p 27919 lvmlockd p 27919 lvmlockd p 27919 lvmlockd p -1 status s lvm_b075258f5b9547d7b4464fff246bbce1:49:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:0 s lvm_671bfaeb4e344014b041fc78115307bb:30:/dev/mapper/671bfaeb4e344014b041fc78115307bb-lvmlock:0 s lvm_4c149e7262a64c728997954257bece01:94:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:0 s lvm_091c17d096894a179313aad65cb4127a:87:/dev/mapper/091c17d096894a179313aad65cb4127a-lvmlock:0 r lvm_b075258f5b9547d7b4464fff246bbce1:VGLK:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016:SH p 27919 <------ r lvm_4c149e7262a64c728997954257bece01:0fiorC-82Su-yVRq-dQPe-ed93-xYC5-Pou1i7:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:101711872:SH p 27919 r lvm_4c149e7262a64c728997954257bece01:TMG0ev-vvP0-kVWU-xRLE-G9y3-kMf0-bIgIrc:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:111149056:4 p 27919 ... ... r lvm_4c149e7262a64c728997954257bece01:yyjuZt-euSf-ZCG6-eJWF-a0f4-MtpA-KX2Jzr:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:119537664:4 p 27919 r lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH p 27919 ADD <------ As it shows, it is adding lock for "lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD" and holding VGLK with shared mode, so there is an command hanging: [root@HLD-1-4-S10 ~]# ps aux | grep lv root 7755 0.0 0.0 160248 5724 ? S 14:12 0:00 lvchange -asy /dev/b075258f5b9547d7b4464fff246bbce1/e36d96c2a29716b999b2a33aa16e94e0 root 11335 0.0 0.0 112660 972 pts/0 S+ 14:40 0:00 grep --color=auto lv root 27919 0.0 0.0 423860 5088 ? Ssl 11:07 0:00 /usr/sbin/lvmlockd --daemon-debug --sanlock-timeout 40 root 27926 0.0 0.0 114476 1116 ? Ssl 11:07 0:00 /usr/sbin/lvmetad -f As "ps" shows, I use lvmlockd with damon debug mode, so it will print log in messages: Oct 10 14:07:13 HLD-1-4-S10 kernel: kvm [39758]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xabcd Oct 10 14:10:01 HLD-1-4-S10 systemd: Started Session 248 of user root. Oct 10 14:10:01 HLD-1-4-S10 systemd: Starting Session 248 of user root. Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 new cl 849 pi 2 fd 9 Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 recv lvchange[7755] cl 849 lock vg "b075258f5b9547d7b4464fff246bbce1" mode sh flags 0 Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK action lock sh Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock cl 849 mode sh Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK lock_san sh at /dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016 Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock rv 0 read vb 101 0 3301 Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock got version 3301 our 3292 Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock set lvmetad vg version 3301 Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 send lvchange[7755] cl 849 lock vg rv 0 Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 recv lvchange[7755] cl 849 lock lv "b075258f5b9547d7b4464fff246bbce1" mode sh flags 1 Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S lvm_b075258f5b9547d7b4464fff246bbce1 R u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD action lock sh Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S lvm_b075258f5b9547d7b4464fff246bbce1 R u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD res_lock cl 849 mode sh (e36d96c2a29716b999b2a33aa16e94e0) Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S lvm_b075258f5b9547d7b4464fff246bbce1 R u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD lock_san sh at /dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056 Oct 10 14:15:23 HLD-1-4-S10 systemd-logind: New session 249 of user root. Oct 10 14:15:23 HLD-1-4-S10 systemd: Started Session 249 of user root. Oct 10 14:15:23 HLD-1-4-S10 systemd: Starting Session 249 of user root. Oct 10 14:18:12 HLD-1-4-S10 sanlock[27912]: 2018-10-10 14:18:12 11831 [27917]: cmd_release 2,9,27919 no resource VGLK Oct 10 14:20:01 HLD-1-4-S10 systemd: Started Session 250 of user root. Oct 10 14:20:01 HLD-1-4-S10 systemd: Starting Session 250 of user root. Oct 10 14:30:01 HLD-1-4-S10 systemd: Started Session 251 of user root. Oct 10 14:30:01 HLD-1-4-S10 systemd: Starting Session 251 of user root. Oct 10 14:32:22 HLD-1-4-S10 sanlock[27912]: 2018-10-10 14:32:22 12680 [27917]: cmd_release 2,9,27919 no resource VGLK Oct 10 14:40:01 HLD-1-4-S10 systemd: Started Session 252 of user root. Oct 10 14:40:01 HLD-1-4-S10 systemd: Starting Session 252 of user root. Oct 10 14:40:28 HLD-1-4-S10 lvmlockd: 1539153628 new cl 850 pi 3 fd 14 the timestamp matches, it holds VGLK at 14:12:36 and not release. So I keep digging log of sanlock and find this: 2018-10-10 14:12:33 11492 [27916]: cmd_examine 6,17 lvm_091c17d096894a179313aad65cb4127a 2018-10-10 14:12:33 11492 [27916]: cmd_examine 6,17 done 2 2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0 2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0 2018-10-10 14:12:33 11492 [27917]: cmd_examine 6,17 lvm_b075258f5b9547d7b4464fff246bbce1 2018-10-10 14:12:33 11492 [27917]: cmd_examine 6,17 done 2 2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0 2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0 2018-10-10 14:12:33 11492 [27916]: set_lockspace_event lvm_091c17d096894a179313aad65cb4127a 2018-10-10 14:12:33 11492 [27916]: cmd_set_event result 0 2018-10-10 14:12:33 11492 [27917]: set_lockspace_event lvm_b075258f5b9547d7b4464fff246bbce1 2018-10-10 14:12:33 11492 [27917]: cmd_set_event result 0 2018-10-10 14:12:34 11493 [9550]: s2 delta_renew begin read 2018-10-10 14:12:34 11493 [9550]: s2 delta_renew begin write for new ts 11493 2018-10-10 14:12:34 11493 [9550]: s2 renewed 11493 delta_length 0 interval 81 2018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919 ci_in 5 fd 15 count 1 flags 9 2018-10-10 14:12:36 11494 [27916]: s4:r3097 resource lvm_b075258f5b9547d7b4464fff246bbce1:VGLK:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016:SH for 5,15,27919 2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire begin e 0 0 2018-10-10 14:12:36 11494 [27916]: r3097 leader 42693 owner 66 2 0 dblocks 65:140695692504066:140695692504066:66:2:3805:42693:1, 2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire leader 42693 owner 66 2 0 max mbal[65] 140695692504066 our_dblock 140695692434049 140695692434049 49 2 4303 42658 2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire leader 42693 free 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase1 write mbal 140695692506049 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase1 read 48:140695692506049:0:0:0:0:42694:0, 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase2 write bal 140695692506049 inp 49 2 11494 q_max -1 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase2 read 48:140695692506049:140695692506049:49:2:11494:42694:0, 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 commit self owner 49 2 11494 2018-10-10 14:12:36 11494 [27916]: r3097 acquire_disk rv 1 lver 42694 at 11494 2018-10-10 14:12:36 11494 [27916]: r3097 write_host_block host_id 49 flags 1 gen 2 dblock 140695692506049:140695692506049:49:2:11494:42694:RELEASED. 2018-10-10 14:12:36 11494 [27916]: r3097 paxos_release leader 42694 owner 49 2 11494 2018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919 result 0 pid_dead 0 2018-10-10 14:12:36 11494 [27917]: cmd_get_lvb ci 6 fd 17 result 0 res lvm_b075258f5b9547d7b4464fff246bbce1:VGLK 018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919 ci_in 5 fd 15 count 1 flags 8 018-10-10 14:12:36 11494 [27916]: s4:r3098 resource lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH for 5,15,27919 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire begin e 0 0 018-10-10 14:12:36 11494 [27916]: r3098 leader 2 owner 19 1 4854717 dblocks 8:29802:510:140245418403952:140245440585933:140245418403840:4:1,48:202556049:0:0:0:0:3:0,65:218746066:0:0:0:0:3:0,79:202874080:0:0:0:0:3:0, 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2 owner 19 1 4854717 max mbal[65] 218746066 our_dblock 202556049 0 0 0 0 3 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19 1 4854717 host_status 19 2 10914 wait_start 11494 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19 1 4854717 delta 19 2 10995 mismatch 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2 owner 19 1 4854717 018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 phase1 write mbal 218748049 018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 abort1 larger lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4 our dblock 218748049:0:0:0:0:3 018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 phase1 read 18:29802:510:140245418403952:140245440585933:140245418403840:4:1, 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire 3 retry delay 887122 us 018-10-10 14:12:37 11495 [27916]: r3098 paxos_acquire leader 2 owner 19 1 4854717 018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 phase1 write mbal 218750049 018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 abort1 larger lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4 our dblock 218750049:0:0:0:0:3 018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 phase1 read 18:29802:510:140245418403952:140245440585933:140245418403840:4:1, 018-10-10 14:12:37 11495 [27916]: r3098 paxos_acquire 3 retry delay 455075 us 018-10-10 14:12:38 11496 [27916]: r3098 paxos_acquire leader 2 owner 19 1 4854717 018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 phase1 write mbal 218752049 018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 abort1 larger lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4 our dblock 218752049:0:0:0:0:3 018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 phase1 read 18:29802:510:140245418403952:140245440585933:140245418403840:4:1, 018-10-10 14:12:38 11496 [27916]: r3098 paxos_acquire 3 retry delay 904853 us 018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 lvm_671bfaeb4e344014b041fc78115307bb 018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 done 2 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 lvm_4c149e7262a64c728997954257bece01 018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 done 5 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 018-10-10 14:12:38 11496 [27917]: set_lockspace_event lvm_671bfaeb4e344014b041fc78115307bb 018-10-10 14:12:38 11496 [27917]: cmd_set_event result 0 018-10-10 14:12:38 11496 [27917]: set_lockspace_event lvm_4c149e7262a64c728997954257bece01 018-10-10 14:12:38 11496 [27917]: cmd_set_event result 0 018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 lvm_091c17d096894a179313aad65cb4127a 018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 done 2 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 lvm_b075258f5b9547d7b4464fff246bbce1 018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 done 3 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 018-10-10 14:12:38 11497 [27917]: set_lockspace_event lvm_091c17d096894a179313aad65cb4127a 018-10-10 14:12:38 11497 [27917]: cmd_set_event result 0 018-10-10 14:12:38 11497 [27917]: set_lockspace_event lvm_b075258f5b9547d7b4464fff246bbce1 018-10-10 14:12:38 11497 [27917]: cmd_set_event result 0 018-10-10 14:12:39 11497 [27916]: r3098 paxos_acquire leader 2 owner 19 1 4854717 018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 phase1 write mbal 218754049 018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 abort1 larger lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4 our dblock 218754049:0:0:0:0:3 018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 phase1 read 18:29802:510:140245418403952:140245440585933:140245418403840:4:1, 018-10-10 14:12:39 11497 [27916]: r3098 paxos_acquire 3 retry delay 981162 us 018-10-10 14:12:40 11498 [27916]: r3098 paxos_acquire leader 2 owner 19 1 4854717 This is confusing to me, what means "ballot 3 abort1 larger lver inbk" and what causes this? Seems related to the bug I reported that io timeout when release vglock causes problems? Besides, I found there are some hosts has different sanlock io timeout than others, is this the root cause? Thanks, Damon On Wed, Oct 10, 2018@3:51 PM Damon Wang <damon.devops@gmail.com> wrote: > > Hi, > > I can not create lv in an environment and after some check I found a > server has hold the VGLK and not released for long time. > > this is "sanlock client status" prints: > > [root@HLD-1-4-S10 ~]# sanlock client status > daemon 091c17d0-822c6a4f-HLD-1-4-S10 > p -1 helper > p -1 listener > p 27919 lvmlockd > p 27919 lvmlockd > p 27919 lvmlockd > p 27919 lvmlockd > p -1 status > s lvm_b075258f5b9547d7b4464fff246bbce1:49:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:0 > s lvm_671bfaeb4e344014b041fc78115307bb:30:/dev/mapper/671bfaeb4e344014b041fc78115307bb-lvmlock:0 > s lvm_4c149e7262a64c728997954257bece01:94:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:0 > s lvm_091c17d096894a179313aad65cb4127a:87:/dev/mapper/091c17d096894a179313aad65cb4127a-lvmlock:0 > r lvm_b075258f5b9547d7b4464fff246bbce1:VGLK:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016:SH > p 27919 <------ > r lvm_4c149e7262a64c728997954257bece01:0fiorC-82Su-yVRq-dQPe-ed93-xYC5-Pou1i7:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:101711872:SH > p 27919 > r lvm_4c149e7262a64c728997954257bece01:TMG0ev-vvP0-kVWU-xRLE-G9y3-kMf0-bIgIrc:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:111149056:4 > p 27919 > ... ... > r lvm_4c149e7262a64c728997954257bece01:yyjuZt-euSf-ZCG6-eJWF-a0f4-MtpA-KX2Jzr:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:119537664:4 > p 27919 > r lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH > p 27919 ADD <------ > > As it shows, it is adding lock for > "lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD" > and holding VGLK with shared mode, so there is an command hanging: > > [root@HLD-1-4-S10 ~]# ps aux | grep lv > root 7755 0.0 0.0 160248 5724 ? S 14:12 > 0:00 lvchange -asy > /dev/b075258f5b9547d7b4464fff246bbce1/e36d96c2a29716b999b2a33aa16e94e0 > <------ > root 11335 0.0 0.0 112660 972 pts/0 S+ 14:40 > 0:00 grep --color=auto lv > root 27919 0.0 0.0 423860 5088 ? Ssl 11:07 > 0:00 /usr/sbin/lvmlockd --daemon-debug --sanlock-timeout 40 > root 27926 0.0 0.0 114476 1116 ? Ssl 11:07 > 0:00 /usr/sbin/lvmetad -f > > As "ps" shows, I use lvmlockd with damon debug mode, so it will print > log in messages: > > Oct 10 14:07:13 HLD-1-4-S10 kernel: kvm [39758]: vcpu0 > disabled perfctr wrmsr: 0xc1 data 0xabcd > Oct 10 14:10:01 HLD-1-4-S10 systemd: Started Session 248 of user root. > Oct 10 14:10:01 HLD-1-4-S10 systemd: Starting Session 248 of user root. > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 new cl 849 pi 2 fd 9 > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 recv > lvchange[7755] cl 849 lock vg "b075258f5b9547d7b4464fff246bbce1" mode > sh flags 0 > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK action lock sh > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock cl 849 mode sh > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK lock_san sh at > /dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016 > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock rv 0 read vb 101 > 0 3301 > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock got version 3301 > our 3292 > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock set lvmetad vg > version 3301 > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 send > lvchange[7755] cl 849 lock vg rv 0 > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 recv > lvchange[7755] cl 849 lock lv "b075258f5b9547d7b4464fff246bbce1" mode > sh flags 1 > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > lvm_b075258f5b9547d7b4464fff246bbce1 R > u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD action lock sh > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > lvm_b075258f5b9547d7b4464fff246bbce1 R > u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD res_lock cl 849 mode sh > (e36d96c2a29716b999b2a33aa16e94e0) > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > lvm_b075258f5b9547d7b4464fff246bbce1 R > u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD lock_san sh at > /dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056 > Oct 10 14:15:23 HLD-1-4-S10 systemd-logind: New session 249 of > user root. > Oct 10 14:15:23 HLD-1-4-S10 systemd: Started Session 249 of user root. > Oct 10 14:15:23 HLD-1-4-S10 systemd: Starting Session 249 of user root. > Oct 10 14:18:12 HLD-1-4-S10 sanlock[27912]: 2018-10-10 > 14:18:12 11831 [27917]: cmd_release 2,9,27919 no resource VGLK > Oct 10 14:20:01 HLD-1-4-S10 systemd: Started Session 250 of user root. > Oct 10 14:20:01 HLD-1-4-S10 systemd: Starting Session 250 of user root. > Oct 10 14:30:01 HLD-1-4-S10 systemd: Started Session 251 of user root. > Oct 10 14:30:01 HLD-1-4-S10 systemd: Starting Session 251 of user root. > Oct 10 14:32:22 HLD-1-4-S10 sanlock[27912]: 2018-10-10 > 14:32:22 12680 [27917]: cmd_release 2,9,27919 no resource VGLK > Oct 10 14:40:01 HLD-1-4-S10 systemd: Started Session 252 of user root. > Oct 10 14:40:01 HLD-1-4-S10 systemd: Starting Session 252 of user root. > Oct 10 14:40:28 HLD-1-4-S10 lvmlockd: 1539153628 new cl 850 pi > 3 fd 14 > > the timestamp matches, it holds VGLK at 14:12:36 and not release. > So I keep digging log of sanlock and find this: > > 2018-10-10 14:12:33 11492 [27916]: cmd_examine 6,17 > lvm_091c17d096894a179313aad65cb4127a > 2018-10-10 14:12:33 11492 [27916]: cmd_examine 6,17 done 2 > 2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0 > 2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0 > 2018-10-10 14:12:33 11492 [27917]: cmd_examine 6,17 > lvm_b075258f5b9547d7b4464fff246bbce1 > 2018-10-10 14:12:33 11492 [27917]: cmd_examine 6,17 done 2 > 2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0 > 2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0 > 2018-10-10 14:12:33 11492 [27916]: set_lockspace_event > lvm_091c17d096894a179313aad65cb4127a > 2018-10-10 14:12:33 11492 [27916]: cmd_set_event result 0 > 2018-10-10 14:12:33 11492 [27917]: set_lockspace_event > lvm_b075258f5b9547d7b4464fff246bbce1 > 2018-10-10 14:12:33 11492 [27917]: cmd_set_event result 0 > 2018-10-10 14:12:34 11493 [9550]: s2 delta_renew begin read > 2018-10-10 14:12:34 11493 [9550]: s2 delta_renew begin write > for new ts 11493 > 2018-10-10 14:12:34 11493 [9550]: s2 renewed 11493 > delta_length 0 interval 81 > 2018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919 > ci_in 5 fd 15 count 1 flags 9 > 2018-10-10 14:12:36 11494 [27916]: s4:r3097 resource > lvm_b075258f5b9547d7b4464fff246bbce1:VGLK:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016:SH > for 5,15,27919 > 2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire begin e 0 0 > 2018-10-10 14:12:36 11494 [27916]: r3097 leader 42693 owner 66 > 2 0 dblocks 65:140695692504066:140695692504066:66:2:3805:42693:1, > 2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire leader > 42693 owner 66 2 0 max mbal[65] 140695692504066 our_dblock > 140695692434049 140695692434049 49 2 4303 42658 > 2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire leader 42693 free > 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase1 > write mbal 140695692506049 > 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase1 > read 48:140695692506049:0:0:0:0:42694:0, > 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase2 > write bal 140695692506049 inp 49 2 11494 q_max -1 > 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase2 > read 48:140695692506049:140695692506049:49:2:11494:42694:0, > 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 commit > self owner 49 2 11494 > 2018-10-10 14:12:36 11494 [27916]: r3097 acquire_disk rv 1 > lver 42694 at 11494 > 2018-10-10 14:12:36 11494 [27916]: r3097 write_host_block > host_id 49 flags 1 gen 2 dblock > 140695692506049:140695692506049:49:2:11494:42694:RELEASED. > 2018-10-10 14:12:36 11494 [27916]: r3097 paxos_release leader > 42694 owner 49 2 11494 > 2018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919 > result 0 pid_dead 0 > 2018-10-10 14:12:36 11494 [27917]: cmd_get_lvb ci 6 fd 17 > result 0 res lvm_b075258f5b9547d7b4464fff246bbce1:VGLK > 018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919 ci_in > 5 fd 15 count 1 flags 8 > 018-10-10 14:12:36 11494 [27916]: s4:r3098 resource > lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH > for 5,15,27919 > 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire begin e 0 0 > 018-10-10 14:12:36 11494 [27916]: r3098 leader 2 owner 19 1 > 4854717 dblocks > 8:29802:510:140245418403952:140245440585933:140245418403840:4:1,48:202556049:0:0:0:0:3:0,65:218746066:0:0:0:0:3:0,79:202874080:0:0:0:0:3:0, > 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2 > owner 19 1 4854717 max mbal[65] 218746066 our_dblock 202556049 0 0 0 0 > 3 > 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19 > 1 4854717 host_status 19 2 10914 wait_start 11494 > 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19 > 1 4854717 delta 19 2 10995 mismatch > 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2 > owner 19 1 4854717 > 018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 phase1 write > mbal 218748049 > 018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 abort1 larger > lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4 > our dblock 218748049:0:0:0:0:3 > 018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 phase1 read > 18:29802:510:140245418403952:140245440585933:140245418403840:4:1, > 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire 3 retry > delay 887122 us > 018-10-10 14:12:37 11495 [27916]: r3098 paxos_acquire leader 2 > owner 19 1 4854717 > 018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 phase1 write > mbal 218750049 > 018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 abort1 larger > lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4 > our dblock 218750049:0:0:0:0:3 > 018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 phase1 read > 18:29802:510:140245418403952:140245440585933:140245418403840:4:1, > 018-10-10 14:12:37 11495 [27916]: r3098 paxos_acquire 3 retry > delay 455075 us > 018-10-10 14:12:38 11496 [27916]: r3098 paxos_acquire leader 2 > owner 19 1 4854717 > 018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 phase1 write > mbal 218752049 > 018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 abort1 larger > lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4 > our dblock 218752049:0:0:0:0:3 > 018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 phase1 read > 18:29802:510:140245418403952:140245440585933:140245418403840:4:1, > 018-10-10 14:12:38 11496 [27916]: r3098 paxos_acquire 3 retry > delay 904853 us > 018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 > lvm_671bfaeb4e344014b041fc78115307bb > 018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 done 2 > 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 > 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 > 018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 > lvm_4c149e7262a64c728997954257bece01 > 018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 done 5 > 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 > 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 > 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 > 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 > 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 > 018-10-10 14:12:38 11496 [27917]: set_lockspace_event > lvm_671bfaeb4e344014b041fc78115307bb > 018-10-10 14:12:38 11496 [27917]: cmd_set_event result 0 > 018-10-10 14:12:38 11496 [27917]: set_lockspace_event > lvm_4c149e7262a64c728997954257bece01 > 018-10-10 14:12:38 11496 [27917]: cmd_set_event result 0 > 018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 > lvm_091c17d096894a179313aad65cb4127a > 018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 done 2 > 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 > 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 > 018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 > lvm_b075258f5b9547d7b4464fff246bbce1 > 018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 done 3 > 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 > 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 > 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 > 018-10-10 14:12:38 11497 [27917]: set_lockspace_event > lvm_091c17d096894a179313aad65cb4127a > 018-10-10 14:12:38 11497 [27917]: cmd_set_event result 0 > 018-10-10 14:12:38 11497 [27917]: set_lockspace_event > lvm_b075258f5b9547d7b4464fff246bbce1 > 018-10-10 14:12:38 11497 [27917]: cmd_set_event result 0 > 018-10-10 14:12:39 11497 [27916]: r3098 paxos_acquire leader 2 > owner 19 1 4854717 > 018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 phase1 write > mbal 218754049 > 018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 abort1 larger > lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4 > our dblock 218754049:0:0:0:0:3 > 018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 phase1 read > 18:29802:510:140245418403952:140245440585933:140245418403840:4:1, > 018-10-10 14:12:39 11497 [27916]: r3098 paxos_acquire 3 retry > delay 981162 us > 018-10-10 14:12:40 11498 [27916]: r3098 paxos_acquire leader 2 > owner 19 1 4854717 > > This is confusing to me, what means "ballot 3 abort1 larger lver in > bk" and what causes this? > Seems related to the bug I reported that io timeout when release vg > lock causes problems? > > Besides, I found there are some hosts has different sanlock io timeout > than others, is this the root cause? > > Thanks, > Damon ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..." 2018-10-10 7:58 ` Damon Wang @ 2018-10-10 8:12 ` Damon Wang 0 siblings, 0 replies; 9+ messages in thread From: Damon Wang @ 2018-10-10 8:12 UTC (permalink / raw) To: LVM general discussion and development seems mail server is not friendly with long text, so paste them in pastebin: messages: https://pastebin.com/FUHLuWry sanlock.log: https://pastebin.com/5JwmSCWP sanlock client status: https://pastebin.com/9cGbav5f ps aux: https://pastebin.com/6t6SQU6X On Wed, Oct 10, 2018 at 3:58 PM Damon Wang <damon.devops@gmail.com> wrote: > > the log format of previous mail become horrible, so please this, thanks. > > > Hi, > > I can not create lv in an environment and after some check I found a > server has hold the VGLK and not released for long time. > > This is "sanlock client status" prints: > > [root@HLD-1-4-S10 ~]# sanlock client status > daemon 091c17d0-822c6a4f-HLD-1-4-S10 > p -1 helper > p -1 listener > p 27919 lvmlockd > p 27919 lvmlockd > p 27919 lvmlockd > p 27919 lvmlockd > p -1 status > s lvm_b075258f5b9547d7b4464fff246bbce1:49:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:0 > s lvm_671bfaeb4e344014b041fc78115307bb:30:/dev/mapper/671bfaeb4e344014b041fc78115307bb-lvmlock:0 > s lvm_4c149e7262a64c728997954257bece01:94:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:0 > s lvm_091c17d096894a179313aad65cb4127a:87:/dev/mapper/091c17d096894a179313aad65cb4127a-lvmlock:0 > r lvm_b075258f5b9547d7b4464fff246bbce1:VGLK:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016:SH > p 27919 <------ > r lvm_4c149e7262a64c728997954257bece01:0fiorC-82Su-yVRq-dQPe-ed93-xYC5-Pou1i7:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:101711872:SH > p 27919 > r lvm_4c149e7262a64c728997954257bece01:TMG0ev-vvP0-kVWU-xRLE-G9y3-kMf0-bIgIrc:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:111149056:4 > p 27919 > ... ... > r lvm_4c149e7262a64c728997954257bece01:yyjuZt-euSf-ZCG6-eJWF-a0f4-MtpA-KX2Jzr:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:119537664:4 > p 27919 > r lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH > p 27919 ADD <------ > > As it shows, it is adding lock for > "lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD" > and holding VGLK with shared mode, so there is an command hanging: > > [root@HLD-1-4-S10 ~]# ps aux | grep lv > root 7755 0.0 0.0 160248 5724 ? S 14:12 > 0:00 lvchange -asy > /dev/b075258f5b9547d7b4464fff246bbce1/e36d96c2a29716b999b2a33aa16e94e0 > root 11335 0.0 0.0 112660 972 pts/0 S+ 14:40 > 0:00 grep --color=auto lv > root 27919 0.0 0.0 423860 5088 ? Ssl 11:07 > 0:00 /usr/sbin/lvmlockd --daemon-debug --sanlock-timeout 40 > root 27926 0.0 0.0 114476 1116 ? Ssl 11:07 > 0:00 /usr/sbin/lvmetad -f > > As "ps" shows, I use lvmlockd with damon debug mode, so it will print > log in messages: > > Oct 10 14:07:13 HLD-1-4-S10 kernel: kvm [39758]: vcpu0 > disabled perfctr wrmsr: 0xc1 data 0xabcd > Oct 10 14:10:01 HLD-1-4-S10 systemd: Started Session 248 of user root. > Oct 10 14:10:01 HLD-1-4-S10 systemd: Starting Session 248 of user root. > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 new cl 849 pi 2 fd 9 > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 recv > lvchange[7755] cl 849 lock vg "b075258f5b9547d7b4464fff246bbce1" mode > sh flags 0 > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK action lock sh > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock cl 849 mode sh > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK lock_san sh at > /dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016 > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock rv 0 read vb 101 > 0 3301 > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock got version 3301 > our 3292 > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock set lvmetad vg > version 3301 > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 send > lvchange[7755] cl 849 lock vg rv 0 > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 recv > lvchange[7755] cl 849 lock lv "b075258f5b9547d7b4464fff246bbce1" mode > sh flags 1 > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > lvm_b075258f5b9547d7b4464fff246bbce1 R > u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD action lock sh > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > lvm_b075258f5b9547d7b4464fff246bbce1 R > u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD res_lock cl 849 mode sh > (e36d96c2a29716b999b2a33aa16e94e0) > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > lvm_b075258f5b9547d7b4464fff246bbce1 R > u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD lock_san sh at > /dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056 > Oct 10 14:15:23 HLD-1-4-S10 systemd-logind: New session 249 of > user root. > Oct 10 14:15:23 HLD-1-4-S10 systemd: Started Session 249 of user root. > Oct 10 14:15:23 HLD-1-4-S10 systemd: Starting Session 249 of user root. > Oct 10 14:18:12 HLD-1-4-S10 sanlock[27912]: 2018-10-10 > 14:18:12 11831 [27917]: cmd_release 2,9,27919 no resource VGLK > Oct 10 14:20:01 HLD-1-4-S10 systemd: Started Session 250 of user root. > Oct 10 14:20:01 HLD-1-4-S10 systemd: Starting Session 250 of user root. > Oct 10 14:30:01 HLD-1-4-S10 systemd: Started Session 251 of user root. > Oct 10 14:30:01 HLD-1-4-S10 systemd: Starting Session 251 of user root. > Oct 10 14:32:22 HLD-1-4-S10 sanlock[27912]: 2018-10-10 > 14:32:22 12680 [27917]: cmd_release 2,9,27919 no resource VGLK > Oct 10 14:40:01 HLD-1-4-S10 systemd: Started Session 252 of user root. > Oct 10 14:40:01 HLD-1-4-S10 systemd: Starting Session 252 of user root. > Oct 10 14:40:28 HLD-1-4-S10 lvmlockd: 1539153628 new cl 850 pi > 3 fd 14 > > the timestamp matches, it holds VGLK at 14:12:36 and not release. > So I keep digging log of sanlock and find this: > > 2018-10-10 14:12:33 11492 [27916]: cmd_examine 6,17 > lvm_091c17d096894a179313aad65cb4127a > 2018-10-10 14:12:33 11492 [27916]: cmd_examine 6,17 done 2 > 2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0 > 2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0 > 2018-10-10 14:12:33 11492 [27917]: cmd_examine 6,17 > lvm_b075258f5b9547d7b4464fff246bbce1 > 2018-10-10 14:12:33 11492 [27917]: cmd_examine 6,17 done 2 > 2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0 > 2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0 > 2018-10-10 14:12:33 11492 [27916]: set_lockspace_event > lvm_091c17d096894a179313aad65cb4127a > 2018-10-10 14:12:33 11492 [27916]: cmd_set_event result 0 > 2018-10-10 14:12:33 11492 [27917]: set_lockspace_event > lvm_b075258f5b9547d7b4464fff246bbce1 > 2018-10-10 14:12:33 11492 [27917]: cmd_set_event result 0 > 2018-10-10 14:12:34 11493 [9550]: s2 delta_renew begin read > 2018-10-10 14:12:34 11493 [9550]: s2 delta_renew begin write > for new ts 11493 > 2018-10-10 14:12:34 11493 [9550]: s2 renewed 11493 > delta_length 0 interval 81 > 2018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919 > ci_in 5 fd 15 count 1 flags 9 > 2018-10-10 14:12:36 11494 [27916]: s4:r3097 resource > lvm_b075258f5b9547d7b4464fff246bbce1:VGLK:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016:SH > for 5,15,27919 > 2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire begin e 0 0 > 2018-10-10 14:12:36 11494 [27916]: r3097 leader 42693 owner 66 > 2 0 dblocks 65:140695692504066:140695692504066:66:2:3805:42693:1, > 2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire leader > 42693 owner 66 2 0 max mbal[65] 140695692504066 our_dblock > 140695692434049 140695692434049 49 2 4303 42658 > 2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire leader 42693 free > 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase1 > write mbal 140695692506049 > 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase1 > read 48:140695692506049:0:0:0:0:42694:0, > 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase2 > write bal 140695692506049 inp 49 2 11494 q_max -1 > 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase2 > read 48:140695692506049:140695692506049:49:2:11494:42694:0, > 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 commit > self owner 49 2 11494 > 2018-10-10 14:12:36 11494 [27916]: r3097 acquire_disk rv 1 > lver 42694 at 11494 > 2018-10-10 14:12:36 11494 [27916]: r3097 write_host_block > host_id 49 flags 1 gen 2 dblock > 140695692506049:140695692506049:49:2:11494:42694:RELEASED. > 2018-10-10 14:12:36 11494 [27916]: r3097 paxos_release leader > 42694 owner 49 2 11494 > 2018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919 > result 0 pid_dead 0 > 2018-10-10 14:12:36 11494 [27917]: cmd_get_lvb ci 6 fd 17 > result 0 res lvm_b075258f5b9547d7b4464fff246bbce1:VGLK > 018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919 ci_in > 5 fd 15 count 1 flags 8 > 018-10-10 14:12:36 11494 [27916]: s4:r3098 resource > lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH > for 5,15,27919 > 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire begin e 0 0 > 018-10-10 14:12:36 11494 [27916]: r3098 leader 2 owner 19 1 > 4854717 dblocks > 8:29802:510:140245418403952:140245440585933:140245418403840:4:1,48:202556049:0:0:0:0:3:0,65:218746066:0:0:0:0:3:0,79:202874080:0:0:0:0:3:0, > 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2 > owner 19 1 4854717 max mbal[65] 218746066 our_dblock 202556049 0 0 0 0 > 3 > 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19 > 1 4854717 host_status 19 2 10914 wait_start 11494 > 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19 > 1 4854717 delta 19 2 10995 mismatch > 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2 > owner 19 1 4854717 > 018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 phase1 write > mbal 218748049 > 018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 abort1 larger > lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4 > our dblock 218748049:0:0:0:0:3 > 018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 phase1 read > 18:29802:510:140245418403952:140245440585933:140245418403840:4:1, > 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire 3 retry > delay 887122 us > 018-10-10 14:12:37 11495 [27916]: r3098 paxos_acquire leader 2 > owner 19 1 4854717 > 018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 phase1 write > mbal 218750049 > 018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 abort1 larger > lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4 > our dblock 218750049:0:0:0:0:3 > 018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 phase1 read > 18:29802:510:140245418403952:140245440585933:140245418403840:4:1, > 018-10-10 14:12:37 11495 [27916]: r3098 paxos_acquire 3 retry > delay 455075 us > 018-10-10 14:12:38 11496 [27916]: r3098 paxos_acquire leader 2 > owner 19 1 4854717 > 018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 phase1 write > mbal 218752049 > 018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 abort1 larger > lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4 > our dblock 218752049:0:0:0:0:3 > 018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 phase1 read > 18:29802:510:140245418403952:140245440585933:140245418403840:4:1, > 018-10-10 14:12:38 11496 [27916]: r3098 paxos_acquire 3 retry > delay 904853 us > 018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 > lvm_671bfaeb4e344014b041fc78115307bb > 018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 done 2 > 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 > 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 > 018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 > lvm_4c149e7262a64c728997954257bece01 > 018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 done 5 > 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 > 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 > 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 > 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 > 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 > 018-10-10 14:12:38 11496 [27917]: set_lockspace_event > lvm_671bfaeb4e344014b041fc78115307bb > 018-10-10 14:12:38 11496 [27917]: cmd_set_event result 0 > 018-10-10 14:12:38 11496 [27917]: set_lockspace_event > lvm_4c149e7262a64c728997954257bece01 > 018-10-10 14:12:38 11496 [27917]: cmd_set_event result 0 > 018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 > lvm_091c17d096894a179313aad65cb4127a > 018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 done 2 > 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 > 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 > 018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 > lvm_b075258f5b9547d7b4464fff246bbce1 > 018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 done 3 > 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 > 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 > 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 > 018-10-10 14:12:38 11497 [27917]: set_lockspace_event > lvm_091c17d096894a179313aad65cb4127a > 018-10-10 14:12:38 11497 [27917]: cmd_set_event result 0 > 018-10-10 14:12:38 11497 [27917]: set_lockspace_event > lvm_b075258f5b9547d7b4464fff246bbce1 > 018-10-10 14:12:38 11497 [27917]: cmd_set_event result 0 > 018-10-10 14:12:39 11497 [27916]: r3098 paxos_acquire leader 2 > owner 19 1 4854717 > 018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 phase1 write > mbal 218754049 > 018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 abort1 larger > lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4 > our dblock 218754049:0:0:0:0:3 > 018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 phase1 read > 18:29802:510:140245418403952:140245440585933:140245418403840:4:1, > 018-10-10 14:12:39 11497 [27916]: r3098 paxos_acquire 3 retry > delay 981162 us > 018-10-10 14:12:40 11498 [27916]: r3098 paxos_acquire leader 2 > owner 19 1 4854717 > > > This is confusing to me, what means "ballot 3 abort1 larger lver inbk" > and what causes this? > Seems related to the bug I reported that io timeout when release > vglock causes problems? > > Besides, I found there are some hosts has different sanlock io timeout > than others, is this the root cause? > > Thanks, > Damon > On Wed, Oct 10, 2018 at 3:51 PM Damon Wang <damon.devops@gmail.com> wrote: > > > > Hi, > > > > I can not create lv in an environment and after some check I found a > > server has hold the VGLK and not released for long time. > > > > this is "sanlock client status" prints: > > > > [root@HLD-1-4-S10 ~]# sanlock client status > > daemon 091c17d0-822c6a4f-HLD-1-4-S10 > > p -1 helper > > p -1 listener > > p 27919 lvmlockd > > p 27919 lvmlockd > > p 27919 lvmlockd > > p 27919 lvmlockd > > p -1 status > > s lvm_b075258f5b9547d7b4464fff246bbce1:49:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:0 > > s lvm_671bfaeb4e344014b041fc78115307bb:30:/dev/mapper/671bfaeb4e344014b041fc78115307bb-lvmlock:0 > > s lvm_4c149e7262a64c728997954257bece01:94:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:0 > > s lvm_091c17d096894a179313aad65cb4127a:87:/dev/mapper/091c17d096894a179313aad65cb4127a-lvmlock:0 > > r lvm_b075258f5b9547d7b4464fff246bbce1:VGLK:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016:SH > > p 27919 <------ > > r lvm_4c149e7262a64c728997954257bece01:0fiorC-82Su-yVRq-dQPe-ed93-xYC5-Pou1i7:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:101711872:SH > > p 27919 > > r lvm_4c149e7262a64c728997954257bece01:TMG0ev-vvP0-kVWU-xRLE-G9y3-kMf0-bIgIrc:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:111149056:4 > > p 27919 > > ... ... > > r lvm_4c149e7262a64c728997954257bece01:yyjuZt-euSf-ZCG6-eJWF-a0f4-MtpA-KX2Jzr:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:119537664:4 > > p 27919 > > r lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH > > p 27919 ADD <------ > > > > As it shows, it is adding lock for > > "lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD" > > and holding VGLK with shared mode, so there is an command hanging: > > > > [root@HLD-1-4-S10 ~]# ps aux | grep lv > > root 7755 0.0 0.0 160248 5724 ? S 14:12 > > 0:00 lvchange -asy > > /dev/b075258f5b9547d7b4464fff246bbce1/e36d96c2a29716b999b2a33aa16e94e0 > > <------ > > root 11335 0.0 0.0 112660 972 pts/0 S+ 14:40 > > 0:00 grep --color=auto lv > > root 27919 0.0 0.0 423860 5088 ? Ssl 11:07 > > 0:00 /usr/sbin/lvmlockd --daemon-debug --sanlock-timeout 40 > > root 27926 0.0 0.0 114476 1116 ? Ssl 11:07 > > 0:00 /usr/sbin/lvmetad -f > > > > As "ps" shows, I use lvmlockd with damon debug mode, so it will print > > log in messages: > > > > Oct 10 14:07:13 HLD-1-4-S10 kernel: kvm [39758]: vcpu0 > > disabled perfctr wrmsr: 0xc1 data 0xabcd > > Oct 10 14:10:01 HLD-1-4-S10 systemd: Started Session 248 of user root. > > Oct 10 14:10:01 HLD-1-4-S10 systemd: Starting Session 248 of user root. > > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 new cl 849 pi 2 fd 9 > > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 recv > > lvchange[7755] cl 849 lock vg "b075258f5b9547d7b4464fff246bbce1" mode > > sh flags 0 > > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > > lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK action lock sh > > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > > lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock cl 849 mode sh > > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > > lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK lock_san sh at > > /dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016 > > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > > lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock rv 0 read vb 101 > > 0 3301 > > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > > lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock got version 3301 > > our 3292 > > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > > lvm_b075258f5b9547d7b4464fff246bbce1 R VGLK res_lock set lvmetad vg > > version 3301 > > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 send > > lvchange[7755] cl 849 lock vg rv 0 > > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 recv > > lvchange[7755] cl 849 lock lv "b075258f5b9547d7b4464fff246bbce1" mode > > sh flags 1 > > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > > lvm_b075258f5b9547d7b4464fff246bbce1 R > > u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD action lock sh > > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > > lvm_b075258f5b9547d7b4464fff246bbce1 R > > u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD res_lock cl 849 mode sh > > (e36d96c2a29716b999b2a33aa16e94e0) > > Oct 10 14:12:36 HLD-1-4-S10 lvmlockd: 1539151956 S > > lvm_b075258f5b9547d7b4464fff246bbce1 R > > u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD lock_san sh at > > /dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056 > > Oct 10 14:15:23 HLD-1-4-S10 systemd-logind: New session 249 of > > user root. > > Oct 10 14:15:23 HLD-1-4-S10 systemd: Started Session 249 of user root. > > Oct 10 14:15:23 HLD-1-4-S10 systemd: Starting Session 249 of user root. > > Oct 10 14:18:12 HLD-1-4-S10 sanlock[27912]: 2018-10-10 > > 14:18:12 11831 [27917]: cmd_release 2,9,27919 no resource VGLK > > Oct 10 14:20:01 HLD-1-4-S10 systemd: Started Session 250 of user root. > > Oct 10 14:20:01 HLD-1-4-S10 systemd: Starting Session 250 of user root. > > Oct 10 14:30:01 HLD-1-4-S10 systemd: Started Session 251 of user root. > > Oct 10 14:30:01 HLD-1-4-S10 systemd: Starting Session 251 of user root. > > Oct 10 14:32:22 HLD-1-4-S10 sanlock[27912]: 2018-10-10 > > 14:32:22 12680 [27917]: cmd_release 2,9,27919 no resource VGLK > > Oct 10 14:40:01 HLD-1-4-S10 systemd: Started Session 252 of user root. > > Oct 10 14:40:01 HLD-1-4-S10 systemd: Starting Session 252 of user root. > > Oct 10 14:40:28 HLD-1-4-S10 lvmlockd: 1539153628 new cl 850 pi > > 3 fd 14 > > > > the timestamp matches, it holds VGLK at 14:12:36 and not release. > > So I keep digging log of sanlock and find this: > > > > 2018-10-10 14:12:33 11492 [27916]: cmd_examine 6,17 > > lvm_091c17d096894a179313aad65cb4127a > > 2018-10-10 14:12:33 11492 [27916]: cmd_examine 6,17 done 2 > > 2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0 > > 2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0 > > 2018-10-10 14:12:33 11492 [27917]: cmd_examine 6,17 > > lvm_b075258f5b9547d7b4464fff246bbce1 > > 2018-10-10 14:12:33 11492 [27917]: cmd_examine 6,17 done 2 > > 2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0 > > 2018-10-10 14:12:33 11492 [27918]: examine_token rv 1 lver 0 mode 0 > > 2018-10-10 14:12:33 11492 [27916]: set_lockspace_event > > lvm_091c17d096894a179313aad65cb4127a > > 2018-10-10 14:12:33 11492 [27916]: cmd_set_event result 0 > > 2018-10-10 14:12:33 11492 [27917]: set_lockspace_event > > lvm_b075258f5b9547d7b4464fff246bbce1 > > 2018-10-10 14:12:33 11492 [27917]: cmd_set_event result 0 > > 2018-10-10 14:12:34 11493 [9550]: s2 delta_renew begin read > > 2018-10-10 14:12:34 11493 [9550]: s2 delta_renew begin write > > for new ts 11493 > > 2018-10-10 14:12:34 11493 [9550]: s2 renewed 11493 > > delta_length 0 interval 81 > > 2018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919 > > ci_in 5 fd 15 count 1 flags 9 > > 2018-10-10 14:12:36 11494 [27916]: s4:r3097 resource > > lvm_b075258f5b9547d7b4464fff246bbce1:VGLK:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:69206016:SH > > for 5,15,27919 > > 2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire begin e 0 0 > > 2018-10-10 14:12:36 11494 [27916]: r3097 leader 42693 owner 66 > > 2 0 dblocks 65:140695692504066:140695692504066:66:2:3805:42693:1, > > 2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire leader > > 42693 owner 66 2 0 max mbal[65] 140695692504066 our_dblock > > 140695692434049 140695692434049 49 2 4303 42658 > > 2018-10-10 14:12:36 11494 [27916]: r3097 paxos_acquire leader 42693 free > > 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase1 > > write mbal 140695692506049 > > 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase1 > > read 48:140695692506049:0:0:0:0:42694:0, > > 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase2 > > write bal 140695692506049 inp 49 2 11494 q_max -1 > > 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 phase2 > > read 48:140695692506049:140695692506049:49:2:11494:42694:0, > > 2018-10-10 14:12:36 11494 [27916]: r3097 ballot 42694 commit > > self owner 49 2 11494 > > 2018-10-10 14:12:36 11494 [27916]: r3097 acquire_disk rv 1 > > lver 42694 at 11494 > > 2018-10-10 14:12:36 11494 [27916]: r3097 write_host_block > > host_id 49 flags 1 gen 2 dblock > > 140695692506049:140695692506049:49:2:11494:42694:RELEASED. > > 2018-10-10 14:12:36 11494 [27916]: r3097 paxos_release leader > > 42694 owner 49 2 11494 > > 2018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919 > > result 0 pid_dead 0 > > 2018-10-10 14:12:36 11494 [27917]: cmd_get_lvb ci 6 fd 17 > > result 0 res lvm_b075258f5b9547d7b4464fff246bbce1:VGLK > > 018-10-10 14:12:36 11494 [27916]: cmd_acquire 5,15,27919 ci_in > > 5 fd 15 count 1 flags 8 > > 018-10-10 14:12:36 11494 [27916]: s4:r3098 resource > > lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH > > for 5,15,27919 > > 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire begin e 0 0 > > 018-10-10 14:12:36 11494 [27916]: r3098 leader 2 owner 19 1 > > 4854717 dblocks > > 8:29802:510:140245418403952:140245440585933:140245418403840:4:1,48:202556049:0:0:0:0:3:0,65:218746066:0:0:0:0:3:0,79:202874080:0:0:0:0:3:0, > > 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2 > > owner 19 1 4854717 max mbal[65] 218746066 our_dblock 202556049 0 0 0 0 > > 3 > > 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19 > > 1 4854717 host_status 19 2 10914 wait_start 11494 > > 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19 > > 1 4854717 delta 19 2 10995 mismatch > > 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2 > > owner 19 1 4854717 > > 018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 phase1 write > > mbal 218748049 > > 018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 abort1 larger > > lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4 > > our dblock 218748049:0:0:0:0:3 > > 018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 phase1 read > > 18:29802:510:140245418403952:140245440585933:140245418403840:4:1, > > 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire 3 retry > > delay 887122 us > > 018-10-10 14:12:37 11495 [27916]: r3098 paxos_acquire leader 2 > > owner 19 1 4854717 > > 018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 phase1 write > > mbal 218750049 > > 018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 abort1 larger > > lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4 > > our dblock 218750049:0:0:0:0:3 > > 018-10-10 14:12:37 11495 [27916]: r3098 ballot 3 phase1 read > > 18:29802:510:140245418403952:140245440585933:140245418403840:4:1, > > 018-10-10 14:12:37 11495 [27916]: r3098 paxos_acquire 3 retry > > delay 455075 us > > 018-10-10 14:12:38 11496 [27916]: r3098 paxos_acquire leader 2 > > owner 19 1 4854717 > > 018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 phase1 write > > mbal 218752049 > > 018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 abort1 larger > > lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4 > > our dblock 218752049:0:0:0:0:3 > > 018-10-10 14:12:38 11496 [27916]: r3098 ballot 3 phase1 read > > 18:29802:510:140245418403952:140245440585933:140245418403840:4:1, > > 018-10-10 14:12:38 11496 [27916]: r3098 paxos_acquire 3 retry > > delay 904853 us > > 018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 > > lvm_671bfaeb4e344014b041fc78115307bb > > 018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 done 2 > > 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 > > 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 > > 018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 > > lvm_4c149e7262a64c728997954257bece01 > > 018-10-10 14:12:38 11496 [27917]: cmd_examine 6,20 done 5 > > 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 > > 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 > > 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 > > 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 > > 018-10-10 14:12:38 11496 [27918]: examine_token rv 1 lver 0 mode 0 > > 018-10-10 14:12:38 11496 [27917]: set_lockspace_event > > lvm_671bfaeb4e344014b041fc78115307bb > > 018-10-10 14:12:38 11496 [27917]: cmd_set_event result 0 > > 018-10-10 14:12:38 11496 [27917]: set_lockspace_event > > lvm_4c149e7262a64c728997954257bece01 > > 018-10-10 14:12:38 11496 [27917]: cmd_set_event result 0 > > 018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 > > lvm_091c17d096894a179313aad65cb4127a > > 018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 done 2 > > 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 > > 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 > > 018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 > > lvm_b075258f5b9547d7b4464fff246bbce1 > > 018-10-10 14:12:38 11497 [27917]: cmd_examine 6,20 done 3 > > 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 > > 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 > > 018-10-10 14:12:38 11497 [27918]: examine_token rv 1 lver 0 mode 0 > > 018-10-10 14:12:38 11497 [27917]: set_lockspace_event > > lvm_091c17d096894a179313aad65cb4127a > > 018-10-10 14:12:38 11497 [27917]: cmd_set_event result 0 > > 018-10-10 14:12:38 11497 [27917]: set_lockspace_event > > lvm_b075258f5b9547d7b4464fff246bbce1 > > 018-10-10 14:12:38 11497 [27917]: cmd_set_event result 0 > > 018-10-10 14:12:39 11497 [27916]: r3098 paxos_acquire leader 2 > > owner 19 1 4854717 > > 018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 phase1 write > > mbal 218754049 > > 018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 abort1 larger > > lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4 > > our dblock 218754049:0:0:0:0:3 > > 018-10-10 14:12:39 11497 [27916]: r3098 ballot 3 phase1 read > > 18:29802:510:140245418403952:140245440585933:140245418403840:4:1, > > 018-10-10 14:12:39 11497 [27916]: r3098 paxos_acquire 3 retry > > delay 981162 us > > 018-10-10 14:12:40 11498 [27916]: r3098 paxos_acquire leader 2 > > owner 19 1 4854717 > > > > This is confusing to me, what means "ballot 3 abort1 larger lver in > > bk" and what causes this? > > Seems related to the bug I reported that io timeout when release vg > > lock causes problems? > > > > Besides, I found there are some hosts has different sanlock io timeout > > than others, is this the root cause? > > > > Thanks, > > Damon ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..." 2018-10-10 7:51 [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..." Damon Wang 2018-10-10 7:58 ` Damon Wang @ 2018-10-10 19:08 ` David Teigland 2018-10-11 13:03 ` Damon Wang 1 sibling, 1 reply; 9+ messages in thread From: David Teigland @ 2018-10-10 19:08 UTC (permalink / raw) To: Damon Wang; +Cc: linux-lvm On Wed, Oct 10, 2018 at 03:51:14PM +0800, Damon Wang wrote: > I can not create lv in an environment and after some check I found a > server has hold the VGLK and not released for long time. Hi, I'm curious how many hosts are using this VG? (It's not immediately relevant to this problem.) I'm confused about how this host is using different host_id's in each VG lockspace. I would normally expect that /etc/lvm/lvmlocal.conf contains local/host_id = N, and then find host_id N for all the lvm lockspaces: s lvm_b075258f5b9547d7b4464fff246bbce1:49:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:0 s lvm_671bfaeb4e344014b041fc78115307bb:30:/dev/mapper/671bfaeb4e344014b041fc78115307bb-lvmlock:0 s lvm_4c149e7262a64c728997954257bece01:94:/dev/mapper/4c149e7262a64c728997954257bece01-lvmlock:0 s lvm_091c17d096894a179313aad65cb4127a:87:/dev/mapper/091c17d096894a179313aad65cb4127a-lvmlock:0 The VGLK being held for a long time is just a secondary effect of the real problem, which is that acquiring an LV lock is stuck in a retry loop in the low level sanlock code. sanlock keeps retrying because there's bad data in host_id 19's lease structure on disk. host_id 19 failed sometime before 18-10-10 14:12, and if you could find and send any messages from that machine prior to it failing, that might help. (Also, if you'd like to upload a copy of the hidden "lvmlock" LV from this VG, I might try to pull more clues from that.) > r lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH > p 27919 ADD <------ sanlock is stuck repeatedly trying to acquire the lock for that LV (u3G3 / r3098). The debug info for acquiring that lock begins here: > 018-10-10 14:12:36 11494 [27916]: s4:r3098 resource > lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI-dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH > 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire begin e 0 0 > 018-10-10 14:12:36 11494 [27916]: r3098 leader 2 owner 19 1 4854717 dblocks 8:29802:510:140245418403952:140245440585933:140245418403840:4:1,48:202556049:0:0:0:0:3:0,65:218746066:0:0:0:0:3:0,79:202874080:0:0:0:0:3:0, That shows the LV is already locked by host_id 19 (generation 1). > 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2 > owner 19 1 4854717 max mbal[65] 218746066 our_dblock 202556049 0 0 0 0 3 > 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19 1 4854717 host_status 19 2 10914 wait_start 11494 > 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire owner 19 1 4854717 delta 19 2 10995 mismatch Those two messages show that host_id 19 is currently alive with generation 2. Since the lock is owned by host_id 19 generation 1, it means that host_id 19 previously failed while holding a lock on this LV, and has since restarted and rejoined the lockspace (increasing the generation number to 2 upon rejoining). Since the owner of the lock is now stale, this host should be able to acquire it immediately, but it doesn't... > 018-10-10 14:12:36 11494 [27916]: r3098 paxos_acquire leader 2 > owner 19 1 4854717 > 018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 phase1 write mbal 218748049 > 018-10-10 14:12:36 11494 [27916]: r3098 ballot 3 abort1 larger > lver in bk[18] 29802:510:140245418403952:140245440585933:140245418403840:4 > our dblock 218748049:0:0:0:0:3 While going through the steps to acquire this lock, it finds invalid dblock data on disk from host_id 19. There's no checksum error for the invalid data, so I wonder if it could be bad data in the buffer which 19 wrote before failing. This host is looking at the invalid lease version from the dblock, which is causing it to retry (the correct thing when it legitimately sees a larger lease version.) The checksum should normally prevent it from using the invalid data. > Besides, I found there are some hosts has different sanlock io timeout > than others, is this the root cause? Some hosts could still be using the previous 10 sec timeout if they have not restarted the lockspace (or a particular delta lease has not been reacquired using the newer timeout). I don't think it would effect this problem. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..." 2018-10-10 19:08 ` David Teigland @ 2018-10-11 13:03 ` Damon Wang 2018-10-11 15:55 ` David Teigland 0 siblings, 1 reply; 9+ messages in thread From: Damon Wang @ 2018-10-11 13:03 UTC (permalink / raw) To: David Teigland; +Cc: LVM general discussion and development Hi, 1. About host ID This is because I regenerate a host id when the host join a new lockspace -- I find host id need to be unique only in each lockspace rather than all lockspace. it's very natural a host should keep a single host id, since the exists of global lock, the host id on global lock lockspace must unique to all and can be set to all lockspaces. but consider this situation: three hosts a, b, c and 3 storage 1, 2, 3 each host only attach 2 storage, a possible combination: a(1,2), b(2,3), c(1,3) so none of these storage is a proper storage to hold global lock! so I give up the global lock setting and the host id on global, I'll only correct global lock when I need(add vg, pv, etc) 2. About host 19 I found host 19 truly hold the lease since 2018-10-09 20:49:15: daemon 091c17d0-648eb28c-HLD-1-3-S07 p -1 helper p -1 listener p 2235 lvmlockd p 2235 lvmlockd p 2235 lvmlockd p 2235 lvmlockd p -1 status s lvm_b075258f5b9547d7b4464fff246bbce1:19:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:0 2018-10-09 20:49:15 4854716 [29802]: s4:r2320 resource lvm_b075258f5b9547d7b4464fff246bbce1:u3G3P3-5Ert-CPSB-TxjI- dREz-GB77-AefhQD:/dev/mapper/b075258f5b9547d7b4464fff246bbce1-lvmlock:111149056:SH for 5,14,29715 2018-10-09 20:49:15 4854716 [29802]: r2320 paxos_acquire begin e 0 0 2018-10-09 20:49:15 4854716 [29802]: r2320 leader 1 owner 54 2 0 dblocks 53:54:54:54:2:4755629:1:1, 2018-10-09 20:49:15 4854716 [29802]: r2320 paxos_acquire leader 1 owner 54 2 0 max mbal[53] 54 our_dblock 0 0 0 0 0 0 2018-10-09 20:49:15 4854716 [29802]: r2320 paxos_acquire leader 1 free 2018-10-09 20:49:15 4854716 [29802]: r2320 ballot 2 phase1 write mbal 2019 2018-10-09 20:49:15 4854717 [29802]: r2320 ballot 2 mode[53] shared 1 gen 2 2018-10-09 20:49:15 4854717 [29802]: r2320 ballot 2 phase1 read 18:2019:0:0:0:0:2:0, 2018-10-09 20:49:15 4854717 [29802]: r2320 ballot 2 phase2 write bal 2019 inp 19 1 4854717 q_max -1 2018-10-09 20:49:15 4854717 [29802]: r2320 ballot 2 abort2 larger mbal in bk[79] 4080:0:0:0:0:2 our dblock 2019:2019: 19:1:4854717:2 2018-10-09 20:49:15 4854717 [29802]: r2320 ballot 2 phase2 read 18:2019:2019:19:1:4854717:2:0,79:4080:0:0:0:0:2:0, 2018-10-09 20:49:15 4854717 [29802]: r2320 paxos_acquire 2 retry delay 724895 us 2018-10-09 20:49:16 4854717 [29802]: r2320 paxos_acquire leader 2 owner 19 1 4854717 2018-10-09 20:49:16 4854717 [29802]: r2320 paxos_acquire 2 owner is our inp 19 1 4854717 commited by 80 2018-10-09 20:49:16 4854717 [29802]: r2320 acquire_disk rv 1 lver 2 at 4854717 2018-10-09 20:49:16 4854717 [29802]: r2320 write_host_block host_id 19 flags 1 gen 1 dblock 29802:510: 140245418403952:140245440585933:140245418403840:4:RELEASED. 2018-10-09 20:49:16 4854717 [29802]: r2320 paxos_release leader 2 owner 19 1 4854717 2018-10-09 20:49:16 4854717 [29802]: r2320 paxos_release skip write last lver 2 owner 19 1 4854717 writer 80 1 4854737 disk lver 2 owner 19 1 4854717 writer 80 1 4854737 does the "paxos_release skip write last lver" is abnormal? 3. Others The lvmlockd I set it size as 1GB, it maybe to large to upload and analyse, but I can upload to s3 if we have no other clues. Because of the problem of multipath queue_if_no_path, it's difficult to kill process using lv, I may clear lockspace directly without the process killed, is this related to this problem? I'm wondering why host generation change in host 19, does clear lockspace and rejoin or reboot host cause this? Thanks, Damon ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..." 2018-10-11 13:03 ` Damon Wang @ 2018-10-11 15:55 ` David Teigland 2018-10-11 18:58 ` David Teigland 0 siblings, 1 reply; 9+ messages in thread From: David Teigland @ 2018-10-11 15:55 UTC (permalink / raw) To: Damon Wang; +Cc: linux-lvm On Thu, Oct 11, 2018 at 09:03:01PM +0800, Damon Wang wrote: > This is because I regenerate a host id when the host join a new > lockspace OK, I guess there's something about your setup that makes it difficult to assign static host_id's. > 2018-10-09 20:49:16 4854717 [29802]: r2320 write_host_block host_id > 19 flags 1 gen 1 dblock 29802:510:140245418403952:140245440585933:140245418403840:4:RELEASED. Perfect, that shows exactly where the bug is. There's a case where it has missed saving the latest dblock values, so random values are being copied back to the dblock instead. I'll push out a patch once I've had a chance to test it. > does the "paxos_release skip write last lver" is abnormal? Not very unusual when you have lock contention (multiple hosts trying to acquire the same lock concurrently.) Thanks, Dave ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..." 2018-10-11 15:55 ` David Teigland @ 2018-10-11 18:58 ` David Teigland 2018-10-12 8:58 ` Damon Wang 0 siblings, 1 reply; 9+ messages in thread From: David Teigland @ 2018-10-11 18:58 UTC (permalink / raw) To: Damon Wang; +Cc: linux-lvm > > 2018-10-09 20:49:16 4854717 [29802]: r2320 write_host_block host_id > > 19 flags 1 gen 1 dblock 29802:510:140245418403952:140245440585933:140245418403840:4:RELEASED. > > Perfect, that shows exactly where the bug is. There's a case where it has > missed saving the latest dblock values, so random values are being copied > back to the dblock instead. I'll push out a patch once I've had a chance > to test it. Here's the fix: https://www.pagure.io/sanlock/c/1471e241a33abc1f0774ce7ec418a134b655f9ac?branch=master Thanks for the help finding that. Dave ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..." 2018-10-11 18:58 ` David Teigland @ 2018-10-12 8:58 ` Damon Wang 2018-10-12 14:21 ` David Teigland 0 siblings, 1 reply; 9+ messages in thread From: Damon Wang @ 2018-10-12 8:58 UTC (permalink / raw) To: David Teigland; +Cc: LVM general discussion and development Hi, Thanks for your mail, I read the patch and have some questions: 1. Is this bug easy to repeat (seems need lease owner committed by a host is not the lease leader)? 2. I find the commit 6501351b742 introduced bug is released with 3.6.0, if I want to fix it, which is recommend, make a master build or downgrade to 3.5.0? Thanks again, Damon On Fri, Oct 12, 2018 at 2:58 AM David Teigland <teigland@redhat.com> wrote: > > > > 2018-10-09 20:49:16 4854717 [29802]: r2320 write_host_block host_id > > > 19 flags 1 gen 1 dblock 29802:510:140245418403952:140245440585933:140245418403840:4:RELEASED. > > > > Perfect, that shows exactly where the bug is. There's a case where it has > > missed saving the latest dblock values, so random values are being copied > > back to the dblock instead. I'll push out a patch once I've had a chance > > to test it. > > Here's the fix: > https://www.pagure.io/sanlock/c/1471e241a33abc1f0774ce7ec418a134b655f9ac?branch=master > > Thanks for the help finding that. > Dave ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..." 2018-10-12 8:58 ` Damon Wang @ 2018-10-12 14:21 ` David Teigland 0 siblings, 0 replies; 9+ messages in thread From: David Teigland @ 2018-10-12 14:21 UTC (permalink / raw) To: Damon Wang; +Cc: LVM general discussion and development On Fri, Oct 12, 2018 at 04:58:45PM +0800, Damon Wang wrote: > 1. Is this bug easy to repeat (seems need lease owner committed by a > host is not the lease leader)? It will happen occasionally, depending entirely on timing, when there are commands running concurrently on different hosts that need the same lock. > 2. I find the commit 6501351b742 introduced bug is released with > 3.6.0, if I want to fix it, which is recommend, make a master build or > downgrade to 3.5.0? I suggest applying the patch onto 3.6.0. Dave ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2018-10-12 14:21 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-10-10 7:51 [linux-lvm] [lvmlockd] lvm command hung with sanlock log "ballot 3 abort1 larger lver in bk..." Damon Wang 2018-10-10 7:58 ` Damon Wang 2018-10-10 8:12 ` Damon Wang 2018-10-10 19:08 ` David Teigland 2018-10-11 13:03 ` Damon Wang 2018-10-11 15:55 ` David Teigland 2018-10-11 18:58 ` David Teigland 2018-10-12 8:58 ` Damon Wang 2018-10-12 14:21 ` David Teigland
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).