* [linux-lvm] [lvmlockd] recovery lvmlockd after kill_vg @ 2018-09-25 10:18 Damon Wang 2018-09-25 16:44 ` David Teigland 0 siblings, 1 reply; 7+ messages in thread From: Damon Wang @ 2018-09-25 10:18 UTC (permalink / raw) To: LVM general discussion and development Hi, AFAIK once sanlock can not access lease storage, it will run "kill_vg" to lvmlockd, and the standard process should be deactivate logical volumes and drop vg locks. But sometimes the storage will recovery after kill_vg(and before we deactivate or drop lock), and then it will prints "storage failed for sanlock leases" on lvm commands like this: [root@dev1-2 ~]# vgck 71b1110c97bd48aaa25366e2dc11f65f WARNING: Not using lvmetad because config setting use_lvmetad=0. WARNING: To avoid corruption, rescan devices to make changes visible (pvscan --cache). VG 71b1110c97bd48aaa25366e2dc11f65f lock skipped: storage failed for sanlock leases Reading VG 71b1110c97bd48aaa25366e2dc11f65f without a lock. so what should I do to recovery this, (better) without affect volumes in using? I find a way but it seems very tricky: save "lvmlockctl -i" output, run lvmlockctl -r vg and then activate volumes as the previous output. Do we have an "official" way to handle this? Since it is pretty common that when I find lvmlockd failed, the storage has already recovered. Thanks, Damon Wang ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [linux-lvm] [lvmlockd] recovery lvmlockd after kill_vg 2018-09-25 10:18 [linux-lvm] [lvmlockd] recovery lvmlockd after kill_vg Damon Wang @ 2018-09-25 16:44 ` David Teigland 2018-09-27 14:12 ` Damon Wang 0 siblings, 1 reply; 7+ messages in thread From: David Teigland @ 2018-09-25 16:44 UTC (permalink / raw) To: Damon Wang; +Cc: LVM general discussion and development On Tue, Sep 25, 2018 at 06:18:53PM +0800, Damon Wang wrote: > Hi, > > AFAIK once sanlock can not access lease storage, it will run > "kill_vg" to lvmlockd, and the standard process should be deactivate > logical volumes and drop vg locks. > > But sometimes the storage will recovery after kill_vg(and before we > deactivate or drop lock), and then it will prints "storage failed for > sanlock leases" on lvm commands like this: > > [root@dev1-2 ~]# vgck 71b1110c97bd48aaa25366e2dc11f65f > WARNING: Not using lvmetad because config setting use_lvmetad=0. > WARNING: To avoid corruption, rescan devices to make changes visible > (pvscan --cache). > VG 71b1110c97bd48aaa25366e2dc11f65f lock skipped: storage failed for > sanlock leases > Reading VG 71b1110c97bd48aaa25366e2dc11f65f without a lock. > > so what should I do to recovery this, (better) without affect > volumes in using? > > I find a way but it seems very tricky: save "lvmlockctl -i" output, > run lvmlockctl -r vg and then activate volumes as the previous output. > > Do we have an "official" way to handle this? Since it is pretty > common that when I find lvmlockd failed, the storage has already > recovered. Hi, to figure out that workaround, you've probably already read the section of the lvmlockd man page: "sanlock lease storage failure", which gives some background about what's happening and why. What the man page is missing is some help about false failure detections like you're seeing. It sounds like io delays from your storage are a little longer than sanlock is allowing for. With the default 10 sec io timeout, sanlock will initiate recovery (kill_vg in lvmlockd) after 80 seconds of no successful io from the storage. After this, it decides the storage has failed. If it's not failed, just slow, then the proper way to handle that is to increase the timeouts. (Or perhaps try to configure the storage to avoid such lengthy delays.) Once a failure is detected and recovery is begun, there's not an official way to back out of it. You can increase the sanlock io timeout with lvmlockd -o <seconds>. sanlock multiplies that by 8 to get the total length of time before starting recovery. I'd look at how long your temporary storage outages last and set io_timeout so that 8*io_timeout will cover it. Dave ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [linux-lvm] [lvmlockd] recovery lvmlockd after kill_vg 2018-09-25 16:44 ` David Teigland @ 2018-09-27 14:12 ` Damon Wang 2018-09-27 17:35 ` David Teigland 0 siblings, 1 reply; 7+ messages in thread From: Damon Wang @ 2018-09-27 14:12 UTC (permalink / raw) To: David Teigland; +Cc: LVM general discussion and development Thank you for your reply, I have another question under such circumstances. I usually run "vgck" to check weather vg is good, but sometimes it seems it stuck, and leave a VGLK on sanlock. (I'm sure io error will cause it, but sometimes not because io error) Then i'll try use sanlock client release -r xxx to release it, but it also sometimes not work.(be stuck) Then I may lvmlockctl -r to drop vg lockspace, but it still may stuck, and I'm io is ok when it stuck This usually happens on multipath storage, I consider multipath will queue some io is blamed, but not sure. Any idea? Thanks for your reply again Damon On Wed, Sep 26, 2018 at 12:44 AM David Teigland <teigland@redhat.com> wrote: > > On Tue, Sep 25, 2018 at 06:18:53PM +0800, Damon Wang wrote: > > Hi, > > > > AFAIK once sanlock can not access lease storage, it will run > > "kill_vg" to lvmlockd, and the standard process should be deactivate > > logical volumes and drop vg locks. > > > > But sometimes the storage will recovery after kill_vg(and before we > > deactivate or drop lock), and then it will prints "storage failed for > > sanlock leases" on lvm commands like this: > > > > [root@dev1-2 ~]# vgck 71b1110c97bd48aaa25366e2dc11f65f > > WARNING: Not using lvmetad because config setting use_lvmetad=0. > > WARNING: To avoid corruption, rescan devices to make changes visible > > (pvscan --cache). > > VG 71b1110c97bd48aaa25366e2dc11f65f lock skipped: storage failed for > > sanlock leases > > Reading VG 71b1110c97bd48aaa25366e2dc11f65f without a lock. > > > > so what should I do to recovery this, (better) without affect > > volumes in using? > > > > I find a way but it seems very tricky: save "lvmlockctl -i" output, > > run lvmlockctl -r vg and then activate volumes as the previous output. > > > > Do we have an "official" way to handle this? Since it is pretty > > common that when I find lvmlockd failed, the storage has already > > recovered. > > Hi, to figure out that workaround, you've probably already read the > section of the lvmlockd man page: "sanlock lease storage failure", which > gives some background about what's happening and why. What the man page > is missing is some help about false failure detections like you're seeing. > > It sounds like io delays from your storage are a little longer than > sanlock is allowing for. With the default 10 sec io timeout, sanlock will > initiate recovery (kill_vg in lvmlockd) after 80 seconds of no successful > io from the storage. After this, it decides the storage has failed. If > it's not failed, just slow, then the proper way to handle that is to > increase the timeouts. (Or perhaps try to configure the storage to avoid > such lengthy delays.) Once a failure is detected and recovery is begun, > there's not an official way to back out of it. > > You can increase the sanlock io timeout with lvmlockd -o <seconds>. > sanlock multiplies that by 8 to get the total length of time before > starting recovery. I'd look at how long your temporary storage outages > last and set io_timeout so that 8*io_timeout will cover it. > > Dave ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [linux-lvm] [lvmlockd] recovery lvmlockd after kill_vg 2018-09-27 14:12 ` Damon Wang @ 2018-09-27 17:35 ` David Teigland 2018-09-28 3:14 ` Damon Wang 0 siblings, 1 reply; 7+ messages in thread From: David Teigland @ 2018-09-27 17:35 UTC (permalink / raw) To: Damon Wang; +Cc: LVM general discussion and development On Thu, Sep 27, 2018 at 10:12:44PM +0800, Damon Wang wrote: > Thank you for your reply, I have another question under such circumstances. > > I usually run "vgck" to check weather vg is good, but sometimes it > seems it stuck, and leave a VGLK on sanlock. (I'm sure io error will > cause it, but sometimes not because io error) > Then i'll try use sanlock client release -r xxx to release it, but it > also sometimes not work.(be stuck) > Then I may lvmlockctl -r to drop vg lockspace, but it still may stuck, > and I'm io is ok when it stuck > > This usually happens on multipath storage, I consider multipath will > queue some io is blamed, but not sure. > > Any idea? First, you might be able to avoid this issue by doing the check using something other than an lvm command, or perhaps and lvm command configured to avoid taking locks (the --nolocking option in vgs/pvs/lvs). What's appropriate depends on specifically what you want to know from the check. I still haven't fixed the issue you found earlier, which sounds like it could be the same or related to what you're describing now. https://www.redhat.com/archives/linux-lvm/2018-July/msg00011.html As for manually cleaning up a stray lock using sanlock client, there may be some limits on the situations that works in, I don't recall off hand. You should try using the -p <pid> option with client release to match the pid of lvmlockd. Configuring multipath to fail more quickly instead of queueing might give you a better chance of cleaning things up. Dave ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [linux-lvm] [lvmlockd] recovery lvmlockd after kill_vg 2018-09-27 17:35 ` David Teigland @ 2018-09-28 3:14 ` Damon Wang 2018-09-28 14:32 ` David Teigland 0 siblings, 1 reply; 7+ messages in thread From: Damon Wang @ 2018-09-28 3:14 UTC (permalink / raw) To: David Teigland; +Cc: LVM general discussion and development [-- Attachment #1: Type: text/plain, Size: 4460 bytes --] On Fri, Sep 28, 2018 at 1:35 AM David Teigland <teigland@redhat.com> wrote: > > On Thu, Sep 27, 2018 at 10:12:44PM +0800, Damon Wang wrote: > > Thank you for your reply, I have another question under such circumstances. > > > > I usually run "vgck" to check weather vg is good, but sometimes it > > seems it stuck, and leave a VGLK on sanlock. (I'm sure io error will > > cause it, but sometimes not because io error) > > Then i'll try use sanlock client release -r xxx to release it, but it > > also sometimes not work.(be stuck) > > Then I may lvmlockctl -r to drop vg lockspace, but it still may stuck, > > and I'm io is ok when it stuck > > > > This usually happens on multipath storage, I consider multipath will > > queue some io is blamed, but not sure. > > > > Any idea? > > First, you might be able to avoid this issue by doing the check using > something other than an lvm command, or perhaps and lvm command configured > to avoid taking locks (the --nolocking option in vgs/pvs/lvs). What's > appropriate depends on specifically what you want to know from the check. > This is how I use sanlock and lvmlockd: +------------------+ +---------------------+ +----------------+ | | | | | | | sanlock <------------> lvmlockd <---------+ lvm commands | | | | | | | +------------------+ +---------------------+ +----------------+ | | | | +------------------+ +-----------------+ +------------+ | | | | | | | +------> multipath <- - - - - - - - - - - | lvm volumes <--------+ qemu | | | | | | | +------------------+ +-----------------+ +------------+ | | | | | +------------------+ | | | san storage | | | | | +------------------+ As I mentioned in first mail, sometimes I found lvm commands failed with "sanlock lease storage failure", I guess this is because lvmlockd kill_vg has triggered, as the manual says, it should deactivate volumes and drop lockspace as quick as possible, but I can't get a proper alert from a program way. TTY can get a message, but it's not a good way to listen or monitor, so I run vgck periodically and parse its stdout and stderr, once "sanlock lease storage failure" or something unusual happens, an alert will be triggered and I'll do some check(I hope all this process can be automatically). If do not require lock(pvs/lvs/vgs --nolocking), these error wont be noticed, since lots of san storage configure multipath as queue io as far as possible(multipath -t | grep queue_if_no_path), get lvm error@early is pretty difficult, vgck and parse its output a way with less load(it will get a shared vglk) and better efficiency(it should take less than 0.1s in usual) after various tried. As you mentioned, I'll extend io timeout to avoid storage jitter, and I believe it also resolves some problems from multipath queue io. > I still haven't fixed the issue you found earlier, which sounds like it > could be the same or related to what you're describing now. > https://www.redhat.com/archives/linux-lvm/2018-July/msg00011.html > > As for manually cleaning up a stray lock using sanlock client, there may > be some limits on the situations that works in, I don't recall off hand. > You should try using the -p <pid> option with client release to match the > pid of lvmlockd. yes I added -p to release lock, and I wanna summary up an "Emergency Procedures" for deal with different storage failure, for me it's still unclear now. I'll do more experiment after fix these annoying storage fails, then make this summary > Configuring multipath to fail more quickly instead of queueing might give > you a better chance of cleaning things up. > > Dave > yeah, I believe multipath queue io should be blamed, I'm negotiating with storage vendor since they think multipath config is right :-( Thank you for your patience! Damon [-- Attachment #2: Type: text/html, Size: 7443 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [linux-lvm] [lvmlockd] recovery lvmlockd after kill_vg 2018-09-28 3:14 ` Damon Wang @ 2018-09-28 14:32 ` David Teigland 2018-09-28 18:13 ` Damon Wang 0 siblings, 1 reply; 7+ messages in thread From: David Teigland @ 2018-09-28 14:32 UTC (permalink / raw) To: Damon Wang; +Cc: LVM general discussion and development On Fri, Sep 28, 2018 at 11:14:35AM +0800, Damon Wang wrote: > as the manual says, it should deactivate volumes and drop lockspace as > quick as possible, A year ago we discussed a more automated solution for forcing a VG offline when its sanlock lockspace was shut down: https://www.redhat.com/archives/lvm-devel/2017-September/msg00011.html The idea was to forcibly shut down LVs (using dmsetup wipe_table) in the VG when the kill_vg happened, then to automatically do the 'lvmlockctl --drop' when the LVs were safely shut down. There were some loose ends around integrating this solution that I never sorted out, so it's remained on my todo list. > TTY can get a message, but it's not a good way to listen or monitor, so I > run vgck periodically and parse its stdout and stderr, once "sanlock lease > storage failure" or > something unusual happens, an alert will be triggered and I'll do some > check(I hope all this process can be automatically). If you are specifically interested in detecting this lease timeout condition, there are some sanlock commands that can give you this info. You can also detect ahead of time that a VG's lockspace is getting close the threshold. I can get back to you with more specific fields to look at, but for now take a look at 'sanlock client renewal' and some of the internal details that are printed by 'sanlock client status -D'. Dave ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [linux-lvm] [lvmlockd] recovery lvmlockd after kill_vg 2018-09-28 14:32 ` David Teigland @ 2018-09-28 18:13 ` Damon Wang 0 siblings, 0 replies; 7+ messages in thread From: Damon Wang @ 2018-09-28 18:13 UTC (permalink / raw) To: David Teigland; +Cc: LVM general discussion and development It's really help, I noticed "sanlock client renewal" returns practical information, but not noticed "sanlock client status -D", the later is what I exactly want, thanks! Damon On Fri, Sep 28, 2018 at 10:32 PM David Teigland <teigland@redhat.com> wrote: > > On Fri, Sep 28, 2018 at 11:14:35AM +0800, Damon Wang wrote: > > > as the manual says, it should deactivate volumes and drop lockspace as > > quick as possible, > > A year ago we discussed a more automated solution for forcing a VG offline > when its sanlock lockspace was shut down: > > https://www.redhat.com/archives/lvm-devel/2017-September/msg00011.html > > The idea was to forcibly shut down LVs (using dmsetup wipe_table) in the > VG when the kill_vg happened, then to automatically do the 'lvmlockctl > --drop' when the LVs were safely shut down. There were some loose ends > around integrating this solution that I never sorted out, so it's remained > on my todo list. > > > TTY can get a message, but it's not a good way to listen or monitor, so I > > run vgck periodically and parse its stdout and stderr, once "sanlock lease > > storage failure" or > > something unusual happens, an alert will be triggered and I'll do some > > check(I hope all this process can be automatically). > > If you are specifically interested in detecting this lease timeout > condition, there are some sanlock commands that can give you this info. > You can also detect ahead of time that a VG's lockspace is getting close > the threshold. I can get back to you with more specific fields to look > at, but for now take a look at 'sanlock client renewal' and some of the > internal details that are printed by 'sanlock client status -D'. > > Dave ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2018-09-28 18:13 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-09-25 10:18 [linux-lvm] [lvmlockd] recovery lvmlockd after kill_vg Damon Wang 2018-09-25 16:44 ` David Teigland 2018-09-27 14:12 ` Damon Wang 2018-09-27 17:35 ` David Teigland 2018-09-28 3:14 ` Damon Wang 2018-09-28 14:32 ` David Teigland 2018-09-28 18:13 ` Damon Wang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).