linux-lvm.redhat.com archive mirror
 help / color / mirror / Atom feed
* [linux-lvm] [lvmlockd] recovery lvmlockd after kill_vg
@ 2018-09-25 10:18 Damon Wang
  2018-09-25 16:44 ` David Teigland
  0 siblings, 1 reply; 7+ messages in thread
From: Damon Wang @ 2018-09-25 10:18 UTC (permalink / raw)
  To: LVM general discussion and development

Hi,

  AFAIK once sanlock can not access lease storage, it will run
"kill_vg" to lvmlockd, and the standard process should be deactivate
logical volumes and drop vg locks.

  But sometimes the storage will recovery after kill_vg(and before we
deactivate or drop lock), and then it will prints "storage failed for
sanlock leases" on lvm commands like this:

[root@dev1-2 ~]# vgck 71b1110c97bd48aaa25366e2dc11f65f
  WARNING: Not using lvmetad because config setting use_lvmetad=0.
  WARNING: To avoid corruption, rescan devices to make changes visible
(pvscan --cache).
  VG 71b1110c97bd48aaa25366e2dc11f65f lock skipped: storage failed for
sanlock leases
  Reading VG 71b1110c97bd48aaa25366e2dc11f65f without a lock.

  so what should I do to recovery this, (better) without affect
volumes in using?

  I find a way but it seems very tricky: save "lvmlockctl -i" output,
run lvmlockctl -r vg and then activate volumes as the previous output.

  Do we have an "official" way to handle this? Since it is pretty
common that when I find lvmlockd failed, the storage has already
recovered.

Thanks,
Damon Wang

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [linux-lvm] [lvmlockd] recovery lvmlockd after kill_vg
  2018-09-25 10:18 [linux-lvm] [lvmlockd] recovery lvmlockd after kill_vg Damon Wang
@ 2018-09-25 16:44 ` David Teigland
  2018-09-27 14:12   ` Damon Wang
  0 siblings, 1 reply; 7+ messages in thread
From: David Teigland @ 2018-09-25 16:44 UTC (permalink / raw)
  To: Damon Wang; +Cc: LVM general discussion and development

On Tue, Sep 25, 2018 at 06:18:53PM +0800, Damon Wang wrote:
> Hi,
> 
>   AFAIK once sanlock can not access lease storage, it will run
> "kill_vg" to lvmlockd, and the standard process should be deactivate
> logical volumes and drop vg locks.
> 
>   But sometimes the storage will recovery after kill_vg(and before we
> deactivate or drop lock), and then it will prints "storage failed for
> sanlock leases" on lvm commands like this:
> 
> [root@dev1-2 ~]# vgck 71b1110c97bd48aaa25366e2dc11f65f
>   WARNING: Not using lvmetad because config setting use_lvmetad=0.
>   WARNING: To avoid corruption, rescan devices to make changes visible
> (pvscan --cache).
>   VG 71b1110c97bd48aaa25366e2dc11f65f lock skipped: storage failed for
> sanlock leases
>   Reading VG 71b1110c97bd48aaa25366e2dc11f65f without a lock.
> 
>   so what should I do to recovery this, (better) without affect
> volumes in using?
> 
>   I find a way but it seems very tricky: save "lvmlockctl -i" output,
> run lvmlockctl -r vg and then activate volumes as the previous output.
> 
>   Do we have an "official" way to handle this? Since it is pretty
> common that when I find lvmlockd failed, the storage has already
> recovered.

Hi, to figure out that workaround, you've probably already read the
section of the lvmlockd man page: "sanlock lease storage failure", which
gives some background about what's happening and why.  What the man page
is missing is some help about false failure detections like you're seeing.

It sounds like io delays from your storage are a little longer than
sanlock is allowing for.  With the default 10 sec io timeout, sanlock will
initiate recovery (kill_vg in lvmlockd) after 80 seconds of no successful
io from the storage.  After this, it decides the storage has failed.  If
it's not failed, just slow, then the proper way to handle that is to
increase the timeouts.  (Or perhaps try to configure the storage to avoid
such lengthy delays.)  Once a failure is detected and recovery is begun,
there's not an official way to back out of it.

You can increase the sanlock io timeout with lvmlockd -o <seconds>.
sanlock multiplies that by 8 to get the total length of time before
starting recovery.  I'd look at how long your temporary storage outages
last and set io_timeout so that 8*io_timeout will cover it.

Dave

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [linux-lvm] [lvmlockd] recovery lvmlockd after kill_vg
  2018-09-25 16:44 ` David Teigland
@ 2018-09-27 14:12   ` Damon Wang
  2018-09-27 17:35     ` David Teigland
  0 siblings, 1 reply; 7+ messages in thread
From: Damon Wang @ 2018-09-27 14:12 UTC (permalink / raw)
  To: David Teigland; +Cc: LVM general discussion and development

Thank you for your reply, I have another question under such circumstances.

I usually run "vgck" to check weather vg is good, but sometimes it
seems it stuck, and leave a VGLK on sanlock. (I'm sure io error will
cause it, but sometimes not because io error)
Then i'll try use sanlock client release -r xxx to release it, but it
also sometimes not work.(be stuck)
Then I may lvmlockctl -r to drop vg lockspace, but it still may stuck,
and I'm io is ok when it stuck

This usually happens on multipath storage, I consider multipath will
queue some io is blamed, but not sure.

Any idea?

Thanks for your reply again

Damon
On Wed, Sep 26, 2018 at 12:44 AM David Teigland <teigland@redhat.com> wrote:
>
> On Tue, Sep 25, 2018 at 06:18:53PM +0800, Damon Wang wrote:
> > Hi,
> >
> >   AFAIK once sanlock can not access lease storage, it will run
> > "kill_vg" to lvmlockd, and the standard process should be deactivate
> > logical volumes and drop vg locks.
> >
> >   But sometimes the storage will recovery after kill_vg(and before we
> > deactivate or drop lock), and then it will prints "storage failed for
> > sanlock leases" on lvm commands like this:
> >
> > [root@dev1-2 ~]# vgck 71b1110c97bd48aaa25366e2dc11f65f
> >   WARNING: Not using lvmetad because config setting use_lvmetad=0.
> >   WARNING: To avoid corruption, rescan devices to make changes visible
> > (pvscan --cache).
> >   VG 71b1110c97bd48aaa25366e2dc11f65f lock skipped: storage failed for
> > sanlock leases
> >   Reading VG 71b1110c97bd48aaa25366e2dc11f65f without a lock.
> >
> >   so what should I do to recovery this, (better) without affect
> > volumes in using?
> >
> >   I find a way but it seems very tricky: save "lvmlockctl -i" output,
> > run lvmlockctl -r vg and then activate volumes as the previous output.
> >
> >   Do we have an "official" way to handle this? Since it is pretty
> > common that when I find lvmlockd failed, the storage has already
> > recovered.
>
> Hi, to figure out that workaround, you've probably already read the
> section of the lvmlockd man page: "sanlock lease storage failure", which
> gives some background about what's happening and why.  What the man page
> is missing is some help about false failure detections like you're seeing.
>
> It sounds like io delays from your storage are a little longer than
> sanlock is allowing for.  With the default 10 sec io timeout, sanlock will
> initiate recovery (kill_vg in lvmlockd) after 80 seconds of no successful
> io from the storage.  After this, it decides the storage has failed.  If
> it's not failed, just slow, then the proper way to handle that is to
> increase the timeouts.  (Or perhaps try to configure the storage to avoid
> such lengthy delays.)  Once a failure is detected and recovery is begun,
> there's not an official way to back out of it.
>
> You can increase the sanlock io timeout with lvmlockd -o <seconds>.
> sanlock multiplies that by 8 to get the total length of time before
> starting recovery.  I'd look at how long your temporary storage outages
> last and set io_timeout so that 8*io_timeout will cover it.
>
> Dave

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [linux-lvm] [lvmlockd] recovery lvmlockd after kill_vg
  2018-09-27 14:12   ` Damon Wang
@ 2018-09-27 17:35     ` David Teigland
  2018-09-28  3:14       ` Damon Wang
  0 siblings, 1 reply; 7+ messages in thread
From: David Teigland @ 2018-09-27 17:35 UTC (permalink / raw)
  To: Damon Wang; +Cc: LVM general discussion and development

On Thu, Sep 27, 2018 at 10:12:44PM +0800, Damon Wang wrote:
> Thank you for your reply, I have another question under such circumstances.
> 
> I usually run "vgck" to check weather vg is good, but sometimes it
> seems it stuck, and leave a VGLK on sanlock. (I'm sure io error will
> cause it, but sometimes not because io error)
> Then i'll try use sanlock client release -r xxx to release it, but it
> also sometimes not work.(be stuck)
> Then I may lvmlockctl -r to drop vg lockspace, but it still may stuck,
> and I'm io is ok when it stuck
> 
> This usually happens on multipath storage, I consider multipath will
> queue some io is blamed, but not sure.
> 
> Any idea?

First, you might be able to avoid this issue by doing the check using
something other than an lvm command, or perhaps and lvm command configured
to avoid taking locks (the --nolocking option in vgs/pvs/lvs).  What's
appropriate depends on specifically what you want to know from the check.

I still haven't fixed the issue you found earlier, which sounds like it
could be the same or related to what you're describing now.
https://www.redhat.com/archives/linux-lvm/2018-July/msg00011.html

As for manually cleaning up a stray lock using sanlock client, there may
be some limits on the situations that works in, I don't recall off hand.
You should try using the -p <pid> option with client release to match the
pid of lvmlockd.

Configuring multipath to fail more quickly instead of queueing might give
you a better chance of cleaning things up.

Dave

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [linux-lvm] [lvmlockd] recovery lvmlockd after kill_vg
  2018-09-27 17:35     ` David Teigland
@ 2018-09-28  3:14       ` Damon Wang
  2018-09-28 14:32         ` David Teigland
  0 siblings, 1 reply; 7+ messages in thread
From: Damon Wang @ 2018-09-28  3:14 UTC (permalink / raw)
  To: David Teigland; +Cc: LVM general discussion and development

[-- Attachment #1: Type: text/plain, Size: 4460 bytes --]

On Fri, Sep 28, 2018 at 1:35 AM David Teigland <teigland@redhat.com> wrote:
>
> On Thu, Sep 27, 2018 at 10:12:44PM +0800, Damon Wang wrote:
> > Thank you for your reply, I have another question under such
circumstances.
> >
> > I usually run "vgck" to check weather vg is good, but sometimes it
> > seems it stuck, and leave a VGLK on sanlock. (I'm sure io error will
> > cause it, but sometimes not because io error)
> > Then i'll try use sanlock client release -r xxx to release it, but it
> > also sometimes not work.(be stuck)
> > Then I may lvmlockctl -r to drop vg lockspace, but it still may stuck,
> > and I'm io is ok when it stuck
> >
> > This usually happens on multipath storage, I consider multipath will
> > queue some io is blamed, but not sure.
> >
> > Any idea?
>
> First, you might be able to avoid this issue by doing the check using
> something other than an lvm command, or perhaps and lvm command configured
> to avoid taking locks (the --nolocking option in vgs/pvs/lvs).  What's
> appropriate depends on specifically what you want to know from the check.
>

This is how I use sanlock and lvmlockd:


 +------------------+            +---------------------+
 +----------------+
 |                  |            |                     |         |
      |
 |     sanlock      <------------>     lvmlockd        <---------+  lvm
commands  |
 |                  |            |                     |         |
      |
 +------------------+            +---------------------+
 +----------------+
       |
       |
       |
       |      +------------------+
 +-----------------+        +------------+
       |      |                  |                               |
       |        |            |
       +------>     multipath    <- - -  -  -  -   -  -  -  -  - |  lvm
volumes    <--------+    qemu    |
              |                  |                               |
       |        |            |
              +------------------+
 +-----------------+        +------------+
                      |
                      |
                      |
                      |
                      |
              +------------------+
              |                  |
              |   san storage    |
              |                  |
              |                  |
              +------------------+

As I mentioned in first mail, sometimes I found lvm commands failed with
"sanlock lease storage failure", I guess this is because lvmlockd kill_vg
has triggered,
as the manual says, it should deactivate volumes and drop lockspace as
quick as possible, but I can't get a proper alert from a program way.

TTY can get a message, but it's not a good way to listen or monitor, so I
run vgck periodically and parse its stdout and stderr, once "sanlock lease
storage failure" or
something unusual happens, an alert will be triggered and I'll do some
check(I hope all this process can be automatically).

If do not require lock(pvs/lvs/vgs --nolocking), these error wont be
noticed, since lots of san storage configure multipath as queue io as far
as possible(multipath -t | grep queue_if_no_path),
get lvm error@early is pretty difficult, vgck and parse its output a way
with less load(it will get a shared vglk) and better efficiency(it should
take less than 0.1s in usual) after various tried.

As you mentioned, I'll extend io timeout to avoid storage jitter, and I
believe it also resolves some problems from multipath queue io.


> I still haven't fixed the issue you found earlier, which sounds like it
> could be the same or related to what you're describing now.
> https://www.redhat.com/archives/linux-lvm/2018-July/msg00011.html
>
> As for manually cleaning up a stray lock using sanlock client, there may
> be some limits on the situations that works in, I don't recall off hand.
> You should try using the -p <pid> option with client release to match the
> pid of lvmlockd.


yes I added -p to release lock, and I wanna summary up an "Emergency
Procedures" for deal with different storage failure, for me it's still
unclear now.
I'll do more experiment after fix these annoying storage fails, then make
this summary


> Configuring multipath to fail more quickly instead of queueing might give
> you a better chance of cleaning things up.
>
> Dave
>

yeah, I believe multipath queue io should be blamed, I'm negotiating with
storage vendor since they think multipath config is right :-(


Thank you for your patience!

Damon

[-- Attachment #2: Type: text/html, Size: 7443 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [linux-lvm] [lvmlockd] recovery lvmlockd after kill_vg
  2018-09-28  3:14       ` Damon Wang
@ 2018-09-28 14:32         ` David Teigland
  2018-09-28 18:13           ` Damon Wang
  0 siblings, 1 reply; 7+ messages in thread
From: David Teigland @ 2018-09-28 14:32 UTC (permalink / raw)
  To: Damon Wang; +Cc: LVM general discussion and development

On Fri, Sep 28, 2018 at 11:14:35AM +0800, Damon Wang wrote:

> as the manual says, it should deactivate volumes and drop lockspace as
> quick as possible,

A year ago we discussed a more automated solution for forcing a VG offline
when its sanlock lockspace was shut down:

https://www.redhat.com/archives/lvm-devel/2017-September/msg00011.html

The idea was to forcibly shut down LVs (using dmsetup wipe_table) in the
VG when the kill_vg happened, then to automatically do the 'lvmlockctl
--drop' when the LVs were safely shut down.  There were some loose ends
around integrating this solution that I never sorted out, so it's remained
on my todo list.

> TTY can get a message, but it's not a good way to listen or monitor, so I
> run vgck periodically and parse its stdout and stderr, once "sanlock lease
> storage failure" or
> something unusual happens, an alert will be triggered and I'll do some
> check(I hope all this process can be automatically).

If you are specifically interested in detecting this lease timeout
condition, there are some sanlock commands that can give you this info.
You can also detect ahead of time that a VG's lockspace is getting close
the threshold.  I can get back to you with more specific fields to look
at, but for now take a look at 'sanlock client renewal' and some of the
internal details that are printed by 'sanlock client status -D'.

Dave

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [linux-lvm] [lvmlockd] recovery lvmlockd after kill_vg
  2018-09-28 14:32         ` David Teigland
@ 2018-09-28 18:13           ` Damon Wang
  0 siblings, 0 replies; 7+ messages in thread
From: Damon Wang @ 2018-09-28 18:13 UTC (permalink / raw)
  To: David Teigland; +Cc: LVM general discussion and development

It's really help, I noticed "sanlock client renewal" returns practical
information, but not noticed "sanlock client status -D", the later is
what I exactly want, thanks!

Damon

On Fri, Sep 28, 2018 at 10:32 PM David Teigland <teigland@redhat.com> wrote:
>
> On Fri, Sep 28, 2018 at 11:14:35AM +0800, Damon Wang wrote:
>
> > as the manual says, it should deactivate volumes and drop lockspace as
> > quick as possible,
>
> A year ago we discussed a more automated solution for forcing a VG offline
> when its sanlock lockspace was shut down:
>
> https://www.redhat.com/archives/lvm-devel/2017-September/msg00011.html
>
> The idea was to forcibly shut down LVs (using dmsetup wipe_table) in the
> VG when the kill_vg happened, then to automatically do the 'lvmlockctl
> --drop' when the LVs were safely shut down.  There were some loose ends
> around integrating this solution that I never sorted out, so it's remained
> on my todo list.
>
> > TTY can get a message, but it's not a good way to listen or monitor, so I
> > run vgck periodically and parse its stdout and stderr, once "sanlock lease
> > storage failure" or
> > something unusual happens, an alert will be triggered and I'll do some
> > check(I hope all this process can be automatically).
>
> If you are specifically interested in detecting this lease timeout
> condition, there are some sanlock commands that can give you this info.
> You can also detect ahead of time that a VG's lockspace is getting close
> the threshold.  I can get back to you with more specific fields to look
> at, but for now take a look at 'sanlock client renewal' and some of the
> internal details that are printed by 'sanlock client status -D'.
>
> Dave

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-09-28 18:13 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-25 10:18 [linux-lvm] [lvmlockd] recovery lvmlockd after kill_vg Damon Wang
2018-09-25 16:44 ` David Teigland
2018-09-27 14:12   ` Damon Wang
2018-09-27 17:35     ` David Teigland
2018-09-28  3:14       ` Damon Wang
2018-09-28 14:32         ` David Teigland
2018-09-28 18:13           ` Damon Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).