[Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?
@ 2021-08-11 10:38 Gang He
  2021-08-11 20:35 ` Alexander Aring
  0 siblings, 1 reply; 7+ messages in thread
From: Gang He @ 2021-08-11 10:38 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hello List,

I am using kernel 5.13.4 (some old version kernels have the same problem).
When node A acquired a dlm (EX) lock, node B tried to get the dlm lock, node A got a BAST message,
then node A downcoverted the dlm lock to NL, dlm_lock function failed with the error -16.
The function failure did not always happen, but in some case, I could encounter this failure. 
Why does dlm_lock function fails when downconvert a dlm lock? there are any documents for describe these error cases?
If the code ignores dlm_lock return error from node A, node B will not get the dlm lock permanently.
How should we handle such situation? call dlm_lock function to downconvert the dlm lock again?

Thanks
Gang

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?
  2021-08-11 10:38 [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock? Gang He
@ 2021-08-11 20:35 ` Alexander Aring
  2021-08-12  5:44   ` Gang He
  0 siblings, 1 reply; 7+ messages in thread
From: Alexander Aring @ 2021-08-11 20:35 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On Wed, Aug 11, 2021 at 6:41 AM Gang He <GHe@suse.com> wrote:
>
> Hello List,
>
> I am using kernel 5.13.4 (some old version kernels have the same problem).
> When node A acquired a dlm (EX) lock, node B tried to get the dlm lock, node A got a BAST message,
> then node A downcoverted the dlm lock to NL, dlm_lock function failed with the error -16.
> The function failure did not always happen, but in some case, I could encounter this failure.
> Why does dlm_lock function fails when downconvert a dlm lock? there are any documents for describe these error cases?
> If the code ignores dlm_lock return error from node A, node B will not get the dlm lock permanently.
> How should we handle such situation? call dlm_lock function to downconvert the dlm lock again?

What is your dlm user? Is it kernel (e.g. gfs2/ocfs2/md) or user (libdlm)?

I believe you are running into case [0]. Can you provide the
corresponding log_debug() message? It's necessary to insert
"log_debug=1" in your dlm.conf and it will be reported on KERN_DEBUG
in your kernel log then.

Thanks.

- Alex

[0] https://elixir.bootlin.com/linux/v5.14-rc5/source/fs/dlm/lock.c#L2886

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?
  2021-08-11 20:35 ` Alexander Aring
@ 2021-08-12  5:44   ` Gang He
  2021-08-12 17:45     ` David Teigland
  0 siblings, 1 reply; 7+ messages in thread
From: Gang He @ 2021-08-12  5:44 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi Alexander,


On 2021/8/12 4:35, Alexander Aring wrote:
> Hi,
> 
> On Wed, Aug 11, 2021 at 6:41 AM Gang He <GHe@suse.com> wrote:
>>
>> Hello List,
>>
>> I am using kernel 5.13.4 (some old version kernels have the same problem).
>> When node A acquired a dlm (EX) lock, node B tried to get the dlm lock, node A got a BAST message,
>> then node A downcoverted the dlm lock to NL, dlm_lock function failed with the error -16.
>> The function failure did not always happen, but in some case, I could encounter this failure.
>> Why does dlm_lock function fails when downconvert a dlm lock? there are any documents for describe these error cases?
>> If the code ignores dlm_lock return error from node A, node B will not get the dlm lock permanently.
>> How should we handle such situation? call dlm_lock function to downconvert the dlm lock again?
> 
> What is your dlm user? Is it kernel (e.g. gfs2/ocfs2/md) or user (libdlm)?
ocfs2 file system.

> 
> I believe you are running into case [0]. Can you provide the
> corresponding log_debug() message? It's necessary to insert
> "log_debug=1" in your dlm.conf and it will be reported on KERN_DEBUG
> in your kernel log then.
[Thu Aug 12 12:04:55 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
remwait 10 cancel_reply overlap
[Thu Aug 12 12:05:00 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
addwait 10 cur 2 overlap 4 count 2 f 100000
[Thu Aug 12 12:05:00 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
remwait 10 cancel_reply overlap
[Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
addwait 10 cur 2 overlap 4 count 2 f 100000
[Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
validate_lock_args -16 10 100000 10c 2 0 M0000000000000000046e0200000000
[Thu Aug 12 12:05:05 2021] 
(ocfs2dc-ED6296E,1602,1):ocfs2_downconvert_lock:3674 ERROR: DLM error 
-16 while calling ocfs2_dlm_lock on resource M0000000000000000046e0200000000
[Thu Aug 12 12:05:05 2021] 
(ocfs2dc-ED6296E,1602,1):ocfs2_unblock_lock:3918 ERROR: status = -16
[Thu Aug 12 12:05:05 2021] 
(ocfs2dc-ED6296E,1602,1):ocfs2_process_blocked_lock:4317 ERROR: status = -16
[Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
remwait 10 cancel_reply overlap

The whole kernel log for this node is here:
https://pastebin.com/FBn8Uwsu
The other two node kernel log:
https://pastebin.com/XxrZw6ds
https://pastebin.com/2Jw1ZqVb

In fact, I can reproduce this problem stably.
I want to know if this error happen is by our expectation? since there 
is not any extreme pressure test.
Second, how should we handle these error cases? call dlm_lock function 
again? maybe the function will fails again, that will lead to kernel 
soft-lockup after multiple re-tries.

Thanks
Gang

> 
> Thanks.
> 
> - Alex
> 
> [0] https://elixir.bootlin.com/linux/v5.14-rc5/source/fs/dlm/lock.c#L2886
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?
  2021-08-12  5:44   ` Gang He
@ 2021-08-12 17:45     ` David Teigland
  2021-08-13  6:49       ` Gang He
  0 siblings, 1 reply; 7+ messages in thread
From: David Teigland @ 2021-08-12 17:45 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu, Aug 12, 2021 at 01:44:53PM +0800, Gang He wrote:
> In fact, I can reproduce this problem stably.
> I want to know if this error happen is by our expectation? since there is
> not any extreme pressure test.
> Second, how should we handle these error cases? call dlm_lock function
> again? maybe the function will fails again, that will lead to kernel
> soft-lockup after multiple re-tries.

What's probably happening is that ocfs2 calls dlm_unlock(CANCEL) to cancel
an in-progress dlm_lock() request.  Before the cancel completes (or the
original request completes), ocfs2 calls dlm_lock() again on the same
resource.  This dlm_lock() returns -EBUSY because the previous request has
not completed, either normally or by cancellation.  This is expected.

A couple options to try: wait for the original request to complete
(normally or by cancellation) before calling dlm_lock() again, or retry
dlm_lock() on -EBUSY.

Dave

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?
  2021-08-12 17:45     ` David Teigland
@ 2021-08-13  6:49       ` Gang He
  2021-08-16 14:41         ` David Teigland
  0 siblings, 1 reply; 7+ messages in thread
From: Gang He @ 2021-08-13  6:49 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi David,

On 2021/8/13 1:45, David Teigland wrote:
> On Thu, Aug 12, 2021 at 01:44:53PM +0800, Gang He wrote:
>> In fact, I can reproduce this problem stably.
>> I want to know if this error happen is by our expectation? since there is
>> not any extreme pressure test.
>> Second, how should we handle these error cases? call dlm_lock function
>> again? maybe the function will fails again, that will lead to kernel
>> soft-lockup after multiple re-tries.
> 
> What's probably happening is that ocfs2 calls dlm_unlock(CANCEL) to cancel
> an in-progress dlm_lock() request.  Before the cancel completes (or the
> original request completes), ocfs2 calls dlm_lock() again on the same
> resource.  This dlm_lock() returns -EBUSY because the previous request has
> not completed, either normally or by cancellation.  This is expected.
These dlm_lock and dlm_unlock are invoked in the same node, or the 
different nodes?

> 
> A couple options to try: wait for the original request to complete
> (normally or by cancellation) before calling dlm_lock() again, or retry
> dlm_lock() on -EBUSY.
If I retry dlm_lock() repeatedly, I just wonder if this will lead to 
kernel soft lockup or waste lots of CPU.
If dlm_lock() function returns -EAGAIN, how should we handle this case?
retry it repeatedly?

Thanks
Gang

> 
> Dave
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?
  2021-08-13  6:49       ` Gang He
@ 2021-08-16 14:41         ` David Teigland
  2021-08-16 14:50           ` David Teigland
  0 siblings, 1 reply; 7+ messages in thread
From: David Teigland @ 2021-08-16 14:41 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Aug 13, 2021 at 02:49:04PM +0800, Gang He wrote:
> Hi David,
> 
> On 2021/8/13 1:45, David Teigland wrote:
> > On Thu, Aug 12, 2021 at 01:44:53PM +0800, Gang He wrote:
> > > In fact, I can reproduce this problem stably.
> > > I want to know if this error happen is by our expectation? since there is
> > > not any extreme pressure test.
> > > Second, how should we handle these error cases? call dlm_lock function
> > > again? maybe the function will fails again, that will lead to kernel
> > > soft-lockup after multiple re-tries.
> > 
> > What's probably happening is that ocfs2 calls dlm_unlock(CANCEL) to cancel
> > an in-progress dlm_lock() request.  Before the cancel completes (or the
> > original request completes), ocfs2 calls dlm_lock() again on the same
> > resource.  This dlm_lock() returns -EBUSY because the previous request has
> > not completed, either normally or by cancellation.  This is expected.
> These dlm_lock and dlm_unlock are invoked in the same node, or the different
> nodes?

different

> > A couple options to try: wait for the original request to complete
> > (normally or by cancellation) before calling dlm_lock() again, or retry
> > dlm_lock() on -EBUSY.
> If I retry dlm_lock() repeatedly, I just wonder if this will lead to kernel
> soft lockup or waste lots of CPU.

I'm not aware of other code doing this, so I can't tell you with certainty.
It would depend largely on the implementation in the caller.

> If dlm_lock() function returns -EAGAIN, how should we handle this case?
> retry it repeatedly?

Again, this is a question more about the implementation of the calling
code and what it wants to do.  EAGAIN is specifically related to the
DLM_LKF_NOQUEUE flag.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?
  2021-08-16 14:41         ` David Teigland
@ 2021-08-16 14:50           ` David Teigland
  0 siblings, 0 replies; 7+ messages in thread
From: David Teigland @ 2021-08-16 14:50 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Aug 16, 2021 at 09:41:18AM -0500, David Teigland wrote:
> On Fri, Aug 13, 2021 at 02:49:04PM +0800, Gang He wrote:
> > Hi David,
> > 
> > On 2021/8/13 1:45, David Teigland wrote:
> > > On Thu, Aug 12, 2021 at 01:44:53PM +0800, Gang He wrote:
> > > > In fact, I can reproduce this problem stably.
> > > > I want to know if this error happen is by our expectation? since there is
> > > > not any extreme pressure test.
> > > > Second, how should we handle these error cases? call dlm_lock function
> > > > again? maybe the function will fails again, that will lead to kernel
> > > > soft-lockup after multiple re-tries.
> > > 
> > > What's probably happening is that ocfs2 calls dlm_unlock(CANCEL) to cancel
> > > an in-progress dlm_lock() request.  Before the cancel completes (or the
> > > original request completes), ocfs2 calls dlm_lock() again on the same
> > > resource.  This dlm_lock() returns -EBUSY because the previous request has
> > > not completed, either normally or by cancellation.  This is expected.
> > These dlm_lock and dlm_unlock are invoked in the same node, or the different
> > nodes?
> 
> different

Sorry, same node



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-08-16 14:50 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-11 10:38 [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock? Gang He
2021-08-11 20:35 ` Alexander Aring
2021-08-12  5:44   ` Gang He
2021-08-12 17:45     ` David Teigland
2021-08-13  6:49       ` Gang He
2021-08-16 14:41         ` David Teigland
2021-08-16 14:50           ` David Teigland

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.