* [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?
@ 2021-08-11 10:38 Gang He
2021-08-11 20:35 ` Alexander Aring
0 siblings, 1 reply; 7+ messages in thread
From: Gang He @ 2021-08-11 10:38 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hello List,
I am using kernel 5.13.4 (some old version kernels have the same problem).
When node A acquired a dlm (EX) lock, node B tried to get the dlm lock, node A got a BAST message,
then node A downcoverted the dlm lock to NL, dlm_lock function failed with the error -16.
The function failure did not always happen, but in some case, I could encounter this failure.
Why does dlm_lock function fails when downconvert a dlm lock? there are any documents for describe these error cases?
If the code ignores dlm_lock return error from node A, node B will not get the dlm lock permanently.
How should we handle such situation? call dlm_lock function to downconvert the dlm lock again?
Thanks
Gang
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?
2021-08-11 10:38 [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock? Gang He
@ 2021-08-11 20:35 ` Alexander Aring
2021-08-12 5:44 ` Gang He
0 siblings, 1 reply; 7+ messages in thread
From: Alexander Aring @ 2021-08-11 20:35 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hi,
On Wed, Aug 11, 2021 at 6:41 AM Gang He <GHe@suse.com> wrote:
>
> Hello List,
>
> I am using kernel 5.13.4 (some old version kernels have the same problem).
> When node A acquired a dlm (EX) lock, node B tried to get the dlm lock, node A got a BAST message,
> then node A downcoverted the dlm lock to NL, dlm_lock function failed with the error -16.
> The function failure did not always happen, but in some case, I could encounter this failure.
> Why does dlm_lock function fails when downconvert a dlm lock? there are any documents for describe these error cases?
> If the code ignores dlm_lock return error from node A, node B will not get the dlm lock permanently.
> How should we handle such situation? call dlm_lock function to downconvert the dlm lock again?
What is your dlm user? Is it kernel (e.g. gfs2/ocfs2/md) or user (libdlm)?
I believe you are running into case [0]. Can you provide the
corresponding log_debug() message? It's necessary to insert
"log_debug=1" in your dlm.conf and it will be reported on KERN_DEBUG
in your kernel log then.
Thanks.
- Alex
[0] https://elixir.bootlin.com/linux/v5.14-rc5/source/fs/dlm/lock.c#L2886
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?
2021-08-11 20:35 ` Alexander Aring
@ 2021-08-12 5:44 ` Gang He
2021-08-12 17:45 ` David Teigland
0 siblings, 1 reply; 7+ messages in thread
From: Gang He @ 2021-08-12 5:44 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hi Alexander,
On 2021/8/12 4:35, Alexander Aring wrote:
> Hi,
>
> On Wed, Aug 11, 2021 at 6:41 AM Gang He <GHe@suse.com> wrote:
>>
>> Hello List,
>>
>> I am using kernel 5.13.4 (some old version kernels have the same problem).
>> When node A acquired a dlm (EX) lock, node B tried to get the dlm lock, node A got a BAST message,
>> then node A downcoverted the dlm lock to NL, dlm_lock function failed with the error -16.
>> The function failure did not always happen, but in some case, I could encounter this failure.
>> Why does dlm_lock function fails when downconvert a dlm lock? there are any documents for describe these error cases?
>> If the code ignores dlm_lock return error from node A, node B will not get the dlm lock permanently.
>> How should we handle such situation? call dlm_lock function to downconvert the dlm lock again?
>
> What is your dlm user? Is it kernel (e.g. gfs2/ocfs2/md) or user (libdlm)?
ocfs2 file system.
>
> I believe you are running into case [0]. Can you provide the
> corresponding log_debug() message? It's necessary to insert
> "log_debug=1" in your dlm.conf and it will be reported on KERN_DEBUG
> in your kernel log then.
[Thu Aug 12 12:04:55 2021] dlm: ED6296E929054DFF87853DD3610D838F:
remwait 10 cancel_reply overlap
[Thu Aug 12 12:05:00 2021] dlm: ED6296E929054DFF87853DD3610D838F:
addwait 10 cur 2 overlap 4 count 2 f 100000
[Thu Aug 12 12:05:00 2021] dlm: ED6296E929054DFF87853DD3610D838F:
remwait 10 cancel_reply overlap
[Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F:
addwait 10 cur 2 overlap 4 count 2 f 100000
[Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F:
validate_lock_args -16 10 100000 10c 2 0 M0000000000000000046e0200000000
[Thu Aug 12 12:05:05 2021]
(ocfs2dc-ED6296E,1602,1):ocfs2_downconvert_lock:3674 ERROR: DLM error
-16 while calling ocfs2_dlm_lock on resource M0000000000000000046e0200000000
[Thu Aug 12 12:05:05 2021]
(ocfs2dc-ED6296E,1602,1):ocfs2_unblock_lock:3918 ERROR: status = -16
[Thu Aug 12 12:05:05 2021]
(ocfs2dc-ED6296E,1602,1):ocfs2_process_blocked_lock:4317 ERROR: status = -16
[Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F:
remwait 10 cancel_reply overlap
The whole kernel log for this node is here:
https://pastebin.com/FBn8Uwsu
The other two node kernel log:
https://pastebin.com/XxrZw6ds
https://pastebin.com/2Jw1ZqVb
In fact, I can reproduce this problem stably.
I want to know if this error happen is by our expectation? since there
is not any extreme pressure test.
Second, how should we handle these error cases? call dlm_lock function
again? maybe the function will fails again, that will lead to kernel
soft-lockup after multiple re-tries.
Thanks
Gang
>
> Thanks.
>
> - Alex
>
> [0] https://elixir.bootlin.com/linux/v5.14-rc5/source/fs/dlm/lock.c#L2886
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?
2021-08-12 5:44 ` Gang He
@ 2021-08-12 17:45 ` David Teigland
2021-08-13 6:49 ` Gang He
0 siblings, 1 reply; 7+ messages in thread
From: David Teigland @ 2021-08-12 17:45 UTC (permalink / raw)
To: cluster-devel.redhat.com
On Thu, Aug 12, 2021 at 01:44:53PM +0800, Gang He wrote:
> In fact, I can reproduce this problem stably.
> I want to know if this error happen is by our expectation? since there is
> not any extreme pressure test.
> Second, how should we handle these error cases? call dlm_lock function
> again? maybe the function will fails again, that will lead to kernel
> soft-lockup after multiple re-tries.
What's probably happening is that ocfs2 calls dlm_unlock(CANCEL) to cancel
an in-progress dlm_lock() request. Before the cancel completes (or the
original request completes), ocfs2 calls dlm_lock() again on the same
resource. This dlm_lock() returns -EBUSY because the previous request has
not completed, either normally or by cancellation. This is expected.
A couple options to try: wait for the original request to complete
(normally or by cancellation) before calling dlm_lock() again, or retry
dlm_lock() on -EBUSY.
Dave
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?
2021-08-12 17:45 ` David Teigland
@ 2021-08-13 6:49 ` Gang He
2021-08-16 14:41 ` David Teigland
0 siblings, 1 reply; 7+ messages in thread
From: Gang He @ 2021-08-13 6:49 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hi David,
On 2021/8/13 1:45, David Teigland wrote:
> On Thu, Aug 12, 2021 at 01:44:53PM +0800, Gang He wrote:
>> In fact, I can reproduce this problem stably.
>> I want to know if this error happen is by our expectation? since there is
>> not any extreme pressure test.
>> Second, how should we handle these error cases? call dlm_lock function
>> again? maybe the function will fails again, that will lead to kernel
>> soft-lockup after multiple re-tries.
>
> What's probably happening is that ocfs2 calls dlm_unlock(CANCEL) to cancel
> an in-progress dlm_lock() request. Before the cancel completes (or the
> original request completes), ocfs2 calls dlm_lock() again on the same
> resource. This dlm_lock() returns -EBUSY because the previous request has
> not completed, either normally or by cancellation. This is expected.
These dlm_lock and dlm_unlock are invoked in the same node, or the
different nodes?
>
> A couple options to try: wait for the original request to complete
> (normally or by cancellation) before calling dlm_lock() again, or retry
> dlm_lock() on -EBUSY.
If I retry dlm_lock() repeatedly, I just wonder if this will lead to
kernel soft lockup or waste lots of CPU.
If dlm_lock() function returns -EAGAIN, how should we handle this case?
retry it repeatedly?
Thanks
Gang
>
> Dave
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?
2021-08-13 6:49 ` Gang He
@ 2021-08-16 14:41 ` David Teigland
2021-08-16 14:50 ` David Teigland
0 siblings, 1 reply; 7+ messages in thread
From: David Teigland @ 2021-08-16 14:41 UTC (permalink / raw)
To: cluster-devel.redhat.com
On Fri, Aug 13, 2021 at 02:49:04PM +0800, Gang He wrote:
> Hi David,
>
> On 2021/8/13 1:45, David Teigland wrote:
> > On Thu, Aug 12, 2021 at 01:44:53PM +0800, Gang He wrote:
> > > In fact, I can reproduce this problem stably.
> > > I want to know if this error happen is by our expectation? since there is
> > > not any extreme pressure test.
> > > Second, how should we handle these error cases? call dlm_lock function
> > > again? maybe the function will fails again, that will lead to kernel
> > > soft-lockup after multiple re-tries.
> >
> > What's probably happening is that ocfs2 calls dlm_unlock(CANCEL) to cancel
> > an in-progress dlm_lock() request. Before the cancel completes (or the
> > original request completes), ocfs2 calls dlm_lock() again on the same
> > resource. This dlm_lock() returns -EBUSY because the previous request has
> > not completed, either normally or by cancellation. This is expected.
> These dlm_lock and dlm_unlock are invoked in the same node, or the different
> nodes?
different
> > A couple options to try: wait for the original request to complete
> > (normally or by cancellation) before calling dlm_lock() again, or retry
> > dlm_lock() on -EBUSY.
> If I retry dlm_lock() repeatedly, I just wonder if this will lead to kernel
> soft lockup or waste lots of CPU.
I'm not aware of other code doing this, so I can't tell you with certainty.
It would depend largely on the implementation in the caller.
> If dlm_lock() function returns -EAGAIN, how should we handle this case?
> retry it repeatedly?
Again, this is a question more about the implementation of the calling
code and what it wants to do. EAGAIN is specifically related to the
DLM_LKF_NOQUEUE flag.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?
2021-08-16 14:41 ` David Teigland
@ 2021-08-16 14:50 ` David Teigland
0 siblings, 0 replies; 7+ messages in thread
From: David Teigland @ 2021-08-16 14:50 UTC (permalink / raw)
To: cluster-devel.redhat.com
On Mon, Aug 16, 2021 at 09:41:18AM -0500, David Teigland wrote:
> On Fri, Aug 13, 2021 at 02:49:04PM +0800, Gang He wrote:
> > Hi David,
> >
> > On 2021/8/13 1:45, David Teigland wrote:
> > > On Thu, Aug 12, 2021 at 01:44:53PM +0800, Gang He wrote:
> > > > In fact, I can reproduce this problem stably.
> > > > I want to know if this error happen is by our expectation? since there is
> > > > not any extreme pressure test.
> > > > Second, how should we handle these error cases? call dlm_lock function
> > > > again? maybe the function will fails again, that will lead to kernel
> > > > soft-lockup after multiple re-tries.
> > >
> > > What's probably happening is that ocfs2 calls dlm_unlock(CANCEL) to cancel
> > > an in-progress dlm_lock() request. Before the cancel completes (or the
> > > original request completes), ocfs2 calls dlm_lock() again on the same
> > > resource. This dlm_lock() returns -EBUSY because the previous request has
> > > not completed, either normally or by cancellation. This is expected.
> > These dlm_lock and dlm_unlock are invoked in the same node, or the different
> > nodes?
>
> different
Sorry, same node
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2021-08-16 14:50 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-11 10:38 [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock? Gang He
2021-08-11 20:35 ` Alexander Aring
2021-08-12 5:44 ` Gang He
2021-08-12 17:45 ` David Teigland
2021-08-13 6:49 ` Gang He
2021-08-16 14:41 ` David Teigland
2021-08-16 14:50 ` David Teigland
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.