* [PATCH] ocfs2: fix cluster hang after a node dies
@ 2017-10-17 6:48 ` Changwei Ge
0 siblings, 0 replies; 12+ messages in thread
From: Changwei Ge @ 2017-10-17 6:48 UTC (permalink / raw)
To: ocfs2-devel, Mark Fasheh, Junxiao Bi, Joseph Qi, Joel Becker
Cc: Vitaly Mayatskih, Andrew Morton, linux-fsdevel
When a node dies, other live nodes have to choose a new master
for an existed lock resource mastered by the dead node.
As for ocfs2/dlm implementation, this is done by function -
dlm_move_lockres_to_recovery_list which marks those lock rsources
as DLM_LOCK_RES_RECOVERING and manages them via a list from which
DLM changes lock resource's master later.
So without invoking dlm_move_lockres_to_recovery_list, no master will
be choosed after dlm recovery accomplishment since no lock resource can
be found through ::resource list.
What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for
lock resources mastered a dead node, it will break up synchronization
among nodes.
So invoke dlm_move_lockres_to_recovery_list again.
Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery
lockres when recovery master goes down")'
Reported-by: Vitaly Mayatskih <v.mayatskih@gmail.com>
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
---
fs/ocfs2/dlm/dlmrecovery.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
index 74407c6..ec8f758 100644
--- a/fs/ocfs2/dlm/dlmrecovery.c
+++ b/fs/ocfs2/dlm/dlmrecovery.c
@@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanup(struct
dlm_ctxt *dlm, u8 dead_node)
dlm_lockres_put(res);
continue;
}
+ dlm_move_lockres_to_recovery_list(dlm, res);
} else if (res->owner == dlm->node_num) {
dlm_free_dead_locks(dlm, res, dead_node);
__dlm_lockres_calc_usage(dlm, res);
--
1.7.9.5
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix cluster hang after a node dies
@ 2017-10-17 6:48 ` Changwei Ge
0 siblings, 0 replies; 12+ messages in thread
From: Changwei Ge @ 2017-10-17 6:48 UTC (permalink / raw)
To: ocfs2-devel, Mark Fasheh, Junxiao Bi, Joseph Qi, Joel Becker
Cc: Vitaly Mayatskih, Andrew Morton, linux-fsdevel
When a node dies, other live nodes have to choose a new master
for an existed lock resource mastered by the dead node.
As for ocfs2/dlm implementation, this is done by function -
dlm_move_lockres_to_recovery_list which marks those lock rsources
as DLM_LOCK_RES_RECOVERING and manages them via a list from which
DLM changes lock resource's master later.
So without invoking dlm_move_lockres_to_recovery_list, no master will
be choosed after dlm recovery accomplishment since no lock resource can
be found through ::resource list.
What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for
lock resources mastered a dead node, it will break up synchronization
among nodes.
So invoke dlm_move_lockres_to_recovery_list again.
Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery
lockres when recovery master goes down")'
Reported-by: Vitaly Mayatskih <v.mayatskih@gmail.com>
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
---
fs/ocfs2/dlm/dlmrecovery.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
index 74407c6..ec8f758 100644
--- a/fs/ocfs2/dlm/dlmrecovery.c
+++ b/fs/ocfs2/dlm/dlmrecovery.c
@@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanup(struct
dlm_ctxt *dlm, u8 dead_node)
dlm_lockres_put(res);
continue;
}
+ dlm_move_lockres_to_recovery_list(dlm, res);
} else if (res->owner == dlm->node_num) {
dlm_free_dead_locks(dlm, res, dead_node);
__dlm_lockres_calc_usage(dlm, res);
--
1.7.9.5
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] ocfs2: fix cluster hang after a node dies
2017-10-17 6:48 ` [Ocfs2-devel] " Changwei Ge
(?)
@ 2017-10-17 14:28 ` Vitaly Mayatskikh
-1 siblings, 0 replies; 12+ messages in thread
From: Vitaly Mayatskikh @ 2017-10-17 14:28 UTC (permalink / raw)
To: Changwei Ge
Cc: ocfs2-devel, Mark Fasheh, Junxiao Bi, Joseph Qi, Joel Becker,
Andrew Morton, linux-fsdevel
On Tue, 17 Oct 2017 02:48:21 -0400,
Changwei Ge wrote:
>
> When a node dies, other live nodes have to choose a new master
> for an existed lock resource mastered by the dead node.
>
> As for ocfs2/dlm implementation, this is done by function -
> dlm_move_lockres_to_recovery_list which marks those lock rsources
> as DLM_LOCK_RES_RECOVERING and manages them via a list from which
> DLM changes lock resource's master later.
>
> So without invoking dlm_move_lockres_to_recovery_list, no master will
> be choosed after dlm recovery accomplishment since no lock resource can
> be found through ::resource list.
>
> What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for
> lock resources mastered a dead node, it will break up synchronization
> among nodes.
>
> So invoke dlm_move_lockres_to_recovery_list again.
I did a brief testing and all looks good now. Thanks!
> Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery
> lockres when recovery master goes down")'
>
> Reported-by: Vitaly Mayatskih <v.mayatskih@gmail.com>
> Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
> ---
> fs/ocfs2/dlm/dlmrecovery.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
> index 74407c6..ec8f758 100644
> --- a/fs/ocfs2/dlm/dlmrecovery.c
> +++ b/fs/ocfs2/dlm/dlmrecovery.c
> @@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanup(struct
> dlm_ctxt *dlm, u8 dead_node)
> dlm_lockres_put(res);
> continue;
> }
> + dlm_move_lockres_to_recovery_list(dlm, res);
> } else if (res->owner == dlm->node_num) {
> dlm_free_dead_locks(dlm, res, dead_node);
> __dlm_lockres_calc_usage(dlm, res);
> --
> 1.7.9.5
--
wbr, Vitaly
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Ocfs2-devel] [PATCH] ocfs2: fix cluster hang after a node dies
2017-10-17 6:48 ` [Ocfs2-devel] " Changwei Ge
@ 2017-10-18 8:17 ` piaojun
-1 siblings, 0 replies; 12+ messages in thread
From: piaojun @ 2017-10-18 8:17 UTC (permalink / raw)
To: Changwei Ge, ocfs2-devel, Mark Fasheh, Junxiao Bi, Joseph Qi,
Joel Becker
Cc: linux-fsdevel, Vitaly Mayatskih
Hi Changwei,
Could you share the method to reproduce the problem?
On 2017/10/17 14:48, Changwei Ge wrote:
> When a node dies, other live nodes have to choose a new master
> for an existed lock resource mastered by the dead node.
>
> As for ocfs2/dlm implementation, this is done by function -
> dlm_move_lockres_to_recovery_list which marks those lock rsources
> as DLM_LOCK_RES_RECOVERING and manages them via a list from which
> DLM changes lock resource's master later.
>
> So without invoking dlm_move_lockres_to_recovery_list, no master will
> be choosed after dlm recovery accomplishment since no lock resource can
> be found through ::resource list.
>
> What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for
> lock resources mastered a dead node, it will break up synchronization
> among nodes.
>
> So invoke dlm_move_lockres_to_recovery_list again.
>
> Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery
> lockres when recovery master goes down")'
>
> Reported-by: Vitaly Mayatskih <v.mayatskih@gmail.com>
> Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
> ---
> fs/ocfs2/dlm/dlmrecovery.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
> index 74407c6..ec8f758 100644
> --- a/fs/ocfs2/dlm/dlmrecovery.c
> +++ b/fs/ocfs2/dlm/dlmrecovery.c
> @@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanup(struct
> dlm_ctxt *dlm, u8 dead_node)
> dlm_lockres_put(res);
> continue;
> }
> + dlm_move_lockres_to_recovery_list(dlm, res);
> } else if (res->owner == dlm->node_num) {
> dlm_free_dead_locks(dlm, res, dead_node);
> __dlm_lockres_calc_usage(dlm, res);
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix cluster hang after a node dies
@ 2017-10-18 8:17 ` piaojun
0 siblings, 0 replies; 12+ messages in thread
From: piaojun @ 2017-10-18 8:17 UTC (permalink / raw)
To: Changwei Ge, ocfs2-devel, Mark Fasheh, Junxiao Bi, Joseph Qi,
Joel Becker
Cc: linux-fsdevel, Vitaly Mayatskih
Hi Changwei,
Could you share the method to reproduce the problem?
On 2017/10/17 14:48, Changwei Ge wrote:
> When a node dies, other live nodes have to choose a new master
> for an existed lock resource mastered by the dead node.
>
> As for ocfs2/dlm implementation, this is done by function -
> dlm_move_lockres_to_recovery_list which marks those lock rsources
> as DLM_LOCK_RES_RECOVERING and manages them via a list from which
> DLM changes lock resource's master later.
>
> So without invoking dlm_move_lockres_to_recovery_list, no master will
> be choosed after dlm recovery accomplishment since no lock resource can
> be found through ::resource list.
>
> What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for
> lock resources mastered a dead node, it will break up synchronization
> among nodes.
>
> So invoke dlm_move_lockres_to_recovery_list again.
>
> Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery
> lockres when recovery master goes down")'
>
> Reported-by: Vitaly Mayatskih <v.mayatskih@gmail.com>
> Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
> ---
> fs/ocfs2/dlm/dlmrecovery.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
> index 74407c6..ec8f758 100644
> --- a/fs/ocfs2/dlm/dlmrecovery.c
> +++ b/fs/ocfs2/dlm/dlmrecovery.c
> @@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanup(struct
> dlm_ctxt *dlm, u8 dead_node)
> dlm_lockres_put(res);
> continue;
> }
> + dlm_move_lockres_to_recovery_list(dlm, res);
> } else if (res->owner == dlm->node_num) {
> dlm_free_dead_locks(dlm, res, dead_node);
> __dlm_lockres_calc_usage(dlm, res);
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Ocfs2-devel] [PATCH] ocfs2: fix cluster hang after a node dies
2017-10-18 8:17 ` piaojun
@ 2017-10-18 8:42 ` Changwei Ge
-1 siblings, 0 replies; 12+ messages in thread
From: Changwei Ge @ 2017-10-18 8:42 UTC (permalink / raw)
To: piaojun, ocfs2-devel, Mark Fasheh, Junxiao Bi, Joseph Qi, Joel Becker
Cc: linux-fsdevel, Vitaly Mayatskih
On 2017/10/18 16:21, piaojun wrote:
> Hi Changwei,
>
> Could you share the method to reproduce the problem?
Hi Jun,
It's easy to reproduce this issue, just make a node dead, no mater how.
For example, make a node crash via 'magic sysrq'.
Also you can refer to Vitaly's mail. Perhaps he can provide some better
or more detailed clues.
Thanks,
Changwei.
>
> On 2017/10/17 14:48, Changwei Ge wrote:
>> When a node dies, other live nodes have to choose a new master
>> for an existed lock resource mastered by the dead node.
>>
>> As for ocfs2/dlm implementation, this is done by function -
>> dlm_move_lockres_to_recovery_list which marks those lock rsources
>> as DLM_LOCK_RES_RECOVERING and manages them via a list from which
>> DLM changes lock resource's master later.
>>
>> So without invoking dlm_move_lockres_to_recovery_list, no master will
>> be choosed after dlm recovery accomplishment since no lock resource can
>> be found through ::resource list.
>>
>> What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for
>> lock resources mastered a dead node, it will break up synchronization
>> among nodes.
>>
>> So invoke dlm_move_lockres_to_recovery_list again.
>>
>> Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery
>> lockres when recovery master goes down")'
>>
>> Reported-by: Vitaly Mayatskih <v.mayatskih@gmail.com>
>> Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
>> ---
>> fs/ocfs2/dlm/dlmrecovery.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
>> index 74407c6..ec8f758 100644
>> --- a/fs/ocfs2/dlm/dlmrecovery.c
>> +++ b/fs/ocfs2/dlm/dlmrecovery.c
>> @@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanup(struct
>> dlm_ctxt *dlm, u8 dead_node)
>> dlm_lockres_put(res);
>> continue;
>> }
>> + dlm_move_lockres_to_recovery_list(dlm, res);
>> } else if (res->owner == dlm->node_num) {
>> dlm_free_dead_locks(dlm, res, dead_node);
>> __dlm_lockres_calc_usage(dlm, res);
>>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix cluster hang after a node dies
@ 2017-10-18 8:42 ` Changwei Ge
0 siblings, 0 replies; 12+ messages in thread
From: Changwei Ge @ 2017-10-18 8:42 UTC (permalink / raw)
To: piaojun, ocfs2-devel, Mark Fasheh, Junxiao Bi, Joseph Qi, Joel Becker
Cc: linux-fsdevel, Vitaly Mayatskih
On 2017/10/18 16:21, piaojun wrote:
> Hi Changwei,
>
> Could you share the method to reproduce the problem?
Hi Jun,
It's easy to reproduce this issue, just make a node dead, no mater how.
For example, make a node crash via 'magic sysrq'.
Also you can refer to Vitaly's mail. Perhaps he can provide some better
or more detailed clues.
Thanks,
Changwei.
>
> On 2017/10/17 14:48, Changwei Ge wrote:
>> When a node dies, other live nodes have to choose a new master
>> for an existed lock resource mastered by the dead node.
>>
>> As for ocfs2/dlm implementation, this is done by function -
>> dlm_move_lockres_to_recovery_list which marks those lock rsources
>> as DLM_LOCK_RES_RECOVERING and manages them via a list from which
>> DLM changes lock resource's master later.
>>
>> So without invoking dlm_move_lockres_to_recovery_list, no master will
>> be choosed after dlm recovery accomplishment since no lock resource can
>> be found through ::resource list.
>>
>> What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for
>> lock resources mastered a dead node, it will break up synchronization
>> among nodes.
>>
>> So invoke dlm_move_lockres_to_recovery_list again.
>>
>> Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery
>> lockres when recovery master goes down")'
>>
>> Reported-by: Vitaly Mayatskih <v.mayatskih@gmail.com>
>> Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
>> ---
>> fs/ocfs2/dlm/dlmrecovery.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
>> index 74407c6..ec8f758 100644
>> --- a/fs/ocfs2/dlm/dlmrecovery.c
>> +++ b/fs/ocfs2/dlm/dlmrecovery.c
>> @@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanup(struct
>> dlm_ctxt *dlm, u8 dead_node)
>> dlm_lockres_put(res);
>> continue;
>> }
>> + dlm_move_lockres_to_recovery_list(dlm, res);
>> } else if (res->owner == dlm->node_num) {
>> dlm_free_dead_locks(dlm, res, dead_node);
>> __dlm_lockres_calc_usage(dlm, res);
>>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Ocfs2-devel] [PATCH] ocfs2: fix cluster hang after a node dies
2017-10-17 6:48 ` [Ocfs2-devel] " Changwei Ge
@ 2017-10-18 9:09 ` piaojun
-1 siblings, 0 replies; 12+ messages in thread
From: piaojun @ 2017-10-18 9:09 UTC (permalink / raw)
To: Changwei Ge, ocfs2-devel, Mark Fasheh, Junxiao Bi, Joseph Qi,
Joel Becker
Cc: linux-fsdevel, Vitaly Mayatskih
On 2017/10/17 14:48, Changwei Ge wrote:
> When a node dies, other live nodes have to choose a new master
> for an existed lock resource mastered by the dead node.
>
> As for ocfs2/dlm implementation, this is done by function -
> dlm_move_lockres_to_recovery_list which marks those lock rsources
> as DLM_LOCK_RES_RECOVERING and manages them via a list from which
> DLM changes lock resource's master later.
>
> So without invoking dlm_move_lockres_to_recovery_list, no master will
> be choosed after dlm recovery accomplishment since no lock resource can
> be found through ::resource list.
>
> What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for
> lock resources mastered a dead node, it will break up synchronization
> among nodes.
>
> So invoke dlm_move_lockres_to_recovery_list again.
>
> Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery
> lockres when recovery master goes down")'
>
> Reported-by: Vitaly Mayatskih <v.mayatskih@gmail.com>
> Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
> ---
> fs/ocfs2/dlm/dlmrecovery.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
> index 74407c6..ec8f758 100644
> --- a/fs/ocfs2/dlm/dlmrecovery.c
> +++ b/fs/ocfs2/dlm/dlmrecovery.c
> @@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanup(struct
> dlm_ctxt *dlm, u8 dead_node)
> dlm_lockres_put(res);
> continue;
> }
> + dlm_move_lockres_to_recovery_list(dlm, res);
> } else if (res->owner == dlm->node_num) {
> dlm_free_dead_locks(dlm, res, dead_node);
> __dlm_lockres_calc_usage(dlm, res);
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix cluster hang after a node dies
@ 2017-10-18 9:09 ` piaojun
0 siblings, 0 replies; 12+ messages in thread
From: piaojun @ 2017-10-18 9:09 UTC (permalink / raw)
To: Changwei Ge, ocfs2-devel, Mark Fasheh, Junxiao Bi, Joseph Qi,
Joel Becker
Cc: linux-fsdevel, Vitaly Mayatskih
On 2017/10/17 14:48, Changwei Ge wrote:
> When a node dies, other live nodes have to choose a new master
> for an existed lock resource mastered by the dead node.
>
> As for ocfs2/dlm implementation, this is done by function -
> dlm_move_lockres_to_recovery_list which marks those lock rsources
> as DLM_LOCK_RES_RECOVERING and manages them via a list from which
> DLM changes lock resource's master later.
>
> So without invoking dlm_move_lockres_to_recovery_list, no master will
> be choosed after dlm recovery accomplishment since no lock resource can
> be found through ::resource list.
>
> What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for
> lock resources mastered a dead node, it will break up synchronization
> among nodes.
>
> So invoke dlm_move_lockres_to_recovery_list again.
>
> Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery
> lockres when recovery master goes down")'
>
> Reported-by: Vitaly Mayatskih <v.mayatskih@gmail.com>
> Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
> ---
> fs/ocfs2/dlm/dlmrecovery.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
> index 74407c6..ec8f758 100644
> --- a/fs/ocfs2/dlm/dlmrecovery.c
> +++ b/fs/ocfs2/dlm/dlmrecovery.c
> @@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanup(struct
> dlm_ctxt *dlm, u8 dead_node)
> dlm_lockres_put(res);
> continue;
> }
> + dlm_move_lockres_to_recovery_list(dlm, res);
> } else if (res->owner == dlm->node_num) {
> dlm_free_dead_locks(dlm, res, dead_node);
> __dlm_lockres_calc_usage(dlm, res);
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Ocfs2-devel] [PATCH] ocfs2: fix cluster hang after a node dies
2017-10-18 8:17 ` piaojun
(?)
(?)
@ 2017-10-18 11:37 ` Vitaly Mayatskikh
-1 siblings, 0 replies; 12+ messages in thread
From: Vitaly Mayatskikh @ 2017-10-18 11:37 UTC (permalink / raw)
To: piaojun
Cc: Changwei Ge, ocfs2-devel, Mark Fasheh, Junxiao Bi, Joseph Qi,
Joel Becker, linux-fsdevel
On Wed, 18 Oct 2017 04:17:31 -0400,
piaojun wrote:
>
> Hi Changwei,
>
> Could you share the method to reproduce the problem?
See here: https://lkml.org/lkml/2017/10/10/1444
--
wbr, Vitaly
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] ocfs2: fix cluster hang after a node dies
2017-10-17 6:48 ` [Ocfs2-devel] " Changwei Ge
@ 2017-10-23 3:51 ` Joseph Qi
-1 siblings, 0 replies; 12+ messages in thread
From: Joseph Qi @ 2017-10-23 3:51 UTC (permalink / raw)
To: Changwei Ge, ocfs2-devel, Mark Fasheh, Junxiao Bi, Joel Becker
Cc: Vitaly Mayatskih, Andrew Morton, linux-fsdevel
On 17/10/17 14:48, Changwei Ge wrote:
> When a node dies, other live nodes have to choose a new master
> for an existed lock resource mastered by the dead node.
>
> As for ocfs2/dlm implementation, this is done by function -
> dlm_move_lockres_to_recovery_list which marks those lock rsources
> as DLM_LOCK_RES_RECOVERING and manages them via a list from which
> DLM changes lock resource's master later.
>
> So without invoking dlm_move_lockres_to_recovery_list, no master will
> be choosed after dlm recovery accomplishment since no lock resource can
> be found through ::resource list.
>
> What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for
> lock resources mastered a dead node, it will break up synchronization
> among nodes.
>
> So invoke dlm_move_lockres_to_recovery_list again.
>
> Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery
> lockres when recovery master goes down")'
>
A typo here, it should be:
Fixes: ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery lockres when recovery master goes down")
Also we'd better Cc stable as well.
Others look good to me.
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
> Reported-by: Vitaly Mayatskih <v.mayatskih@gmail.com>
> Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
> ---
> fs/ocfs2/dlm/dlmrecovery.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
> index 74407c6..ec8f758 100644
> --- a/fs/ocfs2/dlm/dlmrecovery.c
> +++ b/fs/ocfs2/dlm/dlmrecovery.c
> @@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanup(struct
> dlm_ctxt *dlm, u8 dead_node)
> dlm_lockres_put(res);
> continue;
> }
> + dlm_move_lockres_to_recovery_list(dlm, res);
> } else if (res->owner == dlm->node_num) {
> dlm_free_dead_locks(dlm, res, dead_node);
> __dlm_lockres_calc_usage(dlm, res);
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix cluster hang after a node dies
@ 2017-10-23 3:51 ` Joseph Qi
0 siblings, 0 replies; 12+ messages in thread
From: Joseph Qi @ 2017-10-23 3:51 UTC (permalink / raw)
To: Changwei Ge, ocfs2-devel, Mark Fasheh, Junxiao Bi, Joel Becker
Cc: Vitaly Mayatskih, Andrew Morton, linux-fsdevel
On 17/10/17 14:48, Changwei Ge wrote:
> When a node dies, other live nodes have to choose a new master
> for an existed lock resource mastered by the dead node.
>
> As for ocfs2/dlm implementation, this is done by function -
> dlm_move_lockres_to_recovery_list which marks those lock rsources
> as DLM_LOCK_RES_RECOVERING and manages them via a list from which
> DLM changes lock resource's master later.
>
> So without invoking dlm_move_lockres_to_recovery_list, no master will
> be choosed after dlm recovery accomplishment since no lock resource can
> be found through ::resource list.
>
> What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for
> lock resources mastered a dead node, it will break up synchronization
> among nodes.
>
> So invoke dlm_move_lockres_to_recovery_list again.
>
> Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery
> lockres when recovery master goes down")'
>
A typo here, it should be:
Fixes: ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery lockres when recovery master goes down")
Also we'd better Cc stable as well.
Others look good to me.
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
> Reported-by: Vitaly Mayatskih <v.mayatskih@gmail.com>
> Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
> ---
> fs/ocfs2/dlm/dlmrecovery.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
> index 74407c6..ec8f758 100644
> --- a/fs/ocfs2/dlm/dlmrecovery.c
> +++ b/fs/ocfs2/dlm/dlmrecovery.c
> @@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanup(struct
> dlm_ctxt *dlm, u8 dead_node)
> dlm_lockres_put(res);
> continue;
> }
> + dlm_move_lockres_to_recovery_list(dlm, res);
> } else if (res->owner == dlm->node_num) {
> dlm_free_dead_locks(dlm, res, dead_node);
> __dlm_lockres_calc_usage(dlm, res);
>
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2017-10-23 3:51 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-17 6:48 [PATCH] ocfs2: fix cluster hang after a node dies Changwei Ge
2017-10-17 6:48 ` [Ocfs2-devel] " Changwei Ge
2017-10-17 14:28 ` Vitaly Mayatskikh
2017-10-18 8:17 ` [Ocfs2-devel] " piaojun
2017-10-18 8:17 ` piaojun
2017-10-18 8:42 ` Changwei Ge
2017-10-18 8:42 ` Changwei Ge
2017-10-18 11:37 ` Vitaly Mayatskikh
2017-10-18 9:09 ` piaojun
2017-10-18 9:09 ` piaojun
2017-10-23 3:51 ` Joseph Qi
2017-10-23 3:51 ` [Ocfs2-devel] " Joseph Qi
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.