Re: [PATCH 2/2] md/cluster: fix deadlock when doing reshape job

From: "heming.zhao@suse.com" <heming.zhao@suse.com>
To: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>,
	linux-raid@vger.kernel.org, song@kernel.org
Cc: lidong.zhong@suse.com, xni@redhat.com, neilb@suse.de, colyli@suse.de
Subject: Re: [PATCH 2/2] md/cluster: fix deadlock when doing reshape job
Date: Wed, 11 Nov 2020 20:13:12 +0800	[thread overview]
Message-ID: <e5ab9dfc-46bd-ac84-e79f-468f4ed0512f@suse.com> (raw)
In-Reply-To: <b6ef581d-2eaf-37d3-7a21-a91630b0836c@cloud.ionos.com>

On 11/10/20 2:36 PM, Guoqing Jiang wrote:
> 
> 
> On 11/8/20 15:53, Zhao Heming wrote:
>> There is a similar deadlock in commit 0ba959774e93
>> ("md-cluster: use sync way to handle METADATA_UPDATED msg")
>> ... ...
>>   static void unlock_comm(struct md_cluster_info *cinfo)
>> @@ -784,9 +794,11 @@ static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg,
>>   {
>>       int ret;
>> -    lock_comm(cinfo, mddev_locked);
>> -    ret = __sendmsg(cinfo, cmsg);
>> -    unlock_comm(cinfo);
>> +    ret = lock_comm(cinfo, mddev_locked);
>> +    if (!ret) {
>> +        ret = __sendmsg(cinfo, cmsg);
>> +        unlock_comm(cinfo);
>> +    }
>>       return ret;
>>   }
> 
> What happens if the cluster has latency issue? Let's say, node normally can't get lock during the 5s window. IOW, is the timeout value justified well?

5 seconds is a random number first from my brain when I was coding. 

below I gave more info for my patch.

The key of patch 2 is to change from wait_event to wait_event_timeout.
After applying patch, code logic of some functions are also changed.
I will analyze the two sides: send & receive

*** send side***

Before applying patch, sendmsg only fail when __sendmsg return error. it means dlm layer fails to convert lock. But currently code should do access the return value of lock_comm.

After applying patch, there will have new fail cases: 5s timeout.

all related functions:
resync_bitmap
- void, only print error info when error. 

update_bitmap_size
- return err. caller already handles error case.

resync_info_update
- return err. caller ignore. because this is info msg, it can safely ignore.

remove_disk
- return err. part of caller handle error case. 
  I forgot to modify hot_remove_disk(), will add it in v2 patch.
  So in future v2 patch, all callers will handle error case.

gather_bitmaps - return err, caller already handles error case

We can see there is only one function which doesn't care return value: resync_bitmap
this function is used in leave path. if the msg doesn't send out (5s timeout), the result is other nodes won't know the failure event by BITMAP_NEEDS_SYNC. But if my understand is correct, even if missing BITMAP_NEEDS_SYNC, there is another api recover_slot(), which is triggered by dlm and do the same job.

So all the sending side related functions are safe.

*** receive side ***

process_metadata_update
the patch means it won't do metadata update job when 5s timeout. 
and I change my mind, 5s timeout is wrong. Receive side should do as more as possible to handle incomming msg. if there is a 5s timeout code branch, there will tigger inconsistent issue.
e.g. 
A does --faulty, send METADATA_UPDATE to B, 
B receives the msg, but meets 5s timeout. it won't to update kernel md info. and it will have a gap between kernel md info and disk metadata.