All of lore.kernel.org
 help / color / mirror / Atom feed
From: Changwei Ge <ge.changwei@h3c.com>
To: ocfs2-devel@oss.oracle.com
Subject: [Ocfs2-devel] [PATCH] ocfs2: re-queue AST or BAST if sending is failed to improve the reliability
Date: Tue, 8 Aug 2017 10:56:43 +0000	[thread overview]
Message-ID: <63ADC13FD55D6546B7DECE290D39E373AC2CB9E5@H3CMLB14-EX.srv.huawei-3com.com> (raw)
In-Reply-To: CAAXPY_+YpQgRPBu1AeDMj-a3ouUxVktL7gdkSkSaEdeTKTUCvw@mail.gmail.com



On 2017/8/8 4:20, Mark Fasheh wrote:
> On Mon, Aug 7, 2017 at 2:13 AM, Changwei Ge <ge.changwei@h3c.com> wrote:
>> Hi,
>>
>> In current code, while flushing AST, we don't handle an exception that
>> sending AST or BAST is failed.
>> But it is indeed possible that AST or BAST is lost due to some kind of
>> networks fault.
>>
>> If above exception happens, the requesting node will never obtain an AST
>> back, hence, it will never acquire the lock or abort current locking.
>>
>> With this patch, I'd like to fix this issue by re-queuing the AST or
>> BAST if sending is failed due to networks fault.
>>
>> And the re-queuing AST or BAST will be dropped if the requesting node is
>> dead!
>>
>> It will improve the reliability a lot.
> Can you detail your testing? Code-wise this looks fine to me but as
> you note, this is a pretty hard to hit corner case so it'd be nice to
> hear that you were able to exercise it.
>
> Thanks,
>    --Mark
Hi Mark,

My test is quite simple to perform.
Test environment includes 7 hosts. Ethernet devices in 6 of them are
down and then up repetitively.
After several rounds of up and down. Some file operation hangs.

Through debugfs.ocfs2 tool involved in NODE 2 which was the owner of
lock resource 'O000000000000000011150300000000',
it told that:

debugfs: dlm_locks O000000000000000011150300000000
Lockres: O000000000000000011150300000000   Owner: 2    State: 0x0
Last Used: 0      ASTs Reserved: 0    Inflight: 0    Migration Pending: No
Refs: 4    Locks: 2    On Lists: None
Reference Map: 3
 Lock-Queue  Node  Level  Conv  Cookie           Refs  AST  BAST 
Pending-Action
 Granted     2     PR     -1    2:53             2     No   No    None
 Granted     3     PR     -1    3:48             2     No   No    None

That meant NODE 2 had granted NODE 3 and the AST had been transited to
NODE 3.

Meanwhile, through debugfs.ocfs2 tool involved in NODE 3,
it told that:
debugfs: dlm_locks O000000000000000011150300000000
Lockres: O000000000000000011150300000000   Owner: 2    State: 0x0
Last Used: 0      ASTs Reserved: 0    Inflight: 0    Migration Pending: No
Refs: 3    Locks: 1    On Lists: None
Reference Map:
 Lock-Queue  Node  Level  Conv  Cookie           Refs  AST  BAST 
Pending-Action
 Blocked     3     PR     -1    3:48             2     No   No    None

That meant NODE 3 didn't ever receive any AST to move local lock from
blocked list to grant list.

This consequence  makes sense, since AST sending is failed which can be
seen in kernel log.

As for BAST, it is more or less the same.

Thanks
Changwei


From

  reply	other threads:[~2017-08-08 10:56 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-07  7:13 [Ocfs2-devel] [PATCH] ocfs2: re-queue AST or BAST if sending is failed to improve the reliability Changwei Ge
2017-08-07  7:43 ` Gang He
2017-08-07  7:55   ` Changwei Ge
2017-08-07 20:19 ` Mark Fasheh
2017-08-08 10:56   ` Changwei Ge [this message]
2017-08-22 20:49     ` Mark Fasheh
2017-08-23  1:06       ` Joseph Qi
2017-08-09 11:32 ` Joseph Qi
2017-08-09 15:24   ` ge changwei
2017-08-10  9:34     ` Joseph Qi
2017-08-10 10:49       ` Changwei Ge
2017-08-23  2:23         ` Junxiao Bi
2017-08-23  3:34           ` Joseph Qi
2017-08-23  4:47             ` Gang He
2017-08-23  5:56               ` Changwei Ge
     [not found]                 ` <63ADC13FD55D6546B7DECE290D39E373CED4F4ED@H3CMLB14-EX.srv.huawei-3com.com>
2017-09-13  7:03                   ` Changwei Ge

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=63ADC13FD55D6546B7DECE290D39E373AC2CB9E5@H3CMLB14-EX.srv.huawei-3com.com \
    --to=ge.changwei@h3c.com \
    --cc=ocfs2-devel@oss.oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.