All of lore.kernel.org
 help / color / mirror / Atom feed
* [Ocfs2-devel] [PATCH 1/4] ocfs2/dlm: Retract fix for race between purge and migrate
@ 2009-02-03 20:37 Sunil Mushran
  2009-02-03 20:37 ` [Ocfs2-devel] [PATCH 2/4] ocfs2: Cleanup the lockname print in dlmglue.c Sunil Mushran
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Sunil Mushran @ 2009-02-03 20:37 UTC (permalink / raw)
  To: ocfs2-devel

Mainline commit d4f7e650e55af6b235871126f747da88600e8040 attempts to delay
the dlm_thread from sending the drop ref message if the lockres is being
migrated. The problem is that we make the dlm_thread wait for the migration
to complete. This causes a deadlock as dlm_thread also participates in the
lockres migration process.

A better fix for the original oss bugzilla#1012 is in testing.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
---
 fs/ocfs2/dlm/dlmthread.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c
index d129520..4060bb3 100644
--- a/fs/ocfs2/dlm/dlmthread.c
+++ b/fs/ocfs2/dlm/dlmthread.c
@@ -181,8 +181,7 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm,
 
 		spin_lock(&res->spinlock);
 		/* This ensures that clear refmap is sent after the set */
-		__dlm_wait_on_lockres_flags(res, (DLM_LOCK_RES_SETREF_INPROG |
-						  DLM_LOCK_RES_MIGRATING));
+		__dlm_wait_on_lockres_flags(res, DLM_LOCK_RES_SETREF_INPROG);
 		spin_unlock(&res->spinlock);
 
 		/* clear our bit from the master's refmap, ignore errors */
-- 
1.5.6.3

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] [PATCH 2/4] ocfs2: Cleanup the lockname print in dlmglue.c
  2009-02-03 20:37 [Ocfs2-devel] [PATCH 1/4] ocfs2/dlm: Retract fix for race between purge and migrate Sunil Mushran
@ 2009-02-03 20:37 ` Sunil Mushran
  2009-02-11  7:31   ` Joel Becker
  2009-02-03 20:37 ` [Ocfs2-devel] [PATCH 3/4] ocfs2/dlm: Use ast_lock to protect ast_list Sunil Mushran
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Sunil Mushran @ 2009-02-03 20:37 UTC (permalink / raw)
  To: ocfs2-devel

The dentry lock has a different format than other locks. This patch fixes
ocfs2_log_dlm_error() macro to make it print the dentry lock correctly.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
---
 fs/ocfs2/dlmglue.c |   11 ++++++++---
 1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index 206a237..7219a86 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -320,9 +320,14 @@ static void ocfs2_schedule_blocked_lock(struct ocfs2_super *osb,
 					struct ocfs2_lock_res *lockres);
 static inline void ocfs2_recover_from_dlm_error(struct ocfs2_lock_res *lockres,
 						int convert);
-#define ocfs2_log_dlm_error(_func, _err, _lockres) do {			\
-	mlog(ML_ERROR, "DLM error %d while calling %s on resource %s\n", \
-	     _err, _func, _lockres->l_name);				\
+#define ocfs2_log_dlm_error(_func, _err, _lockres) do {					\
+	if ((_lockres)->l_type != OCFS2_LOCK_TYPE_DENTRY)				\
+		mlog(ML_ERROR, "DLM error %d while calling %s on resource %s\n",	\
+		     _err, _func, _lockres->l_name);					\
+	else										\
+		mlog(ML_ERROR, "DLM error %d while calling %s on resource %.*s%08x\n",	\
+		     _err, _func, OCFS2_DENTRY_LOCK_INO_START - 1, (_lockres)->l_name,	\
+		     (unsigned int)ocfs2_get_dentry_lock_ino(_lockres));		\
 } while (0)
 static int ocfs2_downconvert_thread(void *arg);
 static void ocfs2_downconvert_on_unlock(struct ocfs2_super *osb,
-- 
1.5.6.3

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] [PATCH 3/4] ocfs2/dlm: Use ast_lock to protect ast_list
  2009-02-03 20:37 [Ocfs2-devel] [PATCH 1/4] ocfs2/dlm: Retract fix for race between purge and migrate Sunil Mushran
  2009-02-03 20:37 ` [Ocfs2-devel] [PATCH 2/4] ocfs2: Cleanup the lockname print in dlmglue.c Sunil Mushran
@ 2009-02-03 20:37 ` Sunil Mushran
  2009-02-11  7:31   ` Joel Becker
  2009-02-03 20:37 ` [Ocfs2-devel] [PATCH 4/4] ocfs2/dlm: Make dlm_assert_master_handler() kill itself instead of the asserter Sunil Mushran
  2009-02-11  7:31 ` [Ocfs2-devel] [PATCH 1/4] ocfs2/dlm: Retract fix for race between purge and migrate Joel Becker
  3 siblings, 1 reply; 9+ messages in thread
From: Sunil Mushran @ 2009-02-03 20:37 UTC (permalink / raw)
  To: ocfs2-devel

The code was using dlm->spinlock instead of dlm->ast_lock to protect the
ast_list. This patch fixes the issue.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
---
 fs/ocfs2/dlm/dlmunlock.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmunlock.c b/fs/ocfs2/dlm/dlmunlock.c
index 86ca085..fcf879e 100644
--- a/fs/ocfs2/dlm/dlmunlock.c
+++ b/fs/ocfs2/dlm/dlmunlock.c
@@ -117,11 +117,11 @@ static enum dlm_status dlmunlock_common(struct dlm_ctxt *dlm,
 	else
 		BUG_ON(res->owner == dlm->node_num);
 
-	spin_lock(&dlm->spinlock);
+	spin_lock(&dlm->ast_lock);
 	/* We want to be sure that we're not freeing a lock
 	 * that still has AST's pending... */
 	in_use = !list_empty(&lock->ast_list);
-	spin_unlock(&dlm->spinlock);
+	spin_unlock(&dlm->ast_lock);
 	if (in_use) {
 	       mlog(ML_ERROR, "lockres %.*s: Someone is calling dlmunlock "
 		    "while waiting for an ast!", res->lockname.len,
-- 
1.5.6.3

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] [PATCH 4/4] ocfs2/dlm: Make dlm_assert_master_handler() kill itself instead of the asserter
  2009-02-03 20:37 [Ocfs2-devel] [PATCH 1/4] ocfs2/dlm: Retract fix for race between purge and migrate Sunil Mushran
  2009-02-03 20:37 ` [Ocfs2-devel] [PATCH 2/4] ocfs2: Cleanup the lockname print in dlmglue.c Sunil Mushran
  2009-02-03 20:37 ` [Ocfs2-devel] [PATCH 3/4] ocfs2/dlm: Use ast_lock to protect ast_list Sunil Mushran
@ 2009-02-03 20:37 ` Sunil Mushran
  2009-02-11  7:35   ` Joel Becker
  2009-02-11  7:31 ` [Ocfs2-devel] [PATCH 1/4] ocfs2/dlm: Retract fix for race between purge and migrate Joel Becker
  3 siblings, 1 reply; 9+ messages in thread
From: Sunil Mushran @ 2009-02-03 20:37 UTC (permalink / raw)
  To: ocfs2-devel

In dlm_assert_master_handler(), if we get an incorrect assert master from a node
that, we reply with EINVAL asking the asserter to die. The problem is that an
assert is sent after so many hoops, it is invariably the node that thinks the
asserter is wrong, is actually wrong. So instead of killing the asserter, this
patch kills the assertee.

This patch papers over a race that is still being addressed.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
---
 fs/ocfs2/dlm/dlmmaster.c |   12 ++++++------
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c
index 54e182a..0a28139 100644
--- a/fs/ocfs2/dlm/dlmmaster.c
+++ b/fs/ocfs2/dlm/dlmmaster.c
@@ -1849,12 +1849,12 @@ int dlm_assert_master_handler(struct o2net_msg *msg, u32 len, void *data,
 		if (!mle) {
 			if (res->owner != DLM_LOCK_RES_OWNER_UNKNOWN &&
 			    res->owner != assert->node_idx) {
-				mlog(ML_ERROR, "assert_master from "
-					  "%u, but current owner is "
-					  "%u! (%.*s)\n",
-				       assert->node_idx, res->owner,
-				       namelen, name);
-				goto kill;
+				mlog(ML_ERROR, "DIE! Mastery assert from %u, "
+				     "but current owner is %u! (%.*s)\n",
+				     assert->node_idx, res->owner, namelen,
+				     name);
+				__dlm_print_one_lock_resource(res);
+				BUG();
 			}
 		} else if (mle->type != DLM_MLE_MIGRATION) {
 			if (res->owner != DLM_LOCK_RES_OWNER_UNKNOWN) {
-- 
1.5.6.3

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] [PATCH 1/4] ocfs2/dlm: Retract fix for race between purge and migrate
  2009-02-03 20:37 [Ocfs2-devel] [PATCH 1/4] ocfs2/dlm: Retract fix for race between purge and migrate Sunil Mushran
                   ` (2 preceding siblings ...)
  2009-02-03 20:37 ` [Ocfs2-devel] [PATCH 4/4] ocfs2/dlm: Make dlm_assert_master_handler() kill itself instead of the asserter Sunil Mushran
@ 2009-02-11  7:31 ` Joel Becker
  3 siblings, 0 replies; 9+ messages in thread
From: Joel Becker @ 2009-02-11  7:31 UTC (permalink / raw)
  To: ocfs2-devel

looks good.

On Tue, Feb 03, 2009 at 12:37:13PM -0800, Sunil Mushran wrote:
> Mainline commit d4f7e650e55af6b235871126f747da88600e8040 attempts to delay
> the dlm_thread from sending the drop ref message if the lockres is being
> migrated. The problem is that we make the dlm_thread wait for the migration
> to complete. This causes a deadlock as dlm_thread also participates in the
> lockres migration process.
> 
> A better fix for the original oss bugzilla#1012 is in testing.
> 
> Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
> ---
>  fs/ocfs2/dlm/dlmthread.c |    3 +--
>  1 files changed, 1 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c
> index d129520..4060bb3 100644
> --- a/fs/ocfs2/dlm/dlmthread.c
> +++ b/fs/ocfs2/dlm/dlmthread.c
> @@ -181,8 +181,7 @@ static int dlm_purge_lockres(struct dlm_ctxt *dlm,
>  
>  		spin_lock(&res->spinlock);
>  		/* This ensures that clear refmap is sent after the set */
> -		__dlm_wait_on_lockres_flags(res, (DLM_LOCK_RES_SETREF_INPROG |
> -						  DLM_LOCK_RES_MIGRATING));
> +		__dlm_wait_on_lockres_flags(res, DLM_LOCK_RES_SETREF_INPROG);
>  		spin_unlock(&res->spinlock);
>  
>  		/* clear our bit from the master's refmap, ignore errors */
> -- 
> 1.5.6.3
> 
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-devel

-- 

"Behind every successful man there's a lot of unsuccessful years."
        - Bob Brown

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] [PATCH 2/4] ocfs2: Cleanup the lockname print in dlmglue.c
  2009-02-03 20:37 ` [Ocfs2-devel] [PATCH 2/4] ocfs2: Cleanup the lockname print in dlmglue.c Sunil Mushran
@ 2009-02-11  7:31   ` Joel Becker
  0 siblings, 0 replies; 9+ messages in thread
From: Joel Becker @ 2009-02-11  7:31 UTC (permalink / raw)
  To: ocfs2-devel

sobby

On Tue, Feb 03, 2009 at 12:37:14PM -0800, Sunil Mushran wrote:
> The dentry lock has a different format than other locks. This patch fixes
> ocfs2_log_dlm_error() macro to make it print the dentry lock correctly.
> 
> Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
> ---
>  fs/ocfs2/dlmglue.c |   11 ++++++++---
>  1 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> index 206a237..7219a86 100644
> --- a/fs/ocfs2/dlmglue.c
> +++ b/fs/ocfs2/dlmglue.c
> @@ -320,9 +320,14 @@ static void ocfs2_schedule_blocked_lock(struct ocfs2_super *osb,
>  					struct ocfs2_lock_res *lockres);
>  static inline void ocfs2_recover_from_dlm_error(struct ocfs2_lock_res *lockres,
>  						int convert);
> -#define ocfs2_log_dlm_error(_func, _err, _lockres) do {			\
> -	mlog(ML_ERROR, "DLM error %d while calling %s on resource %s\n", \
> -	     _err, _func, _lockres->l_name);				\
> +#define ocfs2_log_dlm_error(_func, _err, _lockres) do {					\
> +	if ((_lockres)->l_type != OCFS2_LOCK_TYPE_DENTRY)				\
> +		mlog(ML_ERROR, "DLM error %d while calling %s on resource %s\n",	\
> +		     _err, _func, _lockres->l_name);					\
> +	else										\
> +		mlog(ML_ERROR, "DLM error %d while calling %s on resource %.*s%08x\n",	\
> +		     _err, _func, OCFS2_DENTRY_LOCK_INO_START - 1, (_lockres)->l_name,	\
> +		     (unsigned int)ocfs2_get_dentry_lock_ino(_lockres));		\
>  } while (0)
>  static int ocfs2_downconvert_thread(void *arg);
>  static void ocfs2_downconvert_on_unlock(struct ocfs2_super *osb,
> -- 
> 1.5.6.3
> 
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-devel

-- 

"There is no sincerer love than the love of food."  
         - George Bernard Shaw 

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] [PATCH 3/4] ocfs2/dlm: Use ast_lock to protect ast_list
  2009-02-03 20:37 ` [Ocfs2-devel] [PATCH 3/4] ocfs2/dlm: Use ast_lock to protect ast_list Sunil Mushran
@ 2009-02-11  7:31   ` Joel Becker
  0 siblings, 0 replies; 9+ messages in thread
From: Joel Becker @ 2009-02-11  7:31 UTC (permalink / raw)
  To: ocfs2-devel

whoops.  sob.

On Tue, Feb 03, 2009 at 12:37:15PM -0800, Sunil Mushran wrote:
> The code was using dlm->spinlock instead of dlm->ast_lock to protect the
> ast_list. This patch fixes the issue.
> 
> Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
> ---
>  fs/ocfs2/dlm/dlmunlock.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ocfs2/dlm/dlmunlock.c b/fs/ocfs2/dlm/dlmunlock.c
> index 86ca085..fcf879e 100644
> --- a/fs/ocfs2/dlm/dlmunlock.c
> +++ b/fs/ocfs2/dlm/dlmunlock.c
> @@ -117,11 +117,11 @@ static enum dlm_status dlmunlock_common(struct dlm_ctxt *dlm,
>  	else
>  		BUG_ON(res->owner == dlm->node_num);
>  
> -	spin_lock(&dlm->spinlock);
> +	spin_lock(&dlm->ast_lock);
>  	/* We want to be sure that we're not freeing a lock
>  	 * that still has AST's pending... */
>  	in_use = !list_empty(&lock->ast_list);
> -	spin_unlock(&dlm->spinlock);
> +	spin_unlock(&dlm->ast_lock);
>  	if (in_use) {
>  	       mlog(ML_ERROR, "lockres %.*s: Someone is calling dlmunlock "
>  		    "while waiting for an ast!", res->lockname.len,
> -- 
> 1.5.6.3
> 
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-devel

-- 

"Well-timed silence hath more eloquence than speech."  
         - Martin Fraquhar Tupper

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] [PATCH 4/4] ocfs2/dlm: Make dlm_assert_master_handler() kill itself instead of the asserter
  2009-02-03 20:37 ` [Ocfs2-devel] [PATCH 4/4] ocfs2/dlm: Make dlm_assert_master_handler() kill itself instead of the asserter Sunil Mushran
@ 2009-02-11  7:35   ` Joel Becker
  2009-02-12  0:19     ` Sunil Mushran
  0 siblings, 1 reply; 9+ messages in thread
From: Joel Becker @ 2009-02-11  7:35 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Feb 03, 2009 at 12:37:16PM -0800, Sunil Mushran wrote:
> In dlm_assert_master_handler(), if we get an incorrect assert master from a node
> that, we reply with EINVAL asking the asserter to die. The problem is that an
> assert is sent after so many hoops, it is invariably the node that thinks the
> asserter is wrong, is actually wrong. So instead of killing the asserter, this
> patch kills the assertee.

	You mean that the node asserting mastery is probably correct,
and the node that sees a disconnect between mastery information and the
asserter is confused?

> This patch papers over a race that is still being addressed.
> 
> Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
> ---
>  fs/ocfs2/dlm/dlmmaster.c |   12 ++++++------
>  1 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c
> index 54e182a..0a28139 100644
> --- a/fs/ocfs2/dlm/dlmmaster.c
> +++ b/fs/ocfs2/dlm/dlmmaster.c
> @@ -1849,12 +1849,12 @@ int dlm_assert_master_handler(struct o2net_msg *msg, u32 len, void *data,
>  		if (!mle) {
>  			if (res->owner != DLM_LOCK_RES_OWNER_UNKNOWN &&
>  			    res->owner != assert->node_idx) {
> -				mlog(ML_ERROR, "assert_master from "
> -					  "%u, but current owner is "
> -					  "%u! (%.*s)\n",
> -				       assert->node_idx, res->owner,
> -				       namelen, name);
> -				goto kill;
> +				mlog(ML_ERROR, "DIE! Mastery assert from %u, "
> +				     "but current owner is %u! (%.*s)\n",
> +				     assert->node_idx, res->owner, namelen,
> +				     name);
> +				__dlm_print_one_lock_resource(res);
> +				BUG();

	BUG() isn't much of a die.  Are you figuring soft lockup code
will eventually kill this?

Joel

-- 

"And yet I fight,
 And yet I fight this battle all alone.
 No one to cry to;
 No place to call home."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] [PATCH 4/4] ocfs2/dlm: Make dlm_assert_master_handler() kill itself instead of the asserter
  2009-02-11  7:35   ` Joel Becker
@ 2009-02-12  0:19     ` Sunil Mushran
  0 siblings, 0 replies; 9+ messages in thread
From: Sunil Mushran @ 2009-02-12  0:19 UTC (permalink / raw)
  To: ocfs2-devel

Joel Becker wrote:
> 	You mean that the node asserting mastery is probably correct,
> and the node that sees a disconnect between mastery information and the
> asserter is confused?

Yes. It's empirical, ofcourse. This is not easy to trigger. Though
now I have a testcase + env in which I can trigger it more commonly.

The real issue is still there. I am working on it. The reason I want
to push this now is because there are cases when the assertee
kills off multiple asserting nodes.


> 	BUG() isn't much of a die.  Are you figuring soft lockup code
> will eventually kill this?

I've never heard of a BUG() that did not kill a node. Atleast on x86.
Are you sure about this?

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2009-02-12  0:19 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-03 20:37 [Ocfs2-devel] [PATCH 1/4] ocfs2/dlm: Retract fix for race between purge and migrate Sunil Mushran
2009-02-03 20:37 ` [Ocfs2-devel] [PATCH 2/4] ocfs2: Cleanup the lockname print in dlmglue.c Sunil Mushran
2009-02-11  7:31   ` Joel Becker
2009-02-03 20:37 ` [Ocfs2-devel] [PATCH 3/4] ocfs2/dlm: Use ast_lock to protect ast_list Sunil Mushran
2009-02-11  7:31   ` Joel Becker
2009-02-03 20:37 ` [Ocfs2-devel] [PATCH 4/4] ocfs2/dlm: Make dlm_assert_master_handler() kill itself instead of the asserter Sunil Mushran
2009-02-11  7:35   ` Joel Becker
2009-02-12  0:19     ` Sunil Mushran
2009-02-11  7:31 ` [Ocfs2-devel] [PATCH 1/4] ocfs2/dlm: Retract fix for race between purge and migrate Joel Becker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.