All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] sunrpc:  Fix race between work-queue and rpc_killall_tasks.
@ 2011-07-06 22:49 greearb
  2011-07-06 23:45   ` Trond Myklebust
  0 siblings, 1 reply; 20+ messages in thread
From: greearb @ 2011-07-06 22:49 UTC (permalink / raw)
  To: linux-nfs, linux-kernel; +Cc: Ben Greear

From: Ben Greear <greearb@candelatech.com>

The rpc_killall_tasks logic is not locked against
the work-queue thread, but it still directly modifies
function pointers and data in the task objects.

This patch changes the killall-tasks logic to set a flag
that tells the work-queue thread to terminate the task
instead of directly calling the terminate logic.

Signed-off-by: Ben Greear <greearb@candelatech.com>
---

NOTE:  This needs review, as I am still struggling to understand
the rpc code, and it's quite possible this patch either doesn't
fully fix the problem or actually causes other issues.  That said,
my nfs stress test seems to run a bit more stable with this patch applied.

:100644 100644 fe2d8e6... b238944... M	include/linux/sunrpc/sched.h
:100644 100644 8c91415... 6851f84... M	net/sunrpc/clnt.c
:100644 100644 1cbbed5... 0fc559e... M	net/sunrpc/sched.c
 include/linux/sunrpc/sched.h |   10 ++++++++++
 net/sunrpc/clnt.c            |    3 +--
 net/sunrpc/sched.c           |    6 ++++++
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index fe2d8e6..b238944 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -76,6 +76,7 @@ struct rpc_task {
 
 	pid_t			tk_owner;	/* Process id for batching tasks */
 	int			tk_status;	/* result of last operation */
+	int			tk_killme_errno;/* For RPC_TASK_KILLME */
 	unsigned short		tk_flags;	/* misc flags */
 	unsigned short		tk_timeouts;	/* maj timeouts */
 
@@ -130,6 +131,7 @@ struct rpc_task_setup {
 #define RPC_TASK_SOFTCONN	0x0400		/* Fail if can't connect */
 #define RPC_TASK_SENT		0x0800		/* message was sent */
 #define RPC_TASK_TIMEOUT	0x1000		/* fail with ETIMEDOUT on timeout */
+#define RPC_TASK_KILLME		0x2000		/* Need to die ASAP. */
 
 #define RPC_IS_ASYNC(t)		((t)->tk_flags & RPC_TASK_ASYNC)
 #define RPC_IS_SWAPPER(t)	((t)->tk_flags & RPC_TASK_SWAPPER)
@@ -138,6 +140,7 @@ struct rpc_task_setup {
 #define RPC_IS_SOFT(t)		((t)->tk_flags & (RPC_TASK_SOFT|RPC_TASK_TIMEOUT))
 #define RPC_IS_SOFTCONN(t)	((t)->tk_flags & RPC_TASK_SOFTCONN)
 #define RPC_WAS_SENT(t)		((t)->tk_flags & RPC_TASK_SENT)
+#define RPC_SHOULD_KILLME(t)	((t)->tk_flags & RPC_TASK_KILLME)
 
 #define RPC_TASK_RUNNING	0
 #define RPC_TASK_QUEUED		1
@@ -269,4 +272,11 @@ static inline const char * rpc_qname(struct rpc_wait_queue *q)
 }
 #endif
 
+static inline void rpc_task_killme(struct rpc_task *task, int exit_errno)
+{
+	task->tk_killme_errno = exit_errno;
+	task->tk_flags |= RPC_TASK_KILLME;
+}
+
+
 #endif /* _LINUX_SUNRPC_SCHED_H_ */
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 8c91415..6851f84 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -437,8 +437,7 @@ void rpc_killall_tasks(struct rpc_clnt *clnt)
 		if (!RPC_IS_ACTIVATED(rovr))
 			continue;
 		if (!(rovr->tk_flags & RPC_TASK_KILLED)) {
-			rovr->tk_flags |= RPC_TASK_KILLED;
-			rpc_exit(rovr, -EIO);
+			rpc_task_killme(rovr, -EIO);
 			if (RPC_IS_QUEUED(rovr))
 				rpc_wake_up_queued_task(rovr->tk_waitqueue,
 							rovr);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 1cbbed5..0fc559e 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -646,6 +646,12 @@ static void __rpc_execute(struct rpc_task *task)
 			task->tk_action(task);
 		}
 
+		/* If we should die, do it now. */
+		if (RPC_SHOULD_KILLME(task)) {
+			task->tk_flags |= RPC_TASK_KILLED;
+			rpc_exit(task, task->tk_killme_errno);
+		}
+
 		/*
 		 * Lockless check for whether task is sleeping or not.
 		 */
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC] sunrpc:  Fix race between work-queue and rpc_killall_tasks.
  2011-07-06 22:49 [RFC] sunrpc: Fix race between work-queue and rpc_killall_tasks greearb
@ 2011-07-06 23:45   ` Trond Myklebust
  0 siblings, 0 replies; 20+ messages in thread
From: Trond Myklebust @ 2011-07-06 23:45 UTC (permalink / raw)
  To: greearb; +Cc: linux-nfs, linux-kernel

On Wed, 2011-07-06 at 15:49 -0700, greearb@candelatech.com wrote: 
> From: Ben Greear <greearb@candelatech.com>
> 
> The rpc_killall_tasks logic is not locked against
> the work-queue thread, but it still directly modifies
> function pointers and data in the task objects.
> 
> This patch changes the killall-tasks logic to set a flag
> that tells the work-queue thread to terminate the task
> instead of directly calling the terminate logic.
> 
> Signed-off-by: Ben Greear <greearb@candelatech.com>
> ---
> 
> NOTE:  This needs review, as I am still struggling to understand
> the rpc code, and it's quite possible this patch either doesn't
> fully fix the problem or actually causes other issues.  That said,
> my nfs stress test seems to run a bit more stable with this patch applied.

Yes, but I don't see why you are adding a new flag, nor do I see why we
want to keep checking for that flag in the rpc_execute() loop.
rpc_killall_tasks() is not a frequent operation that we want to optimise
for.

How about the following instead?

8<---------------------------------------------------------------------------------- 
>From ecb7244b661c3f9d2008ef6048733e5cea2f98ab Mon Sep 17 00:00:00 2001
From: Trond Myklebust <Trond.Myklebust@netapp.com>
Date: Wed, 6 Jul 2011 19:44:52 -0400
Subject: [PATCH] SUNRPC: Fix a race between work-queue and rpc_killall_tasks

Since rpc_killall_tasks may modify the rpc_task's tk_action field
without any locking, we need to be careful when dereferencing it.

Reported-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---
 net/sunrpc/sched.c |   27 +++++++++++----------------
 1 files changed, 11 insertions(+), 16 deletions(-)

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index a27406b..4814e24 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -616,30 +616,25 @@ static void __rpc_execute(struct rpc_task *task)
 	BUG_ON(RPC_IS_QUEUED(task));
 
 	for (;;) {
+		void (*do_action)(struct rpc_task *);
 
 		/*
-		 * Execute any pending callback.
+		 * Execute any pending callback first.
 		 */
-		if (task->tk_callback) {
-			void (*save_callback)(struct rpc_task *);
-
-			/*
-			 * We set tk_callback to NULL before calling it,
-			 * in case it sets the tk_callback field itself:
-			 */
-			save_callback = task->tk_callback;
-			task->tk_callback = NULL;
-			save_callback(task);
-		} else {
+		do_action = task->tk_callback;
+		task->tk_callback = NULL;
+		if (do_action == NULL) {
 			/*
 			 * Perform the next FSM step.
-			 * tk_action may be NULL when the task has been killed
-			 * by someone else.
+			 * tk_action may be NULL if the task has been killed.
+			 * In particular, note that rpc_killall_tasks may
+			 * do this at any time, so beware when dereferencing.
 			 */
-			if (task->tk_action == NULL)
+			do_action = task->tk_action;
+			if (do_action == NULL)
 				break;
-			task->tk_action(task);
 		}
+		do_action(task);
 
 		/*
 		 * Lockless check for whether task is sleeping or not.
-- 
1.7.6


-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC] sunrpc:  Fix race between work-queue and rpc_killall_tasks.
@ 2011-07-06 23:45   ` Trond Myklebust
  0 siblings, 0 replies; 20+ messages in thread
From: Trond Myklebust @ 2011-07-06 23:45 UTC (permalink / raw)
  To: greearb; +Cc: linux-nfs, linux-kernel

On Wed, 2011-07-06 at 15:49 -0700, greearb@candelatech.com wrote: 
> From: Ben Greear <greearb@candelatech.com>
> 
> The rpc_killall_tasks logic is not locked against
> the work-queue thread, but it still directly modifies
> function pointers and data in the task objects.
> 
> This patch changes the killall-tasks logic to set a flag
> that tells the work-queue thread to terminate the task
> instead of directly calling the terminate logic.
> 
> Signed-off-by: Ben Greear <greearb@candelatech.com>
> ---
> 
> NOTE:  This needs review, as I am still struggling to understand
> the rpc code, and it's quite possible this patch either doesn't
> fully fix the problem or actually causes other issues.  That said,
> my nfs stress test seems to run a bit more stable with this patch applied.

Yes, but I don't see why you are adding a new flag, nor do I see why we
want to keep checking for that flag in the rpc_execute() loop.
rpc_killall_tasks() is not a frequent operation that we want to optimise
for.

How about the following instead?

8<---------------------------------------------------------------------------------- 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] sunrpc:  Fix race between work-queue and rpc_killall_tasks.
  2011-07-06 23:45   ` Trond Myklebust
  (?)
@ 2011-07-07  0:07   ` Ben Greear
  2011-07-07  0:17     ` Trond Myklebust
  -1 siblings, 1 reply; 20+ messages in thread
From: Ben Greear @ 2011-07-07  0:07 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, linux-kernel

On 07/06/2011 04:45 PM, Trond Myklebust wrote:
> On Wed, 2011-07-06 at 15:49 -0700, greearb@candelatech.com wrote:
>> From: Ben Greear<greearb@candelatech.com>
>>
>> The rpc_killall_tasks logic is not locked against
>> the work-queue thread, but it still directly modifies
>> function pointers and data in the task objects.
>>
>> This patch changes the killall-tasks logic to set a flag
>> that tells the work-queue thread to terminate the task
>> instead of directly calling the terminate logic.
>>
>> Signed-off-by: Ben Greear<greearb@candelatech.com>
>> ---
>>
>> NOTE:  This needs review, as I am still struggling to understand
>> the rpc code, and it's quite possible this patch either doesn't
>> fully fix the problem or actually causes other issues.  That said,
>> my nfs stress test seems to run a bit more stable with this patch applied.
>
> Yes, but I don't see why you are adding a new flag, nor do I see why we
> want to keep checking for that flag in the rpc_execute() loop.
> rpc_killall_tasks() is not a frequent operation that we want to optimise
> for.

I was hoping that if the killall logic never set anything that was also
set by the work-queue thread it would be lock-safe without needing
explicit locking.

I was a bit concerned that my flags |= KILLME logic would potentially
over-write flags that were being simultaneously written elsewhere
(so maybe I'd have to add a completely new variable for that KILLME flag
to really be safe.)

>
> How about the following instead?

I think it still races..more comments below.

>
> 8<----------------------------------------------------------------------------------
>  From ecb7244b661c3f9d2008ef6048733e5cea2f98ab Mon Sep 17 00:00:00 2001
> From: Trond Myklebust<Trond.Myklebust@netapp.com>
> Date: Wed, 6 Jul 2011 19:44:52 -0400
> Subject: [PATCH] SUNRPC: Fix a race between work-queue and rpc_killall_tasks
>
> Since rpc_killall_tasks may modify the rpc_task's tk_action field
> without any locking, we need to be careful when dereferencing it.

> +		do_action = task->tk_callback;
> +		task->tk_callback = NULL;
> +		if (do_action == NULL) {

I think the race still exists, though it would be harder to hit.
What if the killall logic sets task->tk_callback right after you assign do_action, but before
you set tk_callback to NULL?  Or after you set tk_callback to NULL for
that matter.

>   			/*
>   			 * Perform the next FSM step.
> -			 * tk_action may be NULL when the task has been killed
> -			 * by someone else.
> +			 * tk_action may be NULL if the task has been killed.
> +			 * In particular, note that rpc_killall_tasks may
> +			 * do this at any time, so beware when dereferencing.
>   			 */
> -			if (task->tk_action == NULL)
> +			do_action = task->tk_action;
> +			if (do_action == NULL)
>   				break;
> -			task->tk_action(task);
>   		}
> +		do_action(task);
>
>   		/*
>   		 * Lockless check for whether task is sleeping or not.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] sunrpc:  Fix race between work-queue and rpc_killall_tasks.
  2011-07-07  0:07   ` Ben Greear
@ 2011-07-07  0:17     ` Trond Myklebust
  2011-07-07  0:35       ` Ben Greear
  0 siblings, 1 reply; 20+ messages in thread
From: Trond Myklebust @ 2011-07-07  0:17 UTC (permalink / raw)
  To: Ben Greear; +Cc: linux-nfs, linux-kernel

On Wed, 2011-07-06 at 17:07 -0700, Ben Greear wrote: 
> On 07/06/2011 04:45 PM, Trond Myklebust wrote:
> > On Wed, 2011-07-06 at 15:49 -0700, greearb@candelatech.com wrote:
> >> From: Ben Greear<greearb@candelatech.com>
> >>
> >> The rpc_killall_tasks logic is not locked against
> >> the work-queue thread, but it still directly modifies
> >> function pointers and data in the task objects.
> >>
> >> This patch changes the killall-tasks logic to set a flag
> >> that tells the work-queue thread to terminate the task
> >> instead of directly calling the terminate logic.
> >>
> >> Signed-off-by: Ben Greear<greearb@candelatech.com>
> >> ---
> >>
> >> NOTE:  This needs review, as I am still struggling to understand
> >> the rpc code, and it's quite possible this patch either doesn't
> >> fully fix the problem or actually causes other issues.  That said,
> >> my nfs stress test seems to run a bit more stable with this patch applied.
> >
> > Yes, but I don't see why you are adding a new flag, nor do I see why we
> > want to keep checking for that flag in the rpc_execute() loop.
> > rpc_killall_tasks() is not a frequent operation that we want to optimise
> > for.
> 
> I was hoping that if the killall logic never set anything that was also
> set by the work-queue thread it would be lock-safe without needing
> explicit locking.
> 
> I was a bit concerned that my flags |= KILLME logic would potentially
> over-write flags that were being simultaneously written elsewhere
> (so maybe I'd have to add a completely new variable for that KILLME flag
> to really be safe.)
> 
> >
> > How about the following instead?
> 
> I think it still races..more comments below.
> 
> >
> > 8<----------------------------------------------------------------------------------
> >  From ecb7244b661c3f9d2008ef6048733e5cea2f98ab Mon Sep 17 00:00:00 2001
> > From: Trond Myklebust<Trond.Myklebust@netapp.com>
> > Date: Wed, 6 Jul 2011 19:44:52 -0400
> > Subject: [PATCH] SUNRPC: Fix a race between work-queue and rpc_killall_tasks
> >
> > Since rpc_killall_tasks may modify the rpc_task's tk_action field
> > without any locking, we need to be careful when dereferencing it.
> 
> > +		do_action = task->tk_callback;
> > +		task->tk_callback = NULL;
> > +		if (do_action == NULL) {
> 
> I think the race still exists, though it would be harder to hit.
> What if the killall logic sets task->tk_callback right after you assign do_action, but before
> you set tk_callback to NULL?  Or after you set tk_callback to NULL for
> that matter.

What if it does? The rpc call will continue to execute until it
completes.

rpc_killall_tasks is really only useful for signalling to tasks that are
hanging on a completely unresponsive server that we want them to stop.
The only case where we really care is in rpc_shutdown_client(), where we
sleep and loop anyway.

IOW: I really don't care about 'fixing' rpc_killall_tasks to perfection.
All I care about is that it doesn't Oops.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] sunrpc:  Fix race between work-queue and rpc_killall_tasks.
  2011-07-07  0:17     ` Trond Myklebust
@ 2011-07-07  0:35       ` Ben Greear
  0 siblings, 0 replies; 20+ messages in thread
From: Ben Greear @ 2011-07-07  0:35 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, linux-kernel

On 07/06/2011 05:17 PM, Trond Myklebust wrote:
> On Wed, 2011-07-06 at 17:07 -0700, Ben Greear wrote:
>> On 07/06/2011 04:45 PM, Trond Myklebust wrote:
>>> On Wed, 2011-07-06 at 15:49 -0700, greearb@candelatech.com wrote:
>>>> From: Ben Greear<greearb@candelatech.com>
>>>>
>>>> The rpc_killall_tasks logic is not locked against
>>>> the work-queue thread, but it still directly modifies
>>>> function pointers and data in the task objects.
>>>>
>>>> This patch changes the killall-tasks logic to set a flag
>>>> that tells the work-queue thread to terminate the task
>>>> instead of directly calling the terminate logic.
>>>>
>>>> Signed-off-by: Ben Greear<greearb@candelatech.com>
>>>> ---
>>>>
>>>> NOTE:  This needs review, as I am still struggling to understand
>>>> the rpc code, and it's quite possible this patch either doesn't
>>>> fully fix the problem or actually causes other issues.  That said,
>>>> my nfs stress test seems to run a bit more stable with this patch applied.
>>>
>>> Yes, but I don't see why you are adding a new flag, nor do I see why we
>>> want to keep checking for that flag in the rpc_execute() loop.
>>> rpc_killall_tasks() is not a frequent operation that we want to optimise
>>> for.
>>
>> I was hoping that if the killall logic never set anything that was also
>> set by the work-queue thread it would be lock-safe without needing
>> explicit locking.
>>
>> I was a bit concerned that my flags |= KILLME logic would potentially
>> over-write flags that were being simultaneously written elsewhere
>> (so maybe I'd have to add a completely new variable for that KILLME flag
>> to really be safe.)
>>
>>>
>>> How about the following instead?
>>
>> I think it still races..more comments below.
>>
>>>
>>> 8<----------------------------------------------------------------------------------
>>>   From ecb7244b661c3f9d2008ef6048733e5cea2f98ab Mon Sep 17 00:00:00 2001
>>> From: Trond Myklebust<Trond.Myklebust@netapp.com>
>>> Date: Wed, 6 Jul 2011 19:44:52 -0400
>>> Subject: [PATCH] SUNRPC: Fix a race between work-queue and rpc_killall_tasks
>>>
>>> Since rpc_killall_tasks may modify the rpc_task's tk_action field
>>> without any locking, we need to be careful when dereferencing it.
>>
>>> +		do_action = task->tk_callback;
>>> +		task->tk_callback = NULL;
>>> +		if (do_action == NULL) {
>>
>> I think the race still exists, though it would be harder to hit.
>> What if the killall logic sets task->tk_callback right after you assign do_action, but before
>> you set tk_callback to NULL?  Or after you set tk_callback to NULL for
>> that matter.
>
> What if it does? The rpc call will continue to execute until it
> completes.
>
> rpc_killall_tasks is really only useful for signalling to tasks that are
> hanging on a completely unresponsive server that we want them to stop.
> The only case where we really care is in rpc_shutdown_client(), where we
> sleep and loop anyway.
>
> IOW: I really don't care about 'fixing' rpc_killall_tasks to perfection.
> All I care about is that it doesn't Oops.

That is my concern as well.  I'll try your patch and see if it fixes
the crashes I'm seeing in this area.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] sunrpc:  Fix race between work-queue and rpc_killall_tasks.
  2011-07-06 23:45   ` Trond Myklebust
  (?)
  (?)
@ 2011-07-07 20:38   ` Ben Greear
  2011-07-08 15:03     ` Ben Greear
  -1 siblings, 1 reply; 20+ messages in thread
From: Ben Greear @ 2011-07-07 20:38 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, linux-kernel

On 07/06/2011 04:45 PM, Trond Myklebust wrote:
> On Wed, 2011-07-06 at 15:49 -0700, greearb@candelatech.com wrote:
>> From: Ben Greear<greearb@candelatech.com>
>>
>> The rpc_killall_tasks logic is not locked against
>> the work-queue thread, but it still directly modifies
>> function pointers and data in the task objects.
>>
>> This patch changes the killall-tasks logic to set a flag
>> that tells the work-queue thread to terminate the task
>> instead of directly calling the terminate logic.
>>
>> Signed-off-by: Ben Greear<greearb@candelatech.com>
>> ---
>>
>> NOTE:  This needs review, as I am still struggling to understand
>> the rpc code, and it's quite possible this patch either doesn't
>> fully fix the problem or actually causes other issues.  That said,
>> my nfs stress test seems to run a bit more stable with this patch applied.
>
> Yes, but I don't see why you are adding a new flag, nor do I see why we
> want to keep checking for that flag in the rpc_execute() loop.
> rpc_killall_tasks() is not a frequent operation that we want to optimise
> for.
>
> How about the following instead?
>
> 8<----------------------------------------------------------------------------------
>  From ecb7244b661c3f9d2008ef6048733e5cea2f98ab Mon Sep 17 00:00:00 2001
> From: Trond Myklebust<Trond.Myklebust@netapp.com>
> Date: Wed, 6 Jul 2011 19:44:52 -0400
> Subject: [PATCH] SUNRPC: Fix a race between work-queue and rpc_killall_tasks
>
> Since rpc_killall_tasks may modify the rpc_task's tk_action field
> without any locking, we need to be careful when dereferencing it.
>
> Reported-by: Ben Greear<greearb@candelatech.com>
> Signed-off-by: Trond Myklebust<Trond.Myklebust@netapp.com>

I've been testing this for 4+ hours, and it seems to fix the problem.  We'll
continue to burn on it for a day or two just in case we're
getting (un)lucky in our testing.

Thanks,
Ben

> ---
>   net/sunrpc/sched.c |   27 +++++++++++----------------
>   1 files changed, 11 insertions(+), 16 deletions(-)
>
> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> index a27406b..4814e24 100644
> --- a/net/sunrpc/sched.c
> +++ b/net/sunrpc/sched.c
> @@ -616,30 +616,25 @@ static void __rpc_execute(struct rpc_task *task)
>   	BUG_ON(RPC_IS_QUEUED(task));
>
>   	for (;;) {
> +		void (*do_action)(struct rpc_task *);
>
>   		/*
> -		 * Execute any pending callback.
> +		 * Execute any pending callback first.
>   		 */
> -		if (task->tk_callback) {
> -			void (*save_callback)(struct rpc_task *);
> -
> -			/*
> -			 * We set tk_callback to NULL before calling it,
> -			 * in case it sets the tk_callback field itself:
> -			 */
> -			save_callback = task->tk_callback;
> -			task->tk_callback = NULL;
> -			save_callback(task);
> -		} else {
> +		do_action = task->tk_callback;
> +		task->tk_callback = NULL;
> +		if (do_action == NULL) {
>   			/*
>   			 * Perform the next FSM step.
> -			 * tk_action may be NULL when the task has been killed
> -			 * by someone else.
> +			 * tk_action may be NULL if the task has been killed.
> +			 * In particular, note that rpc_killall_tasks may
> +			 * do this at any time, so beware when dereferencing.
>   			 */
> -			if (task->tk_action == NULL)
> +			do_action = task->tk_action;
> +			if (do_action == NULL)
>   				break;
> -			task->tk_action(task);
>   		}
> +		do_action(task);
>
>   		/*
>   		 * Lockless check for whether task is sleeping or not.


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] sunrpc:  Fix race between work-queue and rpc_killall_tasks.
  2011-07-07 20:38   ` Ben Greear
@ 2011-07-08 15:03     ` Ben Greear
  0 siblings, 0 replies; 20+ messages in thread
From: Ben Greear @ 2011-07-08 15:03 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, linux-kernel

On 07/07/2011 01:38 PM, Ben Greear wrote:
> On 07/06/2011 04:45 PM, Trond Myklebust wrote:
>> On Wed, 2011-07-06 at 15:49 -0700, greearb@candelatech.com wrote:
>>> From: Ben Greear<greearb@candelatech.com>
>>>
>>> The rpc_killall_tasks logic is not locked against
>>> the work-queue thread, but it still directly modifies
>>> function pointers and data in the task objects.
>>>
>>> This patch changes the killall-tasks logic to set a flag
>>> that tells the work-queue thread to terminate the task
>>> instead of directly calling the terminate logic.
>>>
>>> Signed-off-by: Ben Greear<greearb@candelatech.com>
>>> ---
>>>
>>> NOTE: This needs review, as I am still struggling to understand
>>> the rpc code, and it's quite possible this patch either doesn't
>>> fully fix the problem or actually causes other issues. That said,
>>> my nfs stress test seems to run a bit more stable with this patch applied.
>>
>> Yes, but I don't see why you are adding a new flag, nor do I see why we
>> want to keep checking for that flag in the rpc_execute() loop.
>> rpc_killall_tasks() is not a frequent operation that we want to optimise
>> for.
>>
>> How about the following instead?
>>
>> 8<----------------------------------------------------------------------------------
>> From ecb7244b661c3f9d2008ef6048733e5cea2f98ab Mon Sep 17 00:00:00 2001
>> From: Trond Myklebust<Trond.Myklebust@netapp.com>
>> Date: Wed, 6 Jul 2011 19:44:52 -0400
>> Subject: [PATCH] SUNRPC: Fix a race between work-queue and rpc_killall_tasks
>>
>> Since rpc_killall_tasks may modify the rpc_task's tk_action field
>> without any locking, we need to be careful when dereferencing it.
>>
>> Reported-by: Ben Greear<greearb@candelatech.com>
>> Signed-off-by: Trond Myklebust<Trond.Myklebust@netapp.com>
>
> I've been testing this for 4+ hours, and it seems to fix the problem. We'll
> continue to burn on it for a day or two just in case we're
> getting (un)lucky in our testing.
>
> Thanks,
> Ben

Well, we still hit the bug in over-night testing.  Maybe there are other races...

=============================================================================
BUG kmalloc-64: Object is on free-list
-----------------------------------------------------------------------------

INFO: Allocated in rpcb_getport_async+0x39c/0x5a5 [sunrpc] age=52 cpu=6 pid=18908
         __slab_alloc+0x348/0x3ba
         kmem_cache_alloc_trace+0x67/0xe7
         rpcb_getport_async+0x39c/0x5a5 [sunrpc]
         call_bind+0x70/0x75 [sunrpc]
         __rpc_execute+0x80/0x253 [sunrpc]
         rpc_execute+0x3d/0x42 [sunrpc]
         rpc_run_task+0x79/0x81 [sunrpc]
         rpc_call_sync+0x3f/0x60 [sunrpc]
         rpc_ping+0x42/0x58 [sunrpc]
         rpc_create+0x4aa/0x527 [sunrpc]
         nfs_create_rpc_client+0xb1/0xf6 [nfs]
         nfs_init_client+0x3b/0x7d [nfs]
         nfs_get_client+0x453/0x5ab [nfs]
         nfs_create_server+0x10b/0x437 [nfs]
         nfs_fs_mount+0x4ca/0x708 [nfs]
         mount_fs+0x6b/0x152
INFO: Freed in rpcb_map_release+0x3f/0x44 [sunrpc] age=119 cpu=5 pid=13934
         __slab_free+0x57/0x150
         kfree+0x107/0x13a
         rpcb_map_release+0x3f/0x44 [sunrpc]
         rpc_release_calldata+0x12/0x14 [sunrpc]
         rpc_free_task+0x72/0x7a [sunrpc]
         rpc_final_put_task+0x82/0x8a [sunrpc]
         __rpc_execute+0x244/0x253 [sunrpc]
         rpc_async_schedule+0x10/0x12 [sunrpc]
         process_one_work+0x230/0x41d
         worker_thread+0x133/0x217
         kthread+0x7d/0x85
         kernel_thread_helper+0x4/0x10
INFO: Slab 0xffffea0002c38b90 objects=20 used=13 fp=0xffff8800ca27e620 flags=0x20000000004081
INFO: Object 0xffff8800ca27e620 @offset=1568 fp=0xffff8800ca27f880
Bytes b4 0xffff8800ca27e610:  d0 c4 1a 02 01 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ÐÄ......ZZZZZZZZ
   Object 0xffff8800ca27e620:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
   Object 0xffff8800ca27e630:  6b 6b 6b 6b 00 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkk..kkkkkkkkkk
   Object 0xffff8800ca27e640:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
   Object 0xffff8800ca27e650:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkk¥
  Redzone 0xffff8800ca27e660:  bb bb bb bb bb bb bb bb                         »»»»»»»»
  Padding 0xffff8800ca27e7a0:  5a 5a 5a 5a 5a 5a 5a 5a                         ZZZZZZZZ
Pid: 13035, comm: kworker/0:1 Not tainted 3.0.0-rc6+ #13
Call Trace:
  [<ffffffff81105907>] print_trailer+0x131/0x13a
  [<ffffffff81105945>] object_err+0x35/0x3e
  [<ffffffff811077b3>] verify_mem_not_deleted+0x7a/0xb7
  [<ffffffffa02891e5>] rpcb_getport_done+0x23/0x126 [sunrpc]
  [<ffffffffa02810df>] rpc_exit_task+0x3f/0x6d [sunrpc]
  [<ffffffffa02814d8>] __rpc_execute+0x80/0x253 [sunrpc]
  [<ffffffffa02816ed>] ? rpc_execute+0x42/0x42 [sunrpc]
  [<ffffffffa02816fd>] rpc_async_schedule+0x10/0x12 [sunrpc]
  [<ffffffff81061343>] process_one_work+0x230/0x41d
  [<ffffffff8106128e>] ? process_one_work+0x17b/0x41d
  [<ffffffff8106379f>] worker_thread+0x133/0x217
  [<ffffffff8106366c>] ? manage_workers+0x191/0x191
  [<ffffffff81066f9c>] kthread+0x7d/0x85
  [<ffffffff81485ee4>] kernel_thread_helper+0x4/0x10
  [<ffffffff8147f0d8>] ? retint_restore_args+0x13/0x13
  [<ffffffff81066f1f>] ? __init_kthread_worker+0x56/0x56
  [<ffffffff81485ee0>] ? gs_change+0x13/0x13

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] sunrpc:  Fix race between work-queue and rpc_killall_tasks.
  2011-07-06 23:45   ` Trond Myklebust
                     ` (2 preceding siblings ...)
  (?)
@ 2011-07-08 17:18   ` Ben Greear
  2011-07-08 18:11       ` Myklebust, Trond
  -1 siblings, 1 reply; 20+ messages in thread
From: Ben Greear @ 2011-07-08 17:18 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, linux-kernel

On 07/06/2011 04:45 PM, Trond Myklebust wrote:
> On Wed, 2011-07-06 at 15:49 -0700, greearb@candelatech.com wrote:
>> From: Ben Greear<greearb@candelatech.com>
>>
>> The rpc_killall_tasks logic is not locked against
>> the work-queue thread, but it still directly modifies
>> function pointers and data in the task objects.
>>
>> This patch changes the killall-tasks logic to set a flag
>> that tells the work-queue thread to terminate the task
>> instead of directly calling the terminate logic.
>>
>> Signed-off-by: Ben Greear<greearb@candelatech.com>
>> ---
>>
>> NOTE:  This needs review, as I am still struggling to understand
>> the rpc code, and it's quite possible this patch either doesn't
>> fully fix the problem or actually causes other issues.  That said,
>> my nfs stress test seems to run a bit more stable with this patch applied.
>
> Yes, but I don't see why you are adding a new flag, nor do I see why we
> want to keep checking for that flag in the rpc_execute() loop.
> rpc_killall_tasks() is not a frequent operation that we want to optimise
> for.
>
> How about the following instead?

Ok, I looked at your patch closer.  I think it can still cause
bad race conditions.

For instance:

Assume that tk_callback is NULL at beginning of while loop in __rpc_execute,
and tk_action is rpc_exit_task.

While do_action(task) is being called, tk_action is set to NULL in rpc_exit_task.

But, right after tk_action is set to NULL in rpc_exit_task, the rpc_killall_tasks
method calls rpc_exit, which sets tk_action back to rpc_exit_task.

I believe this could cause the xprt_release(task) logic to be called in the
work-queue's execution of rpc_exit_task due to tk_action != NULL when
it should not be.

I have no hard evidence this exact scenario is happening in my case, but I
believe the code is still racy with your patch.

For that matter, is it safe to modify the flags in rpc_killall_tasks:

rovr->tk_flags |= RPC_TASK_KILLED;

Is that guaranteed to be atomic with any other modification of flags?

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC] sunrpc:  Fix race between work-queue and rpc_killall_tasks.
  2011-07-08 17:18   ` Ben Greear
@ 2011-07-08 18:11       ` Myklebust, Trond
  0 siblings, 0 replies; 20+ messages in thread
From: Myklebust, Trond @ 2011-07-08 18:11 UTC (permalink / raw)
  To: Ben Greear; +Cc: linux-nfs, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3140 bytes --]

> -----Original Message-----
> From: Ben Greear [mailto:greearb@candelatech.com]
> Sent: Friday, July 08, 2011 1:19 PM
> To: Myklebust, Trond
> Cc: linux-nfs@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [RFC] sunrpc: Fix race between work-queue and
> rpc_killall_tasks.
> 
> On 07/06/2011 04:45 PM, Trond Myklebust wrote:
> > On Wed, 2011-07-06 at 15:49 -0700, greearb@candelatech.com wrote:
> >> From: Ben Greear<greearb@candelatech.com>
> >>
> >> The rpc_killall_tasks logic is not locked against
> >> the work-queue thread, but it still directly modifies
> >> function pointers and data in the task objects.
> >>
> >> This patch changes the killall-tasks logic to set a flag
> >> that tells the work-queue thread to terminate the task
> >> instead of directly calling the terminate logic.
> >>
> >> Signed-off-by: Ben Greear<greearb@candelatech.com>
> >> ---
> >>
> >> NOTE:  This needs review, as I am still struggling to understand
> >> the rpc code, and it's quite possible this patch either doesn't
> >> fully fix the problem or actually causes other issues.  That said,
> >> my nfs stress test seems to run a bit more stable with this patch
> applied.
> >
> > Yes, but I don't see why you are adding a new flag, nor do I see why
> we
> > want to keep checking for that flag in the rpc_execute() loop.
> > rpc_killall_tasks() is not a frequent operation that we want to
> optimise
> > for.
> >
> > How about the following instead?
> 
> Ok, I looked at your patch closer.  I think it can still cause
> bad race conditions.
> 
> For instance:
> 
> Assume that tk_callback is NULL at beginning of while loop in
> __rpc_execute,
> and tk_action is rpc_exit_task.
> 
> While do_action(task) is being called, tk_action is set to NULL in
> rpc_exit_task.
> 
> But, right after tk_action is set to NULL in rpc_exit_task, the
> rpc_killall_tasks
> method calls rpc_exit, which sets tk_action back to rpc_exit_task.
> 
> I believe this could cause the xprt_release(task) logic to be called in
> the
> work-queue's execution of rpc_exit_task due to tk_action != NULL when
> it should not be.

Why would this be a problem? xprt_release() can certainly be called multiple times on an rpc_task. Ditto rpbc_getport_done.

The only thing that is not re-entrant there is rpcb_map_release, which should only ever be called once whether or not something calls rpc_killall_tasks.

 
> I have no hard evidence this exact scenario is happening in my case,
> but I
> believe the code is still racy with your patch.
> 
> For that matter, is it safe to modify the flags in rpc_killall_tasks:
> 
> rovr->tk_flags |= RPC_TASK_KILLED;
> 
> Is that guaranteed to be atomic with any other modification of flags?

Task->tk_flags should never change after the rpc_task is set up. The only allowed change is the RPC_TASK_KILLED. We could convert that into an atomic bit in task->tk_runstate, but again, this isn't something that is likely to be responsible for the problem you are seeing.

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC] sunrpc:  Fix race between work-queue and rpc_killall_tasks.
@ 2011-07-08 18:11       ` Myklebust, Trond
  0 siblings, 0 replies; 20+ messages in thread
From: Myklebust, Trond @ 2011-07-08 18:11 UTC (permalink / raw)
  To: Ben Greear; +Cc: linux-nfs, linux-kernel

PiAtLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KPiBGcm9tOiBCZW4gR3JlZWFyIFttYWlsdG86
Z3JlZWFyYkBjYW5kZWxhdGVjaC5jb21dDQo+IFNlbnQ6IEZyaWRheSwgSnVseSAwOCwgMjAxMSAx
OjE5IFBNDQo+IFRvOiBNeWtsZWJ1c3QsIFRyb25kDQo+IENjOiBsaW51eC1uZnNAdmdlci5rZXJu
ZWwub3JnOyBsaW51eC1rZXJuZWxAdmdlci5rZXJuZWwub3JnDQo+IFN1YmplY3Q6IFJlOiBbUkZD
XSBzdW5ycGM6IEZpeCByYWNlIGJldHdlZW4gd29yay1xdWV1ZSBhbmQNCj4gcnBjX2tpbGxhbGxf
dGFza3MuDQo+IA0KPiBPbiAwNy8wNi8yMDExIDA0OjQ1IFBNLCBUcm9uZCBNeWtsZWJ1c3Qgd3Jv
dGU6DQo+ID4gT24gV2VkLCAyMDExLTA3LTA2IGF0IDE1OjQ5IC0wNzAwLCBncmVlYXJiQGNhbmRl
bGF0ZWNoLmNvbSB3cm90ZToNCj4gPj4gRnJvbTogQmVuIEdyZWVhcjxncmVlYXJiQGNhbmRlbGF0
ZWNoLmNvbT4NCj4gPj4NCj4gPj4gVGhlIHJwY19raWxsYWxsX3Rhc2tzIGxvZ2ljIGlzIG5vdCBs
b2NrZWQgYWdhaW5zdA0KPiA+PiB0aGUgd29yay1xdWV1ZSB0aHJlYWQsIGJ1dCBpdCBzdGlsbCBk
aXJlY3RseSBtb2RpZmllcw0KPiA+PiBmdW5jdGlvbiBwb2ludGVycyBhbmQgZGF0YSBpbiB0aGUg
dGFzayBvYmplY3RzLg0KPiA+Pg0KPiA+PiBUaGlzIHBhdGNoIGNoYW5nZXMgdGhlIGtpbGxhbGwt
dGFza3MgbG9naWMgdG8gc2V0IGEgZmxhZw0KPiA+PiB0aGF0IHRlbGxzIHRoZSB3b3JrLXF1ZXVl
IHRocmVhZCB0byB0ZXJtaW5hdGUgdGhlIHRhc2sNCj4gPj4gaW5zdGVhZCBvZiBkaXJlY3RseSBj
YWxsaW5nIHRoZSB0ZXJtaW5hdGUgbG9naWMuDQo+ID4+DQo+ID4+IFNpZ25lZC1vZmYtYnk6IEJl
biBHcmVlYXI8Z3JlZWFyYkBjYW5kZWxhdGVjaC5jb20+DQo+ID4+IC0tLQ0KPiA+Pg0KPiA+PiBO
T1RFOiAgVGhpcyBuZWVkcyByZXZpZXcsIGFzIEkgYW0gc3RpbGwgc3RydWdnbGluZyB0byB1bmRl
cnN0YW5kDQo+ID4+IHRoZSBycGMgY29kZSwgYW5kIGl0J3MgcXVpdGUgcG9zc2libGUgdGhpcyBw
YXRjaCBlaXRoZXIgZG9lc24ndA0KPiA+PiBmdWxseSBmaXggdGhlIHByb2JsZW0gb3IgYWN0dWFs
bHkgY2F1c2VzIG90aGVyIGlzc3Vlcy4gIFRoYXQgc2FpZCwNCj4gPj4gbXkgbmZzIHN0cmVzcyB0
ZXN0IHNlZW1zIHRvIHJ1biBhIGJpdCBtb3JlIHN0YWJsZSB3aXRoIHRoaXMgcGF0Y2gNCj4gYXBw
bGllZC4NCj4gPg0KPiA+IFllcywgYnV0IEkgZG9uJ3Qgc2VlIHdoeSB5b3UgYXJlIGFkZGluZyBh
IG5ldyBmbGFnLCBub3IgZG8gSSBzZWUgd2h5DQo+IHdlDQo+ID4gd2FudCB0byBrZWVwIGNoZWNr
aW5nIGZvciB0aGF0IGZsYWcgaW4gdGhlIHJwY19leGVjdXRlKCkgbG9vcC4NCj4gPiBycGNfa2ls
bGFsbF90YXNrcygpIGlzIG5vdCBhIGZyZXF1ZW50IG9wZXJhdGlvbiB0aGF0IHdlIHdhbnQgdG8N
Cj4gb3B0aW1pc2UNCj4gPiBmb3IuDQo+ID4NCj4gPiBIb3cgYWJvdXQgdGhlIGZvbGxvd2luZyBp
bnN0ZWFkPw0KPiANCj4gT2ssIEkgbG9va2VkIGF0IHlvdXIgcGF0Y2ggY2xvc2VyLiAgSSB0aGlu
ayBpdCBjYW4gc3RpbGwgY2F1c2UNCj4gYmFkIHJhY2UgY29uZGl0aW9ucy4NCj4gDQo+IEZvciBp
bnN0YW5jZToNCj4gDQo+IEFzc3VtZSB0aGF0IHRrX2NhbGxiYWNrIGlzIE5VTEwgYXQgYmVnaW5u
aW5nIG9mIHdoaWxlIGxvb3AgaW4NCj4gX19ycGNfZXhlY3V0ZSwNCj4gYW5kIHRrX2FjdGlvbiBp
cyBycGNfZXhpdF90YXNrLg0KPiANCj4gV2hpbGUgZG9fYWN0aW9uKHRhc2spIGlzIGJlaW5nIGNh
bGxlZCwgdGtfYWN0aW9uIGlzIHNldCB0byBOVUxMIGluDQo+IHJwY19leGl0X3Rhc2suDQo+IA0K
PiBCdXQsIHJpZ2h0IGFmdGVyIHRrX2FjdGlvbiBpcyBzZXQgdG8gTlVMTCBpbiBycGNfZXhpdF90
YXNrLCB0aGUNCj4gcnBjX2tpbGxhbGxfdGFza3MNCj4gbWV0aG9kIGNhbGxzIHJwY19leGl0LCB3
aGljaCBzZXRzIHRrX2FjdGlvbiBiYWNrIHRvIHJwY19leGl0X3Rhc2suDQo+IA0KPiBJIGJlbGll
dmUgdGhpcyBjb3VsZCBjYXVzZSB0aGUgeHBydF9yZWxlYXNlKHRhc2spIGxvZ2ljIHRvIGJlIGNh
bGxlZCBpbg0KPiB0aGUNCj4gd29yay1xdWV1ZSdzIGV4ZWN1dGlvbiBvZiBycGNfZXhpdF90YXNr
IGR1ZSB0byB0a19hY3Rpb24gIT0gTlVMTCB3aGVuDQo+IGl0IHNob3VsZCBub3QgYmUuDQoNCldo
eSB3b3VsZCB0aGlzIGJlIGEgcHJvYmxlbT8geHBydF9yZWxlYXNlKCkgY2FuIGNlcnRhaW5seSBi
ZSBjYWxsZWQgbXVsdGlwbGUgdGltZXMgb24gYW4gcnBjX3Rhc2suIERpdHRvIHJwYmNfZ2V0cG9y
dF9kb25lLg0KDQpUaGUgb25seSB0aGluZyB0aGF0IGlzIG5vdCByZS1lbnRyYW50IHRoZXJlIGlz
IHJwY2JfbWFwX3JlbGVhc2UsIHdoaWNoIHNob3VsZCBvbmx5IGV2ZXIgYmUgY2FsbGVkIG9uY2Ug
d2hldGhlciBvciBub3Qgc29tZXRoaW5nIGNhbGxzIHJwY19raWxsYWxsX3Rhc2tzLg0KDQogDQo+
IEkgaGF2ZSBubyBoYXJkIGV2aWRlbmNlIHRoaXMgZXhhY3Qgc2NlbmFyaW8gaXMgaGFwcGVuaW5n
IGluIG15IGNhc2UsDQo+IGJ1dCBJDQo+IGJlbGlldmUgdGhlIGNvZGUgaXMgc3RpbGwgcmFjeSB3
aXRoIHlvdXIgcGF0Y2guDQo+IA0KPiBGb3IgdGhhdCBtYXR0ZXIsIGlzIGl0IHNhZmUgdG8gbW9k
aWZ5IHRoZSBmbGFncyBpbiBycGNfa2lsbGFsbF90YXNrczoNCj4gDQo+IHJvdnItPnRrX2ZsYWdz
IHw9IFJQQ19UQVNLX0tJTExFRDsNCj4gDQo+IElzIHRoYXQgZ3VhcmFudGVlZCB0byBiZSBhdG9t
aWMgd2l0aCBhbnkgb3RoZXIgbW9kaWZpY2F0aW9uIG9mIGZsYWdzPw0KDQpUYXNrLT50a19mbGFn
cyBzaG91bGQgbmV2ZXIgY2hhbmdlIGFmdGVyIHRoZSBycGNfdGFzayBpcyBzZXQgdXAuIFRoZSBv
bmx5IGFsbG93ZWQgY2hhbmdlIGlzIHRoZSBSUENfVEFTS19LSUxMRUQuIFdlIGNvdWxkIGNvbnZl
cnQgdGhhdCBpbnRvIGFuIGF0b21pYyBiaXQgaW4gdGFzay0+dGtfcnVuc3RhdGUsIGJ1dCBhZ2Fp
biwgdGhpcyBpc24ndCBzb21ldGhpbmcgdGhhdCBpcyBsaWtlbHkgdG8gYmUgcmVzcG9uc2libGUg
Zm9yIHRoZSBwcm9ibGVtIHlvdSBhcmUgc2VlaW5nLg0KDQoT77+977+97Lm7HO+/vSbvv71+77+9
Ju+/vRjvv73vv70rLe+/ve+/vd22F++/ve+/vXfvv73vv73Lm++/ve+/ve+/vW3vv71i77+977+9
Z37Ip++/vRfvv73vv73cqH3vv73vv73vv73GoHrvv70majordu+/ve+/ve+/vQfvv73vv73vv73v
v716Wivvv73vv70rembvv73vv73vv71o77+977+977+9fu+/ve+/ve+/ve+/vWnvv73vv73vv716
77+9Hu+/vXfvv73vv73vv70/77+977+977+977+9Ju+/vSnfohtm

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] sunrpc:  Fix race between work-queue and rpc_killall_tasks.
  2011-07-08 18:11       ` Myklebust, Trond
  (?)
@ 2011-07-08 22:03       ` Ben Greear
  2011-07-08 22:14           ` Myklebust, Trond
  -1 siblings, 1 reply; 20+ messages in thread
From: Ben Greear @ 2011-07-08 22:03 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: linux-nfs, linux-kernel

On 07/08/2011 11:11 AM, Myklebust, Trond wrote:
>> -----Original Message-----
>> From: Ben Greear [mailto:greearb@candelatech.com]
>> Sent: Friday, July 08, 2011 1:19 PM
>> To: Myklebust, Trond
>> Cc: linux-nfs@vger.kernel.org; linux-kernel@vger.kernel.org
>> Subject: Re: [RFC] sunrpc: Fix race between work-queue and
>> rpc_killall_tasks.
>>
>> On 07/06/2011 04:45 PM, Trond Myklebust wrote:
>>> On Wed, 2011-07-06 at 15:49 -0700, greearb@candelatech.com wrote:
>>>> From: Ben Greear<greearb@candelatech.com>
>>>>
>>>> The rpc_killall_tasks logic is not locked against
>>>> the work-queue thread, but it still directly modifies
>>>> function pointers and data in the task objects.
>>>>
>>>> This patch changes the killall-tasks logic to set a flag
>>>> that tells the work-queue thread to terminate the task
>>>> instead of directly calling the terminate logic.
>>>>
>>>> Signed-off-by: Ben Greear<greearb@candelatech.com>
>>>> ---
>>>>
>>>> NOTE:  This needs review, as I am still struggling to understand
>>>> the rpc code, and it's quite possible this patch either doesn't
>>>> fully fix the problem or actually causes other issues.  That said,
>>>> my nfs stress test seems to run a bit more stable with this patch
>> applied.
>>>
>>> Yes, but I don't see why you are adding a new flag, nor do I see why
>> we
>>> want to keep checking for that flag in the rpc_execute() loop.
>>> rpc_killall_tasks() is not a frequent operation that we want to
>> optimise
>>> for.
>>>
>>> How about the following instead?
>>
>> Ok, I looked at your patch closer.  I think it can still cause
>> bad race conditions.
>>
>> For instance:
>>
>> Assume that tk_callback is NULL at beginning of while loop in
>> __rpc_execute,
>> and tk_action is rpc_exit_task.
>>
>> While do_action(task) is being called, tk_action is set to NULL in
>> rpc_exit_task.
>>
>> But, right after tk_action is set to NULL in rpc_exit_task, the
>> rpc_killall_tasks
>> method calls rpc_exit, which sets tk_action back to rpc_exit_task.
>>
>> I believe this could cause the xprt_release(task) logic to be called in
>> the
>> work-queue's execution of rpc_exit_task due to tk_action != NULL when
>> it should not be.
>
> Why would this be a problem? xprt_release() can certainly be called multiple times on an rpc_task. Ditto rpbc_getport_done.
>
> The only thing that is not re-entrant there is rpcb_map_release, which should only ever be called once whether or not something calls rpc_killall_tasks.


 From the trace I posted, this stack trace below is being
called with the void *data object already freed.

One way for this to happen would be to have rpc_exit_task call task->tk_ops->rpc_call_done
more than once (I believe).  Two calls to rpc_exit_task could do that, and since the
rpc_exit_task method is assigned to tk_action, I *think* the race I mention above could cause
rpc_exit_task to be called twice.

  [<ffffffff81105907>] print_trailer+0x131/0x13a
  [<ffffffff81105945>] object_err+0x35/0x3e
  [<ffffffff811077b3>] verify_mem_not_deleted+0x7a/0xb7
  [<ffffffffa02891e5>] rpcb_getport_done+0x23/0x126 [sunrpc]
  [<ffffffffa02810df>] rpc_exit_task+0x3f/0x6d [sunrpc]
  [<ffffffffa02814d8>] __rpc_execute+0x80/0x253 [sunrpc]
  [<ffffffffa02816ed>] ? rpc_execute+0x42/0x42 [sunrpc]
  [<ffffffffa02816fd>] rpc_async_schedule+0x10/0x12 [sunrpc]
  [<ffffffff81061343>] process_one_work+0x230/0x41d
  [<ffffffff8106128e>] ? process_one_work+0x17b/0x41d
  [<ffffffff8106379f>] worker_thread+0x133/0x217
  [<ffffffff8106366c>] ? manage_workers+0x191/0x191
  [<ffffffff81066f9c>] kthread+0x7d/0x85
  [<ffffffff81485ee4>] kernel_thread_helper+0x4/0x10
  [<ffffffff8147f0d8>] ? retint_restore_args+0x13/0x13
  [<ffffffff81066f1f>] ? __init_kthread_worker+0x56/0x56
  [<ffffffff81485ee0>] ? gs_change+0x13/0x13

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC] sunrpc:  Fix race between work-queue and rpc_killall_tasks.
  2011-07-08 22:03       ` Ben Greear
@ 2011-07-08 22:14           ` Myklebust, Trond
  0 siblings, 0 replies; 20+ messages in thread
From: Myklebust, Trond @ 2011-07-08 22:14 UTC (permalink / raw)
  To: Ben Greear; +Cc: linux-nfs, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 4636 bytes --]

> -----Original Message-----
> From: Ben Greear [mailto:greearb@candelatech.com]
> Sent: Friday, July 08, 2011 6:03 PM
> To: Myklebust, Trond
> Cc: linux-nfs@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [RFC] sunrpc: Fix race between work-queue and
> rpc_killall_tasks.
> 
> On 07/08/2011 11:11 AM, Myklebust, Trond wrote:
> >> -----Original Message-----
> >> From: Ben Greear [mailto:greearb@candelatech.com]
> >> Sent: Friday, July 08, 2011 1:19 PM
> >> To: Myklebust, Trond
> >> Cc: linux-nfs@vger.kernel.org; linux-kernel@vger.kernel.org
> >> Subject: Re: [RFC] sunrpc: Fix race between work-queue and
> >> rpc_killall_tasks.
> >>
> >> On 07/06/2011 04:45 PM, Trond Myklebust wrote:
> >>> On Wed, 2011-07-06 at 15:49 -0700, greearb@candelatech.com wrote:
> >>>> From: Ben Greear<greearb@candelatech.com>
> >>>>
> >>>> The rpc_killall_tasks logic is not locked against
> >>>> the work-queue thread, but it still directly modifies
> >>>> function pointers and data in the task objects.
> >>>>
> >>>> This patch changes the killall-tasks logic to set a flag
> >>>> that tells the work-queue thread to terminate the task
> >>>> instead of directly calling the terminate logic.
> >>>>
> >>>> Signed-off-by: Ben Greear<greearb@candelatech.com>
> >>>> ---
> >>>>
> >>>> NOTE:  This needs review, as I am still struggling to understand
> >>>> the rpc code, and it's quite possible this patch either doesn't
> >>>> fully fix the problem or actually causes other issues.  That said,
> >>>> my nfs stress test seems to run a bit more stable with this patch
> >> applied.
> >>>
> >>> Yes, but I don't see why you are adding a new flag, nor do I see
> why
> >> we
> >>> want to keep checking for that flag in the rpc_execute() loop.
> >>> rpc_killall_tasks() is not a frequent operation that we want to
> >> optimise
> >>> for.
> >>>
> >>> How about the following instead?
> >>
> >> Ok, I looked at your patch closer.  I think it can still cause
> >> bad race conditions.
> >>
> >> For instance:
> >>
> >> Assume that tk_callback is NULL at beginning of while loop in
> >> __rpc_execute,
> >> and tk_action is rpc_exit_task.
> >>
> >> While do_action(task) is being called, tk_action is set to NULL in
> >> rpc_exit_task.
> >>
> >> But, right after tk_action is set to NULL in rpc_exit_task, the
> >> rpc_killall_tasks
> >> method calls rpc_exit, which sets tk_action back to rpc_exit_task.
> >>
> >> I believe this could cause the xprt_release(task) logic to be called
> in
> >> the
> >> work-queue's execution of rpc_exit_task due to tk_action != NULL
> when
> >> it should not be.
> >
> > Why would this be a problem? xprt_release() can certainly be called
> multiple times on an rpc_task. Ditto rpbc_getport_done.
> >
> > The only thing that is not re-entrant there is rpcb_map_release,
> which should only ever be called once whether or not something calls
> rpc_killall_tasks.
> 
> 
>  From the trace I posted, this stack trace below is being
> called with the void *data object already freed.
> 
> One way for this to happen would be to have rpc_exit_task call task-
> >tk_ops->rpc_call_done
> more than once (I believe).  Two calls to rpc_exit_task could do that,
> and since the
> rpc_exit_task method is assigned to tk_action, I *think* the race I
> mention above could cause
> rpc_exit_task to be called twice.
> 
>   [<ffffffff81105907>] print_trailer+0x131/0x13a
>   [<ffffffff81105945>] object_err+0x35/0x3e
>   [<ffffffff811077b3>] verify_mem_not_deleted+0x7a/0xb7
>   [<ffffffffa02891e5>] rpcb_getport_done+0x23/0x126 [sunrpc]
>   [<ffffffffa02810df>] rpc_exit_task+0x3f/0x6d [sunrpc]
>   [<ffffffffa02814d8>] __rpc_execute+0x80/0x253 [sunrpc]
>   [<ffffffffa02816ed>] ? rpc_execute+0x42/0x42 [sunrpc]
>   [<ffffffffa02816fd>] rpc_async_schedule+0x10/0x12 [sunrpc]
>   [<ffffffff81061343>] process_one_work+0x230/0x41d
>   [<ffffffff8106128e>] ? process_one_work+0x17b/0x41d
>   [<ffffffff8106379f>] worker_thread+0x133/0x217
>   [<ffffffff8106366c>] ? manage_workers+0x191/0x191
>   [<ffffffff81066f9c>] kthread+0x7d/0x85
>   [<ffffffff81485ee4>] kernel_thread_helper+0x4/0x10
>   [<ffffffff8147f0d8>] ? retint_restore_args+0x13/0x13
>   [<ffffffff81066f1f>] ? __init_kthread_worker+0x56/0x56
>   [<ffffffff81485ee0>] ? gs_change+0x13/0x13

The calldata gets freed in the rpc_final_put_task() which shouldn't ever be run while the task is still referenced in __rpc_execute

IOW: it should be impossible to call rpc_exit_task() after rpc_final_put_task
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC] sunrpc:  Fix race between work-queue and rpc_killall_tasks.
@ 2011-07-08 22:14           ` Myklebust, Trond
  0 siblings, 0 replies; 20+ messages in thread
From: Myklebust, Trond @ 2011-07-08 22:14 UTC (permalink / raw)
  To: Ben Greear; +Cc: linux-nfs, linux-kernel

PiAtLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KPiBGcm9tOiBCZW4gR3JlZWFyIFttYWlsdG86
Z3JlZWFyYkBjYW5kZWxhdGVjaC5jb21dDQo+IFNlbnQ6IEZyaWRheSwgSnVseSAwOCwgMjAxMSA2
OjAzIFBNDQo+IFRvOiBNeWtsZWJ1c3QsIFRyb25kDQo+IENjOiBsaW51eC1uZnNAdmdlci5rZXJu
ZWwub3JnOyBsaW51eC1rZXJuZWxAdmdlci5rZXJuZWwub3JnDQo+IFN1YmplY3Q6IFJlOiBbUkZD
XSBzdW5ycGM6IEZpeCByYWNlIGJldHdlZW4gd29yay1xdWV1ZSBhbmQNCj4gcnBjX2tpbGxhbGxf
dGFza3MuDQo+IA0KPiBPbiAwNy8wOC8yMDExIDExOjExIEFNLCBNeWtsZWJ1c3QsIFRyb25kIHdy
b3RlOg0KPiA+PiAtLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KPiA+PiBGcm9tOiBCZW4gR3Jl
ZWFyIFttYWlsdG86Z3JlZWFyYkBjYW5kZWxhdGVjaC5jb21dDQo+ID4+IFNlbnQ6IEZyaWRheSwg
SnVseSAwOCwgMjAxMSAxOjE5IFBNDQo+ID4+IFRvOiBNeWtsZWJ1c3QsIFRyb25kDQo+ID4+IENj
OiBsaW51eC1uZnNAdmdlci5rZXJuZWwub3JnOyBsaW51eC1rZXJuZWxAdmdlci5rZXJuZWwub3Jn
DQo+ID4+IFN1YmplY3Q6IFJlOiBbUkZDXSBzdW5ycGM6IEZpeCByYWNlIGJldHdlZW4gd29yay1x
dWV1ZSBhbmQNCj4gPj4gcnBjX2tpbGxhbGxfdGFza3MuDQo+ID4+DQo+ID4+IE9uIDA3LzA2LzIw
MTEgMDQ6NDUgUE0sIFRyb25kIE15a2xlYnVzdCB3cm90ZToNCj4gPj4+IE9uIFdlZCwgMjAxMS0w
Ny0wNiBhdCAxNTo0OSAtMDcwMCwgZ3JlZWFyYkBjYW5kZWxhdGVjaC5jb20gd3JvdGU6DQo+ID4+
Pj4gRnJvbTogQmVuIEdyZWVhcjxncmVlYXJiQGNhbmRlbGF0ZWNoLmNvbT4NCj4gPj4+Pg0KPiA+
Pj4+IFRoZSBycGNfa2lsbGFsbF90YXNrcyBsb2dpYyBpcyBub3QgbG9ja2VkIGFnYWluc3QNCj4g
Pj4+PiB0aGUgd29yay1xdWV1ZSB0aHJlYWQsIGJ1dCBpdCBzdGlsbCBkaXJlY3RseSBtb2RpZmll
cw0KPiA+Pj4+IGZ1bmN0aW9uIHBvaW50ZXJzIGFuZCBkYXRhIGluIHRoZSB0YXNrIG9iamVjdHMu
DQo+ID4+Pj4NCj4gPj4+PiBUaGlzIHBhdGNoIGNoYW5nZXMgdGhlIGtpbGxhbGwtdGFza3MgbG9n
aWMgdG8gc2V0IGEgZmxhZw0KPiA+Pj4+IHRoYXQgdGVsbHMgdGhlIHdvcmstcXVldWUgdGhyZWFk
IHRvIHRlcm1pbmF0ZSB0aGUgdGFzaw0KPiA+Pj4+IGluc3RlYWQgb2YgZGlyZWN0bHkgY2FsbGlu
ZyB0aGUgdGVybWluYXRlIGxvZ2ljLg0KPiA+Pj4+DQo+ID4+Pj4gU2lnbmVkLW9mZi1ieTogQmVu
IEdyZWVhcjxncmVlYXJiQGNhbmRlbGF0ZWNoLmNvbT4NCj4gPj4+PiAtLS0NCj4gPj4+Pg0KPiA+
Pj4+IE5PVEU6ICBUaGlzIG5lZWRzIHJldmlldywgYXMgSSBhbSBzdGlsbCBzdHJ1Z2dsaW5nIHRv
IHVuZGVyc3RhbmQNCj4gPj4+PiB0aGUgcnBjIGNvZGUsIGFuZCBpdCdzIHF1aXRlIHBvc3NpYmxl
IHRoaXMgcGF0Y2ggZWl0aGVyIGRvZXNuJ3QNCj4gPj4+PiBmdWxseSBmaXggdGhlIHByb2JsZW0g
b3IgYWN0dWFsbHkgY2F1c2VzIG90aGVyIGlzc3Vlcy4gIFRoYXQgc2FpZCwNCj4gPj4+PiBteSBu
ZnMgc3RyZXNzIHRlc3Qgc2VlbXMgdG8gcnVuIGEgYml0IG1vcmUgc3RhYmxlIHdpdGggdGhpcyBw
YXRjaA0KPiA+PiBhcHBsaWVkLg0KPiA+Pj4NCj4gPj4+IFllcywgYnV0IEkgZG9uJ3Qgc2VlIHdo
eSB5b3UgYXJlIGFkZGluZyBhIG5ldyBmbGFnLCBub3IgZG8gSSBzZWUNCj4gd2h5DQo+ID4+IHdl
DQo+ID4+PiB3YW50IHRvIGtlZXAgY2hlY2tpbmcgZm9yIHRoYXQgZmxhZyBpbiB0aGUgcnBjX2V4
ZWN1dGUoKSBsb29wLg0KPiA+Pj4gcnBjX2tpbGxhbGxfdGFza3MoKSBpcyBub3QgYSBmcmVxdWVu
dCBvcGVyYXRpb24gdGhhdCB3ZSB3YW50IHRvDQo+ID4+IG9wdGltaXNlDQo+ID4+PiBmb3IuDQo+
ID4+Pg0KPiA+Pj4gSG93IGFib3V0IHRoZSBmb2xsb3dpbmcgaW5zdGVhZD8NCj4gPj4NCj4gPj4g
T2ssIEkgbG9va2VkIGF0IHlvdXIgcGF0Y2ggY2xvc2VyLiAgSSB0aGluayBpdCBjYW4gc3RpbGwg
Y2F1c2UNCj4gPj4gYmFkIHJhY2UgY29uZGl0aW9ucy4NCj4gPj4NCj4gPj4gRm9yIGluc3RhbmNl
Og0KPiA+Pg0KPiA+PiBBc3N1bWUgdGhhdCB0a19jYWxsYmFjayBpcyBOVUxMIGF0IGJlZ2lubmlu
ZyBvZiB3aGlsZSBsb29wIGluDQo+ID4+IF9fcnBjX2V4ZWN1dGUsDQo+ID4+IGFuZCB0a19hY3Rp
b24gaXMgcnBjX2V4aXRfdGFzay4NCj4gPj4NCj4gPj4gV2hpbGUgZG9fYWN0aW9uKHRhc2spIGlz
IGJlaW5nIGNhbGxlZCwgdGtfYWN0aW9uIGlzIHNldCB0byBOVUxMIGluDQo+ID4+IHJwY19leGl0
X3Rhc2suDQo+ID4+DQo+ID4+IEJ1dCwgcmlnaHQgYWZ0ZXIgdGtfYWN0aW9uIGlzIHNldCB0byBO
VUxMIGluIHJwY19leGl0X3Rhc2ssIHRoZQ0KPiA+PiBycGNfa2lsbGFsbF90YXNrcw0KPiA+PiBt
ZXRob2QgY2FsbHMgcnBjX2V4aXQsIHdoaWNoIHNldHMgdGtfYWN0aW9uIGJhY2sgdG8gcnBjX2V4
aXRfdGFzay4NCj4gPj4NCj4gPj4gSSBiZWxpZXZlIHRoaXMgY291bGQgY2F1c2UgdGhlIHhwcnRf
cmVsZWFzZSh0YXNrKSBsb2dpYyB0byBiZSBjYWxsZWQNCj4gaW4NCj4gPj4gdGhlDQo+ID4+IHdv
cmstcXVldWUncyBleGVjdXRpb24gb2YgcnBjX2V4aXRfdGFzayBkdWUgdG8gdGtfYWN0aW9uICE9
IE5VTEwNCj4gd2hlbg0KPiA+PiBpdCBzaG91bGQgbm90IGJlLg0KPiA+DQo+ID4gV2h5IHdvdWxk
IHRoaXMgYmUgYSBwcm9ibGVtPyB4cHJ0X3JlbGVhc2UoKSBjYW4gY2VydGFpbmx5IGJlIGNhbGxl
ZA0KPiBtdWx0aXBsZSB0aW1lcyBvbiBhbiBycGNfdGFzay4gRGl0dG8gcnBiY19nZXRwb3J0X2Rv
bmUuDQo+ID4NCj4gPiBUaGUgb25seSB0aGluZyB0aGF0IGlzIG5vdCByZS1lbnRyYW50IHRoZXJl
IGlzIHJwY2JfbWFwX3JlbGVhc2UsDQo+IHdoaWNoIHNob3VsZCBvbmx5IGV2ZXIgYmUgY2FsbGVk
IG9uY2Ugd2hldGhlciBvciBub3Qgc29tZXRoaW5nIGNhbGxzDQo+IHJwY19raWxsYWxsX3Rhc2tz
Lg0KPiANCj4gDQo+ICBGcm9tIHRoZSB0cmFjZSBJIHBvc3RlZCwgdGhpcyBzdGFjayB0cmFjZSBi
ZWxvdyBpcyBiZWluZw0KPiBjYWxsZWQgd2l0aCB0aGUgdm9pZCAqZGF0YSBvYmplY3QgYWxyZWFk
eSBmcmVlZC4NCj4gDQo+IE9uZSB3YXkgZm9yIHRoaXMgdG8gaGFwcGVuIHdvdWxkIGJlIHRvIGhh
dmUgcnBjX2V4aXRfdGFzayBjYWxsIHRhc2stDQo+ID50a19vcHMtPnJwY19jYWxsX2RvbmUNCj4g
bW9yZSB0aGFuIG9uY2UgKEkgYmVsaWV2ZSkuICBUd28gY2FsbHMgdG8gcnBjX2V4aXRfdGFzayBj
b3VsZCBkbyB0aGF0LA0KPiBhbmQgc2luY2UgdGhlDQo+IHJwY19leGl0X3Rhc2sgbWV0aG9kIGlz
IGFzc2lnbmVkIHRvIHRrX2FjdGlvbiwgSSAqdGhpbmsqIHRoZSByYWNlIEkNCj4gbWVudGlvbiBh
Ym92ZSBjb3VsZCBjYXVzZQ0KPiBycGNfZXhpdF90YXNrIHRvIGJlIGNhbGxlZCB0d2ljZS4NCj4g
DQo+ICAgWzxmZmZmZmZmZjgxMTA1OTA3Pl0gcHJpbnRfdHJhaWxlcisweDEzMS8weDEzYQ0KPiAg
IFs8ZmZmZmZmZmY4MTEwNTk0NT5dIG9iamVjdF9lcnIrMHgzNS8weDNlDQo+ICAgWzxmZmZmZmZm
ZjgxMTA3N2IzPl0gdmVyaWZ5X21lbV9ub3RfZGVsZXRlZCsweDdhLzB4YjcNCj4gICBbPGZmZmZm
ZmZmYTAyODkxZTU+XSBycGNiX2dldHBvcnRfZG9uZSsweDIzLzB4MTI2IFtzdW5ycGNdDQo+ICAg
WzxmZmZmZmZmZmEwMjgxMGRmPl0gcnBjX2V4aXRfdGFzaysweDNmLzB4NmQgW3N1bnJwY10NCj4g
ICBbPGZmZmZmZmZmYTAyODE0ZDg+XSBfX3JwY19leGVjdXRlKzB4ODAvMHgyNTMgW3N1bnJwY10N
Cj4gICBbPGZmZmZmZmZmYTAyODE2ZWQ+XSA/IHJwY19leGVjdXRlKzB4NDIvMHg0MiBbc3VucnBj
XQ0KPiAgIFs8ZmZmZmZmZmZhMDI4MTZmZD5dIHJwY19hc3luY19zY2hlZHVsZSsweDEwLzB4MTIg
W3N1bnJwY10NCj4gICBbPGZmZmZmZmZmODEwNjEzNDM+XSBwcm9jZXNzX29uZV93b3JrKzB4MjMw
LzB4NDFkDQo+ICAgWzxmZmZmZmZmZjgxMDYxMjhlPl0gPyBwcm9jZXNzX29uZV93b3JrKzB4MTdi
LzB4NDFkDQo+ICAgWzxmZmZmZmZmZjgxMDYzNzlmPl0gd29ya2VyX3RocmVhZCsweDEzMy8weDIx
Nw0KPiAgIFs8ZmZmZmZmZmY4MTA2MzY2Yz5dID8gbWFuYWdlX3dvcmtlcnMrMHgxOTEvMHgxOTEN
Cj4gICBbPGZmZmZmZmZmODEwNjZmOWM+XSBrdGhyZWFkKzB4N2QvMHg4NQ0KPiAgIFs8ZmZmZmZm
ZmY4MTQ4NWVlND5dIGtlcm5lbF90aHJlYWRfaGVscGVyKzB4NC8weDEwDQo+ICAgWzxmZmZmZmZm
ZjgxNDdmMGQ4Pl0gPyByZXRpbnRfcmVzdG9yZV9hcmdzKzB4MTMvMHgxMw0KPiAgIFs8ZmZmZmZm
ZmY4MTA2NmYxZj5dID8gX19pbml0X2t0aHJlYWRfd29ya2VyKzB4NTYvMHg1Ng0KPiAgIFs8ZmZm
ZmZmZmY4MTQ4NWVlMD5dID8gZ3NfY2hhbmdlKzB4MTMvMHgxMw0KDQpUaGUgY2FsbGRhdGEgZ2V0
cyBmcmVlZCBpbiB0aGUgcnBjX2ZpbmFsX3B1dF90YXNrKCkgd2hpY2ggc2hvdWxkbid0IGV2ZXIg
YmUgcnVuIHdoaWxlIHRoZSB0YXNrIGlzIHN0aWxsIHJlZmVyZW5jZWQgaW4gX19ycGNfZXhlY3V0
ZQ0KDQpJT1c6IGl0IHNob3VsZCBiZSBpbXBvc3NpYmxlIHRvIGNhbGwgcnBjX2V4aXRfdGFzaygp
IGFmdGVyIHJwY19maW5hbF9wdXRfdGFzaw0KTu+/ve+/ve+/ve+/ve+/vXLvv73vv71577+977+9
77+9Yu+/vVjvv73vv73Hp3bvv71e77+9Kd66ey5u77+9K++/ve+/ve+/ve+/vXvvv73vv73vv70i
77+977+9Xm7vv71y77+977+977+9eu+/vRrvv73vv71o77+977+977+977+9Ju+/ve+/vR7vv71H
77+977+977+9aO+/vQMo77+96ZqO77+93aJqIu+/ve+/vRrvv70bbe+/ve+/ve+/ve+/ve+/vXrv
v73elu+/ve+/ve+/vWbvv73vv73vv71o77+977+977+9fu+/vW3vv70=

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] sunrpc:  Fix race between work-queue and rpc_killall_tasks.
  2011-07-08 22:14           ` Myklebust, Trond
  (?)
@ 2011-07-09 16:34           ` Ben Greear
  -1 siblings, 0 replies; 20+ messages in thread
From: Ben Greear @ 2011-07-09 16:34 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: linux-nfs, linux-kernel

On 07/08/2011 03:14 PM, Myklebust, Trond wrote:

> The calldata gets freed in the rpc_final_put_task() which shouldn't ever be run while the task is still referenced in __rpc_execute

Ok, please go ahead and use your patch for the killall tasks race.  My problem remains with or without
your patch, and with or without my version.  So, I'm hitting something else.

I'm real low on ideas of how exactly I am hitting the bug..but will keep poking around.

Thanks,
Ben

>
> IOW: it should be impossible to call rpc_exit_task() after rpc_final_put_task
> N�����r��y���b�X��ǧv�^�)޺{.n�+����{���"��^n�r��z�\x1a��h����&��\x1e�G���h�\x03(�階�ݢj"��\x1a�^[m�����z�ޖ���f���h���~�mml==


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] sunrpc:  Fix race between work-queue and rpc_killall_tasks.
  2011-07-08 22:14           ` Myklebust, Trond
  (?)
  (?)
@ 2011-07-12 17:14           ` Ben Greear
  2011-07-12 17:25               ` Myklebust, Trond
  -1 siblings, 1 reply; 20+ messages in thread
From: Ben Greear @ 2011-07-12 17:14 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: linux-nfs, linux-kernel

On 07/08/2011 03:14 PM, Myklebust, Trond wrote:

>>    [<ffffffff81105907>] print_trailer+0x131/0x13a
>>    [<ffffffff81105945>] object_err+0x35/0x3e
>>    [<ffffffff811077b3>] verify_mem_not_deleted+0x7a/0xb7
>>    [<ffffffffa02891e5>] rpcb_getport_done+0x23/0x126 [sunrpc]
>>    [<ffffffffa02810df>] rpc_exit_task+0x3f/0x6d [sunrpc]
>>    [<ffffffffa02814d8>] __rpc_execute+0x80/0x253 [sunrpc]
>>    [<ffffffffa02816ed>] ? rpc_execute+0x42/0x42 [sunrpc]
>>    [<ffffffffa02816fd>] rpc_async_schedule+0x10/0x12 [sunrpc]
>>    [<ffffffff81061343>] process_one_work+0x230/0x41d
>>    [<ffffffff8106128e>] ? process_one_work+0x17b/0x41d
>>    [<ffffffff8106379f>] worker_thread+0x133/0x217
>>    [<ffffffff8106366c>] ? manage_workers+0x191/0x191
>>    [<ffffffff81066f9c>] kthread+0x7d/0x85
>>    [<ffffffff81485ee4>] kernel_thread_helper+0x4/0x10
>>    [<ffffffff8147f0d8>] ? retint_restore_args+0x13/0x13
>>    [<ffffffff81066f1f>] ? __init_kthread_worker+0x56/0x56
>>    [<ffffffff81485ee0>] ? gs_change+0x13/0x13
>
> The calldata gets freed in the rpc_final_put_task() which shouldn't ever be run while the task is still referenced in __rpc_execute
>
> IOW: it should be impossible to call rpc_exit_task() after rpc_final_put_task

I added lots of locking around the calldata, work-queue logic, and such, and
still the problem persists w/out hitting any of the debug warnings or poisoned
values I put in.  It almost seems like tk_calldata is just assigned to two
different tasks.

While poking through the code, I noticed that 'map' is static in rpcb_getport_async.

That would seem to cause problems if two threads called this method at
the same time, possibly causing tk_calldata to be assigned to two different
tasks???

Any idea why it is static?

I'm going to start another test run with this non-static
to see if that resolves things...

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC] sunrpc:  Fix race between work-queue and rpc_killall_tasks.
  2011-07-12 17:14           ` Ben Greear
@ 2011-07-12 17:25               ` Myklebust, Trond
  0 siblings, 0 replies; 20+ messages in thread
From: Myklebust, Trond @ 2011-07-12 17:25 UTC (permalink / raw)
  To: Ben Greear; +Cc: linux-nfs, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2443 bytes --]

> -----Original Message-----
> From: Ben Greear [mailto:greearb@candelatech.com]
> Sent: Tuesday, July 12, 2011 1:15 PM
> To: Myklebust, Trond
> Cc: linux-nfs@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [RFC] sunrpc: Fix race between work-queue and
> rpc_killall_tasks.
> 
> On 07/08/2011 03:14 PM, Myklebust, Trond wrote:
> 
> >>    [<ffffffff81105907>] print_trailer+0x131/0x13a
> >>    [<ffffffff81105945>] object_err+0x35/0x3e
> >>    [<ffffffff811077b3>] verify_mem_not_deleted+0x7a/0xb7
> >>    [<ffffffffa02891e5>] rpcb_getport_done+0x23/0x126 [sunrpc]
> >>    [<ffffffffa02810df>] rpc_exit_task+0x3f/0x6d [sunrpc]
> >>    [<ffffffffa02814d8>] __rpc_execute+0x80/0x253 [sunrpc]
> >>    [<ffffffffa02816ed>] ? rpc_execute+0x42/0x42 [sunrpc]
> >>    [<ffffffffa02816fd>] rpc_async_schedule+0x10/0x12 [sunrpc]
> >>    [<ffffffff81061343>] process_one_work+0x230/0x41d
> >>    [<ffffffff8106128e>] ? process_one_work+0x17b/0x41d
> >>    [<ffffffff8106379f>] worker_thread+0x133/0x217
> >>    [<ffffffff8106366c>] ? manage_workers+0x191/0x191
> >>    [<ffffffff81066f9c>] kthread+0x7d/0x85
> >>    [<ffffffff81485ee4>] kernel_thread_helper+0x4/0x10
> >>    [<ffffffff8147f0d8>] ? retint_restore_args+0x13/0x13
> >>    [<ffffffff81066f1f>] ? __init_kthread_worker+0x56/0x56
> >>    [<ffffffff81485ee0>] ? gs_change+0x13/0x13
> >
> > The calldata gets freed in the rpc_final_put_task() which shouldn't
> ever be run while the task is still referenced in __rpc_execute
> >
> > IOW: it should be impossible to call rpc_exit_task() after
> rpc_final_put_task
> 
> I added lots of locking around the calldata, work-queue logic, and
> such, and
> still the problem persists w/out hitting any of the debug warnings or
> poisoned
> values I put in.  It almost seems like tk_calldata is just assigned to
> two
> different tasks.
> 
> While poking through the code, I noticed that 'map' is static in
> rpcb_getport_async.
> 
> That would seem to cause problems if two threads called this method at
> the same time, possibly causing tk_calldata to be assigned to two
> different
> tasks???
> 
> Any idea why it is static?

Doh! That is clearly a typo dating all the way back to when Chuck wrote that function.

Yes, that would definitely explain your problem.

Cheers
  Trond
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC] sunrpc:  Fix race between work-queue and rpc_killall_tasks.
@ 2011-07-12 17:25               ` Myklebust, Trond
  0 siblings, 0 replies; 20+ messages in thread
From: Myklebust, Trond @ 2011-07-12 17:25 UTC (permalink / raw)
  To: Ben Greear; +Cc: linux-nfs, linux-kernel

PiAtLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KPiBGcm9tOiBCZW4gR3JlZWFyIFttYWlsdG86
Z3JlZWFyYkBjYW5kZWxhdGVjaC5jb21dDQo+IFNlbnQ6IFR1ZXNkYXksIEp1bHkgMTIsIDIwMTEg
MToxNSBQTQ0KPiBUbzogTXlrbGVidXN0LCBUcm9uZA0KPiBDYzogbGludXgtbmZzQHZnZXIua2Vy
bmVsLm9yZzsgbGludXgta2VybmVsQHZnZXIua2VybmVsLm9yZw0KPiBTdWJqZWN0OiBSZTogW1JG
Q10gc3VucnBjOiBGaXggcmFjZSBiZXR3ZWVuIHdvcmstcXVldWUgYW5kDQo+IHJwY19raWxsYWxs
X3Rhc2tzLg0KPiANCj4gT24gMDcvMDgvMjAxMSAwMzoxNCBQTSwgTXlrbGVidXN0LCBUcm9uZCB3
cm90ZToNCj4gDQo+ID4+ICAgIFs8ZmZmZmZmZmY4MTEwNTkwNz5dIHByaW50X3RyYWlsZXIrMHgx
MzEvMHgxM2ENCj4gPj4gICAgWzxmZmZmZmZmZjgxMTA1OTQ1Pl0gb2JqZWN0X2VycisweDM1LzB4
M2UNCj4gPj4gICAgWzxmZmZmZmZmZjgxMTA3N2IzPl0gdmVyaWZ5X21lbV9ub3RfZGVsZXRlZCsw
eDdhLzB4YjcNCj4gPj4gICAgWzxmZmZmZmZmZmEwMjg5MWU1Pl0gcnBjYl9nZXRwb3J0X2RvbmUr
MHgyMy8weDEyNiBbc3VucnBjXQ0KPiA+PiAgICBbPGZmZmZmZmZmYTAyODEwZGY+XSBycGNfZXhp
dF90YXNrKzB4M2YvMHg2ZCBbc3VucnBjXQ0KPiA+PiAgICBbPGZmZmZmZmZmYTAyODE0ZDg+XSBf
X3JwY19leGVjdXRlKzB4ODAvMHgyNTMgW3N1bnJwY10NCj4gPj4gICAgWzxmZmZmZmZmZmEwMjgx
NmVkPl0gPyBycGNfZXhlY3V0ZSsweDQyLzB4NDIgW3N1bnJwY10NCj4gPj4gICAgWzxmZmZmZmZm
ZmEwMjgxNmZkPl0gcnBjX2FzeW5jX3NjaGVkdWxlKzB4MTAvMHgxMiBbc3VucnBjXQ0KPiA+PiAg
ICBbPGZmZmZmZmZmODEwNjEzNDM+XSBwcm9jZXNzX29uZV93b3JrKzB4MjMwLzB4NDFkDQo+ID4+
ICAgIFs8ZmZmZmZmZmY4MTA2MTI4ZT5dID8gcHJvY2Vzc19vbmVfd29yaysweDE3Yi8weDQxZA0K
PiA+PiAgICBbPGZmZmZmZmZmODEwNjM3OWY+XSB3b3JrZXJfdGhyZWFkKzB4MTMzLzB4MjE3DQo+
ID4+ICAgIFs8ZmZmZmZmZmY4MTA2MzY2Yz5dID8gbWFuYWdlX3dvcmtlcnMrMHgxOTEvMHgxOTEN
Cj4gPj4gICAgWzxmZmZmZmZmZjgxMDY2ZjljPl0ga3RocmVhZCsweDdkLzB4ODUNCj4gPj4gICAg
WzxmZmZmZmZmZjgxNDg1ZWU0Pl0ga2VybmVsX3RocmVhZF9oZWxwZXIrMHg0LzB4MTANCj4gPj4g
ICAgWzxmZmZmZmZmZjgxNDdmMGQ4Pl0gPyByZXRpbnRfcmVzdG9yZV9hcmdzKzB4MTMvMHgxMw0K
PiA+PiAgICBbPGZmZmZmZmZmODEwNjZmMWY+XSA/IF9faW5pdF9rdGhyZWFkX3dvcmtlcisweDU2
LzB4NTYNCj4gPj4gICAgWzxmZmZmZmZmZjgxNDg1ZWUwPl0gPyBnc19jaGFuZ2UrMHgxMy8weDEz
DQo+ID4NCj4gPiBUaGUgY2FsbGRhdGEgZ2V0cyBmcmVlZCBpbiB0aGUgcnBjX2ZpbmFsX3B1dF90
YXNrKCkgd2hpY2ggc2hvdWxkbid0DQo+IGV2ZXIgYmUgcnVuIHdoaWxlIHRoZSB0YXNrIGlzIHN0
aWxsIHJlZmVyZW5jZWQgaW4gX19ycGNfZXhlY3V0ZQ0KPiA+DQo+ID4gSU9XOiBpdCBzaG91bGQg
YmUgaW1wb3NzaWJsZSB0byBjYWxsIHJwY19leGl0X3Rhc2soKSBhZnRlcg0KPiBycGNfZmluYWxf
cHV0X3Rhc2sNCj4gDQo+IEkgYWRkZWQgbG90cyBvZiBsb2NraW5nIGFyb3VuZCB0aGUgY2FsbGRh
dGEsIHdvcmstcXVldWUgbG9naWMsIGFuZA0KPiBzdWNoLCBhbmQNCj4gc3RpbGwgdGhlIHByb2Js
ZW0gcGVyc2lzdHMgdy9vdXQgaGl0dGluZyBhbnkgb2YgdGhlIGRlYnVnIHdhcm5pbmdzIG9yDQo+
IHBvaXNvbmVkDQo+IHZhbHVlcyBJIHB1dCBpbi4gIEl0IGFsbW9zdCBzZWVtcyBsaWtlIHRrX2Nh
bGxkYXRhIGlzIGp1c3QgYXNzaWduZWQgdG8NCj4gdHdvDQo+IGRpZmZlcmVudCB0YXNrcy4NCj4g
DQo+IFdoaWxlIHBva2luZyB0aHJvdWdoIHRoZSBjb2RlLCBJIG5vdGljZWQgdGhhdCAnbWFwJyBp
cyBzdGF0aWMgaW4NCj4gcnBjYl9nZXRwb3J0X2FzeW5jLg0KPiANCj4gVGhhdCB3b3VsZCBzZWVt
IHRvIGNhdXNlIHByb2JsZW1zIGlmIHR3byB0aHJlYWRzIGNhbGxlZCB0aGlzIG1ldGhvZCBhdA0K
PiB0aGUgc2FtZSB0aW1lLCBwb3NzaWJseSBjYXVzaW5nIHRrX2NhbGxkYXRhIHRvIGJlIGFzc2ln
bmVkIHRvIHR3bw0KPiBkaWZmZXJlbnQNCj4gdGFza3M/Pz8NCj4gDQo+IEFueSBpZGVhIHdoeSBp
dCBpcyBzdGF0aWM/DQoNCkRvaCEgVGhhdCBpcyBjbGVhcmx5IGEgdHlwbyBkYXRpbmcgYWxsIHRo
ZSB3YXkgYmFjayB0byB3aGVuIENodWNrIHdyb3RlIHRoYXQgZnVuY3Rpb24uDQoNClllcywgdGhh
dCB3b3VsZCBkZWZpbml0ZWx5IGV4cGxhaW4geW91ciBwcm9ibGVtLg0KDQpDaGVlcnMNCiAgVHJv
bmQNCgTvv717Lm7vv70r77+977+977+977+977+977+977+9KyXvv73vv71sendt77+977+9Yu+/
veunsu+/ve+/vXLvv73vv716WO+/ve+/vRnfsinvv73vv73vv713Kh9qZ++/ve+/ve+/vR7vv73v
v73vv73vv73vv73domov77+977+977+9eu+/vd6W77+977+9Mu+/vd6Z77+977+977+9Ju+/vSnf
oe+/vWHvv73vv71/77+977+9Hu+/vUfvv73vv73vv71o77+9D++/vWo6K3bvv73vv73vv71377+9
2aU=

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] sunrpc:  Fix race between work-queue and rpc_killall_tasks.
  2011-07-12 17:25               ` Myklebust, Trond
  (?)
@ 2011-07-12 17:30               ` Ben Greear
  2011-07-14 16:20                 ` Ben Greear
  -1 siblings, 1 reply; 20+ messages in thread
From: Ben Greear @ 2011-07-12 17:30 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: linux-nfs, linux-kernel

On 07/12/2011 10:25 AM, Myklebust, Trond wrote:
>> -----Original Message-----
>> From: Ben Greear [mailto:greearb@candelatech.com]
>> Sent: Tuesday, July 12, 2011 1:15 PM
>> To: Myklebust, Trond
>> Cc: linux-nfs@vger.kernel.org; linux-kernel@vger.kernel.org
>> Subject: Re: [RFC] sunrpc: Fix race between work-queue and
>> rpc_killall_tasks.
>>
>> On 07/08/2011 03:14 PM, Myklebust, Trond wrote:
>>
>>>>     [<ffffffff81105907>] print_trailer+0x131/0x13a
>>>>     [<ffffffff81105945>] object_err+0x35/0x3e
>>>>     [<ffffffff811077b3>] verify_mem_not_deleted+0x7a/0xb7
>>>>     [<ffffffffa02891e5>] rpcb_getport_done+0x23/0x126 [sunrpc]
>>>>     [<ffffffffa02810df>] rpc_exit_task+0x3f/0x6d [sunrpc]
>>>>     [<ffffffffa02814d8>] __rpc_execute+0x80/0x253 [sunrpc]
>>>>     [<ffffffffa02816ed>] ? rpc_execute+0x42/0x42 [sunrpc]
>>>>     [<ffffffffa02816fd>] rpc_async_schedule+0x10/0x12 [sunrpc]
>>>>     [<ffffffff81061343>] process_one_work+0x230/0x41d
>>>>     [<ffffffff8106128e>] ? process_one_work+0x17b/0x41d
>>>>     [<ffffffff8106379f>] worker_thread+0x133/0x217
>>>>     [<ffffffff8106366c>] ? manage_workers+0x191/0x191
>>>>     [<ffffffff81066f9c>] kthread+0x7d/0x85
>>>>     [<ffffffff81485ee4>] kernel_thread_helper+0x4/0x10
>>>>     [<ffffffff8147f0d8>] ? retint_restore_args+0x13/0x13
>>>>     [<ffffffff81066f1f>] ? __init_kthread_worker+0x56/0x56
>>>>     [<ffffffff81485ee0>] ? gs_change+0x13/0x13
>>>
>>> The calldata gets freed in the rpc_final_put_task() which shouldn't
>> ever be run while the task is still referenced in __rpc_execute
>>>
>>> IOW: it should be impossible to call rpc_exit_task() after
>> rpc_final_put_task
>>
>> I added lots of locking around the calldata, work-queue logic, and
>> such, and
>> still the problem persists w/out hitting any of the debug warnings or
>> poisoned
>> values I put in.  It almost seems like tk_calldata is just assigned to
>> two
>> different tasks.
>>
>> While poking through the code, I noticed that 'map' is static in
>> rpcb_getport_async.
>>
>> That would seem to cause problems if two threads called this method at
>> the same time, possibly causing tk_calldata to be assigned to two
>> different
>> tasks???
>>
>> Any idea why it is static?
>
> Doh! That is clearly a typo dating all the way back to when Chuck wrote that function.
>
> Yes, that would definitely explain your problem.

Ok, patch sent.  I assume someone will propagate this to stable
as desired?

And assuming this fixes it, can I get some brownie points towards
review of the ip-addr binding patches? :)

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] sunrpc:  Fix race between work-queue and rpc_killall_tasks.
  2011-07-12 17:30               ` Ben Greear
@ 2011-07-14 16:20                 ` Ben Greear
  0 siblings, 0 replies; 20+ messages in thread
From: Ben Greear @ 2011-07-14 16:20 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: linux-nfs, linux-kernel

On 07/12/2011 10:30 AM, Ben Greear wrote:
> On 07/12/2011 10:25 AM, Myklebust, Trond wrote:
>>> -----Original Message-----
>>> From: Ben Greear [mailto:greearb@candelatech.com]
>>> Sent: Tuesday, July 12, 2011 1:15 PM
>>> To: Myklebust, Trond
>>> Cc: linux-nfs@vger.kernel.org; linux-kernel@vger.kernel.org
>>> Subject: Re: [RFC] sunrpc: Fix race between work-queue and
>>> rpc_killall_tasks.
>>>

>>> I added lots of locking around the calldata, work-queue logic, and
>>> such, and
>>> still the problem persists w/out hitting any of the debug warnings or
>>> poisoned
>>> values I put in. It almost seems like tk_calldata is just assigned to
>>> two
>>> different tasks.
>>>
>>> While poking through the code, I noticed that 'map' is static in
>>> rpcb_getport_async.
>>>
>>> That would seem to cause problems if two threads called this method at
>>> the same time, possibly causing tk_calldata to be assigned to two
>>> different
>>> tasks???
>>>
>>> Any idea why it is static?
>>
>> Doh! That is clearly a typo dating all the way back to when Chuck
>> wrote that function.
>>
>> Yes, that would definitely explain your problem.
>
> Ok, patch sent. I assume someone will propagate this to stable
> as desired?
>
> And assuming this fixes it, can I get some brownie points towards
> review of the ip-addr binding patches? :)

Just to close this issue:  We ran a clean 24+ hour test mounting and
unmounting 200 mounts every 30 seconds, and it ran with zero problems.

This was with 2.6.38.8+ with this fix applied.

3.0-rc7+ is still flaky in various other ways, but I see no more
NFS problems at least.

So, that was the problem I was hitting, and it appears to be the
last problem in this area.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2011-07-14 16:21 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-06 22:49 [RFC] sunrpc: Fix race between work-queue and rpc_killall_tasks greearb
2011-07-06 23:45 ` Trond Myklebust
2011-07-06 23:45   ` Trond Myklebust
2011-07-07  0:07   ` Ben Greear
2011-07-07  0:17     ` Trond Myklebust
2011-07-07  0:35       ` Ben Greear
2011-07-07 20:38   ` Ben Greear
2011-07-08 15:03     ` Ben Greear
2011-07-08 17:18   ` Ben Greear
2011-07-08 18:11     ` Myklebust, Trond
2011-07-08 18:11       ` Myklebust, Trond
2011-07-08 22:03       ` Ben Greear
2011-07-08 22:14         ` Myklebust, Trond
2011-07-08 22:14           ` Myklebust, Trond
2011-07-09 16:34           ` Ben Greear
2011-07-12 17:14           ` Ben Greear
2011-07-12 17:25             ` Myklebust, Trond
2011-07-12 17:25               ` Myklebust, Trond
2011-07-12 17:30               ` Ben Greear
2011-07-14 16:20                 ` Ben Greear

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.