linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Race in RPC code
@ 2003-02-07 12:31 Jakob Oestergaard
  2003-02-07 13:21 ` Trond Myklebust
  0 siblings, 1 reply; 5+ messages in thread
From: Jakob Oestergaard @ 2003-02-07 12:31 UTC (permalink / raw)
  To: linux-kernel; +Cc: Trond Myklebust

[-- Attachment #1: Type: text/plain, Size: 2329 bytes --]


Hello all,

I think there is a race in the RPC code in 2.4.20, related to timeout
and congestion window handling.

The problem code is in net/sunrpc/xprt.c:

static void
xprt_timer(struct rpc_task *task)
{       
        struct rpc_rqst *req = task->tk_rqstp;
        struct rpc_xprt *xprt = req->rq_xprt;

        spin_lock(&xprt->sock_lock);
        if (req->rq_received)
                goto out;

        if (!xprt->nocong) {
                if (xprt_expbackoff(task, req)) {
                        rpc_add_timer(task, xprt_timer);
                        goto out_unlock;
                }
                rpc_inc_timeo(&task->tk_client->cl_rtt);
                xprt_adjust_cwnd(req->rq_xprt, -ETIMEDOUT);
        }
        req->rq_nresend++;

The call to xprt_adjust_cwnd is the problem - I experienced a panic
(null pointer dereference) in 

static void
xprt_adjust_cwnd(struct rpc_xprt *xprt, int result)
{
        unsigned long   cwnd;

        cwnd = xprt->cwnd;
        if (result >= 0 && cwnd <= xprt->cong) {

Here it is the "cwnd = xprt->cwnd" that causes the panic. xprt was 0.

This means, in the xprt_timer code, that req->rq_xprt must have been 0.

I guess this can happen because of the sequence:

 xprt = req->rq_xprt;
 spin_lock(&xprt->sock_lock);
...
 xprt_adjust_cwnd(req->rq_xprt);

We don't know whether req has been modified between the assignment and
the spin_lock.

Attached is a patch to solve the problem - please comment.

It does not solve the other potential (?) problem with:

 xprt = req->rq_xprt;
 spin_lock(&xprt->sock_lock);
 ...
                if (xprt_expbackoff(task, req)) {
 ...

Any suggestions to that one?

I cannot test whether my patch solve the problem, because this panic has
happened once on a *heavily* loaded box which has run 2.4.20 ever since
it came out.  The race is extremely rare.  It is an SMP box by the way.

Thanks a lot to "baldrick" on kernel-newbies for the help!

-- 
................................................................
:   jakob@unthought.net   : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob Østergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:

[-- Attachment #2: race-diff --]
[-- Type: text/plain, Size: 335 bytes --]

--- linux-2.4.20-unpatched/net/sunrpc/xprt.c	Fri Feb  7 12:22:19 2003
+++ linux-2.4.20/net/sunrpc/xprt.c	Fri Feb  7 13:13:40 2003
@@ -1021,7 +1021,7 @@
 			goto out_unlock;
 		}
 		rpc_inc_timeo(&task->tk_client->cl_rtt);
-		xprt_adjust_cwnd(req->rq_xprt, -ETIMEDOUT);
+		xprt_adjust_cwnd(xprt, -ETIMEDOUT);
 	}
 	req->rq_nresend++;
 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Race in RPC code
  2003-02-07 12:31 Race in RPC code Jakob Oestergaard
@ 2003-02-07 13:21 ` Trond Myklebust
  2003-02-07 13:44   ` Jakob Oestergaard
  0 siblings, 1 reply; 5+ messages in thread
From: Trond Myklebust @ 2003-02-07 13:21 UTC (permalink / raw)
  To: Jakob Oestergaard; +Cc: linux-kernel

>>>>> " " == Jakob Oestergaard <jakob@unthought.net> writes:


     > We don't know whether req has been modified between the
     > assignment and the spin_lock.

It had better not be. If it is, then I want to know where so that we
can fix it.

req->rq_xprt is set up when the request is initialized. It
is not meant to change until the request gets released. This again
should not happen while the request is still on the wait queue.

IOW the fix you propose would just be papering over another problem.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Race in RPC code
  2003-02-07 13:21 ` Trond Myklebust
@ 2003-02-07 13:44   ` Jakob Oestergaard
  2003-02-07 18:18     ` Ion Badulescu
  0 siblings, 1 reply; 5+ messages in thread
From: Jakob Oestergaard @ 2003-02-07 13:44 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-kernel

On Fri, Feb 07, 2003 at 02:21:50PM +0100, Trond Myklebust wrote:
> >>>>> " " == Jakob Oestergaard <jakob@unthought.net> writes:
> 
> 
>      > We don't know whether req has been modified between the
>      > assignment and the spin_lock.
> 
> It had better not be. If it is, then I want to know where so that we
> can fix it.
> 
> req->rq_xprt is set up when the request is initialized. It
> is not meant to change until the request gets released. This again
> should not happen while the request is still on the wait queue.
> 
> IOW the fix you propose would just be papering over another problem.

Any suggestions as to how it could happen?

The box is running huge compile jobs (>100MB memory used by each
compiler - runs 2-3 compilers concurrently) all day long - we never had
a GCC sig11 error. It has 512 MB of ECC memory (and ECC is enabled) - I
seriously doubt that we have a memory corruption problem.

The panic has happened once, just today.

I will be happy to try other solutions, but I can't verify whether they
work - I mean, if the box runs another few months without crashing that
doesn't really prove anything...

Thanks for commenting!

-- 
................................................................
:   jakob@unthought.net   : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob Østergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Race in RPC code
  2003-02-07 13:44   ` Jakob Oestergaard
@ 2003-02-07 18:18     ` Ion Badulescu
  2003-02-07 19:51       ` Duncan Sands
  0 siblings, 1 reply; 5+ messages in thread
From: Ion Badulescu @ 2003-02-07 18:18 UTC (permalink / raw)
  To: Jakob Oestergaard; +Cc: linux-kernel, Trond Myklebust

Hi Jakob, Trond,

On Fri, 7 Feb 2003 14:44:46 +0100, Jakob Oestergaard <jakob@unthought.net> wrote:

> The panic has happened once, just today.

I've seen this multiple times, and even reported it to the nfs mailing list.

I'm just glad someone else is seeing the same kind of oops with a vanilla 
kernel, because I was seeing it with a heavily patched 
redhat+nfs2.4.20+nfsall2.4.20 kernel and wasn't sure if that had anything
to do with it.

I have at least 5-6 such oopsen recorded, if anyone cares.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
            than to open it and remove all doubt.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Race in RPC code
  2003-02-07 18:18     ` Ion Badulescu
@ 2003-02-07 19:51       ` Duncan Sands
  0 siblings, 0 replies; 5+ messages in thread
From: Duncan Sands @ 2003-02-07 19:51 UTC (permalink / raw)
  To: Ion Badulescu, Jakob Oestergaard; +Cc: linux-kernel, Trond Myklebust

On Friday 07 February 2003 19:18, Ion Badulescu wrote:
> Hi Jakob, Trond,
>
> On Fri, 7 Feb 2003 14:44:46 +0100, Jakob Oestergaard <jakob@unthought.net> wrote:
> > The panic has happened once, just today.
>
> I've seen this multiple times, and even reported it to the nfs mailing
> list.
>
> I'm just glad someone else is seeing the same kind of oops with a vanilla
> kernel, because I was seeing it with a heavily patched
> redhat+nfs2.4.20+nfsall2.4.20 kernel and wasn't sure if that had anything
> to do with it.
>
> I have at least 5-6 such oopsen recorded, if anyone cares.

Please send them.

Duncan.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2003-02-07 19:41 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-02-07 12:31 Race in RPC code Jakob Oestergaard
2003-02-07 13:21 ` Trond Myklebust
2003-02-07 13:44   ` Jakob Oestergaard
2003-02-07 18:18     ` Ion Badulescu
2003-02-07 19:51       ` Duncan Sands

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).