All of lore.kernel.org
 help / color / mirror / Atom feed
* Thread scheduling issues
@ 2010-10-14 12:38 Lyle Seaman
       [not found] ` <AANLkTim975UbEmeNSSS0awLSou151rr7=DpRxxw6trdn-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Lyle Seaman @ 2010-10-14 12:38 UTC (permalink / raw)
  To: linux-nfs

Hi guys.  I've been trying to figure out a performance problem with
some of my production servers, and I have run into something that
strikes me as odd.  Is this the right place for discussion of bugs in
sunrpc/svc_xprt.c, or is there a better place?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Thread scheduling issues
       [not found] ` <AANLkTim975UbEmeNSSS0awLSou151rr7=DpRxxw6trdn-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-10-14 14:12   ` J. Bruce Fields
  0 siblings, 0 replies; 5+ messages in thread
From: J. Bruce Fields @ 2010-10-14 14:12 UTC (permalink / raw)
  To: Lyle Seaman; +Cc: linux-nfs

On Thu, Oct 14, 2010 at 08:38:21AM -0400, Lyle Seaman wrote:
> Hi guys.  I've been trying to figure out a performance problem with
> some of my production servers, and I have run into something that
> strikes me as odd.  Is this the right place for discussion of bugs in
> sunrpc/svc_xprt.c, or is there a better place?

Go for it.

--b.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Thread scheduling issues
  2010-10-14 17:15 ` J. Bruce Fields
@ 2010-10-15  0:09   ` Lyle Seaman
  0 siblings, 0 replies; 5+ messages in thread
From: Lyle Seaman @ 2010-10-15  0:09 UTC (permalink / raw)
  To: linux-nfs

Sorry to add to the confusion.  It is a 2.6.32.23 kernel, allegedly:
Linux prodLMS03 2.6.32-23-server #37-Ubuntu SMP Fri Jun 11 09:11:11
UTC 2010 x86_6
4 GNU/Linux

Though I won't be 100% sure what's in it unless I build one myself from source.

>>  363        if (pool->sp_nwaking >= SVC_MAX_WAKING) {
>> //  == 5 -lws
>>  364                /* too many threads are runnable and trying to wake up */
>>  365                thread_avail = 0;
>>  366                pool->sp_stats.overloads_avoided++;
>>  367        }
>
> That strikes me as a more likely cause of your problem.  But it was
> reverted in v2.6.33, and you say you're using v2.6.35.22, so there's
> some confusion here?

I think that the problem is the combination of the above with the
half-second sleep, and my next step is to build a kernel and tweak
both of those.

> I think the code was trying to avoid reading an rpc request
> until it was reasonably sure it would have the resources to complete it,
> so we didn't end up with a thread stuck waiting on a half-received rpc.

Yes, that's what it looks like to me too.  I'd have to think hard
about how to balance the different factors and avoid deadlock,
particularly where the client and server are sharing the same VM pool.
 It seems to me that preventing stuck threads is most important if
threads are the scarce resource, but I suppose memory could be a
problem too.   One issue is that data is acknowledged when it is
consumed by NFS so you can't just decide "whoops, I'm out of memory",
throw it away, and let the client resend - at least, not if using
RPC/TCP. (we could do that in AFS because we had two different kind of
acknowledgements, incidentally)

The symptoms of the problem are:

workload is biased towards small writes, ~ 4k ave size.

ops are:
getattr 40%  setattr 5% lookup 10% access 19%
read 5% write 8% commit 8%

low CPU utilization - usr+sys < 5%, idle == 0, wait > 95%
high disk-wait %age ==> "disk bound", but that's a bit misleading, see below
low # of nfsd in D state, usually 3-8, very occasionally 20.
50 nfsd threads are configured, but those last 30 just never run.
small #s (3-ish) of simultaneous I/Os delivered to disk subsystem as
reported by 'sar -d'.
2 clients with lots of pent-up demand ( > 50 threads on each are
blocked waiting for NFS )

local processes on the NFS server are still getting great performance
out of the disk subsystem, so the driver/HBA isn't the bottleneck.
sar reports mean svc times <4ms.  I don't know how much to trust sar
though.

The disk subsystem is one LUN in a shared SAN with lots of capacity,
if only I could deliver more than a handful of ops to it at one time.
"disk-bound" just means that adding cpu won't help with the same
workload, but it doesn't mean that the disk subsystem is incapable of
handling additional load if I can get it there.

Watching /proc/fs/nfsd/pool_stats shows deltas from previous sample in
the range of
0 ~4x x ~2x x 0
( pool, packets, sockets_queued, threads_woken, overloads_avoided,
threads_timedout )

That is, at every sample, the value of sockets_queued is exactly equal
to overloads_avoided, and can be as much as half the number of total
calls handled (though usually it is much less).

5 threads which are blocked waiting for the underlying filesystem op
do not cause increments  of "overloads_avoided" counter.  That looks
like a, hmm, heuristic for systems with more disk capacity and less
CPU than I have.  I don't absolutely *know* that this particular
alloc_page is failing but it sure fits the data.   Next step for me is
instrumenting and counting that branch, but I thought I'd check to see
if this had been talked about before, since I can't find it in the
archives or bugzilla.

exporting the fs with -o async helps but I don't like doing it.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Thread scheduling issues
  2010-10-14 15:21 Lyle Seaman
@ 2010-10-14 17:15 ` J. Bruce Fields
  2010-10-15  0:09   ` Lyle Seaman
  0 siblings, 1 reply; 5+ messages in thread
From: J. Bruce Fields @ 2010-10-14 17:15 UTC (permalink / raw)
  To: Lyle Seaman; +Cc: linux-nfs

On Thu, Oct 14, 2010 at 11:21:33AM -0400, Lyle Seaman wrote:
> Ok.  I've been trying to figure out a performance problem with some of
> my production servers (I'm just not able to get enough simultaneous
> I/Os queued up to get decent performance out of my disk subsystem),
> and I have run into something that strikes me as odd.  For reference,
> the code I'm looking at is from a 2.6.35.22 kernel, and I know there
> have been some relevant changes since then but I am not seeing the
> discussion of this in the mailing list archives. If I have to figure
> out the new source control system to look at the head-of-line code,
> say so.
> 
> in sunrpc/svc_xprt.c:svc_recv() I find this:
> 
>         /* now allocate needed pages.  If we get a failure, sleep briefly */
>  620        pages = (serv->sv_max_mesg + PAGE_SIZE) / PAGE_SIZE;
>  621        for (i = 0; i < pages ; i++)
>  622                while (rqstp->rq_pages[i] == NULL) {
>  623                        struct page *p = alloc_page(GFP_KERNEL);
>  624                        if (!p) {
>  625                                set_current_state(TASK_INTERRUPTIBLE);
>  626                                if (signalled() || kthread_should_stop()) {
>  627                                        set_current_state(TASK_RUNNING);
>  628                                        return -EINTR;
>  629                                }
>  630                                schedule_timeout(msecs_to_jiffies(500));
>  631                        }
>  632                        rqstp->rq_pages[i] = p;
>  633                }
> 
> First of all, 500ms is a long way from "brief".  Second, it seems like
> a really bad idea to sleep while holding a contended resource.
> Shouldn't you drop those pages before sleeping?  It's been a long time
> since I knew anything about the Linux vm system, but presumably the
> reason that alloc_page has failed is because there aren't 26 pages
> available*.  So now I'm going to sleep while holding, say, the last 24
> free pages?  This looks like deadlock city, except for the valiant
> efforts by the vm system to keep freeing pages, and other adjustments
> to vary sv_max_mesg depending on the number of threads which mean that
> in practice, deadlock is unlikely.  But figuring out the interaction
> of all those other systems requires global knowledge, which is asking
> for trouble in the long run.  And there are intermediate degrees of
> cascading interlocked interdependencies that result in poor
> performance which aren't technically a complete deadlock.
> 
> (* I'm going to have to do some more digging here, because in my
> specific situation, vmstat generally reports between 100M and 600M
> free, so I'm admittedly not clear on why alloc_page is failing me,
> unless there is a difference between "free" and "available for nfsd".)

OK, I'm more than willing to consider proposals for improvement here,
but I'm losing the connection to your problem: how do you know that this
particular alloc_page is failing in your case, and how do you know
that's the cause of your problem?

> And then there's this, in svc_xprt_enqueue()
> 
>  360 process:
>  361        /* Work out whether threads are available */
>  362        thread_avail = !list_empty(&pool->sp_threads);  /* threads
> are asleep */
>  363        if (pool->sp_nwaking >= SVC_MAX_WAKING) {
> //  == 5 -lws
>  364                /* too many threads are runnable and trying to wake up */
>  365                thread_avail = 0;
>  366                pool->sp_stats.overloads_avoided++;
>  367        }

That strikes me as a more likely cause of your problem.  But it was
reverted in v2.6.33, and you say you're using v2.6.35.22, so there's
some confusion here?

But again we're jumping to conclusions: what's your setup, what symptoms
are you seeing, what did you try to do to fix it, and what were the
results?

> Now, the sp_nwaking counter is only decremented after that snippet of
> code earlier, so if I've got five threads that happen to all run into
> a transient problem getting the pages they need, they're going to
> sleep for 500 millis and -nothing- is going to happen.
> 
> I see that this overloads_avoided branch disappeared in more recent
> kernels, was there some discussion of it that you can point me to so I
> don't have to rehash an old topic? I think that removing it actually
> increases the likelihood of deadlock.
> 
> Finally, zero-copy is great, awesome, I love it.  But it seems
> profligate to allocate an entire 26 pages for every operation when
> only 9% of them (at least in my workload) are writes.  All the rest of
> the operations only need a small fraction of that to be preallocated.
> I don't have a concrete suggestion here, I'd have to do some tinkering
> with the code first.  I'm just... saying.  Maybe allocate two pages up
> front, then wait to see if the others are needed and allocate them at
> that time.  If you're willing to sleep in svc_recv while holding lots
> of pages, it's no worse to sleep while holding one or two.

Could be.  I think the code was trying to avoid reading an rpc request
until it was reasonably sure it would have the resources to complete it,
so we didn't end up with a thread stuck waiting on a half-received rpc.

--b.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Thread scheduling issues
@ 2010-10-14 15:21 Lyle Seaman
  2010-10-14 17:15 ` J. Bruce Fields
  0 siblings, 1 reply; 5+ messages in thread
From: Lyle Seaman @ 2010-10-14 15:21 UTC (permalink / raw)
  To: linux-nfs

Ok.  I've been trying to figure out a performance problem with some of
my production servers (I'm just not able to get enough simultaneous
I/Os queued up to get decent performance out of my disk subsystem),
and I have run into something that strikes me as odd.  For reference,
the code I'm looking at is from a 2.6.35.22 kernel, and I know there
have been some relevant changes since then but I am not seeing the
discussion of this in the mailing list archives. If I have to figure
out the new source control system to look at the head-of-line code,
say so.

in sunrpc/svc_xprt.c:svc_recv() I find this:

        /* now allocate needed pages.  If we get a failure, sleep briefly */
 620        pages = (serv->sv_max_mesg + PAGE_SIZE) / PAGE_SIZE;
 621        for (i = 0; i < pages ; i++)
 622                while (rqstp->rq_pages[i] == NULL) {
 623                        struct page *p = alloc_page(GFP_KERNEL);
 624                        if (!p) {
 625                                set_current_state(TASK_INTERRUPTIBLE);
 626                                if (signalled() || kthread_should_stop()) {
 627                                        set_current_state(TASK_RUNNING);
 628                                        return -EINTR;
 629                                }
 630                                schedule_timeout(msecs_to_jiffies(500));
 631                        }
 632                        rqstp->rq_pages[i] = p;
 633                }

First of all, 500ms is a long way from "brief".  Second, it seems like
a really bad idea to sleep while holding a contended resource.
Shouldn't you drop those pages before sleeping?  It's been a long time
since I knew anything about the Linux vm system, but presumably the
reason that alloc_page has failed is because there aren't 26 pages
available*.  So now I'm going to sleep while holding, say, the last 24
free pages?  This looks like deadlock city, except for the valiant
efforts by the vm system to keep freeing pages, and other adjustments
to vary sv_max_mesg depending on the number of threads which mean that
in practice, deadlock is unlikely.  But figuring out the interaction
of all those other systems requires global knowledge, which is asking
for trouble in the long run.  And there are intermediate degrees of
cascading interlocked interdependencies that result in poor
performance which aren't technically a complete deadlock.

(* I'm going to have to do some more digging here, because in my
specific situation, vmstat generally reports between 100M and 600M
free, so I'm admittedly not clear on why alloc_page is failing me,
unless there is a difference between "free" and "available for nfsd".)

And then there's this, in svc_xprt_enqueue()

 360 process:
 361        /* Work out whether threads are available */
 362        thread_avail = !list_empty(&pool->sp_threads);  /* threads
are asleep */
 363        if (pool->sp_nwaking >= SVC_MAX_WAKING) {
//  == 5 -lws
 364                /* too many threads are runnable and trying to wake up */
 365                thread_avail = 0;
 366                pool->sp_stats.overloads_avoided++;
 367        }

Now, the sp_nwaking counter is only decremented after that snippet of
code earlier, so if I've got five threads that happen to all run into
a transient problem getting the pages they need, they're going to
sleep for 500 millis and -nothing- is going to happen.

I see that this overloads_avoided branch disappeared in more recent
kernels, was there some discussion of it that you can point me to so I
don't have to rehash an old topic? I think that removing it actually
increases the likelihood of deadlock.

Finally, zero-copy is great, awesome, I love it.  But it seems
profligate to allocate an entire 26 pages for every operation when
only 9% of them (at least in my workload) are writes.  All the rest of
the operations only need a small fraction of that to be preallocated.
I don't have a concrete suggestion here, I'd have to do some tinkering
with the code first.  I'm just... saying.  Maybe allocate two pages up
front, then wait to see if the others are needed and allocate them at
that time.  If you're willing to sleep in svc_recv while holding lots
of pages, it's no worse to sleep while holding one or two.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-10-15  0:09 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-14 12:38 Thread scheduling issues Lyle Seaman
     [not found] ` <AANLkTim975UbEmeNSSS0awLSou151rr7=DpRxxw6trdn-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-10-14 14:12   ` J. Bruce Fields
2010-10-14 15:21 Lyle Seaman
2010-10-14 17:15 ` J. Bruce Fields
2010-10-15  0:09   ` Lyle Seaman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.