From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wy0-f174.google.com ([74.125.82.174]:52067 "EHLO mail-wy0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750839Ab0JOAJc convert rfc822-to-8bit (ORCPT ); Thu, 14 Oct 2010 20:09:32 -0400 Received: by wyb28 with SMTP id 28so151534wyb.19 for ; Thu, 14 Oct 2010 17:09:30 -0700 (PDT) In-Reply-To: <20101014171522.GA553@fieldses.org> References: <20101014171522.GA553@fieldses.org> Date: Thu, 14 Oct 2010 20:09:26 -0400 Message-ID: Subject: Re: Thread scheduling issues From: Lyle Seaman To: linux-nfs@vger.kernel.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-nfs-owner@vger.kernel.org List-ID: MIME-Version: 1.0 Sorry to add to the confusion. It is a 2.6.32.23 kernel, allegedly: Linux prodLMS03 2.6.32-23-server #37-Ubuntu SMP Fri Jun 11 09:11:11 UTC 2010 x86_6 4 GNU/Linux Though I won't be 100% sure what's in it unless I build one myself from source. >>  363        if (pool->sp_nwaking >= SVC_MAX_WAKING) { >> //  == 5 -lws >>  364                /* too many threads are runnable and trying to wake up */ >>  365                thread_avail = 0; >>  366                pool->sp_stats.overloads_avoided++; >>  367        } > > That strikes me as a more likely cause of your problem.  But it was > reverted in v2.6.33, and you say you're using v2.6.35.22, so there's > some confusion here? I think that the problem is the combination of the above with the half-second sleep, and my next step is to build a kernel and tweak both of those. > I think the code was trying to avoid reading an rpc request > until it was reasonably sure it would have the resources to complete it, > so we didn't end up with a thread stuck waiting on a half-received rpc. Yes, that's what it looks like to me too. I'd have to think hard about how to balance the different factors and avoid deadlock, particularly where the client and server are sharing the same VM pool. It seems to me that preventing stuck threads is most important if threads are the scarce resource, but I suppose memory could be a problem too. One issue is that data is acknowledged when it is consumed by NFS so you can't just decide "whoops, I'm out of memory", throw it away, and let the client resend - at least, not if using RPC/TCP. (we could do that in AFS because we had two different kind of acknowledgements, incidentally) The symptoms of the problem are: workload is biased towards small writes, ~ 4k ave size. ops are: getattr 40% setattr 5% lookup 10% access 19% read 5% write 8% commit 8% low CPU utilization - usr+sys < 5%, idle == 0, wait > 95% high disk-wait %age ==> "disk bound", but that's a bit misleading, see below low # of nfsd in D state, usually 3-8, very occasionally 20. 50 nfsd threads are configured, but those last 30 just never run. small #s (3-ish) of simultaneous I/Os delivered to disk subsystem as reported by 'sar -d'. 2 clients with lots of pent-up demand ( > 50 threads on each are blocked waiting for NFS ) local processes on the NFS server are still getting great performance out of the disk subsystem, so the driver/HBA isn't the bottleneck. sar reports mean svc times <4ms. I don't know how much to trust sar though. The disk subsystem is one LUN in a shared SAN with lots of capacity, if only I could deliver more than a handful of ops to it at one time. "disk-bound" just means that adding cpu won't help with the same workload, but it doesn't mean that the disk subsystem is incapable of handling additional load if I can get it there. Watching /proc/fs/nfsd/pool_stats shows deltas from previous sample in the range of 0 ~4x x ~2x x 0 ( pool, packets, sockets_queued, threads_woken, overloads_avoided, threads_timedout ) That is, at every sample, the value of sockets_queued is exactly equal to overloads_avoided, and can be as much as half the number of total calls handled (though usually it is much less). 5 threads which are blocked waiting for the underlying filesystem op do not cause increments of "overloads_avoided" counter. That looks like a, hmm, heuristic for systems with more disk capacity and less CPU than I have. I don't absolutely *know* that this particular alloc_page is failing but it sure fits the data. Next step for me is instrumenting and counting that branch, but I thought I'd check to see if this had been talked about before, since I can't find it in the archives or bugzilla. exporting the fs with -o async helps but I don't like doing it.