linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* More XFS resource starvation?
@ 2010-11-15 18:30 J.H.
  2010-11-15 19:08 ` Simon Kirby
  2010-11-15 23:49 ` Dave Chinner
  0 siblings, 2 replies; 3+ messages in thread
From: J.H. @ 2010-11-15 18:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: jaxboe, Dave Chinner, Christoph Hellwig

So apparently I'm having fun tripping over all kinds of bugs lately.
I've seen this a couple of times now on the box in question.  Usually
happens after a few days, or after particularly heavy rsync traffic on
the box.

http://pastebin.osuosl.org/36014

Christoph seemed to think it's a memory exhaustion problem, so I've
included the /proc/meminfo and as you can see there's plenty of memory
around on the system.

Loads have, expectedly, climbed currently around 1250.05 but growing slowly.

Quick overview of the underlying storage:

xfs -> md (raid 0) -+--> P812 hardware raid6 (cciss driver)
                    |
                    +--> P812 hardware raid6 (cciss driver)

This is running on an HP DL380 G7.

I saw this both on an older 2.6.30.10-105.2.23.fc11.x86_64, and
currently on 2.6.34.7-61.fc13.x86_64 (both being Fedora stock kernels)

I have not seen this on a very similar DL380 G6, with the same storage
setup and it is currently running the 2.6.30 kernel from above.

Christoph suggest increasing the nr_request values for each of the
underlying devices, but this didn't seem to change anything
significantly on the system.

Anyone have any ideas on what's going on?

- John 'Warthog9' Hawley

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: More XFS resource starvation?
  2010-11-15 18:30 More XFS resource starvation? J.H.
@ 2010-11-15 19:08 ` Simon Kirby
  2010-11-15 23:49 ` Dave Chinner
  1 sibling, 0 replies; 3+ messages in thread
From: Simon Kirby @ 2010-11-15 19:08 UTC (permalink / raw)
  To: J.H.; +Cc: linux-kernel, jaxboe, Dave Chinner, Christoph Hellwig

On Mon, Nov 15, 2010 at 10:30:38AM -0800, J.H. wrote:

> So apparently I'm having fun tripping over all kinds of bugs lately.
> I've seen this a couple of times now on the box in question.  Usually
> happens after a few days, or after particularly heavy rsync traffic on
> the box.
> 
> http://pastebin.osuosl.org/36014
> 
> Christoph seemed to think it's a memory exhaustion problem, so I've
> included the /proc/meminfo and as you can see there's plenty of memory
> around on the system.
> 
> Loads have, expectedly, climbed currently around 1250.05 but growing slowly.
> 
> Quick overview of the underlying storage:
> 
> xfs -> md (raid 0) -+--> P812 hardware raid6 (cciss driver)
>                     |
>                     +--> P812 hardware raid6 (cciss driver)
> 
> This is running on an HP DL380 G7.
> 
> I saw this both on an older 2.6.30.10-105.2.23.fc11.x86_64, and
> currently on 2.6.34.7-61.fc13.x86_64 (both being Fedora stock kernels)
> 
> I have not seen this on a very similar DL380 G6, with the same storage
> setup and it is currently running the 2.6.30 kernel from above.
> 
> Christoph suggest increasing the nr_request values for each of the
> underlying devices, but this didn't seem to change anything
> significantly on the system.
> 
> Anyone have any ideas on what's going on?

What does this show?

	iostat -x -k 1

In particular, "avgqu-sz" aka "average queue size" would be non-zero if
there are requests pending.  If r/s and w/s are zero over a long time
with the queue size being non-zero, the issuing of commands to the
hardware raid controller is stuck for some reason.

Since your Dirty and Writeback is pretty high, it sounds like this is the
issue.  Not sure where to go from there.

Simon-

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: More XFS resource starvation?
  2010-11-15 18:30 More XFS resource starvation? J.H.
  2010-11-15 19:08 ` Simon Kirby
@ 2010-11-15 23:49 ` Dave Chinner
  1 sibling, 0 replies; 3+ messages in thread
From: Dave Chinner @ 2010-11-15 23:49 UTC (permalink / raw)
  To: J.H.; +Cc: linux-kernel, jaxboe, Christoph Hellwig

On Mon, Nov 15, 2010 at 10:30:38AM -0800, J.H. wrote:
> So apparently I'm having fun tripping over all kinds of bugs lately.
> I've seen this a couple of times now on the box in question.  Usually
> happens after a few days, or after particularly heavy rsync traffic on
> the box.
> 
> http://pastebin.osuosl.org/36014
> 
> Christoph seemed to think it's a memory exhaustion problem, so I've
> included the /proc/meminfo and as you can see there's plenty of memory
> around on the system.

That looks very much like some IOs have not been completed and XFS
is waiting around for them to complete. Both the xfsbufd and the
flush daemons are stuck in get_request_wait(), which implies that
the request queue is full and not being serviced. Various rsync process
is stuck waiting for log buffer completion, waiting for buffer reads
to complete, etc, which implies that IO is simply not being completed.

My experience with such hangs is that they are typically caused by a
storage problem (e.g. lost interrupt, IO not completed, controller
firmware problem, etc).

> Loads have, expectedly, climbed currently around 1250.05 but growing slowly.
> 
> Quick overview of the underlying storage:
> 
> xfs -> md (raid 0) -+--> P812 hardware raid6 (cciss driver)
>                     |
>                     +--> P812 hardware raid6 (cciss driver)
> 
> This is running on an HP DL380 G7.
> 
> I saw this both on an older 2.6.30.10-105.2.23.fc11.x86_64, and
> currently on 2.6.34.7-61.fc13.x86_64 (both being Fedora stock kernels)
> 
> I have not seen this on a very similar DL380 G6, with the same storage
> setup and it is currently running the 2.6.30 kernel from above.
> 
> Christoph suggest increasing the nr_request values for each of the
> underlying devices, but this didn't seem to change anything
> significantly on the system.
> 
> Anyone have any ideas on what's going on?

Any other information in the log (e.g. from the cciss driver)? Are
the raid controllers all running the latest (and same) firmware? I'd
be wanting to make sure that all the storage below the filesystem is
working correctly before looking at anything filesystem related...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-11-15 23:50 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-15 18:30 More XFS resource starvation? J.H.
2010-11-15 19:08 ` Simon Kirby
2010-11-15 23:49 ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).