All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Bruno Prémont" <bonbons@linux-vserver.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-kernel@vger.kernel.org,
	Markus Trippelsdorf <markus@trippelsdorf.de>,
	xfs-masters@oss.sgi.com, xfs@oss.sgi.com,
	Christoph Hellwig <hch@infradead.org>,
	Alex Elder <aelder@sgi.com>, Dave Chinner <dchinner@redhat.com>
Subject: Re: 2.6.39-rc3, 2.6.39-rc4: XFS lockup - regression since 2.6.38
Date: Thu, 5 May 2011 22:35:13 +0200	[thread overview]
Message-ID: <20110505223513.3654c041@neptune.home> (raw)
In-Reply-To: <20110505122117.GB26837@dastard>

On Thu, 05 May 2011 Dave Chinner wrote:
> On Thu, May 05, 2011 at 12:26:13PM +1000, Dave Chinner wrote:
> > On Thu, May 05, 2011 at 10:21:26AM +1000, Dave Chinner wrote:
> > > On Wed, May 04, 2011 at 12:57:36AM +0000, Jamie Heilman wrote:
> > > > Dave Chinner wrote:
> > > > > OK, so the common elements here appears to be root filesystems
> > > > > with small log sizes, which means they are tail pushing all the
> > > > > time metadata operations are in progress. Definitely seems like a
> > > > > race in the AIL workqueue trigger mechanism. I'll see if I can
> > > > > reproduce this and cook up a patch to fix it.
> > > > 
> > > > Is there value in continuing to post sysrq-w, sysrq-l, xfs_info, and
> > > > other assorted feedback wrt this issue?  I've had it happen twice now
> > > > myself in the past week or so, though I have no reliable reproduction
> > > > technique.  Just wondering if more data points will help isolate the
> > > > cause, and if so, how to be prepared to get them.
> > > > 
> > > > For whatever its worth, my last lockup was while running
> > > > 2.6.39-rc5-00127-g1be6a1f with a preempt config without cgroups.
> > > 
> > > Can you all try the patch below? I've managed to trigger a couple of
> > > xlog_wait() lockups in some controlled load tests. The lockups don't
> > > appear to occur with the following patch to he race condition in
> > > the AIL workqueue trigger.
> > 
> > They are still there, just harder to hit.
> > 
> > FWIW, I've also discovered that "echo 2 > /proc/sys/vm/drop_caches"
> > gets the system moving again because that changes the push target.
> > 
> > I've found two more bugs, and now my test case is now reliably
> > reproducably a 5-10s pause at ~1M created 1byte files and then
> > hanging at about 1.25M files. So there's yet another problem lurking
> > that I need to get to the bottom of.
> 
> Which, of course, was the real regression. The patch below has
> survived a couple of hours of testing, which fixes all 4 of the
> problems I found. Please test.

Successfully survives my 2-hours session of today. Will continue testing
during week-end and see if it also survives the longer whole-day sessions.

Will report results at end of week-end (or earlier in case of trouble).

Thanks,
Bruno

> Cheers,
> 
> Dave.

WARNING: multiple messages have this Message-ID (diff)
From: "Bruno Prémont" <bonbons@linux-vserver.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Chinner <dchinner@redhat.com>,
	linux-kernel@vger.kernel.org, xfs@oss.sgi.com,
	Christoph Hellwig <hch@infradead.org>,
	xfs-masters@oss.sgi.com, Dave@oss.sgi.com,
	Alex Elder <aelder@sgi.com>,
	Markus Trippelsdorf <markus@trippelsdorf.de>
Subject: Re: 2.6.39-rc3, 2.6.39-rc4: XFS lockup - regression since 2.6.38
Date: Thu, 5 May 2011 22:35:13 +0200	[thread overview]
Message-ID: <20110505223513.3654c041@neptune.home> (raw)
In-Reply-To: <20110505122117.GB26837@dastard>

On Thu, 05 May 2011 Dave Chinner wrote:
> On Thu, May 05, 2011 at 12:26:13PM +1000, Dave Chinner wrote:
> > On Thu, May 05, 2011 at 10:21:26AM +1000, Dave Chinner wrote:
> > > On Wed, May 04, 2011 at 12:57:36AM +0000, Jamie Heilman wrote:
> > > > Dave Chinner wrote:
> > > > > OK, so the common elements here appears to be root filesystems
> > > > > with small log sizes, which means they are tail pushing all the
> > > > > time metadata operations are in progress. Definitely seems like a
> > > > > race in the AIL workqueue trigger mechanism. I'll see if I can
> > > > > reproduce this and cook up a patch to fix it.
> > > > 
> > > > Is there value in continuing to post sysrq-w, sysrq-l, xfs_info, and
> > > > other assorted feedback wrt this issue?  I've had it happen twice now
> > > > myself in the past week or so, though I have no reliable reproduction
> > > > technique.  Just wondering if more data points will help isolate the
> > > > cause, and if so, how to be prepared to get them.
> > > > 
> > > > For whatever its worth, my last lockup was while running
> > > > 2.6.39-rc5-00127-g1be6a1f with a preempt config without cgroups.
> > > 
> > > Can you all try the patch below? I've managed to trigger a couple of
> > > xlog_wait() lockups in some controlled load tests. The lockups don't
> > > appear to occur with the following patch to he race condition in
> > > the AIL workqueue trigger.
> > 
> > They are still there, just harder to hit.
> > 
> > FWIW, I've also discovered that "echo 2 > /proc/sys/vm/drop_caches"
> > gets the system moving again because that changes the push target.
> > 
> > I've found two more bugs, and now my test case is now reliably
> > reproducably a 5-10s pause at ~1M created 1byte files and then
> > hanging at about 1.25M files. So there's yet another problem lurking
> > that I need to get to the bottom of.
> 
> Which, of course, was the real regression. The patch below has
> survived a couple of hours of testing, which fixes all 4 of the
> problems I found. Please test.

Successfully survives my 2-hours session of today. Will continue testing
during week-end and see if it also survives the longer whole-day sessions.

Will report results at end of week-end (or earlier in case of trouble).

Thanks,
Bruno

> Cheers,
> 
> Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  parent reply	other threads:[~2011-05-05 20:35 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-23 20:44 2.6.39-rc3, 2.6.39-rc4: XFS lockup - regression since 2.6.38 Bruno Prémont
2011-04-23 20:44 ` Bruno Prémont
2011-04-27  5:08 ` Dave Chinner
2011-04-27  5:08   ` Dave Chinner
2011-04-27 16:26   ` Bruno Prémont
2011-04-27 16:26     ` Bruno Prémont
2011-04-28 19:45     ` Markus Trippelsdorf
2011-04-28 19:45       ` Markus Trippelsdorf
2011-04-29  1:19       ` Dave Chinner
2011-04-29  1:19         ` Dave Chinner
2011-04-29 15:18         ` Markus Trippelsdorf
2011-04-29 15:18           ` Markus Trippelsdorf
2011-04-29 19:35           ` Bruno Prémont
2011-04-29 19:35             ` Bruno Prémont
2011-04-30 14:18             ` Bruno Prémont
2011-04-30 14:18               ` Bruno Prémont
2011-05-02  6:15               ` Markus Trippelsdorf
2011-05-02  6:15                 ` Markus Trippelsdorf
2011-05-02 12:40                 ` Dave Chinner
2011-05-02 12:40                   ` Dave Chinner
2011-05-04  0:57         ` Jamie Heilman
2011-05-04  0:57           ` Jamie Heilman
2011-05-04 13:25           ` Dave Chinner
2011-05-04 13:25             ` Dave Chinner
2011-05-05  0:21           ` Dave Chinner
2011-05-05  0:21             ` Dave Chinner
2011-05-05  2:26             ` Dave Chinner
2011-05-05  2:26               ` Dave Chinner
2011-05-05 12:21               ` Dave Chinner
2011-05-05 12:21                 ` Dave Chinner
2011-05-05 12:39                 ` Christoph Hellwig
2011-05-05 12:39                   ` Christoph Hellwig
2011-05-06  1:49                   ` Dave Chinner
2011-05-06  1:49                     ` Dave Chinner
2011-05-05 20:35                 ` Bruno Prémont [this message]
2011-05-05 20:35                   ` Bruno Prémont
2011-05-09  5:57                   ` Bruno Prémont
2011-05-09  5:57                     ` Bruno Prémont
2011-05-08  5:11                 ` Jamie Heilman
2011-05-08  5:11                   ` Jamie Heilman
2011-05-20 11:20         ` Andrey Rahmatullin
2011-05-20 11:20           ` Andrey Rahmatullin
2011-05-21  0:14           ` Dave Chinner
2011-05-21  0:14             ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110505223513.3654c041@neptune.home \
    --to=bonbons@linux-vserver.org \
    --cc=aelder@sgi.com \
    --cc=david@fromorbit.com \
    --cc=dchinner@redhat.com \
    --cc=hch@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=markus@trippelsdorf.de \
    --cc=xfs-masters@oss.sgi.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.