All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: "J. Bruce Fields" <bfields@fieldses.org>
Cc: linux-nfs@vger.kernel.org, Eric Sandeen <sandeen@sandeen.net>,
	Christoph Hellwig <hch@lst.de>,
	xfs@oss.sgi.com
Subject: Re: panic on 4.20 server exporting xfs filesystem
Date: Fri, 6 Mar 2015 07:59:22 +1100	[thread overview]
Message-ID: <20150305205922.GF18360@dastard> (raw)
In-Reply-To: <20150305204749.GA17934@fieldses.org>

On Thu, Mar 05, 2015 at 03:47:49PM -0500, J. Bruce Fields wrote:
> On Thu, Mar 05, 2015 at 12:02:17PM -0500, J. Bruce Fields wrote:
> > On Thu, Mar 05, 2015 at 10:01:38AM -0500, J. Bruce Fields wrote:
> > > On Thu, Mar 05, 2015 at 02:17:31PM +0100, Christoph Hellwig wrote:
> > > > On Wed, Mar 04, 2015 at 11:08:49PM -0500, J. Bruce Fields wrote:
> > > > > Ah-hah:
> > > > > 
> > > > > 	static void
> > > > > 	nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls)
> > > > > 	{
> > > > > 		...
> > > > > 		nfsd4_cb_layout_fail(ls);
> > > > > 
> > > > > That'd do it!
> > > > > 
> > > > > Haven't tried to figure out why exactly that's getting called, and why
> > > > > only rarely.  Some intermittent problem with the callback path, I guess.
> > > > > 
> > > > > Anyway, I think that solves most of the mystery....
> > > > 
> > > > Ooops, that was a nasty git merge error in the last rebase, see the fix
> > > > below.
> > > 
> > > Thanks!
> > 
> > And with that fix things look good.
> > 
> > I'm still curious why the callbacks are failling.  It's also logging
> > "nfsd: client 192.168.122.32 failed to respond to layout recall".
> 
> I spoke too soon, I'm still not getting through my usual test run--the most
> recent run is hanging in generic/247 with the following in the server logs.
> 
> But I probably still won't get a chance to look at this any closer till after
> vault.
> 
> --b.
> 
> nfsd: client 192.168.122.32 failed to respond to layout recall.   Fencing..
> nfsd: fence failed for client 192.168.122.32: -2!
> nfsd: client 192.168.122.32 failed to respond to layout recall.   Fencing..
> nfsd: fence failed for client 192.168.122.32: -2!
> receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff88005639a000 xid c21abd62
> kswapd0: page allocation failure: order:0, mode:0x120

[snip network driver memory allocation failure]

> active_anon:7053 inactive_anon:2435 isolated_anon:0
>  active_file:88743 inactive_file:89505 isolated_file:32
>  unevictable:0 dirty:9786 writeback:0 unstable:0
>  free:3571 slab_reclaimable:227807 slab_unreclaimable:75772
>  mapped:21010 shmem:380 pagetables:1567 bounce:0
>  free_cma:0

Looks like there should be heaps of reclaimable memory...

> nfsd: client 192.168.122.32 failed to respond to layout recall.   Fencing..

So there's a layout recall pending...

> nfsd: fence failed for client 192.168.122.32: -2!
> receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff880051dfc000 xid 8ff02aaf
> INFO: task nfsd:17653 blocked for more than 120 seconds.
>       Not tainted 4.0.0-rc2-09922-g26cbcc7 #89
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> nfsd            D ffff8800753a7848 11720 17653      2 0x00000000
>  ffff8800753a7848 0000000000000001 0000000000000001 ffffffff82210580
>  ffff88004e9bcb50 0000000000000006 ffffffff8119d9f0 ffff8800753a7848
>  ffff8800753a7fd8 ffff88002e5e3d70 0000000000000246 ffff88004e9bcb50
> Call Trace:
>  [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0
>  [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0
>  [<ffffffff81a95737>] schedule+0x37/0x90
>  [<ffffffff81a95ac8>] schedule_preempt_disabled+0x18/0x30
>  [<ffffffff81a97756>] mutex_lock_nested+0x156/0x400
>  [<ffffffff813a0d5a>] ? xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0
>  [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0
>  [<ffffffff813a0d5a>] xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0
>  [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0
>  [<ffffffff813a1016>] xfs_file_write_iter+0x86/0x130
>  [<ffffffff8119db05>] do_iter_readv_writev+0x65/0xa0

and the nfsd got hung up on the inode mutex during  a write.

Which means there's some other process blocked holding the i_mutex.
sysrq-w and sysrq-t is probably going to tell us more here.

I suspect we'll have another write stuck in break_layout().....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Christoph Hellwig <hch@lst.de>,
	Eric Sandeen <sandeen@sandeen.net>,
	linux-nfs@vger.kernel.org, xfs@oss.sgi.com
Subject: Re: panic on 4.20 server exporting xfs filesystem
Date: Fri, 6 Mar 2015 07:59:22 +1100	[thread overview]
Message-ID: <20150305205922.GF18360@dastard> (raw)
In-Reply-To: <20150305204749.GA17934@fieldses.org>

On Thu, Mar 05, 2015 at 03:47:49PM -0500, J. Bruce Fields wrote:
> On Thu, Mar 05, 2015 at 12:02:17PM -0500, J. Bruce Fields wrote:
> > On Thu, Mar 05, 2015 at 10:01:38AM -0500, J. Bruce Fields wrote:
> > > On Thu, Mar 05, 2015 at 02:17:31PM +0100, Christoph Hellwig wrote:
> > > > On Wed, Mar 04, 2015 at 11:08:49PM -0500, J. Bruce Fields wrote:
> > > > > Ah-hah:
> > > > > 
> > > > > 	static void
> > > > > 	nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls)
> > > > > 	{
> > > > > 		...
> > > > > 		nfsd4_cb_layout_fail(ls);
> > > > > 
> > > > > That'd do it!
> > > > > 
> > > > > Haven't tried to figure out why exactly that's getting called, and why
> > > > > only rarely.  Some intermittent problem with the callback path, I guess.
> > > > > 
> > > > > Anyway, I think that solves most of the mystery....
> > > > 
> > > > Ooops, that was a nasty git merge error in the last rebase, see the fix
> > > > below.
> > > 
> > > Thanks!
> > 
> > And with that fix things look good.
> > 
> > I'm still curious why the callbacks are failling.  It's also logging
> > "nfsd: client 192.168.122.32 failed to respond to layout recall".
> 
> I spoke too soon, I'm still not getting through my usual test run--the most
> recent run is hanging in generic/247 with the following in the server logs.
> 
> But I probably still won't get a chance to look at this any closer till after
> vault.
> 
> --b.
> 
> nfsd: client 192.168.122.32 failed to respond to layout recall.   Fencing..
> nfsd: fence failed for client 192.168.122.32: -2!
> nfsd: client 192.168.122.32 failed to respond to layout recall.   Fencing..
> nfsd: fence failed for client 192.168.122.32: -2!
> receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff88005639a000 xid c21abd62
> kswapd0: page allocation failure: order:0, mode:0x120

[snip network driver memory allocation failure]

> active_anon:7053 inactive_anon:2435 isolated_anon:0
>  active_file:88743 inactive_file:89505 isolated_file:32
>  unevictable:0 dirty:9786 writeback:0 unstable:0
>  free:3571 slab_reclaimable:227807 slab_unreclaimable:75772
>  mapped:21010 shmem:380 pagetables:1567 bounce:0
>  free_cma:0

Looks like there should be heaps of reclaimable memory...

> nfsd: client 192.168.122.32 failed to respond to layout recall.   Fencing..

So there's a layout recall pending...

> nfsd: fence failed for client 192.168.122.32: -2!
> receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff880051dfc000 xid 8ff02aaf
> INFO: task nfsd:17653 blocked for more than 120 seconds.
>       Not tainted 4.0.0-rc2-09922-g26cbcc7 #89
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> nfsd            D ffff8800753a7848 11720 17653      2 0x00000000
>  ffff8800753a7848 0000000000000001 0000000000000001 ffffffff82210580
>  ffff88004e9bcb50 0000000000000006 ffffffff8119d9f0 ffff8800753a7848
>  ffff8800753a7fd8 ffff88002e5e3d70 0000000000000246 ffff88004e9bcb50
> Call Trace:
>  [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0
>  [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0
>  [<ffffffff81a95737>] schedule+0x37/0x90
>  [<ffffffff81a95ac8>] schedule_preempt_disabled+0x18/0x30
>  [<ffffffff81a97756>] mutex_lock_nested+0x156/0x400
>  [<ffffffff813a0d5a>] ? xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0
>  [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0
>  [<ffffffff813a0d5a>] xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0
>  [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0
>  [<ffffffff813a1016>] xfs_file_write_iter+0x86/0x130
>  [<ffffffff8119db05>] do_iter_readv_writev+0x65/0xa0

and the nfsd got hung up on the inode mutex during  a write.

Which means there's some other process blocked holding the i_mutex.
sysrq-w and sysrq-t is probably going to tell us more here.

I suspect we'll have another write stuck in break_layout().....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2015-03-05 20:59 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-03 22:10 panic on 4.20 server exporting xfs filesystem J. Bruce Fields
2015-03-03 22:10 ` J. Bruce Fields
2015-03-03 22:44 ` Dave Chinner
2015-03-03 22:44   ` Dave Chinner
2015-03-04  2:08   ` J. Bruce Fields
2015-03-04  2:08     ` J. Bruce Fields
2015-03-04  4:41     ` Dave Chinner
2015-03-04  4:41       ` Dave Chinner
2015-03-05 13:19       ` Christoph Hellwig
2015-03-05 13:19         ` Christoph Hellwig
2015-03-05 15:21         ` J. Bruce Fields
2015-03-05 15:21           ` J. Bruce Fields
2015-03-08 13:08         ` Tom Haynes
2015-03-08 13:08           ` Tom Haynes
2015-03-04 15:54     ` J. Bruce Fields
2015-03-04 15:54       ` J. Bruce Fields
2015-03-04 22:09       ` Dave Chinner
2015-03-04 22:09         ` Dave Chinner
2015-03-04 22:27         ` J. Bruce Fields
2015-03-04 22:27           ` J. Bruce Fields
2015-03-04 22:45           ` Dave Chinner
2015-03-04 22:45             ` Dave Chinner
2015-03-04 22:49             ` Eric Sandeen
2015-03-04 22:49               ` Eric Sandeen
2015-03-04 22:56               ` Dave Chinner
2015-03-04 22:56                 ` Dave Chinner
2015-03-05  4:08                 ` J. Bruce Fields
2015-03-05  4:08                   ` J. Bruce Fields
2015-03-05 13:17                   ` Christoph Hellwig
2015-03-05 13:17                     ` Christoph Hellwig
2015-03-05 15:01                     ` J. Bruce Fields
2015-03-05 15:01                       ` J. Bruce Fields
2015-03-05 17:02                       ` J. Bruce Fields
2015-03-05 17:02                         ` J. Bruce Fields
2015-03-05 20:47                         ` J. Bruce Fields
2015-03-05 20:47                           ` J. Bruce Fields
2015-03-05 20:59                           ` Dave Chinner [this message]
2015-03-05 20:59                             ` Dave Chinner
2015-03-06 20:47                             ` J. Bruce Fields
2015-03-06 20:47                               ` J. Bruce Fields
2015-03-19 17:27                               ` Christoph Hellwig
2015-03-19 17:27                                 ` Christoph Hellwig
2015-03-19 18:47                                 ` J. Bruce Fields
2015-03-19 18:47                                   ` J. Bruce Fields
2015-03-20  6:49                                   ` Christoph Hellwig
2015-03-20  6:49                                     ` Christoph Hellwig
2015-03-08 15:30                           ` Christoph Hellwig
2015-03-08 15:30                             ` Christoph Hellwig
2015-03-09 19:45                             ` J. Bruce Fields
2015-03-09 19:45                               ` J. Bruce Fields
2015-03-20  4:06                     ` Kinglong Mee
2015-03-20  4:06                       ` Kinglong Mee
2015-03-20  6:50                       ` Christoph Hellwig
2015-03-20  6:50                         ` Christoph Hellwig
2015-03-20  7:56                         ` [PATCH] NFSD: Fix infinite loop in nfsd4_cb_layout_fail() Kinglong Mee
2015-03-20  7:56                           ` Kinglong Mee
2015-03-15 12:58 ` panic on 4.20 server exporting xfs filesystem Christoph Hellwig
2015-03-15 12:58   ` Christoph Hellwig
2015-03-16 14:27   ` J. Bruce Fields
2015-03-16 14:27     ` J. Bruce Fields
2015-03-17 10:30     ` Christoph Hellwig
2015-03-17 10:30       ` Christoph Hellwig
2015-03-18 10:50     ` Christoph Hellwig
2015-03-18 10:50       ` Christoph Hellwig
2015-03-27 10:41 ` Christoph Hellwig
2015-03-27 14:50   ` Jeff Layton
2015-03-30 16:44     ` Christoph Hellwig
2015-03-27 15:13   ` J. Bruce Fields
2015-04-26 16:19   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150305205922.GF18360@dastard \
    --to=david@fromorbit.com \
    --cc=bfields@fieldses.org \
    --cc=hch@lst.de \
    --cc=linux-nfs@vger.kernel.org \
    --cc=sandeen@sandeen.net \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.