All of lore.kernel.org
 help / color / mirror / Atom feed
From: "J. Bruce Fields" <bfields@fieldses.org>
To: Nix <nix@esperi.org.uk>
Cc: "Ted Ts'o" <tytso@mit.edu>,
	linux-kernel@vger.kernel.org,
	Bryan Schumaker <bjschuma@netapp.com>,
	Peng Tao <bergwolf@gmail.com>,
	Trond.Myklebust@netapp.com, gregkh@linuxfoundation.org,
	linux-nfs@vger.kernel.org,
	Stanislav Kinsbursky <skinsbursky@parallels.com>
Subject: Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug)
Date: Tue, 23 Oct 2012 12:46:21 -0400	[thread overview]
Message-ID: <20121023164621.GC3040@fieldses.org> (raw)
In-Reply-To: <874nllxi7e.fsf_-_@spindle.srvr.nix>

On Tue, Oct 23, 2012 at 05:32:21PM +0100, Nix wrote:
> On 23 Oct 2012, J. Bruce Fields uttered the following:
> > nfs-utils shouldn't be capable of oopsing the kernel, so from my
> > (selfish) point of view I'd actually rather you stick with whatever you
> > have and try to reproduce the oops.
> 
> Reproduced in 3.6.3, not in 3.6.1, not tried 3.6.2. Capturing it was
> rendered somewhat difficult by an ext4/JBD2 bug which leads to data loss
> in /var on every reboot out of 3.6.1 and on some reboots out of 3.6.3 (I
> have runs of NULs in my logs now, which keep eating the oopses):
> 
> [while in 3.6.1]
> [   88.565698] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
> [   88.799263] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
> [   89.648152] ------------[ cut here ]------------
> [   89.648386] WARNING: at fs/inode.c:280 drop_nlink+0x25/0x42()
> [   89.648614] Hardware name: empty
> [   89.648833] Modules linked in: firewire_ohci firewire_core [last unloaded: microcode]
> [   89.649382] Pid: 1484, comm: dhcpd Not tainted 3.6.1-dirty #1
> [   89.649610] Call Trace:
> [   89.649832]  [<ffffffff810608c0>] warn_slowpath_common+0x83/0x9b
> [   89.650063]  [<ffffffff810608f2>] warn_slowpath_null+0x1a/0x1c
> [   89.650292]  [<ffffffff8112efbf>] drop_nlink+0x25/0x42
> [   89.650533]  [<ffffffff81187112>] ext4_dec_count+0x26/0x28
> [   89.650763]  [<ffffffff8118abb8>] ext4_rename+0x4ec/0x7b4
> [   89.650993]  [<ffffffff81125d81>] ? vfs_rename+0xbe/0x3b7
> [   89.651224]  [<ffffffff81125f3f>] vfs_rename+0x27c/0x3b7
> [   89.651454]  [<ffffffff81127a53>] sys_renameat+0x1b1/0x228
> [   89.651682]  [<ffffffff8114bda1>] ? fsnotify+0x226/0x249
> [   89.651911]  [<ffffffff81239b75>] ? security_inode_permission+0x1e/0x20
> [   89.652145]  [<ffffffff8111a240>] ? vfs_write+0x116/0x142
> [   89.652372]  [<ffffffff81127ae5>] sys_rename+0x1b/0x1e
> [   89.652601]  [<ffffffff814fffa2>] system_call_fastpath+0x16/0x1b
> [...]
> [while having just booted into 3.6.1 after some time in 3.6.3: the FS
>  was clean, and fscked on the previous boot into 3.6.3 after a previous
>  instance of this bug]
> Oct 23 17:18:26 spindle crit: [   67.625319] EXT4-fs error (device dm-5): mb_free_blocks:1300: group 65, block 2143748:freeing already freed block (bit 13828)
> 
> This may well be a 3.6.1-specific bug fixed in 3.6.3, but it's hard to
> tell since most of my reboots are 3.6.1->3.6.3 or vice versa right now.
> 
> 
> Anyway, here's the NFSv4 oops (not a panic: it helps if I remember to
> turn off panic_on_oops when I come home from the holidays).
> 
> It's a lockd problem, and probably happens during delivery of mail over
> NFS (my mailserver load soars when it happens):
> 
> [  813.110354] ------------[ cut here ]------------
> [  813.110585] kernel BUG at fs/lockd/mon.c:150!

So nsm_mon_unmon() is being passed a NULL client.

There are three container patches between 3.6.1 and 3.6.3:

	lockd: per-net NSM client creation and destruction helpers introduced
	lockd: use rpc client's cl_nodename for id encoding
	lockd: create and use per-net NSM RPC clients on MON/UNMON requests

and that last does change nsm_monitor's call to nsm_mon_unmon, so that's
almost certainly it....

Looks like there's some confusion about whether nsm_client_get() returns
NULL or an error?

--b.

> [  813.110878] invalid opcode: 0000 [#1] SMP
> [  813.111173] Modules linked in: firewire_ohci firewire_core [last unloaded: microcode]
> [  813.111727] CPU 1
> [  813.111772] Pid: 1040, comm: lockd Not tainted 3.6.3-dirty #1 empty empty/S7010
> [  813.112388] RIP: 0010:[<ffffffff8120fbbc>]  [<ffffffff8120fbbc>] nsm_mon_unmon+0x64/0x98
> [  813.112840] RSP: 0018:ffff88062163dcf0  EFLAGS: 00010246
> [  813.113069] RAX: ffff88062163dcf8 RBX: 0000000000000000 RCX: 0000000000000000
> [  813.113303] RDX: ffff88062163dd68 RSI: 0000000000000002 RDI: ffff88062163dd40
> [  813.113537] RBP: ffff88062163dd50 R08: 0000000000000000 R09: ffffffff8120c463
> [  813.113771] R10: ffffffff8120c463 R11: 0000000000000001 R12: 0000000000000000
> [  813.114006] R13: ffff88061f067e40 R14: ffff88059c763a00 R15: ffff88062163de28
> [  813.114241] FS:  0000000000000000(0000) GS:ffff88063fc40000(0000) knlGS:0000000000000000
> [  813.114651] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  813.114882] CR2: ffffffffff600400 CR3: 0000000001a0b000 CR4: 00000000000007e0
> [  813.115116] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  813.115350] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  813.115584] Process lockd (pid: 1040, threadinfo ffff88062163c000, task ffff880621611640)
> [  813.115994] Stack:
> [  813.116210]  ffff88062163dd90 ffff8805f4e044b1 00000003000186b5 ffffffff00000010
> [  813.116761]  ffff8805f4e044c1 000000000000008c 0000000000000000 ffff88062163dcf8
> [  813.117310]  ffff88062163dd68 0000000000000000 ffff8806200ce088 ffff8805f4e04400
> [  813.117861] Call Trace:
> [  813.118083]  [<ffffffff8120fee4>] nsm_monitor+0x100/0x157
> [  813.118315]  [<ffffffff81211b4a>] nlm4svc_retrieve_args+0x62/0xd7
> [  813.118547]  [<ffffffff81211f49>] nlm4svc_proc_lock+0x3c/0xb5
> [  813.118779]  [<ffffffff8121167b>] ? nlm4svc_decode_lockargs+0x47/0xb2
> [  813.119016]  [<ffffffff814d89ca>] svc_process+0x3bf/0x6a1
> [  813.119246]  [<ffffffff8120d5f0>] lockd+0x127/0x164
> [  813.119474]  [<ffffffff8120d4c9>] ? set_grace_period+0x8a/0x8a
> [  813.119708]  [<ffffffff8107bcbc>] kthread+0x8b/0x93
> [  813.119938]  [<ffffffff815012f4>] kernel_thread_helper+0x4/0x10
> [  813.120170]  [<ffffffff8107bc31>] ? kthread_worker_fn+0xe1/0xe1
> [  813.120401]  [<ffffffff815012f0>] ? gs_change+0xb/0xb
> [  813.120628] Code: b8 10 00 00 00 48 89 45 c0 48 8d 81 8c 00 00 00 b9 08 00 00 00 48 89 45 c8 89 d8 f3 ab 48 8d 45 a8 48 89 55 e0 48 89 45 d8 75 02 <0f> 0b 89 f6 48 c7 02 00 00 00 00 4c 89 c7 48 6b f6 38 ba 00 04
> [  813.123627] RIP  [<ffffffff8120fbbc>] nsm_mon_unmon+0x64/0x98
> [  813.123894]  RSP <ffff88062163dcf0>
> [  813.124129] ---[ end trace 11eb11a091ffd910 ]---
> 
> Aside: I am running with this patch, to prevent mounting of one
> NFS-mounted directory under another from failing all accesses with a
> spurious -EBUSY, a 3.6 regression -- but I can't see how this could
> cause it. I'm running with this in 3.6.1 as well:
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 8086636..649a112 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -2404,6 +2404,10 @@ out_unalias:
>  	if (likely(!d_mountpoint(alias))) {
>  		__d_move(alias, dentry);
>  		ret = alias;
> +	} else {
> +		printk(KERN_WARNING "VFS: __d_move()ing a d_mountpoint(), uh oh\n");
> +		__d_move(alias, dentry);
> +		ret = alias;
>  	}
>  out_err:
>  	spin_unlock(&inode->i_lock);
> 

  reply	other threads:[~2012-10-23 16:46 UTC|newest]

Thread overview: 105+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-10-22 16:17 Heads-up: 3.6.2 / 3.6.3 NFS server panic: 3.6.2+ regression? Nix
2012-10-23  1:33 ` J. Bruce Fields
2012-10-23 14:07   ` Nix
2012-10-23 14:30     ` J. Bruce Fields
2012-10-23 16:32       ` Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug) Nix
2012-10-23 16:46         ` J. Bruce Fields [this message]
2012-10-23 16:54           ` J. Bruce Fields
2012-10-23 16:56           ` Myklebust, Trond
2012-10-23 16:56             ` Myklebust, Trond
2012-10-23 17:05             ` Nix
2012-10-23 17:36               ` Nix
2012-10-23 17:43                 ` J. Bruce Fields
2012-10-23 17:44                 ` Myklebust, Trond
2012-10-23 17:57                   ` Myklebust, Trond
2012-10-23 17:57                     ` Myklebust, Trond
     [not found]                   ` <1351015039.4622.23.camel@lade.trondhjem.org>
2012-10-23 18:23                     ` Myklebust, Trond
2012-10-23 18:23                       ` Myklebust, Trond
2012-10-23 19:49                       ` Nix
2012-10-24 10:18                         ` [PATCH] lockd: fix races in per-net NSM client handling Stanislav Kinsbursky
2012-10-23 20:57         ` Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) Nix
2012-10-23 20:57           ` Nix
2012-10-23 22:19           ` Theodore Ts'o
2012-10-23 22:47             ` Nix
2012-10-23 23:16               ` Theodore Ts'o
2012-10-23 23:06             ` Nix
2012-10-23 23:28               ` Theodore Ts'o
2012-10-23 23:34                 ` Nix
2012-10-24  0:57             ` Eric Sandeen
2012-10-24 20:17               ` Jan Kara
2012-10-26 15:25                 ` Eric Sandeen
2012-10-24 19:13             ` Jannis Achstetter
2012-10-24 19:13               ` Jannis Achstetter
2012-10-24 21:31               ` Theodore Ts'o
2012-10-24 22:05                 ` Jannis Achstetter
2012-10-24 23:47                 ` Nix
2012-10-25 17:02                 ` Felipe Contreras
2012-10-24 21:04             ` Jannis Achstetter
2012-10-24  1:13           ` Eric Sandeen
2012-10-24  1:13             ` Eric Sandeen
2012-10-24  4:15             ` Nix
2012-10-24  4:27               ` Eric Sandeen
2012-10-24  5:23                 ` Theodore Ts'o
2012-10-24  7:00                   ` Hugh Dickins
2012-10-24 11:46                     ` Nix
2012-10-24 11:45                   ` Nix
2012-10-24 17:22                   ` Eric Sandeen
2012-10-24 19:49                   ` Nix
2012-10-24 19:54                     ` Nix
2012-10-24 20:30                     ` Eric Sandeen
2012-10-24 20:34                       ` Nix
2012-10-24 20:45                     ` Nix
2012-10-24 21:08                     ` Theodore Ts'o
2012-10-24 23:27                       ` Apparent serious progressive ext4 data corruption bug in 3.6 (when rebooting during umount) Nix
2012-10-24 23:42                         ` Nix
2012-10-25  1:10                         ` Theodore Ts'o
2012-10-25  1:45                           ` Nix
2012-10-25  1:45                             ` Nix
2012-10-25 14:12                             ` Theodore Ts'o
2012-10-25 14:15                               ` Nix
2012-10-25 17:39                                 ` Nix
2012-10-25 11:06                           ` Nix
2012-10-26  0:22                           ` Apparent serious progressive ext4 data corruption bug in 3.6 (when rebooting during umount) (possibly blockdev / arcmsr at fault??) Nix
2012-10-26  0:11               ` Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) Ric Wheeler
2012-10-26  0:43                 ` Theodore Ts'o
2012-10-26 12:12                   ` Nix
2012-10-26 20:35           ` Eric Sandeen
2012-10-26 20:37             ` Nix
2012-10-26 20:56               ` Theodore Ts'o
2012-10-26 20:56                 ` Theodore Ts'o
2012-10-26 20:59                 ` Nix
2012-10-26 20:59                   ` Nix
2012-10-26 21:15                   ` Theodore Ts'o
2012-10-26 21:15                     ` Theodore Ts'o
2012-10-26 21:19                     ` Nix
2012-10-27  0:22                       ` Theodore Ts'o
2012-10-27  0:22                         ` Theodore Ts'o
2012-10-27 12:45                         ` Nix
2012-10-27 17:55                           ` Theodore Ts'o
2012-10-27 18:47                             ` Nix
2012-10-27 21:19                               ` Eric Sandeen
2012-10-27 21:21                                 ` Nix
2012-10-27 21:23                                   ` Eric Sandeen
2012-10-27 21:29                                     ` Nix
2012-10-27 21:34                                       ` Eric Sandeen
2012-10-27 21:40                                         ` Nix
     [not found]                                         ` <09758CEA-74B5-48D0-8075-BB723A2CABBB@dilger.ca>
2012-10-29  2:09                                           ` Eric Sandeen
2012-10-27 22:42                                 ` Eric Sandeen
2012-10-29  1:00                                   ` Theodore Ts'o
2012-10-29  1:04                                     ` Nix
2012-10-29  2:24                                     ` Eric Sandeen
2012-10-29  2:34                                       ` Theodore Ts'o
2012-10-29  2:35                                         ` Eric Sandeen
2012-10-29  2:42                                           ` Theodore Ts'o
2012-10-27 18:30                           ` Eric Sandeen
2012-10-27  3:11                     ` Jim Rees
2012-10-27  3:11                       ` Jim Rees
2012-10-27  8:01             ` Testing ext4's journal via simulating a reboot via KVM Theodore Ts'o
2012-10-28  4:23           ` [PATCH] ext4: fix unjournaled inode bitmap modification Eric Sandeen
2012-10-28  4:23             ` Eric Sandeen
2012-10-28 13:59             ` Nix
2012-10-29  2:30             ` [PATCH -v3] " Theodore Ts'o
2012-10-29  2:30               ` Theodore Ts'o
2012-10-29  3:24               ` Eric Sandeen
2012-10-29  5:07               ` Andreas Dilger
2012-10-29 17:08               ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121023164621.GC3040@fieldses.org \
    --to=bfields@fieldses.org \
    --cc=Trond.Myklebust@netapp.com \
    --cc=bergwolf@gmail.com \
    --cc=bjschuma@netapp.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=nix@esperi.org.uk \
    --cc=skinsbursky@parallels.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.