Heads-up: 3.6.2 / 3.6.3 NFS server panic: 3.6.2+ regression?

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Heads-up: 3.6.2 / 3.6.3 NFS server panic: 3.6.2+ regression?
@ 2012-10-22 16:17 Nix
  2012-10-23  1:33 ` J. Bruce Fields
  0 siblings, 1 reply; 83+ messages in thread
From: Nix @ 2012-10-22 16:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh

I just had a panic/oops on upgrading from 3.6.1 to 3.6.3, after weeks of
smooth operation on 3.6.1: one of the NFS changes that went into one of
the two latest stable kernels appears to be lethal after around half an
hour of uptime. The oops came from NFSv4, IIRC (relying on memory since
my camera was recharging and there is no netconsole from that box
because it is where the netconsole logs go, so I'll have to reproduce it
later today). The machine is an NFSv3 server only at present, with no
NFSv4 running (though NFSv4 is built in).

This is just a heads-up, not intended to be sufficient to track it down
unless it is instantly obvious: more later, including a .config dump and
proper oops report if it turns out to be reproducible. I just have to
recover from the usual panic-induced fs corruption and take a backup
first.

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Heads-up: 3.6.2 / 3.6.3 NFS server panic: 3.6.2+ regression?
  2012-10-22 16:17 Heads-up: 3.6.2 / 3.6.3 NFS server panic: 3.6.2+ regression? Nix
@ 2012-10-23  1:33 ` J. Bruce Fields
  2012-10-23 14:07   ` Nix
  0 siblings, 1 reply; 83+ messages in thread
From: J. Bruce Fields @ 2012-10-23  1:33 UTC (permalink / raw)
  To: Nix
  Cc: linux-kernel, Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh,
	linux-nfs

On Mon, Oct 22, 2012 at 05:17:04PM +0100, Nix wrote:
> I just had a panic/oops on upgrading from 3.6.1 to 3.6.3, after weeks of
> smooth operation on 3.6.1: one of the NFS changes that went into one of
> the two latest stable kernels appears to be lethal after around half an
> hour of uptime. The oops came from NFSv4, IIRC (relying on memory since
> my camera was recharging and there is no netconsole from that box
> because it is where the netconsole logs go, so I'll have to reproduce it
> later today). The machine is an NFSv3 server only at present, with no
> NFSv4 running (though NFSv4 is built in).

Note recent clients may try to negotiate NFSv4 by default, so it's
possible to use it without knowing.

You didn't change anything else about your server or clients recently?

> This is just a heads-up, not intended to be sufficient to track it down
> unless it is instantly obvious: more later, including a .config dump and
> proper oops report if it turns out to be reproducible. I just have to
> recover from the usual panic-induced fs corruption and take a backup
> first.

Thanks.

I don't see an obvious candidate on a quick skim of v3.6.1..v3.6.3
commits, but of course I could be missing something.

--b.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Heads-up: 3.6.2 / 3.6.3 NFS server panic: 3.6.2+ regression?
  2012-10-23  1:33 ` J. Bruce Fields
@ 2012-10-23 14:07   ` Nix
  2012-10-23 14:30     ` J. Bruce Fields
  0 siblings, 1 reply; 83+ messages in thread
From: Nix @ 2012-10-23 14:07 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: linux-kernel, Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh,
	linux-nfs

On 23 Oct 2012, J. Bruce Fields uttered the following:

> On Mon, Oct 22, 2012 at 05:17:04PM +0100, Nix wrote:
>> I just had a panic/oops on upgrading from 3.6.1 to 3.6.3, after weeks of
>> smooth operation on 3.6.1: one of the NFS changes that went into one of
>> the two latest stable kernels appears to be lethal after around half an
>> hour of uptime. The oops came from NFSv4, IIRC (relying on memory since
>> my camera was recharging and there is no netconsole from that box
>> because it is where the netconsole logs go, so I'll have to reproduce it
>> later today). The machine is an NFSv3 server only at present, with no
>> NFSv4 running (though NFSv4 is built in).
>
> Note recent clients may try to negotiate NFSv4 by default, so it's
> possible to use it without knowing.

Every NFS import from all this server's clients has 'vers=3' forced on
(for now, until I get around to figuring out what if anything needs to
be done to move to NFSv4: it may be the answer is 'nothing' but I tread
quite carefully with this server, since my $HOME is there).

/proc/fs/nfsfs/volumes on the clients confirms that everything is v3.

> You didn't change anything else about your server or clients recently?

Nope (other than upgrading the clients to 3.6.3 in concert). Running
3.6.1 here on that server now and 3.6.3 on all the clients, and no crash
in over a day.

nfs-utils is a bit old, 1.2.6-rc6, I should probably upgrade it...

> I don't see an obvious candidate on a quick skim of v3.6.1..v3.6.3
> commits, but of course I could be missing something.

I'll try rebooting into it again soon, and get an oops report if it
should happen again (and hopefully less filesystem corruption this time,
right after a reboot into a new kernel was clearly the wrong time to
move a bunch of directories around).

Sorry for the continuing lack of useful information... I'll try to fix
that shortly (once with a report of a false alarm, since I really want
this stable kernel: my server is affected by the tx stalls fixed by
3468e9d and I'm getting tired of the frequent 1s lags talking to it: I
could just apply that one fix, but I'd rather track this down properly).

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Heads-up: 3.6.2 / 3.6.3 NFS server panic: 3.6.2+ regression?
  2012-10-23 14:07   ` Nix
@ 2012-10-23 14:30     ` J. Bruce Fields
  2012-10-23 16:32       ` Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug) Nix
  0 siblings, 1 reply; 83+ messages in thread
From: J. Bruce Fields @ 2012-10-23 14:30 UTC (permalink / raw)
  To: Nix
  Cc: linux-kernel, Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh,
	linux-nfs

On Tue, Oct 23, 2012 at 03:07:59PM +0100, Nix wrote:
> On 23 Oct 2012, J. Bruce Fields uttered the following:
> 
> > On Mon, Oct 22, 2012 at 05:17:04PM +0100, Nix wrote:
> >> I just had a panic/oops on upgrading from 3.6.1 to 3.6.3, after weeks of
> >> smooth operation on 3.6.1: one of the NFS changes that went into one of
> >> the two latest stable kernels appears to be lethal after around half an
> >> hour of uptime. The oops came from NFSv4, IIRC (relying on memory since
> >> my camera was recharging and there is no netconsole from that box
> >> because it is where the netconsole logs go, so I'll have to reproduce it
> >> later today). The machine is an NFSv3 server only at present, with no
> >> NFSv4 running (though NFSv4 is built in).
> >
> > Note recent clients may try to negotiate NFSv4 by default, so it's
> > possible to use it without knowing.
> 
> Every NFS import from all this server's clients has 'vers=3' forced on
> (for now, until I get around to figuring out what if anything needs to
> be done to move to NFSv4: it may be the answer is 'nothing' but I tread
> quite carefully with this server, since my $HOME is there).
> 
> /proc/fs/nfsfs/volumes on the clients confirms that everything is v3.
> 
> > You didn't change anything else about your server or clients recently?
> 
> Nope (other than upgrading the clients to 3.6.3 in concert). Running
> 3.6.1 here on that server now and 3.6.3 on all the clients, and no crash
> in over a day.
> 
> nfs-utils is a bit old, 1.2.6-rc6, I should probably upgrade it...

nfs-utils shouldn't be capable of oopsing the kernel, so from my
(selfish) point of view I'd actually rather you stick with whatever you
have and try to reproduce the oops.

(It's unlikely to make any difference anyway.)

--b.

> > I don't see an obvious candidate on a quick skim of v3.6.1..v3.6.3
> > commits, but of course I could be missing something.
> 
> I'll try rebooting into it again soon, and get an oops report if it
> should happen again (and hopefully less filesystem corruption this time,
> right after a reboot into a new kernel was clearly the wrong time to
> move a bunch of directories around).
> 
> Sorry for the continuing lack of useful information... I'll try to fix
> that shortly (once with a report of a false alarm, since I really want
> this stable kernel: my server is affected by the tx stalls fixed by
> 3468e9d and I'm getting tired of the frequent 1s lags talking to it: I
> could just apply that one fix, but I'd rather track this down properly).

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug)
  2012-10-23 14:30     ` J. Bruce Fields
@ 2012-10-23 16:32       ` Nix
  2012-10-23 16:46         ` J. Bruce Fields
  2012-10-23 20:57         ` Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) Nix
  0 siblings, 2 replies; 83+ messages in thread
From: Nix @ 2012-10-23 16:32 UTC (permalink / raw)
  To: J. Bruce Fields, Ted Ts'o
  Cc: linux-kernel, Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh,
	linux-nfs

On 23 Oct 2012, J. Bruce Fields uttered the following:
> nfs-utils shouldn't be capable of oopsing the kernel, so from my
> (selfish) point of view I'd actually rather you stick with whatever you
> have and try to reproduce the oops.

Reproduced in 3.6.3, not in 3.6.1, not tried 3.6.2. Capturing it was
rendered somewhat difficult by an ext4/JBD2 bug which leads to data loss
in /var on every reboot out of 3.6.1 and on some reboots out of 3.6.3 (I
have runs of NULs in my logs now, which keep eating the oopses):

[while in 3.6.1]
[   88.565698] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
[   88.799263] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
[   89.648152] ------------[ cut here ]------------
[   89.648386] WARNING: at fs/inode.c:280 drop_nlink+0x25/0x42()
[   89.648614] Hardware name: empty
[   89.648833] Modules linked in: firewire_ohci firewire_core [last unloaded: microcode]
[   89.649382] Pid: 1484, comm: dhcpd Not tainted 3.6.1-dirty #1
[   89.649610] Call Trace:
[   89.649832]  [<ffffffff810608c0>] warn_slowpath_common+0x83/0x9b
[   89.650063]  [<ffffffff810608f2>] warn_slowpath_null+0x1a/0x1c
[   89.650292]  [<ffffffff8112efbf>] drop_nlink+0x25/0x42
[   89.650533]  [<ffffffff81187112>] ext4_dec_count+0x26/0x28
[   89.650763]  [<ffffffff8118abb8>] ext4_rename+0x4ec/0x7b4
[   89.650993]  [<ffffffff81125d81>] ? vfs_rename+0xbe/0x3b7
[   89.651224]  [<ffffffff81125f3f>] vfs_rename+0x27c/0x3b7
[   89.651454]  [<ffffffff81127a53>] sys_renameat+0x1b1/0x228
[   89.651682]  [<ffffffff8114bda1>] ? fsnotify+0x226/0x249
[   89.651911]  [<ffffffff81239b75>] ? security_inode_permission+0x1e/0x20
[   89.652145]  [<ffffffff8111a240>] ? vfs_write+0x116/0x142
[   89.652372]  [<ffffffff81127ae5>] sys_rename+0x1b/0x1e
[   89.652601]  [<ffffffff814fffa2>] system_call_fastpath+0x16/0x1b
[...]
[while having just booted into 3.6.1 after some time in 3.6.3: the FS
 was clean, and fscked on the previous boot into 3.6.3 after a previous
 instance of this bug]
Oct 23 17:18:26 spindle crit: [   67.625319] EXT4-fs error (device dm-5): mb_free_blocks:1300: group 65, block 2143748:freeing already freed block (bit 13828)

This may well be a 3.6.1-specific bug fixed in 3.6.3, but it's hard to
tell since most of my reboots are 3.6.1->3.6.3 or vice versa right now.


Anyway, here's the NFSv4 oops (not a panic: it helps if I remember to
turn off panic_on_oops when I come home from the holidays).

It's a lockd problem, and probably happens during delivery of mail over
NFS (my mailserver load soars when it happens):

[  813.110354] ------------[ cut here ]------------
[  813.110585] kernel BUG at fs/lockd/mon.c:150!
[  813.110878] invalid opcode: 0000 [#1] SMP
[  813.111173] Modules linked in: firewire_ohci firewire_core [last unloaded: microcode]
[  813.111727] CPU 1
[  813.111772] Pid: 1040, comm: lockd Not tainted 3.6.3-dirty #1 empty empty/S7010
[  813.112388] RIP: 0010:[<ffffffff8120fbbc>]  [<ffffffff8120fbbc>] nsm_mon_unmon+0x64/0x98
[  813.112840] RSP: 0018:ffff88062163dcf0  EFLAGS: 00010246
[  813.113069] RAX: ffff88062163dcf8 RBX: 0000000000000000 RCX: 0000000000000000
[  813.113303] RDX: ffff88062163dd68 RSI: 0000000000000002 RDI: ffff88062163dd40
[  813.113537] RBP: ffff88062163dd50 R08: 0000000000000000 R09: ffffffff8120c463
[  813.113771] R10: ffffffff8120c463 R11: 0000000000000001 R12: 0000000000000000
[  813.114006] R13: ffff88061f067e40 R14: ffff88059c763a00 R15: ffff88062163de28
[  813.114241] FS:  0000000000000000(0000) GS:ffff88063fc40000(0000) knlGS:0000000000000000
[  813.114651] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  813.114882] CR2: ffffffffff600400 CR3: 0000000001a0b000 CR4: 00000000000007e0
[  813.115116] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  813.115350] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  813.115584] Process lockd (pid: 1040, threadinfo ffff88062163c000, task ffff880621611640)
[  813.115994] Stack:
[  813.116210]  ffff88062163dd90 ffff8805f4e044b1 00000003000186b5 ffffffff00000010
[  813.116761]  ffff8805f4e044c1 000000000000008c 0000000000000000 ffff88062163dcf8
[  813.117310]  ffff88062163dd68 0000000000000000 ffff8806200ce088 ffff8805f4e04400
[  813.117861] Call Trace:
[  813.118083]  [<ffffffff8120fee4>] nsm_monitor+0x100/0x157
[  813.118315]  [<ffffffff81211b4a>] nlm4svc_retrieve_args+0x62/0xd7
[  813.118547]  [<ffffffff81211f49>] nlm4svc_proc_lock+0x3c/0xb5
[  813.118779]  [<ffffffff8121167b>] ? nlm4svc_decode_lockargs+0x47/0xb2
[  813.119016]  [<ffffffff814d89ca>] svc_process+0x3bf/0x6a1
[  813.119246]  [<ffffffff8120d5f0>] lockd+0x127/0x164
[  813.119474]  [<ffffffff8120d4c9>] ? set_grace_period+0x8a/0x8a
[  813.119708]  [<ffffffff8107bcbc>] kthread+0x8b/0x93
[  813.119938]  [<ffffffff815012f4>] kernel_thread_helper+0x4/0x10
[  813.120170]  [<ffffffff8107bc31>] ? kthread_worker_fn+0xe1/0xe1
[  813.120401]  [<ffffffff815012f0>] ? gs_change+0xb/0xb
[  813.120628] Code: b8 10 00 00 00 48 89 45 c0 48 8d 81 8c 00 00 00 b9 08 00 00 00 48 89 45 c8 89 d8 f3 ab 48 8d 45 a8 48 89 55 e0 48 89 45 d8 75 02 <0f> 0b 89 f6 48 c7 02 00 00 00 00 4c 89 c7 48 6b f6 38 ba 00 04
[  813.123627] RIP  [<ffffffff8120fbbc>] nsm_mon_unmon+0x64/0x98
[  813.123894]  RSP <ffff88062163dcf0>
[  813.124129] ---[ end trace 11eb11a091ffd910 ]---

Aside: I am running with this patch, to prevent mounting of one
NFS-mounted directory under another from failing all accesses with a
spurious -EBUSY, a 3.6 regression -- but I can't see how this could
cause it. I'm running with this in 3.6.1 as well:

diff --git a/fs/dcache.c b/fs/dcache.c
index 8086636..649a112 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2404,6 +2404,10 @@ out_unalias:
 	if (likely(!d_mountpoint(alias))) {
 		__d_move(alias, dentry);
 		ret = alias;
+	} else {
+		printk(KERN_WARNING "VFS: __d_move()ing a d_mountpoint(), uh oh\n");
+		__d_move(alias, dentry);
+		ret = alias;
 	}
 out_err:
 	spin_unlock(&inode->i_lock);


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug)
  2012-10-23 16:32       ` Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug) Nix
@ 2012-10-23 16:46         ` J. Bruce Fields
  2012-10-23 16:54           ` J. Bruce Fields
  2012-10-23 16:56           ` Myklebust, Trond
  2012-10-23 20:57         ` Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) Nix
  1 sibling, 2 replies; 83+ messages in thread
From: J. Bruce Fields @ 2012-10-23 16:46 UTC (permalink / raw)
  To: Nix
  Cc: Ted Ts'o, linux-kernel, Bryan Schumaker, Peng Tao,
	Trond.Myklebust, gregkh, linux-nfs, Stanislav Kinsbursky

On Tue, Oct 23, 2012 at 05:32:21PM +0100, Nix wrote:
> On 23 Oct 2012, J. Bruce Fields uttered the following:
> > nfs-utils shouldn't be capable of oopsing the kernel, so from my
> > (selfish) point of view I'd actually rather you stick with whatever you
> > have and try to reproduce the oops.
> 
> Reproduced in 3.6.3, not in 3.6.1, not tried 3.6.2. Capturing it was
> rendered somewhat difficult by an ext4/JBD2 bug which leads to data loss
> in /var on every reboot out of 3.6.1 and on some reboots out of 3.6.3 (I
> have runs of NULs in my logs now, which keep eating the oopses):
> 
> [while in 3.6.1]
> [   88.565698] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
> [   88.799263] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
> [   89.648152] ------------[ cut here ]------------
> [   89.648386] WARNING: at fs/inode.c:280 drop_nlink+0x25/0x42()
> [   89.648614] Hardware name: empty
> [   89.648833] Modules linked in: firewire_ohci firewire_core [last unloaded: microcode]
> [   89.649382] Pid: 1484, comm: dhcpd Not tainted 3.6.1-dirty #1
> [   89.649610] Call Trace:
> [   89.649832]  [<ffffffff810608c0>] warn_slowpath_common+0x83/0x9b
> [   89.650063]  [<ffffffff810608f2>] warn_slowpath_null+0x1a/0x1c
> [   89.650292]  [<ffffffff8112efbf>] drop_nlink+0x25/0x42
> [   89.650533]  [<ffffffff81187112>] ext4_dec_count+0x26/0x28
> [   89.650763]  [<ffffffff8118abb8>] ext4_rename+0x4ec/0x7b4
> [   89.650993]  [<ffffffff81125d81>] ? vfs_rename+0xbe/0x3b7
> [   89.651224]  [<ffffffff81125f3f>] vfs_rename+0x27c/0x3b7
> [   89.651454]  [<ffffffff81127a53>] sys_renameat+0x1b1/0x228
> [   89.651682]  [<ffffffff8114bda1>] ? fsnotify+0x226/0x249
> [   89.651911]  [<ffffffff81239b75>] ? security_inode_permission+0x1e/0x20
> [   89.652145]  [<ffffffff8111a240>] ? vfs_write+0x116/0x142
> [   89.652372]  [<ffffffff81127ae5>] sys_rename+0x1b/0x1e
> [   89.652601]  [<ffffffff814fffa2>] system_call_fastpath+0x16/0x1b
> [...]
> [while having just booted into 3.6.1 after some time in 3.6.3: the FS
>  was clean, and fscked on the previous boot into 3.6.3 after a previous
>  instance of this bug]
> Oct 23 17:18:26 spindle crit: [   67.625319] EXT4-fs error (device dm-5): mb_free_blocks:1300: group 65, block 2143748:freeing already freed block (bit 13828)
> 
> This may well be a 3.6.1-specific bug fixed in 3.6.3, but it's hard to
> tell since most of my reboots are 3.6.1->3.6.3 or vice versa right now.
> 
> 
> Anyway, here's the NFSv4 oops (not a panic: it helps if I remember to
> turn off panic_on_oops when I come home from the holidays).
> 
> It's a lockd problem, and probably happens during delivery of mail over
> NFS (my mailserver load soars when it happens):
> 
> [  813.110354] ------------[ cut here ]------------
> [  813.110585] kernel BUG at fs/lockd/mon.c:150!

So nsm_mon_unmon() is being passed a NULL client.

There are three container patches between 3.6.1 and 3.6.3:

	lockd: per-net NSM client creation and destruction helpers introduced
	lockd: use rpc client's cl_nodename for id encoding
	lockd: create and use per-net NSM RPC clients on MON/UNMON requests

and that last does change nsm_monitor's call to nsm_mon_unmon, so that's
almost certainly it....

Looks like there's some confusion about whether nsm_client_get() returns
NULL or an error?

--b.

> [  813.110878] invalid opcode: 0000 [#1] SMP
> [  813.111173] Modules linked in: firewire_ohci firewire_core [last unloaded: microcode]
> [  813.111727] CPU 1
> [  813.111772] Pid: 1040, comm: lockd Not tainted 3.6.3-dirty #1 empty empty/S7010
> [  813.112388] RIP: 0010:[<ffffffff8120fbbc>]  [<ffffffff8120fbbc>] nsm_mon_unmon+0x64/0x98
> [  813.112840] RSP: 0018:ffff88062163dcf0  EFLAGS: 00010246
> [  813.113069] RAX: ffff88062163dcf8 RBX: 0000000000000000 RCX: 0000000000000000
> [  813.113303] RDX: ffff88062163dd68 RSI: 0000000000000002 RDI: ffff88062163dd40
> [  813.113537] RBP: ffff88062163dd50 R08: 0000000000000000 R09: ffffffff8120c463
> [  813.113771] R10: ffffffff8120c463 R11: 0000000000000001 R12: 0000000000000000
> [  813.114006] R13: ffff88061f067e40 R14: ffff88059c763a00 R15: ffff88062163de28
> [  813.114241] FS:  0000000000000000(0000) GS:ffff88063fc40000(0000) knlGS:0000000000000000
> [  813.114651] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  813.114882] CR2: ffffffffff600400 CR3: 0000000001a0b000 CR4: 00000000000007e0
> [  813.115116] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  813.115350] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  813.115584] Process lockd (pid: 1040, threadinfo ffff88062163c000, task ffff880621611640)
> [  813.115994] Stack:
> [  813.116210]  ffff88062163dd90 ffff8805f4e044b1 00000003000186b5 ffffffff00000010
> [  813.116761]  ffff8805f4e044c1 000000000000008c 0000000000000000 ffff88062163dcf8
> [  813.117310]  ffff88062163dd68 0000000000000000 ffff8806200ce088 ffff8805f4e04400
> [  813.117861] Call Trace:
> [  813.118083]  [<ffffffff8120fee4>] nsm_monitor+0x100/0x157
> [  813.118315]  [<ffffffff81211b4a>] nlm4svc_retrieve_args+0x62/0xd7
> [  813.118547]  [<ffffffff81211f49>] nlm4svc_proc_lock+0x3c/0xb5
> [  813.118779]  [<ffffffff8121167b>] ? nlm4svc_decode_lockargs+0x47/0xb2
> [  813.119016]  [<ffffffff814d89ca>] svc_process+0x3bf/0x6a1
> [  813.119246]  [<ffffffff8120d5f0>] lockd+0x127/0x164
> [  813.119474]  [<ffffffff8120d4c9>] ? set_grace_period+0x8a/0x8a
> [  813.119708]  [<ffffffff8107bcbc>] kthread+0x8b/0x93
> [  813.119938]  [<ffffffff815012f4>] kernel_thread_helper+0x4/0x10
> [  813.120170]  [<ffffffff8107bc31>] ? kthread_worker_fn+0xe1/0xe1
> [  813.120401]  [<ffffffff815012f0>] ? gs_change+0xb/0xb
> [  813.120628] Code: b8 10 00 00 00 48 89 45 c0 48 8d 81 8c 00 00 00 b9 08 00 00 00 48 89 45 c8 89 d8 f3 ab 48 8d 45 a8 48 89 55 e0 48 89 45 d8 75 02 <0f> 0b 89 f6 48 c7 02 00 00 00 00 4c 89 c7 48 6b f6 38 ba 00 04
> [  813.123627] RIP  [<ffffffff8120fbbc>] nsm_mon_unmon+0x64/0x98
> [  813.123894]  RSP <ffff88062163dcf0>
> [  813.124129] ---[ end trace 11eb11a091ffd910 ]---
> 
> Aside: I am running with this patch, to prevent mounting of one
> NFS-mounted directory under another from failing all accesses with a
> spurious -EBUSY, a 3.6 regression -- but I can't see how this could
> cause it. I'm running with this in 3.6.1 as well:
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 8086636..649a112 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -2404,6 +2404,10 @@ out_unalias:
>  	if (likely(!d_mountpoint(alias))) {
>  		__d_move(alias, dentry);
>  		ret = alias;
> +	} else {
> +		printk(KERN_WARNING "VFS: __d_move()ing a d_mountpoint(), uh oh\n");
> +		__d_move(alias, dentry);
> +		ret = alias;
>  	}
>  out_err:
>  	spin_unlock(&inode->i_lock);
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug)
  2012-10-23 16:46         ` J. Bruce Fields
@ 2012-10-23 16:54           ` J. Bruce Fields
  2012-10-23 16:56           ` Myklebust, Trond
  1 sibling, 0 replies; 83+ messages in thread
From: J. Bruce Fields @ 2012-10-23 16:54 UTC (permalink / raw)
  To: Nix
  Cc: Ted Ts'o, linux-kernel, Bryan Schumaker, Peng Tao,
	Trond.Myklebust, gregkh, linux-nfs, Stanislav Kinsbursky

On Tue, Oct 23, 2012 at 12:46:21PM -0400, J. Bruce Fields wrote:
> On Tue, Oct 23, 2012 at 05:32:21PM +0100, Nix wrote:
> > On 23 Oct 2012, J. Bruce Fields uttered the following:
> > > nfs-utils shouldn't be capable of oopsing the kernel, so from my
> > > (selfish) point of view I'd actually rather you stick with whatever you
> > > have and try to reproduce the oops.
> > 
> > Reproduced in 3.6.3, not in 3.6.1, not tried 3.6.2. Capturing it was
> > rendered somewhat difficult by an ext4/JBD2 bug which leads to data loss
> > in /var on every reboot out of 3.6.1 and on some reboots out of 3.6.3 (I
> > have runs of NULs in my logs now, which keep eating the oopses):
> > 
> > [while in 3.6.1]
> > [   88.565698] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
> > [   88.799263] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
> > [   89.648152] ------------[ cut here ]------------
> > [   89.648386] WARNING: at fs/inode.c:280 drop_nlink+0x25/0x42()
> > [   89.648614] Hardware name: empty
> > [   89.648833] Modules linked in: firewire_ohci firewire_core [last unloaded: microcode]
> > [   89.649382] Pid: 1484, comm: dhcpd Not tainted 3.6.1-dirty #1
> > [   89.649610] Call Trace:
> > [   89.649832]  [<ffffffff810608c0>] warn_slowpath_common+0x83/0x9b
> > [   89.650063]  [<ffffffff810608f2>] warn_slowpath_null+0x1a/0x1c
> > [   89.650292]  [<ffffffff8112efbf>] drop_nlink+0x25/0x42
> > [   89.650533]  [<ffffffff81187112>] ext4_dec_count+0x26/0x28
> > [   89.650763]  [<ffffffff8118abb8>] ext4_rename+0x4ec/0x7b4
> > [   89.650993]  [<ffffffff81125d81>] ? vfs_rename+0xbe/0x3b7
> > [   89.651224]  [<ffffffff81125f3f>] vfs_rename+0x27c/0x3b7
> > [   89.651454]  [<ffffffff81127a53>] sys_renameat+0x1b1/0x228
> > [   89.651682]  [<ffffffff8114bda1>] ? fsnotify+0x226/0x249
> > [   89.651911]  [<ffffffff81239b75>] ? security_inode_permission+0x1e/0x20
> > [   89.652145]  [<ffffffff8111a240>] ? vfs_write+0x116/0x142
> > [   89.652372]  [<ffffffff81127ae5>] sys_rename+0x1b/0x1e
> > [   89.652601]  [<ffffffff814fffa2>] system_call_fastpath+0x16/0x1b
> > [...]
> > [while having just booted into 3.6.1 after some time in 3.6.3: the FS
> >  was clean, and fscked on the previous boot into 3.6.3 after a previous
> >  instance of this bug]
> > Oct 23 17:18:26 spindle crit: [   67.625319] EXT4-fs error (device dm-5): mb_free_blocks:1300: group 65, block 2143748:freeing already freed block (bit 13828)
> > 
> > This may well be a 3.6.1-specific bug fixed in 3.6.3, but it's hard to
> > tell since most of my reboots are 3.6.1->3.6.3 or vice versa right now.
> > 
> > 
> > Anyway, here's the NFSv4 oops (not a panic: it helps if I remember to
> > turn off panic_on_oops when I come home from the holidays).
> > 
> > It's a lockd problem, and probably happens during delivery of mail over
> > NFS (my mailserver load soars when it happens):
> > 
> > [  813.110354] ------------[ cut here ]------------
> > [  813.110585] kernel BUG at fs/lockd/mon.c:150!
> 
> So nsm_mon_unmon() is being passed a NULL client.
> 
> There are three container patches between 3.6.1 and 3.6.3:
> 
> 	lockd: per-net NSM client creation and destruction helpers introduced
> 	lockd: use rpc client's cl_nodename for id encoding
> 	lockd: create and use per-net NSM RPC clients on MON/UNMON requests
> 
> and that last does change nsm_monitor's call to nsm_mon_unmon, so that's
> almost certainly it....
> 
> Looks like there's some confusion about whether nsm_client_get() returns
> NULL or an error?

The return from nsm_client_get() is either from nsm_create() or from
ln->nsm_clnt.

nsm_create's return is from rpc_create, and it doesn't look possible for
rpc_creat to return NULL.

So probably we have some case where, while holding ln->nsm_clnt_lock,
you can see ln->nsm_users nonzero, but ln->nsm_clnt NULL ?

--b.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug)
  2012-10-23 16:46         ` J. Bruce Fields
  2012-10-23 16:54           ` J. Bruce Fields
@ 2012-10-23 16:56           ` Myklebust, Trond
  2012-10-23 17:05             ` Nix
  1 sibling, 1 reply; 83+ messages in thread
From: Myklebust, Trond @ 2012-10-23 16:56 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Nix, Ted Ts'o, linux-kernel, Schumaker, Bryan, Peng Tao,
	gregkh, linux-nfs, Stanislav Kinsbursky

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 4160 bytes --]

On Tue, 2012-10-23 at 12:46 -0400, J. Bruce Fields wrote:
> On Tue, Oct 23, 2012 at 05:32:21PM +0100, Nix wrote:
> > On 23 Oct 2012, J. Bruce Fields uttered the following:
> > > nfs-utils shouldn't be capable of oopsing the kernel, so from my
> > > (selfish) point of view I'd actually rather you stick with whatever you
> > > have and try to reproduce the oops.
> > 
> > Reproduced in 3.6.3, not in 3.6.1, not tried 3.6.2. Capturing it was
> > rendered somewhat difficult by an ext4/JBD2 bug which leads to data loss
> > in /var on every reboot out of 3.6.1 and on some reboots out of 3.6.3 (I
> > have runs of NULs in my logs now, which keep eating the oopses):
> > 
> > [while in 3.6.1]
> > [   88.565698] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
> > [   88.799263] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
> > [   89.648152] ------------[ cut here ]------------
> > [   89.648386] WARNING: at fs/inode.c:280 drop_nlink+0x25/0x42()
> > [   89.648614] Hardware name: empty
> > [   89.648833] Modules linked in: firewire_ohci firewire_core [last unloaded: microcode]
> > [   89.649382] Pid: 1484, comm: dhcpd Not tainted 3.6.1-dirty #1
> > [   89.649610] Call Trace:
> > [   89.649832]  [<ffffffff810608c0>] warn_slowpath_common+0x83/0x9b
> > [   89.650063]  [<ffffffff810608f2>] warn_slowpath_null+0x1a/0x1c
> > [   89.650292]  [<ffffffff8112efbf>] drop_nlink+0x25/0x42
> > [   89.650533]  [<ffffffff81187112>] ext4_dec_count+0x26/0x28
> > [   89.650763]  [<ffffffff8118abb8>] ext4_rename+0x4ec/0x7b4
> > [   89.650993]  [<ffffffff81125d81>] ? vfs_rename+0xbe/0x3b7
> > [   89.651224]  [<ffffffff81125f3f>] vfs_rename+0x27c/0x3b7
> > [   89.651454]  [<ffffffff81127a53>] sys_renameat+0x1b1/0x228
> > [   89.651682]  [<ffffffff8114bda1>] ? fsnotify+0x226/0x249
> > [   89.651911]  [<ffffffff81239b75>] ? security_inode_permission+0x1e/0x20
> > [   89.652145]  [<ffffffff8111a240>] ? vfs_write+0x116/0x142
> > [   89.652372]  [<ffffffff81127ae5>] sys_rename+0x1b/0x1e
> > [   89.652601]  [<ffffffff814fffa2>] system_call_fastpath+0x16/0x1b
> > [...]
> > [while having just booted into 3.6.1 after some time in 3.6.3: the FS
> >  was clean, and fscked on the previous boot into 3.6.3 after a previous
> >  instance of this bug]
> > Oct 23 17:18:26 spindle crit: [   67.625319] EXT4-fs error (device dm-5): mb_free_blocks:1300: group 65, block 2143748:freeing already freed block (bit 13828)
> > 
> > This may well be a 3.6.1-specific bug fixed in 3.6.3, but it's hard to
> > tell since most of my reboots are 3.6.1->3.6.3 or vice versa right now.
> > 
> > 
> > Anyway, here's the NFSv4 oops (not a panic: it helps if I remember to
> > turn off panic_on_oops when I come home from the holidays).
> > 
> > It's a lockd problem, and probably happens during delivery of mail over
> > NFS (my mailserver load soars when it happens):
> > 
> > [  813.110354] ------------[ cut here ]------------
> > [  813.110585] kernel BUG at fs/lockd/mon.c:150!
> 
> So nsm_mon_unmon() is being passed a NULL client.
> 
> There are three container patches between 3.6.1 and 3.6.3:
> 
> 	lockd: per-net NSM client creation and destruction helpers introduced
> 	lockd: use rpc client's cl_nodename for id encoding
> 	lockd: create and use per-net NSM RPC clients on MON/UNMON requests
> 
> and that last does change nsm_monitor's call to nsm_mon_unmon, so that's
> almost certainly it....
> 
> Looks like there's some confusion about whether nsm_client_get() returns
> NULL or an error?

nsm_client_get() looks extremely racy in the case where ln->nsm_users ==
0.

Since we never recheck the value of ln->nsm_users after taking
nsm_create_mutex, what is stopping 2 different threads from both setting
ln->nsm_clnt and re-initialising ln->nsm_users?


-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug)
  2012-10-23 16:56           ` Myklebust, Trond
@ 2012-10-23 17:05             ` Nix
  2012-10-23 17:36               ` Nix
  0 siblings, 1 reply; 83+ messages in thread
From: Nix @ 2012-10-23 17:05 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: J. Bruce Fields, Ted Ts'o, linux-kernel, Schumaker, Bryan,
	Peng Tao, gregkh, linux-nfs, Stanislav Kinsbursky

On 23 Oct 2012, Trond Myklebust spake thusly:
> On Tue, 2012-10-23 at 12:46 -0400, J. Bruce Fields wrote:
>> Looks like there's some confusion about whether nsm_client_get() returns
>> NULL or an error?
>
> nsm_client_get() looks extremely racy in the case where ln->nsm_users ==
> 0.  Since we never recheck the value of ln->nsm_users after taking
> nsm_create_mutex, what is stopping 2 different threads from both setting
> ln->nsm_clnt and re-initialising ln->nsm_users?

Yep. At the worst possible time:

	spin_lock(&ln->nsm_clnt_lock);
	if (ln->nsm_users) {
		if (--ln->nsm_users)
			ln->nsm_clnt = NULL;
(1)		shutdown = !ln->nsm_users;
	}
	spin_unlock(&ln->nsm_clnt_lock);

If a thread reinitializes nsm_users at point (1), after the assignment,
we could well end up with ln->nsm_clnt NULL and shutdown false. A bit
later, nsm_mon_unmon gets called with a NULL clnt, and boom.

This seems particularly likely if there is only one nsm_user (which is
true in my case, since I have only one active network namespace).

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug)
  2012-10-23 17:05             ` Nix
@ 2012-10-23 17:36               ` Nix
  2012-10-23 17:43                 ` J. Bruce Fields
  2012-10-23 17:44                 ` Myklebust, Trond
  0 siblings, 2 replies; 83+ messages in thread
From: Nix @ 2012-10-23 17:36 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: J. Bruce Fields, Ted Ts'o, linux-kernel, Schumaker, Bryan,
	Peng Tao, gregkh, linux-nfs, Stanislav Kinsbursky

On 23 Oct 2012, nix@esperi.org.uk uttered the following:

> On 23 Oct 2012, Trond Myklebust spake thusly:
>> On Tue, 2012-10-23 at 12:46 -0400, J. Bruce Fields wrote:
>>> Looks like there's some confusion about whether nsm_client_get() returns
>>> NULL or an error?
>>
>> nsm_client_get() looks extremely racy in the case where ln->nsm_users ==
>> 0.  Since we never recheck the value of ln->nsm_users after taking
>> nsm_create_mutex, what is stopping 2 different threads from both setting
>> ln->nsm_clnt and re-initialising ln->nsm_users?
>
> Yep. At the worst possible time:
>
> 	spin_lock(&ln->nsm_clnt_lock);
> 	if (ln->nsm_users) {
> 		if (--ln->nsm_users)
> 			ln->nsm_clnt = NULL;
> (1)		shutdown = !ln->nsm_users;
> 	}
> 	spin_unlock(&ln->nsm_clnt_lock);
>
> If a thread reinitializes nsm_users at point (1), after the assignment,
> we could well end up with ln->nsm_clnt NULL and shutdown false. A bit
> later, nsm_mon_unmon gets called with a NULL clnt, and boom.

Possible fix if so, utterly untested so far (will test when I can face
yet another reboot and fs-corruption-recovery-hell cycle, in a few
hours), may ruin performance, violate locking hierarchies, and consume
kittens:

diff --git a/fs/lockd/mon.c b/fs/lockd/mon.c
index e4fb3ba..da91cdf 100644
--- a/fs/lockd/mon.c
+++ b/fs/lockd/mon.c
@@ -98,7 +98,6 @@ static struct rpc_clnt *nsm_client_get(struct net *net)
 		spin_unlock(&ln->nsm_clnt_lock);
 		goto out;
 	}
-	spin_unlock(&ln->nsm_clnt_lock);
 
 	mutex_lock(&nsm_create_mutex);
 	clnt = nsm_create(net);
@@ -108,6 +107,7 @@ static struct rpc_clnt *nsm_client_get(struct net *net)
 		ln->nsm_users = 1;
 	}
 	mutex_unlock(&nsm_create_mutex);
+	spin_unlock(&ln->nsm_clnt_lock);
 out:
 	return clnt;
 }


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug)
  2012-10-23 17:36               ` Nix
@ 2012-10-23 17:43                 ` J. Bruce Fields
  2012-10-23 17:44                 ` Myklebust, Trond
  1 sibling, 0 replies; 83+ messages in thread
From: J. Bruce Fields @ 2012-10-23 17:43 UTC (permalink / raw)
  To: Nix
  Cc: Myklebust, Trond, Ted Ts'o, linux-kernel, Schumaker, Bryan,
	Peng Tao, gregkh, linux-nfs, Stanislav Kinsbursky

On Tue, Oct 23, 2012 at 06:36:15PM +0100, Nix wrote:
> On 23 Oct 2012, nix@esperi.org.uk uttered the following:
> 
> > On 23 Oct 2012, Trond Myklebust spake thusly:
> >> On Tue, 2012-10-23 at 12:46 -0400, J. Bruce Fields wrote:
> >>> Looks like there's some confusion about whether nsm_client_get() returns
> >>> NULL or an error?
> >>
> >> nsm_client_get() looks extremely racy in the case where ln->nsm_users ==
> >> 0.  Since we never recheck the value of ln->nsm_users after taking
> >> nsm_create_mutex, what is stopping 2 different threads from both setting
> >> ln->nsm_clnt and re-initialising ln->nsm_users?
> >
> > Yep. At the worst possible time:
> >
> > 	spin_lock(&ln->nsm_clnt_lock);
> > 	if (ln->nsm_users) {
> > 		if (--ln->nsm_users)
> > 			ln->nsm_clnt = NULL;
> > (1)		shutdown = !ln->nsm_users;
> > 	}
> > 	spin_unlock(&ln->nsm_clnt_lock);
> >
> > If a thread reinitializes nsm_users at point (1), after the assignment,
> > we could well end up with ln->nsm_clnt NULL and shutdown false. A bit
> > later, nsm_mon_unmon gets called with a NULL clnt, and boom.
> 
> Possible fix if so, utterly untested so far (will test when I can face
> yet another reboot and fs-corruption-recovery-hell cycle, in a few
> hours), may ruin performance, violate locking hierarchies, and consume
> kittens:

Right, mutexes can't be taken while holding spinlocks.  Keep the kittens
well away from the computer.

--b.

> 
> diff --git a/fs/lockd/mon.c b/fs/lockd/mon.c
> index e4fb3ba..da91cdf 100644
> --- a/fs/lockd/mon.c
> +++ b/fs/lockd/mon.c
> @@ -98,7 +98,6 @@ static struct rpc_clnt *nsm_client_get(struct net *net)
>  		spin_unlock(&ln->nsm_clnt_lock);
>  		goto out;
>  	}
> -	spin_unlock(&ln->nsm_clnt_lock);
>  
>  	mutex_lock(&nsm_create_mutex);
>  	clnt = nsm_create(net);
> @@ -108,6 +107,7 @@ static struct rpc_clnt *nsm_client_get(struct net *net)
>  		ln->nsm_users = 1;
>  	}
>  	mutex_unlock(&nsm_create_mutex);
> +	spin_unlock(&ln->nsm_clnt_lock);
>  out:
>  	return clnt;
>  }
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* RE: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug)
  2012-10-23 17:36               ` Nix
  2012-10-23 17:43                 ` J. Bruce Fields
@ 2012-10-23 17:44                 ` Myklebust, Trond
  2012-10-23 17:57                   ` Myklebust, Trond
       [not found]                   ` <1351015039.4622.23.camel@lade.trondhjem.org>
  1 sibling, 2 replies; 83+ messages in thread
From: Myklebust, Trond @ 2012-10-23 17:44 UTC (permalink / raw)
  To: Nix
  Cc: J. Bruce Fields, Ted Ts'o, linux-kernel, Schumaker, Bryan,
	Peng Tao, gregkh, linux-nfs, Stanislav Kinsbursky

> -----Original Message-----
> From: Nix [mailto:nix@esperi.org.uk]
> Sent: Tuesday, October 23, 2012 1:36 PM
> To: Myklebust, Trond
> Cc: J. Bruce Fields; Ted Ts'o; linux-kernel@vger.kernel.org; Schumaker,
> Bryan; Peng Tao; gregkh@linuxfoundation.org; linux-nfs@vger.kernel.org;
> Stanislav Kinsbursky
> Subject: Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also
> an unrelated ext4 data loss bug)
> 
> On 23 Oct 2012, nix@esperi.org.uk uttered the following:
> 
> > On 23 Oct 2012, Trond Myklebust spake thusly:
> >> On Tue, 2012-10-23 at 12:46 -0400, J. Bruce Fields wrote:
> >>> Looks like there's some confusion about whether nsm_client_get()
> >>> returns NULL or an error?
> >>
> >> nsm_client_get() looks extremely racy in the case where ln->nsm_users
> >> == 0.  Since we never recheck the value of ln->nsm_users after taking
> >> nsm_create_mutex, what is stopping 2 different threads from both
> >> setting
> >> ln->nsm_clnt and re-initialising ln->nsm_users?
> >
> > Yep. At the worst possible time:
> >
> > 	spin_lock(&ln->nsm_clnt_lock);
> > 	if (ln->nsm_users) {
> > 		if (--ln->nsm_users)
> > 			ln->nsm_clnt = NULL;
> > (1)		shutdown = !ln->nsm_users;
> > 	}
> > 	spin_unlock(&ln->nsm_clnt_lock);
> >
> > If a thread reinitializes nsm_users at point (1), after the
> > assignment, we could well end up with ln->nsm_clnt NULL and shutdown
> > false. A bit later, nsm_mon_unmon gets called with a NULL clnt, and boom.
> 
> Possible fix if so, utterly untested so far (will test when I can face yet another
> reboot and fs-corruption-recovery-hell cycle, in a few hours), may ruin
> performance, violate locking hierarchies, and consume
> kittens:
> 
> diff --git a/fs/lockd/mon.c b/fs/lockd/mon.c index e4fb3ba..da91cdf 100644
> --- a/fs/lockd/mon.c
> +++ b/fs/lockd/mon.c
> @@ -98,7 +98,6 @@ static struct rpc_clnt *nsm_client_get(struct net *net)
>  		spin_unlock(&ln->nsm_clnt_lock);
>  		goto out;
>  	}
> -	spin_unlock(&ln->nsm_clnt_lock);
> 
>  	mutex_lock(&nsm_create_mutex);
>  	clnt = nsm_create(net);
> @@ -108,6 +107,7 @@ static struct rpc_clnt *nsm_client_get(struct net *net)
>  		ln->nsm_users = 1;
>  	}
>  	mutex_unlock(&nsm_create_mutex);
> +	spin_unlock(&ln->nsm_clnt_lock);

You can't hold a spinlock while sleeping. Both mutex_lock() and nsm_create() can definitely sleep.

The correct way to do this is to grab the spinlock and recheck the value of ln->nsm_users inside the 'if (!IS_ERR())' condition. If it is still zero, bump it and set ln->nsm_clnt, otherwise bump it, get the existing ln->nsm_clnt and call rpc_shutdown_clnt() on the redundant nsm client after dropping the spinlock.

Cheers
  Trond

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug)
  2012-10-23 17:44                 ` Myklebust, Trond
@ 2012-10-23 17:57                   ` Myklebust, Trond
       [not found]                   ` <1351015039.4622.23.camel@lade.trondhjem.org>
  1 sibling, 0 replies; 83+ messages in thread
From: Myklebust, Trond @ 2012-10-23 17:57 UTC (permalink / raw)
  To: Nix
  Cc: J. Bruce Fields, Ted Ts'o, linux-kernel, Schumaker, Bryan,
	Peng Tao, gregkh, linux-nfs, Stanislav Kinsbursky

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2628 bytes --]

On Tue, 2012-10-23 at 17:44 +0000, Myklebust, Trond wrote:
> You can't hold a spinlock while sleeping. Both mutex_lock() and nsm_create() can definitely sleep.
> 
> The correct way to do this is to grab the spinlock and recheck the value of ln->nsm_users inside the 'if (!IS_ERR())' condition. If it is still zero, bump it and set ln->nsm_clnt, otherwise bump it, get the existing ln->nsm_clnt and call rpc_shutdown_clnt() on the redundant nsm client after dropping the spinlock.
> 
> Cheers
>   Trond

Can you please check if the following patch fixes the issue?

Cheers
  Trond

8<--------------------------------------------------------
>From 44a070455d246e09de0cefc8875833f21ca655e8 Mon Sep 17 00:00:00 2001
From: Trond Myklebust <Trond.Myklebust@netapp.com>
Date: Tue, 23 Oct 2012 13:51:58 -0400
Subject: [PATCH] LOCKD: fix races in nsm_client_get

Commit e9406db20fecbfcab646bad157b4cfdc7cadddfb (lockd: per-net
NSM client creation and destruction helpers introduced) contains
a nasty race on initialisation of the per-net NSM client because
it doesn't check whether or not the client is set after grabbing
the nsm_create_mutex.

Reported-by: Nix <nix@esperi.org.uk>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---
 fs/lockd/mon.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/lockd/mon.c b/fs/lockd/mon.c
index e4fb3ba..9755603 100644
--- a/fs/lockd/mon.c
+++ b/fs/lockd/mon.c
@@ -88,7 +88,7 @@ static struct rpc_clnt *nsm_create(struct net *net)
 static struct rpc_clnt *nsm_client_get(struct net *net)
 {
 	static DEFINE_MUTEX(nsm_create_mutex);
-	struct rpc_clnt	*clnt;
+	struct rpc_clnt	*clnt, *new;
 	struct lockd_net *ln = net_generic(net, lockd_net_id);
 
 	spin_lock(&ln->nsm_clnt_lock);
@@ -101,11 +101,19 @@ static struct rpc_clnt *nsm_client_get(struct net *net)
 	spin_unlock(&ln->nsm_clnt_lock);
 
 	mutex_lock(&nsm_create_mutex);
-	clnt = nsm_create(net);
-	if (!IS_ERR(clnt)) {
-		ln->nsm_clnt = clnt;
-		smp_wmb();
-		ln->nsm_users = 1;
+	new = nsm_create(net);
+	clnt = new;
+	if (!IS_ERR(new)) {
+		spin_lock(&ln->nsm_clnt_lock);
+		if (!ln->nsm_users) {
+			ln->nsm_clnt = new;
+			new = NULL;
+		}
+		clnt = ln->nsm_clnt;
+		ln->nsm_users++;
+		spin_unlock(&ln->nsm_clnt_lock);
+		if (new)
+			rpc_shutdown_client(new);
 	}
 	mutex_unlock(&nsm_create_mutex);
 out:
-- 
1.7.11.7


-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug)
       [not found]                   ` <1351015039.4622.23.camel@lade.trondhjem.org>
@ 2012-10-23 18:23                     ` Myklebust, Trond
  2012-10-23 19:49                       ` Nix
  0 siblings, 1 reply; 83+ messages in thread
From: Myklebust, Trond @ 2012-10-23 18:23 UTC (permalink / raw)
  To: Nix
  Cc: J. Bruce Fields, Ted Ts'o, linux-kernel, Schumaker, Bryan,
	Peng Tao, gregkh, linux-nfs, Stanislav Kinsbursky

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3153 bytes --]

On Tue, 2012-10-23 at 13:57 -0400, Trond Myklebust wrote:
> On Tue, 2012-10-23 at 17:44 +0000, Myklebust, Trond wrote:
> > You can't hold a spinlock while sleeping. Both mutex_lock() and nsm_create() can definitely sleep.
> > 
> > The correct way to do this is to grab the spinlock and recheck the value of ln->nsm_users inside the 'if (!IS_ERR())' condition. If it is still zero, bump it and set ln->nsm_clnt, otherwise bump it, get the existing ln->nsm_clnt and call rpc_shutdown_clnt() on the redundant nsm client after dropping the spinlock.
> > 
> > Cheers
> >   Trond
> 
> Can you please check if the following patch fixes the issue?
> 
> Cheers
>   Trond
> 
Meh... This one gets rid of the 100% redundant mutex...

8<-----------------------------------------------------------
>From 4187c816a15df12544ebcfa6b961fce96458e244 Mon Sep 17 00:00:00 2001
From: Trond Myklebust <Trond.Myklebust@netapp.com>
Date: Tue, 23 Oct 2012 13:51:58 -0400
Subject: [PATCH] LOCKD: fix races in nsm_client_get

Commit e9406db20fecbfcab646bad157b4cfdc7cadddfb (lockd: per-net
NSM client creation and destruction helpers introduced) contains
a nasty race on initialisation of the per-net NSM client because
it doesn't check whether or not the client is set after grabbing
the nsm_create_mutex.

Reported-by: Nix <nix@esperi.org.uk>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
---
 fs/lockd/mon.c | 43 ++++++++++++++++++++++++++-----------------
 1 file changed, 26 insertions(+), 17 deletions(-)

diff --git a/fs/lockd/mon.c b/fs/lockd/mon.c
index e4fb3ba..fe69560 100644
--- a/fs/lockd/mon.c
+++ b/fs/lockd/mon.c
@@ -85,29 +85,38 @@ static struct rpc_clnt *nsm_create(struct net *net)
 	return rpc_create(&args);
 }
 
+static struct rpc_clnt *nsm_client_set(struct lockd_net *ln,
+		struct rpc_clnt *clnt)
+{
+	spin_lock(&ln->nsm_clnt_lock);
+	if (ln->nsm_users == 0) {
+		if (clnt == NULL)
+			goto out;
+		ln->nsm_clnt = clnt;
+	}
+	clnt = ln->nsm_clnt;
+	ln->nsm_users++;
+out:
+	spin_unlock(&ln->nsm_clnt_lock);
+	return clnt;
+}
+
 static struct rpc_clnt *nsm_client_get(struct net *net)
 {
-	static DEFINE_MUTEX(nsm_create_mutex);
-	struct rpc_clnt	*clnt;
+	struct rpc_clnt	*clnt, *new;
 	struct lockd_net *ln = net_generic(net, lockd_net_id);
 
-	spin_lock(&ln->nsm_clnt_lock);
-	if (ln->nsm_users) {
-		ln->nsm_users++;
-		clnt = ln->nsm_clnt;
-		spin_unlock(&ln->nsm_clnt_lock);
+	clnt = nsm_client_set(ln, NULL);
+	if (clnt != NULL)
 		goto out;
-	}
-	spin_unlock(&ln->nsm_clnt_lock);
 
-	mutex_lock(&nsm_create_mutex);
-	clnt = nsm_create(net);
-	if (!IS_ERR(clnt)) {
-		ln->nsm_clnt = clnt;
-		smp_wmb();
-		ln->nsm_users = 1;
-	}
-	mutex_unlock(&nsm_create_mutex);
+	clnt = new = nsm_create(net);
+	if (IS_ERR(clnt))
+		goto out;
+
+	clnt = nsm_client_set(ln, new);
+	if (clnt != new)
+		rpc_shutdown_client(new);
 out:
 	return clnt;
 }
-- 
1.7.11.7


-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug)
  2012-10-23 18:23                     ` Myklebust, Trond
@ 2012-10-23 19:49                       ` Nix
  2012-10-24 10:18                         ` [PATCH] lockd: fix races in per-net NSM client handling Stanislav Kinsbursky
  0 siblings, 1 reply; 83+ messages in thread
From: Nix @ 2012-10-23 19:49 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: J. Bruce Fields, Ted Ts'o, linux-kernel, Schumaker, Bryan,
	Peng Tao, gregkh, linux-nfs, Stanislav Kinsbursky

On 23 Oct 2012, Trond Myklebust outgrape:

> On Tue, 2012-10-23 at 13:57 -0400, Trond Myklebust wrote:
>> On Tue, 2012-10-23 at 17:44 +0000, Myklebust, Trond wrote:
>> > You can't hold a spinlock while sleeping. Both mutex_lock() and nsm_create() can definitely sleep.
>> > 
>> > The correct way to do this is to grab the spinlock and recheck the value of ln->nsm_users inside the 'if (!IS_ERR())' condition. If it is still zero, bump it and set ln->nsm_clnt, otherwise bump it, get the existing ln->nsm_clnt and call rpc_shutdown_clnt() on the redundant nsm client after dropping the spinlock.
>> > 
>> > Cheers
>> >   Trond
>> 
>> Can you please check if the following patch fixes the issue?
>> 
>> Cheers
>>   Trond
>> 
> Meh... This one gets rid of the 100% redundant mutex...

No help, I'm afraid:

[  894.005699] ------------[ cut here ]------------
[  894.005929] kernel BUG at fs/lockd/mon.c:159!
[  894.006156] invalid opcode: 0000 [#1] SMP
[  894.006451] Modules linked in: firewire_ohci firewire_core [last unloaded: microcode]
[  894.007005] CPU 1
[  894.007050] Pid: 1035, comm: lockd Not tainted 3.6.3-dirty #1 empty empty/S7010
[  894.007669] RIP: 0010:[<ffffffff8120fbbc>]  [<ffffffff8120fbbc>] nsm_mon_unmon+0x64/0x98
[  894.008126] RSP: 0018:ffff880620a23ce0  EFLAGS: 00010246
[  894.008355] RAX: ffff880620a23ce8 RBX: 0000000000000000 RCX: 0000000000000000
[  894.008591] RDX: ffff880620a23d58 RSI: 0000000000000002 RDI: ffff880620a23d30
[  894.008827] RBP: ffff880620a23d40 R08: 0000000000000000 R09: ffffea00188e4f00
[  894.009063] R10: ffffffff814d032f R11: 0000000000000020 R12: 0000000000000000
[  894.009300] R13: ffff88061f067e40 R14: ffff88061f067ee8 R15: ffff88062393dc00
[  894.009537] FS:  0000000000000000(0000) GS:ffff88063fc40000(0000) knlGS:0000000000000000
[  894.009956] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  894.010187] CR2: 00007f056a9a6ff0 CR3: 0000000001a0b000 CR4: 00000000000027e0
[  894.010422] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  894.010659] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  894.010896] Process lockd (pid: 1035, threadinfo ffff880620a22000, task ffff8806208b5900)
[  894.011310] Stack:
[  894.011528]  0000000000000010 ffff8806102d3db1 00000003000186b5 ffffffff00000010
[  894.012083]  ffff8806102d3dc1 000000000000008c 0000000000000000 ffff880620a23ce8
[  894.012637]  ffff880620a23d58 0000000000000000 ffff88061f067ee8 ffff8806102d3d00
[  894.013190] Call Trace:
[  894.013413]  [<ffffffff8120ff07>] nsm_monitor+0x123/0x17e
[  894.013645]  [<ffffffff81211b72>] nlm4svc_retrieve_args+0x62/0xd7
[  894.013879]  [<ffffffff81211f71>] nlm4svc_proc_lock+0x3c/0xb5
[  894.014112]  [<ffffffff812116a3>] ? nlm4svc_decode_lockargs+0x47/0xb2
[  894.014349]  [<ffffffff814d89fa>] svc_process+0x3bf/0x6a1
[  894.014581]  [<ffffffff8120d5f0>] lockd+0x127/0x164
[  894.014810]  [<ffffffff8120d4c9>] ? set_grace_period+0x8a/0x8a
[  894.015046]  [<ffffffff8107bcbc>] kthread+0x8b/0x93
[  894.015277]  [<ffffffff81501334>] kernel_thread_helper+0x4/0x10
[  894.015511]  [<ffffffff8107bc31>] ? kthread_worker_fn+0xe1/0xe1
[  894.015744]  [<ffffffff81501330>] ? gs_change+0xb/0xb
[  894.015972] Code: b8 10 00 00 00 48 89 45 c0 48 8d 81 8c 00 00 00 b9 08 00 00 00 48 89 45 c8 89 d8 f3 ab 48 8d 45 a8 48 89 55 e0 48 89 45 d8 75 02 <0f> 0b 89 f6 48 c7 02 00 00 00 00 4c 89 c7 48 6b f6 38 ba 00 04
[  894.018895] RIP  [<ffffffff8120fbbc>] nsm_mon_unmon+0x64/0x98
[  894.019163]  RSP <ffff880620a23ce0>
[  894.019401] ---[ end trace b8ef5cb81bec72c8 ]---

Slightly different timing, but still boom.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-23 16:32       ` Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug) Nix
  2012-10-23 16:46         ` J. Bruce Fields
@ 2012-10-23 20:57         ` Nix
  2012-10-23 22:19           ` Theodore Ts'o
                             ` (3 more replies)
  1 sibling, 4 replies; 83+ messages in thread
From: Nix @ 2012-10-23 20:57 UTC (permalink / raw)
  To: Ted Ts'o, linux-ext4
  Cc: linux-kernel, J. Bruce Fields, Bryan Schumaker, Peng Tao,
	Trond.Myklebust, gregkh, linux-nfs

[Bruce, Trond, I fear it may be hard for me to continue chasing this NFS
 lockd crash as long as ext4 on 3.6.3 is hosing my filesystems like
 this. Apologies.]

On 23 Oct 2012, nix@esperi.org.uk uttered the following:
> Reproduced in 3.6.3, not in 3.6.1, not tried 3.6.2. Capturing it was
> rendered somewhat difficult by an ext4/JBD2 bug which leads to data loss
> in /var on every reboot out of 3.6.1 and on some reboots out of 3.6.3 (I
> have runs of NULs in my logs now, which keep eating the oopses):
>
> [while in 3.6.1]
> [   88.565698] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
> [   88.799263] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
> [   89.648152] ------------[ cut here ]------------
> [   89.648386] WARNING: at fs/inode.c:280 drop_nlink+0x25/0x42()
> [   89.648614] Hardware name: empty
> [   89.648833] Modules linked in: firewire_ohci firewire_core [last unloaded: microcode]
> [   89.649382] Pid: 1484, comm: dhcpd Not tainted 3.6.1-dirty #1
> [   89.649610] Call Trace:
> [   89.649832]  [<ffffffff810608c0>] warn_slowpath_common+0x83/0x9b
> [   89.650063]  [<ffffffff810608f2>] warn_slowpath_null+0x1a/0x1c
> [   89.650292]  [<ffffffff8112efbf>] drop_nlink+0x25/0x42
> [   89.650533]  [<ffffffff81187112>] ext4_dec_count+0x26/0x28
> [   89.650763]  [<ffffffff8118abb8>] ext4_rename+0x4ec/0x7b4
> [   89.650993]  [<ffffffff81125d81>] ? vfs_rename+0xbe/0x3b7
> [   89.651224]  [<ffffffff81125f3f>] vfs_rename+0x27c/0x3b7
> [   89.651454]  [<ffffffff81127a53>] sys_renameat+0x1b1/0x228
> [   89.651682]  [<ffffffff8114bda1>] ? fsnotify+0x226/0x249
> [   89.651911]  [<ffffffff81239b75>] ? security_inode_permission+0x1e/0x20
> [   89.652145]  [<ffffffff8111a240>] ? vfs_write+0x116/0x142
> [   89.652372]  [<ffffffff81127ae5>] sys_rename+0x1b/0x1e
> [   89.652601]  [<ffffffff814fffa2>] system_call_fastpath+0x16/0x1b
> [...]
> [while having just booted into 3.6.1 after some time in 3.6.3: the FS
>  was clean, and fscked on the previous boot into 3.6.3 after a previous
>  instance of this bug]
> Oct 23 17:18:26 spindle crit: [   67.625319] EXT4-fs error (device dm-5): mb_free_blocks:1300: group 65, block 2143748:freeing already freed block (bit 13828)
> 
> This may well be a 3.6.1-specific bug fixed in 3.6.3, but it's hard to
> tell since most of my reboots are 3.6.1->3.6.3 or vice versa right now.

It is now quite clear that this is a bug introduced by one or more of
the post-3.6.1 ext4 patches (which have all been backported at least to
3.5, so the problem is probably there too). Rebooting from 3.6.3 back
into 3.6.1, I saw this within seconds of boot:

[   60.290844] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 202, 1583 clusters in bitmap, 1675 in gd
[   60.291426] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
[  116.508621] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 5, 17749 clusters in bitmap, 17700 in gd
[  116.509626] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 8, 15814 clusters in bitmap, 16073 in gd
[  116.510103] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 10, 3349 clusters in bitmap, 3493 in gd
[  116.510571] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 12, 1792 clusters in bitmap, 1648 in gd
[  116.511691] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
[  116.512736] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 25, 14463 clusters in bitmap, 14462 in gd
[  116.513624] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
[  359.538550] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #73732: comm nfsd: deleted inode referenced: 10024
[  359.559220] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #73732: comm nfsd: deleted inode referenced: 10572
[  366.113780] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #73732: comm nfsd: deleted inode referenced: 10024
[  366.114837] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #73732: comm nfsd: deleted inode referenced: 10572
[  456.013682] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm vi: deleted inode referenced: 10303
[  456.384454] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm vi: deleted inode referenced: 10303
[  457.508943] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm vi: deleted inode referenced: 10303
[  457.509422] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm vi: deleted inode referenced: 10303
[  457.509897] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm vi: deleted inode referenced: 10303
[  478.779574] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm zsh: deleted inode referenced: 10303
[  478.780047] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm zsh: deleted inode referenced: 10303
[  478.780516] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm zsh: deleted inode referenced: 10303
[  478.780983] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm zsh: deleted inode referenced: 10303
[  478.782010] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm zsh: deleted inode referenced: 10303
[  478.782480] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm zsh: deleted inode referenced: 10303
[  479.826974] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm lesspipe.sh: deleted inode referenced: 10303
[  479.834091] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm lesspipe.sh: deleted inode referenced: 10303
[  479.835966] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm file: deleted inode referenced: 10303
[  479.856946] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm less: deleted inode referenced: 10303
[  479.857431] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm less: deleted inode referenced: 10303
[  679.812704] EXT4-fs error (device dm-3): mb_free_blocks:1300: group 27, block 916489:freeing already freed block (bit 31753)

(I'd provide more sample errors, but this bug has been eating
newly-written logs in /var all day, so not much has survived.)

I rebooted into 3.6.1 rescue mode and fscked everything: lots of
orphans, block group corruption and cross-linked files. The problems did
not recur upon booting from 3.6.1 into 3.6.1 again. It is quite clear
that metadata changes made in 3.6.3 are not making it to disk reliably,
thus leading to corrupted filesystems marked clean on reboot into other
kernels: pretty much every file appended to in 3.6.3 loses some or all
of its appended data, and newly allocated blocks often end up
cross-linked between multiple files.

The curious thing is this doesn't affect every filesystem: for a while
it affected only /var, and now it's affecting only /var and /home. The
massive writes to the ext4 filesystem mounted on /usr/src seem to have
gone off without incident: fsck reports no problems.


The only unusual thing about the filesystems on this machine are that
they have hardware RAID-5 (using the Areca driver), so I'm mounting with
'nobarrier': the full set of options for all my ext4 filesystems are:

rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota,
usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota

If there's anything I can do to help, I'm happy to do it, once I've
restored my home directory from backup :(


tune2fs output for one of the afflicted filesystems (after fscking):

tune2fs 1.42.2 (9-Apr-2012)
Filesystem volume name:   home
Last mounted on:          /home
Filesystem UUID:          95bd22c2-253c-456f-8e36-b6cfb9ecd4ef
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              3276800
Block count:              13107200
Reserved block count:     655360
Free blocks:              5134852
Free inodes:              3174777
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      20
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
RAID stripe width:        16
Flex block group size:    64
Filesystem created:       Tue May 26 21:29:41 2009
Last mount time:          Tue Oct 23 21:32:07 2012
Last write time:          Tue Oct 23 21:32:07 2012
Mount count:              2
Maximum mount count:      20
Last checked:             Tue Oct 23 21:22:16 2012
Check interval:           15552000 (6 months)
Next check after:         Sun Apr 21 21:22:16 2013
Lifetime writes:          1092 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
First orphan inode:       1572907
Default directory hash:   half_md4
Directory Hash Seed:      a201983d-d8a3-460b-93ca-eb7804b62c23
Journal backup:           inode blocks

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-23 20:57         ` Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) Nix
@ 2012-10-23 22:19           ` Theodore Ts'o
  2012-10-23 22:47             ` Nix
                               ` (4 more replies)
  2012-10-24  1:13           ` Eric Sandeen
                             ` (2 subsequent siblings)
  3 siblings, 5 replies; 83+ messages in thread
From: Theodore Ts'o @ 2012-10-23 22:19 UTC (permalink / raw)
  To: Nix
  Cc: linux-ext4, linux-kernel, J. Bruce Fields, Bryan Schumaker,
	Peng Tao, Trond.Myklebust, gregkh, Toralf Förster,
	Eric Sandeen, stable

On Tue, Oct 23, 2012 at 09:57:08PM +0100, Nix wrote:
> 
> It is now quite clear that this is a bug introduced by one or more of
> the post-3.6.1 ext4 patches (which have all been backported at least to
> 3.5, so the problem is probably there too).
> 
> [   60.290844] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 202, 1583 clusters in bitmap, 1675 in gd
> [   60.291426] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
> 

I think I've found the problem.  I believe the commit at fault is commit
14b4ed22a6 (upstream commit eeecef0af5e):

    jbd2: don't write superblock when if its empty

which first appeared in v3.6.2.

The reason why the problem happens rarely is that the effect of the
buggy commit is that if the journal's starting block is zero, we fail
to truncate the journal when we unmount the file system.  This can
happen if we mount and then unmount the file system fairly quickly,
before the log has a chance to wrap.  After the first time this has
happened, it's not a disaster, since when we replay the journal, we'll
just replay some extra transactions.  But if this happens twice, the
oldest valid transaction will still not have gotten updated, but some
of the newer transactions from the last mount session will have gotten
written by the very latest transacitons, and when we then try to do
the extra transaction replays, the metadata blocks can end up getting
very scrambled indeed.

*Sigh*.  My apologies for not catching this when I reviewed this
patch.  I believe the following patch should fix the bug; once it's
reviewed by other ext4 developers, I'll push this to Linus ASAP.

						- Ted

commit 26de1ba5acc39f0ab57ce1ed523cb128e4ad73a4
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Tue Oct 23 18:15:22 2012 -0400

    jbd2: fix a potential fs corrupting bug in jbd2_mark_journal_empty
    
    Fix a potential file system corrupting bug which was introduced by
    commit eeecef0af5ea4efd763c9554cf2bd80fc4a0efd3: jbd2: don't write
    superblock when if its empty.
    
    We should only skip writing the journal superblock if there is nothing
    to do --- not just when s_start is zero.
    
    This has caused users to report file system corruptions in ext4 that
    look like this:
    
    EXT4-fs error (device sdb3): ext4_mb_generate_buddy:741: group 436, 22902 clusters in bitmap, 22901 in gd
    JBD2: Spotted dirty metadata buffer (dev = sdb3, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
    
    after the file system has been corrupted.
    
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
    Cc: stable@vger.kernel.org

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 0f16edd..0064181 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1351,18 +1351,20 @@ void jbd2_journal_update_sb_log_tail(journal_t *journal, tid_t tail_tid,
 static void jbd2_mark_journal_empty(journal_t *journal)
 {
 	journal_superblock_t *sb = journal->j_superblock;
+	__be32		new_tail_sequence;
 
 	BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex));
 	read_lock(&journal->j_state_lock);
-	/* Is it already empty? */
-	if (sb->s_start == 0) {
+	new_tail_sequence = cpu_to_be32(journal->j_tail_sequence);
+	/* Nothing to do? */
+	if (sb->s_start == 0 && sb->s_sequence == new_tail_sequence) {
 		read_unlock(&journal->j_state_lock);
 		return;
 	}
 	jbd_debug(1, "JBD2: Marking journal as empty (seq %d)\n",
 		  journal->j_tail_sequence);
 
-	sb->s_sequence = cpu_to_be32(journal->j_tail_sequence);
+	sb->s_sequence = new_tail_sequence;
 	sb->s_start    = cpu_to_be32(0);
 	read_unlock(&journal->j_state_lock);
 



^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-23 22:19           ` Theodore Ts'o
@ 2012-10-23 22:47             ` Nix
  2012-10-23 23:16               ` Theodore Ts'o
  2012-10-23 23:06             ` Nix
                               ` (3 subsequent siblings)
  4 siblings, 1 reply; 83+ messages in thread
From: Nix @ 2012-10-23 22:47 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: linux-ext4, linux-kernel, J. Bruce Fields, Bryan Schumaker,
	Peng Tao, Trond.Myklebust, gregkh, Toralf Förster,
	Eric Sandeen, stable

On 23 Oct 2012, Theodore Ts'o said:

> The reason why the problem happens rarely is that the effect of the
> buggy commit is that if the journal's starting block is zero, we fail
> to truncate the journal when we unmount the file system.

Oh dear oh dear.

>                                                           This can
> happen if we mount and then unmount the file system fairly quickly,
> before the log has a chance to wrap.

... which is quite likely if you're rebooting frequently to try to track
down some other kernel bug.

>                                       After the first time this has
> happened, it's not a disaster, since when we replay the journal, we'll
> just replay some extra transactions.  But if this happens twice, the
> oldest valid transaction will still not have gotten updated, but some
> of the newer transactions from the last mount session will have gotten
> written by the very latest transacitons, and when we then try to do
> the extra transaction replays, the metadata blocks can end up getting
> very scrambled indeed.

Ow. OK, it's a good thing I rebooted fast. :) and only fses that got
written to, but not too much, will see this. Hence my /usr/src stayed
intact because it had lots of updates of lots of tiny files, more than
enough to cause the journal to wrap over and over again, even
journalling only metadata. But /home doesn't see so many updates, and
neither does /var...

This seems to explain everything.

It looks like fscking everything will fix it (it'll replay the buggered
journal, mangling the metadata, but then fix up the scrambled metadata
and fix the journal's starting block). So I probably don't need to worry
about latent corruption hiding waiting to pounce. Phew.

> *Sigh*.  My apologies for not catching this when I reviewed this
> patch.  I believe the following patch should fix the bug; once it's
> reviewed by other ext4 developers, I'll push this to Linus ASAP.

No problem. This is my first data-corruption bug in more than seventeen
years of ext* use (it even survived horribly faulty RAM). I call that a
good record. And it happened one day after a full backup, and was
immediately highlighted by corruption of .bash_history and input/output
errors logging in -- and fsck pretty much fixed the problem, with only a
few missing files, one file full of garbage, and one high-ASCII filename
in a temporary directory to show for it. I call that luckier than I have
any right to be.

Plus, my faith in the amazingly fast bugfixing talents of ext4 devs is
undimmed! :)

> 						- Ted
>
> commit 26de1ba5acc39f0ab57ce1ed523cb128e4ad73a4
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Tue Oct 23 18:15:22 2012 -0400
>
>     jbd2: fix a potential fs corrupting bug in jbd2_mark_journal_empty

I'll apply this tomorrow (enough fun with filesystem restoration for
today) and see what happens. (What could *possibly* go wrong?)

But I might not upgrade to stable kernels quite so often in future :(
you know what they say: once burnt, twice not upgrading before doing a
full backup!

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-23 22:19           ` Theodore Ts'o
  2012-10-23 22:47             ` Nix
@ 2012-10-23 23:06             ` Nix
  2012-10-23 23:28               ` Theodore Ts'o
  2012-10-24  0:57             ` Eric Sandeen
                               ` (2 subsequent siblings)
  4 siblings, 1 reply; 83+ messages in thread
From: Nix @ 2012-10-23 23:06 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: linux-ext4, linux-kernel, J. Bruce Fields, Bryan Schumaker,
	Peng Tao, Trond.Myklebust, gregkh, Toralf Förster,
	Eric Sandeen, stable

On 23 Oct 2012, Theodore Ts'o verbalised:
> *Sigh*.  My apologies for not catching this when I reviewed this
> patch.  I believe the following patch should fix the bug; once it's
> reviewed by other ext4 developers, I'll push this to Linus ASAP.

I note that the patch is in the latest stable releases of 3.4.x and
3.5.x, which are IIRC end-of-lifed. I'd say that if your patch does
indeed fix it, this justifies pushing out new releases of both these
stable kernels with just this patch in, just to make sure people taking
the latest stable kernel from those releases don't eat their
filesystems.

The bug did really quite a lot of damage to my /home fs in only a few
minutes of uptime, given how few files I wrote to it. What it could have
done to a more conventional distro install with everything including
/home on one filesystem, I shudder to think.

(I've posted a note to the LWN thread announcing 3.6.3, just in case
that saves someone's fs.)

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-23 22:47             ` Nix
@ 2012-10-23 23:16               ` Theodore Ts'o
  0 siblings, 0 replies; 83+ messages in thread
From: Theodore Ts'o @ 2012-10-23 23:16 UTC (permalink / raw)
  To: Nix
  Cc: linux-ext4, linux-kernel, J. Bruce Fields, Bryan Schumaker,
	Peng Tao, Trond.Myklebust, gregkh, Toralf Förster,
	Eric Sandeen, stable, Jan Kara

Just to follow up (mostly for ext4 developers).  After talking to Eric
via irc, it appears he thought it was sufficient to check (s_start ==
0) from commit 24bcc89c7e, which was authored by Jan Kara.  (Which we
now need to audit very carefully, although it's been in the upstream
kernel since 3.4, so it's obviously not causing failures as
spectacularly or as easily as eeecef0af5e.)

And I suspect the reason why Jan thought this was OK is because of the
following totally bogus comment at fs/jbd2/recovery.c:259:

	/*
	 * The journal superblock's s_start field (the current log head)
	 * is always zero if, and only if, the journal was cleanly
	 * unmounted.
	 */

After doing some code archeology, I've found that this comment dates
back to the very first commit in the historic git tree when the fs/jbd
code was added to the 2.4.14 tree.  I suspect that s_start was
originally a physical block number (in the very early days when sct
was initially developing ext3, before it was submitted to the kernel),
but then when Stephen added the ability to store the journal in an
inode, it became a logical block number, and this comment became
incorrect, but no one noticed and/or decided to fix the comment in the
last ten years.  :-(

So now we know the root cause of the thought processes that lead to
the bug, and now we need to double check the changes in commits
24bcc89c7e for jbd2, and 9754e39c7b for jbd (a similar change was also
added to ext3 in v3.5).

						- Ted

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-23 23:06             ` Nix
@ 2012-10-23 23:28               ` Theodore Ts'o
  2012-10-23 23:34                 ` Nix
  0 siblings, 1 reply; 83+ messages in thread
From: Theodore Ts'o @ 2012-10-23 23:28 UTC (permalink / raw)
  To: Nix
  Cc: linux-ext4, linux-kernel, J. Bruce Fields, Bryan Schumaker,
	Peng Tao, Trond.Myklebust, gregkh, Toralf Förster,
	Eric Sandeen, stable

On Wed, Oct 24, 2012 at 12:06:21AM +0100, Nix wrote:
> I note that the patch is in the latest stable releases of 3.4.x and
> 3.5.x, which are IIRC end-of-lifed. I'd say that if your patch does
> indeed fix it, this justifies pushing out new releases of both these
> stable kernels with just this patch in, just to make sure people taking
> the latest stable kernel from those releases don't eat their
> filesystems.

Eric is in the process of reviewing the bug, and creating a repro case
so we can definitely show that my theory is sound, and that the bug
has been fixed by my proposed fix.  We know that my patch definitely
restores the behaviour previous to commit eeecef0af5, so it can't
hurt, but we do want to make 100% sure that it really fixes the
problem.  (I found the potential bug by desk checking the all of the
commits between v3.6.1 and v3.6.3, and none of the other commits
triggered my WTF alarm, but we want to have a easy repro case so we
can be 100% sure it's been fixed.  It's always nice when theory is
backed up with empircal evidence.  :-)

Until then, it should also be fine to just revert that commit on the
other stable kernels.

> The bug did really quite a lot of damage to my /home fs in only a few
> minutes of uptime, given how few files I wrote to it. What it could have
> done to a more conventional distro install with everything including
> /home on one filesystem, I shudder to think.

Well, the problem won't show up if the journal has wrapped.  So it
will only show up if the system has been rebooted twice in fairly
quick succession.  A full conventional distro install probably
wouldn't have triggered a bug... although someone who habitually
reboots their laptop instead of using suspend/resume or hiberbate, or
someone who is trying to bisect the kernel looking for some other bug
could easily trip over this --- which I guess is how you got hit by
it.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-23 23:28               ` Theodore Ts'o
@ 2012-10-23 23:34                 ` Nix
  0 siblings, 0 replies; 83+ messages in thread
From: Nix @ 2012-10-23 23:34 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: linux-ext4, linux-kernel, J. Bruce Fields, Bryan Schumaker,
	Peng Tao, Trond.Myklebust, gregkh, Toralf Förster,
	Eric Sandeen, stable

On 24 Oct 2012, Theodore Ts'o told this:

> hurt, but we do want to make 100% sure that it really fixes the
> problem.

Well, yes, that would be nice. I can certainly try to verify that it
stops my filesystems getting corrupted. (And if so, I owe you a
$BEVERAGE. Though I suspect I owe you about three million of those
already for other code written in the past.)

>> The bug did really quite a lot of damage to my /home fs in only a few
>> minutes of uptime, given how few files I wrote to it. What it could have
>> done to a more conventional distro install with everything including
>> /home on one filesystem, I shudder to think.
>
> Well, the problem won't show up if the journal has wrapped.  So it
> will only show up if the system has been rebooted twice in fairly
> quick succession.  A full conventional distro install probably
> wouldn't have triggered a bug...

A full *install* from scratch, no. I was more worried about the
possibility of someone running -stable kernels on an existing distro
installation, and shutting down every night (given what's been happening
to UK electricity prices in the last few years I suspect there are quite
a lot of people doing that in the UK to save power). If they happen not
to do much on one particular day other than a bit of light distro
updating, they could perfectly well end up roasting things touched
during the distro update. Things like glibc :(

>                                  although someone who habitually
> reboots their laptop instead of using suspend/resume or hiberbate, or
> someone who is trying to bisect the kernel looking for some other bug
> could easily trip over this --- which I guess is how you got hit by
> it.

I was first hit by it in /var before I was even trying to bisect: I was
just rebooting to unwedge NFS lockd. It's true that in less than a week
probably not all that many people have rebooted often enough to trip
over this.

I hope.

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-23 22:19           ` Theodore Ts'o
  2012-10-23 22:47             ` Nix
  2012-10-23 23:06             ` Nix
@ 2012-10-24  0:57             ` Eric Sandeen
  2012-10-24 20:17               ` Jan Kara
  2012-10-24 19:13             ` Jannis Achstetter
  2012-10-24 21:04             ` Jannis Achstetter
  4 siblings, 1 reply; 83+ messages in thread
From: Eric Sandeen @ 2012-10-24  0:57 UTC (permalink / raw)
  To: Theodore Ts'o, Nix, linux-ext4, linux-kernel,
	J. Bruce Fields, Bryan Schumaker, Peng Tao, Trond.Myklebust,
	gregkh, Toralf Förster, stable, Jan Kara

On 10/23/12 5:19 PM, Theodore Ts'o wrote:
> On Tue, Oct 23, 2012 at 09:57:08PM +0100, Nix wrote:
>>
>> It is now quite clear that this is a bug introduced by one or more of
>> the post-3.6.1 ext4 patches (which have all been backported at least to
>> 3.5, so the problem is probably there too).
>>
>> [   60.290844] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 202, 1583 clusters in bitmap, 1675 in gd
>> [   60.291426] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
>>
> 
> I think I've found the problem.  I believe the commit at fault is commit
> 14b4ed22a6 (upstream commit eeecef0af5e):
> 
>     jbd2: don't write superblock when if its empty
> 
> which first appeared in v3.6.2.
> 
> The reason why the problem happens rarely is that the effect of the
> buggy commit is that if the journal's starting block is zero, we fail
> to truncate the journal when we unmount the file system.  This can
> happen if we mount and then unmount the file system fairly quickly,
> before the log has a chance to wrap.After the first time this has
> happened, it's not a disaster, since when we replay the journal, we'll
> just replay some extra transactions.  But if this happens twice, the
> oldest valid transaction will still not have gotten updated, but some
> of the newer transactions from the last mount session will have gotten
> written by the very latest transacitons, and when we then try to do
> the extra transaction replays, the metadata blocks can end up getting
> very scrambled indeed.

I'm stumped by this; maybe Ted can see if I'm missing something.

(and Nix, is there anything special about your fs?  Any nondefault
mkfs or mount options, external journal, inordinately large fs, or
anything like that?)

The suspect commit added this in jbd2_mark_journal_empty():

        /* Is it already empty? */
        if (sb->s_start == 0) {
                read_unlock(&journal->j_state_lock);
                return;
        }

thereby short circuiting the function.

But Ted's suggestion that mounting the fs, doing a little work, and
unmounting before we wrap would lead to this doesn't make sense to
me.  When I do a little work, s_start is at 1, not 0.  We start
the journal at s_first:

load_superblock()
	journal->j_first = be32_to_cpu(sb->s_first);

And when we wrap the journal, we wrap back to j_first:

jbd2_journal_next_log_block():
        if (journal->j_head == journal->j_last)
                journal->j_head = journal->j_first;

and j_first comes from s_first, which is set at journal creation
time to be "1" for an internal journal.

So s_start == 0 sure looks special to me; so far I can only see that
we get there if we've been through jbd2_mark_journal_empty() already,
though I'm eyeballing jbd2_journal_get_log_tail() as well.

Ted's proposed patch seems harmless but so far I don't understand
what problem it fixes, and I cannot recreate getting to
jbd2_mark_journal_empty() with a dirty log and s_start == 0.

-Eric

> *Sigh*.  My apologies for not catching this when I reviewed this
> patch.  I believe the following patch should fix the bug; once it's
> reviewed by other ext4 developers, I'll push this to Linus ASAP.
> 
> 						- Ted
> 
> commit 26de1ba5acc39f0ab57ce1ed523cb128e4ad73a4
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Tue Oct 23 18:15:22 2012 -0400
> 
>     jbd2: fix a potential fs corrupting bug in jbd2_mark_journal_empty
>     
>     Fix a potential file system corrupting bug which was introduced by
>     commit eeecef0af5ea4efd763c9554cf2bd80fc4a0efd3: jbd2: don't write
>     superblock when if its empty.
>     
>     We should only skip writing the journal superblock if there is nothing
>     to do --- not just when s_start is zero.
>     
>     This has caused users to report file system corruptions in ext4 that
>     look like this:
>     
>     EXT4-fs error (device sdb3): ext4_mb_generate_buddy:741: group 436, 22902 clusters in bitmap, 22901 in gd
>     JBD2: Spotted dirty metadata buffer (dev = sdb3, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
>     
>     after the file system has been corrupted.
>     
>     Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>     Cc: stable@vger.kernel.org
> 
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 0f16edd..0064181 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -1351,18 +1351,20 @@ void jbd2_journal_update_sb_log_tail(journal_t *journal, tid_t tail_tid,
>  static void jbd2_mark_journal_empty(journal_t *journal)
>  {
>  	journal_superblock_t *sb = journal->j_superblock;
> +	__be32		new_tail_sequence;
>  
>  	BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex));
>  	read_lock(&journal->j_state_lock);
> -	/* Is it already empty? */
> -	if (sb->s_start == 0) {
> +	new_tail_sequence = cpu_to_be32(journal->j_tail_sequence);
> +	/* Nothing to do? */
> +	if (sb->s_start == 0 && sb->s_sequence == new_tail_sequence) {
>  		read_unlock(&journal->j_state_lock);
>  		return;
>  	}
>  	jbd_debug(1, "JBD2: Marking journal as empty (seq %d)\n",
>  		  journal->j_tail_sequence);
>  
> -	sb->s_sequence = cpu_to_be32(journal->j_tail_sequence);
> +	sb->s_sequence = new_tail_sequence;
>  	sb->s_start    = cpu_to_be32(0);
>  	read_unlock(&journal->j_state_lock);
>  
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-23 20:57         ` Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) Nix
  2012-10-23 22:19           ` Theodore Ts'o
@ 2012-10-24  1:13           ` Eric Sandeen
  2012-10-24  4:15             ` Nix
  2012-10-26 20:35           ` Eric Sandeen
  2012-10-28  4:23           ` [PATCH] ext4: fix unjournaled inode bitmap modification Eric Sandeen
  3 siblings, 1 reply; 83+ messages in thread
From: Eric Sandeen @ 2012-10-24  1:13 UTC (permalink / raw)
  To: Nix
  Cc: Ted Ts'o, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs

On 10/23/12 3:57 PM, Nix wrote:

<snip>

> (I'd provide more sample errors, but this bug has been eating
> newly-written logs in /var all day, so not much has survived.)
> 
> I rebooted into 3.6.1 rescue mode and fscked everything: lots of
> orphans, block group corruption and cross-linked files. The problems did
> not recur upon booting from 3.6.1 into 3.6.1 again. It is quite clear
> that metadata changes made in 3.6.3 are not making it to disk reliably,
> thus leading to corrupted filesystems marked clean on reboot into other
> kernels: pretty much every file appended to in 3.6.3 loses some or all
> of its appended data, and newly allocated blocks often end up
> cross-linked between multiple files.
> 
> The curious thing is this doesn't affect every filesystem: for a while
> it affected only /var, and now it's affecting only /var and /home. The
> massive writes to the ext4 filesystem mounted on /usr/src seem to have
> gone off without incident: fsck reports no problems.
> 
> 
> The only unusual thing about the filesystems on this machine are that
> they have hardware RAID-5 (using the Areca driver), so I'm mounting with
> 'nobarrier': 

I should have read more.  :(  More questions follow:

* Does the Areca have a battery backed write cache?
* Are you crashing or rebooting cleanly?
* Do you see log recovery messages in the logs for this filesystem?

> the full set of options for all my ext4 filesystems are:
> 
> rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota,
> usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota

ok journal_async_commit is off the reservation a bit; that's really not
tested, and Jan had serious reservations about its safety.

* Can you reproduce this w/o journal_async_commit?

-Eric

> If there's anything I can do to help, I'm happy to do it, once I've
> restored my home directory from backup :(
> 
> 
> tune2fs output for one of the afflicted filesystems (after fscking):
> 
> tune2fs 1.42.2 (9-Apr-2012)
> Filesystem volume name:   home
> Last mounted on:          /home
> Filesystem UUID:          95bd22c2-253c-456f-8e36-b6cfb9ecd4ef
> Filesystem magic number:  0xEF53
> Filesystem revision #:    1 (dynamic)
> Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
> Filesystem flags:         signed_directory_hash
> Default mount options:    (none)
> Filesystem state:         clean
> Errors behavior:          Continue
> Filesystem OS type:       Linux
> Inode count:              3276800
> Block count:              13107200
> Reserved block count:     655360
> Free blocks:              5134852
> Free inodes:              3174777
> First block:              0
> Block size:               4096
> Fragment size:            4096
> Reserved GDT blocks:      20
> Blocks per group:         32768
> Fragments per group:      32768
> Inodes per group:         8192
> Inode blocks per group:   512
> RAID stripe width:        16
> Flex block group size:    64
> Filesystem created:       Tue May 26 21:29:41 2009
> Last mount time:          Tue Oct 23 21:32:07 2012
> Last write time:          Tue Oct 23 21:32:07 2012
> Mount count:              2
> Maximum mount count:      20
> Last checked:             Tue Oct 23 21:22:16 2012
> Check interval:           15552000 (6 months)
> Next check after:         Sun Apr 21 21:22:16 2013
> Lifetime writes:          1092 GB
> Reserved blocks uid:      0 (user root)
> Reserved blocks gid:      0 (group root)
> First inode:              11
> Inode size:               256
> Required extra isize:     28
> Desired extra isize:      28
> Journal inode:            8
> First orphan inode:       1572907
> Default directory hash:   half_md4
> Directory Hash Seed:      a201983d-d8a3-460b-93ca-eb7804b62c23
> Journal backup:           inode blocks
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-24  1:13           ` Eric Sandeen
@ 2012-10-24  4:15             ` Nix
  2012-10-24  4:27               ` Eric Sandeen
  2012-10-26  0:11               ` Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) Ric Wheeler
  0 siblings, 2 replies; 83+ messages in thread
From: Nix @ 2012-10-24  4:15 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Ted Ts'o, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs

On 24 Oct 2012, Eric Sandeen uttered the following:

> On 10/23/12 3:57 PM, Nix wrote:
>> The only unusual thing about the filesystems on this machine are that
>> they have hardware RAID-5 (using the Areca driver), so I'm mounting with
>> 'nobarrier': 
>
> I should have read more.  :(  More questions follow:
>
> * Does the Areca have a battery backed write cache?

Yes (though I'm not powering off, just rebooting). Battery at 100% and
happy, though the lack of power-off means it's not actually getting
used, since the cache is obviously mains-backed as well.

> * Are you crashing or rebooting cleanly?

Rebooting cleanly, everything umounted happily including /home and /var.

> * Do you see log recovery messages in the logs for this filesystem?

My memory says yes, but nothing seems to be logged when this happens
(though with my logs on the first filesystem damaged by this, this is
rather hard to tell, they're all quite full of NULs by now).

I'll double-reboot tomorrow via the faulty kernel and check, unless I
get asked not to in the interim. (And then double-reboot again to fsck
everything...)

>> the full set of options for all my ext4 filesystems are:
>> 
>> rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota,
>> usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota
>
> ok journal_async_commit is off the reservation a bit; that's really not
> tested, and Jan had serious reservations about its safety.

OK, well, I've been 'testing' it for years :) No problems until now. (If
anything, I was more concerned about journal_checksum. I thought that
had actually been implicated in corruption before now...)

> * Can you reproduce this w/o journal_async_commit?

I can try!

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-24  4:15             ` Nix
@ 2012-10-24  4:27               ` Eric Sandeen
  2012-10-24  5:23                 ` Theodore Ts'o
  2012-10-26  0:11               ` Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) Ric Wheeler
  1 sibling, 1 reply; 83+ messages in thread
From: Eric Sandeen @ 2012-10-24  4:27 UTC (permalink / raw)
  To: Nix
  Cc: Ted Ts'o, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs

On 10/23/12 11:15 PM, Nix wrote:
> On 24 Oct 2012, Eric Sandeen uttered the following:
> 
>> On 10/23/12 3:57 PM, Nix wrote:
>>> The only unusual thing about the filesystems on this machine are that
>>> they have hardware RAID-5 (using the Areca driver), so I'm mounting with
>>> 'nobarrier': 
>>
>> I should have read more.  :(  More questions follow:
>>
>> * Does the Areca have a battery backed write cache?
> 
> Yes (though I'm not powering off, just rebooting). Battery at 100% and
> happy, though the lack of power-off means it's not actually getting
> used, since the cache is obviously mains-backed as well.
> 
>> * Are you crashing or rebooting cleanly?
> 
> Rebooting cleanly, everything umounted happily including /home and /var.
> 
>> * Do you see log recovery messages in the logs for this filesystem?
> 
> My memory says yes, but nothing seems to be logged when this happens
> (though with my logs on the first filesystem damaged by this, this is
> rather hard to tell, they're all quite full of NULs by now).
> 
> I'll double-reboot tomorrow via the faulty kernel and check, unless I
> get asked not to in the interim. (And then double-reboot again to fsck
> everything...)
> 
>>> the full set of options for all my ext4 filesystems are:
>>>
>>> rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota,
>>> usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota
>>
>> ok journal_async_commit is off the reservation a bit; that's really not
>> tested, and Jan had serious reservations about its safety.
> 
> OK, well, I've been 'testing' it for years :) No problems until now. (If
> anything, I was more concerned about journal_checksum. I thought that
> had actually been implicated in corruption before now...)

It had, but I fixed it AFAIK; OTOH, we turned it off by default
after that episode.

>> * Can you reproduce this w/o journal_async_commit?
> 
> I can try!

Ok, fair enough.  If the BBU is working, nobarrier is ok; I don't trust
journal_async_commit, but that doesn't mean this isn't a regression.

Thanks for the answers... onward.  :)

-Eric


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-24  4:27               ` Eric Sandeen
@ 2012-10-24  5:23                 ` Theodore Ts'o
  2012-10-24  7:00                   ` Hugh Dickins
                                     ` (3 more replies)
  0 siblings, 4 replies; 83+ messages in thread
From: Theodore Ts'o @ 2012-10-24  5:23 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Nix, linux-ext4, linux-kernel, J. Bruce Fields, Bryan Schumaker,
	Peng Tao, Trond.Myklebust, gregkh, Toralf Förster

On Tue, Oct 23, 2012 at 11:27:09PM -0500, Eric Sandeen wrote:
> 
> Ok, fair enough.  If the BBU is working, nobarrier is ok; I don't trust
> journal_async_commit, but that doesn't mean this isn't a regression.

Note that Toralf has reported almost exactly the same set of symptoms,
but he's using an external USB stick --- and as far as I know he
wasn't using nobarrier and/or the journal_async_commit.  Toralf, can
you confirm what, if any, mount options you were using when you saw
it.

I've been looking at this some more, and there's one other thing that
the short circuit code does, which is neglects setting the
JBD2_FLUSHED flag, which is used by the commit code to know when it
needs to reset the s_start fields in the superblock when we make our
next commit.  However, this would only happen if the short circuit
code is getting hit some time other than when the file system is
getting unmounted --- and that's what Eric and I can't figure out how
it might be happening.  Journal flushes outside of an unmount does
happen as part of online resizing, the FIBMAP ioctl, or when the file
system is frozen.  But it didn't sound like Toralf or Nix was using
any of those features.  (Toralf, Nix, please correct me if my
assumptions here is wrong).

So here's a replacement patch which essentially restores the effects
of eeecef0af5e while still keeping the optimization and fixing the
read/only testing issue which eeecef0af5e is trying to fix up.  It
also have a debugging printk that will trigger so we can perhaps have
a better chance of figuring out what might be going on.

Toralf, Nix, if you could try applying this patch (at the end of this
message), and let me know how and when the WARN_ON triggers, and if it
does, please send the empty_bug_workaround plus the WARN_ON(1) report.
I know about the case where a file system is mounted and then
immediately unmounted, but we don't think that's the problematic case.
If you see any other cases where WARN_ON is triggering, it would be
really good to know....

      	    	       		      - Ted

P.S.  This is a list of all of the commits between v3.6.1 and v3.6.2
(there were no ext4-related changes between v3.6.2 and v3.6.3), and a
quick analysis of the patch.  The last commit, 14b4ed2, is the only
one that I could see as potentially being problematic, which is why
I've been pushing so hard on this one even though my original analysis
doesn't seem to be correct, and Eric and I can't see how the change in
14b4ed2 could be causing the fs corruption.

Online Defrag
=============
22a5672 ext4: online defrag is not supported for journaled files
ba57d9e ext4: move_extent code cleanup
   No behavioral change unless e4defrag has been used.

Online Resize
=============
5018ddd ext4: avoid duplicate writes of the backup bg descriptor blocks
256ae46 ext4: don't copy non-existent gdt blocks when resizing
416a688 ext4: ignore last group w/o enough space when resizing instead of BUG'ing
   No observable change unless online resizing (e2resize) has been used

Other Commits
=============
92b7722 ext4: fix mtime update in nodelalloc mode
   Changes where we call file_update_time()

34414b2 ext4: fix fdatasync() for files with only i_size changes
   Forces the inode changes to be commited if only i_sync changes when
   fdatasync() is called.  No changes except performance impact
   to fdatasync() and correctness after a system crash.

12ebdf0 ext4: always set i_op in ext4_mknod()
   Fixes a bug if CONFIG_EXT4_FS_XATTR is not defined;
   no change if CONFIG_EXT4_FS_XATTR is defined

2fdb112 ext4: fix crash when accessing /proc/mounts concurrently
   Remove an erroneous "static" for an function so it is allocated on the stack;
   fixes a bug if two processes cat /proc/mounts at the same time

1638f1f ext4: fix potential deadlock in ext4_nonda_switch()
   Fixes a circular lock dependency

14b4ed2 jbd2: don't write superblock when if its empty
   If journal->s_start is zero, we may not update journal->s_sequence when
   it might be needed.  (But we at the moement we can't see how this could
   lead to the reported fs corruptions.)

commit cb57108637e01ec2f02d9311cedc3013e96f25d4
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Wed Oct 24 01:01:41 2012 -0400

    jbd2: fix a potential fs corrupting bug in jbd2_mark_journal_empty

    Fix a potential file system corrupting bug which was introduced by
    commit eeecef0af5ea4efd763c9554cf2bd80fc4a0efd3: jbd2: don't write
    superblock when if its empty.

    We should only skip writing the journal superblock if there is nothing
    to do --- not just when s_start is zero.

    This has caused users to report file system corruptions in ext4 that
    look like this:

    EXT4-fs error (device sdb3): ext4_mb_generate_buddy:741: group 436, 22902 clusters in bitmap, 22901 in gd
    JBD2: Spotted dirty metadata buffer (dev = sdb3, blocknr = 0). There's a risk of filesystem corruption in case of system crash.

    after the file system has been corrupted.

    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
    Cc: stable@vger.kernel.org

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 0f16edd..26b2983 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1351,24 +1351,33 @@ void jbd2_journal_update_sb_log_tail(journal_t *journal, tid_t tail_tid,
 static void jbd2_mark_journal_empty(journal_t *journal)
 {
 	journal_superblock_t *sb = journal->j_superblock;
+	__be32		new_tail_sequence;

 	BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex));
 	read_lock(&journal->j_state_lock);
-	/* Is it already empty? */
+	new_tail_sequence = cpu_to_be32(journal->j_tail_sequence);
+	/* Nothing to do? */
 	if (sb->s_start == 0) {
+		pr_err("JBD2: jbd2_mark_journal_empty bug workaround (%u, %u)\n",
+		       (unsigned) be32_to_cpu(sb->s_sequence),
+		       (unsigned) be32_to_cpu(new_tail_sequence));
+		WARN_ON(1);
+	}
+	if (sb->s_start == 0 && sb->s_sequence == new_tail_sequence) {
 		read_unlock(&journal->j_state_lock);
-		return;
+		goto set_flushed;
 	}
 	jbd_debug(1, "JBD2: Marking journal as empty (seq %d)\n",
 		  journal->j_tail_sequence);

-	sb->s_sequence = cpu_to_be32(journal->j_tail_sequence);
+	sb->s_sequence = new_tail_sequence;
 	sb->s_start    = cpu_to_be32(0);
 	read_unlock(&journal->j_state_lock);

 	jbd2_write_superblock(journal, WRITE_FUA);

-	/* Log is no longer empty */
+set_flushed:
+	/* Log is empty */
 	write_lock(&journal->j_state_lock);
 	journal->j_flags |= JBD2_FLUSHED;
 	write_unlock(&journal->j_state_lock);

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-24  5:23                 ` Theodore Ts'o
@ 2012-10-24  7:00                   ` Hugh Dickins
  2012-10-24 11:46                     ` Nix
  2012-10-24 11:45                   ` Nix
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 83+ messages in thread
From: Hugh Dickins @ 2012-10-24  7:00 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Eric Sandeen, Nix, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh,
	Toralf Förster

On Wed, 24 Oct 2012, Theodore Ts'o wrote:
> Journal flushes outside of an unmount does
> happen as part of online resizing, the FIBMAP ioctl, or when the file
> system is frozen.  But it didn't sound like Toralf or Nix was using
> any of those features.  (Toralf, Nix, please correct me if my
> assumptions here is wrong).

I believe it also happens at swapon of a swapfile on the filesystem.

Hugh

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH] lockd: fix races in per-net NSM client handling
  2012-10-23 19:49                       ` Nix
@ 2012-10-24 10:18                         ` Stanislav Kinsbursky
  0 siblings, 0 replies; 83+ messages in thread
From: Stanislav Kinsbursky @ 2012-10-24 10:18 UTC (permalink / raw)
  To: Trond.Myklebust; +Cc: bfields, linux-nfs, linux-kernel, devel

This patch fixes two problems:
1) Removes races on NSM creation.
2) Fixes silly misprint on NSM client destruction (usage counter was checked
for non-zero value instead of zero).

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
---
 fs/lockd/mon.c |   35 +++++++++++++++++++++++------------
 1 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/fs/lockd/mon.c b/fs/lockd/mon.c
index e4fb3ba..e3e59f6 100644
--- a/fs/lockd/mon.c
+++ b/fs/lockd/mon.c
@@ -85,30 +85,41 @@ static struct rpc_clnt *nsm_create(struct net *net)
 	return rpc_create(&args);
 }
 
-static struct rpc_clnt *nsm_client_get(struct net *net)
+static struct rpc_clnt *nsm_get_client(struct net *net)
 {
-	static DEFINE_MUTEX(nsm_create_mutex);
-	struct rpc_clnt	*clnt;
+	struct rpc_clnt *clnt = NULL;
 	struct lockd_net *ln = net_generic(net, lockd_net_id);
 
 	spin_lock(&ln->nsm_clnt_lock);
 	if (ln->nsm_users) {
 		ln->nsm_users++;
 		clnt = ln->nsm_clnt;
-		spin_unlock(&ln->nsm_clnt_lock);
-		goto out;
 	}
 	spin_unlock(&ln->nsm_clnt_lock);
+	return clnt;
+}
+
+static struct rpc_clnt *nsm_client_get(struct net *net)
+{
+	static DEFINE_MUTEX(nsm_create_mutex);
+	struct rpc_clnt	*clnt;
+	struct lockd_net *ln = net_generic(net, lockd_net_id);
+
+	clnt = nsm_get_client(net);
+	if (clnt)
+		return clnt;
 
 	mutex_lock(&nsm_create_mutex);
-	clnt = nsm_create(net);
-	if (!IS_ERR(clnt)) {
-		ln->nsm_clnt = clnt;
-		smp_wmb();
-		ln->nsm_users = 1;
+	clnt = nsm_get_client(net);
+	if (clnt == NULL) {
+		clnt = nsm_create(net);
+		if (!IS_ERR(clnt)) {
+			ln->nsm_clnt = clnt;
+			smp_wmb();
+			ln->nsm_users = 1;
+		}
 	}
 	mutex_unlock(&nsm_create_mutex);
-out:
 	return clnt;
 }
 
@@ -120,7 +131,7 @@ static void nsm_client_put(struct net *net)
 
 	spin_lock(&ln->nsm_clnt_lock);
 	if (ln->nsm_users) {
-		if (--ln->nsm_users)
+		if (--ln->nsm_users == 0)
 			ln->nsm_clnt = NULL;
 		shutdown = !ln->nsm_users;
 	}


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-24  5:23                 ` Theodore Ts'o
  2012-10-24  7:00                   ` Hugh Dickins
@ 2012-10-24 11:45                   ` Nix
  2012-10-24 17:22                   ` Eric Sandeen
  2012-10-24 19:49                   ` Nix
  3 siblings, 0 replies; 83+ messages in thread
From: Nix @ 2012-10-24 11:45 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh,
	Toralf Förster

On 24 Oct 2012, Theodore Ts'o stated:

>                         Journal flushes outside of an unmount does
> happen as part of online resizing, the FIBMAP ioctl, or when the file
> system is frozen.  But it didn't sound like Toralf or Nix was using
> any of those features.

Quite so -- the corrupted filesystems have space reserved for resizing,
and one of them has been resized, years ago, but I haven't resized
either of them with this kernel, or with any kernel numbered 3.x for
that matter.

> Toralf, Nix, if you could try applying this patch (at the end of this
> message), and let me know how and when the WARN_ON triggers, and if it
> does, please send the empty_bug_workaround plus the WARN_ON(1) report.
> I know about the case where a file system is mounted and then
> immediately unmounted, but we don't think that's the problematic case.
> If you see any other cases where WARN_ON is triggering, it would be
> really good to know....

I'll give it a test later today, after another backup has finished.
Daily backups are normally overkill, but I don't think they are right
now.

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-24  7:00                   ` Hugh Dickins
@ 2012-10-24 11:46                     ` Nix
  0 siblings, 0 replies; 83+ messages in thread
From: Nix @ 2012-10-24 11:46 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Theodore Ts'o, Eric Sandeen, linux-ext4, linux-kernel,
	J. Bruce Fields, Bryan Schumaker, Peng Tao, Trond.Myklebust,
	gregkh, Toralf Förster

On 24 Oct 2012, Hugh Dickins verbalised:

> On Wed, 24 Oct 2012, Theodore Ts'o wrote:
>> Journal flushes outside of an unmount does
>> happen as part of online resizing, the FIBMAP ioctl, or when the file
>> system is frozen.  But it didn't sound like Toralf or Nix was using
>> any of those features.  (Toralf, Nix, please correct me if my
>> assumptions here is wrong).
>
> I believe it also happens at swapon of a swapfile on the filesystem.

I'm not using swapfiles, only swap partitions (on separate LVM LVs).
So that's not it either.

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-24  5:23                 ` Theodore Ts'o
  2012-10-24  7:00                   ` Hugh Dickins
  2012-10-24 11:45                   ` Nix
@ 2012-10-24 17:22                   ` Eric Sandeen
  2012-10-24 19:49                   ` Nix
  3 siblings, 0 replies; 83+ messages in thread
From: Eric Sandeen @ 2012-10-24 17:22 UTC (permalink / raw)
  To: Theodore Ts'o, Nix, linux-ext4, linux-kernel,
	J. Bruce Fields, Bryan Schumaker, Peng Tao, Trond.Myklebust,
	gregkh, Toralf Förster

On 10/24/2012 12:23 AM, Theodore Ts'o wrote:
> On Tue, Oct 23, 2012 at 11:27:09PM -0500, Eric Sandeen wrote:
>>
>> Ok, fair enough.  If the BBU is working, nobarrier is ok; I don't trust
>> journal_async_commit, but that doesn't mean this isn't a regression.
> 
> Note that Toralf has reported almost exactly the same set of symptoms,
> but he's using an external USB stick --- and as far as I know he
> wasn't using nobarrier and/or the journal_async_commit.  Toralf, can
> you confirm what, if any, mount options you were using when you saw
> it.
> 
> I've been looking at this some more, and there's one other thing that
> the short circuit code does, which is neglects setting the
> JBD2_FLUSHED flag, which is used by the commit code to know when it
> needs to reset the s_start fields in the superblock when we make our
> next commit.  However, this would only happen if the short circuit
> code is getting hit some time other than when the file system is
> getting unmounted --- and that's what Eric and I can't figure out how
> it might be happening.  Journal flushes outside of an unmount does
> happen as part of online resizing, the FIBMAP ioctl, or when the file
> system is frozen.  But it didn't sound like Toralf or Nix was using
> any of those features.  (Toralf, Nix, please correct me if my
> assumptions here is wrong).

If I freeze w/ anything in the log, then s_start !=0 and we proceed
normally.  If I re-freeze w/o anything in the log, it's already set to
FLUSHED (which makes sense) so not re-setting it doesn't matter.  So I
don't see that that's an issue.

As for FIBMAP I think we only do journal_flush if it's data=journal.

In other news, Phoronix is on the case, so expect escalating freakouts ;)

-Eric

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-23 22:19           ` Theodore Ts'o
                               ` (2 preceding siblings ...)
  2012-10-24  0:57             ` Eric Sandeen
@ 2012-10-24 19:13             ` Jannis Achstetter
  2012-10-24 21:31               ` Theodore Ts'o
  2012-10-24 21:04             ` Jannis Achstetter
  4 siblings, 1 reply; 83+ messages in thread
From: Jannis Achstetter @ 2012-10-24 19:13 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-ext4, stable

Am 24.10.2012 00:19, schrieb Theodore Ts'o:
> [...]
> The reason why the problem happens rarely is that the effect of the
> buggy commit is that if the journal's starting block is zero, we fail
> to truncate the journal when we unmount the file system.  This can
> happen if we mount and then unmount the file system fairly quickly,
> before the log has a chance to wrap.  After the first time this has
> happened, it's not a disaster, since when we replay the journal, we'll
> just replay some extra transactions.  But if this happens twice, the
> oldest valid transaction will still not have gotten updated, but some
> of the newer transactions from the last mount session will have gotten
> written by the very latest transacitons, and when we then try to do
> the extra transaction replays, the metadata blocks can end up getting
> very scrambled indeed.
> [...]

As a "normal linux user" I'm interested in the practical things to do
now to avoid data loss. I'm running several systems with 3.6.2 and ext4.
Fearing loss of data:
- Is there a way to see whether the journal of a specific partition has
been wrapped (since mounting) so that umounting and mounting (or doing a
reboot to downgrade the kernel) is safe?
- Is there a way to "force" a journal-wrap? Run any
filesystem-benchmark? Which one with what parameters? Or is it unwise
since I might even further corrupt data if I hit the case already?
- Is it wise to umount now and run e2fsck or might I corrupt my files
just by umounting now if the journal hasn't wrapped yet?
- How do you define "fairly quickly"? Of course servers run 24/7 but I
might be using my PC 2-5 hrs a day... Is that a "reboot to soon after
booting"?
- Any more advice you can give to the ordinary user to avoid
fs-corruption? Don't shut down machines for some days? Better down- or
upgrade the kernel?

Best regards,
	Jannis Achstetter



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-24  5:23                 ` Theodore Ts'o
                                     ` (2 preceding siblings ...)
  2012-10-24 17:22                   ` Eric Sandeen
@ 2012-10-24 19:49                   ` Nix
  2012-10-24 19:54                     ` Nix
                                       ` (3 more replies)
  3 siblings, 4 replies; 83+ messages in thread
From: Nix @ 2012-10-24 19:49 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh,
	Toralf Förster

On 24 Oct 2012, Theodore Ts'o spake thusly:
> Toralf, Nix, if you could try applying this patch (at the end of this
> message), and let me know how and when the WARN_ON triggers, and if it
> does, please send the empty_bug_workaround plus the WARN_ON(1) report.
> I know about the case where a file system is mounted and then
> immediately unmounted, but we don't think that's the problematic case.
> If you see any other cases where WARN_ON is triggering, it would be
> really good to know....

Confirmed, it triggers. Traceback below.

But first, a rather lengthy apology: I did indeed forget something
unusual about my system. In my defence, this is a change I made to my
shutdown scripts many years ago, when umount -l was first introduced
(early 2000s? something like that). So it's not surprising I forgot
about it until I needed to add sleeps to it to capture the tracebacks
below. It is really ugly. You may need a sick bag. In brief: some of my
filesystems will sometimes be uncleanly unmounted and experience journal
replay even on clean shutdowns, and which it is will vary unpredictably.

Some of my machines have fairly intricate webs of NFS-mounted and
non-NFS-mounted filesystems, and I expect them all to reboot
successfully if commanded remotely, because sometimes I'm hundreds of
miles away when I do it and can hardly hit the reset button.

Unfortunately, if I have a mount structure like this:

/usr         local
/usr/foo     NFS-mounted (may be loopback-NFS-mounted)
/usr/foo/bar local

and /usr/foo is down, any attempt to umount /usr/foo/bar will hang
indefinitely. Worse yet, if I umount the nfs filesystem, the local fs
isn't going to be reachable either -- but umounting nfs filesystems has
to happen first so I can killall everything (which would include e.g.
rpc.statd and rpc.nfsd) in order to free up the local filesystems for
umount.

The only way I could see to fix this is to umount -l everything rather
than umounting it (sure, I could do some sort of NFS-versus-non-NFS
analysis and only do this to some filesystems, but testing this
complexity for the -- for me -- rare case of system shutdown was too
annoying to consider). I consider a hang on shutdown much worse than an
occasional unclean umount, because all my filesystems are journalled so
journal recovery will make everything quite happy.

So I do

sync
umount -a -l -t nfs & sleep 2
killall5 -15
killall5 -9
exportfs -ua
quotaoff -a
swapoff -a
LANG=C sort -r -k 2 /proc/mounts | \
(DIRS=""
 while read DEV DIR TYPE REST; do
     case "$DIR" in
         /|/proc|/dev|/proc/*|/sys)
             continue;; # Ignoring virtual file systems needed later
     esac

     case $TYPE in
         proc|procfs|sysfs|usbfs|usbdevfs|devpts)
             continue;; # Ignoring non-tmpfs virtual file systems
     esac
     DIRS="$DIRS $DIR"
done
umount -l -r -d $DIRS) # rely on mount's toposort
sleep 2

The net effect of this being to cleanly umount everything whose mount
points are reachable and which unmounts cleanly in less than a couple of
seconds, and to leave the rest mounted and let journal recovery handle
them. This is clearly really horrible -- I'd far prefer to say 'sleep
until filesystems have finished doing I/O' or better have mount just not
return from mount(8) unless that is true. But this isn't available, and
even it was some fses would still be left to journal recovery, so I
kludged it -- and then forgot about doing anything to improve the
situation for many years.

So, the net effect of this is that normally I get no journal recovery on
anything at all -- but sometimes, if umounting takes longer than a few
seconds, I reboot with not everything unmounted, and journal recovery
kicks in on reboot. My post-test fscks this time suggest that only when
journal recovery kicks in after rebooting out of 2.6.3 do I see
corruption. So this is indeed an unclean shutdown journal-replay
situation: it just happens that I routinely have one or two fses
uncleanly unmounted when all the rest are cleanly unmounted. This
perhaps explains the scattershot nature of the corruption I see, and why
most of my ext4 filesystems get off scot-free.

I'll wait for a minute until you're finished projectile-vomiting. (And
if you have suggestions for making the case of nested local/rewmote
filesystems work without rebooting while umounts may still be in
progress, or even better suggestions to allow me to umount mounts that
happen to be mounted below NFS-mounted mounts with dead or nonresponsive
NFS server, I'd be glad to hear them! Distros appear to take the
opposite tack, and prefer to simply lock up forever waiting for a
nonresponsive NFS server in this situation. I could never accept that.)

[...]

OK. That umount of local filesystems sprayed your added
empty bug workaround and WARN_ONs so many times that nearly all of them
scrolled off the screen -- and because syslogd was dead by now and this
is where my netconsole logs go, they're lost. I suspect every single
umounted filesystem sprayed one of these (and this happened long before
any reboot-before-we're-done).

But I did the old trick of camera-capturing the last one (which was
probably /boot, which has never got corrupted because I hardly ever
write anything to it at all). I hope it's more useful than nothing. (I
can rearrange things to umount /var last, and try again, if you think
that a specific warning from an fs known to get corrupted is especially
likely to be valuable.)

So I see, for one umount at least (and the chunk of the previous one
that scrolled offscreen is consistent with this):

jbd2_mark_journal_empty bug workaround (21218, 21219)
[obscured by light] at fs/jbd2/journal.c:1364 jbd2_mark_journal_empty+06c/0xbd
...
[addresses omitted for sanity: traceback only]
warn_slowpath_common+0x83/0x9b
warn_slowpath_null+0x1a/0x1c
jbd2_mark_journal_empty+06c/0xbd
jbd2_journal_destroy+0x183/0x20c
? abort_exclusive_wait+0x8e/0x8e
ext4_put_super+0x6c/0x316
? evict_inodes+0xe6/0xf1
generic_shutdown_super+0x59/0xd1
? free_vfsmnt+0x18/0x3c
kill_block_super+0x27/0x6a
deactivate_locked_super+0x26/0x57
deactivate_super+0x3f/0x43
mntput_no_expire+0x134/0x13c
sys_umount+0x308/0x33a
system_call_fastpath+0x16/0x1b

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-24 19:49                   ` Nix
@ 2012-10-24 19:54                     ` Nix
  2012-10-24 20:30                     ` Eric Sandeen
                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 83+ messages in thread
From: Nix @ 2012-10-24 19:54 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh,
	Toralf Förster

On 24 Oct 2012, nix@esperi.org.uk uttered the following:
> So, the net effect of this is that normally I get no journal recovery on
> anything at all -- but sometimes, if umounting takes longer than a few
> seconds, I reboot with not everything unmounted, and journal recovery
> kicks in on reboot. My post-test fscks this time suggest that only when
> journal recovery kicks in after rebooting out of 2.6.3 do I see
> corruption. So this is indeed an unclean shutdown journal-replay
> situation: it just happens that I routinely have one or two fses
> uncleanly unmounted when all the rest are cleanly unmounted. This
> perhaps explains the scattershot nature of the corruption I see, and why
> most of my ext4 filesystems get off scot-free.

Note that two umounts are not required: fsck found corruption on /var
after a single boot+shutdown round in 3.6.3+this patch. (It did do a
journal replay on /var first.)

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-24  0:57             ` Eric Sandeen
@ 2012-10-24 20:17               ` Jan Kara
  2012-10-26 15:25                 ` Eric Sandeen
  0 siblings, 1 reply; 83+ messages in thread
From: Jan Kara @ 2012-10-24 20:17 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Theodore Ts'o, Nix, linux-ext4, linux-kernel,
	J. Bruce Fields, Bryan Schumaker, Peng Tao, Trond.Myklebust,
	gregkh, Toralf Förster, stable, Jan Kara

On Tue 23-10-12 19:57:09, Eric Sandeen wrote:
> On 10/23/12 5:19 PM, Theodore Ts'o wrote:
> > On Tue, Oct 23, 2012 at 09:57:08PM +0100, Nix wrote:
> >>
> >> It is now quite clear that this is a bug introduced by one or more of
> >> the post-3.6.1 ext4 patches (which have all been backported at least to
> >> 3.5, so the problem is probably there too).
> >>
> >> [   60.290844] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 202, 1583 clusters in bitmap, 1675 in gd
> >> [   60.291426] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
> >>
> > 
> > I think I've found the problem.  I believe the commit at fault is commit
> > 14b4ed22a6 (upstream commit eeecef0af5e):
> > 
> >     jbd2: don't write superblock when if its empty
> > 
> > which first appeared in v3.6.2.
> > 
> > The reason why the problem happens rarely is that the effect of the
> > buggy commit is that if the journal's starting block is zero, we fail
> > to truncate the journal when we unmount the file system.  This can
> > happen if we mount and then unmount the file system fairly quickly,
> > before the log has a chance to wrap.After the first time this has
> > happened, it's not a disaster, since when we replay the journal, we'll
> > just replay some extra transactions.  But if this happens twice, the
> > oldest valid transaction will still not have gotten updated, but some
> > of the newer transactions from the last mount session will have gotten
> > written by the very latest transacitons, and when we then try to do
> > the extra transaction replays, the metadata blocks can end up getting
> > very scrambled indeed.
> 
> I'm stumped by this; maybe Ted can see if I'm missing something.
> 
> (and Nix, is there anything special about your fs?  Any nondefault
> mkfs or mount options, external journal, inordinately large fs, or
> anything like that?)
> 
> The suspect commit added this in jbd2_mark_journal_empty():
> 
>         /* Is it already empty? */
>         if (sb->s_start == 0) {
>                 read_unlock(&journal->j_state_lock);
>                 return;
>         }
> 
> thereby short circuiting the function.
> 
> But Ted's suggestion that mounting the fs, doing a little work, and
> unmounting before we wrap would lead to this doesn't make sense to
> me.  When I do a little work, s_start is at 1, not 0.  We start
> the journal at s_first:
> 
> load_superblock()
> 	journal->j_first = be32_to_cpu(sb->s_first);
> 
> And when we wrap the journal, we wrap back to j_first:
> 
> jbd2_journal_next_log_block():
>         if (journal->j_head == journal->j_last)
>                 journal->j_head = journal->j_first;
> 
> and j_first comes from s_first, which is set at journal creation
> time to be "1" for an internal journal.
> 
> So s_start == 0 sure looks special to me; so far I can only see that
> we get there if we've been through jbd2_mark_journal_empty() already,
> though I'm eyeballing jbd2_journal_get_log_tail() as well.
> 
> Ted's proposed patch seems harmless but so far I don't understand
> what problem it fixes, and I cannot recreate getting to
> jbd2_mark_journal_empty() with a dirty log and s_start == 0.
  Agreed. I rather thing we might miss journal->j_flags |= JBD2_FLUSHED
when shortcircuiting jbd2_mark_journal_empty(). But I still don't exactly
see how that would cause the corruption...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-24 19:49                   ` Nix
  2012-10-24 19:54                     ` Nix
@ 2012-10-24 20:30                     ` Eric Sandeen
  2012-10-24 20:34                       ` Nix
  2012-10-24 20:45                     ` Nix
  2012-10-24 21:08                     ` Theodore Ts'o
  3 siblings, 1 reply; 83+ messages in thread
From: Eric Sandeen @ 2012-10-24 20:30 UTC (permalink / raw)
  To: Nix
  Cc: Theodore Ts'o, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh,
	Toralf Förster

On 10/24/2012 02:49 PM, Nix wrote:
> On 24 Oct 2012, Theodore Ts'o spake thusly:
>> Toralf, Nix, if you could try applying this patch (at the end of this
>> message), and let me know how and when the WARN_ON triggers, and if it
>> does, please send the empty_bug_workaround plus the WARN_ON(1) report.
>> I know about the case where a file system is mounted and then
>> immediately unmounted, but we don't think that's the problematic case.
>> If you see any other cases where WARN_ON is triggering, it would be
>> really good to know....
> 
> Confirmed, it triggers. Traceback below.
> 

<giant snip>

The warn on triggers, but I can't tell - did the corruption still occur
with Ted's patch?

-Eric

> 
> OK. That umount of local filesystems sprayed your added
> empty bug workaround and WARN_ONs so many times that nearly all of them
> scrolled off the screen -- and because syslogd was dead by now and this
> is where my netconsole logs go, they're lost. I suspect every single
> umounted filesystem sprayed one of these (and this happened long before
> any reboot-before-we're-done).
> 
> But I did the old trick of camera-capturing the last one (which was
> probably /boot, which has never got corrupted because I hardly ever
> write anything to it at all). I hope it's more useful than nothing. (I
> can rearrange things to umount /var last, and try again, if you think
> that a specific warning from an fs known to get corrupted is especially
> likely to be valuable.)
> 
> So I see, for one umount at least (and the chunk of the previous one
> that scrolled offscreen is consistent with this):
> 
> jbd2_mark_journal_empty bug workaround (21218, 21219)
> [obscured by light] at fs/jbd2/journal.c:1364 jbd2_mark_journal_empty+06c/0xbd
> ...
> [addresses omitted for sanity: traceback only]
> warn_slowpath_common+0x83/0x9b
> warn_slowpath_null+0x1a/0x1c
> jbd2_mark_journal_empty+06c/0xbd
> jbd2_journal_destroy+0x183/0x20c
> ? abort_exclusive_wait+0x8e/0x8e
> ext4_put_super+0x6c/0x316
> ? evict_inodes+0xe6/0xf1
> generic_shutdown_super+0x59/0xd1
> ? free_vfsmnt+0x18/0x3c
> kill_block_super+0x27/0x6a
> deactivate_locked_super+0x26/0x57
> deactivate_super+0x3f/0x43
> mntput_no_expire+0x134/0x13c
> sys_umount+0x308/0x33a
> system_call_fastpath+0x16/0x1b


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-24 20:30                     ` Eric Sandeen
@ 2012-10-24 20:34                       ` Nix
  0 siblings, 0 replies; 83+ messages in thread
From: Nix @ 2012-10-24 20:34 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Theodore Ts'o, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh,
	Toralf Förster

On 24 Oct 2012, Eric Sandeen uttered the following:

> On 10/24/2012 02:49 PM, Nix wrote:
>> On 24 Oct 2012, Theodore Ts'o spake thusly:
>>> Toralf, Nix, if you could try applying this patch (at the end of this
>>> message), and let me know how and when the WARN_ON triggers, and if it
>>> does, please send the empty_bug_workaround plus the WARN_ON(1) report.
>>> I know about the case where a file system is mounted and then
>>> immediately unmounted, but we don't think that's the problematic case.
>>> If you see any other cases where WARN_ON is triggering, it would be
>>> really good to know....
>> 
>> Confirmed, it triggers. Traceback below.
>
> <giant snip>
>
> The warn on triggers, but I can't tell - did the corruption still occur
> with Ted's patch?

Yes. I fscked the filesystems in 3.6.1 after rebooting: /var had a
journal replay, and the usual varieties of corruption (free space bitmap
problems and multiply-claimed blocks). (The other filesystems for which
the warning triggered had neither a journal replay nor corruption.
At least one of them, /home, likely had a few writes but not enough to
cause a journal wrap.)

I note that the warning may well *not* have triggered for /var: if the
reason it had a journal replay was simply that it was still in use by
something that hadn't died, the umount -l will have avoided doing a full
umount for that filesystem alone.

Also, the corrupted filesystem was mounted in 3.6.3 exactly once.
Multiple umounts are not necessary, but an unclean umount apparently is.

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-24 19:49                   ` Nix
  2012-10-24 19:54                     ` Nix
  2012-10-24 20:30                     ` Eric Sandeen
@ 2012-10-24 20:45                     ` Nix
  2012-10-24 21:08                     ` Theodore Ts'o
  3 siblings, 0 replies; 83+ messages in thread
From: Nix @ 2012-10-24 20:45 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh,
	Toralf Förster

On 24 Oct 2012, nix@esperi.org.uk spake thusly:
> So, the net effect of this is that normally I get no journal recovery on
> anything at all -- but sometimes, if umounting takes longer than a few
> seconds, I reboot with not everything unmounted, and journal recovery
> kicks in on reboot.

It occurs to me that it is possible that this bug hits only those
filesystems for which a umount has started but been unable to complete.
If so, this is a relatively rare and unimportant bug which probably hits
only me and users of slow removable filesystems in the whole world...

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-23 22:19           ` Theodore Ts'o
                               ` (3 preceding siblings ...)
  2012-10-24 19:13             ` Jannis Achstetter
@ 2012-10-24 21:04             ` Jannis Achstetter
  4 siblings, 0 replies; 83+ messages in thread
From: Jannis Achstetter @ 2012-10-24 21:04 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-kernel

Am 24.10.2012 00:19, schrieb Theodore Ts'o:
> The reason why the problem happens rarely is that the effect of the
> buggy commit is that if the journal's starting block is zero, we fail
> to truncate the journal when we unmount the file system.  This can
> happen if we mount and then unmount the file system fairly quickly,
> before the log has a chance to wrap.  After the first time this has
> happened, it's not a disaster, since when we replay the journal, we'll
> just replay some extra transactions.  But if this happens twice, the
> oldest valid transaction will still not have gotten updated, but some
> of the newer transactions from the last mount session will have gotten
> written by the very latest transacitons, and when we then try to do
> the extra transaction replays, the metadata blocks can end up getting
> very scrambled indeed.

Repost. Sorry, I don't mean to spam, I just don't see my first mail
(sent via gmane.org) anywhere, so ...

As a "normal linux user" I'm interested in the practical things to do
now to avoid data loss. I'm running several systems with 3.6.2 and ext4.
Fearing loss of data:
- Is there a way to see whether the journal of a specific partition has
been wrapped (since mounting) so that umounting and mounting (or doing a
reboot to downgrade the kernel) is safe?
- Is there a way to "force" a journal-wrap? Run any
filesystem-benchmark? Which one with what parameters? Or is it unwise
since I might even further corrupt data if I hit the case already?
- Is it wise to umount now and run e2fsck or might I corrupt my files
just by umounting now if the journal hasn't wrapped yet?
- How do you define "fairly quickly"? Of course servers run 24/7 but I
might be using my PC 2-5 hrs a day... Is that a "reboot to soon after
booting"?
- Any more advice you can give to the ordinary user to avoid
fs-corruption? Don't shut down machines for some days? Better down- or
upgrade the kernel?

Best regards,
	Jannis Achstetter


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-24 19:49                   ` Nix
                                       ` (2 preceding siblings ...)
  2012-10-24 20:45                     ` Nix
@ 2012-10-24 21:08                     ` Theodore Ts'o
  2012-10-24 23:27                       ` Apparent serious progressive ext4 data corruption bug in 3.6 (when rebooting during umount) Nix
  3 siblings, 1 reply; 83+ messages in thread
From: Theodore Ts'o @ 2012-10-24 21:08 UTC (permalink / raw)
  To: Nix
  Cc: Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh,
	Toralf Förster

On Wed, Oct 24, 2012 at 09:45:47PM +0100, Nix wrote:
> 
> It occurs to me that it is possible that this bug hits only those
> filesystems for which a umount has started but been unable to complete.
> If so, this is a relatively rare and unimportant bug which probably hits
> only me and users of slow removable filesystems in the whole world...

Can you verify this?  Does the bug show up if you just hit the power
switch while the system is booted?

How about changing the "sleep 2" to "sleep 0.5"?  (Feel free to
unmount your other partitions, and just leave a test file system
mounted to minimize the chances that you lose partitions that require
hours and hours to restore...)

If you can get a very reliable repro, we might have to ask you to try
the following experiments:

0) Make sure the reliable repro does _not_ work with 3.6.1 booted

1) Try a 3.6.2 kernel

2) (If the problem shows up above) try a 3.6.2 kernel with 14b4ed2 reverted

3) (If the problem shows up above) try a 3.6.2 kernel with all of ext4
   related patches reverted:
92b7722 ext4: fix mtime update in nodelalloc mode
34414b2 ext4: fix fdatasync() for files with only i_size changes
12ebdf0 ext4: always set i_op in ext4_mknod()
22a5672 ext4: online defrag is not supported for journaled files
ba57d9e ext4: move_extent code cleanup
2fdb112 ext4: fix crash when accessing /proc/mounts concurrently
1638f1f ext4: fix potential deadlock in ext4_nonda_switch()
5018ddd ext4: avoid duplicate writes of the backup bg descriptor blocks
256ae46 ext4: don't copy non-existent gdt blocks when resizing
416a688 ext4: ignore last group w/o enough space when resizing instead of BUG'ing
14b4ed2 jbd2: don't write superblock when if its empty

4) (If the problem still shows up) then we may need to do a full
   bisect to figure out what is going on....

   	     	     	    	     	- Ted

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-24 19:13             ` Jannis Achstetter
@ 2012-10-24 21:31               ` Theodore Ts'o
  2012-10-24 22:05                 ` Jannis Achstetter
                                   ` (2 more replies)
  0 siblings, 3 replies; 83+ messages in thread
From: Theodore Ts'o @ 2012-10-24 21:31 UTC (permalink / raw)
  To: Jannis Achstetter; +Cc: linux-ext4, linux-kernel, stable

On Wed, Oct 24, 2012 at 09:13:01PM +0200, Jannis Achstetter wrote:
> 
> As a "normal linux user" I'm interested in the practical things to do
> now to avoid data loss. I'm running several systems with 3.6.2 and ext4.
> Fearing loss of data:
> - Is there a way to see whether the journal of a specific partition has
> been wrapped (since mounting) so that umounting and mounting (or doing a
> reboot to downgrade the kernel) is safe?

My initial analysis of what had been causing the problem now looks
incorrect (or at least incomplete).  Both Eric and I have been unable
to reproduce the failure based on my initial theory of what had been
going on.  So the best information at this point is that it's probably
not related to the file system getting unmounted before the journal
has wrapped.

(Keep in mind this is why commercial software corporations like
Microsoft or Apple generally don't make discussions as they are trying
to root cause a problem public; sometimes the initial theories can be
incorrect, and it's unfortunate when misinformation ends up on
Phoronix or Slashdot, leading to people to panic...  but this is open
source, so that means we do everything in the open, since that way we
can all work towards finding the best answer.)

At the *moment* it looks like it might be related to an unclean
shutdown (i.e., a forced reset or power failure while the file system
is mounted or is in the process of being unmounted).  That being said,
a simply kill -9 of kvm running a test kernel while the file system is
mounted by otherwise quiscient doesn't trigger the problem (I was
trying that last night).

It's a little bit too early for this meme:

    http://memegenerator.net/instance/28936247

But do please note that that Fedora !7 users have been using 3.6.2 for
a while, so if this were an easily triggered bug, (a) Eric and I would
have managed to reproduce it by now, and (b) lots of people would be
complaining, since the symptoms of the bug are not subtle.

That's not to say we aren't treating this seriously; but people
shouldn't panic unduly.... (and if you are using a critical
enterprise/production server on bleeding edge kernels, may I suggest
that this might not be such a good idea; there is a *reason* why
enterprise Linux distro's spend 6-9 months or more just stablizing the
kernel, and being super paranoid about making changes afterwards for
years, and it's not because they enjoy backporting patches and working
with trailing edge kernel sources.  :-)

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-24 21:31               ` Theodore Ts'o
@ 2012-10-24 22:05                 ` Jannis Achstetter
  2012-10-24 23:47                 ` Nix
  2012-10-25 17:02                 ` Felipe Contreras
  2 siblings, 0 replies; 83+ messages in thread
From: Jannis Achstetter @ 2012-10-24 22:05 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-kernel, linux-ext4, stable

Am 24.10.2012 23:31, schrieb Theodore Ts'o:
> On Wed, Oct 24, 2012 at 09:13:01PM +0200, Jannis Achstetter wrote:
>>
>> As a "normal linux user" I'm interested in the practical things to do
>> now to avoid data loss. I'm running several systems with 3.6.2 and ext4.
>> Fearing loss of data:
>> - Is there a way to see whether the journal of a specific partition has
>> been wrapped (since mounting) so that umounting and mounting (or doing a
>> reboot to downgrade the kernel) is safe?
> [...]
> (Keep in mind this is why commercial software corporations like
> [...]
> can all work towards finding the best answer.)

I really appreciate this and I like it since although the root-cause
hasn't been found for sure yet, it is a transparent process.
And it's great good thing that we can directly talk to the involved devs
w/o going through 200 layers of marketing and spokesmen (as it were with
the two companies you mentioned).

> It's a little bit too early for this meme:
>     http://memegenerator.net/instance/28936247

That's a good one :)

> But do please note that that Fedora !7 users have been using 3.6.2 for
> [...]
> with trailing edge kernel sources.  :-)

Yes, the downside of running Gentoo unstable. But even the "stable" tree
used 3.5.7 and this is the one my NAS uses where I do store my backups.
Nevertheless, your reply eased my mind to a great extend and I'm
thankful for it.
Time for bed now :)

	Jannis


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6 (when rebooting during umount)
  2012-10-24 21:08                     ` Theodore Ts'o
@ 2012-10-24 23:27                       ` Nix
  2012-10-24 23:42                         ` Nix
  2012-10-25  1:10                         ` Theodore Ts'o
  0 siblings, 2 replies; 83+ messages in thread
From: Nix @ 2012-10-24 23:27 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh,
	Toralf Förster

On 24 Oct 2012, Theodore Ts'o verbalised:

> On Wed, Oct 24, 2012 at 09:45:47PM +0100, Nix wrote:
>> 
>> It occurs to me that it is possible that this bug hits only those
>> filesystems for which a umount has started but been unable to complete.
>> If so, this is a relatively rare and unimportant bug which probably hits
>> only me and users of slow removable filesystems in the whole world...
>
> Can you verify this?  Does the bug show up if you just hit the power
> switch while the system is booted?

Verified! You do indeed need to do passing strange things to trigger
this bug -- not surprising, really, or everyone and his dog would have
reported it by now. As it is, I'm sorry this hit slashdot, because it
reflects unnecessarily badly on a filesystem that is experiencing
problems only when people do rather insane things to it.

> How about changing the "sleep 2" to "sleep 0.5"?

I tried the following:

 - /sbin/reboot -f of running system
   -> Journal replay, no problems other than the expected free block
      count problems. This is not such a severe problem after all!

 - Normal shutdown, but a 60 second pause after lazy umount, more than
   long enough for all umounts to proceed to termination
   -> no corruption, but curiously /home experienced a journal replay
      before being fscked, even though a cat of /proc/mounts after
      umounting revealed that the only mounted filesystem was /,
      read-only, so /home should have been clean

 - Normal shutdown, a 60 second pause after lazy umount of everything
   other than /var, and then a umount of /var the instant before
   reboot, no sleep at all
   -> massive corruption just as seen before.

Unfortunately, the massive corruption in the last testcase was seen in
3.6.1 as well as 3.6.3: it appears that the only effect that superblock
change had in 3.6.3 was to make this problem easier to hit, and that the
bug itself was introduced probably somewhere between 3.5 and 3.6 (though
I only rebooted 3.5.x twice, and it's rare enough before 3.6.[23], at
~1/20 boots, that it may have been present for longer and I never
noticed).

So the problem is caused by rebooting or powering off or disconnecting
the device *while* umounting a filesystem with a dirty journal, and
might have been introduced by I/O scheduler changes or who knows what
other changes, not just ext4 changes, since the order of block writes by
umount is clearly at issue.

Even though my own system relies on the possibility of rebooting during
umount to reboot reliably, I'd be inclined to say 'not a bug, don't do
that then' -- except that this renders it unreliable to use umount -l to
unmount all the filesystems you can, skipping those that are not
reachable due to having unresponsive servers in the way. As far as I can
tell, there is *no* way to tell when a lazy umount has completed, except
perhaps for polling /proc/mounts: and there is no way at all to tell
when a lazy umount switches from 'waiting for the last process to stop
using me, you can reboot without incident' to 'doing umount, rebooting
is disastrous'. And unfortunately I want to reboot if we're in the
former state, but not in the latter.

(It also makes it unreliable to use ext4 on devices like USB sticks that
might suddenly get disconnected during a umount.)

Further, it seems to me that this makes it dangerous to ever use umount
-l at all, even during normal system operation, since the real umount
might only start when all processes are killed at system shutdown, and
the reboot could well kick in before the umount has finished.

It also appears impossible for me to reliably shut my system down,
though a 60s timeout after lazy umount and before reboot is likely to
work in all but the most pathological of cases (where a downed NFS
server comes up at just the wrong instant): it is clear that the
previous 5s timeout eventually became insufficient simply because of the
amount of time it can take to do a umount on today's larger filesystems.

Truly, my joy is unbounded :(

> 0) Make sure the reliable repro does _not_ work with 3.6.1 booted

Oh dear. Sorry :(((

I can try to bisect this and track down which kernel release it appeared
in -- if it isn't expected behaviour, of course, which is perfectly
possible: rebooting during a umount is at best questionable. But I can't
do anything that lengthy before the weekend, I'm afraid.

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6 (when rebooting during umount)
  2012-10-24 23:27                       ` Apparent serious progressive ext4 data corruption bug in 3.6 (when rebooting during umount) Nix
@ 2012-10-24 23:42                         ` Nix
  2012-10-25  1:10                         ` Theodore Ts'o
  1 sibling, 0 replies; 83+ messages in thread
From: Nix @ 2012-10-24 23:42 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh,
	Toralf Förster

On 25 Oct 2012, nix@esperi.org.uk said:
> Even though my own system relies on the possibility of rebooting during
> umount to reboot reliably, I'd be inclined to say 'not a bug, don't do
> that then' -- except that this renders it unreliable to use umount -l to
> unmount all the filesystems you can, skipping those that are not
> reachable due to having unresponsive servers in the way.

It's worse than that. If you're using filesystem namespaces, how can
*any* shell script loop, or anything in userspace, reliably unmount all
filesystems before reboot? It seems to me this is impossible. There is
no process that necessarily has access to all namespaces, and when you
bring PID namespaces into the picture there is no process that can even
kill all userspace processes in order to zap their filesystems.

I suspect we need a new blocking 'umountall' syscall and a command that
calls it, which umounts everything it can in every filesystem namespace
it can, skipping those that are (unreachable?) network mounts, and
returns only when everything is done. (Possibly it should first kill
every process it sees in every PID namespace other than that of the
caller, too.)

Then shutdown scripts can just call this, and get the right behaviour
immediately.

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-24 21:31               ` Theodore Ts'o
  2012-10-24 22:05                 ` Jannis Achstetter
@ 2012-10-24 23:47                 ` Nix
  2012-10-25 17:02                 ` Felipe Contreras
  2 siblings, 0 replies; 83+ messages in thread
From: Nix @ 2012-10-24 23:47 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jannis Achstetter, linux-ext4, linux-kernel, stable

On 24 Oct 2012, Theodore Ts'o uttered the following:
> (Keep in mind this is why commercial software corporations like
> Microsoft or Apple generally don't make discussions as they are trying
> to root cause a problem public; sometimes the initial theories can be
> incorrect, and it's unfortunate when misinformation ends up on
> Phoronix or Slashdot, leading to people to panic...  but this is open
> source, so that means we do everything in the open, since that way we
> can all work towards finding the best answer.)

Quite. The first few days of any problem diagnosis are often a process
of taking something from 'oh my god it might be the end of the world' to
'oh look it's really obscure, no wonder nobody has ever seen it before'.

This is quite *definitely* such a problem.

> It's a little bit too early for this meme:
>
>     http://memegenerator.net/instance/28936247

It appears I have taken up a new post as the Iraqi Information Minister.
This is why I was disturbed to see the thing hitting Phoronix and then
Slashdot: as the guy whose FSes are being eaten, this is probably not an
easy bug to hit! If it hits, the consequences are serious, but it
doesn't seem to be easy to hit. (I should perhaps have phrased the
subject line better, but I'd just had my $HOME eaten and was rather
stressed out...)

> But do please note that that Fedora !7 users have been using 3.6.2 for
> a while, so if this were an easily triggered bug, (a) Eric and I would
> have managed to reproduce it by now, and (b) lots of people would be
> complaining, since the symptoms of the bug are not subtle.

Quite.

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6 (when rebooting during umount)
  2012-10-24 23:27                       ` Apparent serious progressive ext4 data corruption bug in 3.6 (when rebooting during umount) Nix
  2012-10-24 23:42                         ` Nix
@ 2012-10-25  1:10                         ` Theodore Ts'o
  2012-10-25  1:45                           ` Nix
                                             ` (2 more replies)
  1 sibling, 3 replies; 83+ messages in thread
From: Theodore Ts'o @ 2012-10-25  1:10 UTC (permalink / raw)
  To: Nix
  Cc: Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh,
	Toralf Förster

On Thu, Oct 25, 2012 at 12:27:02AM +0100, Nix wrote:
>
>  - /sbin/reboot -f of running system
>    -> Journal replay, no problems other than the expected free block
>       count problems. This is not such a severe problem after all!
> 
>  - Normal shutdown, but a 60 second pause after lazy umount, more than
>    long enough for all umounts to proceed to termination
>    -> no corruption, but curiously /home experienced a journal replay
>       before being fscked, even though a cat of /proc/mounts after
>       umounting revealed that the only mounted filesystem was /,
>       read-only, so /home should have been clean

Question: how are you doing the journal replay?  Is it happening as
part of running e2fsck, or are you mounting the file system and
letting kernel do the journal replay?

Also, can you reproduce the problem with the nobarrier and
journal_async_commit options *removed*?  Yes, I know you have battery
backup, but it would be interesting to see if the problem shows up in
the default configuration with none of the more specialist options.
(So it would probably be good to test with journal_checksum removed as
well.)

If that does make the problem go away, that will be a very interesting
data point....

> Unfortunately, the massive corruption in the last testcase was seen in
> 3.6.1 as well as 3.6.3: it appears that the only effect that superblock
> change had in 3.6.3 was to make this problem easier to hit, and that the
> bug itself was introduced probably somewhere between 3.5 and 3.6 (though
> I only rebooted 3.5.x twice, and it's rare enough before 3.6.[23], at
> ~1/20 boots, that it may have been present for longer and I never
> noticed).

Hmm.... ok.  Can you tell whether or not the 2nd patch I posted on
this thread made any difference to how frequently it happened?  The
main difference with 3.6.3 with 2nd patch applied compared to 3.6.1 is
that if it detects that the journal superblock update is a no-op, it
skips the write request.  With 3.6.1, it submits the journal
superblock write regardless of whether or not it would be a no-op.  So
if my patch isn't making a difference to the freqency to when you are
seeing the corruption, then it must be the write request itself which
is important.

When you say it's rare before 3.6.[23], how rare is it?  How reliably
can you trigger it under 3.6.1?  One in 3?  One in 5?  One in 20?

As far as bisecting, one experiment that I'd really appreciate your
doing is to check and see whether you can reproduce the problem using
the 3.4 kernel, and if you can, to see if it reproduces under the 3.3
kernel.

The reason why I ask this is there were not any major changes between
3.5 and 3.6, or between 3.4 and 3.5.  There *were* however, some
fairly major changes made by Jan Kara that were introduced between 3.3
and 3.4.  Among other things, this is where we started using FUA
(Force Unit Attention) writes to update the journal superblock instead
of just using REQ_FLUSH.  This is in fact the most likely place where
we might have introduced the regression, since it wouldn't surprise me
if Jan didn't test the case of using nobarrier with a storage array
with battery backup (I certainly didn't, since I don't have easy
access to such fancy toys :-).

> It also appears impossible for me to reliably shut my system down,
> though a 60s timeout after lazy umount and before reboot is likely to
> work in all but the most pathological of cases (where a downed NFS
> server comes up at just the wrong instant): it is clear that the
> previous 5s timeout eventually became insufficient simply because of the
> amount of time it can take to do a umount on today's larger filesystems.

Something that you might want to consider trying is after you kill all
of the processes, remount all of the local disk file systems
read-only, then kick off the unmount of the NFS file systems (just to
be nice to the NFS servers, so they are notified of the unmount), and
then just force the reboot.  If the file systems have been remounted
r/o, that will cause the journal to be shutdown cleanly, and all of
the write flushed out.  (Modulo issues with nobarrier, but that's a
separate issue.  I'm now thinking that a smart thing to do might be
force a flush on an unmount or remount r/o, regardless of whether
nobarrier is specified, just to make sure everything is written out
before the poweroff, battery backup or no.)

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6 (when rebooting during umount)
  2012-10-25  1:10                         ` Theodore Ts'o
@ 2012-10-25  1:45                           ` Nix
  2012-10-25 14:12                             ` Theodore Ts'o
  2012-10-25 11:06                           ` Nix
  2012-10-26  0:22                           ` Apparent serious progressive ext4 data corruption bug in 3.6 (when rebooting during umount) (possibly blockdev / arcmsr at fault??) Nix
  2 siblings, 1 reply; 83+ messages in thread
From: Nix @ 2012-10-25  1:45 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh,
	Toralf Förster

On 25 Oct 2012, Theodore Ts'o stated:

> On Thu, Oct 25, 2012 at 12:27:02AM +0100, Nix wrote:
>>
>>  - /sbin/reboot -f of running system
>>    -> Journal replay, no problems other than the expected free block
>>       count problems. This is not such a severe problem after all!
>> 
>>  - Normal shutdown, but a 60 second pause after lazy umount, more than
>>    long enough for all umounts to proceed to termination
>>    -> no corruption, but curiously /home experienced a journal replay
>>       before being fscked, even though a cat of /proc/mounts after
>>       umounting revealed that the only mounted filesystem was /,
>>       read-only, so /home should have been clean
>
> Question: how are you doing the journal replay?  Is it happening as
> part of running e2fsck, or are you mounting the file system and
> letting kernel do the journal replay?

This most recent instance was e2fsck. Normally, it's mount. Both
seem able to yield the same corruption.

> Also, can you reproduce the problem with the nobarrier and
> journal_async_commit options *removed*?  Yes, I know you have battery
> backup, but it would be interesting to see if the problem shows up in
> the default configuration with none of the more specialist options.
> (So it would probably be good to test with journal_checksum removed as
> well.)

I'll try that, hopefully tomorrow sometime. It's 2:30am now and probably
time to sleep.

>> Unfortunately, the massive corruption in the last testcase was seen in
>> 3.6.1 as well as 3.6.3: it appears that the only effect that superblock
>> change had in 3.6.3 was to make this problem easier to hit, and that the
>> bug itself was introduced probably somewhere between 3.5 and 3.6 (though
>> I only rebooted 3.5.x twice, and it's rare enough before 3.6.[23], at
>> ~1/20 boots, that it may have been present for longer and I never
>> noticed).
>
> Hmm.... ok.  Can you tell whether or not the 2nd patch I posted on
> this thread made any difference to how frequently it happened?  The

Well, I had a couple of reboots without corruption with that patch
applied, and /home was only ever corrupted with it not applied -- but
that could perfectly well be chance, since I only had two or three
instances of /home corruption so far, thank goodness.

> When you say it's rare before 3.6.[23], how rare is it?  How reliably
> can you trigger it under 3.6.1?  One in 3?  One in 5?  One in 20?

I've rebooted out of 3.6.1 about fifteen times so far. I've seen once
instance of corruption. I've never seen it before 3.6, but I only
rebooted 3.5.x or 3.4.x once or twice in total, so that too could be
chance.

> As far as bisecting, one experiment that I'd really appreciate your
> doing is to check and see whether you can reproduce the problem using
> the 3.4 kernel, and if you can, to see if it reproduces under the 3.3
> kernel.

Will try. It might be the weekend before I can find the time though :(

> The reason why I ask this is there were not any major changes between
> 3.5 and 3.6, or between 3.4 and 3.5.  There *were* however, some
> fairly major changes made by Jan Kara that were introduced between 3.3
> and 3.4.  Among other things, this is where we started using FUA
> (Force Unit Attention) writes to update the journal superblock instead
> of just using REQ_FLUSH.  This is in fact the most likely place where
> we might have introduced the regression, since it wouldn't surprise me
> if Jan didn't test the case of using nobarrier with a storage array
> with battery backup (I certainly didn't, since I don't have easy
> access to such fancy toys :-).

Hm. At boot, I see this for both volumes on the Areca controller:

[    0.855376] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    0.855465] sd 0:0:0:1: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

So it looks to me like FUA changes themselves could have little effect.

(btw, the controller cost only about £150... if it was particularly
fancy I certainly couldn't have afforded it.)

>> It also appears impossible for me to reliably shut my system down,
>> though a 60s timeout after lazy umount and before reboot is likely to
>> work in all but the most pathological of cases (where a downed NFS
>> server comes up at just the wrong instant): it is clear that the
>> previous 5s timeout eventually became insufficient simply because of the
>> amount of time it can take to do a umount on today's larger filesystems.
>
> Something that you might want to consider trying is after you kill all
> of the processes, remount all of the local disk file systems
> read-only, then kick off the unmount of the NFS file systems (just to
> be nice to the NFS servers, so they are notified of the unmount), and

Actually I umount NFS first of all, because if I kill the processes
first, this causes trouble with the NFS unmounts, particularly if I'm
doing self-mounting (which I do sometimes, though not at the moment).

I will certainly try a readonly remount instead.

> force a flush on an unmount or remount r/o, regardless of whether
> nobarrier is specified, just to make sure everything is written out
> before the poweroff, battery backup or no.)

I'm rather surprised that doesn't happen anyway. I always thought it
did.

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6 (when rebooting during umount)
  2012-10-25  1:10                         ` Theodore Ts'o
  2012-10-25  1:45                           ` Nix
@ 2012-10-25 11:06                           ` Nix
  2012-10-26  0:22                           ` Apparent serious progressive ext4 data corruption bug in 3.6 (when rebooting during umount) (possibly blockdev / arcmsr at fault??) Nix
  2 siblings, 0 replies; 83+ messages in thread
From: Nix @ 2012-10-25 11:06 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh,
	Toralf Förster

On 25 Oct 2012, Theodore Ts'o stated:
> Also, can you reproduce the problem with the nobarrier and
> journal_async_commit options *removed*?  Yes, I know you have battery
> backup, but it would be interesting to see if the problem shows up in
> the default configuration with none of the more specialist options.
> (So it would probably be good to test with journal_checksum removed as
> well.)

I'm going to spend some time after work today trying to reproduce this
in a virtual machine. If rebooting while umounting is truly all it
takes, as it seems, this should eliminate a bunch of variables and make
it a lot easier to reproduce. (Assuming Eric hasn't already done just
that, that is.)

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6 (when rebooting during umount)
  2012-10-25  1:45                           ` Nix
@ 2012-10-25 14:12                             ` Theodore Ts'o
  2012-10-25 14:15                               ` Nix
  0 siblings, 1 reply; 83+ messages in thread
From: Theodore Ts'o @ 2012-10-25 14:12 UTC (permalink / raw)
  To: Nix
  Cc: Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh,
	Toralf Förster

I've been thinking about this some more, and if you don't have a lot
of time, perhaps the most important test to do is this.  Does the
chance of your seeing corrupted files in v3.6.3 go down if you run
3.6.3 with commit 14b4ed22a6 reverted?  Keep your current
configuration, using nobarrier, et. al, constant.  If reverting the
commit makes things better, then that's what would be most useful to
know as soon as possible, since the correct short-term solution is to
revert that commit for 3.7-rcX, as well as the 3.6 and 3.5 stable
kernels.

We can investigate later whether nobarrier, journal_async_commit seem
to make the problem worse, and whether the less common corruption case
that you were seeing with 3.6.1 was actually a change which was
introduced between 3.3 and 3.4.

But most importantly, even if the bug doesn't show up with the default
mount options at all (which explains why Eric and I weren't able to
reproduce it), there are probably other users using nobarrier, so if
the frequency with which you were seeing corruptions went up
significantly between 3.6.1 and 3.6.3, and reverting 14b4ed22a6 brings
the frequency back down to what you were seeing with 3.6.1, we should
do that ASAP.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6 (when rebooting during umount)
  2012-10-25 14:12                             ` Theodore Ts'o
@ 2012-10-25 14:15                               ` Nix
  2012-10-25 17:39                                 ` Nix
  0 siblings, 1 reply; 83+ messages in thread
From: Nix @ 2012-10-25 14:15 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh,
	Toralf Förster

On 25 Oct 2012, Theodore Ts'o stated:

> I've been thinking about this some more, and if you don't have a lot
> of time,

I've got time, but it's this weekend, not during the week :)

>          perhaps the most important test to do is this.  Does the
> chance of your seeing corrupted files in v3.6.3 go down if you run
> 3.6.3 with commit 14b4ed22a6 reverted?

This I can verify, sometime this evening. (I presume what we're really
interested in is whether the window in which files get corrupted has
narrowed such that my 5s sleep after umount is now long enough to have a
lower likelihood of corruption, since we know that a near-0s sleep after
umount causes corruption almost every time on 3.6.1 as well: I've now
done that three times and got corruption every time.)

> But most importantly, even if the bug doesn't show up with the default
> mount options at all (which explains why Eric and I weren't able to
> reproduce it), there are probably other users using nobarrier, so if
> the frequency with which you were seeing corruptions went up
> significantly between 3.6.1 and 3.6.3, and reverting 14b4ed22a6 brings
> the frequency back down to what you were seeing with 3.6.1, we should
> do that ASAP.

Agreed.

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-24 21:31               ` Theodore Ts'o
  2012-10-24 22:05                 ` Jannis Achstetter
  2012-10-24 23:47                 ` Nix
@ 2012-10-25 17:02                 ` Felipe Contreras
  2 siblings, 0 replies; 83+ messages in thread
From: Felipe Contreras @ 2012-10-25 17:02 UTC (permalink / raw)
  To: Theodore Ts'o, Jannis Achstetter, linux-ext4, linux-kernel, stable

On Wed, Oct 24, 2012 at 11:31 PM, Theodore Ts'o <tytso@mit.edu> wrote:

> But do please note that that Fedora !7 users have been using 3.6.2 for
> a while,

Same in Arch Linux.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6 (when rebooting during umount)
  2012-10-25 14:15                               ` Nix
@ 2012-10-25 17:39                                 ` Nix
  0 siblings, 0 replies; 83+ messages in thread
From: Nix @ 2012-10-25 17:39 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh,
	Toralf Förster

On 25 Oct 2012, nix@esperi.org.uk said:

> This I can verify, sometime this evening.

Sometime *tomorrow* evening. This has been quite stressful and I can
hardly keep my eyes open. I'm not doing anything risky in this state.

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-24  4:15             ` Nix
  2012-10-24  4:27               ` Eric Sandeen
@ 2012-10-26  0:11               ` Ric Wheeler
  2012-10-26  0:43                 ` Theodore Ts'o
  1 sibling, 1 reply; 83+ messages in thread
From: Ric Wheeler @ 2012-10-26  0:11 UTC (permalink / raw)
  To: Eric Sandeen, Ted Ts'o; +Cc: linux-kernel, J. Bruce Fields, Bryan Schumaker

On 10/24/2012 12:15 AM, Nix wrote:
> On 24 Oct 2012, Eric Sandeen uttered the following:
>
>> On 10/23/12 3:57 PM, Nix wrote:
>>> The only unusual thing about the filesystems on this machine are that
>>> they have hardware RAID-5 (using the Areca driver), so I'm mounting with
>>> 'nobarrier':
>> I should have read more.  :(  More questions follow:
>>
>> * Does the Areca have a battery backed write cache?
> Yes (though I'm not powering off, just rebooting). Battery at 100% and
> happy, though the lack of power-off means it's not actually getting
> used, since the cache is obviously mains-backed as well.

Sending this just to you two to avoid embarrassing myself if I misread the 
thread, but....

Can we reproduce this with any other hardware RAID card? Or with MD?

If we cannot reproduce this in other machines, why assume this is an ext4 issue 
and not a hardware firmware bug?

As an ex-storage guy, this really smells like the hardware raid card might be 
misleading us....

ric




^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6 (when rebooting during umount) (possibly blockdev / arcmsr at fault??)
  2012-10-25  1:10                         ` Theodore Ts'o
  2012-10-25  1:45                           ` Nix
  2012-10-25 11:06                           ` Nix
@ 2012-10-26  0:22                           ` Nix
  2 siblings, 0 replies; 83+ messages in thread
From: Nix @ 2012-10-26  0:22 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh,
	Toralf Förster, nick.cheng

On 25 Oct 2012, Theodore Ts'o told this:
> If that does make the problem go away, that will be a very interesting
> data point....

I'll be looking at this tomorrow, but as sod's law would have it I have
another user on this machine who didn't want it mega-rebooted tonight,
so I was reduced to trying to reproduce the problem in virtualization
under qemu.

I failed, for one very simple reason: on 3.6.3, even with a umount -l
still in the process of unmounting the fs and flushing changes, even on
an fs mounted nobarrier,journal_async_commit, even when mounted atop
LVM, reboot(2) will block until umount's writeout is complete (and lvm
vgchange refuses to deactivate the volume group while that is happening,
but I don't bother deactivating volume groups on the afflicted machine
so I know that can't be related). Obviously, this suffices to ensure
that a reboot is not possible while umounts are underway -- though a
power cut is still possible, I suppose.

On the afflicted machine (with a block device stack running LVM, then
libata/arcmsr), as far as I can tell reboot(8) is *not* blocking if a
unmount is underway: it shoots down everything and restarts at once. I
have no direct proof of this yet, but during the last week I've
routinely seen it reboot with lots of writes underway and umount -l log
messages streaming up the screen: it certainly doesn't wait for all the
umount -l's to be done the way it is in virtualization. I have no idea
how this can be possible: I thought fses on a block device had to be
quiesced (thus, in the case of an ongoing umount, unmounted and flushed)
before any attempt at all was made to shut the underlying block device
down, and I'd be fairly surprised if a flush wasn't done even if
nobarrier was active (it certainly seems to be for virtio-blk, but that
may well be a special case). But arcmsr (or libata? I should test with a
simulated libata rather than virtio-blk next) appears to be getting
around that somehow. This would probably explain all sorts of horrible
corruption if umounting during a reboot, right?

So maybe it's the stack of block devices that's at fault, and not the
filesystem at all! I'll admit I don't really understand what happens at
system halt time well enough to be sure, and getting log info from a
machine in the middle of reboot(8) appears likely to be a complete sod
(maybe halt(8) would be better: at least I could take a photo of the
screen then). If that's true, it would *certainly* explain why nobody
else can see this problem: only arcmsr users who also do umount -l's
would have a chance, and that population probably has a size of one.

I'll try to prove this tomorrow by writing a few gigs of junk to a temp
filesytem held open by a temporary cat /dev/null, umount -l'ing it and
killing off the cat the instant before the reboot -f call. If I don't
see the reboot call blocking, the hypothesis is proved. (This is much
what I did in virtualization, where I observe reboot blocking.)

(Another blockdev-related possibility, if reboot *is* observed to block,
is that arcmsr may be throwing away very-recently-written data when the
adapter is shut down right before reboot.)

Argh. How can rebooting a system be so damn complicated. Bring back the
C64 or BBC Master where I could just pull the power lead out and stick
it back in. :)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-26  0:11               ` Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) Ric Wheeler
@ 2012-10-26  0:43                 ` Theodore Ts'o
  2012-10-26 12:12                   ` Nix
  0 siblings, 1 reply; 83+ messages in thread
From: Theodore Ts'o @ 2012-10-26  0:43 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Eric Sandeen, linux-kernel, J. Bruce Fields, Bryan Schumaker

On Thu, Oct 25, 2012 at 08:11:12PM -0400, Ric Wheeler wrote:
> 
> Sending this just to you two to avoid embarrassing myself if I
> misread the thread, but....
> 
> Can we reproduce this with any other hardware RAID card? Or with MD?

There was another user who reported very similar corruption using
3.6.2 using USB thumb drive.  I can't be certain that it's the same
bug that's being triggered, but the symptoms were identical.

> If we cannot reproduce this in other machines, why assume this is an
> ext4 issue and not a hardware firmware bug?
> 
> As an ex-storage guy, this really smells like the hardware raid card
> might be misleading us....

It's possible.  The main reason why I took this so seriously was
because of the 2nd, apparently confirming report, with very different
hardware.  That was what was so scary to me, at least at first.

	   	    	     	      - Ted

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-26  0:43                 ` Theodore Ts'o
@ 2012-10-26 12:12                   ` Nix
  0 siblings, 0 replies; 83+ messages in thread
From: Nix @ 2012-10-26 12:12 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Ric Wheeler, Eric Sandeen, linux-kernel, J. Bruce Fields,
	Bryan Schumaker

On 26 Oct 2012, Theodore Ts'o spake thusly:

> On Thu, Oct 25, 2012 at 08:11:12PM -0400, Ric Wheeler wrote:
>> 
>> Sending this just to you two to avoid embarrassing myself if I
>> misread the thread, but....
>> 
>> Can we reproduce this with any other hardware RAID card? Or with MD?
>
> There was another user who reported very similar corruption using
> 3.6.2 using USB thumb drive.  I can't be certain that it's the same
> bug that's being triggered, but the symptoms were identical.

I now suspect it's the same bug, triggered in a different way, but also
by a block-layer problem -- instead of the block device driver not
blocking while the umount finishes (or throwing some of the data umount
writes away, whichever it is, not yet known), the block device goes away
because someone pulled it out of the USB socket. In any case, it appears
that an ext4 umount being interrupted while data is being written does
bad, bad things to the filesystem.

>> If we cannot reproduce this in other machines, why assume this is an
>> ext4 issue and not a hardware firmware bug?

A tad unlikely. Why would a firmware bug show up only at the instant of
reboot? Why would it show up as a lack of blocking on the kernel side? I
assure you that if you write lots of data to this controller normally,
you will end up blocking :) I can completely believe that it's an arcmsr
driver bug though. If it was an ext4 bug, it would surely be
reproducible in virtualization, or on different hardware, or something
like that.

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-24 20:17               ` Jan Kara
@ 2012-10-26 15:25                 ` Eric Sandeen
  0 siblings, 0 replies; 83+ messages in thread
From: Eric Sandeen @ 2012-10-26 15:25 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Ts'o, Nix, linux-ext4, linux-kernel,
	J. Bruce Fields, Bryan Schumaker, Peng Tao, Trond.Myklebust,
	gregkh, Toralf Förster, stable

On 10/24/12 3:17 PM, Jan Kara wrote:
> On Tue 23-10-12 19:57:09, Eric Sandeen wrote:
>> On 10/23/12 5:19 PM, Theodore Ts'o wrote:
>>> On Tue, Oct 23, 2012 at 09:57:08PM +0100, Nix wrote:
>>>>
>>>> It is now quite clear that this is a bug introduced by one or more of
>>>> the post-3.6.1 ext4 patches (which have all been backported at least to
>>>> 3.5, so the problem is probably there too).
>>>>
>>>> [   60.290844] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 202, 1583 clusters in bitmap, 1675 in gd
>>>> [   60.291426] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
>>>>
>>>
>>> I think I've found the problem.  I believe the commit at fault is commit
>>> 14b4ed22a6 (upstream commit eeecef0af5e):
>>>
>>>     jbd2: don't write superblock when if its empty
>>>
>>> which first appeared in v3.6.2.
>>>
>>> The reason why the problem happens rarely is that the effect of the
>>> buggy commit is that if the journal's starting block is zero, we fail
>>> to truncate the journal when we unmount the file system.  This can
>>> happen if we mount and then unmount the file system fairly quickly,
>>> before the log has a chance to wrap.After the first time this has
>>> happened, it's not a disaster, since when we replay the journal, we'll
>>> just replay some extra transactions.  But if this happens twice, the
>>> oldest valid transaction will still not have gotten updated, but some
>>> of the newer transactions from the last mount session will have gotten
>>> written by the very latest transacitons, and when we then try to do
>>> the extra transaction replays, the metadata blocks can end up getting
>>> very scrambled indeed.
>>
>> I'm stumped by this; maybe Ted can see if I'm missing something.
>>
>> (and Nix, is there anything special about your fs?  Any nondefault
>> mkfs or mount options, external journal, inordinately large fs, or
>> anything like that?)
>>
>> The suspect commit added this in jbd2_mark_journal_empty():
>>
>>         /* Is it already empty? */
>>         if (sb->s_start == 0) {
>>                 read_unlock(&journal->j_state_lock);
>>                 return;
>>         }
>>
>> thereby short circuiting the function.
>>
>> But Ted's suggestion that mounting the fs, doing a little work, and
>> unmounting before we wrap would lead to this doesn't make sense to
>> me.  When I do a little work, s_start is at 1, not 0.  We start
>> the journal at s_first:
>>
>> load_superblock()
>> 	journal->j_first = be32_to_cpu(sb->s_first);
>>
>> And when we wrap the journal, we wrap back to j_first:
>>
>> jbd2_journal_next_log_block():
>>         if (journal->j_head == journal->j_last)
>>                 journal->j_head = journal->j_first;
>>
>> and j_first comes from s_first, which is set at journal creation
>> time to be "1" for an internal journal.
>>
>> So s_start == 0 sure looks special to me; so far I can only see that
>> we get there if we've been through jbd2_mark_journal_empty() already,
>> though I'm eyeballing jbd2_journal_get_log_tail() as well.
>>
>> Ted's proposed patch seems harmless but so far I don't understand
>> what problem it fixes, and I cannot recreate getting to
>> jbd2_mark_journal_empty() with a dirty log and s_start == 0.
>   Agreed. I rather thing we might miss journal->j_flags |= JBD2_FLUSHED
> when shortcircuiting jbd2_mark_journal_empty(). But I still don't exactly
> see how that would cause the corruption...

Agreed, except so far I cannot see any way to get here with s_start == 0
without ALREADY having JBD2_FLUSHED set.  Can you?

Anyway, I think the problem is still poorly understood; lots of random facts
floating about, and a pretty weird usecase with nonstandard/dangerous mount
options.  I do want to figure out what regressed (if anything) but so far
this investigation doesn't seem very methodical.

-Eric

> 								Honza
> 


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-23 20:57         ` Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) Nix
  2012-10-23 22:19           ` Theodore Ts'o
  2012-10-24  1:13           ` Eric Sandeen
@ 2012-10-26 20:35           ` Eric Sandeen
  2012-10-26 20:37             ` Nix
  2012-10-28  4:23           ` [PATCH] ext4: fix unjournaled inode bitmap modification Eric Sandeen
  3 siblings, 1 reply; 83+ messages in thread
From: Eric Sandeen @ 2012-10-26 20:35 UTC (permalink / raw)
  To: Nix
  Cc: Ted Ts'o, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs

On 10/23/12 3:57 PM, Nix wrote:
> [Bruce, Trond, I fear it may be hard for me to continue chasing this NFS
>  lockd crash as long as ext4 on 3.6.3 is hosing my filesystems like
>  this. Apologies.]

<big snip>

> The only unusual thing about the filesystems on this machine are that
> they have hardware RAID-5 (using the Areca driver), so I'm mounting with
> 'nobarrier': the full set of options for all my ext4 filesystems are:
> 
> rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota,
> usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota

Out of curiosity, when I test log replay with the journal_checksum option, I
almost always get something like:

[  999.917805] JBD2: journal transaction 84121 on dm-1-8 is corrupt.
[  999.923904] EXT4-fs (dm-1): error loading journal

after a simulated crash & log replay.

Do you see anything like that in your logs?

<big snip>

Thanks,
-Eric


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-26 20:35           ` Eric Sandeen
@ 2012-10-26 20:37             ` Nix
  2012-10-26 20:56               ` Theodore Ts'o
  0 siblings, 1 reply; 83+ messages in thread
From: Nix @ 2012-10-26 20:37 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Ted Ts'o, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs

On 26 Oct 2012, Eric Sandeen outgrape:

> On 10/23/12 3:57 PM, Nix wrote:
>> The only unusual thing about the filesystems on this machine are that
>> they have hardware RAID-5 (using the Areca driver), so I'm mounting with
>> 'nobarrier': the full set of options for all my ext4 filesystems are:
>> 
>> rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota,
>> usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota
>
> Out of curiosity, when I test log replay with the journal_checksum option, I
> almost always get something like:
>
> [  999.917805] JBD2: journal transaction 84121 on dm-1-8 is corrupt.
> [  999.923904] EXT4-fs (dm-1): error loading journal
>
> after a simulated crash & log replay.
>
> Do you see anything like that in your logs?

I'm not seeing any corrupt journals or abort messages at all. The
journal claims to be fine, but plainly isn't.

I can reproduce this on a small filesystem and stick the image somewhere
if that would be of any use to anyone. (If I'm very lucky, merely making
this offer will make the problem go away. :} )

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-26 20:37             ` Nix
@ 2012-10-26 20:56               ` Theodore Ts'o
  2012-10-26 20:59                 ` Nix
  0 siblings, 1 reply; 83+ messages in thread
From: Theodore Ts'o @ 2012-10-26 20:56 UTC (permalink / raw)
  To: Nix
  Cc: Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs

On Fri, Oct 26, 2012 at 09:37:08PM +0100, Nix wrote:
> 
> I can reproduce this on a small filesystem and stick the image somewhere
> if that would be of any use to anyone. (If I'm very lucky, merely making
> this offer will make the problem go away. :} )

I'm not sure the image is going to be that useful.  What we really
need to do is to get a reliable reproduction of what _you_ are seeing.

It's clear from Eric's experiments that journal_checksum is dangerous.
In fact, I will likely put it under an #ifdef EXT4_EXPERIMENTAL to try
to discourage people from using it in the future.  There are things
I've been planning on doing to make it be safer, but there's a very
good *reason* that both journal_checksum and journal_async_commit are
not on by default.

That's why one of the things I asked you to do when you had time was
to see if you could reproduce the problem you are seeing w/o
nobarrier,journal_checksum,journal_async_commit.

The other experiment that would be really useful if you could do is to
try to apply these two patches which I sent earlier this week:

[PATCH 1/2] ext4: revert "jbd2: don't write superblock when if its empty
[PATCH 2/2] ext4: fix I/O error when unmounting an ro file system

... and see if they make a difference.

If they don't make a difference, I don't want to apply patches just
for placebo/PR reasons.  And for Eric at least, he can reproduce the
journal checksum error followed by fairly significant corruption
reported by e2fsck with journal_checksum, and the presence or absense
of these patches make no difference for him.  So I really don't want
to push these patches to Linus until I get confirmation that they make
a difference to *somebody*.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-26 20:56               ` Theodore Ts'o
@ 2012-10-26 20:59                 ` Nix
  2012-10-26 21:15                   ` Theodore Ts'o
  0 siblings, 1 reply; 83+ messages in thread
From: Nix @ 2012-10-26 20:59 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs

On 26 Oct 2012, Theodore Ts'o stated:

> On Fri, Oct 26, 2012 at 09:37:08PM +0100, Nix wrote:
>> 
>> I can reproduce this on a small filesystem and stick the image somewhere
>> if that would be of any use to anyone. (If I'm very lucky, merely making
>> this offer will make the problem go away. :} )
>
> I'm not sure the image is going to be that useful.  What we really
> need to do is to get a reliable reproduction of what _you_ are seeing.
>
> It's clear from Eric's experiments that journal_checksum is dangerous.
> 
> That's why one of the things I asked you to do when you had time was
> to see if you could reproduce the problem you are seeing w/o
> nobarrier,journal_checksum,journal_async_commit.

OK. Will do tomorrow.

> The other experiment that would be really useful if you could do is to
> try to apply these two patches which I sent earlier this week:
>
> [PATCH 1/2] ext4: revert "jbd2: don't write superblock when if its empty
> [PATCH 2/2] ext4: fix I/O error when unmounting an ro file system
>
> ... and see if they make a difference.

As of tomorrow I'll be able to reboot without causing a riot: I'll test
it then. (Sorry for the delay :( )

>                                               So I really don't want
> to push these patches to Linus until I get confirmation that they make
> a difference to *somebody*.

Agreed.

This isn't the first time that journal_checksum has proven problematic.
It's a shame that we're stuck between two error-inducing stools here...

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-26 20:59                 ` Nix
@ 2012-10-26 21:15                   ` Theodore Ts'o
  2012-10-26 21:19                     ` Nix
  2012-10-27  3:11                     ` Jim Rees
  0 siblings, 2 replies; 83+ messages in thread
From: Theodore Ts'o @ 2012-10-26 21:15 UTC (permalink / raw)
  To: Nix
  Cc: Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs

> This isn't the first time that journal_checksum has proven problematic.
> It's a shame that we're stuck between two error-inducing stools here...

The problem is that it currently bails out be aborting the entire
journal replay, and the file system will get left in a mess when it
does that.  It's actually safer today to just be blissfully ignorant
of a corrupted block in the journal, than to have the journal getting
aborted mid-replay when we detect a corrupted commit.

The plan is that eventually, we will have checksums on a
per-journalled block basis, instead of a per-commit basis, and when we
get a failed checksum, we skip the replay of that block, but we keep
going and replay all of the other blocks and commits.  We'll then set
the "file system corrupted" bit and force an e2fsck check.

The problem is this code isn't done yet, and journal_checksum is
really not ready for prime time.  When it is ready, my plan is to wire
it up so it is enabled by default; at the moment, it was intended for
developer experimentation only.  As I said, it's my fault for not
clearly labelling it "Not for you!", or putting it under an #ifdef to
prevent unwary civilians from coming across the feature and saying,
"oooh, shiny!" and turning it on.  :-(

					- Ted

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-26 21:15                   ` Theodore Ts'o
@ 2012-10-26 21:19                     ` Nix
  2012-10-27  0:22                       ` Theodore Ts'o
  2012-10-27  3:11                     ` Jim Rees
  1 sibling, 1 reply; 83+ messages in thread
From: Nix @ 2012-10-26 21:19 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs

On 26 Oct 2012, Theodore Ts'o uttered the following:

> The plan is that eventually, we will have checksums on a
> per-journalled block basis, instead of a per-commit basis, and when we
> get a failed checksum, we skip the replay of that block,

But not of everything it implies, since that's quite tricky to track
down (it's basically the same work needed for softupdates, but in
reverse). Hence the e2fsck check, I suppose.

> prevent unwary civilians from coming across the feature and saying,
> "oooh, shiny!" and turning it on.  :-(

Or having it turned on by default either, which seems to be the case
now.

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-26 21:19                     ` Nix
@ 2012-10-27  0:22                       ` Theodore Ts'o
  2012-10-27 12:45                         ` Nix
  0 siblings, 1 reply; 83+ messages in thread
From: Theodore Ts'o @ 2012-10-27  0:22 UTC (permalink / raw)
  To: Nix
  Cc: Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields,
	Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs

On Fri, Oct 26, 2012 at 10:19:21PM +0100, Nix wrote:
> > prevent unwary civilians from coming across the feature and saying,
> > "oooh, shiny!" and turning it on.  :-(
> 
> Or having it turned on by default either, which seems to be the case
> now.

Huh?  It's not turned on by default.  If you mount with no mount
options, journal checksums are *not* turned on.

	 	 	       	     - Ted

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-26 21:15                   ` Theodore Ts'o
  2012-10-26 21:19                     ` Nix
@ 2012-10-27  3:11                     ` Jim Rees
  1 sibling, 0 replies; 83+ messages in thread
From: Jim Rees @ 2012-10-27  3:11 UTC (permalink / raw)
  To: Theodore Ts'o, Nix, Eric Sandeen, linux-ext4, linux-kernel,
	J. Bruce Fields, Bryan Schumaker, Peng Tao, Trond.Myklebust,
	gregkh, linux-nfs

Theodore Ts'o wrote:

  The problem is this code isn't done yet, and journal_checksum is
  really not ready for prime time.  When it is ready, my plan is to wire
  it up so it is enabled by default; at the moment, it was intended for
  developer experimentation only.  As I said, it's my fault for not
  clearly labelling it "Not for you!", or putting it under an #ifdef to
  prevent unwary civilians from coming across the feature and saying,
  "oooh, shiny!" and turning it on.  :-(

Perhaps a word or two in the mount man page would be appropriate?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-27  0:22                       ` Theodore Ts'o
@ 2012-10-27 12:45                         ` Nix
  2012-10-27 17:55                           ` Theodore Ts'o
  2012-10-27 18:30                           ` Eric Sandeen
  0 siblings, 2 replies; 83+ messages in thread
From: Nix @ 2012-10-27 12:45 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Eric Sandeen, linux-ext4, linux-kernel, gregkh

[nfs people purged from Cc]

On 27 Oct 2012, Theodore Ts'o verbalised:

> Huh?  It's not turned on by default.  If you mount with no mount
> options, journal checksums are *not* turned on.

?! it's turned on for me, and though I use weird mount options I don't
use that one:

/dev/main/var   /var        ext4        defaults,nobarrier,usrquota,grpquota,nosuid,nodev,relatime,journal_async_commit,commit=30,user_xattr,acl 1  2
Default mount options:    (none)
/dev/mapper/main-var /var ext4 rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota,usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota 0 0

...

Ah! it's turned on by journal_async_commit. OK, that alone argues
against use of journal_async_commit, tested or not, and I'd not have
turned it on if I'd noticed that.

(So, the combinations I'll be trying for effect on this bug are:

 journal_async_commit (as now)
 journal_checksum
 none

Technically to investigate all possibilities we should try
journal_async_commit,no_journal_checksum, but this seems so unlikely to
have ever been tested by anyone that it's not worth looking into...)

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-27 12:45                         ` Nix
@ 2012-10-27 17:55                           ` Theodore Ts'o
  2012-10-27 18:47                             ` Nix
  2012-10-27 18:30                           ` Eric Sandeen
  1 sibling, 1 reply; 83+ messages in thread
From: Theodore Ts'o @ 2012-10-27 17:55 UTC (permalink / raw)
  To: Nix; +Cc: Eric Sandeen, linux-ext4, linux-kernel, gregkh

On Sat, Oct 27, 2012 at 01:45:25PM +0100, Nix wrote:
> Ah! it's turned on by journal_async_commit. OK, that alone argues
> against use of journal_async_commit, tested or not, and I'd not have
> turned it on if I'd noticed that.
> 
> (So, the combinations I'll be trying for effect on this bug are:
> 
>  journal_async_commit (as now)
>  journal_checksum
>  none

Can you also check and see whether the presence or absence of
"nobarrier" makes a difference?

Thanks,

						- Ted

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-27 12:45                         ` Nix
  2012-10-27 17:55                           ` Theodore Ts'o
@ 2012-10-27 18:30                           ` Eric Sandeen
  1 sibling, 0 replies; 83+ messages in thread
From: Eric Sandeen @ 2012-10-27 18:30 UTC (permalink / raw)
  To: Nix; +Cc: Theodore Ts'o, linux-ext4, linux-kernel, gregkh

On 10/27/12 7:45 AM, Nix wrote:
> [nfs people purged from Cc]
> 
> On 27 Oct 2012, Theodore Ts'o verbalised:
> 
>> Huh?  It's not turned on by default.  If you mount with no mount
>> options, journal checksums are *not* turned on.
> 
> ?! it's turned on for me, and though I use weird mount options I don't
> use that one:

journal_async_commit implies journal_checksum:

        {Opt_journal_async_commit, (EXT4_MOUNT_JOURNAL_ASYNC_COMMIT |
                                    EXT4_MOUNT_JOURNAL_CHECKSUM), MOPT_SET},

journal_checksum seems to have broken, at least for me, between 3.3 & 3.4, I think I've narrowed down the commit but not sure yet what the flaw is, will investigate & report back later.  Busy Saturday.

So anyway, turning on journal_async_commit (notionally unsafe) enables journal_checksum (apparently broken).

-Eric

> /dev/main/var   /var        ext4        defaults,nobarrier,usrquota,grpquota,nosuid,nodev,relatime,journal_async_commit,commit=30,user_xattr,acl 1  2
> Default mount options:    (none)
> /dev/mapper/main-var /var ext4 rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota,usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota 0 0
> 
> ...
> 
> Ah! it's turned on by journal_async_commit. OK, that alone argues
> against use of journal_async_commit, tested or not, and I'd not have
> turned it on if I'd noticed that.
> 
> (So, the combinations I'll be trying for effect on this bug are:
> 
>  journal_async_commit (as now)
>  journal_checksum
>  none
> 
> Technically to investigate all possibilities we should try
> journal_async_commit,no_journal_checksum, but this seems so unlikely to
> have ever been tested by anyone that it's not worth looking into...)
> 


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-27 17:55                           ` Theodore Ts'o
@ 2012-10-27 18:47                             ` Nix
  2012-10-27 21:19                               ` Eric Sandeen
  0 siblings, 1 reply; 83+ messages in thread
From: Nix @ 2012-10-27 18:47 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Eric Sandeen, linux-ext4, linux-kernel, gregkh

On 27 Oct 2012, Theodore Ts'o said:

> On Sat, Oct 27, 2012 at 01:45:25PM +0100, Nix wrote:
>> Ah! it's turned on by journal_async_commit. OK, that alone argues
>> against use of journal_async_commit, tested or not, and I'd not have
>> turned it on if I'd noticed that.
>> 
>> (So, the combinations I'll be trying for effect on this bug are:
>> 
>>  journal_async_commit (as now)
>>  journal_checksum
>>  none
>
> Can you also check and see whether the presence or absence of
> "nobarrier" makes a difference?

Done. (Also checked the effect of your patches posted earlier this week:
no effect, I'm afraid, certainly not under the fails-even-on-3.6.1 test
I was carrying out, umount -l'ing /var as the very last thing I did
before /sbin/reboot -f.)

nobarrier makes a difference that I, at least, did not expect:

[no options]                    No corruption

nobarrier                       No corruption

          journal_checksum      Corruption
                                Corrupted transaction, journal aborted

nobarrier,journal_checksum      Corruption
                                Corrupted transaction, journal aborted

          journal_async_commit  Corruption
                                Corrupted transaction, journal aborted

nobarrier,journal_async_commit  Corruption
                                No corrupted transaction or aborted journal

I didn't expect the last case at all, and it adequately explains why you
are mostly seeing corrupted journal messages in your tests but I was
not. It also explains why when I saw this for the first time I was able
to mount the resulting corrupted filesystem read-write and corrupt it
further before I noticed that anything was wrong.

It is also clear that journal_checksum and all that relies on it is
worse than useless right now, as Eric reported while I was testing this.
It should probably be marked CONFIG_BROKEN in future 3.[346].* stable
kernels, if CONFIG_BROKEN existed anymore, which it doesn't.

It's a shame journal_async_commit depends on a broken feature: it might
be notionally unsafe but on some of my systems (without nobarrier or
flashy caching controllers) it was associated with a noticeable speedup
of metadata-heavy workloads -- though that was way back in 2009...
however, "safety first" definitely applies in this case.

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-27 18:47                             ` Nix
@ 2012-10-27 21:19                               ` Eric Sandeen
  2012-10-27 22:42                                 ` Eric Sandeen
  0 siblings, 1 reply; 83+ messages in thread
From: Eric Sandeen @ 2012-10-27 21:19 UTC (permalink / raw)
  To: Nix; +Cc: Theodore Ts'o, linux-ext4, linux-kernel, gregkh

On 10/27/12 1:47 PM, Nix wrote:
> On 27 Oct 2012, Theodore Ts'o said:
> 
>> On Sat, Oct 27, 2012 at 01:45:25PM +0100, Nix wrote:
>>> Ah! it's turned on by journal_async_commit. OK, that alone argues
>>> against use of journal_async_commit, tested or not, and I'd not have
>>> turned it on if I'd noticed that.
>>>
>>> (So, the combinations I'll be trying for effect on this bug are:
>>>
>>>  journal_async_commit (as now)
>>>  journal_checksum
>>>  none
>>
>> Can you also check and see whether the presence or absence of
>> "nobarrier" makes a difference?
> 
> Done. (Also checked the effect of your patches posted earlier this week:
> no effect, I'm afraid, certainly not under the fails-even-on-3.6.1 test
> I was carrying out, umount -l'ing /var as the very last thing I did
> before /sbin/reboot -f.)
> 
> nobarrier makes a difference that I, at least, did not expect:
> 
> [no options]                    No corruption
> 
> nobarrier                       No corruption
> 
>           journal_checksum      Corruption
>                                 Corrupted transaction, journal aborted
>                                 
> nobarrier,journal_checksum      Corruption
>                                 Corrupted transaction, journal aborted
> 
>           journal_async_commit  Corruption
>                                 Corrupted transaction, journal aborted
> 
> nobarrier,journal_async_commit  Corruption
>                                 No corrupted transaction or aborted journal

That's what we needed.  Woulda been great a few days ago ;)

In my testing journal_checksum is broken, and my bisection seems to
implicate

commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Mon Feb 6 20:12:03 2012 -0500

    ext4: fold ext4_claim_inode into ext4_new_inode
    
as the culprit.  I haven't had time to look into why, yet.

-Eric

> I didn't expect the last case at all, and it adequately explains why you
> are mostly seeing corrupted journal messages in your tests but I was
> not. It also explains why when I saw this for the first time I was able
> to mount the resulting corrupted filesystem read-write and corrupt it
> further before I noticed that anything was wrong.
> 
> It is also clear that journal_checksum and all that relies on it is
> worse than useless right now, as Eric reported while I was testing this.
> It should probably be marked CONFIG_BROKEN in future 3.[346].* stable
> kernels, if CONFIG_BROKEN existed anymore, which it doesn't.
> 
> It's a shame journal_async_commit depends on a broken feature: it might
> be notionally unsafe but on some of my systems (without nobarrier or
> flashy caching controllers) it was associated with a noticeable speedup
> of metadata-heavy workloads -- though that was way back in 2009...
> however, "safety first" definitely applies in this case.
> 


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-27 21:19                               ` Eric Sandeen
@ 2012-10-27 22:42                                 ` Eric Sandeen
  2012-10-29  1:00                                   ` Theodore Ts'o
  0 siblings, 1 reply; 83+ messages in thread
From: Eric Sandeen @ 2012-10-27 22:42 UTC (permalink / raw)
  To: Nix; +Cc: Theodore Ts'o, linux-ext4, linux-kernel, gregkh

On 10/27/12 4:19 PM, Eric Sandeen wrote:
> On 10/27/12 1:47 PM, Nix wrote:
>> On 27 Oct 2012, Theodore Ts'o said:
>>
>>> On Sat, Oct 27, 2012 at 01:45:25PM +0100, Nix wrote:
>>>> Ah! it's turned on by journal_async_commit. OK, that alone argues
>>>> against use of journal_async_commit, tested or not, and I'd not have
>>>> turned it on if I'd noticed that.
>>>>
>>>> (So, the combinations I'll be trying for effect on this bug are:
>>>>
>>>>  journal_async_commit (as now)
>>>>  journal_checksum
>>>>  none
>>>
>>> Can you also check and see whether the presence or absence of
>>> "nobarrier" makes a difference?
>>
>> Done. (Also checked the effect of your patches posted earlier this week:
>> no effect, I'm afraid, certainly not under the fails-even-on-3.6.1 test
>> I was carrying out, umount -l'ing /var as the very last thing I did
>> before /sbin/reboot -f.)
>>
>> nobarrier makes a difference that I, at least, did not expect:
>>
>> [no options]                    No corruption
>>
>> nobarrier                       No corruption
>>
>>           journal_checksum      Corruption
>>                                 Corrupted transaction, journal aborted
>>                                 
>> nobarrier,journal_checksum      Corruption
>>                                 Corrupted transaction, journal aborted
>>
>>           journal_async_commit  Corruption
>>                                 Corrupted transaction, journal aborted
>>
>> nobarrier,journal_async_commit  Corruption
>>                                 No corrupted transaction or aborted journal
> 
> That's what we needed.  Woulda been great a few days ago ;)
> 
> In my testing journal_checksum is broken, and my bisection seems to
> implicate
> 
> commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Mon Feb 6 20:12:03 2012 -0500
> 
>     ext4: fold ext4_claim_inode into ext4_new_inode
>     
> as the culprit.  I haven't had time to look into why, yet.

It looks like the inode_bitmap_bh is being modified outside a transaction:

                ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);

It needs a get_write_access / handle_dirty_metadata pair around it.

A hacky patch like this seems to work but it was done 5mins before I have
to be out the door to dinner so it probably needs cleanup or at least
scrutiny.

[PATCH] ext4: fix unjournaled inode bitmap modification

commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 modified this function
such that the inode bitmap was being modified outside a transaction,
which could lead to corruption, and was discovered when journal_checksum
found a bad checksum in the journal.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---

--- ialloc.c.reverted2	2012-10-27 17:31:20.351537073 -0500
+++ ialloc.c	2012-10-27 17:40:18.643553576 -0500
@@ -669,6 +669,10 @@
 		inode_bitmap_bh = ext4_read_inode_bitmap(sb, group);
 		if (!inode_bitmap_bh)
 			goto fail;
+		BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
+		err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
+		if (err)
+			goto fail;
 
 repeat_in_this_group:
 		ino = ext4_find_next_zero_bit((unsigned long *)
@@ -690,6 +694,10 @@
 		ino++;		/* the inode bitmap is zero-based */
 		if (!ret2)
 			goto got; /* we grabbed the inode! */
+		BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
+		err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
+		if (err)
+			goto fail;
 		if (ino < EXT4_INODES_PER_GROUP(sb))
 			goto repeat_in_this_group;
 	}




> -Eric
> 
>> I didn't expect the last case at all, and it adequately explains why you
>> are mostly seeing corrupted journal messages in your tests but I was
>> not. It also explains why when I saw this for the first time I was able
>> to mount the resulting corrupted filesystem read-write and corrupt it
>> further before I noticed that anything was wrong.
>>
>> It is also clear that journal_checksum and all that relies on it is
>> worse than useless right now, as Eric reported while I was testing this.
>> It should probably be marked CONFIG_BROKEN in future 3.[346].* stable
>> kernels, if CONFIG_BROKEN existed anymore, which it doesn't.
>>
>> It's a shame journal_async_commit depends on a broken feature: it might
>> be notionally unsafe but on some of my systems (without nobarrier or
>> flashy caching controllers) it was associated with a noticeable speedup
>> of metadata-heavy workloads -- though that was way back in 2009...
>> however, "safety first" definitely applies in this case.
>>
> 


^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH] ext4: fix unjournaled inode bitmap modification
  2012-10-23 20:57         ` Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) Nix
                             ` (2 preceding siblings ...)
  2012-10-26 20:35           ` Eric Sandeen
@ 2012-10-28  4:23           ` Eric Sandeen
  2012-10-28 13:59             ` Nix
  2012-10-29  2:30             ` [PATCH -v3] " Theodore Ts'o
  3 siblings, 2 replies; 83+ messages in thread
From: Eric Sandeen @ 2012-10-28  4:23 UTC (permalink / raw)
  To: Nix; +Cc: Ted Ts'o, linux-ext4, linux-kernel, gregkh

commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 changed
ext4_new_inode() such that the inode bitmap was being modified
outside a transaction, which could lead to corruption, and was
discovered when journal_checksum found a bad checksum in the
journal during log replay.

Nix ran into this when using the journal_async_commit mount
option, which enables journal checksumming.  The ensuing
journal replay failures due to the bad checksums led to
filesystem corruption reported as the now infamous
"Apparent serious progressive ext4 data corruption bug"

I've tested this by mounting with journal_checksum and
running fsstress then dropping power; I've also tested by
hacking DM to create snapshots w/o first quiescing, which
allows me to test journal replay repeatedly w/o actually
power-cycling the box.  Without the patch I hit a journal
checksum error every time.  With this fix it survives
many iterations.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
cc: Nix <nix@esperi.org.uk>
---

A little more going on here to try to properly handle error
cases & moving to the next group; despite
ext4_handle_release_buffer being a no-op, I've tried
to sprinkle it in at the right places.  Double checking
on review is probably a fine idea ;)

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 4facdd2..1d18fba 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -706,10 +706,17 @@ got_group:
 			continue;
 		}

-		brelse(inode_bitmap_bh);
+		if (inode_bitmap_bh) {
+			ext4_handle_release_buffer(handle, inode_bitmap_bh);
+			brelse(inode_bitmap_bh);
+		}
 		inode_bitmap_bh = ext4_read_inode_bitmap(sb, group);
 		if (!inode_bitmap_bh)
 			goto fail;
+		BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
+		err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
+		if (err)
+			goto fail;

 repeat_in_this_group:
 		ino = ext4_find_next_zero_bit((unsigned long *)
@@ -734,10 +741,16 @@ repeat_in_this_group:
 		if (ino < EXT4_INODES_PER_GROUP(sb))
 			goto repeat_in_this_group;
 	}
+	ext4_handle_release_buffer(handle, inode_bitmap_bh);
 	err = -ENOSPC;
 	goto out;

 got:
+	BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
+	err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
+	if (err)
+		goto fail;
+
 	/* We may have to initialize the block bitmap if it isn't already */
 	if (ext4_has_group_desc_csum(sb) &&
 	    gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
@@ -771,11 +784,6 @@ got:
 			goto fail;
 	}

-	BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
-	err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
-	if (err)
-		goto fail;
-
 	BUFFER_TRACE(group_desc_bh, "get_write_access");
 	err = ext4_journal_get_write_access(handle, group_desc_bh);
 	if (err)
@@ -823,11 +831,6 @@ got:
 	}
 	ext4_unlock_group(sb, group);

-	BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
-	err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
-	if (err)
-		goto fail;
-
 	BUFFER_TRACE(group_desc_bh, "call ext4_handle_dirty_metadata");
 	err = ext4_handle_dirty_metadata(handle, NULL, group_desc_bh);
 	if (err)

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH] ext4: fix unjournaled inode bitmap modification
  2012-10-28  4:23           ` [PATCH] ext4: fix unjournaled inode bitmap modification Eric Sandeen
@ 2012-10-28 13:59             ` Nix
  2012-10-29  2:30             ` [PATCH -v3] " Theodore Ts'o
  1 sibling, 0 replies; 83+ messages in thread
From: Nix @ 2012-10-28 13:59 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Ted Ts'o, linux-ext4, linux-kernel, gregkh

On 28 Oct 2012, Eric Sandeen outgrape:

> I've tested this by mounting with journal_checksum and
> running fsstress then dropping power; I've also tested by
> hacking DM to create snapshots w/o first quiescing, which
> allows me to test journal replay repeatedly w/o actually
> power-cycling the box.  Without the patch I hit a journal
> checksum error every time.  With this fix it survives
> many iterations.

Works for me too. Looks like you fixed it.

btw, I dug back through my old notes and found the reason I turned on
journal_async_commit in the first place, back in 2009. It's got nothing
to do with performance: I was trying to save power.

The server that's been having all this trouble has four WD GreenPower
disks. These are rightly maligned for leaving the way they save power
undocumented, changing it without notice, and having most of those
power-saving methods be extremely silly (e.g. complete spindowns, with a
nonresponsive disk for N seconds until it spins up again). But I was
lucky and got disks that actually did save power without being damaged.

By observing the machine's power draw I was able to guess that they spin
partway down (rumour says to ~2500rpm) after about eight seconds with no
activity at all, following which it takes a fairly large burst of
activity to get them to spin up again: they can service low loads
without spinning up.

Unfortunately back in 2009 ext4's journalling was preventing them from
ever spinning down, since even on idle rw-mounted filesystems it was
touching the disk with what blktrace said was something journal-related
once every five seconds, so the disks decided they should never spin
down, costing me about 10W of power. Now laptop_mode would have solved
this problem, but laptop_mode makes other changes that I didn't want
(e.g. changing the way dirty writeout is done so that all writeout is
lumpy: smooth dirty writeout is fine, I just don't want the disks
touched all the time when the system is actually idle).

So I turned on commit=30... and nothing changed, a steady pulse of
commits every five seconds. Only when I threw caution to the wind and
tried turning on journal_async_commit (even though the description of
its function seemed quite unrelated) did the commit rate slow to one
every 30s as the commit interval suggested.

I now suspect this was a bug, or multiple bugs, and I should have
reported it rather than flailing around to try to get things working --
but whatever the problem it has by now been fixed: journal committing is
now working rather better, one commit every 15s, even with async commit
turned off. (It is peculiar that I'm seeing one commit every 15s when I
asked for commit=30, but it's less often than once every 8s and that's
what matters to me.)

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-27 22:42                                 ` Eric Sandeen
@ 2012-10-29  1:00                                   ` Theodore Ts'o
  2012-10-29  1:04                                     ` Nix
  2012-10-29  2:24                                     ` Eric Sandeen
  0 siblings, 2 replies; 83+ messages in thread
From: Theodore Ts'o @ 2012-10-29  1:00 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Nix, linux-ext4, linux-kernel, gregkh

On Sat, Oct 27, 2012 at 05:42:07PM -0500, Eric Sandeen wrote:
> 
> It looks like the inode_bitmap_bh is being modified outside a transaction:
> 
>                 ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
> 
> It needs a get_write_access / handle_dirty_metadata pair around it.

Oops.   Nice catch!!

The patch isn't quite right, though.  We only want to call
ext4_journal_get_write_access() when we know that there is an available
bit in the bitmap.  (We could still lose the race, but in that case
the winner of the race probably grabbed the bitmap block first.)

Also, we only need to call ext4_handle_dirty_metadata() if we
successfully grab the bit in the bitmap.

So I suggest this patch instead:

commit 087eda81f1ac6a6a0394f781b44f1d555d8f64c6
Author: Eric Sandeen <sandeen@redhat.com>
Date:   Sun Oct 28 20:59:57 2012 -0400

    ext4: fix unjournaled inode bitmap modification
    
    commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 modified this function
    such that the inode bitmap was being modified outside a transaction,
    which could lead to corruption, and was discovered when journal_checksum
    found a bad checksum in the journal.
    
    Reported-by: Nix <nix@esperi.org.uk>
    Signed-off-by: Eric Sandeen <sandeen@redhat.com>
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
    Cc: stable@vger.kernel.org

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 4facdd2..575afac 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -725,6 +725,10 @@ repeat_in_this_group:
 				   "inode=%lu", ino + 1);
 			continue;
 		}
+		BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
+		err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
+		if (err)
+			goto fail;
 		ext4_lock_group(sb, group);
 		ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
 		ext4_unlock_group(sb, group);
@@ -738,6 +742,11 @@ repeat_in_this_group:
 	goto out;
 
 got:
+	BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
+	err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
+	if (err)
+		goto fail;
+
 	/* We may have to initialize the block bitmap if it isn't already */
 	if (ext4_has_group_desc_csum(sb) &&
 	    gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-29  1:00                                   ` Theodore Ts'o
@ 2012-10-29  1:04                                     ` Nix
  2012-10-29  2:24                                     ` Eric Sandeen
  1 sibling, 0 replies; 83+ messages in thread
From: Nix @ 2012-10-29  1:04 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Eric Sandeen, linux-ext4, linux-kernel, gregkh

On 29 Oct 2012, Theodore Ts'o spake thusly:

>     commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 modified this function
>     such that the inode bitmap was being modified outside a transaction,
>     which could lead to corruption, and was discovered when journal_checksum
>     found a bad checksum in the journal.

Hm. If this could have caused corruption for non-checksum users, it must
be a pretty rare case if nobody's hit it in six months -- or maybe, I
suppose, they hit it and never noticed. (But, hey, this makes me happier
to have reported this despite all the flap, if it's found a genuine bug
that could have hit people not using wierdo mount options.)

Thanks for spending so much time on this fix. Much appreciated.

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-29  1:00                                   ` Theodore Ts'o
  2012-10-29  1:04                                     ` Nix
@ 2012-10-29  2:24                                     ` Eric Sandeen
  2012-10-29  2:34                                       ` Theodore Ts'o
  1 sibling, 1 reply; 83+ messages in thread
From: Eric Sandeen @ 2012-10-29  2:24 UTC (permalink / raw)
  To: Theodore Ts'o, Nix, linux-ext4, linux-kernel, gregkh

On 10/28/12 8:00 PM, Theodore Ts'o wrote:
> On Sat, Oct 27, 2012 at 05:42:07PM -0500, Eric Sandeen wrote:
>>
>> It looks like the inode_bitmap_bh is being modified outside a transaction:
>>
>>                 ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
>>
>> It needs a get_write_access / handle_dirty_metadata pair around it.
> 
> Oops.   Nice catch!!
> 
> The patch isn't quite right, though.  

Yeah, I knew it wasn't ;)  I did resend 
[PATCH] ext4: fix unjournaled inode bitmap modification
which is a bit more involved.

> We only want to call
> ext4_journal_get_write_access() when we know that there is an available
> bit in the bitmap.  (We could still lose the race, but in that case
> the winner of the race probably grabbed the bitmap block first.)
> 
> Also, we only need to call ext4_handle_dirty_metadata() if we
> successfully grab the bit in the bitmap.
> 
> So I suggest this patch instead:

That'll get_write_access on the same buffer over and over, I suppose
it's ok, but the patch I sent tries to minimize that, and call
ext4_handle_release_buffer if we're not going to use it (which is
a no-op today anyway and not normally used I guess...)

If ext4_handle_release_buffer() is dead code now, and repeated calls
via repeat_in_this_group: are no big deal, then your version looks fine.

-Eric

> commit 087eda81f1ac6a6a0394f781b44f1d555d8f64c6
> Author: Eric Sandeen <sandeen@redhat.com>
> Date:   Sun Oct 28 20:59:57 2012 -0400
> 
>     ext4: fix unjournaled inode bitmap modification
>     
>     commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 modified this function
>     such that the inode bitmap was being modified outside a transaction,
>     which could lead to corruption, and was discovered when journal_checksum
>     found a bad checksum in the journal.
>     
>     Reported-by: Nix <nix@esperi.org.uk>
>     Signed-off-by: Eric Sandeen <sandeen@redhat.com>
>     Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>     Cc: stable@vger.kernel.org
> 
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 4facdd2..575afac 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -725,6 +725,10 @@ repeat_in_this_group:
>  				   "inode=%lu", ino + 1);
>  			continue;
>  		}
> +		BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
> +		err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
> +		if (err)
> +			goto fail;
>  		ext4_lock_group(sb, group);
>  		ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
>  		ext4_unlock_group(sb, group);
> @@ -738,6 +742,11 @@ repeat_in_this_group:
>  	goto out;
>  
>  got:
> +	BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
> +	err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
> +	if (err)
> +		goto fail;
> +
>  	/* We may have to initialize the block bitmap if it isn't already */
>  	if (ext4_has_group_desc_csum(sb) &&
>  	    gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH -v3] ext4: fix unjournaled inode bitmap modification
  2012-10-28  4:23           ` [PATCH] ext4: fix unjournaled inode bitmap modification Eric Sandeen
  2012-10-28 13:59             ` Nix
@ 2012-10-29  2:30             ` Theodore Ts'o
  2012-10-29  3:24               ` Eric Sandeen
  2012-10-29 17:08               ` Darrick J. Wong
  1 sibling, 2 replies; 83+ messages in thread
From: Theodore Ts'o @ 2012-10-29  2:30 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Nix, linux-ext4, linux-kernel, gregkh

On Sat, Oct 27, 2012 at 11:23:57PM -0500, Eric Sandeen wrote:
> A little more going on here to try to properly handle error
> cases & moving to the next group; despite
> ext4_handle_release_buffer being a no-op, I've tried
> to sprinkle it in at the right places.  Double checking
> on review is probably a fine idea ;)

Sorry, I didn't see your newer version of your patch.  I'm not
convinced it's worth it to try to get the calls to
ext4_handle_release_buffer() right.  There are plenty of other places
where we're not calling ext4_handle_release_buffer(), and I'm not
convinced it would ever be useful to make it be something other than a
no-op.  In order to make it be useful, we'd have to enforce a rule
that every single get_write_access() was matched with either a
handle_dirty_metadata() or a handle_release_buffer().  That would be
tricky; worse, we'd have to keep track of a refcount on each bh, which
would cost us on the scalability front.  The main benefit would be
that might be able to be able to reclaim bh's where we called
get_write_access() and then changed our mind, but that's relatively
rare, and I think it's easier to simply be more careful about calling
get_write_acceess() until we're sure we're going to need write access.

Hence in my version of the patch, I've waited until right before the
call to ext4_lock_group() before calling get_write_access().  Note
that it's safe to call get_write_access() on a bh twice; the second
time the jbd2 layer will notice that the bh is already a part of the
transaction.

Also, leaving out the calls to ext4_handle_release_buffer() makes the
patch easier to understand and reason about.

What do you think of this version?

						- Ted

commit 67d725143e9e7ea458a0c1c4a6625657c3dc7ba2
Author: Eric Sandeen <sandeen@redhat.com>
Date:   Sun Oct 28 22:24:57 2012 -0400

    ext4: fix unjournaled inode bitmap modification
    
    commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 changed
    ext4_new_inode() such that the inode bitmap was being modified
    outside a transaction, which could lead to corruption, and was
    discovered when journal_checksum found a bad checksum in the
    journal during log replay.
    
    Nix ran into this when using the journal_async_commit mount
    option, which enables journal checksumming.  The ensuing
    journal replay failures due to the bad checksums led to
    filesystem corruption reported as the now infamous
    "Apparent serious progressive ext4 data corruption bug"
    
    [ Changed by tytso to only call ext4_journal_get_write_access() only
      when we're fairly certain that we're going to allocate the inode. ]
    
    I've tested this by mounting with journal_checksum and
    running fsstress then dropping power; I've also tested by
    hacking DM to create snapshots w/o first quiescing, which
    allows me to test journal replay repeatedly w/o actually
    power-cycling the box.  Without the patch I hit a journal
    checksum error every time.  With this fix it survives
    many iterations.
    
    Reported-by: Nix <nix@esperi.org.uk>
    Signed-off-by: Eric Sandeen <sandeen@redhat.com>
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
    Cc: stable@vger.kernel.org

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 4facdd2..3a100e7 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -725,6 +725,10 @@ repeat_in_this_group:
 				   "inode=%lu", ino + 1);
 			continue;
 		}
+		BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
+		err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
+		if (err)
+			goto fail;
 		ext4_lock_group(sb, group);
 		ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
 		ext4_unlock_group(sb, group);
@@ -738,6 +742,11 @@ repeat_in_this_group:
 	goto out;
 
 got:
+	BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
+	err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
+	if (err)
+		goto fail;
+
 	/* We may have to initialize the block bitmap if it isn't already */
 	if (ext4_has_group_desc_csum(sb) &&
 	    gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
@@ -771,11 +780,6 @@ got:
 			goto fail;
 	}
 
-	BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
-	err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
-	if (err)
-		goto fail;
-
 	BUFFER_TRACE(group_desc_bh, "get_write_access");
 	err = ext4_journal_get_write_access(handle, group_desc_bh);
 	if (err)
@@ -823,11 +827,6 @@ got:
 	}
 	ext4_unlock_group(sb, group);
 
-	BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
-	err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
-	if (err)
-		goto fail;
-
 	BUFFER_TRACE(group_desc_bh, "call ext4_handle_dirty_metadata");
 	err = ext4_handle_dirty_metadata(handle, NULL, group_desc_bh);
 	if (err)



^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-29  2:24                                     ` Eric Sandeen
@ 2012-10-29  2:34                                       ` Theodore Ts'o
  2012-10-29  2:35                                         ` Eric Sandeen
  0 siblings, 1 reply; 83+ messages in thread
From: Theodore Ts'o @ 2012-10-29  2:34 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Nix, linux-ext4, linux-kernel, gregkh

On Sun, Oct 28, 2012 at 09:24:19PM -0500, Eric Sandeen wrote:
> Yeah, I knew it wasn't ;)  I did resend 
> [PATCH] ext4: fix unjournaled inode bitmap modification
> which is a bit more involved.

Yeah, sorry, I didn't see your updated patch at first, since this mail
thread is one complicated tangle.  :-(

> That'll get_write_access on the same buffer over and over, I suppose
> it's ok, but the patch I sent tries to minimize that, and call
> ext4_handle_release_buffer if we're not going to use it (which is
> a no-op today anyway and not normally used I guess...)

Well, it's really rare that we will go through that loop more than
once; it only happens if we have multiple processes race against each
other trying to grab the same inode.

> If ext4_handle_release_buffer() is dead code now, and repeated calls
> via repeat_in_this_group: are no big deal, then your version looks fine.

Yeah, I think it's pretty much dead code.  At least, I can't think of
a good reason why we would want to actually try to handle
ext4_handle_release_buffer() to claw back the transaciton credit.  And
if we do, we'll have to do a sweep through the entire ext4 codebase
anyway.

						- Ted

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-29  2:34                                       ` Theodore Ts'o
@ 2012-10-29  2:35                                         ` Eric Sandeen
  2012-10-29  2:42                                           ` Theodore Ts'o
  0 siblings, 1 reply; 83+ messages in thread
From: Eric Sandeen @ 2012-10-29  2:35 UTC (permalink / raw)
  To: Theodore Ts'o, Nix, linux-ext4, linux-kernel, gregkh

On 10/28/12 9:34 PM, Theodore Ts'o wrote:
> On Sun, Oct 28, 2012 at 09:24:19PM -0500, Eric Sandeen wrote:
>> Yeah, I knew it wasn't ;)  I did resend 
>> [PATCH] ext4: fix unjournaled inode bitmap modification
>> which is a bit more involved.
> 
> Yeah, sorry, I didn't see your updated patch at first, since this mail
> thread is one complicated tangle.  :-(
> 
>> That'll get_write_access on the same buffer over and over, I suppose
>> it's ok, but the patch I sent tries to minimize that, and call
>> ext4_handle_release_buffer if we're not going to use it (which is
>> a no-op today anyway and not normally used I guess...)
> 
> Well, it's really rare that we will go through that loop more than
> once; it only happens if we have multiple processes race against each
> other trying to grab the same inode.
> 
>> If ext4_handle_release_buffer() is dead code now, and repeated calls
>> via repeat_in_this_group: are no big deal, then your version looks fine.
> 
> Yeah, I think it's pretty much dead code.  At least, I can't think of
> a good reason why we would want to actually try to handle
> ext4_handle_release_buffer() to claw back the transaciton credit.  And
> if we do, we'll have to do a sweep through the entire ext4 codebase
> anyway.

Yeah, seems that way.

Then your simpler version is probably the way to go.

Thanks,
-Eric

> 						- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)
  2012-10-29  2:35                                         ` Eric Sandeen
@ 2012-10-29  2:42                                           ` Theodore Ts'o
  0 siblings, 0 replies; 83+ messages in thread
From: Theodore Ts'o @ 2012-10-29  2:42 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Nix, linux-ext4, linux-kernel, gregkh

On Sun, Oct 28, 2012 at 09:35:58PM -0500, Eric Sandeen wrote:
> Yeah, seems that way.
> 
> Then your simpler version is probably the way to go.

If you have a chance, could you do me a favor and test my -v3 version
of the patch?  It should work just as well as yours, but I'm getting
paranoid in my old age, and you seem to have a reliable way of testing
for this failure.  I still need to figure out why my kvm based
approach isn't showing the problem....

Thanks,

-- Ted

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH -v3] ext4: fix unjournaled inode bitmap modification
  2012-10-29  2:30             ` [PATCH -v3] " Theodore Ts'o
@ 2012-10-29  3:24               ` Eric Sandeen
  2012-10-29 17:08               ` Darrick J. Wong
  1 sibling, 0 replies; 83+ messages in thread
From: Eric Sandeen @ 2012-10-29  3:24 UTC (permalink / raw)
  To: Theodore Ts'o, Nix, linux-ext4, linux-kernel, gregkh

On 10/28/12 9:30 PM, Theodore Ts'o wrote:
> On Sat, Oct 27, 2012 at 11:23:57PM -0500, Eric Sandeen wrote:
>> A little more going on here to try to properly handle error
>> cases & moving to the next group; despite
>> ext4_handle_release_buffer being a no-op, I've tried
>> to sprinkle it in at the right places.  Double checking
>> on review is probably a fine idea ;)
> 
> Sorry, I didn't see your newer version of your patch.  I'm not
> convinced it's worth it to try to get the calls to
> ext4_handle_release_buffer() right.  There are plenty of other places
> where we're not calling ext4_handle_release_buffer(), and I'm not
> convinced it would ever be useful to make it be something other than a
> no-op.  

Fair enough, I went a little overboard.

> In order to make it be useful, we'd have to enforce a rule
> that every single get_write_access() was matched with either a
> handle_dirty_metadata() or a handle_release_buffer().  That would be
> tricky; worse, we'd have to keep track of a refcount on each bh, which
> would cost us on the scalability front.  The main benefit would be
> that might be able to be able to reclaim bh's where we called
> get_write_access() and then changed our mind, but that's relatively
> rare, and I think it's easier to simply be more careful about calling
> get_write_acceess() until we're sure we're going to need write access.
> 
> Hence in my version of the patch, I've waited until right before the
> call to ext4_lock_group() before calling get_write_access().  Note
> that it's safe to call get_write_access() on a bh twice; the second
> time the jbd2 layer will notice that the bh is already a part of the
> transaction.

Yeah, I guess that's the norm.

So on the one hand you delay calling it until we're sure we need
it; OTOH it's no big deal if it does get called twice :)

> Also, leaving out the calls to ext4_handle_release_buffer() makes the
> patch easier to understand and reason about.

Fair enough.

> What do you think of this version?

Looks fine, tests fine.  Ship it ;)

-Eric

> 						- Ted
> 
> commit 67d725143e9e7ea458a0c1c4a6625657c3dc7ba2
> Author: Eric Sandeen <sandeen@redhat.com>
> Date:   Sun Oct 28 22:24:57 2012 -0400
> 
>     ext4: fix unjournaled inode bitmap modification
>     
>     commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 changed
>     ext4_new_inode() such that the inode bitmap was being modified
>     outside a transaction, which could lead to corruption, and was
>     discovered when journal_checksum found a bad checksum in the
>     journal during log replay.
>     
>     Nix ran into this when using the journal_async_commit mount
>     option, which enables journal checksumming.  The ensuing
>     journal replay failures due to the bad checksums led to
>     filesystem corruption reported as the now infamous
>     "Apparent serious progressive ext4 data corruption bug"
>     
>     [ Changed by tytso to only call ext4_journal_get_write_access() only
>       when we're fairly certain that we're going to allocate the inode. ]
>     
>     I've tested this by mounting with journal_checksum and
>     running fsstress then dropping power; I've also tested by
>     hacking DM to create snapshots w/o first quiescing, which
>     allows me to test journal replay repeatedly w/o actually
>     power-cycling the box.  Without the patch I hit a journal
>     checksum error every time.  With this fix it survives
>     many iterations.
>     
>     Reported-by: Nix <nix@esperi.org.uk>
>     Signed-off-by: Eric Sandeen <sandeen@redhat.com>
>     Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>     Cc: stable@vger.kernel.org
> 
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 4facdd2..3a100e7 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -725,6 +725,10 @@ repeat_in_this_group:
>  				   "inode=%lu", ino + 1);
>  			continue;
>  		}
> +		BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
> +		err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
> +		if (err)
> +			goto fail;
>  		ext4_lock_group(sb, group);
>  		ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
>  		ext4_unlock_group(sb, group);
> @@ -738,6 +742,11 @@ repeat_in_this_group:
>  	goto out;
>  
>  got:
> +	BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
> +	err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
> +	if (err)
> +		goto fail;
> +
>  	/* We may have to initialize the block bitmap if it isn't already */
>  	if (ext4_has_group_desc_csum(sb) &&
>  	    gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
> @@ -771,11 +780,6 @@ got:
>  			goto fail;
>  	}
>  
> -	BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
> -	err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
> -	if (err)
> -		goto fail;
> -
>  	BUFFER_TRACE(group_desc_bh, "get_write_access");
>  	err = ext4_journal_get_write_access(handle, group_desc_bh);
>  	if (err)
> @@ -823,11 +827,6 @@ got:
>  	}
>  	ext4_unlock_group(sb, group);
>  
> -	BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
> -	err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
> -	if (err)
> -		goto fail;
> -
>  	BUFFER_TRACE(group_desc_bh, "call ext4_handle_dirty_metadata");
>  	err = ext4_handle_dirty_metadata(handle, NULL, group_desc_bh);
>  	if (err)
> 
> 


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH -v3] ext4: fix unjournaled inode bitmap modification
  2012-10-29  2:30             ` [PATCH -v3] " Theodore Ts'o
  2012-10-29  3:24               ` Eric Sandeen
@ 2012-10-29 17:08               ` Darrick J. Wong
  1 sibling, 0 replies; 83+ messages in thread
From: Darrick J. Wong @ 2012-10-29 17:08 UTC (permalink / raw)
  To: Theodore Ts'o, Eric Sandeen, Nix, linux-ext4, linux-kernel, gregkh

On Sun, Oct 28, 2012 at 10:30:34PM -0400, Theodore Ts'o wrote:
> On Sat, Oct 27, 2012 at 11:23:57PM -0500, Eric Sandeen wrote:
> > A little more going on here to try to properly handle error
> > cases & moving to the next group; despite
> > ext4_handle_release_buffer being a no-op, I've tried
> > to sprinkle it in at the right places.  Double checking
> > on review is probably a fine idea ;)
> 
> Sorry, I didn't see your newer version of your patch.  I'm not
> convinced it's worth it to try to get the calls to
> ext4_handle_release_buffer() right.  There are plenty of other places
> where we're not calling ext4_handle_release_buffer(), and I'm not
> convinced it would ever be useful to make it be something other than a
> no-op.  In order to make it be useful, we'd have to enforce a rule
> that every single get_write_access() was matched with either a
> handle_dirty_metadata() or a handle_release_buffer().  That would be
> tricky; worse, we'd have to keep track of a refcount on each bh, which
> would cost us on the scalability front.  The main benefit would be
> that might be able to be able to reclaim bh's where we called
> get_write_access() and then changed our mind, but that's relatively
> rare, and I think it's easier to simply be more careful about calling
> get_write_acceess() until we're sure we're going to need write access.
> 
> Hence in my version of the patch, I've waited until right before the
> call to ext4_lock_group() before calling get_write_access().  Note
> that it's safe to call get_write_access() on a bh twice; the second
> time the jbd2 layer will notice that the bh is already a part of the
> transaction.
> 
> Also, leaving out the calls to ext4_handle_release_buffer() makes the
> patch easier to understand and reason about.
> 
> What do you think of this version?

I _think_ it looks ok in terms of making sure we call
ext4_inode_bitmap_csum_set() before calling ext4_handle_dirty_metadata() on the
group descriptor, but this function is a bit tricky. :)

--D
> 
> 						- Ted
> 
> commit 67d725143e9e7ea458a0c1c4a6625657c3dc7ba2
> Author: Eric Sandeen <sandeen@redhat.com>
> Date:   Sun Oct 28 22:24:57 2012 -0400
> 
>     ext4: fix unjournaled inode bitmap modification
>     
>     commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 changed
>     ext4_new_inode() such that the inode bitmap was being modified
>     outside a transaction, which could lead to corruption, and was
>     discovered when journal_checksum found a bad checksum in the
>     journal during log replay.
>     
>     Nix ran into this when using the journal_async_commit mount
>     option, which enables journal checksumming.  The ensuing
>     journal replay failures due to the bad checksums led to
>     filesystem corruption reported as the now infamous
>     "Apparent serious progressive ext4 data corruption bug"
>     
>     [ Changed by tytso to only call ext4_journal_get_write_access() only
>       when we're fairly certain that we're going to allocate the inode. ]
>     
>     I've tested this by mounting with journal_checksum and
>     running fsstress then dropping power; I've also tested by
>     hacking DM to create snapshots w/o first quiescing, which
>     allows me to test journal replay repeatedly w/o actually
>     power-cycling the box.  Without the patch I hit a journal
>     checksum error every time.  With this fix it survives
>     many iterations.
>     
>     Reported-by: Nix <nix@esperi.org.uk>
>     Signed-off-by: Eric Sandeen <sandeen@redhat.com>
>     Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>     Cc: stable@vger.kernel.org
> 
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 4facdd2..3a100e7 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -725,6 +725,10 @@ repeat_in_this_group:
>  				   "inode=%lu", ino + 1);
>  			continue;
>  		}
> +		BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
> +		err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
> +		if (err)
> +			goto fail;
>  		ext4_lock_group(sb, group);
>  		ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
>  		ext4_unlock_group(sb, group);
> @@ -738,6 +742,11 @@ repeat_in_this_group:
>  	goto out;
>  
>  got:
> +	BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
> +	err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
> +	if (err)
> +		goto fail;
> +
>  	/* We may have to initialize the block bitmap if it isn't already */
>  	if (ext4_has_group_desc_csum(sb) &&
>  	    gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
> @@ -771,11 +780,6 @@ got:
>  			goto fail;
>  	}
>  
> -	BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
> -	err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
> -	if (err)
> -		goto fail;
> -
>  	BUFFER_TRACE(group_desc_bh, "get_write_access");
>  	err = ext4_journal_get_write_access(handle, group_desc_bh);
>  	if (err)
> @@ -823,11 +827,6 @@ got:
>  	}
>  	ext4_unlock_group(sb, group);
>  
> -	BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
> -	err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
> -	if (err)
> -		goto fail;
> -
>  	BUFFER_TRACE(group_desc_bh, "call ext4_handle_dirty_metadata");
>  	err = ext4_handle_dirty_metadata(handle, NULL, group_desc_bh);
>  	if (err)
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2012-10-29 17:08 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-22 16:17 Heads-up: 3.6.2 / 3.6.3 NFS server panic: 3.6.2+ regression? Nix
2012-10-23  1:33 ` J. Bruce Fields
2012-10-23 14:07   ` Nix
2012-10-23 14:30     ` J. Bruce Fields
2012-10-23 16:32       ` Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug) Nix
2012-10-23 16:46         ` J. Bruce Fields
2012-10-23 16:54           ` J. Bruce Fields
2012-10-23 16:56           ` Myklebust, Trond
2012-10-23 17:05             ` Nix
2012-10-23 17:36               ` Nix
2012-10-23 17:43                 ` J. Bruce Fields
2012-10-23 17:44                 ` Myklebust, Trond
2012-10-23 17:57                   ` Myklebust, Trond
     [not found]                   ` <1351015039.4622.23.camel@lade.trondhjem.org>
2012-10-23 18:23                     ` Myklebust, Trond
2012-10-23 19:49                       ` Nix
2012-10-24 10:18                         ` [PATCH] lockd: fix races in per-net NSM client handling Stanislav Kinsbursky
2012-10-23 20:57         ` Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) Nix
2012-10-23 22:19           ` Theodore Ts'o
2012-10-23 22:47             ` Nix
2012-10-23 23:16               ` Theodore Ts'o
2012-10-23 23:06             ` Nix
2012-10-23 23:28               ` Theodore Ts'o
2012-10-23 23:34                 ` Nix
2012-10-24  0:57             ` Eric Sandeen
2012-10-24 20:17               ` Jan Kara
2012-10-26 15:25                 ` Eric Sandeen
2012-10-24 19:13             ` Jannis Achstetter
2012-10-24 21:31               ` Theodore Ts'o
2012-10-24 22:05                 ` Jannis Achstetter
2012-10-24 23:47                 ` Nix
2012-10-25 17:02                 ` Felipe Contreras
2012-10-24 21:04             ` Jannis Achstetter
2012-10-24  1:13           ` Eric Sandeen
2012-10-24  4:15             ` Nix
2012-10-24  4:27               ` Eric Sandeen
2012-10-24  5:23                 ` Theodore Ts'o
2012-10-24  7:00                   ` Hugh Dickins
2012-10-24 11:46                     ` Nix
2012-10-24 11:45                   ` Nix
2012-10-24 17:22                   ` Eric Sandeen
2012-10-24 19:49                   ` Nix
2012-10-24 19:54                     ` Nix
2012-10-24 20:30                     ` Eric Sandeen
2012-10-24 20:34                       ` Nix
2012-10-24 20:45                     ` Nix
2012-10-24 21:08                     ` Theodore Ts'o
2012-10-24 23:27                       ` Apparent serious progressive ext4 data corruption bug in 3.6 (when rebooting during umount) Nix
2012-10-24 23:42                         ` Nix
2012-10-25  1:10                         ` Theodore Ts'o
2012-10-25  1:45                           ` Nix
2012-10-25 14:12                             ` Theodore Ts'o
2012-10-25 14:15                               ` Nix
2012-10-25 17:39                                 ` Nix
2012-10-25 11:06                           ` Nix
2012-10-26  0:22                           ` Apparent serious progressive ext4 data corruption bug in 3.6 (when rebooting during umount) (possibly blockdev / arcmsr at fault??) Nix
2012-10-26  0:11               ` Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) Ric Wheeler
2012-10-26  0:43                 ` Theodore Ts'o
2012-10-26 12:12                   ` Nix
2012-10-26 20:35           ` Eric Sandeen
2012-10-26 20:37             ` Nix
2012-10-26 20:56               ` Theodore Ts'o
2012-10-26 20:59                 ` Nix
2012-10-26 21:15                   ` Theodore Ts'o
2012-10-26 21:19                     ` Nix
2012-10-27  0:22                       ` Theodore Ts'o
2012-10-27 12:45                         ` Nix
2012-10-27 17:55                           ` Theodore Ts'o
2012-10-27 18:47                             ` Nix
2012-10-27 21:19                               ` Eric Sandeen
2012-10-27 22:42                                 ` Eric Sandeen
2012-10-29  1:00                                   ` Theodore Ts'o
2012-10-29  1:04                                     ` Nix
2012-10-29  2:24                                     ` Eric Sandeen
2012-10-29  2:34                                       ` Theodore Ts'o
2012-10-29  2:35                                         ` Eric Sandeen
2012-10-29  2:42                                           ` Theodore Ts'o
2012-10-27 18:30                           ` Eric Sandeen
2012-10-27  3:11                     ` Jim Rees
2012-10-28  4:23           ` [PATCH] ext4: fix unjournaled inode bitmap modification Eric Sandeen
2012-10-28 13:59             ` Nix
2012-10-29  2:30             ` [PATCH -v3] " Theodore Ts'o
2012-10-29  3:24               ` Eric Sandeen
2012-10-29 17:08               ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).