All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: ext4 metadata corruption bug?
       [not found] ` <CAGagf4eEzY4+3cfNWSEENTo1PKe40nq1Ne6ZzOLGm-O78W7RcA@mail.gmail.com>
@ 2014-04-10  5:04   ` Nathaniel W Filardo
  2014-04-10 14:03     ` Theodore Ts'o
  0 siblings, 1 reply; 19+ messages in thread
From: Nathaniel W Filardo @ 2014-04-10  5:04 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Mike Rubin, Frank Mayhar, admins, linux-ext4

[-- Attachment #1: Type: text/plain, Size: 1891 bytes --]

On Wed, Apr 09, 2014 at 10:55:48PM -0400, Theodore Tso wrote:
> Hi Nathaniel,
> 
> In general, it's best if you send these sorts of requests for help to the
> linux-ext4@vger.kernel.org mailing list.

Added to CC.

> The fact that we see the "error count" line early in the boot message
> suggests to me that your VM is not running fsck to fix up the errors before
> mounting the file system.  (Well, either that or you're using a really
> ancient version of e2fsck, but given that you're using a bleeding edge
> kernel, but I'm guessing you're using a reasonably recent version of
> e2fsck.  But that would be good for you to check.)

e2fsck version is 1.42.9 using the same library version.
 
> The ext4 error message is due to the file system getting corrupted.  How
> the file system got corrupted isn't 100% clear, but one potential cause is
> how the disk is configured with qemu.
>[snip]

We use QEMU directives like

        -drive format=raw,file=rbd:rbdafs-mirror/mirror-0,id=drive5,if=none,cache=writeback \
        -device driver=ide-hd,drive=drive5,discard_granularity=512,bus=ahci0.3

We've never had, so far as I know, an unexpected shutdown of the QEMU
process, so I don't think that unexpected loss of cache contents is to
blame.

Perhaps the dmesg I sent was not representative; some days ago, we saw, only
(comparatively!) late in the machine's uptime:

[309894.428685] EXT4-fs (sdd): pa ffff88000d9f9440: logic 832, phys.  957458972, len 192
[309894.430023] EXT4-fs error (device sdd): ext4_mb_release_inode_pa:3729: group 29219, free 192, pa_free 191
[309894.431822] Aborting journal on device sdd-8.
[309894.442913] EXT4-fs (sdd): Remounting filesystem read-only

with Debian kernel 3.13.5-1; sdd here is the same filesystem as in the
earlier dmesg.

I'll capture any subsequent crashes and follow up.

Thanks much!
--nwf;

[-- Attachment #2: Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ext4 metadata corruption bug?
  2014-04-10  5:04   ` ext4 metadata corruption bug? Nathaniel W Filardo
@ 2014-04-10 14:03     ` Theodore Ts'o
  2014-04-10 16:33       ` Nathaniel W Filardo
  0 siblings, 1 reply; 19+ messages in thread
From: Theodore Ts'o @ 2014-04-10 14:03 UTC (permalink / raw)
  To: Nathaniel W Filardo
  Cc: Theodore Tso, Mike Rubin, Frank Mayhar, admins, linux-ext4

On Thu, Apr 10, 2014 at 01:04:28AM -0400, Nathaniel W Filardo wrote:
> We use QEMU directives like
> 
>         -drive format=raw,file=rbd:rbdafs-mirror/mirror-0,id=drive5,if=none,cache=writeback \
>         -device driver=ide-hd,drive=drive5,discard_granularity=512,bus=ahci0.3
> 
> We've never had, so far as I know, an unexpected shutdown of the QEMU
> process, so I don't think that unexpected loss of cache contents is to
> blame.
> 
> Perhaps the dmesg I sent was not representative; some days ago, we saw, only
> (comparatively!) late in the machine's uptime:
> 
> [309894.428685] EXT4-fs (sdd): pa ffff88000d9f9440: logic 832, phys.  957458972, len 192
> [309894.430023] EXT4-fs error (device sdd): ext4_mb_release_inode_pa:3729: group 29219, free 192, pa_free 191
> [309894.431822] Aborting journal on device sdd-8.
> [309894.442913] EXT4-fs (sdd): Remounting filesystem read-only
> 
> with Debian kernel 3.13.5-1; sdd here is the same filesystem as in the
> earlier dmesg.

What is your workload?  Can you reproduce this easily?  And can you
try using a local disk to see if the problem goes away, so we can
start to bisect which software components might be at fault?

I'm not aware of any corruption problem with a 3.13 based kernel which
matches your signature, and the ext4 errors that you are showing
(minor accounting discrepancies in the number free blocks and number
of free inodes between the allocation bitmap and the summary
statistics in the block group descriptors) is very closely matches the
signature of some part of the storage stack not honoring FLUSH CACHE
("barrier") operations, either by ignoring them completely, or
reordring writes across a barrier / flush cache request.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ext4 metadata corruption bug?
  2014-04-10 14:03     ` Theodore Ts'o
@ 2014-04-10 16:33       ` Nathaniel W Filardo
  2014-04-10 22:17         ` Theodore Ts'o
  0 siblings, 1 reply; 19+ messages in thread
From: Nathaniel W Filardo @ 2014-04-10 16:33 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Mike Rubin, Frank Mayhar, admins, linux-ext4

[-- Attachment #1: Type: text/plain, Size: 3701 bytes --]

On Thu, Apr 10, 2014 at 10:03:16AM -0400, Theodore Ts'o wrote:
> On Thu, Apr 10, 2014 at 01:04:28AM -0400, Nathaniel W Filardo wrote:
> >[snip]
> What is your workload?  Can you reproduce this easily?  And can you
> try using a local disk to see if the problem goes away, so we can
> start to bisect which software components might be at fault?

We're running an OpenAFS fileserver; the partition that most often causes us
trouble is that which contains our mirrors (Debian, Fedora, etc.) which has
~5T of its 10T capacity filled.  Every few days, at least, we trip over one
of these.

I will see what I can do about getting some local storage hooked up to the
VM, but my fear is that this is related to the size of the volume and the
amount of data therein.  If that's true, any amount of local storage I could
muster will not shake this out.

> I'm not aware of any corruption problem with a 3.13 based kernel which
> matches your signature, and the ext4 errors that you are showing
> (minor accounting discrepancies in the number free blocks and number
> of free inodes between the allocation bitmap and the summary
> statistics in the block group descriptors) is very closely matches the
> signature of some part of the storage stack not honoring FLUSH CACHE
> ("barrier") operations, either by ignoring them completely, or
> reordring writes across a barrier / flush cache request.

Shouldn't cache reordering or fail to flush correctly only matter if the
machine is crashing or otherwise losing power?  I suppose it's possible
there's a bug that would cause the cache to fail to write a block at all,
rather than simply "too late".  But as I said before, we've not had any
crashes or otherwise lost uptime anywhere: host, guest, storage providers,
etc.

That said, we do occasionally, though much less often than we get reports of
corrupted metadata, get messages that I don't know how to decode from the
ATA stack (though naively they all seemed to be successfully resolved
transients)?  One of our VMs, nearly identically configured, though not the
one that's been reporting corruption on its filesystem, spat this out the
other day, for example:

[532625.888251] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[532625.888762] ata1.00: failed command: FLUSH CACHE
[532625.889128] ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[532625.889128]          res 40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4 (time out)
[532625.889945] ata1.00: status: { DRDY }
[532630.928064] ata1: link is slow to respond, please be patient (ready=0)
[532635.912178] ata1: device not ready (errno=-16), forcing hardreset
[532635.912220] ata1: soft resetting link
[532636.070087] ata1.00: configured for MWDMA2
[532636.070701] ata1.01: configured for MWDMA2
[532636.070705] ata1.00: retrying FLUSH 0xe7 Emask 0x4
[532651.068208] ata1.00: qc timeout (cmd 0xe7)
[532651.068216] ata1.00: FLUSH failed Emask 0x4
[532651.236146] ata1: soft resetting link
[532651.393918] ata1.00: configured for MWDMA2
[532651.394533] ata1.01: configured for MWDMA2
[532651.394537] ata1.00: retrying FLUSH 0xe7 Emask 0x4
[532651.395550] ata1.00: device reported invalid CHS sector 0
[532651.395564] ata1: EH complete

This appears to have been during a stall in Ceph's ability to write data to
the OSDs.  I don't know what caused that, and the machine has been happy
ever since (though maybe it's all about to go south?).  I'll keep my eyes
open, especially for any reports of problems between fscking and the
blockmap corruption being detected.

Would I be better off attaching the RBDs via virtio than achi?  Do I need to
do anything specially on the guest to make that work?

Thanks very much, again,
--nwf;

[-- Attachment #2: Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ext4 metadata corruption bug?
  2014-04-10 16:33       ` Nathaniel W Filardo
@ 2014-04-10 22:17         ` Theodore Ts'o
  2014-04-20 16:32           ` Nathaniel W Filardo
  0 siblings, 1 reply; 19+ messages in thread
From: Theodore Ts'o @ 2014-04-10 22:17 UTC (permalink / raw)
  To: Nathaniel W Filardo; +Cc: Mike Rubin, Frank Mayhar, admins, linux-ext4

On Thu, Apr 10, 2014 at 12:33:51PM -0400, Nathaniel W Filardo wrote:
> 
> Shouldn't cache reordering or fail to flush correctly only matter if the
> machine is crashing or otherwise losing power?  I suppose it's possible
> there's a bug that would cause the cache to fail to write a block at all,
> rather than simply "too late".  But as I said before, we've not had any
> crashes or otherwise lost uptime anywhere: host, guest, storage providers,
> etc.

If it's a cache flush problem, yes, it would only matter if there had
been a crash.  Knowing that what you are doing is a AFS mirror, this
seems even stranger, since writes would be very rare, and it's not
like you there would be a whole lot of opportunities for races ---
when you mirror an FTP site, you write a new file sequentially, and
it's not like there multiple CPU's trying to modify the file at the
same time, etc.

And if you are just seeing the results of random bit flips, one would
expect other types of corruption getting reported.  So I don't know.
This is a mystery so far...

> That said, we do occasionally, though much less often than we get reports of
> corrupted metadata, get messages that I don't know how to decode from the
> ATA stack (though naively they all seemed to be successfully resolved
> transients)?  One of our VMs, nearly identically configured, though not the
> one that's been reporting corruption on its filesystem, spat this out the
> other day, for example:
> 
> [532625.888251] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> [532625.888762] ata1.00: failed command: FLUSH CACHE
> [532625.889128] ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0
> [532625.889128]          res 40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4 (time out)
> [532625.889945] ata1.00: status: { DRDY }
> [532630.928064] ata1: link is slow to respond, please be patient (ready=0)
> [532635.912178] ata1: device not ready (errno=-16), forcing hardreset
> [532635.912220] ata1: soft resetting link
> [532636.070087] ata1.00: configured for MWDMA2
> [532636.070701] ata1.01: configured for MWDMA2
> [532636.070705] ata1.00: retrying FLUSH 0xe7 Emask 0x4
> [532651.068208] ata1.00: qc timeout (cmd 0xe7)
> [532651.068216] ata1.00: FLUSH failed Emask 0x4
> [532651.236146] ata1: soft resetting link
> [532651.393918] ata1.00: configured for MWDMA2
> [532651.394533] ata1.01: configured for MWDMA2
> [532651.394537] ata1.00: retrying FLUSH 0xe7 Emask 0x4
> [532651.395550] ata1.00: device reported invalid CHS sector 0
> [532651.395564] ata1: EH complete

Yeah, that doesn't look good, but you're using some kind of remote
block device here, right?  I'm not sure how qemu is translating that
into pseudo ATA commands.  Maybe that corresponds with a broken
network connection which required creating a new TCP connection or
some such?  I'm not really that familiar with the remote block device
code.  So I also can't really give you any advice about whhether it
would be better to use virtio versus achi.  I would expect that virtio
will probably be faster, but it might not matter for your application.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ext4 metadata corruption bug?
  2014-04-10 22:17         ` Theodore Ts'o
@ 2014-04-20 16:32           ` Nathaniel W Filardo
  2014-04-20 17:57             ` Theodore Ts'o
  2014-04-23  7:23             ` Sander Smeenk
  0 siblings, 2 replies; 19+ messages in thread
From: Nathaniel W Filardo @ 2014-04-20 16:32 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Mike Rubin, Frank Mayhar, admins, linux-ext4

[-- Attachment #1: Type: text/plain, Size: 2060 bytes --]

We just got

> [817576.492013] EXT4-fs (vdd): pa ffff88000dea9b90: logic 0, phys.  1934464544, len 32
> [817576.492468] EXT4-fs error (device vdd): ext4_mb_release_inode_pa:3729: group 59035, free 14, pa_free 12
> [817576.492987] Aborting journal on device vdd-8.
> [817576.493919] EXT4-fs (vdd): Remounting filesystem read-only

Upon unmount, further

> [825457.072206] EXT4-fs error (device vdd): ext4_put_super:791: Couldn't clean up the journal

fscking generated

> fsck from util-linux 2.20.1
> e2fsck 1.42.9 (4-Feb-2014)
> /dev/vdd: recovering journal
> /dev/vdd contains a file system with errors, check forced.
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> Block bitmap differences:  +(1934464544--1934464545)
> Fix<y>? yes
> Free blocks count wrong (1379876836, counted=1386563079).
> Fix<y>? yes
> Free inodes count wrong (331897442, counted=331912336).
> Fix<y>? yes
>
> /dev/vdd: ***** FILE SYSTEM WAS MODIFIED *****
> /dev/vdd: 3631984/335544320 files (1.6% non-contiguous), 1297791481/2684354560 blocks

The particular error reported by the kernel seems to be the first of the
three, but the other two look like leaks?  A huge number of inodes (14894)
and blocks (6686243, or 3.2Gi of storage!) were marked busy in a way that
fsck didn't believe, if I am reading that right?

/dev/vdd is virtio on Ceph RBD, using write-through caching.  We have had a
crash on one of the Ceph OSDs recently in a way that seems to have generated
inconsistent data in Ceph, but subsequent repair commands seem to have made
everything happy again, at least so far as Ceph tells us.

The guest `uname -a` sayeth

> Linux afsscratch-kvm 3.13-1-amd64 #1 SMP Debian 3.13.7-1 (2014-03-25) x86_64 GNU/Linux

And in case it's relevant, host QEMU emulator is version 1.7.0 (Debian
1.7.0+dfsg-3) [modified locally to include rbd]; guest ceph, librbd, etc.
are Debian package 0.72.2-1~bpo70+1 .

Cheers,
--nwf;

[-- Attachment #2: Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ext4 metadata corruption bug?
  2014-04-20 16:32           ` Nathaniel W Filardo
@ 2014-04-20 17:57             ` Theodore Ts'o
  2014-04-23  7:23             ` Sander Smeenk
  1 sibling, 0 replies; 19+ messages in thread
From: Theodore Ts'o @ 2014-04-20 17:57 UTC (permalink / raw)
  To: Nathaniel W Filardo; +Cc: Mike Rubin, Frank Mayhar, admins, linux-ext4

On Sun, Apr 20, 2014 at 12:32:12PM -0400, Nathaniel W Filardo wrote:
> We just got
> 
> > [817576.492013] EXT4-fs (vdd): pa ffff88000dea9b90: logic 0, phys.  1934464544, len 32
> > [817576.492468] EXT4-fs error (device vdd): ext4_mb_release_inode_pa:3729: group 59035, free 14, pa_free 12

OK, so what this means ext4 had preallocated a 32 blocks (starting at
block #0) for a file that was being written.  When we were done
writing the file, and the file is closed (or truncated, or a number of
other cases), ext4 will release the unwritten blocks back to the file
system so they can be used for some other file.

According to the preallocation accounting data, there should have been
12 leftover blocks to be released to be the file system.  However,
when the function looked at the on-disk bitmap, it found 14 leftover
blocks.  The only way this could happen is (a) memory hardware error,
(b) storage device error, or (c) programming error.

> > [817576.492987] Aborting journal on device vdd-8.
> > [817576.493919] EXT4-fs (vdd): Remounting filesystem read-only

So this at this point we abort the journal and remount the file system
read-only in order avoid potential further corruption.

> Upon unmount, further
> 
> > [825457.072206] EXT4-fs error (device vdd): ext4_put_super:791: Couldn't clean up the journal

That's an error message which should be expected, because the journal
was aborted due to the fs error.  So that's not a big deal.

(Yes, some of the error messages could be improved to be less
confusing; sorry about that.  Something we should fix....)

> fscking generated
> 
> > fsck from util-linux 2.20.1
> > e2fsck 1.42.9 (4-Feb-2014)
> > /dev/vdd: recovering journal
> > /dev/vdd contains a file system with errors, check forced.
> > Pass 1: Checking inodes, blocks, and sizes
> > Pass 2: Checking directory structure
> > Pass 3: Checking directory connectivity
> > Pass 4: Checking reference counts
> > Pass 5: Checking group summary information
> > Block bitmap differences:  +(1934464544--1934464545)
> > Fix<y>? yes

These two blocks were actually in use (i.e., referenced by some inode)
but not marked as in use by the bitmap.  That matches up with the
ext4_error message described above.  Somehow, either the storage
device flipped the bits associated with blocks 1934464544 and
1934464545 on disk, or the request to set those bits never got set.

So fortunately, the file system was marked read-only, because
otherwise these two blocks could have gotten allocated and assigend to
some other file, and that would have meant two different files trying
to use the same blocks, which of course means at least one of the
files will have data loss.

> > Free blocks count wrong (1379876836, counted=1386563079).
> > Fix<y>? yes
> > Free inodes count wrong (331897442, counted=331912336).
> > Fix<y>? yes

These two messages are harmless; you don't need to worry about them.
We no longer update the total number of free blocks and free inodes
except when the file system is cleanly unmounted.  Otherwise, every
single CPU that tried to allocate or release blocks or inode would end
up taking a global lock on these fields, which would be a massive
scalability bottleneck.  Instead, we just maintain per-block group
counts for the free blocks and free inodes, and we generate the total
number of free blocks and inode on demand when the user executes the
statfs(2) system call (for commands like df), or when the file system
is unmounted cleanly.

Since the file system was forcibly remounted read-only due to the
problem that we had found, the summary free block/inode counts never
got updated.

> /dev/vdd is virtio on Ceph RBD, using write-through caching.  We have had a
> crash on one of the Ceph OSDs recently in a way that seems to have generated
> inconsistent data in Ceph, but subsequent repair commands seem to have made
> everything happy again, at least so far as Ceph tells us.
> 
> The guest `uname -a` sayeth
> 
> > Linux afsscratch-kvm 3.13-1-amd64 #1 SMP Debian 3.13.7-1 (2014-03-25) x86_64 GNU/Linux
> 
> And in case it's relevant, host QEMU emulator is version 1.7.0 (Debian
> 1.7.0+dfsg-3) [modified locally to include rbd]; guest ceph, librbd, etc.
> are Debian package 0.72.2-1~bpo70+1 .

No one else has reported any bugs like this, nor has anything like
this turned up in our stress tests.  It's possible that your workload
is doing something strange that no one else would experience, and
which isn't getting picked up by our stress tests, but it's also just
as likely (and possibly more so) that the problem is caused by some
portion of the storage stack below ext4 --- i.e., virtio, qemu, the
remote block device, etc.  And so that's why if you can find ways to
substitute out the rbd with a local disk, that would be a really good
first step in trying to bisect what portion of the system might be
causing the fs corruption.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ext4 metadata corruption bug?
  2014-04-20 16:32           ` Nathaniel W Filardo
  2014-04-20 17:57             ` Theodore Ts'o
@ 2014-04-23  7:23             ` Sander Smeenk
  2014-04-23 14:36               ` Theodore Ts'o
  1 sibling, 1 reply; 19+ messages in thread
From: Sander Smeenk @ 2014-04-23  7:23 UTC (permalink / raw)
  To: linux-ext4

On Sun, Apr 20, 2014 at 12:32:12PM -0400, Nathaniel W Filardo wrote:

> We just got
> > [817576.492013] EXT4-fs (vdd): pa ffff88000dea9b90: logic 0, phys. 1934464544, len 32
> > [817576.492468] EXT4-fs error (device vdd):
> > ext4_mb_release_inode_pa:3729: group 59035, free 14, pa_free 12

I'm jumping in on this thread as a co-worker pointed me to it after i
reported somewhere else about a similar ext4 metadata corruption issue
i'm running into.

Quite similar situation to the OP in this thread:
Linux QEMU 2.0.0~rc1+dfsg-0ubuntu3 running on 3.13.0, guest and host
both run Linux 3.13.0, guest has one 'big' volume (10T, 5.5T in use) as
/dev/vdb which in it's entirety is used as an ext4 filesystem. No
partitioning.

The trouble starts with:

| EXT4-fs (vdb): pa ffff880078754c30: logic 274378, phys. 1617823779, len 54
| EXT4-fs error (device vdb): ext4_mb_release_inode_pa:3729: group 49372, free 52, pa_free 50
| Aborting journal on device vdb-8.

After which the system remounts the disk read-only and logs some more
ext4 trouble which i've pasted here: https://8n1.org/9763/4928

The system doesn't crash as this isn't the "OS disk". Running fsck on
the disk reports bitmap corruption and some incorrect free block/inode
counts after which the FS seems to work again.

It only happens to the guest with the large disk. All the other guests
on the same hypervisor and the same backend storage have no issues at
all. No IO errors are logged other than the ext4 errors from the guest.
The workload on this guest is not at all spectacular. A bit of random IO
on a set of files, the rest is mostly archive and rarely touched.

HTH,
-Sander.
-- 
| You dig around for a while but you fail to find any treasure.
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7  FBD6 F3A9 9442 20CC 6CD2

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ext4 metadata corruption bug?
  2014-04-23  7:23             ` Sander Smeenk
@ 2014-04-23 14:36               ` Theodore Ts'o
  2014-04-23 15:30                 ` Nathaniel W Filardo
                                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Theodore Ts'o @ 2014-04-23 14:36 UTC (permalink / raw)
  To: Sander Smeenk, Nathaniel W Filardo; +Cc: linux-ext4

OK, with the two of you reporting this problem, can you do me the
following so we can try to seriously dig into this:

First, of all, can you go through your log files and find me as many
instances of these two pairs of ext4 error messges:

EXT4-fs (vdd): pa ffff88000dea9b90: logic 0, phys. 1934464544, len 32
EXT4-fs error (device vdd): ext4_mb_release_inode_pa:3729: group 59035, free 14, pa_free 12

I want to see if there's any pattern in the physical block number (in
the two samples I have, they are always fairly large numbers), and in
the difference between the free and pa_free numbers.  

Secondly, can you send me the output of dumpe2fs -h for the file
systems in question.

Finally, since the both of you are seeing these messages fairly
frequently, would you be willing to run with a patched kernel?
Specifically, can you add a WARN_ON(1) to fs/ext4/mballoc.c here:

	if (free != pa->pa_free) {
		ext4_msg(e4b->bd_sb, KERN_CRIT,
			 "pa %p: logic %lu, phys. %lu, len %lu",
			 pa, (unsigned long) pa->pa_lstart,
			 (unsigned long) pa->pa_pstart,
			 (unsigned long) pa->pa_len);
		ext4_grp_locked_error(sb, group, 0, 0, "free %u, pa_free %u",
					free, pa->pa_free);
		WARN_ON(1); <---------------- add this line			
		/*
		 * pa is already deleted so we use the value obtained
		 * from the bitmap and continue.
		 */
	}

Then when it triggers, can you send me the stack trace that will be
triggered by the WARN_ON.

The two really interesting commonalities which I've seen so far is:

1)  You are both using virtualization via qemu/kvm

2)  You are both using file systems > 8TB.

Yes?  And Sander, you're not using a remote block device, correct?
You're using a local disk to back the large fileystem on the host OS
side?

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ext4 metadata corruption bug?
  2014-04-23 14:36               ` Theodore Ts'o
@ 2014-04-23 15:30                 ` Nathaniel W Filardo
  2014-04-23 18:05                 ` Sander Smeenk
                                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Nathaniel W Filardo @ 2014-04-23 15:30 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: admins, linux-ext4

[-- Attachment #1: Type: text/plain, Size: 7362 bytes --]

On Wed, Apr 23, 2014 at 10:36:42AM -0400, Theodore Ts'o wrote:
> OK, with the two of you reporting this problem, can you do me the
> following so we can try to seriously dig into this:
>
> First, of all, can you go through your log files and find me as many
> instances of these two pairs of ext4 error messges:
>
> EXT4-fs (vdd): pa ffff88000dea9b90: logic 0, phys. 1934464544, len 32
> EXT4-fs error (device vdd): ext4_mb_release_inode_pa:3729: group 59035, free 14, pa_free 12
>
> I want to see if there's any pattern in the physical block number (in
> the two samples I have, they are always fairly large numbers), and in
> the difference between the free and pa_free numbers.

The current set of logs on the machine contain only two instance of that pair,
which may not be useful for extracting information:

kern.log.1:Apr 20 09:11:13 afsscratch-kvm kernel: [817576.492013] EXT4-fs (vdd): pa ffff88000dea9b90: logic 0, phys. 1934464544, len 32
kern.log.1:Apr 20 09:11:13 afsscratch-kvm kernel: [817576.492468] EXT4-fs error (device vdd): ext4_mb_release_inode_pa:3729: group 59035, free 14, pa_free 12

kern.log.3.gz:Apr  3 14:01:04 afsscratch-kvm kernel: [309894.428685] EXT4-fs (sdd): pa ffff88000d9f9440: logic 832, phys. 957458972, len 192
kern.log.3.gz:Apr  3 14:01:04 afsscratch-kvm kernel: [309894.430023] EXT4-fs error (device sdd): ext4_mb_release_inode_pa:3729: group 29219, free 192, pa_free 191

A good many of our errors come from the allocation side, rather than the
release side; I don't know if this is helpful or useless, but here is
everything in the logs from over there:

kern.log.2.gz:Apr  9 02:10:21 afsscratch-kvm kernel: [384951.911190] EXT4-fs error (device sdd): ext4_mb_generate_buddy:756: group 51014, 2813 clusters in bitmap, 2811 in gd; block bitmap corrupt.
kern.log.2.gz:Apr  9 13:01:23 afsscratch-kvm kernel: [11422.362996] EXT4-fs error (device sdd): ext4_mb_generate_buddy:756: group 42947, 10106 clusters in bitmap, 10105 in gd; block bitmap corrupt.
kern.log.2.gz:Apr  9 17:31:18 afsscratch-kvm kernel: [   24.426020] EXT4-fs error (device sdb): ext4_mb_generate_buddy:756: group 42947, 11128 clusters in bitmap, 11127 in gd; block bitmap corrupt.
kern.log.2.gz:Apr  9 17:36:07 afsscratch-kvm kernel: [  313.312122] EXT4-fs (sdb): error count: 3
kern.log.2.gz:Apr  9 17:36:07 afsscratch-kvm kernel: [  313.312895] EXT4-fs (sdb): initial error at 1397062883: ext4_mb_generate_buddy:756
kern.log.2.gz:Apr  9 17:36:07 afsscratch-kvm kernel: [  313.314256] EXT4-fs (sdb): last error at 1397079078: ext4_mb_generate_buddy:756
kern.log.2.gz:Apr 12 04:41:30 afsscratch-kvm kernel: [110192.817447] EXT4-fs error (device vdd): ext4_mb_generate_buddy:756: group 53425, 84 clusters in bitmap, 82 in gd; block bitmap corrupt.
kern.log.3.gz:Apr  3 21:08:51 afsscratch-kvm kernel: [25112.853350] EXT4-fs error (device sdf): ext4_mb_generate_buddy:756: group 29219, 22572 clusters in bitmap, 22571 in gd; block bitmap corrupt.
kern.log.3.gz:Apr  4 07:44:37 afsscratch-kvm kernel: [34909.921245] EXT4-fs error (device sdd): ext4_mb_generate_buddy:756: group 29219, 22572 clusters in bitmap, 22571 in gd; block bitmap corrupt.
kern.log.3.gz:Apr  4 12:29:47 afsscratch-kvm kernel: [  238.509158] EXT4-fs error (device sdd): ext4_mb_generate_buddy:756: group 29219, 22572 clusters in bitmap, 22571 in gd; block bitmap corrupt.
kern.log.4.gz:Mar 25 14:10:04 afsscratch-kvm kernel: [1801025.178704] EXT4-fs error (device sdf): ext4_mb_generate_buddy:756: group 50994, 3915 clusters in bitmap, 3913 in gd; block bitmap corrupt.
kern.log.4.gz:Mar 30 22:52:28 afsscratch kernel: [2264368.806787] EXT4-fs error (device sdf): ext4_mb_generate_buddy:756: group 52439, 3034 clusters in bitmap, 3032 in gd; block bitmap corrupt.
kern.log.4.gz:Mar 30 23:42:36 afsscratch-kvm kernel: [ 2603.487997] EXT4-fs error (device sdd): ext4_mb_generate_buddy:756: group 52439, 3034 clusters in bitmap , 3032 in gd; block bitmap corrupt.

> Secondly, can you send me the output of dumpe2fs -h for the file
> systems in question.

Filesystem volume name:   <none>
Last mounted on:          /vicepm
Filesystem UUID:          cd47ccd7-92e7-4155-87b2-772828019d52
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype n
eeds_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nli
nk extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              335544320
Block count:              2684354560
Reserved block count:     134217728
Free blocks:              1386563079
Free inodes:              331912336
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      384
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         4096
Inode blocks per group:   256
RAID stride:              1024
RAID stripe width:        1024
Flex block group size:    16
Filesystem created:       Tue Feb 25 18:06:24 2014
Last mount time:          Sun Apr 20 14:08:10 2014
Last write time:          Sun Apr 20 14:08:10 2014
Mount count:              2
Maximum mount count:      -1
Last checked:             Sun Apr 20 11:22:42 2014
Check interval:           0 (<none>)
Lifetime writes:          4834 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      3c362c29-250b-421a-bdc2-472c611b219d
Journal backup:           inode blocks
Journal features:         journal_incompat_revoke
Journal size:             128M
Journal length:           32768
Journal sequence:         0x000e5c4b
Journal start:            25891

> Finally, since the both of you are seeing these messages fairly
> frequently, would you be willing to run with a patched kernel?
> Specifically, can you add a WARN_ON(1) to fs/ext4/mballoc.c here:
>
> 	if (free != pa->pa_free) {
> 		ext4_msg(e4b->bd_sb, KERN_CRIT,
> 			 "pa %p: logic %lu, phys. %lu, len %lu",
> 			 pa, (unsigned long) pa->pa_lstart,
> 			 (unsigned long) pa->pa_pstart,
> 			 (unsigned long) pa->pa_len);
> 		ext4_grp_locked_error(sb, group, 0, 0, "free %u, pa_free %u",
> 					free, pa->pa_free);
> 		WARN_ON(1); <---------------- add this line			
> 		/*
> 		 * pa is already deleted so we use the value obtained
> 		 * from the bitmap and continue.
> 		 */
> 	}
>
> Then when it triggers, can you send me the stack trace that will be
> triggered by the WARN_ON.

I will not be able to roll that kernel immediately, but I can at some point
(possibly this weekend).

> The two really interesting commonalities which I've seen so far is:
>
> 1)  You are both using virtualization via qemu/kvm
>
> 2)  You are both using file systems > 8TB.
>
> Yes?  And Sander, you're not using a remote block device, correct?
> You're using a local disk to back the large fileystem on the host OS
> side?

Yes.

Cheers,
--nwf;

[-- Attachment #2: Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ext4 metadata corruption bug?
  2014-04-23 14:36               ` Theodore Ts'o
  2014-04-23 15:30                 ` Nathaniel W Filardo
@ 2014-04-23 18:05                 ` Sander Smeenk
  2014-04-29 15:22                 ` Nathaniel W Filardo
                                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Sander Smeenk @ 2014-04-23 18:05 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Nathaniel W Filardo, linux-ext4

Quoting Theodore Ts'o (tytso@mit.edu):
> First, of all, can you go through your log files and find me as many
> instances of these two pairs of ext4 error messges:
> EXT4-fs (vdd): pa ffff88000dea9b90: logic 0, phys. 1934464544, len 32
> EXT4-fs error (device vdd): ext4_mb_release_inode_pa:3729: group 59035, free 14, pa_free 12

I've got quite a few of them (yay, remote syslog) and i will keep them
pasted on https://8n1.org/9765/e6d5


> Secondly, can you send me the output of dumpe2fs -h for the file
> systems in question.

The FS was created with 'mkfs.ext4 -m 0 /dev/vdb', iirc.
Dumpe2fs output:
| Filesystem volume name:   <none>
| Last mounted on:          /srv/storage
| Filesystem UUID:          02acfb89-2752-4b82-8604-72b035933f8c
| Filesystem magic number:  0xEF53
| Filesystem revision #:    1 (dynamic)
| Filesystem features:      has_journal ext_attr resize_inode dir_index
|     filetype needs_recovery extent flex_bg sparse_super large_file
|     huge_file uninit_bg dir_nlink extra_isize
| Filesystem flags:         signed_directory_hash 
| Default mount options:    user_xattr acl
| Filesystem state:         clean
| Errors behavior:          Continue
| Filesystem OS type:       Linux
| Inode count:              671088640
| Block count:              2684354560
| Reserved block count:     0
| Free blocks:              1158458306
| Free inodes:              670928082
| First block:              0
| Block size:               4096
| Fragment size:            4096
| Reserved GDT blocks:      384
| Blocks per group:         32768
| Fragments per group:      32768
| Inodes per group:         8192
| Inode blocks per group:   512
| Flex block group size:    16
| Filesystem created:       Sat Jul 20 19:24:38 2013
| Last mount time:          Wed Apr 23 08:59:15 2014
| Last write time:          Wed Apr 23 08:59:15 2014
| Mount count:              1
| Maximum mount count:      -1
| Last checked:             Wed Apr 23 08:53:15 2014
| Check interval:           0 (<none>)
| Lifetime writes:          3444 GB
| Reserved blocks uid:      0 (user root)
| Reserved blocks gid:      0 (group root)
| First inode:              11
| Inode size:           256
| Required extra isize:     28
| Desired extra isize:      28
| Journal inode:            8
| Default directory hash:   half_md4
| Directory Hash Seed:      4e54f4fb-479e-464c-80ba-1478cc56181a
| Journal backup:           inode blocks
| Journal features:         journal_incompat_revoke
| Journal size:             128M
| Journal length:           32768
| Journal sequence:         0x000ada49
| Journal start:            1


> Finally, since the both of you are seeing these messages fairly
> frequently, would you be willing to run with a patched kernel?
> Specifically, can you add a WARN_ON(1) to fs/ext4/mballoc.c here:

I can test away on this box. As long as my data stays safe. :-)
I have to admit i haven't compiled my own *kernel* since 2.4.x so i took
the Ubuntu package and patched that with the WARN_ON(1) call. Building
takes ages, but i will report my findings.


> The two really interesting commonalities which I've seen so far is:
> 1)  You are both using virtualization via qemu/kvm
> 2)  You are both using file systems > 8TB.
> Yes? And Sander, you're not using a remote block device, correct?
> You're using a local disk to back the large fileystem on the host OS
> side?

This is all correct. The host has LVM running, one logical volume is
'exported' to the guest through qemu (2.0) with virtio driver.


-Sndr.
-- 
| A bicycle can't stand alone; it is two tired. 
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7  FBD6 F3A9 9442 20CC 6CD2

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ext4 metadata corruption bug?
  2014-04-23 14:36               ` Theodore Ts'o
  2014-04-23 15:30                 ` Nathaniel W Filardo
  2014-04-23 18:05                 ` Sander Smeenk
@ 2014-04-29 15:22                 ` Nathaniel W Filardo
  2014-05-01 16:25                 ` Nathaniel W Filardo
  2014-05-01 17:02                 ` Sander Smeenk
  4 siblings, 0 replies; 19+ messages in thread
From: Nathaniel W Filardo @ 2014-04-29 15:22 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, admins

[-- Attachment #1: Type: text/plain, Size: 2471 bytes --]

On Wed, Apr 23, 2014 at 10:36:42AM -0400, Theodore Ts'o wrote:
> OK, with the two of you reporting this problem, can you do me the
> following so we can try to seriously dig into this:
>[snip]

Just got another instance of corruption; this time on the allocation side:
[715863.874936] EXT4-fs error (device vdb): ext4_mb_generate_buddy:756: group 50737, 485 clusters in bitmap, 484 in gd; block bitmap corrupt.

This time from a different, though still > 8TB filesystem, whose dump
information sayeth:

Filesystem volume name:   <none>
Last mounted on:          /viceps
Filesystem UUID:          5bd1df60-53df-43f9-9785-5a35da485c81
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean with errors
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              335544320
Block count:              2684354560
Reserved block count:     134217728
Free blocks:              1767616367
Free inodes:              328163234
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      384
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         4096
Inode blocks per group:   256
RAID stride:              1024
RAID stripe width:        1024
Flex block group size:    16
Filesystem created:       Fri Jan 24 18:01:34 2014
Last mount time:          Sun Apr 20 14:08:10 2014
Last write time:          Tue Apr 29 11:16:22 2014
Mount count:              5
Maximum mount count:      -1
Last checked:             Wed Apr  9 17:39:32 2014
Check interval:           0 (<none>)
Lifetime writes:          3574 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      d51561e1-73f7-40c6-bfac-61d8e46a4da4
Journal backup:           inode blocks
Journal features:         journal_incompat_revoke
Journal size:             128M
Journal length:           32768
Journal sequence:         0x000ab65b
Journal start:            0

Cheers,
--nwf;

[-- Attachment #2: Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ext4 metadata corruption bug?
  2014-04-23 14:36               ` Theodore Ts'o
                                   ` (2 preceding siblings ...)
  2014-04-29 15:22                 ` Nathaniel W Filardo
@ 2014-05-01 16:25                 ` Nathaniel W Filardo
  2014-05-06 15:42                   ` Theodore Ts'o
  2014-05-01 17:02                 ` Sander Smeenk
  4 siblings, 1 reply; 19+ messages in thread
From: Nathaniel W Filardo @ 2014-05-01 16:25 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, admins

[-- Attachment #1: Type: text/plain, Size: 8587 bytes --]

Here's another kernel report, this time from /dev/sda1, which is a QEMU-IDE
view of a local LVM volume and is only 4060864 blocks big, so it falls into
neither the "Ceph's fault" nor "8TB is special" bins:

[922646.672586] EXT4-fs error (device sda1): ext4_mb_generate_buddy:756: group 17, 24652 clusters in bitmap, 24651 in gd; block bitmap corrupt.
[922646.673295] Aborting journal on device sda1-8.
[922646.673904] EXT4-fs (sda1): Remounting filesystem read-only
[922646.684017] ------------[ cut here ]------------
[922646.685564] WARNING: CPU: 0 PID: 10001 at /build/linux-oxWk_8/linux-3.13.7/fs/ext4/ext4_jbd2.c:259 __ext4_handle_dirty_metadata+0x17e/0x190 [ext4]()
[922646.685566] Modules linked in: openafs(PO) loop ttm drm_kms_helper drm evdev psmouse processor parport_pc pcspkr serio_raw parport i2c_piix4 thermal_sys button i2c_core ext4 crc16 mbcache jbd2 sg sd_mod crc_t10dif crct10dif_common ata_generic virtio_blk floppy ata_piix ahci libahci virtio_pci virtio_ring virtio libata e1000 scsi_mod
[922646.692926] CPU: 0 PID: 10001 Comm: logrotate Tainted: P           O 3.13-1-amd64 #1 Debian 3.13.7-1
[922646.692929] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[922646.692931] 0000000000000009 ffffffff814a1327 0000000000000000 ffffffff8105ba72
[922646.692935] ffff88000356c2a0 00000000ffffffe2 0000000000000000 ffff88000435f408
[922646.692938] ffffffffa017aa80 ffffffffa015b30e ffff8800083784c8 ffff88001f4ba000
[922646.692941] Call Trace:
[922646.692977] [<ffffffff814a1327>] ? dump_stack+0x41/0x51
[922646.696017] [<ffffffff8105ba72>] ? warn_slowpath_common+0x72/0x90
[922646.699226] [<ffffffffa015b30e>] ? __ext4_handle_dirty_metadata+0x17e/0x190 [ext4]
[922646.699234] [<ffffffffa0135c50>] ? ext4_dirty_inode+0x20/0x50 [ext4]
[922646.699243] [<ffffffffa0163667>] ? ext4_free_blocks+0x5e7/0xb90 [ext4]
[922646.699260] [<ffffffff81096b9c>] ? wake_up_bit+0xc/0x20
[922646.699269] [<ffffffffa015779c>] ? ext4_ext_remove_space+0x7bc/0xff0 [ext4]
[922646.699278] [<ffffffffa0159e48>] ? ext4_ext_truncate+0x98/0xc0 [ext4]
[922646.699284] [<ffffffffa0134169>] ? ext4_truncate+0x379/0x3c0 [ext4]
[922646.699291] [<ffffffffa0134cc9>] ? ext4_evict_inode+0x459/0x4b0 [ext4]
[922646.699312] [<ffffffff811929c3>] ? evict+0xa3/0x190
[922646.699316] [<ffffffff8118ece8>] ? dentry_kill+0x1e8/0x230
[922646.699319] [<ffffffff8118ed84>] ? dput+0x54/0xf0
[922646.699334] [<ffffffff8117afe8>] ? __fput+0x148/0x210
[922646.699346] [<ffffffff81078ea7>] ? task_work_run+0x97/0xd0
[922646.699356] [<ffffffff81012969>] ? do_notify_resume+0x59/0x90
[922646.699366] [<ffffffff814ae7b2>] ? int_signal+0x12/0x17
[922646.699385] ---[ end trace bfc86d6bd5d1e863 ]---
[922646.712017] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
[922646.712017] IP: [<ffffffffa014a7cc>] __ext4_error_inode+0x2c/0x150 [ext4]
[922646.712017] PGD 1bcf1067 PUD 1f3bc067 PMD 0
[922646.712017] Oops: 0000 [#1] SMP
[922646.712017] Modules linked in: openafs(PO) loop ttm drm_kms_helper drm evdev psmouse processor parport_pc pcspkr serio_raw parport i2c_piix4 thermal_sys button i2c_core ext4 crc16 mbcache jbd2 sg sd_mod crc_t10dif crct10dif_common ata_generic virtio_blk floppy ata_piix ahci libahci virtio_pci virtio_ring virtio libata e1000 scsi_mod
[922646.712017] CPU: 0 PID: 10001 Comm: logrotate Tainted: P        W  O 3.13-1-amd64 #1 Debian 3.13.7-1
[922646.712017] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[922646.712017] task: ffff880000f11800 ti: ffff880000e90000 task.ti: ffff880000e90000
[922646.712017] RIP: 0010:[<ffffffffa014a7cc>]  [<ffffffffa014a7cc>] __ext4_error_inode+0x2c/0x150 [ext4]
[922646.712017] RSP: 0018:ffff880000e91b58  EFLAGS: 00010292
[922646.712017] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000080001
[922646.712017] RDX: 00000000000012ea RSI: ffffffffa018291d RDI: 0000000000000000
[922646.712017] RBP: ffff880000e91be8 R08: ffffffffa017fb38 R09: 0000000000000005
[922646.712017] R10: 0000000000000000 R11: ffff880000e9192e R12: 0000000000080001
[922646.712017] R13: ffffffffa017aa80 R14: 00000000000012ea R15: ffffffffa017fb38
[922646.712017] FS:  00007f4357fb0800(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000
[922646.712017] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[922646.712017] CR2: 00000000001f8000 CR3: 000000001bd00000 CR4: 00000000000006f0
[922646.712017] Stack:
[922646.712017] ffffffffa015b30e 00000000000012ea ffffffff8149efa2 0000000000000010
[922646.712017] ffff880000e91bc8 ffff880000e91b88 0000000000000103 bfc86d6bd5d1e863
[922646.712017] 0000000000003232 ffff88000435f408 ffffffffa015ae88 0000000000000227
[922646.712017] Call Trace:
[922646.712017] [<ffffffffa015b30e>] ? __ext4_handle_dirty_metadata+0x17e/0x190 [ext4]
[922646.712017] [<ffffffff8149efa2>] ? printk+0x4f/0x51
[922646.712017] [<ffffffffa015ae88>] ? ext4_journal_abort_handle+0x38/0xb0 [ext4]
[922646.712017] [<ffffffffa015b285>] ? __ext4_handle_dirty_metadata+0xf5/0x190 [ext4]
[922646.712017] [<ffffffffa0163667>] ? ext4_free_blocks+0x5e7/0xb90 [ext4]
[922646.712017] [<ffffffff81096b9c>] ? wake_up_bit+0xc/0x20
[922646.712017] [<ffffffffa015779c>] ? ext4_ext_remove_space+0x7bc/0xff0 [ext4]
[922646.712017] [<ffffffffa0159e48>] ? ext4_ext_truncate+0x98/0xc0 [ext4]
[922646.712017] [<ffffffffa0134169>] ? ext4_truncate+0x379/0x3c0 [ext4]
[922646.712017] [<ffffffffa0134cc9>] ? ext4_evict_inode+0x459/0x4b0 [ext4]
[922646.712017] [<ffffffff811929c3>] ? evict+0xa3/0x190
[922646.712017] [<ffffffff8118ece8>] ? dentry_kill+0x1e8/0x230
[922646.712017] [<ffffffff8118ed84>] ? dput+0x54/0xf0
[922646.712017] [<ffffffff8117afe8>] ? __fput+0x148/0x210
[922646.712017] [<ffffffff81078ea7>] ? task_work_run+0x97/0xd0
[922646.712017] [<ffffffff81012969>] ? do_notify_resume+0x59/0x90
[922646.712017] [<ffffffff814ae7b2>] ? int_signal+0x12/0x17
[922646.712017] Code: 48 89 e5 41 57 4d 89 c7 41 56 41 89 d6 41 55 49 89 f5 48 c7 c6 1d 29 18 a0 41 54 49 89 cc 53 48 89 fb 48 83 ec 68 4c 89 4c 24 60 <48> 8b 47 28 48 8b 57 40 48 8b 80 f8 02 00 00 48 8b 40 68 89 90
[922646.712017] RIP  [<ffffffffa014a7cc>] __ext4_error_inode+0x2c/0x150 [ext4]
[922646.712017] RSP <ffff880000e91b58>
[922646.712017] CR2: 0000000000000028
[922646.753519] ---[ end trace bfc86d6bd5d1e864 ]---

dumpe2fs on /dev/sda1 reports:

Filesystem volume name:   <none>
Last mounted on:          /
Filesystem UUID:          680c70f2-6ff5-49f9-9823-4f1d48062a0c
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean with errors
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              262144
Block count:              1048064
Reserved block count:     10480
Free blocks:              808533
Free inodes:              220750
First block:              0
Block size: 4096
Fragment size:            4096
Reserved GDT blocks:      255
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Fri Feb 21 02:30:29 2014
Last mount time:          Sun Apr 20 14:08:06 2014
Last write time:          Sun Apr 20 14:08:06 2014
Mount count:              29
Maximum mount count:      -1
Last checked:             Fri Feb 21 02:30:29 2014
Check interval:           0 (<none>)
Lifetime writes:          15 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size: 256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
First orphan inode:       131314
Default directory hash:   half_md4
Directory Hash Seed:      ef50dd1a-28f7-4ef3-baa2-b8e579d852a3
Journal backup:           inode blocks
FS Error count:           1
First error time:         Thu May  1 06:25:13 2014
First error function:     ext4_mb_generate_buddy
First error line #:       756
First error inode #:      0
First error block #:      0
Last error time:          Thu May  1 06:25:13 2014
Last error function:      ext4_mb_generate_buddy
Last error line #:        756
Last error inode #:       0
Last error block #:       0
Journal features:         journal_incompat_revoke
Journal size:             64M
Journal length:           16384
Journal sequence:         0x0003f20f
Journal start:            175
Journal errno:            -5

Cheers,
--nwf;

[-- Attachment #2: Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ext4 metadata corruption bug?
  2014-04-23 14:36               ` Theodore Ts'o
                                   ` (3 preceding siblings ...)
  2014-05-01 16:25                 ` Nathaniel W Filardo
@ 2014-05-01 17:02                 ` Sander Smeenk
  2014-05-06 14:22                   ` Sander Smeenk
  4 siblings, 1 reply; 19+ messages in thread
From: Sander Smeenk @ 2014-05-01 17:02 UTC (permalink / raw)
  To: linux-ext4

Quoting Theodore Ts'o (tytso@mit.edu):

> Finally, since the both of you are seeing these messages fairly
> frequently, would you be willing to run with a patched kernel?

JFYI, i am running the patched kernel but i havent had any 'luck' yet.

-Sndr.
-- 
| Happiness isn't enough for me. I demand EUPHORIA!
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7  FBD6 F3A9 9442 20CC 6CD2

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ext4 metadata corruption bug?
  2014-05-01 17:02                 ` Sander Smeenk
@ 2014-05-06 14:22                   ` Sander Smeenk
  2014-05-26 14:59                     ` Sander Smeenk
  0 siblings, 1 reply; 19+ messages in thread
From: Sander Smeenk @ 2014-05-06 14:22 UTC (permalink / raw)
  To: linux-ext4

Quoting Sander Smeenk (ssmeenk@freshdot.net):

> > Finally, since the both of you are seeing these messages fairly
> > frequently, would you be willing to run with a patched kernel?
> JFYI, i am running the patched kernel but i havent had any 'luck' yet.

FWIW, the VM crashed this morning.
Alas, nothing was logged or visible on the console. :(

-- 
| I can can-can! Can you?
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7  FBD6 F3A9 9442 20CC 6CD2

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ext4 metadata corruption bug?
  2014-05-01 16:25                 ` Nathaniel W Filardo
@ 2014-05-06 15:42                   ` Theodore Ts'o
  2014-05-06 15:51                     ` Nathaniel W Filardo
  0 siblings, 1 reply; 19+ messages in thread
From: Theodore Ts'o @ 2014-05-06 15:42 UTC (permalink / raw)
  To: Nathaniel W Filardo; +Cc: linux-ext4, admins

On Thu, May 01, 2014 at 12:25:03PM -0400, Nathaniel W Filardo wrote:
> Here's another kernel report, this time from /dev/sda1, which is a QEMU-IDE
> view of a local LVM volume and is only 4060864 blocks big, so it falls into
> neither the "Ceph's fault" nor "8TB is special" bins:
> 
> [922646.672586] EXT4-fs error (device sda1): ext4_mb_generate_buddy:756: group 17, 24652 clusters in bitmap, 24651 in gd; block bitmap corrupt.

So this is a different report from the ones where we see this error:

[817576.492468] EXT4-fs error (device vdd): ext4_mb_release_inode_pa:3729: group 59035, free 14, pa_free 12

Have you seen any more of these errors?

> [922646.712017] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
> [922646.712017] IP: [<ffffffffa014a7cc>] __ext4_error_inode+0x2c/0x150 [ext4]

FYI, this BUG (which can happens after certain jbd2 errors, which in
your case happened after the journal was aborted) is fixed with commit
66a4cb187b9 which will be in v3.15.

	    	  				- Ted



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ext4 metadata corruption bug?
  2014-05-06 15:42                   ` Theodore Ts'o
@ 2014-05-06 15:51                     ` Nathaniel W Filardo
  2014-07-31  2:37                       ` Theodore Ts'o
  0 siblings, 1 reply; 19+ messages in thread
From: Nathaniel W Filardo @ 2014-05-06 15:51 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, admins

[-- Attachment #1: Type: text/plain, Size: 1587 bytes --]

On Tue, May 06, 2014 at 11:42:39AM -0400, Theodore Ts'o wrote:
> On Thu, May 01, 2014 at 12:25:03PM -0400, Nathaniel W Filardo wrote:
> > Here's another kernel report, this time from /dev/sda1, which is a QEMU-IDE
> > view of a local LVM volume and is only 4060864 blocks big, so it falls into
> > neither the "Ceph's fault" nor "8TB is special" bins:

Ack, oops; my bad.  So I just checked the configuration and realized that,
while /dev/sda1 was in fact once upon a time a local view of LVM, it is now
in Ceph.  So it does eliminate the "8TB is special" bin but "Ceph's fault"
is still in play.

> > [922646.672586] EXT4-fs error (device sda1): ext4_mb_generate_buddy:756: group 17, 24652 clusters in bitmap, 24651 in gd; block bitmap corrupt.
> 
> So this is a different report from the ones where we see this error:
> 
> [817576.492468] EXT4-fs error (device vdd): ext4_mb_release_inode_pa:3729: group 59035, free 14, pa_free 12
> 
> Have you seen any more of these errors?

I think so, yes; I recall seeing bugs in both the allocation and the free
side of things, but I will keep an eye out.

> > [922646.712017] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
> > [922646.712017] IP: [<ffffffffa014a7cc>] __ext4_error_inode+0x2c/0x150 [ext4]
> 
> FYI, this BUG (which can happens after certain jbd2 errors, which in
> your case happened after the journal was aborted) is fixed with commit
> 66a4cb187b9 which will be in v3.15.

Excellent; I look forward to the new release and will stop nagging you with
these. :)

--nwf;

[-- Attachment #2: Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ext4 metadata corruption bug?
  2014-05-06 14:22                   ` Sander Smeenk
@ 2014-05-26 14:59                     ` Sander Smeenk
  0 siblings, 0 replies; 19+ messages in thread
From: Sander Smeenk @ 2014-05-26 14:59 UTC (permalink / raw)
  To: linux-ext4

Quoting Sander Smeenk (ssmeenk@freshdot.net):

> > > Finally, since the both of you are seeing these messages fairly
> > > frequently, would you be willing to run with a patched kernel?
> > JFYI, i am running the patched kernel but i havent had any 'luck' yet.
> FWIW, the VM crashed this morning.
> Alas, nothing was logged or visible on the console. :(

'Crashed' today, but no traces were produced because aparently xfs and
btrfs (both aren't even used) modules have 'bad taint':

| xfs: module has bad taint, not creating trace events
| btrfs: module has bad taint, not creating trace events
| EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 75, 11202 clusters in bitmap, 11200 in gd; block bitmap corrupt. 
| Aborting journal on device vda1-8.
| EXT4-fs (vda1): Remounting filesystem read-only

Also, this is vda1, not vdb. vda1 is my root fs and is only 10GB in
size.

-- 
| Bakers trade bread recipes on a knead to know basis.  
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7  FBD6 F3A9 9442 20CC 6CD2

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ext4 metadata corruption bug?
  2014-05-06 15:51                     ` Nathaniel W Filardo
@ 2014-07-31  2:37                       ` Theodore Ts'o
  2014-08-06  8:53                         ` Sander Smeenk
  0 siblings, 1 reply; 19+ messages in thread
From: Theodore Ts'o @ 2014-07-31  2:37 UTC (permalink / raw)
  To: Nathaniel W Filardo; +Cc: linux-ext4, admins, Sander Smeenk

Hi,

Are you folks still seeing ext4_mb_generate_buddy or
ext4_mb_release_inode_pa EXT4-fs errors?

I think I may have found a fix for this problem.  Or at least, I've
found one of the causes.  If we get a memory allocation failure from
ext4_mb_new_inode_pa(), it can cause these errors.  Actually
triggering it is a bit tricky, but it looks like we saw it hit when a
task which was running under high memory pressure (because it was
running right against its cgroup memory limit) core dump.

Anyway, if you are still seeing these issues, this patch might help you out:

	http://patchwork.ozlabs.org/patch/375106/

If it works out for you, please let me know.

Cheers,

							- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ext4 metadata corruption bug?
  2014-07-31  2:37                       ` Theodore Ts'o
@ 2014-08-06  8:53                         ` Sander Smeenk
  0 siblings, 0 replies; 19+ messages in thread
From: Sander Smeenk @ 2014-08-06  8:53 UTC (permalink / raw)
  To: linux-ext4

Quoting Theodore Ts'o (tytso@mit.edu):

> Are you folks still seeing ext4_mb_generate_buddy or
> ext4_mb_release_inode_pa EXT4-fs errors?
> 
> I think I may have found a fix for this problem.  Or at least, I've
> found one of the causes.  If we get a memory allocation failure from
> ext4_mb_new_inode_pa(), it can cause these errors.

Good to hear a probable cause was found. Unfortunately for this bug's
progress, i've switched to a newer kernel altogether since the 3.13
tree in Ubuntu has had a lot of other problems besides this
EXT4 ext_mb_generate_buddy issue.

With 3.15.0 i have not experienced any ext4 issues at all.

If you want me to, i would be willing to boot that VM in 3.13 with this
patch, but as said, it can take ages before the bug was triggered...

Thanks for your hard work.

-Sndr.
-- 
| > Because we read from top to bottom.
| > > Why should i start my email reply *below* the quoted text?
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7  FBD6 F3A9 9442 20CC 6CD2

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-08-06  9:39 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20140409223820.GU10985@gradx.cs.jhu.edu>
     [not found] ` <CAGagf4eEzY4+3cfNWSEENTo1PKe40nq1Ne6ZzOLGm-O78W7RcA@mail.gmail.com>
2014-04-10  5:04   ` ext4 metadata corruption bug? Nathaniel W Filardo
2014-04-10 14:03     ` Theodore Ts'o
2014-04-10 16:33       ` Nathaniel W Filardo
2014-04-10 22:17         ` Theodore Ts'o
2014-04-20 16:32           ` Nathaniel W Filardo
2014-04-20 17:57             ` Theodore Ts'o
2014-04-23  7:23             ` Sander Smeenk
2014-04-23 14:36               ` Theodore Ts'o
2014-04-23 15:30                 ` Nathaniel W Filardo
2014-04-23 18:05                 ` Sander Smeenk
2014-04-29 15:22                 ` Nathaniel W Filardo
2014-05-01 16:25                 ` Nathaniel W Filardo
2014-05-06 15:42                   ` Theodore Ts'o
2014-05-06 15:51                     ` Nathaniel W Filardo
2014-07-31  2:37                       ` Theodore Ts'o
2014-08-06  8:53                         ` Sander Smeenk
2014-05-01 17:02                 ` Sander Smeenk
2014-05-06 14:22                   ` Sander Smeenk
2014-05-26 14:59                     ` Sander Smeenk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.