All of lore.kernel.org
 help / color / mirror / Atom feed
* [Bug 16081] New: Data loss after crash during heavy I/O
@ 2010-05-31 15:19 bugzilla-daemon
  2010-05-31 15:21 ` [Bug 16081] " bugzilla-daemon
                   ` (21 more replies)
  0 siblings, 22 replies; 23+ messages in thread
From: bugzilla-daemon @ 2010-05-31 15:19 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=16081

           Summary: Data loss after crash during heavy I/O
           Product: File System
           Version: 2.5
    Kernel Version: 2.6.32.12 (Debian-Version 2.6.32-12)
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: ext4
        AssignedTo: fs_ext4@kernel-bugs.osdl.org
        ReportedBy: lkolbe@techfak.uni-bielefeld.de
        Regression: No


Created an attachment (id=26590)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=26590)
end of trace 

On a Supermicro X7DWN+, Intel 5400 chipset, Xeon E5420, 8GB RAM, Adaptec 52445
RAID controller, LSI SAS1068E controller. We have two 9TB ext4-filesystems on
LVM on a 20TB RAID50 spanning 24 disks, used as a diskpool for bacula. After
writing about 10TB of data (8.5TB to the first, 1.5TB to the second fs), the
machine crashed hard (screenshot attached). Afterwards, the filesystems were
both bonkers (after e2fsck 1.41.9 ran over them):

shepherd:~# mount /dev/data/badp1 /mnt/
mount: wrong fs type, bad option, bad superblock on /dev/mapper/data-badp1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

shepherd:~# dmesg | tail
[ 8720.688682] EXT4-fs (dm-1): ext4_check_descriptors: Checksum for group 1
failed (49189!=48621)
[ 8720.688708] EXT4-fs (dm-1): group descriptors corrupted!
[14726.691071] EXT4-fs (dm-1): ext4_check_descriptors: Checksum for group 1
failed (49189!=48621)
[14726.691097] EXT4-fs (dm-1): group descriptors corrupted!
[14737.262709] EXT4-fs (dm-2): mounted filesystem with ordered data mode
[15315.441515] EXT4-fs (dm-1): ext4_check_descriptors: Checksum for group 1
failed (49189!=48621)
[15315.441540] EXT4-fs (dm-1): group descriptors corrupted!

shepherd:~# mount /dev/data/badp2 /mnt/
shepherd:~# ls -la /mnt/
total 80
drwxr-xr-x   3 root root  4096 2010-05-31 13:10 .
drwxr-xr-x  23 root root  4096 2010-05-31 13:01 ..
drwx------ 250 root root 69632 2010-05-31 13:10 lost+found
shepherd:~# ls -la /mnt/lost+found/ | head -n 20
total 216936
drwx------ 250 root       root          69632 2010-05-31 13:10 .
drwxr-xr-x   3 root       root           4096 2010-05-31 13:10 ..
c----wxr--   1  774037444  162299347 237, 210 1957-02-23 13:50 #1000
brwx-----T   1 1954511736 3121970260 249, 121 1922-08-12 15:08 #10021
b-w---xrwt   1  543753214 3130053982 234, 213 2012-06-01 07:58 #10027
c--S--sr-T   1 3871079531 3443641576   2, 232 2036-01-31 13:12 #10036
-r-S-w-r-T   1 2298731406  344458386    32768 2035-05-22 08:46 #10046
brw---Srw-   1 2052225653 4012639896 218, 196 1912-06-23 18:14 #10067
prwS-wSr-x   1 2235883341 1302567651        0 1927-10-10 00:51 #10086
s-wS--x-wt   1 2286828425 2999490124        0 1949-08-22 22:50 #10109
crw--wSrwt   1 3083778288 3882824206 148, 212 2003-07-28 08:32 #10126
s-wS--sr-x   1  874900871   80451928        0 1977-11-28 01:52 #10130
s--sr-x---   1 1903432768    1059722        0 2013-07-05 00:55 #10131
c-w-r-Sr-T   1 3259732952 2590389953   9,  22 2012-06-19 14:56 #10147
pr-x-w--wt   1 1627318825 1016384218        0 1956-12-27 06:01 #10160
srw-r-SrwT   1 2603486838 3240878817        0 1954-11-16 08:43 #10177
srw---srwt   1  458009213  951782573        0 2023-12-03 18:43 #10184
brwxr--rwx   1 2423698452 2252742920  44, 231 1956-07-25 07:28 #10197
brwS-wS-w-   1 3480615060 1244965598  44, 189 2006-10-21 17:03 #1020

This is the second or third time the machine crashed after writing ca. 10TB of
data, but the first time we see this kind of data corruption.

Any hints on how to debug/reprocude such a thing? For the moment, we keep the
broken filesystem for further analysis (if that's neccessary), but sadly this
is our primary backup diskpool and we need to have it running again rather soon
...

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 16081] Data loss after crash during heavy I/O
  2010-05-31 15:19 [Bug 16081] New: Data loss after crash during heavy I/O bugzilla-daemon
@ 2010-05-31 15:21 ` bugzilla-daemon
  2010-05-31 15:22 ` bugzilla-daemon
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2010-05-31 15:21 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=16081





--- Comment #1 from lkolbe@techfak.uni-bielefeld.de  2010-05-31 15:21:24 ---
Created an attachment (id=26591)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=26591)
lsscsi of host

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 16081] Data loss after crash during heavy I/O
  2010-05-31 15:19 [Bug 16081] New: Data loss after crash during heavy I/O bugzilla-daemon
  2010-05-31 15:21 ` [Bug 16081] " bugzilla-daemon
@ 2010-05-31 15:22 ` bugzilla-daemon
  2010-05-31 15:24 ` bugzilla-daemon
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2010-05-31 15:22 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=16081





--- Comment #2 from lkolbe@techfak.uni-bielefeld.de  2010-05-31 15:22:26 ---
Created an attachment (id=26592)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=26592)
lspci -vvv of host

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 16081] Data loss after crash during heavy I/O
  2010-05-31 15:19 [Bug 16081] New: Data loss after crash during heavy I/O bugzilla-daemon
  2010-05-31 15:21 ` [Bug 16081] " bugzilla-daemon
  2010-05-31 15:22 ` bugzilla-daemon
@ 2010-05-31 15:24 ` bugzilla-daemon
  2010-05-31 15:28 ` bugzilla-daemon
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2010-05-31 15:24 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=16081





--- Comment #3 from lkolbe@techfak.uni-bielefeld.de  2010-05-31 15:24:17 ---
One thing I forgot: Using Supermicro's current BIOS 1.2b, the machine exhibits
machine check exceptions that Linux thinks are the hardwares fault. With their
BIOS 1.1b, they do not happen.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 16081] Data loss after crash during heavy I/O
  2010-05-31 15:19 [Bug 16081] New: Data loss after crash during heavy I/O bugzilla-daemon
                   ` (2 preceding siblings ...)
  2010-05-31 15:24 ` bugzilla-daemon
@ 2010-05-31 15:28 ` bugzilla-daemon
  2010-06-01 16:38 ` bugzilla-daemon
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2010-05-31 15:28 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=16081


lkolbe@techfak.uni-bielefeld.de changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sfrey@techfak.uni-bielefeld
                   |                            |.de
           Severity|normal                      |high




-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 16081] Data loss after crash during heavy I/O
  2010-05-31 15:19 [Bug 16081] New: Data loss after crash during heavy I/O bugzilla-daemon
                   ` (3 preceding siblings ...)
  2010-05-31 15:28 ` bugzilla-daemon
@ 2010-06-01 16:38 ` bugzilla-daemon
  2010-06-02 12:02 ` bugzilla-daemon
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2010-06-01 16:38 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=16081


Eric Sandeen <sandeen@redhat.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sandeen@redhat.com




--- Comment #4 from Eric Sandeen <sandeen@redhat.com>  2010-06-01 16:38:04 ---
Getting the whole original oops would be great (since it seems like you can
reproduce it)...

Did e2fsck find anything?  (e2fsck -f?)

Were filesystem barriers left on, and does the storage support them?

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 16081] Data loss after crash during heavy I/O
  2010-05-31 15:19 [Bug 16081] New: Data loss after crash during heavy I/O bugzilla-daemon
                   ` (4 preceding siblings ...)
  2010-06-01 16:38 ` bugzilla-daemon
@ 2010-06-02 12:02 ` bugzilla-daemon
  2010-06-02 12:10 ` bugzilla-daemon
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2010-06-02 12:02 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=16081





--- Comment #5 from lkolbe@techfak.uni-bielefeld.de  2010-06-02 12:02:18 ---
e2fsck segfaulted after about an hour on the first volume and had gazillions of
questions for the second.

I don't know about barriers, mount says:
/dev/mapper/data-badp2 on /var/bacula/diskpool/fs2 type ext4 (rw,nosuid,nodev)

Wether the Adaptec 52445 supports barriers - I really don't know?
The Serial console is now finally working, so if this happens again we'll get a
full stacktrace. It will most likely take a few days of backup runs to trigger,
though. Thanks for looking into this!

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 16081] Data loss after crash during heavy I/O
  2010-05-31 15:19 [Bug 16081] New: Data loss after crash during heavy I/O bugzilla-daemon
                   ` (5 preceding siblings ...)
  2010-06-02 12:02 ` bugzilla-daemon
@ 2010-06-02 12:10 ` bugzilla-daemon
  2010-06-02 15:57 ` bugzilla-daemon
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2010-06-02 12:10 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=16081





--- Comment #6 from Eric Sandeen <sandeen@redhat.com>  2010-06-02 12:10:00 ---
If a barrier failed outright you'd see a note in dmesg / logs shortly after the
mount, FWIW.  Anyway you didn't explicitly disable it.  Barrier support in
dm/lvm is rather new, as well.  Just a thought...

Capturing a core from the segfaulted e2fsck would help fix -that- bug ... and
attaching the output of the fscks might yield a clue as to what is damaged.

-Eric

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 16081] Data loss after crash during heavy I/O
  2010-05-31 15:19 [Bug 16081] New: Data loss after crash during heavy I/O bugzilla-daemon
                   ` (6 preceding siblings ...)
  2010-06-02 12:10 ` bugzilla-daemon
@ 2010-06-02 15:57 ` bugzilla-daemon
  2010-06-02 16:44 ` bugzilla-daemon
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2010-06-02 15:57 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=16081


lkolbe@techfak.uni-bielefeld.de changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #26590|0                           |1
        is obsolete|                            |




--- Comment #7 from lkolbe@techfak.uni-bielefeld.de  2010-06-02 15:57:20 ---
Created an attachment (id=26618)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=26618)
oops after writing 1TB 

This was rather sooner than expected, after writing about 1TB of data to two
ext4 filesystems with approx. 150MB/sec.

Hopefully this trace means something?

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 16081] Data loss after crash during heavy I/O
  2010-05-31 15:19 [Bug 16081] New: Data loss after crash during heavy I/O bugzilla-daemon
                   ` (7 preceding siblings ...)
  2010-06-02 15:57 ` bugzilla-daemon
@ 2010-06-02 16:44 ` bugzilla-daemon
  2010-06-02 16:44 ` bugzilla-daemon
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2010-06-02 16:44 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=16081





--- Comment #8 from lkolbe@techfak.uni-bielefeld.de  2010-06-02 16:44:04 ---
Created an attachment (id=26619)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=26619)
root-filesystem borkage

Hm. After a reset, both 9TB-Filesystems were normal (journal replayed. But a
few minutes after the boot, we got really strange errors (see attachment) and
could only resurrect the root-filesystem with a live-cd and it's fsck, as grub
wouldn't detect a filesystem anymore. fsck fixed it, though (broken superblock
and some minor fixes). The system boots as I write this, and I'll continue the
same backup-tests but this time without barriers on both filesystems.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 16081] Data loss after crash during heavy I/O
  2010-05-31 15:19 [Bug 16081] New: Data loss after crash during heavy I/O bugzilla-daemon
                   ` (8 preceding siblings ...)
  2010-06-02 16:44 ` bugzilla-daemon
@ 2010-06-02 16:44 ` bugzilla-daemon
  2010-06-02 17:53 ` bugzilla-daemon
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2010-06-02 16:44 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=16081


lkolbe@techfak.uni-bielefeld.de changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #26619|application/octet-stream    |text/plain
          mime type|                            |




-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 16081] Data loss after crash during heavy I/O
  2010-05-31 15:19 [Bug 16081] New: Data loss after crash during heavy I/O bugzilla-daemon
                   ` (9 preceding siblings ...)
  2010-06-02 16:44 ` bugzilla-daemon
@ 2010-06-02 17:53 ` bugzilla-daemon
  2010-06-02 18:06 ` bugzilla-daemon
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2010-06-02 17:53 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=16081





--- Comment #9 from Eric Sandeen <sandeen@redhat.com>  2010-06-02 17:53:52 ---
(In reply to comment #8)
> ... The system boots as I write this, and I'll continue the
> same backup-tests but this time without barriers on both filesystems.

no... turning barriers -off- certainly won't help anything.

Whenever I see bad metadata corruption post-crash-and-reset I worry about
missing barriers.  My mention of them was only to see whether they are properly
in use, as they should be on any storage w/ a volatile write cache.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 16081] Data loss after crash during heavy I/O
  2010-05-31 15:19 [Bug 16081] New: Data loss after crash during heavy I/O bugzilla-daemon
                   ` (10 preceding siblings ...)
  2010-06-02 17:53 ` bugzilla-daemon
@ 2010-06-02 18:06 ` bugzilla-daemon
  2010-06-02 18:24 ` bugzilla-daemon
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2010-06-02 18:06 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=16081





--- Comment #10 from Eric Sandeen <sandeen@redhat.com>  2010-06-02 18:06:54 ---
(In reply to comment #8)
> Created an attachment (id=26619)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=26619) [details]
> root-filesystem borkage

How confident are you in your storage?

> [  765.812082] attempt to access beyond end of device
> [  765.812088] dm-6: rw=256, want=18808645176, limit=8388608

the "want" value (in sectors) is ~9T.

The limit is oddly (?) 2^23 - 8388608, that many sectors comes out to exactly
4T.

IOW, now your block device appears to be much smaller than your filesystem....

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 16081] Data loss after crash during heavy I/O
  2010-05-31 15:19 [Bug 16081] New: Data loss after crash during heavy I/O bugzilla-daemon
                   ` (11 preceding siblings ...)
  2010-06-02 18:06 ` bugzilla-daemon
@ 2010-06-02 18:24 ` bugzilla-daemon
  2010-06-02 18:28 ` bugzilla-daemon
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2010-06-02 18:24 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=16081





--- Comment #11 from Eric Sandeen <sandeen@redhat.com>  2010-06-02 18:24:29 ---
As for the trace (attachment #26618) it looks like we've found a page w/o
buffers.

There seems to be rather a lot going wrong with this machine, I'm having a hard
time getting a feel for what the root cause might be...

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 16081] Data loss after crash during heavy I/O
  2010-05-31 15:19 [Bug 16081] New: Data loss after crash during heavy I/O bugzilla-daemon
                   ` (12 preceding siblings ...)
  2010-06-02 18:24 ` bugzilla-daemon
@ 2010-06-02 18:28 ` bugzilla-daemon
  2010-06-02 21:57 ` bugzilla-daemon
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2010-06-02 18:28 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=16081





--- Comment #12 from Eric Sandeen <sandeen@redhat.com>  2010-06-02 18:28:32 ---
I don't know if it's at all possible, but testing on block devices and
filesystems just smaller than 8T would be an interesting datapoint, if that
yields success...  we really should be perfectly safe at 9T but this is looking
like maybe a write has wrapped somewhere and corrupted things.

A resident dm expert also requested the output of "dmsetup table" for the
machine that yielded the "access beyond end of device" message.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 16081] Data loss after crash during heavy I/O
  2010-05-31 15:19 [Bug 16081] New: Data loss after crash during heavy I/O bugzilla-daemon
                   ` (13 preceding siblings ...)
  2010-06-02 18:28 ` bugzilla-daemon
@ 2010-06-02 21:57 ` bugzilla-daemon
  2010-06-02 22:07 ` bugzilla-daemon
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2010-06-02 21:57 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=16081





--- Comment #13 from lkolbe@techfak.uni-bielefeld.de  2010-06-02 21:56:56 ---
Funny thing is, dm-6 is the root-filesystem, and it's 4GB big. It lives on a VG
consisting of one 100GB RAID-50 over 24 disks. Some relevant data:

shepherd:~# lvm lvs -a -o+devices
  LV          VG     Attr   LSize   Origin Snap%  Move Log Copy%  Convert
Devices          
  badp1       data   -wi-ao   9.00T                                      
/dev/sdb(25600)  
  badp2       data   -wi-ao   9.00T                                      
/dev/sdb(2384896)
  baspool     data   -wi-ao   1.00T                                      
/dev/sdb(4769792)
  bawork      data   -wi-ao 100.00G                                      
/dev/sdb(0)      
  db1_srv     data   -wi-ao 100.00G                                      
/dev/sdb(4744192)
  dir1_bawork data   -wi-ao 100.00G                                      
/dev/sdb(5031936)
  db1_log     system -wi-ao   4.00G                                      
/dev/sda1(7168)  
  db1_root    system -wi-ao   4.00G                                      
/dev/sda1(6144)  
  db1_swap    system -wi-ao   4.00G                                      
/dev/sda1(8192)  
  dir1_log    system -wi-ao   4.00G                                      
/dev/sda1(4096)  
  dir1_root   system -wi-ao   4.00G                                      
/dev/sda1(3072)  
  dir1_swap   system -wi-ao   4.00G                                      
/dev/sda1(5120)  
  log         system -wi-ao   4.00G                                      
/dev/sda1(1024)  
  root        system -wi-ao   4.00G                                      
/dev/sda1(0)     
  swap        system -wi-ao   4.00G                                      
/dev/sda1(2048)  

The requested dmsetup table:
shepherd:~# dmsetup table
data-dir1_bawork: 0 209715200 linear 8:16 41221620096
system-db1_log: 0 8388608 linear 8:1 58720640
system-db1_swap: 0 8388608 linear 8:1 67109248
system-db1_root: 0 8388608 linear 8:1 50332032
data-bawork: 0 209715200 linear 8:16 384
data-db1_srv: 0 209715200 linear 8:16 38864421248
data-baspool: 0 2147483648 linear 8:16 39074136448
system-dir1_swap: 0 8388608 linear 8:1 41943424
system-dir1_root: 0 8388608 linear 8:1 25166208
data-badp2: 0 19327352832 linear 8:16 19537068416
data-badp1: 0 19327352832 linear 8:16 209715584
system-swap: 0 8388608 linear 8:1 16777600
system-root: 0 8388608 linear 8:1 384
system-dir1_log: 0 8388608 linear 8:1 33554816
system-log 0 8388608 linear 8:1 8388992

Adaptec version numbers are: BIOS, Firmware, Boot flash: 17899
aacraid driver: 2461 (the version shipped with 2.6.32)

I have (yet) no reason not to trust our storage - it's one 100GB RAID-50 and
one ~19TB RAID-50 on 24 Hitachi HDE721010SLA330 with firmware ST6OA3AA, if that
means anything to anyone.

Since the last crash bacula has written 3.2TiB to data-badp1 and it's still
running (when all backups are done, it should have written ~12TiB). We'll see
if it survives tomorrow.

If it crashes again, I'll try 8TiB-Filesystems.

Thanks for taking your time!
Lukas

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 16081] Data loss after crash during heavy I/O
  2010-05-31 15:19 [Bug 16081] New: Data loss after crash during heavy I/O bugzilla-daemon
                   ` (14 preceding siblings ...)
  2010-06-02 21:57 ` bugzilla-daemon
@ 2010-06-02 22:07 ` bugzilla-daemon
  2010-06-02 22:09 ` bugzilla-daemon
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2010-06-02 22:07 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=16081





--- Comment #14 from Eric Sandeen <sandeen@redhat.com>  2010-06-02 22:07:52 ---
(In reply to comment #13)
> Funny thing is, dm-6 is the root-filesystem, and it's 4GB big. 

whoops you're right I missed a unit there :(  The reported limit was indeed 4G
not 4T.  Still, why was it trying to read a block out at 9T ... ?  Hmmm.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 16081] Data loss after crash during heavy I/O
  2010-05-31 15:19 [Bug 16081] New: Data loss after crash during heavy I/O bugzilla-daemon
                   ` (15 preceding siblings ...)
  2010-06-02 22:07 ` bugzilla-daemon
@ 2010-06-02 22:09 ` bugzilla-daemon
  2010-06-03  6:02 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2010-06-02 22:09 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=16081





--- Comment #15 from Eric Sandeen <sandeen@redhat.com>  2010-06-02 22:09:55 ---
Just realized all the root fs errors were on ext3, too - not ext4.  This gives
me more reason to be worried about things outside the filesystem itself, I'm
afraid.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 16081] Data loss after crash during heavy I/O
  2010-05-31 15:19 [Bug 16081] New: Data loss after crash during heavy I/O bugzilla-daemon
                   ` (16 preceding siblings ...)
  2010-06-02 22:09 ` bugzilla-daemon
@ 2010-06-03  6:02 ` bugzilla-daemon
  2010-06-03 14:19 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2010-06-03  6:02 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=16081





--- Comment #16 from lkolbe@techfak.uni-bielefeld.de  2010-06-03 06:02:28 ---
Thanks, I'll do another round of memtest then. Do you have any idea what else
to look for/what to test?

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 16081] Data loss after crash during heavy I/O
  2010-05-31 15:19 [Bug 16081] New: Data loss after crash during heavy I/O bugzilla-daemon
                   ` (17 preceding siblings ...)
  2010-06-03  6:02 ` bugzilla-daemon
@ 2010-06-03 14:19 ` bugzilla-daemon
  2010-06-05 14:32 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2010-06-03 14:19 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=16081





--- Comment #17 from Eric Sandeen <sandeen@redhat.com>  2010-06-03 14:19:42 ---
I'd just review the storage configuration as well, I guess, though not sure of
any specifics to look for.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 16081] Data loss after crash during heavy I/O
  2010-05-31 15:19 [Bug 16081] New: Data loss after crash during heavy I/O bugzilla-daemon
                   ` (18 preceding siblings ...)
  2010-06-03 14:19 ` bugzilla-daemon
@ 2010-06-05 14:32 ` bugzilla-daemon
  2011-02-28  1:23 ` bugzilla-daemon
  2011-02-28  1:24 ` bugzilla-daemon
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2010-06-05 14:32 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=16081





--- Comment #18 from lkolbe@techfak.uni-bielefeld.de  2010-06-05 14:32:46 ---
Thanks, though. After working flawlessly for more than 13TiB, we hit another
crash today - a colleague called 'lsscsi', after that all commands quit with
'Bus error' for a while and the machine stuck with no messages on the serial
line. Befor that, cat /proc/interrupts worked and showed massive ERR:

shepherd:/etc# cat /proc/interrupts 
             CPU0       CPU1       CPU2       CPU3       
[...]
 THR:          0          0          0          0   Threshold APIC interrupts
 MCE:          0          0          0          0   Machine check exceptions
 MCP:         28         28         28         28   Machine check polls
 ERR:   37567046
 MIS:          0

I suppose this means it's not Linux fault?

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 16081] Data loss after crash during heavy I/O
  2010-05-31 15:19 [Bug 16081] New: Data loss after crash during heavy I/O bugzilla-daemon
                   ` (19 preceding siblings ...)
  2010-06-05 14:32 ` bugzilla-daemon
@ 2011-02-28  1:23 ` bugzilla-daemon
  2011-02-28  1:24 ` bugzilla-daemon
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2011-02-28  1:23 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=16081


Theodore Tso <tytso@mit.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tytso@mit.edu
     Kernel Version|2.6.32.12 (Debian-Version   |2.6.32.12 (Debian)
                   |2.6.32-12)                  |




-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 16081] Data loss after crash during heavy I/O
  2010-05-31 15:19 [Bug 16081] New: Data loss after crash during heavy I/O bugzilla-daemon
                   ` (20 preceding siblings ...)
  2011-02-28  1:23 ` bugzilla-daemon
@ 2011-02-28  1:24 ` bugzilla-daemon
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2011-02-28  1:24 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=16081


Theodore Tso <tytso@mit.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |UNREPRODUCIBLE




--- Comment #19 from Theodore Tso <tytso@mit.edu>  2011-02-28 01:24:43 ---
Closing this bug as it looks pretty clear it was caused by hardware problems.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2011-02-28  1:24 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-31 15:19 [Bug 16081] New: Data loss after crash during heavy I/O bugzilla-daemon
2010-05-31 15:21 ` [Bug 16081] " bugzilla-daemon
2010-05-31 15:22 ` bugzilla-daemon
2010-05-31 15:24 ` bugzilla-daemon
2010-05-31 15:28 ` bugzilla-daemon
2010-06-01 16:38 ` bugzilla-daemon
2010-06-02 12:02 ` bugzilla-daemon
2010-06-02 12:10 ` bugzilla-daemon
2010-06-02 15:57 ` bugzilla-daemon
2010-06-02 16:44 ` bugzilla-daemon
2010-06-02 16:44 ` bugzilla-daemon
2010-06-02 17:53 ` bugzilla-daemon
2010-06-02 18:06 ` bugzilla-daemon
2010-06-02 18:24 ` bugzilla-daemon
2010-06-02 18:28 ` bugzilla-daemon
2010-06-02 21:57 ` bugzilla-daemon
2010-06-02 22:07 ` bugzilla-daemon
2010-06-02 22:09 ` bugzilla-daemon
2010-06-03  6:02 ` bugzilla-daemon
2010-06-03 14:19 ` bugzilla-daemon
2010-06-05 14:32 ` bugzilla-daemon
2011-02-28  1:23 ` bugzilla-daemon
2011-02-28  1:24 ` bugzilla-daemon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.