All of lore.kernel.org
 help / color / mirror / Atom feed
* possible raid6 corruption
@ 2015-06-02  1:24 Christoph Anton Mitterer
  2015-06-02  2:38 ` Chris Murphy
  2015-06-02  7:26 ` Sander
  0 siblings, 2 replies; 3+ messages in thread
From: Christoph Anton Mitterer @ 2015-06-02  1:24 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 32551 bytes --]

Hi.

The following is a possible corruption of a btrfs with RAID6,... it may
however also be just and issue with the megasas driver or the PERC
controller behind it.
Anyway since RADI56 is quite new in btrfs, an expert may want to have a
look at it whether it's something that needs to be focused on.

I cannot mount the btrfs since the incident:
I.e.
# mount /dev/sd[any disk of the btrfs raid] /mnt/

gives a:
[358466.484374] BTRFS info (device sda): disk space caching is enabled
[358466.484426] BTRFS: has skinny extents
[358466.485421] BTRFS: failed to read the system array on sda
[358466.543422] BTRFS: open_ctree failed

But no valuable data has been on these devices and I haven't really
tried any of the recovery methods.



What I did:
At the university we run a Tier-2 for the LHC computing grid (i.e. we
have loads of storage).
Recently we bought a number of Dell nodes each with 16 6TB SATA disks,
the disks are connected via a Dell PERC H730P controller (which is based
on some LSI Mega*-whatever, AFAIC).

Since I had 10 new nodes I wanted to use the opportunity and do some
extensive benchmarking, i.e. HW RAID vs. MD RAID, vs btrfs-RAID... +
btrfs and ext4, in all reasonable combinations.
The nodes which were used for MD/btrfs-RAID obviously used the PERC in
pass-through-mode.

As said, the nodes are brand new and during the tests the one with
btrfs-raid6 had a fs crash (all others continued to run fine).

System is Debian jessie, except the kernel from sid (or experimental of
that time) 4.0.0 and btrfs-progs 4.0.

The fs was created pretty much standard: 
# mkfs.btrfs -L data-test -d raid6 -m raid6 /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp

And then there came some heave iozone stressing: 
# iozone -Rb $(hostname)_1.xls -s 128g -i 0 -i 1 -i 2 -i 5 -j 12 -r 64 -t 1 -F /mnt/iozone


Some excerpts from the kerne.log, which might be of interest: 


May 10 00:26:39 lcg-lrz-dc10 kernel: [115511.479387] Btrfs loaded
May 10 00:26:39 lcg-lrz-dc10 kernel: [115511.479680] BTRFS: device label data-test devid 1 transid 3 /dev/sda
May 10 00:26:39 lcg-lrz-dc10 kernel: [115511.482080] BTRFS: device label data-test devid 2 transid 3 /dev/sdb
May 10 00:26:39 lcg-lrz-dc10 kernel: [115511.484047] BTRFS: device label data-test devid 3 transid 3 /dev/sdc
May 10 00:26:39 lcg-lrz-dc10 kernel: [115511.486021] BTRFS: device label data-test devid 4 transid 3 /dev/sdd
May 10 00:26:39 lcg-lrz-dc10 kernel: [115511.487892] BTRFS: device label data-test devid 5 transid 3 /dev/sde
May 10 00:26:39 lcg-lrz-dc10 kernel: [115511.489849] BTRFS: device label data-test devid 6 transid 3 /dev/sdf
May 10 00:26:39 lcg-lrz-dc10 kernel: [115511.491819] BTRFS: device label data-test devid 7 transid 3 /dev/sdg
May 10 00:26:39 lcg-lrz-dc10 kernel: [115511.493919] BTRFS: device label data-test devid 8 transid 3 /dev/sdh
May 10 00:26:39 lcg-lrz-dc10 kernel: [115511.495761] BTRFS: device label data-test devid 9 transid 3 /dev/sdi
May 10 00:26:39 lcg-lrz-dc10 kernel: [115511.497645] BTRFS: device label data-test devid 10 transid 3 /dev/sdj
May 10 00:26:39 lcg-lrz-dc10 kernel: [115511.499477] BTRFS: device label data-test devid 11 transid 3 /dev/sdk
May 10 00:26:39 lcg-lrz-dc10 kernel: [115511.501307] BTRFS: device label data-test devid 12 transid 3 /dev/sdl
May 10 00:26:39 lcg-lrz-dc10 kernel: [115511.503208] BTRFS: device label data-test devid 13 transid 3 /dev/sdm
May 10 00:26:39 lcg-lrz-dc10 kernel: [115511.505037] BTRFS: device label data-test devid 14 transid 3 /dev/sdn
May 10 00:26:39 lcg-lrz-dc10 kernel: [115511.506837] BTRFS: device label data-test devid 15 transid 3 /dev/sdo
May 10 00:26:39 lcg-lrz-dc10 kernel: [115511.508800] BTRFS: device label data-test devid 16 transid 3 /dev/sdp
May 10 00:27:34 lcg-lrz-dc10 kernel: [115566.351260] BTRFS info (device sdp): disk space caching is enabled
May 10 00:27:34 lcg-lrz-dc10 kernel: [115566.351307] BTRFS: has skinny extents
May 10 00:27:34 lcg-lrz-dc10 kernel: [115566.351333] BTRFS: flagging fs with big metadata feature
May 10 00:27:34 lcg-lrz-dc10 kernel: [115566.354089] BTRFS: creating UUID tree



Literally gazillions of these: 
May 19 02:39:19 lcg-lrz-dc10 kernel: [900318.402678] megasas:span 0 rowDataSize 1
May 19 02:39:19 lcg-lrz-dc10 kernel: [900318.402705] megasas:span 0 rowDataSize 1

Wile I saw the above lines on all the other nodes as well, there were
only like 30 once, and that's it.
But on the one node with btrfs the log file was flooded to 1,6GB with
these.


At some point I've had this: 
May 19 03:25:19 lcg-lrz-dc10 kernel: [903075.511076] megasas: [ 0]waiting for 1 commands to complete for scsi0
May 19 03:25:24 lcg-lrz-dc10 kernel: [903080.526184] megasas: [ 5]waiting for 1 commands to complete for scsi0
May 19 03:25:29 lcg-lrz-dc10 kernel: [903085.541375] megasas: [10]waiting for 1 commands to complete for scsi0
May 19 03:25:34 lcg-lrz-dc10 kernel: [903090.556566] megasas: [15]waiting for 1 commands to complete for scsi0
May 19 03:25:39 lcg-lrz-dc10 kernel: [903095.571755] megasas: [20]waiting for 1 commands to complete for scsi0
May 19 03:25:39 lcg-lrz-dc10 kernel: [903095.585150] megasas: megasas_aen_polling waiting for controller reset to finish for scsi0
May 19 03:25:50 lcg-lrz-dc10 kernel: [903106.581205] sd 0:0:14:0: Device offlined - not ready after error recovery

but after that things seemed to have continued for quite a while (except
millions of ("megasas:span 0 rowDataSize 1")... of course I cannot tell
whether this is maybe just because iozone only read during that time and
only a write would have triggered further errors??

First real errors start here: 
May 28 16:38:01 lcg-lrz-dc10 kernel: [1727446.475425] bash (127422): drop_caches: 3
May 28 16:38:43 lcg-lrz-dc10 kernel: [1727488.984810] sd 0:0:14:0: rejecting I/O to offline device
May 28 16:38:43 lcg-lrz-dc10 kernel: [1727488.985389] sd 0:0:14:0: rejecting I/O to offline device
May 28 16:38:43 lcg-lrz-dc10 kernel: [1727488.985707] sd 0:0:14:0: rejecting I/O to offline device
May 28 16:38:43 lcg-lrz-dc10 kernel: [1727488.986482] sd 0:0:14:0: rejecting I/O to offline device

Again, gazillions of the "rejecting I/O to offline device". As one can
notice, this is the very disk that went offline before.

The drop_caches may be just coincidence. That was I, but it implies
somehow that iozone didn't run at that time and I started only another
round of it afterwards.


In between there were many of these: 
May 28 16:39:19 lcg-lrz-dc10 kernel: [1727524.067182] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:39:19 lcg-lrz-dc10 kernel: [1727524.067426] BTRFS: bdev /dev/sdm errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
May 28 16:39:19 lcg-lrz-dc10 kernel: [1727524.067985] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:39:19 lcg-lrz-dc10 kernel: [1727524.068282] BTRFS: bdev /dev/sdm errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
May 28 16:39:19 lcg-lrz-dc10 kernel: [1727524.068992] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:39:19 lcg-lrz-dc10 kernel: [1727524.069370] BTRFS: bdev /dev/sdm errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
May 28 16:39:50 lcg-lrz-dc10 kernel: [1727555.332553] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:39:50 lcg-lrz-dc10 kernel: [1727555.332767] BTRFS: bdev /dev/sdm errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
May 28 16:39:50 lcg-lrz-dc10 kernel: [1727555.333256] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:39:50 lcg-lrz-dc10 kernel: [1727555.333517] BTRFS: bdev /dev/sdm errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
May 28 16:39:50 lcg-lrz-dc10 kernel: [1727555.334111] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:39:50 lcg-lrz-dc10 kernel: [1727555.334432] BTRFS: bdev /dev/sdm errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
May 28 16:40:21 lcg-lrz-dc10 kernel: [1727586.739347] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:40:21 lcg-lrz-dc10 kernel: [1727586.739349] BTRFS: bdev /dev/sdm errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
May 28 16:40:21 lcg-lrz-dc10 kernel: [1727586.739363] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:40:21 lcg-lrz-dc10 kernel: [1727586.739364] BTRFS: bdev /dev/sdm errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
May 28 16:40:21 lcg-lrz-dc10 kernel: [1727586.739372] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:40:21 lcg-lrz-dc10 kernel: [1727586.739373] BTRFS: bdev /dev/sdm errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
May 28 16:40:43 lcg-lrz-dc10 kernel: [1727608.168996] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:40:43 lcg-lrz-dc10 kernel: [1727608.169171] BTRFS: bdev /dev/sdm errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
May 28 16:40:43 lcg-lrz-dc10 kernel: [1727608.169605] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:40:43 lcg-lrz-dc10 kernel: [1727608.169842] BTRFS: bdev /dev/sdm errs: wr 11, rd 0, flush 0, corrupt 0, gen 0
May 28 16:40:43 lcg-lrz-dc10 kernel: [1727608.170401] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:40:43 lcg-lrz-dc10 kernel: [1727608.170703] BTRFS: bdev /dev/sdm errs: wr 12, rd 0, flush 0, corrupt 0, gen 0
May 28 16:40:50 lcg-lrz-dc10 kernel: [1727615.608552] BTRFS: bdev /dev/sdm errs: wr 12, rd 1, flush 0, corrupt 0, gen 0
May 28 16:41:17 lcg-lrz-dc10 kernel: [1727641.928445] BTRFS: bdev /dev/sdm errs: wr 12, rd 2, flush 0, corrupt 0, gen 0
May 28 16:41:20 lcg-lrz-dc10 kernel: [1727645.692650] BTRFS: bdev /dev/sdm errs: wr 12, rd 3, flush 0, corrupt 0, gen 0
May 28 16:41:23 lcg-lrz-dc10 kernel: [1727647.999097] BTRFS: bdev /dev/sdm errs: wr 12, rd 4, flush 0, corrupt 0, gen 0
May 28 16:41:23 lcg-lrz-dc10 kernel: [1727648.227013] BTRFS: bdev /dev/sdm errs: wr 12, rd 5, flush 0, corrupt 0, gen 0
May 28 16:41:30 lcg-lrz-dc10 kernel: [1727654.974354] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:41:30 lcg-lrz-dc10 kernel: [1727654.974512] BTRFS: bdev /dev/sdm errs: wr 13, rd 5, flush 0, corrupt 0, gen 0
May 28 16:41:30 lcg-lrz-dc10 kernel: [1727654.974888] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:41:30 lcg-lrz-dc10 kernel: [1727654.975083] BTRFS: bdev /dev/sdm errs: wr 14, rd 5, flush 0, corrupt 0, gen 0
May 28 16:41:30 lcg-lrz-dc10 kernel: [1727654.975546] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:41:30 lcg-lrz-dc10 kernel: [1727654.975793] BTRFS: bdev /dev/sdm errs: wr 15, rd 5, flush 0, corrupt 0, gen 0
May 28 16:42:00 lcg-lrz-dc10 kernel: [1727685.438868] BTRFS: bdev /dev/sdm errs: wr 15, rd 6, flush 0, corrupt 0, gen 0
May 28 16:42:00 lcg-lrz-dc10 kernel: [1727685.816052] BTRFS: bdev /dev/sdm errs: wr 15, rd 7, flush 0, corrupt 0, gen 0
May 28 16:42:02 lcg-lrz-dc10 kernel: [1727686.886506] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:42:02 lcg-lrz-dc10 kernel: [1727686.886854] BTRFS: bdev /dev/sdm errs: wr 16, rd 7, flush 0, corrupt 0, gen 0
May 28 16:42:02 lcg-lrz-dc10 kernel: [1727686.887694] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:42:02 lcg-lrz-dc10 kernel: [1727686.888158] BTRFS: bdev /dev/sdm errs: wr 17, rd 7, flush 0, corrupt 0, gen 0
May 28 16:42:02 lcg-lrz-dc10 kernel: [1727686.889257] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:42:02 lcg-lrz-dc10 kernel: [1727686.889847] BTRFS: bdev /dev/sdm errs: wr 18, rd 7, flush 0, corrupt 0, gen 0
May 28 16:42:32 lcg-lrz-dc10 kernel: [1727716.910404] BTRFS: bdev /dev/sdm errs: wr 18, rd 8, flush 0, corrupt 0, gen 0
May 28 16:42:32 lcg-lrz-dc10 kernel: [1727717.004055] BTRFS: bdev /dev/sdm errs: wr 18, rd 9, flush 0, corrupt 0, gen 0
May 28 16:42:32 lcg-lrz-dc10 kernel: [1727717.019085] BTRFS: bdev /dev/sdm errs: wr 18, rd 10, flush 0, corrupt 0, gen 0
May 28 16:42:32 lcg-lrz-dc10 kernel: [1727717.043690] BTRFS: bdev /dev/sdm errs: wr 18, rd 11, flush 0, corrupt 0, gen 0
May 28 16:42:34 lcg-lrz-dc10 kernel: [1727719.121839] BTRFS: bdev /dev/sdm errs: wr 18, rd 12, flush 0, corrupt 0, gen 0
May 28 16:42:35 lcg-lrz-dc10 kernel: [1727720.029509] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:42:35 lcg-lrz-dc10 kernel: [1727720.029606] BTRFS: bdev /dev/sdm errs: wr 19, rd 12, flush 0, corrupt 0, gen 0
May 28 16:42:35 lcg-lrz-dc10 kernel: [1727720.029868] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:42:35 lcg-lrz-dc10 kernel: [1727720.029993] BTRFS: bdev /dev/sdm errs: wr 20, rd 12, flush 0, corrupt 0, gen 0
May 28 16:42:35 lcg-lrz-dc10 kernel: [1727720.030405] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:42:35 lcg-lrz-dc10 kernel: [1727720.030651] BTRFS: bdev /dev/sdm errs: wr 21, rd 12, flush 0, corrupt 0, gen 0
May 28 16:43:05 lcg-lrz-dc10 kernel: [1727750.366637] BTRFS: bdev /dev/sdm errs: wr 21, rd 13, flush 0, corrupt 0, gen 0
May 28 16:43:05 lcg-lrz-dc10 kernel: [1727750.526410] BTRFS: bdev /dev/sdm errs: wr 21, rd 14, flush 0, corrupt 0, gen 0
May 28 16:43:05 lcg-lrz-dc10 kernel: [1727750.683487] BTRFS: bdev /dev/sdm errs: wr 21, rd 15, flush 0, corrupt 0, gen 0
May 28 16:43:06 lcg-lrz-dc10 kernel: [1727751.683162] BTRFS: bdev /dev/sdm errs: wr 21, rd 16, flush 0, corrupt 0, gen 0
May 28 16:43:08 lcg-lrz-dc10 kernel: [1727753.642839] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:43:08 lcg-lrz-dc10 kernel: [1727753.643009] BTRFS: bdev /dev/sdm errs: wr 22, rd 16, flush 0, corrupt 0, gen 0
May 28 16:43:08 lcg-lrz-dc10 kernel: [1727753.643421] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:43:08 lcg-lrz-dc10 kernel: [1727753.643646] BTRFS: bdev /dev/sdm errs: wr 23, rd 16, flush 0, corrupt 0, gen 0
May 28 16:43:08 lcg-lrz-dc10 kernel: [1727753.644159] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:43:08 lcg-lrz-dc10 kernel: [1727753.644420] BTRFS: bdev /dev/sdm errs: wr 24, rd 16, flush 0, corrupt 0, gen 0
May 28 16:43:08 lcg-lrz-dc10 kernel: [1727753.736568] BTRFS: bdev /dev/sdm errs: wr 24, rd 17, flush 0, corrupt 0, gen 0
May 28 16:43:08 lcg-lrz-dc10 kernel: [1727753.751826] BTRFS: bdev /dev/sdm errs: wr 24, rd 18, flush 0, corrupt 0, gen 0
May 28 16:43:09 lcg-lrz-dc10 kernel: [1727753.803959] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:43:09 lcg-lrz-dc10 kernel: [1727753.803962] BTRFS: bdev /dev/sdm errs: wr 25, rd 18, flush 0, corrupt 0, gen 0
May 28 16:43:09 lcg-lrz-dc10 kernel: [1727754.027756] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:43:09 lcg-lrz-dc10 kernel: [1727754.029053] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:43:09 lcg-lrz-dc10 kernel: [1727754.030351] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 16:43:11 lcg-lrz-dc10 kernel: [1727756.068101] btrfs_dev_stat_print_on_error: 3 callbacks suppressed
May 28 16:43:11 lcg-lrz-dc10 kernel: [1727756.068730] BTRFS: bdev /dev/sdm errs: wr 28, rd 19, flush 0, corrupt 0, gen 0
May 28 16:43:11 lcg-lrz-dc10 kernel: [1727756.070009] BTRFS: bdev /dev/sdm errs: wr 28, rd 20, flush 0, corrupt 0, gen 0
May 28 16:43:11 lcg-lrz-dc10 kernel: [1727756.071399] BTRFS: bdev /dev/sdm errs: wr 28, rd 21, flush 0, corrupt 0, gen 0
May 28 16:43:11 lcg-lrz-dc10 kernel: [1727756.072835] BTRFS: bdev /dev/sdm errs: wr 28, rd 22, flush 0, corrupt 0, gen 0
May 28 16:43:11 lcg-lrz-dc10 kernel: [1727756.074251] BTRFS: bdev /dev/sdm errs: wr 28, rd 23, flush 0, corrupt 0, gen 0
May 28 16:43:11 lcg-lrz-dc10 kernel: [1727756.075748] BTRFS: bdev /dev/sdm errs: wr 28, rd 24, flush 0, corrupt 0, gen 0
May 28 16:43:11 lcg-lrz-dc10 kernel: [1727756.077288] BTRFS: bdev /dev/sdm errs: wr 28, rd 25, flush 0, corrupt 0, gen 0
May 28 16:43:11 lcg-lrz-dc10 kernel: [1727756.078899] BTRFS: bdev /dev/sdm errs: wr 28, rd 26, flush 0, corrupt 0, gen 0
May 28 16:43:11 lcg-lrz-dc10 kernel: [1727756.080550] BTRFS: bdev /dev/sdm errs: wr 28, rd 27, flush 0, corrupt 0, gen 0
May 28 16:43:11 lcg-lrz-dc10 kernel: [1727756.082246] BTRFS: bdev /dev/sdm errs: wr 28, rd 28, flush 0, corrupt 0, gen 0
May 28 16:43:16 lcg-lrz-dc10 kernel: [1727761.066100] btrfs_dev_stat_print_on_error: 21558 callbacks suppressed
May 28 16:43:16 lcg-lrz-dc10 kernel: [1727761.066369] BTRFS: bdev /dev/sdm errs: wr 28, rd 21587, flush 0, corrupt 0, gen 0
May 28 16:43:16 lcg-lrz-dc10 kernel: [1727761.067067] BTRFS: bdev /dev/sdm errs: wr 28, rd 21588, flush 0, corrupt 0, gen 0
May 28 16:43:16 lcg-lrz-dc10 kernel: [1727761.067741] BTRFS: bdev /dev/sdm errs: wr 28, rd 21589, flush 0, corrupt 0, gen 0
May 28 16:43:16 lcg-lrz-dc10 kernel: [1727761.068568] BTRFS: bdev /dev/sdm errs: wr 28, rd 21590, flush 0, corrupt 0, gen 0
May 28 16:43:16 lcg-lrz-dc10 kernel: [1727761.069722] BTRFS: bdev /dev/sdm errs: wr 28, rd 21591, flush 0, corrupt 0, gen 0
May 28 16:43:16 lcg-lrz-dc10 kernel: [1727761.070814] BTRFS: bdev /dev/sdm errs: wr 28, rd 21592, flush 0, corrupt 0, gen 0
May 28 16:43:16 lcg-lrz-dc10 kernel: [1727761.071788] BTRFS: bdev /dev/sdm errs: wr 28, rd 21593, flush 0, corrupt 0, gen 0
May 28 16:43:16 lcg-lrz-dc10 kernel: [1727761.073280] BTRFS: bdev /dev/sdm errs: wr 28, rd 21594, flush 0, corrupt 0, gen 0
May 28 16:43:16 lcg-lrz-dc10 kernel: [1727761.075350] BTRFS: bdev /dev/sdm errs: wr 28, rd 21595, flush 0, corrupt 0, gen 0
May 28 16:43:16 lcg-lrz-dc10 kernel: [1727761.077607] BTRFS: bdev /dev/sdm errs: wr 28, rd 21596, flush 0, corrupt 0, gen 0


Later it finally said goodbye: 
May 28 21:03:06 lcg-lrz-dc10 kernel: [1743336.347191] sd 0:0:14:0: rejecting I/O to offline device
May 28 21:03:06 lcg-lrz-dc10 kernel: [1743336.369204] sd 0:0:14:0: rejecting I/O to offline device
May 28 21:03:06 lcg-lrz-dc10 kernel: [1743336.369569] BTRFS: lost page write due to I/O error on /dev/sdm
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.093299] sd 0:0:14:0: rejecting I/O to offline device
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.094348] BTRFS (device sdp): bad tree block start 3328214216270427953 3448651776
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.095019] BTRFS (device sdp): bad tree block start 3328214216270427953 3448651776
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.095354] sd 0:0:14:0: rejecting I/O to offline device
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.095872] BTRFS (device sdp): bad tree block start 3328214216270427953 3448651776
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.096551] BTRFS (device sdp): bad tree block start 3328214216270427953 3448651776
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.096927] BTRFS: error -5 while searching for dev_stats item for device /dev/sdm!
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.097314] BTRFS warning (device sdp): Skipping commit of aborted transaction.
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.097715] ------------[ cut here ]------------
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.098160] WARNING: CPU: 1 PID: 128693 at /build/linux-cJtoh5/linux-4.0/fs/btrfs/super.c:260 __btrfs_abort_transaction+0x4b/0x120 [btrfs]()
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.099170] BTRFS: Transaction aborted (error -5)
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.099172] Modules linked in: btrfs xor raid6_pq udp_diag tcp_diag inet_diag nls_utf8 nls_cp437 vfat fat binfmt_misc cpufreq_userspace cpufreq_cons
ervative cpufreq_stats cpufreq_powersave deflate ctr twofish_generic twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common camellia_generic camellia_aesni_avx2 camellia_aesni_avx_x86_64 camellia_x86_64 serpent_avx2 serpent_avx_x86_64 serpent_sse2_x86_64 xts serpent_generic blowfish_generic blowfish_x86_64 blowfish_common cast5_avx_x86_64 cast5_generic cast_common des_generic cbc cmac xcbc rmd160 sha512_ssse3 sha512_generic sha256_ssse3 sha256_generic hmac crypto_null af_key xfrm_algo ip6table_filter ip6_tables xt_policy ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 xt_multiport xt_conntrack nf_conntrack iptable_filter ip_tables x_tables ipmi_devintf evdev iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm crct10dif_pclmul crc32_pclmul dcdbas crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd pcspkr mgag200 ttm drm_kms_helper drm sg ipmi_si 8250_fintek ipmi_msghandler processor thermal_sys sb_edac edac_core wmi acpi_power_meter ixgbe mdio igb ptp pps_core dca i2c_algo_bit i2c_core button xhci_pci xhci_hcd ehci_pci mei_me mei ehci_hcd lpc_ich mfd_core usbcore usb_common coretemp fuse autofs4 ext4 crc16 mbcache jbd2 dm_mod md_mod sd_mod megaraid_sas scsi_mod shpchp
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.110677] CPU: 1 PID: 128693 Comm: iozone Not tainted 4.0.0-trunk-amd64 #1 Debian 4.0-1~exp1
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.111893] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS 1.0.4 08/28/2014
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.113136]  0000000000000000 ffffffffa0859550 ffffffff8155b12e ffff880de4fdfda8
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.114412]  ffffffff8106d2a1 ffff8806ad8680c8 00000000fffffffb ffff880856f41000
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.115714]  ffffffffa0855910 0000000000000696 ffffffff8106d31a ffffffffa0859628
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.117050] Call Trace:
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.118385]  [<ffffffff8155b12e>] ? dump_stack+0x40/0x50
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.119749]  [<ffffffff8106d2a1>] ? warn_slowpath_common+0x81/0xb0
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.121060]  [<ffffffff8106d31a>] ? warn_slowpath_fmt+0x4a/0x50
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.122362]  [<ffffffffa07a8d2b>] ? __btrfs_abort_transaction+0x4b/0x120 [btrfs]
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.123696]  [<ffffffffa07d625f>] ? cleanup_transaction+0x6f/0x2c0 [btrfs]
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.125050]  [<ffffffff810aab30>] ? wait_woken+0x90/0x90
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.126400]  [<ffffffff810aa724>] ? __wake_up+0x34/0x50
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.127777]  [<ffffffffa07d6f8e>] ? btrfs_commit_transaction+0x2ae/0xa00 [btrfs]
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.129180]  [<ffffffffa07d7d7c>] ? btrfs_attach_transaction_barrier+0x1c/0x50 [btrfs]
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.130604]  [<ffffffff811f3540>] ? do_fsync+0x70/0x70
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.132039]  [<ffffffff811c6130>] ? iterate_supers+0xb0/0x110
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.133409]  [<ffffffff811f3665>] ? sys_sync+0x55/0xa0
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.134768]  [<ffffffff815614cd>] ? system_call_fast_compare_end+0xc/0x11
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.136150] ---[ end trace 8019cf83241ac956 ]---
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.137542] BTRFS: error (device sdp) in cleanup_transaction:1686: errno=-5 IO failure
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.138973] BTRFS info (device sdp): forced readonly
May 28 22:54:48 lcg-lrz-dc10 kernel: [1750032.363543] megasas:span 0 rowDataSize 1
May 28 22:54:48 lcg-lrz-dc10 kernel: [1750032.365440] megasas:span 0 rowDataSize 1
May 28 22:54:48 lcg-lrz-dc10 kernel: [1750032.367233] megasas:span 0 rowDataSize 1
May 28 22:54:48 lcg-lrz-dc10 kernel: [1750032.369001] megasas:span 0 rowDataSize 1

...

May 28 22:55:25 lcg-lrz-dc10 kernel: [1750069.147728] megasas:span 0 rowDataSize 1
May 28 22:55:25 lcg-lrz-dc10 kernel: [1750069.147885] megasas:span 0 rowDataSize 1
May 28 22:55:25 lcg-lrz-dc10 kernel: [1750069.148041] megasas:span 0 rowDataSize 1
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.558937] ------------[ cut here ]------------
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.559193] WARNING: CPU: 4 PID: 134844 at /build/linux-cJtoh5/linux-4.0/fs/btrfs/extent-tree.c:4890 btrfs_free_block_groups+0x379/0x460 [btrfs]()
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.559747] Modules linked in: btrfs xor raid6_pq udp_diag tcp_diag inet_diag nls_utf8 nls_cp437 vfat fat binfmt_misc cpufreq_userspace cpufreq_conservative cpufreq_stats cpufreq_powersave deflate ctr twofish_generic twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common camellia_generic camellia_aesni_avx2 camellia_aesni_avx_x86_64 camellia_x86_64 serpent_avx2 serpent_avx_x86_64 serpent_sse2_x86_64 xts serpent_generic blowfish_generic blowfish_x86_64 blowfish_common cast5_avx_x86_64 cast5_generic cast_common des_generic cbc cmac xcbc rmd160 sha512_ssse3 sha512_generic sha256_ssse3 sha256_generic hmac crypto_null af_key xfrm_algo ip6table_filter ip6_tables xt_policy ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 xt_multiport xt_conntrack nf_conntrack iptable_filter ip_tables x_tables ipmi_devintf evdev iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm crct10dif_pclmul crc32_pclmul dcdbas crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd pcspkr mgag200 ttm drm_kms_helper drm sg ipmi_si 8250_fintek ipmi_msghandler processor thermal_sys sb_edac edac_core wmi acpi_power_meter ixgbe mdio igb ptp pps_core dca i2c_algo_bit i2c_core button xhci_pci xhci_hcd ehci_pci mei_me mei ehci_hcd lpc_ich mfd_core usbcore usb_common coretemp fuse autofs4 ext4 crc16 mbcache jbd2 dm_mod md_mod sd_mod megaraid_sas scsi_mod shpchp
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.568590] CPU: 4 PID: 134844 Comm: umount Tainted: G        W       4.0.0-trunk-amd64 #1 Debian 4.0-1~exp1
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.569665] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS 1.0.4 08/28/2014
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.570776]  0000000000000000 ffffffffa0859bf8 ffffffff8155b12e 0000000000000000
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.571933]  ffffffff8106d2a1 0000000000000000 ffff880857551800 ffff88105796c080
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.573114]  ffff88105796c000 ffff88105796c090 ffffffffa07c5d39 ffff88105796c000
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.574316] Call Trace:
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.575546]  [<ffffffff8155b12e>] ? dump_stack+0x40/0x50
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.576712]  [<ffffffff8106d2a1>] ? warn_slowpath_common+0x81/0xb0
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.577886]  [<ffffffffa07c5d39>] ? btrfs_free_block_groups+0x379/0x460 [btrfs]
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.579100]  [<ffffffffa07d2cb4>] ? close_ctree+0x154/0x350 [btrfs]
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.580313]  [<ffffffff811df95c>] ? evict_inodes+0xfc/0x110
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.581542]  [<ffffffff811c4aee>] ? generic_shutdown_super+0x6e/0xf0
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.582787]  [<ffffffff811c4dee>] ? kill_anon_super+0xe/0x20
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.584057]  [<ffffffffa07a8927>] ? btrfs_kill_super+0x17/0x100 [btrfs]
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.585338]  [<ffffffff811c5175>] ? deactivate_locked_super+0x45/0x80
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.586660]  [<ffffffff811e2b1b>] ? cleanup_mnt+0x3b/0x90
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.587986]  [<ffffffff81089ab7>] ? task_work_run+0xb7/0xf0
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.589249]  [<ffffffff81014079>] ? do_notify_resume+0x69/0x90
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.590499]  [<ffffffff8156172b>] ? int_signal+0x12/0x17
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.591765] ---[ end trace 8019cf83241ac957 ]---
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.593039] ------------[ cut here ]------------
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.594333] WARNING: CPU: 4 PID: 134844 at /build/linux-cJtoh5/linux-4.0/fs/btrfs/extent-tree.c:4891 btrfs_free_block_groups+0x398/0x460 [btrfs]()
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.597086] Modules linked in: btrfs xor raid6_pq udp_diag tcp_diag inet_diag nls_utf8 nls_cp437 vfat fat binfmt_misc cpufreq_userspace cpufreq_conservative cpufreq_stats cpufreq_powersave deflate ctr twofish_generic twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common camellia_generic camellia_aesni_avx2 camellia_aesni_avx_x86_64 camellia_x86_64 serpent_avx2 serpent_avx_x86_64 serpent_sse2_x86_64 xts serpent_generic blowfish_generic blowfish_x86_64 blowfish_common cast5_avx_x86_64 cast5_generic cast_common des_generic cbc cmac xcbc rmd160 sha512_ssse3 sha512_generic sha256_ssse3 sha256_generic hmac crypto_null af_key xfrm_algo ip6table_filter ip6_tables xt_policy ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 xt_multiport xt_conntrack nf_conntrack iptable_filter ip_tables x_tables ipmi_devintf evdev iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm crct10dif_pclmul crc32_pclmul dcdbas crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd pcspkr mgag200 ttm drm_kms_helper drm sg ipmi_si 8250_fintek ipmi_msghandler processor thermal_sys sb_edac edac_core wmi acpi_power_meter ixgbe mdio igb ptp pps_core dca i2c_algo_bit i2c_core button xhci_pci xhci_hcd ehci_pci mei_me mei ehci_hcd lpc_ich mfd_core usbcore usb_common coretemp fuse autofs4 ext4 crc16 mbcache jbd2 dm_mod md_mod sd_mod megaraid_sas scsi_mod shpchp
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.616565] CPU: 4 PID: 134844 Comm: umount Tainted: G        W       4.0.0-trunk-amd64 #1 Debian 4.0-1~exp1
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.618213] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS 1.0.4 08/28/2014
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.619847]  0000000000000000 ffffffffa0859bf8 ffffffff8155b12e 0000000000000000
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.621470]  ffffffff8106d2a1 0000000000000000 ffff880857551800 ffff88105796c080
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.623068]  ffff88105796c000 ffff88105796c090 ffffffffa07c5d58 ffff88105796c000
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.624561] Call Trace:
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.625994]  [<ffffffff8155b12e>] ? dump_stack+0x40/0x50
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.627431]  [<ffffffff8106d2a1>] ? warn_slowpath_common+0x81/0xb0
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.628829]  [<ffffffffa07c5d58>] ? btrfs_free_block_groups+0x398/0x460 [btrfs]
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.630207]  [<ffffffffa07d2cb4>] ? close_ctree+0x154/0x350 [btrfs]
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.631558]  [<ffffffff811df95c>] ? evict_inodes+0xfc/0x110
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.632873]  [<ffffffff811c4aee>] ? generic_shutdown_super+0x6e/0xf0
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.634163]  [<ffffffff811c4dee>] ? kill_anon_super+0xe/0x20
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.635440]  [<ffffffffa07a8927>] ? btrfs_kill_super+0x17/0x100 [btrfs]
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.636662]  [<ffffffff811c5175>] ? deactivate_locked_super+0x45/0x80
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.637836]  [<ffffffff811e2b1b>] ? cleanup_mnt+0x3b/0x90
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.639003]  [<ffffffff81089ab7>] ? task_work_run+0xb7/0xf0
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.640162]  [<ffffffff81014079>] ? do_notify_resume+0x69/0x90
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.641312]  [<ffffffff8156172b>] ? int_signal+0x12/0x17
May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.642454] ---[ end trace 8019cf83241ac958 ]---
May 28 22:56:39 lcg-lrz-dc10 kernel: [1750142.765837] megasas:span 0 rowDataSize 1
May 28 22:56:39 lcg-lrz-dc10 kernel: [1750142.767627] megasas:span 0 rowDataSize 1
May 28 22:56:39 lcg-lrz-dc10 kernel: [1750142.769308] megasas:span 0 rowDataSize 1

...


(at some point iozone had also noted that it cannot write anymore)



Well as I've said,.. maybe it's not an issue at all, but at least it's
strange that this happens on brand new hardware only with the
btrfs-raid56 node, especially the gazillions of megasas messages.
The full log (at least that what's left over,... logrotate hat already
taken its tribute) would be available at:
http://christoph.anton.mitterer.name/tmp/public/a8bcf4a6-08c4-11e5-a513-0019dbacbbbf/kern.log.xz
for some time (beware, it's some 1,6 G unpacked).


Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: possible raid6 corruption
  2015-06-02  1:24 possible raid6 corruption Christoph Anton Mitterer
@ 2015-06-02  2:38 ` Chris Murphy
  2015-06-02  7:26 ` Sander
  1 sibling, 0 replies; 3+ messages in thread
From: Chris Murphy @ 2015-06-02  2:38 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: linux-btrfs

I'm seeing three separate problems:

May 19 03:25:39 lcg-lrz-dc10 kernel: [903095.585150] megasas:
megasas_aen_polling waiting for controller reset to finish for scsi0
May 19 03:25:50 lcg-lrz-dc10 kernel: [903106.581205] sd 0:0:14:0:
Device offlined - not ready after error recovery

I don't know if that's controller related or drive related. In either
case it's hardware related. And then:

May 28 16:40:43 lcg-lrz-dc10 kernel: [1727608.170703] BTRFS: bdev
/dev/sdm errs: wr 12, rd 0, flush 0, corrupt 0, gen 0
May 28 16:40:50 lcg-lrz-dc10 kernel: [1727615.608552] BTRFS: bdev
/dev/sdm errs: wr 12, rd 1, flush 0, corrupt 0, gen 0
...
May 28 16:43:16 lcg-lrz-dc10 kernel: [1727761.077607] BTRFS: bdev
/dev/sdm errs: wr 28, rd 21596, flush 0, corrupt 0, gen 0

This is just the fs saying it can't write to one particular drive, and
then also many read failures. And then:


May 28 21:03:06 lcg-lrz-dc10 kernel: [1743336.369569] BTRFS: lost page
write due to I/O error on /dev/sdm
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.093299] sd 0:0:14:0:
rejecting I/O to offline device
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.094348] BTRFS (device
sdp): bad tree block start 3328214216270427953 3448651776

So another lost write to the same drive, sdm, and then new problem
which is bad tree block on a different drive sdp. And then:

May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.096927] BTRFS: error -5
while searching for dev_stats item for device /dev/sdm!
May 28 21:03:07 lcg-lrz-dc10 kernel: [1743337.097314] BTRFS warning
(device sdp): Skipping commit of aborted transaction.

It still hasn't given up on sdm (which seems kinda odd by now that
there are thousands of read errors and the kernel thinks it's offline
anyway), but then now has to deal with problems with sdp. The
resulting stack trace though suggests a umount was in progress?


May 28 22:55:56 lcg-lrz-dc10 kernel: [1750099.616565] CPU: 4 PID:
134844 Comm: umount Tainted: G        W       4.0.0-trunk-amd64 #1
Debian 4.0-1~exp1



https://bugs.launchpad.net/ubuntu/+source/linux/+bug/891115
That's an old bug, kernel 3.2 era. But ultimately it looks like it was
hardware related.


Chris Murphy

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: possible raid6 corruption
  2015-06-02  1:24 possible raid6 corruption Christoph Anton Mitterer
  2015-06-02  2:38 ` Chris Murphy
@ 2015-06-02  7:26 ` Sander
  1 sibling, 0 replies; 3+ messages in thread
From: Sander @ 2015-06-02  7:26 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: linux-btrfs

Christoph Anton Mitterer wrote (ao):
> May 19 03:25:50 lcg-lrz-dc10 kernel: [903106.581205] sd 0:0:14:0: Device offlined - not ready after error recovery

> May 28 16:38:43 lcg-lrz-dc10 kernel: [1727488.984810] sd 0:0:14:0: rejecting I/O to offline device

> May 28 16:39:19 lcg-lrz-dc10 kernel: [1727524.067182] BTRFS: lost page write due to I/O error on /dev/sdm
> May 28 16:39:19 lcg-lrz-dc10 kernel: [1727524.067426] BTRFS: bdev /dev/sdm errs: wr 1, rd 0, flush 0, corrupt 0, gen 0

> May 28 21:03:06 lcg-lrz-dc10 kernel: [1743336.347191] sd 0:0:14:0: rejecting I/O to offline device
> May 28 21:03:06 lcg-lrz-dc10 kernel: [1743336.369569] BTRFS: lost page write due to I/O error on /dev/sdm

> Well as I've said,.. maybe it's not an issue at all, but at least it's
> strange that this happens on brand new hardware only with the
> btrfs-raid56 node, especially the gazillions of megasas messages.

Brand new hardware is most likely to show (hardware) issues as it has no
proven track record yet while it was subject to any kind of abuse during
transport. I'm sure you will see the same if you put sw raid + ext4 on
this server.

Nice hardware btw, please share your findings.

	Sander

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-06-02  7:26 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-02  1:24 possible raid6 corruption Christoph Anton Mitterer
2015-06-02  2:38 ` Chris Murphy
2015-06-02  7:26 ` Sander

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.