All of lore.kernel.org
 help / color / mirror / Atom feed
* data loss+inode recovery using RAID6 write journal
@ 2016-10-24 23:55 Nick Black
  2016-10-25 12:36 ` Wols Lists
  2016-10-26 18:43 ` Shaohua Li
  0 siblings, 2 replies; 5+ messages in thread
From: Nick Black @ 2016-10-24 23:55 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2164 bytes --]

Hey there, everyone! I've been using and admiring mdadm for over a decade;
thanks for all the awesome work.

I recently put together a new build, and wanted to try out the
--write-journal capability of recent Linux md. My write journal is a
Samsung SSD 840 PRO SSD, atop a RAID6 of 8 4TB spinning disks. All 9 SATA3
devices are plugged into the onboard SATA3 ports of my ASUS X-99 Deluxe II
motherboard. Summary description:

md126 : active raid6 sde1[4] sdg1[6] sdd1[3] sdc1[2] sdf1[5] sdi1[8] sdh1[7] sdb1[1] sda1[0](J)
      23441316864 blocks super 1.2 level 6, 512k chunk, algorithm 2 [8/8] [UUUUUUUU]
      bitmap: 0/30 pages [0KB], 65536KB chunk

All filesystems are ext4. ~14TB of ~22TB are in use on the filesystem built
directly atop md126:

 /dev/md126       22T   14T  7.4T  65% /media/trap

Kernel version is 4.8.3 (the array was built under 4.7.5), and mdadm reports
v3.4. Distro is debian unstable, running a custom (but fairly orthodox)
kernel.

I moved a ~20GB tarball from my home directory (located on another device, a
NVMe md RAID1) to /media/trap/backups. The mv completed successfully. A
short time after that, I hard rebooted the machine due to X lockup (I'm
experimenting with compiz). By "short time", I mean "possibly within the
time window before 20GB could be written out to the backing store, but I'm
unsure about that". Upon restart, the machine engaged in minutes of disk
activity, spat out some fsck inode recovery messages (I'm trying to find
these in my logs), and finally mounted the filesystem. The moved file is
nowhere to be found.

It's no big loss to me -- I can recreate that data -- but I thought I'd
report this. As said, I'm looking for logs or other hard details, but not
seeing them in journalctl output. I can probably reproduce the problem if
someone needs me to, though otherwise I will likely disable the write
journal for now (I've not yet done so). Please let me know how I might help
you track this problem down, if a problem does indeed exist. Thanks!

-- 
nick black -=- http://www.nick-black.com
to make an apple pie from scratch, you need first invent a universe.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 163 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: data loss+inode recovery using RAID6 write journal
  2016-10-24 23:55 data loss+inode recovery using RAID6 write journal Nick Black
@ 2016-10-25 12:36 ` Wols Lists
  2016-10-25 13:16   ` Nick Black
  2016-10-26 18:43 ` Shaohua Li
  1 sibling, 1 reply; 5+ messages in thread
From: Wols Lists @ 2016-10-25 12:36 UTC (permalink / raw)
  To: Nick Black, linux-raid

On 25/10/16 00:55, Nick Black wrote:
> I moved a ~20GB tarball from my home directory (located on another
> device, a NVMe md RAID1) to /media/trap/backups. The mv completed
> successfully. A short time after that, I hard rebooted the machine
> due to X lockup (I'm experimenting with compiz). By "short time", I
> mean "possibly within the time window before 20GB could be written
> out to the backing store, but I'm unsure about that". Upon restart,
> the machine engaged in minutes of disk activity, spat out some fsck
> inode recovery messages (I'm trying to find these in my logs), and
> finally mounted the filesystem. The moved file is nowhere to be
> found.

I can't see what filesystem you're using. It could easily be down to that.

If the reboot interrupted the "write to disk" before the directory
containing the i-node had been flushed, that would explain your
observations, I believe.

Personally, I think that explanation is actually unlikely, as the
kernel devs go to great lengths to preserve metadata, so you're more
likely to get the situation where the file exists but is empty.

This ties in with my impression of the kernel devs - especially the
file system guys - placing great emphasis on protecting the computer
at the expense of the data the user stores there. imho that's daft,
but hey they're system guys, they protect the system. "We can reboot
the system in a clean state in one hour instead of 24 now we no longer
need a fsck". They forget that that 24 hours gave the user a usable
system, now the admins need to run a 72-hour user-space integrity
check before they hand the system back ... :-(

I guess what I'm saying is, don't assume it's the raid, as it could
well be something else entirely (although there are probably plenty of
people here who could help you with that).

Cheers,
Wol

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: data loss+inode recovery using RAID6 write journal
  2016-10-25 12:36 ` Wols Lists
@ 2016-10-25 13:16   ` Nick Black
  0 siblings, 0 replies; 5+ messages in thread
From: Nick Black @ 2016-10-25 13:16 UTC (permalink / raw)
  To: Wols Lists; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 7681 bytes --]

Wols Lists left as an exercise for the reader:
> I can't see what filesystem you're using. It could easily be down to that.

as noted, both source and destination filesystems were ext4 without any
unorthodox options. I wouldn't normally think it was due to mdraid, as I've
never had a problem from it before, but the new write-journal code does
seem a very possible culprit, especially as it appeared to be attempting
to flush data from the SSD during the interruption.

if people don't think it due to raid, i'm not going to say it was. i'll
probably refrain from using the write journal for a bit, though, just on
a hunch.

Destination array details:

[schwarzgerat](0) $ sudo mdadm --detail /dev/md126
/dev/md126:
        Version : 1.2
  Creation Time : Fri Oct 14 21:31:31 2016
     Raid Level : raid6
     Array Size : 23441316864 (22355.38 GiB 24003.91 GB)
  Used Dev Size : 3906886144 (3725.90 GiB 4000.65 GB)
   Raid Devices : 8
  Total Devices : 9
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Mon Oct 24 18:27:09 2016
          State : clean 
 Active Devices : 8
Working Devices : 9
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : schwarzgerat:126  (local to host schwarzgerat)
           UUID : e16ec7ec:77f523c8:243e7ba0:08c5a591
         Events : 15092

    Number   Major   Minor   RaidDevice State
       0       8        1        -      journal   /dev/sda1
       1       8       17        0      active sync   /dev/sdb1
       2       8       33        1      active sync   /dev/sdc1
       3       8       49        2      active sync   /dev/sdd1
       4       8       65        3      active sync   /dev/sde1
       5       8       81        4      active sync   /dev/sdf1
       6       8       97        5      active sync   /dev/sdg1
       7       8      113        6      active sync   /dev/sdh1
       8       8      129        7      active sync   /dev/sdi1
[schwarzgerat](0) $ 

Destination filesystem details:

[schwarzgerat](0) $ sudo dumpe2fs /dev/md126
dumpe2fs 1.43.3 (04-Sep-2016)
Filesystem volume name:   <none>
Last mounted on:          /media/trap
Filesystem UUID:          7f7404cc-9a9c-4a92-98b7-159646f5b355
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags:         signed_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              366272512
Block count:              5860329216
Reserved block count:     293016460
Free blocks:              2255006201
Free inodes:              363288750
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         2048
Inode blocks per group:   128
RAID stride:              128
RAID stripe width:        768
Flex block group size:    16
Filesystem created:       Tue Oct 18 01:37:04 2016
Last mount time:          Mon Oct 24 18:26:39 2016
Last write time:          Mon Oct 24 18:26:39 2016
Mount count:              13
Maximum mount count:      -1
Last checked:             Tue Oct 18 01:37:04 2016
Check interval:           0 (<none>)
Lifetime writes:          16 TB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:	          256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      88fcabd6-4bf3-4817-9eaa-92746a2d4295
Journal backup:           inode blocks
Checksum type:            crc32c
Checksum:                 0xeaeb1fb3
Journal features:         journal_incompat_revoke journal_64bit journal_checksum_v3
Journal size:             1024M
Journal length:           262144
Journal sequence:         0x0000bb9e
Journal start:            1
Journal checksum type:    crc32c
Journal checksum:         0xe8d0e4bd

Source array details:

[schwarzgerat](0) $ sudo mdadm --detail /dev/md127
/dev/md127:
        Version : 1.2
  Creation Time : Wed Oct 12 20:56:39 2016
     Raid Level : raid1
     Array Size : 369607744 (352.49 GiB 378.48 GB)
  Used Dev Size : 369607744 (352.49 GiB 378.48 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Tue Oct 25 09:11:04 2016
          State : clean 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : debian:intel 750 nvme
           UUID : 6a826e91:5bf8c1de:56be717d:e55077d9
         Events : 149

    Number   Major   Minor   RaidDevice State
       0     259        6        0      active sync   /dev/nvme0n1p3
       1     259        2        1      active sync   /dev/nvme1n1p3
[schwarzgerat](0) $ 

Source filesystem details:

[schwarzgerat](0) $ sudo dumpe2fs /dev/md127p2
dumpe2fs 1.43.3 (04-Sep-2016)
Filesystem volume name:   <none>
Last mounted on:          /home
Filesystem UUID:          2509ff02-8976-4051-a074-4c8457512e9e
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index
filetype needs_recovery extent 64bit flex_bg sparse_super large_file
huge_file dir_nlink extra_isize metadata_csum
Filesystem flags:         signed_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              18907136
Block count:              75624459
Reserved block count:     3781222
Free blocks:              67197022
Free inodes:              18742837
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Reserved GDT blocks:      1024
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Fri Oct 14 19:47:30 2016
Last mount time:          Mon Oct 24 18:26:38 2016
Last write time:          Mon Oct 24 18:26:38 2016
Mount count:              16
Maximum mount count:      -1
Last checked:             Fri Oct 14 19:47:30 2016
Check interval:           0 (<none>)
Lifetime writes:          1123 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
First orphan inode:       15729268
Default directory hash:   half_md4
Directory Hash Seed:      67c455a5-d48e-4244-9acd-0b01dbd67730
Journal backup:           inode blocks
Checksum type:            crc32c
Checksum:                 0xabd72e7b
Journal features:         journal_incompat_revoke journal_64bit
journal_checksum_v3
Journal size:             1024M
Journal length:           262144
Journal sequence:         0x00001886
Journal start:            1
Journal checksum type:    crc32c
Journal checksum:         0x1519be37


-- 
nick black -=- http://www.nick-black.com
to make an apple pie from scratch, you need first invent a universe.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 163 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: data loss+inode recovery using RAID6 write journal
  2016-10-24 23:55 data loss+inode recovery using RAID6 write journal Nick Black
  2016-10-25 12:36 ` Wols Lists
@ 2016-10-26 18:43 ` Shaohua Li
  2016-10-26 18:51   ` Nick Black
  1 sibling, 1 reply; 5+ messages in thread
From: Shaohua Li @ 2016-10-26 18:43 UTC (permalink / raw)
  To: Nick Black; +Cc: linux-raid

On Mon, Oct 24, 2016 at 07:55:05PM -0400, Nick Black wrote:
> Hey there, everyone! I've been using and admiring mdadm for over a decade;
> thanks for all the awesome work.
> 
> I recently put together a new build, and wanted to try out the
> --write-journal capability of recent Linux md. My write journal is a
> Samsung SSD 840 PRO SSD, atop a RAID6 of 8 4TB spinning disks. All 9 SATA3
> devices are plugged into the onboard SATA3 ports of my ASUS X-99 Deluxe II
> motherboard. Summary description:
> 
> md126 : active raid6 sde1[4] sdg1[6] sdd1[3] sdc1[2] sdf1[5] sdi1[8] sdh1[7] sdb1[1] sda1[0](J)
>       23441316864 blocks super 1.2 level 6, 512k chunk, algorithm 2 [8/8] [UUUUUUUU]
>       bitmap: 0/30 pages [0KB], 65536KB chunk
> 
> All filesystems are ext4. ~14TB of ~22TB are in use on the filesystem built
> directly atop md126:
> 
>  /dev/md126       22T   14T  7.4T  65% /media/trap
> 
> Kernel version is 4.8.3 (the array was built under 4.7.5), and mdadm reports
> v3.4. Distro is debian unstable, running a custom (but fairly orthodox)
> kernel.
> 
> I moved a ~20GB tarball from my home directory (located on another device, a
> NVMe md RAID1) to /media/trap/backups. The mv completed successfully. A
> short time after that, I hard rebooted the machine due to X lockup (I'm
> experimenting with compiz). By "short time", I mean "possibly within the
> time window before 20GB could be written out to the backing store, but I'm
> unsure about that". Upon restart, the machine engaged in minutes of disk
> activity, spat out some fsck inode recovery messages (I'm trying to find
> these in my logs), and finally mounted the filesystem. The moved file is
> nowhere to be found.
> 
> It's no big loss to me -- I can recreate that data -- but I thought I'd
> report this. As said, I'm looking for logs or other hard details, but not
> seeing them in journalctl output. I can probably reproduce the problem if
> someone needs me to, though otherwise I will likely disable the write
> journal for now (I've not yet done so). Please let me know how I might help
> you track this problem down, if a problem does indeed exist. Thanks!

Thanks for the testing. We can't improve the quality of the new feature if
nobody tests it. Yep, the write journal isn't mature yet, but I can't imagine
the data loss. With write journal, data is written to the ssd first, then to
raid disks and IO is finished at that time. So if IO is finished, the data
should be in raid disks. The only possible way to data loss is the recovery.
But it's also possible filesystem/writeback hasn't flushed data to disk yet.
I'm wondering if you can reproduce it with/without journal, so we can narrow
down it a bit.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: data loss+inode recovery using RAID6 write journal
  2016-10-26 18:43 ` Shaohua Li
@ 2016-10-26 18:51   ` Nick Black
  0 siblings, 0 replies; 5+ messages in thread
From: Nick Black @ 2016-10-26 18:51 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1379 bytes --]

Shaohua Li left as an exercise for the reader:
> Thanks for the testing. We can't improve the quality of the new feature if
> nobody tests it. Yep, the write journal isn't mature yet, but I can't imagine
> the data loss. With write journal, data is written to the ssd first, then to
> raid disks and IO is finished at that time. So if IO is finished, the data
> should be in raid disks. The only possible way to data loss is the recovery.
> But it's also possible filesystem/writeback hasn't flushed data to disk yet.
> I'm wondering if you can reproduce it with/without journal, so we can narrow
> down it a bit.

I doubt it can be replaced without the journal -- like I said, I've been
using mdadm RAID[56] for over a decade, and never seen such a problem.

I'll attempt to reproduce with the journal enabled. Assuming I can, I can
try to reproduce without, but I doubt it'll be fruitful. Are there any
debugging options / flags I should enable prior reproducing in order to get
a more complete report? Some state I should dump from my array and
filesystems? Feel free to be technical.

I owe a lot to Linux MD RAID, and am happy to put some effort into running
this down.

I'll report whether I can at least reproduce ASAP.

--nick

-- 
nick black -=- http://www.nick-black.com
to make an apple pie from scratch, you need first invent a universe.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 163 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-10-26 18:51 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-24 23:55 data loss+inode recovery using RAID6 write journal Nick Black
2016-10-25 12:36 ` Wols Lists
2016-10-25 13:16   ` Nick Black
2016-10-26 18:43 ` Shaohua Li
2016-10-26 18:51   ` Nick Black

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.