* Growing RAID10 with active XFS filesystem
@ 2018-01-08 19:08 xfs.pkoch
2018-01-08 19:26 ` Darrick J. Wong
0 siblings, 1 reply; 37+ messages in thread
From: xfs.pkoch @ 2018-01-08 19:08 UTC (permalink / raw)
To: linux-xfs
Dear Linux-Raid and Linux-XFS experts:
I'm posting this on both the linux-raid and linux-xfs
mailing list as it's not clear at this point wether
this is a MD- od XFS-problem.
I have described my problem in a recent posting on
linux-raid and Wol's conclusion was:
> In other words, one or more of the following three are true :-
> 1) The OP has been caught by some random act of God
> 2) There's a serious flaw in "mdadm --grow"
> 3) There's a serious flaw in xfs
>
> Cheers,
> Wol
There's very important data on our RAID10 device but I doubt
it's important enough for God to take a hand into our storage.
But let me first summarize what happened and why I believe that
this is an XFS-problem:
Machine running Linux 3.14.69 with no kernel-patches.
XFS filesystem was created with XFS userutils 3.1.11.
I did a fresh compile of xfsprogs-4.9.0 yesterday when
I realized that the 3.1.11 xfs_repair did not help.
mdadm is V3.3
/dev/md5 is a RAID10-device that was created in Feb 2013
with 10 2TB disks and an ext3 filesystem on it. Once in a
while I added two more 2TB disks. Reshaping was done
while the ext3 filesystem was mounted. Then the ext3
filesystem was unmounted resized and mounted again. That
worked until I resized the RAID10 from 16 to 20 disks and
realized that ext3 does not support filesystems >16TB.
I switched to XFS and created a 20TB filesystem. Here are
the details:
# xfs_info /dev/md5
meta-data=/dev/md5 isize=256 agcount=32,
agsize=152608128 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=4883457280, imaxpct=5
= sunit=128 swidth=1280 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
Please notice: Ths XFS-filesystem has a size of
4883457280*4K = 19,533,829,120K
On saturday I tried to add two more 2TB disks to the RAID10
and the XFS filesystem was mounted (and in medium use) at that
time. Commands were:
# mdadm /dev/md5 --add /dev/sdo
# mdadm --grow /dev/md5 --raid-devices=21
# mdadm -D /dev/md5
/dev/md5:
Version : 1.2
Creation Time : Sun Feb 10 16:58:10 2013
Raid Level : raid10
Array Size : 19533829120 (18628.91 GiB 20002.64 GB)
Used Dev Size : 1953382912 (1862.89 GiB 2000.26 GB)
Raid Devices : 21
Total Devices : 21
Persistence : Superblock is persistent
Update Time : Sat Jan 6 15:08:37 2018
State : clean, reshaping
Active Devices : 21
Working Devices : 21
Failed Devices : 0
Spare Devices : 0
Layout : near=2
Chunk Size : 512K
Reshape Status : 1% complete
Delta Devices : 1, (20->21)
Name : backup:5 (local to host backup)
UUID : 9030ff07:6a292a3c:26589a26:8c92a488
Events : 86002
Number Major Minor RaidDevice State
0 8 16 0 active sync /dev/sdb
1 65 48 1 active sync /dev/sdt
2 8 64 2 active sync /dev/sde
3 65 96 3 active sync /dev/sdw
4 8 112 4 active sync /dev/sdh
5 65 144 5 active sync /dev/sdz
6 8 160 6 active sync /dev/sdk
7 65 192 7 active sync /dev/sdac
8 8 208 8 active sync /dev/sdn
9 65 240 9 active sync /dev/sdaf
10 65 0 10 active sync /dev/sdq
11 66 32 11 active sync /dev/sdai
12 8 32 12 active sync /dev/sdc
13 65 64 13 active sync /dev/sdu
14 8 80 14 active sync /dev/sdf
15 65 112 15 active sync /dev/sdx
16 8 128 16 active sync /dev/sdi
17 65 160 17 active sync /dev/sdaa
18 8 176 18 active sync /dev/sdl
19 65 208 19 active sync /dev/sdad
20 8 224 20 active sync /dev/sdo
Please notice: Ths RAID10-device has a size of 19,533,829,120K
that's exactly the same size as the contained XFS-filesystem.
Immediately after the RAID10 reshape operation started the
XFS-filesystem reported I/O-errors and was severly damaged.
I waited for the reshape operation to finish and tried to repair
the filesystem with xfs_repair (version 3.1.11) but xfs_repair
crashed, so I tried 4.9.0-version aif xfs_reapair with no luck
either.
/dev/md5 ist now mounted ro,norecovery with an overlay filesystem
on top of it (thanks very much to Andreas for that idea) and I have
setup a new server today. Rsyncing the data to the new server will
take a while and I'm sure I will stumble on lots of corrupted files.
I proceeded from XFS to ZFS (skipped YFS) so lengthy reshape
operations won't happen in the future anymore.
Here are the relevant log messages:
> Jan 6 14:45:00 backup kernel: md: reshape of RAID array md5
> Jan 6 14:45:00 backup kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
> Jan 6 14:45:00 backup kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
> Jan 6 14:45:00 backup kernel: md: using 128k window, over a total of 19533829120k.
> Jan 6 14:45:00 backup kernel: XFS (md5): metadata I/O error: block 0x12c08f360 ("xfs_trans_read_buf_map") error 5 numblks 16
> Jan 6 14:45:00 backup kernel: XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
> Jan 6 14:45:00 backup kernel: XFS (md5): metadata I/O error: block 0x12c08f360 ("xfs_trans_read_buf_map") error 5 numblks 16
> Jan 6 14:45:00 backup kernel: XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
> ... hundreds of the above XFS-messages deleted
> Jan 6 14:45:00 backup kernel: XFS (md5): Log I/O Error Detected. Shutting down filesystem
> Jan 6 14:45:00 backup kernel: XFS (md5): Please umount the filesystem and rectify the problem(s)
Please notice: no error message about hardware-problems.
All 21 disks are fine and the next messages from the
md-driver was:
> Jan 7 02:28:02 backup kernel: md: md5: reshape done.
> Jan 7 02:28:03 backup kernel: md5: detected capacity change from 20002641018880 to 21002772807680
I'm wondering about one thing: the first xfs message is about a
meatadata I/O error on block 0x12c08f360. Since the xfs filesystem
has a blocksize of 4K this block is located at position 20135005568K
which is beyond the end of the RAID10 device. No wonder that the
xfs driver receives an I/O error. And also no wonder that the
filesystem is severely corrupted right now.
Question 1: How did the xfs driver knew on Jan 6 that the RAID10
device was about to be increased from 20TB to 21TB on Jan 7?
Question 2: Why did the xfs driver started to use the additional
space that was not yet there without me executing xfs_growfs.
This looks like a severe XFS-problem to me.
But my hope is that all the data taht was within the filesystem
before Jan 6 14:45 is not involved in the corruption. If xfs
started to use space beyond the end of the underlying raid
device this should have affected only data that was created,
modified or deleted after Jan 6 14:45.
If that was true we could clearly distinct between data
that we must dump and data that we can keep. The machine is
our backup system (as you may have guessed from its name)
and I would like to keep old backup-files.
I remember that mkfs.xfs is clever enough to adopt the
filesystem paramters to the underlying hardware of the
block device that the xfs filesystem is created on. Hence
from the xfs drivers point of view the underlying block
device is not just a sequence of data blocks, but the xfs
driver knows something about the layout of the underlying
hardware.
If that was true - how does the xfs driver reacts if that
information about the layout of the underlying hardware
changes while the xfs-filesystem is mounted?
Seems to be an interesting problem
Kind regards
Peter Koch
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-08 19:08 Growing RAID10 with active XFS filesystem xfs.pkoch
@ 2018-01-08 19:26 ` Darrick J. Wong
2018-01-08 22:01 ` Dave Chinner
0 siblings, 1 reply; 37+ messages in thread
From: Darrick J. Wong @ 2018-01-08 19:26 UTC (permalink / raw)
To: xfs.pkoch; +Cc: linux-xfs
On Mon, Jan 08, 2018 at 08:08:09PM +0100, xfs.pkoch@dfgh.net wrote:
> Dear Linux-Raid and Linux-XFS experts:
>
> I'm posting this on both the linux-raid and linux-xfs
> mailing list as it's not clear at this point wether
> this is a MD- od XFS-problem.
>
> I have described my problem in a recent posting on
> linux-raid and Wol's conclusion was:
>
> >In other words, one or more of the following three are true :-
> >1) The OP has been caught by some random act of God
> >2) There's a serious flaw in "mdadm --grow"
> >3) There's a serious flaw in xfs
> >
> >Cheers,
> >Wol
>
> There's very important data on our RAID10 device but I doubt
> it's important enough for God to take a hand into our storage.
>
> But let me first summarize what happened and why I believe that
> this is an XFS-problem:
>
> Machine running Linux 3.14.69 with no kernel-patches.
>
> XFS filesystem was created with XFS userutils 3.1.11.
> I did a fresh compile of xfsprogs-4.9.0 yesterday when
> I realized that the 3.1.11 xfs_repair did not help.
>
> mdadm is V3.3
>
> /dev/md5 is a RAID10-device that was created in Feb 2013
> with 10 2TB disks and an ext3 filesystem on it. Once in a
> while I added two more 2TB disks. Reshaping was done
> while the ext3 filesystem was mounted. Then the ext3
> filesystem was unmounted resized and mounted again. That
> worked until I resized the RAID10 from 16 to 20 disks and
> realized that ext3 does not support filesystems >16TB.
>
> I switched to XFS and created a 20TB filesystem. Here are
> the details:
>
> # xfs_info /dev/md5
> meta-data=/dev/md5 isize=256 agcount=32,
> agsize=152608128 blks
> = sectsz=512 attr=2
> data = bsize=4096 blocks=4883457280, imaxpct=5
> = sunit=128 swidth=1280 blks
> naming =version 2 bsize=4096 ascii-ci=0
> log =internal bsize=4096 blocks=521728, version=2
> = sectsz=512 sunit=8 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
>
> Please notice: Ths XFS-filesystem has a size of
> 4883457280*4K = 19,533,829,120K
>
> On saturday I tried to add two more 2TB disks to the RAID10
> and the XFS filesystem was mounted (and in medium use) at that
> time. Commands were:
>
> # mdadm /dev/md5 --add /dev/sdo
> # mdadm --grow /dev/md5 --raid-devices=21
>
> # mdadm -D /dev/md5
> /dev/md5:
> Version : 1.2
> Creation Time : Sun Feb 10 16:58:10 2013
> Raid Level : raid10
> Array Size : 19533829120 (18628.91 GiB 20002.64 GB)
> Used Dev Size : 1953382912 (1862.89 GiB 2000.26 GB)
> Raid Devices : 21
> Total Devices : 21
> Persistence : Superblock is persistent
>
> Update Time : Sat Jan 6 15:08:37 2018
> State : clean, reshaping
> Active Devices : 21
> Working Devices : 21
> Failed Devices : 0
> Spare Devices : 0
>
> Layout : near=2
> Chunk Size : 512K
>
> Reshape Status : 1% complete
> Delta Devices : 1, (20->21)
>
> Name : backup:5 (local to host backup)
> UUID : 9030ff07:6a292a3c:26589a26:8c92a488
> Events : 86002
>
> Number Major Minor RaidDevice State
> 0 8 16 0 active sync /dev/sdb
> 1 65 48 1 active sync /dev/sdt
> 2 8 64 2 active sync /dev/sde
> 3 65 96 3 active sync /dev/sdw
> 4 8 112 4 active sync /dev/sdh
> 5 65 144 5 active sync /dev/sdz
> 6 8 160 6 active sync /dev/sdk
> 7 65 192 7 active sync /dev/sdac
> 8 8 208 8 active sync /dev/sdn
> 9 65 240 9 active sync /dev/sdaf
> 10 65 0 10 active sync /dev/sdq
> 11 66 32 11 active sync /dev/sdai
> 12 8 32 12 active sync /dev/sdc
> 13 65 64 13 active sync /dev/sdu
> 14 8 80 14 active sync /dev/sdf
> 15 65 112 15 active sync /dev/sdx
> 16 8 128 16 active sync /dev/sdi
> 17 65 160 17 active sync /dev/sdaa
> 18 8 176 18 active sync /dev/sdl
> 19 65 208 19 active sync /dev/sdad
> 20 8 224 20 active sync /dev/sdo
>
> Please notice: Ths RAID10-device has a size of 19,533,829,120K
> that's exactly the same size as the contained XFS-filesystem.
>
> Immediately after the RAID10 reshape operation started the
> XFS-filesystem reported I/O-errors and was severly damaged.
> I waited for the reshape operation to finish and tried to repair
> the filesystem with xfs_repair (version 3.1.11) but xfs_repair
> crashed, so I tried 4.9.0-version aif xfs_reapair with no luck
> either.
>
> /dev/md5 ist now mounted ro,norecovery with an overlay filesystem
> on top of it (thanks very much to Andreas for that idea) and I have
> setup a new server today. Rsyncing the data to the new server will
> take a while and I'm sure I will stumble on lots of corrupted files.
> I proceeded from XFS to ZFS (skipped YFS) so lengthy reshape
> operations won't happen in the future anymore.
>
> Here are the relevant log messages:
>
> >Jan 6 14:45:00 backup kernel: md: reshape of RAID array md5
> >Jan 6 14:45:00 backup kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
> >Jan 6 14:45:00 backup kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
> >Jan 6 14:45:00 backup kernel: md: using 128k window, over a total of 19533829120k.
> >Jan 6 14:45:00 backup kernel: XFS (md5): metadata I/O error: block 0x12c08f360 ("xfs_trans_read_buf_map") error 5 numblks 16
> >Jan 6 14:45:00 backup kernel: XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
> >Jan 6 14:45:00 backup kernel: XFS (md5): metadata I/O error: block 0x12c08f360 ("xfs_trans_read_buf_map") error 5 numblks 16
> >Jan 6 14:45:00 backup kernel: XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
> >... hundreds of the above XFS-messages deleted
> >Jan 6 14:45:00 backup kernel: XFS (md5): Log I/O Error Detected. Shutting down filesystem
> >Jan 6 14:45:00 backup kernel: XFS (md5): Please umount the filesystem and rectify the problem(s)
>
> Please notice: no error message about hardware-problems.
> All 21 disks are fine and the next messages from the
> md-driver was:
>
> >Jan 7 02:28:02 backup kernel: md: md5: reshape done.
> >Jan 7 02:28:03 backup kernel: md5: detected capacity change from 20002641018880 to 21002772807680
>
> I'm wondering about one thing: the first xfs message is about a
> meatadata I/O error on block 0x12c08f360. Since the xfs filesystem
I'm sure Dave will have more to say about this, but...
"block 0x12c08f360" == units of sectors, not fs blocks.
IOWs, this IO error happened at offset 2,577,280,712,704 (~2.5TB)
XFS doesn't change the fs size until you tell it to (via growfs);
even if the underlying storage geometry changes, XFS won't act on it
until the admin tells it to.
What did xfs_repair do?
--D
> has a blocksize of 4K this block is located at position 20135005568K
> which is beyond the end of the RAID10 device. No wonder that the
> xfs driver receives an I/O error. And also no wonder that the
> filesystem is severely corrupted right now.
>
> Question 1: How did the xfs driver knew on Jan 6 that the RAID10
> device was about to be increased from 20TB to 21TB on Jan 7?
>
> Question 2: Why did the xfs driver started to use the additional
> space that was not yet there without me executing xfs_growfs.
>
> This looks like a severe XFS-problem to me.
>
> But my hope is that all the data taht was within the filesystem
> before Jan 6 14:45 is not involved in the corruption. If xfs
> started to use space beyond the end of the underlying raid
> device this should have affected only data that was created,
> modified or deleted after Jan 6 14:45.
>
> If that was true we could clearly distinct between data
> that we must dump and data that we can keep. The machine is
> our backup system (as you may have guessed from its name)
> and I would like to keep old backup-files.
>
> I remember that mkfs.xfs is clever enough to adopt the
> filesystem paramters to the underlying hardware of the
> block device that the xfs filesystem is created on. Hence
> from the xfs drivers point of view the underlying block
> device is not just a sequence of data blocks, but the xfs
> driver knows something about the layout of the underlying
> hardware.
>
> If that was true - how does the xfs driver reacts if that
> information about the layout of the underlying hardware
> changes while the xfs-filesystem is mounted?
>
> Seems to be an interesting problem
>
> Kind regards
>
> Peter Koch
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-08 19:26 ` Darrick J. Wong
@ 2018-01-08 22:01 ` Dave Chinner
2018-01-08 23:44 ` xfs.pkoch
2018-01-09 9:36 ` Wols Lists
0 siblings, 2 replies; 37+ messages in thread
From: Dave Chinner @ 2018-01-08 22:01 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: xfs.pkoch, linux-xfs, linux-raid
[cc linux-raid, like the OP intended to do]
[For XFS folk, the original linux-raid thread is here:
https://marc.info/?l=linux-raid&m=151525346428531&w=2 ]
On Mon, Jan 08, 2018 at 11:26:07AM -0800, Darrick J. Wong wrote:
> On Mon, Jan 08, 2018 at 08:08:09PM +0100, xfs.pkoch@dfgh.net wrote:
> > Dear Linux-Raid and Linux-XFS experts:
> >
> > I'm posting this on both the linux-raid and linux-xfs
> > mailing list as it's not clear at this point wether
> > this is a MD- od XFS-problem.
> >
> > I have described my problem in a recent posting on
> > linux-raid and Wol's conclusion was:
> >
> > >In other words, one or more of the following three are true :-
> > >1) The OP has been caught by some random act of God
> > >2) There's a serious flaw in "mdadm --grow"
> > >3) There's a serious flaw in xfs
> > >
> > >Cheers,
> > >Wol
> >
> > There's very important data on our RAID10 device but I doubt
> > it's important enough for God to take a hand into our storage.
> >
> > But let me first summarize what happened and why I believe that
> > this is an XFS-problem:
The evidence doesn't support that claim.
tl;dr: block device IO errors occurred immediately after a MD
reshape started and the filesystem simply reported and responded
appropriately to those MD device IO errors.
> > Machine running Linux 3.14.69 with no kernel-patches.
So really old kernel....
> > XFS filesystem was created with XFS userutils 3.1.11.
And a really old userspace, too.
> > I did a fresh compile of xfsprogs-4.9.0 yesterday when
> > I realized that the 3.1.11 xfs_repair did not help.
> >
> > mdadm is V3.3
> >
> > /dev/md5 is a RAID10-device that was created in Feb 2013
> > with 10 2TB disks and an ext3 filesystem on it. Once in a
> > while I added two more 2TB disks. Reshaping was done
> > while the ext3 filesystem was mounted. Then the ext3
> > filesystem was unmounted resized and mounted again. That
> > worked until I resized the RAID10 from 16 to 20 disks and
> > realized that ext3 does not support filesystems >16TB.
> >
> > I switched to XFS and created a 20TB filesystem. Here are
> > the details:
> >
> > # xfs_info /dev/md5
> > meta-data=/dev/md5 isize=256 agcount=32, agsize=152608128 blks
> > = sectsz=512 attr=2
> > data = bsize=4096 blocks=4883457280, imaxpct=5
> > = sunit=128 swidth=1280 blks
> > naming =version 2 bsize=4096 ascii-ci=0
> > log =internal bsize=4096 blocks=521728, version=2
> > = sectsz=512 sunit=8 blks, lazy-count=1
> > realtime =none extsz=4096 blocks=0, rtextents=0
> >
> > Please notice: Ths XFS-filesystem has a size of
> > 4883457280*4K = 19,533,829,120K
> >
> > On saturday I tried to add two more 2TB disks to the RAID10
> > and the XFS filesystem was mounted (and in medium use) at that
> > time. Commands were:
> >
> > # mdadm /dev/md5 --add /dev/sdo
> > # mdadm --grow /dev/md5 --raid-devices=21
You added one device, not two. That's a recipe for a reshape that
moves every block of data in the device to a different location.
> > # mdadm -D /dev/md5
> > /dev/md5:
> > Version : 1.2
> > Creation Time : Sun Feb 10 16:58:10 2013
> > Raid Level : raid10
> > Array Size : 19533829120 (18628.91 GiB 20002.64 GB)
> > Used Dev Size : 1953382912 (1862.89 GiB 2000.26 GB)
> > Raid Devices : 21
> > Total Devices : 21
> > Persistence : Superblock is persistent
> >
> > Update Time : Sat Jan 6 15:08:37 2018
> > State : clean, reshaping
> > Active Devices : 21
> > Working Devices : 21
> > Failed Devices : 0
> > Spare Devices : 0
> >
> > Layout : near=2
> > Chunk Size : 512K
> >
> > Reshape Status : 1% complete
> > Delta Devices : 1, (20->21)
Yup, 21 devices in a RAID 10. That's a really nasty config for
RAID10 which requires an even number of disks to mirror correctly.
Why does MD even allow this sort of whacky, sub-optimal
configuration?
[....]
> > Immediately after the RAID10 reshape operation started the
> > XFS-filesystem reported I/O-errors and was severly damaged.
> > I waited for the reshape operation to finish and tried to repair
> > the filesystem with xfs_repair (version 3.1.11) but xfs_repair
> > crashed, so I tried 4.9.0-version aif xfs_reapair with no luck
> > either.
[...]
> > Here are the relevant log messages:
> >
> > >Jan 6 14:45:00 backup kernel: md: reshape of RAID array md5
> > >Jan 6 14:45:00 backup kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
> > >Jan 6 14:45:00 backup kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
> > >Jan 6 14:45:00 backup kernel: md: using 128k window, over a total of 19533829120k.
> > >Jan 6 14:45:00 backup kernel: XFS (md5): metadata I/O error: block 0x12c08f360 ("xfs_trans_read_buf_map") error 5 numblks 16
> > >Jan 6 14:45:00 backup kernel: XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
> > >Jan 6 14:45:00 backup kernel: XFS (md5): metadata I/O error: block 0x12c08f360 ("xfs_trans_read_buf_map") error 5 numblks 16
> > >Jan 6 14:45:00 backup kernel: XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
> > >... hundreds of the above XFS-messages deleted
> > >Jan 6 14:45:00 backup kernel: XFS (md5): Log I/O Error Detected. Shutting down filesystem
> > >Jan 6 14:45:00 backup kernel: XFS (md5): Please umount the filesystem and rectify the problem(s)
IOWs, within /a second/ of the reshape starting, the active, error
free XFS filesystem received hundreds of IO errors on both read and
write IOs from the MD device and shut down the filesystem.
XFS is just the messenger here - something has gone badly wrong at
the MD layer when the reshape kicked off.
> > Please notice: no error message about hardware-problems.
> > All 21 disks are fine and the next messages from the
> > md-driver was:
> >
> > >Jan 7 02:28:02 backup kernel: md: md5: reshape done.
> > >Jan 7 02:28:03 backup kernel: md5: detected capacity change from 20002641018880 to 21002772807680
Ok, so the reshape took about 12 hours to run, and it grew to 21TB.
A 12 hour long operation is what I'd expect for a major
rearrangement of every block in the MD device....
> > I'm wondering about one thing: the first xfs message is about a
> > meatadata I/O error on block 0x12c08f360. Since the xfs filesystem
>
> I'm sure Dave will have more to say about this, but...
>
> "block 0x12c08f360" == units of sectors, not fs blocks.
>
> IOWs, this IO error happened at offset 2,577,280,712,704 (~2.5TB)
That's correct, Darrick - it's well within the known fileysetm
bounds.
> XFS doesn't change the fs size until you tell it to (via growfs);
> even if the underlying storage geometry changes, XFS won't act on it
> until the admin tells it to.
>
> What did xfs_repair do?
Yeah, I'd like to see that output (from 4.9.0) too, but experience
tells me it did nothing helpful w.r.t data recovery from a badly
corrupted device.... :/
> > This looks like a severe XFS-problem to me.
I'll say this again: tHe evidence does not support that conclusion.
XFS has done exactly the right thing to protect the filesystem when
fatal IO errors started occurring at the block layer: it shut down
and stopped trying to modify the filesystem. What caused those
errors and any filesystem and/or data corruption to occur, OTOH, has
nothing to do with XFS.
> > But my hope is that all the data taht was within the filesystem
> > before Jan 6 14:45 is not involved in the corruption. If xfs
> > started to use space beyond the end of the underlying raid
> > device this should have affected only data that was created,
> > modified or deleted after Jan 6 14:45.
Experience tells me that you cannot trust a single byte of data in
that block device now, regardless of it's age and when it was last
modified. The MD reshape may have completed, but what it did is
highly questionable and you need to verify the contents of every
single directory and file.
When this sort of things happens, often the best data recovery
strategy (i.e. fastest and most reliable) is to simply throw
everything away and restore from known good backups...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-08 22:01 ` Dave Chinner
@ 2018-01-08 23:44 ` xfs.pkoch
2018-01-09 9:36 ` Wols Lists
1 sibling, 0 replies; 37+ messages in thread
From: mdraid.pkoch @ 2018-01-08 23:44 UTC (permalink / raw)
Cc: linux-raid, xfs.pkoch.f85f873813.linux-xfs#vger.linux-raid
Hi Dave and Derrick:
Thanks for answers - seems like my interpretation of the
blocknumber was wrong.
So the culprit is the md-driver again. It's producing I/O-errors
without any hardware-errors.
The machine was setup in 2013 so everything is 5 years old
besides the xfsprogs which I compiled yesterday.
xfs_repair output is very long and my impression is that things
were getting worse with every invocation. xfs_repair itself seemed
to have problems. I don't remeber the exact message but
xfs_repair was complainig a lot about a failed write verifier test.
I will copy as much data as I can from the corrupt filesystem to
our new system. For most files we have md5 checksums so I
can test wether their contents are OK or not.
I started xfs_repair -n 20 minutes ago an it has already printed
1165088 lines of messages
Here are some of these lines:
Phase 1 - find and verify superblock...
- reporting progress in intervals of 15 minutes
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
block (30,18106993-18106993) multiply claimed by cnt space tree, state - 2
block (30,18892669-18892669) multiply claimed by cnt space tree, state - 2
block (30,18904839-18904839) multiply claimed by cnt space tree, state - 2
block (30,19815542-19815542) multiply claimed by cnt space tree, state - 2
block (30,15440783-15440783) multiply claimed by cnt space tree, state - 2
block (30,17658438-17658438) multiply claimed by cnt space tree, state - 2
block (30,18749167-18749167) multiply claimed by cnt space tree, state - 2
block (30,19778684-19778684) multiply claimed by cnt space tree, state - 2
block (30,19951864-19951864) multiply claimed by cnt space tree, state - 2
block (30,19816441-19816441) multiply claimed by cnt space tree, state - 2
block (30,18742154-18742154) multiply claimed by cnt space tree, state - 2
block (30,18132613-18132613) multiply claimed by cnt space tree, state - 2
block (30,15502870-15502870) multiply claimed by cnt space tree, state - 2
agf_freeblks 12543116, counted 12543086 in ag 9
block (30,18168170-18168170) multiply claimed by cnt space tree, state - 2
agf_freeblks 6317001, counted 6316991 in ag 25
agf_freeblks 8962131, counted 8962128 in ag 0
block (1,6142-6142) multiply claimed by cnt space tree, state - 2
block (1,6150-6150) multiply claimed by cnt space tree, state - 2
agf_freeblks 8043945, counted 8043942 in ag 21
agf_freeblks 6833504, counted 6833499 in ag 24
block (1,5777-5777) multiply claimed by cnt space tree, state - 2
agf_freeblks 9032166, counted 9032109 in ag 19
agf_freeblks 16877231, counted 16874747 in ag 30
agf_freeblks 6645873, counted 6645861 in ag 27
block (1,8388992-8388992) multiply claimed by cnt space tree, state - 2
agf_freeblks 21229271, counted 21234873 in ag 1
agf_freeblks 11090766, counted 11090638 in ag 14
agf_freeblks 8424280, counted 8424279 in ag 13
agf_freeblks 1618763, counted 1618764 in ag 16
agf_freeblks 5380834, counted 5380831 in ag 15
agf_freeblks 11211636, counted 11211543 in ag 12
agf_freeblks 14135461, counted 14135434 in ag 11
sb_fdblocks 344528311, counted 344530989
- 00:51:27: scanning filesystem freespace - 32 of 32 allocation
groups done
- found root inode chunk
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
- 00:51:27: scanning agi unlinked lists - 32 of 32 allocation
groups done
- process known inodes and perform inode discovery...
- agno = 0
- agno = 30
- agno = 15
bad nblocks 17 for inode 64425222202, would reset to 18
bad nextents 12 for inode 64425222202, would reset to 13
Invalid inode number 0xfeffffffffffffff
xfs_dir_ino_validate: XFS_ERROR_REPORT
Metadata corruption detected at xfs_dir3_data block 0x4438f5c60/0x1000
entry "/463380382.M621183P10446.mail,S=2075,W=2116" at block 12 offset
2192 in directory inode 64425222202 references invalid inode
18374686479671623679
would clear inode number in entry at offset 2192...
entry at block 12 offset 2192 in directory inode 64425222202 has illegal
name "/463380382.M621183P10446.mail,S=2075,W=2116": would clear entry
entry "/463466963.M420615P6276.mail,S=2202,W=2261" at block 12 offset
2472 in directory inode 64425222202 references invalid inode
18374686479671623679
would clear inode number in entry at offset 2472...
entry at block 12 offset 2472 in directory inode 64425222202 has illegal
name "/463466963.M420615P6276.mail,S=2202,W=2261": would clear entry
entry "/463980159.M342359P4014.mail,S=3285,W=3378" at block 12 offset
3376 in directory inode 64425222202 references invalid inode
18374686479671623679
would clear inode number in entry at offset 3376...
entry at block 12 offset 3376 in directory inode 64425222202 has illegal
name "/463980159.M342359P4014.mail,S=3285,W=3378": would clear entry
entry "/463984373.M513992P19720.mail,S=10818,W=11143" at block 12 offset
3432 in directory inode 64425222202 references invalid inode
18374686479671623679
.....
..... thousends of messages about direcotry inodes referencing inode
0xfeffffffffffffff
..... and illegal names where first character has been replaced by /
..... most agno have these messages, but some agnos are fine
.....
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- 01:10:03: setting up duplicate extent list - 32 of 32
allocation groups done
- check for inodes claiming duplicate blocks...
- agno = 15
- agno = 30
- agno = 0
entry ".." at block 0 offset 32 in directory inode 128849025043
references non-existent inode 124835665944
entry ".." at block 0 offset 32 in directory inode 128849348634
references non-existent inode 124554268735
entry ".." at block 0 offset 32 in directory inode 128849348643
references non-existent inode 124554274826
entry ".." at block 0 offset 32 in directory inode 128849350697
references non-existent inode 4295153945
entry ".." at block 0 offset 32 in directory inode 128849352738
references non-existent inode 124554268679
entry ".." at block 0 offset 32 in directory inode 128849352744
references non-existent inode 124554268687
entry ".." at block 0 offset 32 in directory inode 128849393697
references non-existent inode 124554315786
entry ".." at block 0 offset 32 in directory inode 128849397786
references non-existent inode 124678412289
entry ".." at block 0 offset 32 in directory inode 128849397815
references non-existent inode 124678412340
entry ".." at block 0 offset 32 in directory inode 128849397821
references non-existent inode 4295878668
entry ".." at block 0 offset 32 in directory inode 128849399852
references non-existent inode 124554274851
entry ".." at block 0 offset 32 in directory inode 128849399867
references non-existent inode 4295020775
entry ".." at block 0 offset 32 in directory inode 128849403936
references non-existent inode 124554340368
entry ".." at block 0 offset 32 in directory inode 128849412109
references non-existent inode 124554403877
entry ".." at block 0 offset 32 in directory inode 64425142305
references non-existent inode 4295153925
bad nblocks 17 for inode 64425222202, would reset to 18
bad nextents 12 for inode 64425222202, would reset to 13
Invalid inode number 0xfeffffffffffffff
xfs_dir_ino_validate: XFS_ERROR_REPORT
Metadata corruption detected at xfs_dir3_data block 0x4438f5c60/0x1000
would clear entry
would clear entry
would clear entry
.....
..... entry ".." at block 0 offset 32 - messages repeat over and over
with differnt inodes
.....
Phase 5 which produced a lot of messages as well is missing
when the -n option is used.
> You added one device, not two. That's a recipe for a reshape that
> moves every block of data in the device to a different location.
Of course I was planning to add another one. If I add both in one
step I cannot predict which disk will end up in disk set-A and which
will end up in disk set-B. Since both disk sets are at different location
I have to add the additional disk at location-A first and then the second
disk at location B. Adding two disks in one step does move every
piece of data as well.
> IOWs, within /a second/ of the reshape starting, the active, error
> free XFS filesystem received hundreds of IO errors on both read and
> write IOs from the MD device and shut down the filesystem.
>
> XFS is just the messenger here - something has gone badly wrong at
> the MD layer when the reshape kicked off.
You are right - and this has happened without hardware-problems.
> Yeah, I'd like to see that output (from 4.9.0) too, but experience
> tells me it did nothing helpful w.r.t data recovery from a badly
> corrupted device.... :/
You are right again.
>> This looks like a severe XFS-problem to me.
> I'll say this again: tHe evidence does not support that conclusion.
So let's see what the MD-experts have to say.
Kind regards
Peter
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
@ 2018-01-08 23:44 ` xfs.pkoch
0 siblings, 0 replies; 37+ messages in thread
From: xfs.pkoch @ 2018-01-08 23:44 UTC (permalink / raw)
Cc: mdraid.pkoch.3c0485e297.linux-raid#vger.linux-xfs
Hi Dave and Derrick:
Thanks for answers - seems like my interpretation of the
blocknumber was wrong.
So the culprit is the md-driver again. It's producing I/O-errors
without any hardware-errors.
The machine was setup in 2013 so everything is 5 years old
besides the xfsprogs which I compiled yesterday.
xfs_repair output is very long and my impression is that things
were getting worse with every invocation. xfs_repair itself seemed
to have problems. I don't remeber the exact message but
xfs_repair was complainig a lot about a failed write verifier test.
I will copy as much data as I can from the corrupt filesystem to
our new system. For most files we have md5 checksums so I
can test wether their contents are OK or not.
I started xfs_repair -n 20 minutes ago an it has already printed
1165088 lines of messages
Here are some of these lines:
Phase 1 - find and verify superblock...
- reporting progress in intervals of 15 minutes
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
block (30,18106993-18106993) multiply claimed by cnt space tree, state - 2
block (30,18892669-18892669) multiply claimed by cnt space tree, state - 2
block (30,18904839-18904839) multiply claimed by cnt space tree, state - 2
block (30,19815542-19815542) multiply claimed by cnt space tree, state - 2
block (30,15440783-15440783) multiply claimed by cnt space tree, state - 2
block (30,17658438-17658438) multiply claimed by cnt space tree, state - 2
block (30,18749167-18749167) multiply claimed by cnt space tree, state - 2
block (30,19778684-19778684) multiply claimed by cnt space tree, state - 2
block (30,19951864-19951864) multiply claimed by cnt space tree, state - 2
block (30,19816441-19816441) multiply claimed by cnt space tree, state - 2
block (30,18742154-18742154) multiply claimed by cnt space tree, state - 2
block (30,18132613-18132613) multiply claimed by cnt space tree, state - 2
block (30,15502870-15502870) multiply claimed by cnt space tree, state - 2
agf_freeblks 12543116, counted 12543086 in ag 9
block (30,18168170-18168170) multiply claimed by cnt space tree, state - 2
agf_freeblks 6317001, counted 6316991 in ag 25
agf_freeblks 8962131, counted 8962128 in ag 0
block (1,6142-6142) multiply claimed by cnt space tree, state - 2
block (1,6150-6150) multiply claimed by cnt space tree, state - 2
agf_freeblks 8043945, counted 8043942 in ag 21
agf_freeblks 6833504, counted 6833499 in ag 24
block (1,5777-5777) multiply claimed by cnt space tree, state - 2
agf_freeblks 9032166, counted 9032109 in ag 19
agf_freeblks 16877231, counted 16874747 in ag 30
agf_freeblks 6645873, counted 6645861 in ag 27
block (1,8388992-8388992) multiply claimed by cnt space tree, state - 2
agf_freeblks 21229271, counted 21234873 in ag 1
agf_freeblks 11090766, counted 11090638 in ag 14
agf_freeblks 8424280, counted 8424279 in ag 13
agf_freeblks 1618763, counted 1618764 in ag 16
agf_freeblks 5380834, counted 5380831 in ag 15
agf_freeblks 11211636, counted 11211543 in ag 12
agf_freeblks 14135461, counted 14135434 in ag 11
sb_fdblocks 344528311, counted 344530989
- 00:51:27: scanning filesystem freespace - 32 of 32 allocation
groups done
- found root inode chunk
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
- 00:51:27: scanning agi unlinked lists - 32 of 32 allocation
groups done
- process known inodes and perform inode discovery...
- agno = 0
- agno = 30
- agno = 15
bad nblocks 17 for inode 64425222202, would reset to 18
bad nextents 12 for inode 64425222202, would reset to 13
Invalid inode number 0xfeffffffffffffff
xfs_dir_ino_validate: XFS_ERROR_REPORT
Metadata corruption detected at xfs_dir3_data block 0x4438f5c60/0x1000
entry "/463380382.M621183P10446.mail,S=2075,W=2116" at block 12 offset
2192 in directory inode 64425222202 references invalid inode
18374686479671623679
would clear inode number in entry at offset 2192...
entry at block 12 offset 2192 in directory inode 64425222202 has illegal
name "/463380382.M621183P10446.mail,S=2075,W=2116": would clear entry
entry "/463466963.M420615P6276.mail,S=2202,W=2261" at block 12 offset
2472 in directory inode 64425222202 references invalid inode
18374686479671623679
would clear inode number in entry at offset 2472...
entry at block 12 offset 2472 in directory inode 64425222202 has illegal
name "/463466963.M420615P6276.mail,S=2202,W=2261": would clear entry
entry "/463980159.M342359P4014.mail,S=3285,W=3378" at block 12 offset
3376 in directory inode 64425222202 references invalid inode
18374686479671623679
would clear inode number in entry at offset 3376...
entry at block 12 offset 3376 in directory inode 64425222202 has illegal
name "/463980159.M342359P4014.mail,S=3285,W=3378": would clear entry
entry "/463984373.M513992P19720.mail,S=10818,W=11143" at block 12 offset
3432 in directory inode 64425222202 references invalid inode
18374686479671623679
.....
..... thousends of messages about direcotry inodes referencing inode
0xfeffffffffffffff
..... and illegal names where first character has been replaced by /
..... most agno have these messages, but some agnos are fine
.....
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- 01:10:03: setting up duplicate extent list - 32 of 32
allocation groups done
- check for inodes claiming duplicate blocks...
- agno = 15
- agno = 30
- agno = 0
entry ".." at block 0 offset 32 in directory inode 128849025043
references non-existent inode 124835665944
entry ".." at block 0 offset 32 in directory inode 128849348634
references non-existent inode 124554268735
entry ".." at block 0 offset 32 in directory inode 128849348643
references non-existent inode 124554274826
entry ".." at block 0 offset 32 in directory inode 128849350697
references non-existent inode 4295153945
entry ".." at block 0 offset 32 in directory inode 128849352738
references non-existent inode 124554268679
entry ".." at block 0 offset 32 in directory inode 128849352744
references non-existent inode 124554268687
entry ".." at block 0 offset 32 in directory inode 128849393697
references non-existent inode 124554315786
entry ".." at block 0 offset 32 in directory inode 128849397786
references non-existent inode 124678412289
entry ".." at block 0 offset 32 in directory inode 128849397815
references non-existent inode 124678412340
entry ".." at block 0 offset 32 in directory inode 128849397821
references non-existent inode 4295878668
entry ".." at block 0 offset 32 in directory inode 128849399852
references non-existent inode 124554274851
entry ".." at block 0 offset 32 in directory inode 128849399867
references non-existent inode 4295020775
entry ".." at block 0 offset 32 in directory inode 128849403936
references non-existent inode 124554340368
entry ".." at block 0 offset 32 in directory inode 128849412109
references non-existent inode 124554403877
entry ".." at block 0 offset 32 in directory inode 64425142305
references non-existent inode 4295153925
bad nblocks 17 for inode 64425222202, would reset to 18
bad nextents 12 for inode 64425222202, would reset to 13
Invalid inode number 0xfeffffffffffffff
xfs_dir_ino_validate: XFS_ERROR_REPORT
Metadata corruption detected at xfs_dir3_data block 0x4438f5c60/0x1000
would clear entry
would clear entry
would clear entry
.....
..... entry ".." at block 0 offset 32 - messages repeat over and over
with differnt inodes
.....
Phase 5 which produced a lot of messages as well is missing
when the -n option is used.
> You added one device, not two. That's a recipe for a reshape that
> moves every block of data in the device to a different location.
Of course I was planning to add another one. If I add both in one
step I cannot predict which disk will end up in disk set-A and which
will end up in disk set-B. Since both disk sets are at different location
I have to add the additional disk at location-A first and then the second
disk at location B. Adding two disks in one step does move every
piece of data as well.
> IOWs, within /a second/ of the reshape starting, the active, error
> free XFS filesystem received hundreds of IO errors on both read and
> write IOs from the MD device and shut down the filesystem.
>
> XFS is just the messenger here - something has gone badly wrong at
> the MD layer when the reshape kicked off.
You are right - and this has happened without hardware-problems.
> Yeah, I'd like to see that output (from 4.9.0) too, but experience
> tells me it did nothing helpful w.r.t data recovery from a badly
> corrupted device.... :/
You are right again.
>> This looks like a severe XFS-problem to me.
> I'll say this again: tHe evidence does not support that conclusion.
So let's see what the MD-experts have to say.
Kind regards
Peter
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-08 22:01 ` Dave Chinner
2018-01-08 23:44 ` xfs.pkoch
@ 2018-01-09 9:36 ` Wols Lists
2018-01-09 21:47 ` IMAP-FCC:Sent
2018-01-09 22:25 ` Dave Chinner
1 sibling, 2 replies; 37+ messages in thread
From: Wols Lists @ 2018-01-09 9:36 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs, linux-raid
On 08/01/18 22:01, Dave Chinner wrote:
> Yup, 21 devices in a RAID 10. That's a really nasty config for
> RAID10 which requires an even number of disks to mirror correctly.
> Why does MD even allow this sort of whacky, sub-optimal
> configuration?
Just to point out - if this is raid-10 (and not raid-1+0 which is a
completely different beast) this is actually a normal linux config. I'm
planning to set up a raid-10 across 3 devices. What happens is that is
that raid-10 writes X copies across Y devices. If X = Y then it's a
normal mirror config, if X > Y it makes good use of space (and if X < Y
it doesn't make sense :-)
SDA: 1, 2, 4, 5
SDB: 1, 3, 4, 6
SDC: 2, 3, 5, 6
Cheers,
Wol
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-09 9:36 ` Wols Lists
@ 2018-01-09 21:47 ` IMAP-FCC:Sent
2018-01-09 22:25 ` Dave Chinner
1 sibling, 0 replies; 37+ messages in thread
From: IMAP-FCC:Sent @ 2018-01-09 21:47 UTC (permalink / raw)
To: Wols Lists; +Cc: Dave Chinner, linux-xfs, linux-raid
>>>>> "Wols" == Wols Lists <antlists@youngman.org.uk> writes:
Wols> On 08/01/18 22:01, Dave Chinner wrote:
>> Yup, 21 devices in a RAID 10. That's a really nasty config for
>> RAID10 which requires an even number of disks to mirror correctly.
>> Why does MD even allow this sort of whacky, sub-optimal
>> configuration?
Wols> Just to point out - if this is raid-10 (and not raid-1+0 which is a
Wols> completely different beast) this is actually a normal linux config. I'm
Wols> planning to set up a raid-10 across 3 devices. What happens is that is
Wols> that raid-10 writes X copies across Y devices. If X = Y then it's a
Wols> normal mirror config, if X > Y it makes good use of space (and if X < Y
Wols> it doesn't make sense :-)
Wols> SDA: 1, 2, 4, 5
Wols> SDB: 1, 3, 4, 6
Wols> SDC: 2, 3, 5, 6
This is a nice idea, but honestly, I think it's just asking for
trouble down the line. It's almost more like RAID4 in some ways, but
without parity, just copies.
So I suspect that the problem that's happened here is that some bug in
RAID10 has been found when you do a re-shape (on an old kernel,
RHEL6? Debian? Not clear...) with a large number of devices. Since
you have to re-balance the data as new disks are added... it might get
problematic.
In any case, I would recommend that you simple setup RAID1 pairs, then
pull them all into a VG, then create an LV which spans all the RAID1
pairs. Then you can add new pairs to the system easily and
grow/shrink the array easily.
This also lets you replace the 2tb disks with 4tb or larger disks more
easily as time goes on. And of course I'd *also* put in some hot
spares.
But then again, if this is just a dumping ground for data with mostly
reads, or just large sequential writes (say for media, images, video,
etc) then going to RAID6 sets (say 10 or so per) which you THEN stripe
over using LVM is a better way to go.
I'll see if I can find some time to try setting up a bunch of test
loop devices on my own to see what happens here. But I'm also running
newer kernel and Debian Jessie distribution.
But it will probably be Neil who needs to debug the real issue, I
don't know the code well at all.
John
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-09 9:36 ` Wols Lists
2018-01-09 21:47 ` IMAP-FCC:Sent
@ 2018-01-09 22:25 ` Dave Chinner
2018-01-09 22:32 ` Reindl Harald
` (2 more replies)
1 sibling, 3 replies; 37+ messages in thread
From: Dave Chinner @ 2018-01-09 22:25 UTC (permalink / raw)
To: Wols Lists; +Cc: linux-xfs, linux-raid
On Tue, Jan 09, 2018 at 09:36:49AM +0000, Wols Lists wrote:
> On 08/01/18 22:01, Dave Chinner wrote:
> > Yup, 21 devices in a RAID 10. That's a really nasty config for
> > RAID10 which requires an even number of disks to mirror correctly.
> > Why does MD even allow this sort of whacky, sub-optimal
> > configuration?
>
> Just to point out - if this is raid-10 (and not raid-1+0 which is a
> completely different beast) this is actually a normal linux config. I'm
> planning to set up a raid-10 across 3 devices. What happens is that is
> that raid-10 writes X copies across Y devices. If X = Y then it's a
> normal mirror config, if X > Y it makes good use of space (and if X < Y
> it doesn't make sense :-)
>
> SDA: 1, 2, 4, 5
>
> SDB: 1, 3, 4, 6
>
> SDC: 2, 3, 5, 6
It's nice to know that MD has redefined RAID-10 to be different to
the industry standard definition that has been used for 20 years and
optimised filesystem layouts for. Rotoring data across odd numbers
of disks like this is going to really, really suck on filesystems
that are stripe layout aware..
For example, XFS has hot-spot prevention algorithms in it's
internal physical layout for striped devices. It aligns AGs across
different stripe units so that metadata and data doesn't all get
aligned to the one disk in a RAID0/5/6 stripe. If the stripes are
rotoring across disks themselves, then we're going to end up back in
the same position we started with - multiple AGs aligned to the
same disk.
The result is that many XFS workloads are going to hotspot disks and
result in unbalanced load when there are an odd number of disks in a
RAID-10 array. Actually, it's probably worse than having no
alignment, because it makes hotspot occurrence and behaviour very
unpredictable.
Worse is the fact that there's absolutely nothing we can do to
optimise allocation alignment or IO behaviour at the filesystem
level. We'll have to make mkfs.xfs aware of this clusterfuck and
turn off stripe alignment when we detect such a layout, but that
doesn't help all the existing user installations out there right
now.
IMO, odd-numbered disks in RAID-10 should be considered harmful and
never used....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-09 22:25 ` Dave Chinner
@ 2018-01-09 22:32 ` Reindl Harald
2018-01-10 6:17 ` Wols Lists
2018-01-10 14:10 ` Phil Turmel
2 siblings, 0 replies; 37+ messages in thread
From: Reindl Harald @ 2018-01-09 22:32 UTC (permalink / raw)
To: Dave Chinner, Wols Lists; +Cc: linux-xfs, linux-raid
Am 09.01.2018 um 23:25 schrieb Dave Chinner:
> On Tue, Jan 09, 2018 at 09:36:49AM +0000, Wols Lists wrote:
>> Just to point out - if this is raid-10 (and not raid-1+0 which is a
>> completely different beast) this is actually a normal linux config. I'm
>> planning to set up a raid-10 across 3 devices. What happens is that is
>> that raid-10 writes X copies across Y devices. If X = Y then it's a
>> normal mirror config, if X > Y it makes good use of space (and if X < Y
>> it doesn't make sense :-)
>>
> IMO, odd-numbered disks in RAID-10 should be considered harmful and
> never used....
agreed and then "writemostly" could work without the lame excues that
one could have a crazy RAID10 layout.....
https://www.spinics.net/lists/raid/msg55797.html
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-09 22:25 ` Dave Chinner
2018-01-09 22:32 ` Reindl Harald
@ 2018-01-10 6:17 ` Wols Lists
2018-01-11 2:14 ` Dave Chinner
2018-01-10 14:10 ` Phil Turmel
2 siblings, 1 reply; 37+ messages in thread
From: Wols Lists @ 2018-01-10 6:17 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs, linux-raid
On 09/01/18 22:25, Dave Chinner wrote:
> On Tue, Jan 09, 2018 at 09:36:49AM +0000, Wols Lists wrote:
>> On 08/01/18 22:01, Dave Chinner wrote:
>>> Yup, 21 devices in a RAID 10. That's a really nasty config for
>>> RAID10 which requires an even number of disks to mirror correctly.
>>> Why does MD even allow this sort of whacky, sub-optimal
>>> configuration?
>>
>> Just to point out - if this is raid-10 (and not raid-1+0 which is a
>> completely different beast) this is actually a normal linux config. I'm
>> planning to set up a raid-10 across 3 devices. What happens is that is
>> that raid-10 writes X copies across Y devices. If X = Y then it's a
>> normal mirror config, if X > Y it makes good use of space (and if X < Y
>> it doesn't make sense :-)
>>
>> SDA: 1, 2, 4, 5
>>
>> SDB: 1, 3, 4, 6
>>
>> SDC: 2, 3, 5, 6
>
> It's nice to know that MD has redefined RAID-10 to be different to
> the industry standard definition that has been used for 20 years and
> optimised filesystem layouts for. Rotoring data across odd numbers
> of disks like this is going to really, really suck on filesystems
> that are stripe layout aware..
Actually, I thought that the industry standard definition referred to
Raid-1+0. It's just colloquially referred to as raid-10.
>
> For example, XFS has hot-spot prevention algorithms in it's
> internal physical layout for striped devices. It aligns AGs across
> different stripe units so that metadata and data doesn't all get
> aligned to the one disk in a RAID0/5/6 stripe. If the stripes are
> rotoring across disks themselves, then we're going to end up back in
> the same position we started with - multiple AGs aligned to the
> same disk.
Are you telling me that xfs is aware of the internal structure of an
md-raid array? Given that md-raid is an abstraction layer, this seems
rather dangerous to me - you're breaking the abstraction and this could
explain the OP's problem. Md-raid changed underneath the filesystem, on
the assumption that the filesystem wouldn't notice, and the filesystem
*did*. BANG!
>
> The result is that many XFS workloads are going to hotspot disks and
> result in unbalanced load when there are an odd number of disks in a
> RAID-10 array. Actually, it's probably worse than having no
> alignment, because it makes hotspot occurrence and behaviour very
> unpredictable.
>
> Worse is the fact that there's absolutely nothing we can do to
> optimise allocation alignment or IO behaviour at the filesystem
> level. We'll have to make mkfs.xfs aware of this clusterfuck and
> turn off stripe alignment when we detect such a layout, but that
> doesn't help all the existing user installations out there right
> now.
So you're telling me that mkfs.xfs *IS* aware of the underlying raid
structure. OOPS! What happens when that structure changes for instance a
raid-5 is converted to raid-6, or another disk is added? If you have to
have special code to deal with md-raid and changes in said raid, where's
the problem with more code for raid-10?
>
> IMO, odd-numbered disks in RAID-10 should be considered harmful and
> never used....
>
What about when you have an odd number of mirrors? :-)
Seriously, can't you just make sure that xfs rotates the stripe units
using a number that is relatively prime to the number of disks? If you
have to notice and adjust for changes in the underlying raid structure
anyway, surely that's no greater hardship?
(Just so's you know who I am, I've taken over editorship of the raid
wiki. This is exactly the stuff that belongs on there, so as soon as I
understand what's going on I'll write it up, and I'm happy to be
educated :-) But I do like to really grasp what's going on, so expect
lots of naive questions ... There's not a lot of information on how raid
and filesystems interact, and I haven't really got to grips wioth any of
that at the moment, and I don't use xfs. I use ext4 on gentoo, and the
default btrfs on SUSE.)
Cheers,
Wol
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-09 22:25 ` Dave Chinner
2018-01-09 22:32 ` Reindl Harald
2018-01-10 6:17 ` Wols Lists
@ 2018-01-10 14:10 ` Phil Turmel
2018-01-10 21:57 ` Wols Lists
2018-01-11 3:07 ` Dave Chinner
2 siblings, 2 replies; 37+ messages in thread
From: Phil Turmel @ 2018-01-10 14:10 UTC (permalink / raw)
To: Dave Chinner, Wols Lists; +Cc: linux-xfs, linux-raid
On 01/09/2018 05:25 PM, Dave Chinner wrote:
> It's nice to know that MD has redefined RAID-10 to be different to
> the industry standard definition that has been used for 20 years and
> optimised filesystem layouts for. Rotoring data across odd numbers
> of disks like this is going to really, really suck on filesystems
> that are stripe layout aware..
You're a bit late to this party, Dave. MD has implemented raid10 like
this as far back as I can remember, and it is especially valuable when
running more than two copies. Running raid10,n3 across four or five
devices is a nice capacity boost without giving up triple copies (when
multiples of three aren't available) or giving up the performance of
mirrored raid.
> For example, XFS has hot-spot prevention algorithms in it's
> internal physical layout for striped devices. It aligns AGs across
> different stripe units so that metadata and data doesn't all get
> aligned to the one disk in a RAID0/5/6 stripe. If the stripes are
> rotoring across disks themselves, then we're going to end up back in
> the same position we started with - multiple AGs aligned to the
> same disk.
All of MD's default raid5 and raid6 layouts rotate stripes, too, so that
parity and syndrome are distributed uniformly.
> The result is that many XFS workloads are going to hotspot disks and
> result in unbalanced load when there are an odd number of disks in a
> RAID-10 array. Actually, it's probably worse than having no
> alignment, because it makes hotspot occurrence and behaviour very
> unpredictable.
>
> Worse is the fact that there's absolutely nothing we can do to
> optimise allocation alignment or IO behaviour at the filesystem
> level. We'll have to make mkfs.xfs aware of this clusterfuck and
> turn off stripe alignment when we detect such a layout, but that
> doesn't help all the existing user installations out there right
> now.
>
> IMO, odd-numbered disks in RAID-10 should be considered harmful and
> never used....
Users are perfectly able to layer raid1+0 or raid0+1 if they don't want
the features of raid10. Given the advantages of MD's raid10, a pedant
could say XFS's lack of support for it should be considered harmful and
XFS never used. (-:
FWIW, while I'm sometimes a pendant, I'm not in this case. I use both
MD raid10 and xfs.
Phil
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-10 14:10 ` Phil Turmel
@ 2018-01-10 21:57 ` Wols Lists
2018-01-11 3:07 ` Dave Chinner
1 sibling, 0 replies; 37+ messages in thread
From: Wols Lists @ 2018-01-10 21:57 UTC (permalink / raw)
To: Phil Turmel; +Cc: linux-raid
On 10/01/18 14:10, Phil Turmel wrote:
> FWIW, while I'm sometimes a pendant, I'm not in this case. I use both
> MD raid10 and xfs.
So you sometimes like being left hanging ... :-)
Cheers,
Wol
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-10 6:17 ` Wols Lists
@ 2018-01-11 2:14 ` Dave Chinner
2018-01-12 2:16 ` Guoqing Jiang
0 siblings, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2018-01-11 2:14 UTC (permalink / raw)
To: Wols Lists; +Cc: linux-xfs, linux-raid
On Wed, Jan 10, 2018 at 06:17:11AM +0000, Wols Lists wrote:
> On 09/01/18 22:25, Dave Chinner wrote:
> > On Tue, Jan 09, 2018 at 09:36:49AM +0000, Wols Lists wrote:
> >> On 08/01/18 22:01, Dave Chinner wrote:
> >>> Yup, 21 devices in a RAID 10. That's a really nasty config for
> >>> RAID10 which requires an even number of disks to mirror correctly.
> >>> Why does MD even allow this sort of whacky, sub-optimal
> >>> configuration?
> >>
> >> Just to point out - if this is raid-10 (and not raid-1+0 which is a
> >> completely different beast) this is actually a normal linux config. I'm
> >> planning to set up a raid-10 across 3 devices. What happens is that is
> >> that raid-10 writes X copies across Y devices. If X = Y then it's a
> >> normal mirror config, if X > Y it makes good use of space (and if X < Y
> >> it doesn't make sense :-)
> >>
> >> SDA: 1, 2, 4, 5
> >>
> >> SDB: 1, 3, 4, 6
> >>
> >> SDC: 2, 3, 5, 6
> >
> > It's nice to know that MD has redefined RAID-10 to be different to
> > the industry standard definition that has been used for 20 years and
> > optimised filesystem layouts for. Rotoring data across odd numbers
> > of disks like this is going to really, really suck on filesystems
> > that are stripe layout aware..
>
> Actually, I thought that the industry standard definition referred to
> Raid-1+0. It's just colloquially referred to as raid-10.
https://en.wikipedia.org/wiki/Nested_RAID_levels#RAID_10
"However, a nonstandard definition of "RAID 10" was created for the
Linux MD driver"
So it's not just me who thinks what MD is doing is non-standard.
> > For example, XFS has hot-spot prevention algorithms in it's
> > internal physical layout for striped devices. It aligns AGs across
> > different stripe units so that metadata and data doesn't all get
> > aligned to the one disk in a RAID0/5/6 stripe. If the stripes are
> > rotoring across disks themselves, then we're going to end up back in
> > the same position we started with - multiple AGs aligned to the
> > same disk.
>
> Are you telling me that xfs is aware of the internal structure of an
> md-raid array?
It's aware of the /alignment/ characteristics of block devices, and
these alignment characteristics are exported by MD. e.g. These are
exported in /sys/block/<dev>/queue in
minimum_io_size
- typically the stripe chunk size
optimal_io_size
- typically the stripe width
We get this stuff from DM and MD devices, hardware raid (via scsi
code pages), thinp devices (i.e. to tell us the allocation
granularity so we can align/size IO to match it) and any other block
device that wants to tell us about optimal IO geometry. libblkid
provides us with this information, and it's not just mkfs.xfs that
uses it. e.g. mkfs.ext4 also uses it for the exact same purpose as
XFS....
> Given that md-raid is an abstraction layer, this seems
> rather dangerous to me - you're breaking the abstraction and this could
> explain the OP's problem. Md-raid changed underneath the filesystem, on
> the assumption that the filesystem wouldn't notice, and the filesystem
> *did*. BANG!
No, we aren't breaking any abstractions. It's always been that case
that the filesystem needs to be correctly aligned to the underlying
storage geometry if performance is desired. Think about old skool
filesystems that were aware of the old C/H/S layout of drives back
in the 80s. Optimising layouts for "cylinder groups" in the hardware
gave major performance improvements and we can trace ext4's block
group concept all the way back to those specific hardware geometry
requirements.
I suspect that the problem here is that realtively few people
understand why alignment to the underlying storage geometry is
necessary and don't realise the lengths the storage stack goes to
ensuring alignment is optimal. It's mostly hidden and automatic
these days because most users lack the knowledge to be able to set
this sort of stuff up correctly.
> > The result is that many XFS workloads are going to hotspot disks and
> > result in unbalanced load when there are an odd number of disks in a
> > RAID-10 array. Actually, it's probably worse than having no
> > alignment, because it makes hotspot occurrence and behaviour very
> > unpredictable.
> >
> > Worse is the fact that there's absolutely nothing we can do to
> > optimise allocation alignment or IO behaviour at the filesystem
> > level. We'll have to make mkfs.xfs aware of this clusterfuck and
> > turn off stripe alignment when we detect such a layout, but that
> > doesn't help all the existing user installations out there right
> > now.
>
> So you're telling me that mkfs.xfs *IS* aware of the underlying raid
> structure. OOPS! What happens when that structure changes for instance a
> raid-5 is converted to raid-6, or another disk is added?
RAID-5 to RAID-6 doesn't change the stripe alignment. That's still
N data disks per stripe, so the geometry and alignment is unchanged
and has no impact on the layout.
But changing the stripe geometry (i.e. number of data disks)
completely fucks IO alignment and that impacts overall storage
performance. None of the existing data in the filesystem is aligned
to the underlying storage anymore so overwrites will cause all sorts
of RMW storms, you'll get IO hotspots because what used to be on
separate disks is now all on the same disk, etc. And the filesystem
won't be able to fix this because unless you increase the number
data disks by an integer multiple, the alignment cannot be changed
due to fixed locations of metadata in the filesystem.
> If you have to
> have special code to deal with md-raid and changes in said raid, where's
> the problem with more code for raid-10?
I didn't stay we had code to handle "changes in said raid". That's
explicitly what we /don't have/. To handle a geometry/alignment
change in the underlying storage we have to *resilver the entire
filesystem*. And, well, we can't easily do that because that means
we'd have to completely rewrite and re-index the filesystem. It's
faster, easier and more reliable to dump/mkfs/restore the filesystem
than it is to resilver it.
There's many, many reasons why RAID reshaping is considered harmful
and is not recommended by anyone who understands the whole storage
stack intimately.
> > IMO, odd-numbered disks in RAID-10 should be considered harmful and
> > never used....
> >
> What about when you have an odd number of mirrors? :-)
Be a smart-ass all you want, but it doesn't change the fact that the
"grow-by-one-disk" clusterfuck occurs when you have an odd number of
mirrors, too.
> Seriously, can't you just make sure that xfs rotates the stripe units
> using a number that is relatively prime to the number of disks?
Who said we don't already rotate through stripe units?
And, well, there are situations where ignoring geometry is good
(e.g. delayed allocation allows us to pack lots of small files
together so they aggregate into full stripe writes and avoid RMW
cycles) and there are situations where stripe width rather than
stripe unit alignment is desirable for a single allocation (e.g.
large sequential direct IO writes so we avoid RMW cycles due to
partial stripe overlaps in IO).
These IO alignment optimisations are all done on-the-fly by
filesystems. Filesystems do far more than you realise with the
geometry information they are provided with and that's why assuming
that you can transparently change the storage geometry without the
filesystem (and hence users) caring about such changes is
fundamentally wrong.
> (Just so's you know who I am, I've taken over editorship of the raid
> wiki. This is exactly the stuff that belongs on there, so as soon as I
> understand what's going on I'll write it up, and I'm happy to be
> educated :-) But I do like to really grasp what's going on, so expect
> lots of naive questions ... There's not a lot of information on how raid
> and filesystems interact, and I haven't really got to grips wioth any of
> that at the moment, and I don't use xfs. I use ext4 on gentoo, and the
> default btrfs on SUSE.)
You've got an awful lot of learning to do, then.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-10 14:10 ` Phil Turmel
2018-01-10 21:57 ` Wols Lists
@ 2018-01-11 3:07 ` Dave Chinner
2018-01-12 13:32 ` Wols Lists
1 sibling, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2018-01-11 3:07 UTC (permalink / raw)
To: Phil Turmel; +Cc: Wols Lists, linux-xfs, linux-raid
on Wed, Jan 10, 2018 at 09:10:55AM -0500, Phil Turmel wrote:
> On 01/09/2018 05:25 PM, Dave Chinner wrote:
>
> > It's nice to know that MD has redefined RAID-10 to be different to
> > the industry standard definition that has been used for 20 years and
> > optimised filesystem layouts for. Rotoring data across odd numbers
> > of disks like this is going to really, really suck on filesystems
> > that are stripe layout aware..
>
> You're a bit late to this party, Dave. MD has implemented raid10 like
> this as far back as I can remember, and it is especially valuable when
> running more than two copies. Running raid10,n3 across four or five
> devices is a nice capacity boost without giving up triple copies (when
> multiples of three aren't available) or giving up the performance of
> mirrored raid.
XFS comes from a different background - high performance, high
reliability and hardware RAID storage. Think hundreds of drives in a
filesystem, not a handful. i.e. The XFS world is largely enterprise
and HPC storage, not small DIY solutions for a home or back-room
office. We live in a different world, and MD rarely enters mine.
> > For example, XFS has hot-spot prevention algorithms in it's
> > internal physical layout for striped devices. It aligns AGs across
> > different stripe units so that metadata and data doesn't all get
> > aligned to the one disk in a RAID0/5/6 stripe. If the stripes are
> > rotoring across disks themselves, then we're going to end up back in
> > the same position we started with - multiple AGs aligned to the
> > same disk.
>
> All of MD's default raid5 and raid6 layouts rotate stripes, too, so that
> parity and syndrome are distributed uniformly.
Well, yes, but it appears you haven't thought through what that
typically means. Take a 4+1, chunk size 128k, stripe width 512k
A B C D E
0 0 0 0 P
P 1 1 1 1
2 P 2 2 2
3 3 P 3 3
4 4 4 P 4
For every 5 stripe widths, each disk holds one stripe unit of
parity. Hence 80% of data accesses aligned to a specific data offset
hit that disk. i.e. disk A is hit by 0-128k, parity for 512-1024k,
1024-1152k, 1536-1664k and 2048-2176k. IOWs, if we align stuff to
512k, we're going to hit disk A 80% of the time and disk B 20% of
the time.
So, if mkfs.xfs ends up aligning all AGs to a multiple of 512k, then
all our static AG metadata is aligned to disk A. Further, all the
AGs will align their first stripe unit in a stripe width to Disk A,
too. Hence this results in a major IO hotspot on disk A, and
smaller hotspot on disk B. Disks C, D, and E will have the least IO
load on them.
By telling XFS that the stripe unit is 128k and the stripe width is
512k, we can avoid this problem. mkfs.xfs will rotor it's AG
alignment by some number of stripe units at a time. i.e. AG 0 aligns
to disk A, AG 1 aligns to disk B, AG 2 aligns to disk 3, and so on.
The result is that base alignment used by the filesystem is now
distributed evenly across all disks in the RAID array and so all
disks get loaded evenly. The hot spots go away because the
filesystem has aligned it's layout appropriately for the underlying
storage geometry. This applies to any RAID geometry that stripes
data across multiple disks in a regular/predictable pattern.
[ I'd cite an internal SGI paper written in 1999 that measured and
analysed all this on RAID0 in real world workloads and industry
standard benchmarks like AIM7 and SpecSFS and lead to the mkfs.xfs
changes I described above, but, well, I haven't had access to that
since I left SGI 10 years ago... ]
> > IMO, odd-numbered disks in RAID-10 should be considered harmful and
> > never used....
>
> Users are perfectly able to layer raid1+0 or raid0+1 if they don't want
> the features of raid10. Given the advantages of MD's raid10, a pedant
> could say XFS's lack of support for it should be considered harmful and
> XFS never used. (-:
MD RAID is fine with XFS as long as you use a sane layout and avoid
doing stupid things that require reshaping and changing the geometry
of the underlying device. Reshaping is where the trouble all
starts...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-11 2:14 ` Dave Chinner
@ 2018-01-12 2:16 ` Guoqing Jiang
0 siblings, 0 replies; 37+ messages in thread
From: Guoqing Jiang @ 2018-01-12 2:16 UTC (permalink / raw)
To: Dave Chinner, Wols Lists; +Cc: linux-xfs, linux-raid
Hi Dave,
On 01/11/2018 10:14 AM, Dave Chinner wrote:
>
>> Are you telling me that xfs is aware of the internal structure of an
>> md-raid array?
> It's aware of the /alignment/ characteristics of block devices, and
> these alignment characteristics are exported by MD. e.g. These are
> exported in /sys/block/<dev>/queue in
>
> minimum_io_size
> - typically the stripe chunk size
> optimal_io_size
> - typically the stripe width
>
> We get this stuff from DM and MD devices, hardware raid (via scsi
> code pages), thinp devices (i.e. to tell us the allocation
> granularity so we can align/size IO to match it) and any other block
> device that wants to tell us about optimal IO geometry. libblkid
> provides us with this information, and it's not just mkfs.xfs that
> uses it. e.g. mkfs.ext4 also uses it for the exact same purpose as
> XFS....
I see xfs can detect the geometry with "sunit" and "swidth", ext4 and gfs2
can do similar things as well.
I have one question about xfs on top of raid5. Is it possible that multiple
write operations happen in the same stripe at the same time? Or with the
existence of those parameters, xfs would aggregate them, then it ensures
no conflict happens to the parity of the stripe. Thanks in advance!
Regards,
Guoqing
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-11 3:07 ` Dave Chinner
@ 2018-01-12 13:32 ` Wols Lists
2018-01-12 14:25 ` Emmanuel Florac
0 siblings, 1 reply; 37+ messages in thread
From: Wols Lists @ 2018-01-12 13:32 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs, linux-raid
On 11/01/18 03:07, Dave Chinner wrote:
> XFS comes from a different background - high performance, high
> reliability and hardware RAID storage. Think hundreds of drives in a
> filesystem, not a handful. i.e. The XFS world is largely enterprise
> and HPC storage, not small DIY solutions for a home or back-room
> office. We live in a different world, and MD rarely enters mine.
So what happens when the hardware raid structure changes?
Ext allows you to grow a filesystem. Btrfs allows you to grow a
filesystem. Reiser allows you to grow a file system. Can you add more
disks to XFS and grow the filesystem?
My point is that all this causes geometries to change, and ext and btrfs
amongst others can clearly handle this. Can XFS?
Because if it can, it seems to me the obvious solution to changing raid
geometries is that you need to grow the filesystem, and get that to
adjust its geometries.
Bear in mind, SUSE has now adopted XFS as the default filesystem for
partitions other than /. This means you are going to get a lot of
"hobbyist" systems running XFS on top of MD and LVM. Are you telling me
that XFS is actually very badly suited to be a default filesystem for SUSE?
What concerns me here is, not having a clue how LVM handles changing
partition sizes, what effect this will have on filesystems ... The
problem is the Unix philosophy of "do one thing and do it well".
Sometimes that's just not practical. The Unix philosophy says "leave
partition management to lvm, leave redundancy to md, leave the files to
the filesystem, ..." and then the filesystem comes along and says "hey,
I can't do my job very well, if I don't have a clue about the physical
disk layout". It's a hard circle to square ... :-)
(Anecdotes about btrfs are that it's made a right pigs ear of trying to
do everything itself.)
Cheers,
Wol
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-12 13:32 ` Wols Lists
@ 2018-01-12 14:25 ` Emmanuel Florac
2018-01-12 17:52 ` Wols Lists
2018-01-14 21:33 ` Wol's lists
0 siblings, 2 replies; 37+ messages in thread
From: Emmanuel Florac @ 2018-01-12 14:25 UTC (permalink / raw)
To: Wols Lists; +Cc: Dave Chinner, linux-xfs, linux-raid
[-- Attachment #1: Type: text/plain, Size: 3983 bytes --]
Le Fri, 12 Jan 2018 13:32:49 +0000
Wols Lists <antlists@youngman.org.uk> écrivait:
> On 11/01/18 03:07, Dave Chinner wrote:
> > XFS comes from a different background - high performance, high
> > reliability and hardware RAID storage. Think hundreds of drives in a
> > filesystem, not a handful. i.e. The XFS world is largely enterprise
> > and HPC storage, not small DIY solutions for a home or back-room
> > office. We live in a different world, and MD rarely enters mine.
>
> So what happens when the hardware raid structure changes?
hardware RAID controllers don't expose RAID structure to the software.
So As far as XFS knows, a hardware RAID is just a very large disk.
That's when using stripe unit and stripe width options make sense in
mkfs_xfs.
> Ext allows you to grow a filesystem. Btrfs allows you to grow a
> filesystem. Reiser allows you to grow a file system. Can you add more
> disks to XFS and grow the filesystem?
Of course. xfs_growfs is your friend. Worked with online filesystems
many years before that functionality came to other filesystems.
> My point is that all this causes geometries to change, and ext and
> btrfs amongst others can clearly handle this. Can XFS?
Neither XFS, ext4 or btrfs can handle this. That's why Dave mentioned
the fact that growing your RAID is almost always the wrong solution.
A much better solution is to add a new array and use LVM to aggregate
it with the existing ones.
Basically growing an array then the filesystem on it generally works
OK, BUT it may kill performance (or not). YMMV. At least, you *probably
won't* get the performance gain that the difference of stripe width
would permit when starting anew.
> Because if it can, it seems to me the obvious solution to changing
> raid geometries is that you need to grow the filesystem, and get that
> to adjust its geometries.
Unfortunately that's nigh impossible. No filesystem in existence does
that. The closest thing is ZFS ability to dynamically change stripe
sizes, but when you extend a ZFS zpool it doesn't rebalance existing
files and data (and offers absolutely no way to do it). Sorry, no pony.
> Bear in mind, SUSE has now adopted XFS as the default filesystem for
> partitions other than /. This means you are going to get a lot of
> "hobbyist" systems running XFS on top of MD and LVM. Are you telling
> me that XFS is actually very badly suited to be a default filesystem
> for SUSE?
Doesn't seem so. In fact XFS is less permissive than other filesystems,
and it's a *darn good thing* IMO. It's better having frightening error
messages "XFS force shutdown" than corrupted data, isn't it?
> What concerns me here is, not having a clue how LVM handles changing
> partition sizes, what effect this will have on filesystems ... The
> problem is the Unix philosophy of "do one thing and do it well".
> Sometimes that's just not practical.
LVM volumes changes are propagated to upper levels.
If you don't like Unix principles, use Windows then :)
> The Unix philosophy says "leave
> partition management to lvm, leave redundancy to md, leave the files
> to the filesystem, ..." and then the filesystem comes along and says
> "hey, I can't do my job very well, if I don't have a clue about the
> physical disk layout". It's a hard circle to square ... :-)
Yeah, that was apparently the very same thinking that brought us ZFS.
> (Anecdotes about btrfs are that it's made a right pigs ear of trying
> to do everything itself.)
>
Not so sure. Btrfs is excellent, taking into account how little love it
received for many years at Oracle.
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
[-- Attachment #2: Signature digitale OpenPGP --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-12 14:25 ` Emmanuel Florac
@ 2018-01-12 17:52 ` Wols Lists
2018-01-12 18:37 ` Emmanuel Florac
2018-01-13 0:20 ` Stan Hoeppner
2018-01-14 21:33 ` Wol's lists
1 sibling, 2 replies; 37+ messages in thread
From: Wols Lists @ 2018-01-12 17:52 UTC (permalink / raw)
To: Emmanuel Florac; +Cc: Dave Chinner, linux-xfs, linux-raid
On 12/01/18 14:25, Emmanuel Florac wrote:
> Le Fri, 12 Jan 2018 13:32:49 +0000
> Wols Lists <antlists@youngman.org.uk> écrivait:
>
>> On 11/01/18 03:07, Dave Chinner wrote:
>>> XFS comes from a different background - high performance, high
>>> reliability and hardware RAID storage. Think hundreds of drives in a
>>> filesystem, not a handful. i.e. The XFS world is largely enterprise
>>> and HPC storage, not small DIY solutions for a home or back-room
>>> office. We live in a different world, and MD rarely enters mine.
>>
>> So what happens when the hardware raid structure changes?
>
> hardware RAID controllers don't expose RAID structure to the software.
> So As far as XFS knows, a hardware RAID is just a very large disk.
> That's when using stripe unit and stripe width options make sense in
> mkfs_xfs.
Umm... So you can't partially populate a chassis and add more disks as
you need them? So you have to manually pass stripe unit and width at
creation time and then they are set in stone? Sorry that doesn't sound
that enterprisey to me :-(
>
>> Ext allows you to grow a filesystem. Btrfs allows you to grow a
>> filesystem. Reiser allows you to grow a file system. Can you add more
>> disks to XFS and grow the filesystem?
>
> Of course. xfs_growfs is your friend. Worked with online filesystems
> many years before that functionality came to other filesystems.
>
>> My point is that all this causes geometries to change, and ext and
>> btrfs amongst others can clearly handle this. Can XFS?
>
> Neither XFS, ext4 or btrfs can handle this. That's why Dave mentioned
> the fact that growing your RAID is almost always the wrong solution.
> A much better solution is to add a new array and use LVM to aggregate
> it with the existing ones.
Isn't this what btrfs does with a rebalance? And I may well be wrong,
but I got the impression that some file systems could change stripe
geometries dynamically.
Adding a new array imho breaks the KISS principle. So I now have
multiple arrays sitting on the hard drives (wasting parity disks if I
have raid5/6), multiple instances of LVM on top of that, and then the
filesystem sitting on top of multiple volumes.
As a hobbyist I want one array, with one LVM on top of that, and one
filesystem per volume. Anything else starts to get confusing. And if I'm
a professional sys-admin I would want that in spades! It's all very well
expecting a sys-admin to cope, but the fewer boobytraps and landmines
left lying around, the better!
Squaring the circle, again :-(
>
> Basically growing an array then the filesystem on it generally works
> OK, BUT it may kill performance (or not). YMMV. At least, you *probably
> won't* get the performance gain that the difference of stripe width
> would permit when starting anew.
>
Point taken - but how are you going to backup your huge petabyte XFS
filesystem to get the performance on your bigger array? Catch 22 ...
>> Because if it can, it seems to me the obvious solution to changing
>> raid geometries is that you need to grow the filesystem, and get that
>> to adjust its geometries.
>
> Unfortunately that's nigh impossible. No filesystem in existence does
> that. The closest thing is ZFS ability to dynamically change stripe
> sizes, but when you extend a ZFS zpool it doesn't rebalance existing
> files and data (and offers absolutely no way to do it). Sorry, no pony.
>
Well, how does raid get away with it, rebalancing and restriping
everything :-)
Yes I know, it's a major change if the original file system design
didn't allow for it, and major file system changes can be extremely
destructive to user data ...
>> Bear in mind, SUSE has now adopted XFS as the default filesystem for
>> partitions other than /. This means you are going to get a lot of
>> "hobbyist" systems running XFS on top of MD and LVM. Are you telling
>> me that XFS is actually very badly suited to be a default filesystem
>> for SUSE?
>
> Doesn't seem so. In fact XFS is less permissive than other filesystems,
> and it's a *darn good thing* IMO. It's better having frightening error
> messages "XFS force shutdown" than corrupted data, isn't it?
False dichotomy, I'm afraid. Do you really want a filesystem that
guarantees integrity, but trashes performance when you want to take
advantage of features such as resizing? I'd rather have integrity,
performance *and* features :-) (Pick any two, I know :-)
>
>> What concerns me here is, not having a clue how LVM handles changing
>> partition sizes, what effect this will have on filesystems ... The
>> problem is the Unix philosophy of "do one thing and do it well".
>> Sometimes that's just not practical.
>
> LVM volumes changes are propagated to upper levels.
And what does the filesystem do with them? If LVM is sat on MD, what then?
>
> If you don't like Unix principles, use Windows then :)
>
The phrase "a rock and a hard place" comes to mind. Neither were
designed with commercial solidity and integrity and reliability in mind.
And having used commercial systems I get the impression NIH is alive and
kicking far too much. Both Linux and Windows are much more reliable and
solid than they were, but too many of those features are bolt-ons, and
they feel like it ...
>> The Unix philosophy says "leave
>> partition management to lvm, leave redundancy to md, leave the files
>> to the filesystem, ..." and then the filesystem comes along and says
>> "hey, I can't do my job very well, if I don't have a clue about the
>> physical disk layout". It's a hard circle to square ... :-)
>
> Yeah, that was apparently the very same thinking that brought us ZFS.
>
>> (Anecdotes about btrfs are that it's made a right pigs ear of trying
>> to do everything itself.)
>>
>
> Not so sure. Btrfs is excellent, taking into account how little love it
> received for many years at Oracle.
>
Yep. The solid features are just that - solid. Snag is, a lot of the
nice features are still experimental, and dangerous! Parity raid, for
example ... and I've heard rumours that the flaws could be unfixable, at
least not until btrfs-2 whenever that gets started ...
When MD adds disks, it rewrites the array from top to bottom or the
other way round, moving everything over to the new layout. Is there no
way a file system can do the same sort of thing? Okay, it would probably
need to be a defrag-like utility and linux prides itself on not needing
defrag :-)
Or could it simply switch over to optimising for the new geometry,
accept the fact that the reshape will have caused hotspots, and every
time it rewrites (meta)data, it adjusts it to the new geometry to
reduce/remove hotspots over time?
Cheers,
Wol
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-12 17:52 ` Wols Lists
@ 2018-01-12 18:37 ` Emmanuel Florac
2018-01-12 19:35 ` Wol's lists
2018-01-13 0:20 ` Stan Hoeppner
1 sibling, 1 reply; 37+ messages in thread
From: Emmanuel Florac @ 2018-01-12 18:37 UTC (permalink / raw)
To: Wols Lists; +Cc: Dave Chinner, linux-xfs, linux-raid
[-- Attachment #1: Type: text/plain, Size: 7052 bytes --]
Le Fri, 12 Jan 2018 17:52:59 +0000
Wols Lists <antlists@youngman.org.uk> écrivait:
> >>
> >> So what happens when the hardware raid structure changes?
> >
> > hardware RAID controllers don't expose RAID structure to the
> > software. So As far as XFS knows, a hardware RAID is just a very
> > large disk. That's when using stripe unit and stripe width options
> > make sense in mkfs_xfs.
>
> Umm... So you can't partially populate a chassis and add more disks as
> you need them? So you have to manually pass stripe unit and width at
> creation time and then they are set in stone? Sorry that doesn't sound
> that enterprisey to me :-(
You *can* but it's generally frowned upon. Adding disks by large
batches of 6, 8, 10 and creating new arrays is always better. Adding
one or two disks at a time is a useful, but cheap hack at best.
> >
> > Neither XFS, ext4 or btrfs can handle this. That's why Dave
> > mentioned the fact that growing your RAID is almost always the
> > wrong solution. A much better solution is to add a new array and
> > use LVM to aggregate it with the existing ones.
>
> Isn't this what btrfs does with a rebalance? And I may well be wrong,
> but I got the impression that some file systems could change stripe
> geometries dynamically.
If btrfs does rebalancing, that's fine then. I suppose running xfs_fsr
on XFS could also rebalance data. Would be nice to have an option to
force rewriting of all files, that would solve this particular problem.
> Adding a new array imho breaks the KISS principle. So I now have
> multiple arrays sitting on the hard drives (wasting parity disks if I
> have raid5/6), multiple instances of LVM on top of that, and then the
> filesystem sitting on top of multiple volumes.
No, you need only to declare additional arrays as new physical volumes,
add them to your existing volume group, then extend your existing LVs
as needed. That's standard storage management fare.
You're not supposed to have arrays with tens of drives anyway (unless
you really don't care about your data).
> As a hobbyist I want one array, with one LVM on top of that, and one
> filesystem per volume.
As a hobbyist you don't really have to care about performance. A single
modern hard drive can easily feed a gigabit ethernet connection, anyway.
The systems I set up these times commonly require disk throughput of 3
to 10 GB/s to feed 40GigE lines. Different problems.
> Anything else starts to get confusing. And if
> I'm a professional sys-admin I would want that in spades! It's all
> very well expecting a sys-admin to cope, but the fewer boobytraps and
> landmines left lying around, the better!
>
> Squaring the circle, again :-(
Not really, modern tools like lsblk and friends make it really easy to
sort out.
> >
> > Basically growing an array then the filesystem on it generally works
> > OK, BUT it may kill performance (or not). YMMV. At least, you
> > *probably won't* get the performance gain that the difference of
> > stripe width would permit when starting anew.
> >
> Point taken - but how are you going to backup your huge petabyte XFS
> filesystem to get the performance on your bigger array? Catch 22 ...
Through big networks, or with big tape drives (LTO-8 is ~1GB/s
compressed).
>
> >> Because if it can, it seems to me the obvious solution to changing
> >> raid geometries is that you need to grow the filesystem, and get
> >> that to adjust its geometries.
> >
> > Unfortunately that's nigh impossible. No filesystem in existence
> > does that. The closest thing is ZFS ability to dynamically change
> > stripe sizes, but when you extend a ZFS zpool it doesn't rebalance
> > existing files and data (and offers absolutely no way to do it).
> > Sorry, no pony.
> Well, how does raid get away with it, rebalancing and restriping
> everything :-)
>
Then that must be because Dave is lazy :)
> > Doesn't seem so. In fact XFS is less permissive than other
> > filesystems, and it's a *darn good thing* IMO. It's better having
> > frightening error messages "XFS force shutdown" than corrupted
> > data, isn't it?
>
> False dichotomy, I'm afraid. Do you really want a filesystem that
> guarantees integrity, but trashes performance when you want to take
> advantage of features such as resizing? I'd rather have integrity,
> performance *and* features :-) (Pick any two, I know :-)
XFS is clearly optimized for performances, and is currently gaining
interesting new features (thin copy, then probably snapshots, etc). If
what you're looking for is features first, well, there are other
filesystems :)
> > LVM volumes changes are propagated to upper levels.
>
> And what does the filesystem do with them? If LVM is sat on MD, what
> then?
MD propagates to LVM that propagates to the FS, actually. Everybody
works together nowadays (didn't used to).
> >
> > If you don't like Unix principles, use Windows then :)
> >
> The phrase "a rock and a hard place" comes to mind. Neither were
> designed with commercial solidity and integrity and reliability in
> mind. And having used commercial systems I get the impression NIH is
> alive and kicking far too much. Both Linux and Windows are much more
> reliable and solid than they were, but too many of those features are
> bolt-ons, and they feel like it ...
Linux gives you choice. You want to resize volumes at will? Use ZFS.
You want to juice out all the performance from your disks? use XFS. You
don't bother? use ext4. etc.
> > Not so sure. Btrfs is excellent, taking into account how little
> > love it received for many years at Oracle.
> >
> Yep. The solid features are just that - solid. Snag is, a lot of the
> nice features are still experimental, and dangerous! Parity raid, for
> example ... and I've heard rumours that the flaws could be unfixable,
> at least not until btrfs-2 whenever that gets started ...
Well I don't know much about btrfs so I can't comment.
> When MD adds disks, it rewrites the array from top to bottom or the
> other way round, moving everything over to the new layout. Is there no
> way a file system can do the same sort of thing? Okay, it would
> probably need to be a defrag-like utility and linux prides itself on
> not needing defrag :-)
>
> Or could it simply switch over to optimising for the new geometry,
> accept the fact that the reshape will have caused hotspots, and every
> time it rewrites (meta)data, it adjusts it to the new geometry to
> reduce/remove hotspots over time?
>
I suppose it's doable but not sufficiently a prominent use case to
bother much.
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
[-- Attachment #2: Signature digitale OpenPGP --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-12 18:37 ` Emmanuel Florac
@ 2018-01-12 19:35 ` Wol's lists
2018-01-13 12:30 ` Brad Campbell
0 siblings, 1 reply; 37+ messages in thread
From: Wol's lists @ 2018-01-12 19:35 UTC (permalink / raw)
To: Emmanuel Florac; +Cc: Dave Chinner, linux-xfs, linux-raid
On 12/01/18 18:37, Emmanuel Florac wrote:
>> Or could it simply switch over to optimising for the new geometry,
>> accept the fact that the reshape will have caused hotspots, and every
>> time it rewrites (meta)data, it adjusts it to the new geometry to
>> reduce/remove hotspots over time?
>>
> I suppose it's doable but not sufficiently a prominent use case to
> bother much.
Stick it on the "nice to have when somebody gets round to it" list :-)
But it at least should get put on the list ...
I'll get round to writing all this up soon, so the wiki will try and
persuade people that resizing arrays is not actually the brightest of
ideas.
The trouble is lack of people talking to each other and thinking that
they can rely on the "do one job and do it well". Except of course that
you can't square a circle ... :-)
But this really is stuff that needs to be on the wiki, and it's stuff
the MD and filesystem people don't talk about with each other, I expect :-(
Cheers,
Wol
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-12 17:52 ` Wols Lists
2018-01-12 18:37 ` Emmanuel Florac
@ 2018-01-13 0:20 ` Stan Hoeppner
2018-01-13 19:29 ` Wol's lists
1 sibling, 1 reply; 37+ messages in thread
From: Stan Hoeppner @ 2018-01-13 0:20 UTC (permalink / raw)
To: Wols Lists, Emmanuel Florac; +Cc: Dave Chinner, linux-xfs, linux-raid
On 01/12/2018 11:52 AM, Wols Lists wrote:
> On 12/01/18 14:25, Emmanuel Florac wrote:
>> Le Fri, 12 Jan 2018 13:32:49 +0000
>> Wols Lists <antlists@youngman.org.uk> écrivait:
>>
>>> On 11/01/18 03:07, Dave Chinner wrote:
>>>> XFS comes from a different background - high performance, high
>>>> reliability and hardware RAID storage. Think hundreds of drives in a
>>>> filesystem, not a handful. i.e. The XFS world is largely enterprise
>>>> and HPC storage, not small DIY solutions for a home or back-room
>>>> office. We live in a different world, and MD rarely enters mine.
>>> So what happens when the hardware raid structure changes?
>> hardware RAID controllers don't expose RAID structure to the software.
>> So As far as XFS knows, a hardware RAID is just a very large disk.
>> That's when using stripe unit and stripe width options make sense in
>> mkfs_xfs.
> Umm... So you can't partially populate a chassis and add more disks as
> you need them? So you have to manually pass stripe unit and width at
> creation time and then they are set in stone? Sorry that doesn't sound
> that enterprisey to me :-(
It's not set in stone. If the RAID geometry changes one can specify the
new geometry at mount say in fstab. New writes to the filesystem will
obey the new specified geometry.
>>> Ext allows you to grow a filesystem. Btrfs allows you to grow a
>>> filesystem. Reiser allows you to grow a file system. Can you add more
>>> disks to XFS and grow the filesystem?
>> Of course. xfs_growfs is your friend. Worked with online filesystems
>> many years before that functionality came to other filesystems.
>>
>>> My point is that all this causes geometries to change, and ext and
>>> btrfs amongst others can clearly handle this. Can XFS?
>> Neither XFS, ext4 or btrfs can handle this. That's why Dave mentioned
>> the fact that growing your RAID is almost always the wrong solution.
>> A much better solution is to add a new array and use LVM to aggregate
>> it with the existing ones.
> Isn't this what btrfs does with a rebalance? And I may well be wrong,
> but I got the impression that some file systems could change stripe
> geometries dynamically.
>
> Adding a new array imho breaks the KISS principle. So I now have
> multiple arrays sitting on the hard drives (wasting parity disks if I
> have raid5/6), multiple instances of LVM on top of that, and then the
> filesystem sitting on top of multiple volumes.
>
> As a hobbyist I want one array, with one LVM on top of that, and one
> filesystem per volume. Anything else starts to get confusing. And if I'm
> a professional sys-admin I would want that in spades! It's all very well
> expecting a sys-admin to cope, but the fewer boobytraps and landmines
> left lying around, the better!
>
> Squaring the circle, again :-(
>> Basically growing an array then the filesystem on it generally works
>> OK, BUT it may kill performance (or not). YMMV. At least, you *probably
>> won't* get the performance gain that the difference of stripe width
>> would permit when starting anew.
>>
> Point taken - but how are you going to backup your huge petabyte XFS
> filesystem to get the performance on your bigger array? Catch 22 ...
>
>>> Because if it can, it seems to me the obvious solution to changing
>>> raid geometries is that you need to grow the filesystem, and get that
>>> to adjust its geometries.
>> Unfortunately that's nigh impossible. No filesystem in existence does
>> that. The closest thing is ZFS ability to dynamically change stripe
>> sizes, but when you extend a ZFS zpool it doesn't rebalance existing
>> files and data (and offers absolutely no way to do it). Sorry, no pony.
>>
> Well, how does raid get away with it, rebalancing and restriping
> everything :-)
>
> Yes I know, it's a major change if the original file system design
> didn't allow for it, and major file system changes can be extremely
> destructive to user data ...
>
>>> Bear in mind, SUSE has now adopted XFS as the default filesystem for
>>> partitions other than /. This means you are going to get a lot of
>>> "hobbyist" systems running XFS on top of MD and LVM. Are you telling
>>> me that XFS is actually very badly suited to be a default filesystem
>>> for SUSE?
>> Doesn't seem so. In fact XFS is less permissive than other filesystems,
>> and it's a *darn good thing* IMO. It's better having frightening error
>> messages "XFS force shutdown" than corrupted data, isn't it?
> False dichotomy, I'm afraid. Do you really want a filesystem that
> guarantees integrity, but trashes performance when you want to take
> advantage of features such as resizing? I'd rather have integrity,
> performance *and* features :-) (Pick any two, I know :-)
>>> What concerns me here is, not having a clue how LVM handles changing
>>> partition sizes, what effect this will have on filesystems ... The
>>> problem is the Unix philosophy of "do one thing and do it well".
>>> Sometimes that's just not practical.
>> LVM volumes changes are propagated to upper levels.
> And what does the filesystem do with them? If LVM is sat on MD, what then?
>> If you don't like Unix principles, use Windows then :)
>>
> The phrase "a rock and a hard place" comes to mind. Neither were
> designed with commercial solidity and integrity and reliability in mind.
> And having used commercial systems I get the impression NIH is alive and
> kicking far too much. Both Linux and Windows are much more reliable and
> solid than they were, but too many of those features are bolt-ons, and
> they feel like it ...
>
>>> The Unix philosophy says "leave
>>> partition management to lvm, leave redundancy to md, leave the files
>>> to the filesystem, ..." and then the filesystem comes along and says
>>> "hey, I can't do my job very well, if I don't have a clue about the
>>> physical disk layout". It's a hard circle to square ... :-)
>> Yeah, that was apparently the very same thinking that brought us ZFS.
>>
>>> (Anecdotes about btrfs are that it's made a right pigs ear of trying
>>> to do everything itself.)
>>>
>> Not so sure. Btrfs is excellent, taking into account how little love it
>> received for many years at Oracle.
>>
> Yep. The solid features are just that - solid. Snag is, a lot of the
> nice features are still experimental, and dangerous! Parity raid, for
> example ... and I've heard rumours that the flaws could be unfixable, at
> least not until btrfs-2 whenever that gets started ...
>
> When MD adds disks, it rewrites the array from top to bottom or the
> other way round, moving everything over to the new layout. Is there no
> way a file system can do the same sort of thing? Okay, it would probably
> need to be a defrag-like utility and linux prides itself on not needing
> defrag :-)
>
> Or could it simply switch over to optimising for the new geometry,
> accept the fact that the reshape will have caused hotspots, and every
> time it rewrites (meta)data, it adjusts it to the new geometry to
> reduce/remove hotspots over time?
>
> Cheers,
> Wol
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-12 19:35 ` Wol's lists
@ 2018-01-13 12:30 ` Brad Campbell
2018-01-13 13:18 ` Wols Lists
0 siblings, 1 reply; 37+ messages in thread
From: Brad Campbell @ 2018-01-13 12:30 UTC (permalink / raw)
To: Wol's lists, Emmanuel Florac; +Cc: Dave Chinner, linux-xfs, linux-raid
On 13/01/18 03:35, Wol's lists wrote:
> I'll get round to writing all this up soon, so the wiki will try and
> persuade people that resizing arrays is not actually the brightest of
> ideas.
Now hang on. Don't go tarring every use case with the same brush.
There are many use cases for a bucket of disks and high performance is
but one of them.
Leaving aside XFS, let's look at EXT3/4 as they seem to be generally the
most common filesystems in use for your average "install it and run it"
user (ie *ME*).
If you read the mke2fs man page and check out stripe and stride (which
you *used* to have to specify manually), both of them imply they are
important for letting the filesystem know the construction of your RAID
for *performance* reasons.
Nowhere does *anything* make any mention of changing geometry, and if
you gave a 10 second thought to those parameters and their explanations
you'd have to think "This filesystem was optimised for the RAID geometry
it was built with. If I change that, then I won't have the same
performance I did have at the time of creation". Or maybe that was only
obvious to me.
Anyway, I happily grew several large arrays over the years *knowing*
that there would be a performance impact, because for my use case I
didn't actually care.
"Enterprise" don't grow arrays. They build a storage solution that is
often extremely finely tuned for exactly their workload and they use it.
If they need more storage they either replicate or build another (with
the consequential months of tests/tuning) storage configuration. I see
Stan Hoeppner replied. If you want a good read, get him going on
workload specific XFS tuning.
It's only hacks like me that tack disks onto built arrays, but I did it
*knowing* it wasn't going to affect my workload as all I wanted was a
huge bucket of storage with quick reads. Writes don't happen often
enough to matter.
Exposing the geometry to the filesystem is there to give the filesystem
a chance of performing operations in a manner least likely to create a
performance hotspot (as pointed out by Dave Chinner). They are hints.
Change the geometry after the fact and all bets are off.
On another note, personally I've used XFS in a couple of performance
sensitive roles over the years (when it *really* mattered), but as I
don't often wade into that end of the pool I tend to stick with the ext
series.
e2fsck has gotten me out of some really tight spots and I can rely on it
making the best of a really bad mess. With XFS I've never had the
pleasure of running it on anything other than top of the line hardware,
so it never had to clean up after me. It does go like a stung cat though
when it's tuned up.
If I were to suggest an addition to the RAID wiki, it'd be to elaborate
on the *creation* time tuning a filesystem create tool does with the
RAID geometry, and to point out that once you grow the RAID, all
performance bets are off. I've never met a filesystem that would break
however.
I've grown RAID 1, 5 & 6. Growing RAID10 with anything other than a near
configuration and adding another set of disks just feels like a disaster
waiting to happen. Even I'm not that game.
I do have a staging machine now with a few spare disks, so I might have
a crack at it, but I won't be using a kernel and userspace as old as the
thread initiator.
Regards,
Brad
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-13 12:30 ` Brad Campbell
@ 2018-01-13 13:18 ` Wols Lists
0 siblings, 0 replies; 37+ messages in thread
From: Wols Lists @ 2018-01-13 13:18 UTC (permalink / raw)
To: Brad Campbell, Emmanuel Florac; +Cc: Dave Chinner, linux-xfs, linux-raid
On 13/01/18 12:30, Brad Campbell wrote:
>
> If I were to suggest an addition to the RAID wiki, it'd be to elaborate
> on the *creation* time tuning a filesystem create tool does with the
> RAID geometry, and to point out that once you grow the RAID, all
> performance bets are off. I've never met a filesystem that would break
> however.
You know me ...
If you read the wiki you'll notice I tend to be very much "pros and
cons. Choose what works for you". This write-up will be very much in the
same vein ...
Cheers,
Wol
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-13 0:20 ` Stan Hoeppner
@ 2018-01-13 19:29 ` Wol's lists
2018-01-13 22:40 ` Dave Chinner
0 siblings, 1 reply; 37+ messages in thread
From: Wol's lists @ 2018-01-13 19:29 UTC (permalink / raw)
To: Stan Hoeppner, Emmanuel Florac; +Cc: Dave Chinner, linux-xfs, linux-raid
On 13/01/18 00:20, Stan Hoeppner wrote:
> It's not set in stone. If the RAID geometry changes one can specify the
> new geometry at mount say in fstab. New writes to the filesystem will
> obey the new specified geometry.
Does this then update the defaults, or do you need to specify the new
geometry every mount? Inquiring minds need to know :-)
Cheers,
Wol
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-13 19:29 ` Wol's lists
@ 2018-01-13 22:40 ` Dave Chinner
2018-01-13 23:04 ` Wols Lists
0 siblings, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2018-01-13 22:40 UTC (permalink / raw)
To: Wol's lists; +Cc: Stan Hoeppner, Emmanuel Florac, linux-xfs, linux-raid
On Sat, Jan 13, 2018 at 07:29:19PM +0000, Wol's lists wrote:
> On 13/01/18 00:20, Stan Hoeppner wrote:
> >It's not set in stone. If the RAID geometry changes one can
> >specify the new geometry at mount say in fstab. New writes to the
> >filesystem will obey the new specified geometry.
FWIW, I've been assuming in everything I've said that an admin
would use these mount options to ensure new data writes were
properly aligned after a reshape.
> Does this then update the defaults, or do you need to specify the
> new geometry every mount? Inquiring minds need to know :-)
If you're going to document it, then you should observe it's
behaviour yourself, right? You don't even need a MD/RAID device to
test it - just set su/sw manually on the mkfs command line, then
see what happens when you try to change them on subsequent mounts.
Anyway, start by reading Documentation/filesystems/xfs.txt or 'man 5
xfs' where the mount options are documented. That's answer most FAQs
on this subject.
"Typically the only time these mount options are necessary
if after an underlying RAID device has had it's geometry
modified, such as adding a new disk to a RAID5 lun and
reshaping it."
It should be pretty obvious from this that we know that people
reshape arrays and that we've have had the means to support it all
along. Despite this, we still don't recommend people administer
their RAID-based XFS storage in this manner....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-13 22:40 ` Dave Chinner
@ 2018-01-13 23:04 ` Wols Lists
0 siblings, 0 replies; 37+ messages in thread
From: Wols Lists @ 2018-01-13 23:04 UTC (permalink / raw)
To: Dave Chinner; +Cc: Stan Hoeppner, Emmanuel Florac, linux-xfs, linux-raid
On 13/01/18 22:40, Dave Chinner wrote:
> On Sat, Jan 13, 2018 at 07:29:19PM +0000, Wol's lists wrote:
>> On 13/01/18 00:20, Stan Hoeppner wrote:
>>> It's not set in stone. If the RAID geometry changes one can
>>> specify the new geometry at mount say in fstab. New writes to the
>>> filesystem will obey the new specified geometry.
>
> FWIW, I've been assuming in everything I've said that an admin
> would use these mount options to ensure new data writes were
> properly aligned after a reshape.
>
>> Does this then update the defaults, or do you need to specify the
>> new geometry every mount? Inquiring minds need to know :-)
>
> If you're going to document it, then you should observe it's
> behaviour yourself, right? You don't even need a MD/RAID device to
> test it - just set su/sw manually on the mkfs command line, then
> see what happens when you try to change them on subsequent mounts.
I suppose I could set up a VM ...
>
> Anyway, start by reading Documentation/filesystems/xfs.txt or 'man 5
> xfs' where the mount options are documented. That's answer most FAQs
> on this subject.
>
> "Typically the only time these mount options are necessary
> if after an underlying RAID device has had it's geometry
> modified, such as adding a new disk to a RAID5 lun and
> reshaping it."
anthony@ashdown /usr/src $ man 5 xfs
No entry for xfs in section 5 of the manual
anthony@ashdown /usr/src $
>
> It should be pretty obvious from this that we know that people
> reshape arrays and that we've have had the means to support it all
> along. Despite this, we still don't recommend people administer
> their RAID-based XFS storage in this manner....
>
Note I described myself as *editor* of the raid wiki. Yes I'd love to
play around with all this stuff, but I don't have the hardware, and my
nice new system I was planning to do all this sort of stuff on won't
POST. I've had that problem before, it's finding time to debug a new
system in the face of family demands... and at present I don't have an
xfs partition anywhere.
Reading xfs.txt doesn't seem to answer the question, though. It sounds
like it doesn't update the underlying defaults so it's required every
mount (which is a safe assumption to make), but it could easily be read
the other way, too.
Thanks. I'll document it to the level I understand, make a mental note
to go back and improve it (I try and do that all the time :-), and then
when my new system is up and running, I'll be playing with that to see
how things behave.
Cheers,
Wol
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-12 14:25 ` Emmanuel Florac
2018-01-12 17:52 ` Wols Lists
@ 2018-01-14 21:33 ` Wol's lists
2018-01-15 17:08 ` Emmanuel Florac
1 sibling, 1 reply; 37+ messages in thread
From: Wol's lists @ 2018-01-14 21:33 UTC (permalink / raw)
To: Emmanuel Florac; +Cc: Dave Chinner, linux-xfs, linux-raid
On 12/01/18 14:25, Emmanuel Florac wrote:
>> My point is that all this causes geometries to change, and ext and
>> btrfs amongst others can clearly handle this. Can XFS?
> Neither XFS, ext4 or btrfs can handle this. That's why Dave mentioned
> the fact that growing your RAID is almost always the wrong solution.
> A much better solution is to add a new array and use LVM to aggregate
> it with the existing ones.
Does the new array need the same geometry as the old one?
What happens if my original array is a 4-disk raid-5, and then I add a
3-disk raid-5? Can XFS cope with the different optimisations required
for the different layouts on the different arrays?
>
> Basically growing an array then the filesystem on it generally works
> OK, BUT it may kill performance (or not). YMMV. At least, you *probably
> won't* get the performance gain that the difference of stripe width
> would permit when starting anew.
>
Cheers,
Wol
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-14 21:33 ` Wol's lists
@ 2018-01-15 17:08 ` Emmanuel Florac
0 siblings, 0 replies; 37+ messages in thread
From: Emmanuel Florac @ 2018-01-15 17:08 UTC (permalink / raw)
To: Wol's lists; +Cc: Dave Chinner, linux-xfs, linux-raid
[-- Attachment #1: Type: text/plain, Size: 1422 bytes --]
Le Sun, 14 Jan 2018 21:33:17 +0000
"Wol's lists" <antlists@youngman.org.uk> écrivait:
> On 12/01/18 14:25, Emmanuel Florac wrote:
> >> My point is that all this causes geometries to change, and ext and
> >> btrfs amongst others can clearly handle this. Can XFS?
>
> > Neither XFS, ext4 or btrfs can handle this. That's why Dave
> > mentioned the fact that growing your RAID is almost always the
> > wrong solution. A much better solution is to add a new array and
> > use LVM to aggregate it with the existing ones.
>
> Does the new array need the same geometry as the old one?
That's the best way to preserve performance, yes.
> What happens if my original array is a 4-disk raid-5, and then I add
> a 3-disk raid-5? Can XFS cope with the different optimisations
> required for the different layouts on the different arrays?
No because your array will remain optimised for the initial layout.
However, if you add a new array with the same stripe characteristics you
should at least NOT lose performance. See also the mount options Stan
mentioned earlier :)
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
[-- Attachment #2: Signature digitale OpenPGP --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-08 15:16 ` Wols Lists
2018-01-08 15:34 ` Reindl Harald
2018-01-08 16:24 ` Wolfgang Denk
@ 2018-01-10 1:57 ` Guoqing Jiang
2 siblings, 0 replies; 37+ messages in thread
From: Guoqing Jiang @ 2018-01-10 1:57 UTC (permalink / raw)
To: Wols Lists, mdraid.pkoch, linux-raid
On 01/08/2018 11:16 PM, Wols Lists wrote:
> On 08/01/18 07:31, Guoqing Jiang wrote:
>>
>> On 01/06/2018 11:44 PM, mdraid.pkoch@dfgh.net wrote:
>>> Dear MD-experts:
>>>
>>> I was under the impression that growing a RAID10 device could be done
>>> with an active filesystem running on the device.
>> It depends on whether the specific filesystem provides related tool or
>> not, eg,
>> resize2fs can serve ext fs:
> Sorry Guoqing, but I think you've *completely* missed the point :-(
Yes, I just want to point is it is more safer to umount fs first, then
do the reshape.
There is another deadlock issue which mentioned by BingJing, it happened
during
reshape stage while some I/O come from vfs layer, though it is not the
same issue,
>> https://raid.wiki.kernel.org/index.php/Growing#Extending_the_filesystem
>>
>> And you can use xfs_growfs for your purpose.
> You extend the filesystem *after* you've grown the array. The act of
> growing the array has caused the filesystem to crash. That should NOT
> happen - the act of growing the array should be *invisible* to the
> filesystem.
>
> In other words, one or more of the following three are true :-
> 1) The OP has been caught by some random act of God
> 2) There's a serious flaw in "mdadm --grow"
> 3) There's a serious flaw in xfs
IMHO, there could be potential issue inside raid10/5, so I choose 2.
Anyway, I will
try to simulate the scenario and see what will happen ...
Thanks,
Guoqing
^ permalink raw reply [flat|nested] 37+ messages in thread
* Growing RAID10 with active XFS filesystem
@ 2018-01-08 19:06 mdraid.pkoch
0 siblings, 0 replies; 37+ messages in thread
From: mdraid.pkoch @ 2018-01-08 19:06 UTC (permalink / raw)
To: linux-raid
Dear Linux-Raid and Linux-XFS experts:
I'm posting this on both the linux-raid and linux-xfs
mailing list as it's not clear at this point wether
this is a MD- od XFS-problem.
I have described my problem in a recent posting on
linux-raid and Wol's conclusion was:
> In other words, one or more of the following three are true :-
> 1) The OP has been caught by some random act of God
> 2) There's a serious flaw in "mdadm --grow"
> 3) There's a serious flaw in xfs
>
> Cheers,
> Wol
There's very important data on our RAID10 device but I doubt
it's important enough for God to take a hand into our storage.
But let me first summarize what happened and why I believe that
this is an XFS-problem:
Machine running Linux 3.14.69 with no kernel-patches.
XFS filesystem was created with XFS userutils 3.1.11.
I did a fresh compile of xfsprogs-4.9.0 yesterday when
I realized that the 3.1.11 xfs_repair did not help.
mdadm is V3.3
/dev/md5 is a RAID10-device that was created in Feb 2013
with 10 2TB disks and an ext3 filesystem on it. Once in a
while I added two more 2TB disks. Reshaping was done
while the ext3 filesystem was mounted. Then the ext3
filesystem was unmounted resized and mounted again. That
worked until I resized the RAID10 from 16 to 20 disks and
realized that ext3 does not support filesystems >16TB.
I switched to XFS and created a 20TB filesystem. Here are
the details:
# xfs_info /dev/md5
meta-data=/dev/md5 isize=256 agcount=32,
agsize=152608128 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=4883457280, imaxpct=5
= sunit=128 swidth=1280 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
Please notice: Ths XFS-filesystem has a size of
4883457280*4K = 19,533,829,120K
On saturday I tried to add two more 2TB disks to the RAID10
and the XFS filesystem was mounted (and in medium use) at that
time. Commands were:
# mdadm /dev/md5 --add /dev/sdo
# mdadm --grow /dev/md5 --raid-devices=21
# mdadm -D /dev/md5
/dev/md5:
Version : 1.2
Creation Time : Sun Feb 10 16:58:10 2013
Raid Level : raid10
Array Size : 19533829120 (18628.91 GiB 20002.64 GB)
Used Dev Size : 1953382912 (1862.89 GiB 2000.26 GB)
Raid Devices : 21
Total Devices : 21
Persistence : Superblock is persistent
Update Time : Sat Jan 6 15:08:37 2018
State : clean, reshaping
Active Devices : 21
Working Devices : 21
Failed Devices : 0
Spare Devices : 0
Layout : near=2
Chunk Size : 512K
Reshape Status : 1% complete
Delta Devices : 1, (20->21)
Name : backup:5 (local to host backup)
UUID : 9030ff07:6a292a3c:26589a26:8c92a488
Events : 86002
Number Major Minor RaidDevice State
0 8 16 0 active sync /dev/sdb
1 65 48 1 active sync /dev/sdt
2 8 64 2 active sync /dev/sde
3 65 96 3 active sync /dev/sdw
4 8 112 4 active sync /dev/sdh
5 65 144 5 active sync /dev/sdz
6 8 160 6 active sync /dev/sdk
7 65 192 7 active sync /dev/sdac
8 8 208 8 active sync /dev/sdn
9 65 240 9 active sync /dev/sdaf
10 65 0 10 active sync /dev/sdq
11 66 32 11 active sync /dev/sdai
12 8 32 12 active sync /dev/sdc
13 65 64 13 active sync /dev/sdu
14 8 80 14 active sync /dev/sdf
15 65 112 15 active sync /dev/sdx
16 8 128 16 active sync /dev/sdi
17 65 160 17 active sync /dev/sdaa
18 8 176 18 active sync /dev/sdl
19 65 208 19 active sync /dev/sdad
20 8 224 20 active sync /dev/sdo
Please notice: Ths RAID10-device has a size of 19,533,829,120K
that's exactly the same size as the contained XFS-filesystem.
Immediately after the RAID10 reshape operation started the
XFS-filesystem reported I/O-errors and was severly damaged.
I waited for the reshape operation to finish and tried to repair
the filesystem with xfs_repair (version 3.1.11) but xfs_repair
crashed, so I tried 4.9.0-version aif xfs_reapair with no luck
either.
/dev/md5 ist now mounted ro,norecovery with an overlay filesystem
on top of it (thanks very much to Andreas for that idea) and I have
setup a new server today. Rsyncing the data to the new server will
take a while and I'm sure I will stumble on lots of corrupted files.
I proceeded from XFS to ZFS (skipped YFS) so lengthy reshape
operations won't happen in the future anymore.
Here are the relevant log messages:
> Jan 6 14:45:00 backup kernel: md: reshape of RAID array md5
> Jan 6 14:45:00 backup kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
> Jan 6 14:45:00 backup kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
> Jan 6 14:45:00 backup kernel: md: using 128k window, over a total of 19533829120k.
> Jan 6 14:45:00 backup kernel: XFS (md5): metadata I/O error: block 0x12c08f360 ("xfs_trans_read_buf_map") error 5 numblks 16
> Jan 6 14:45:00 backup kernel: XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
> Jan 6 14:45:00 backup kernel: XFS (md5): metadata I/O error: block 0x12c08f360 ("xfs_trans_read_buf_map") error 5 numblks 16
> Jan 6 14:45:00 backup kernel: XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
> ... hundreds of the above XFS-messages deleted
> Jan 6 14:45:00 backup kernel: XFS (md5): Log I/O Error Detected. Shutting down filesystem
> Jan 6 14:45:00 backup kernel: XFS (md5): Please umount the filesystem and rectify the problem(s)
Please notice: no error message about hardware-problems.
All 21 disks are fine and the next messages from the
md-driver was:
> Jan 7 02:28:02 backup kernel: md: md5: reshape done.
> Jan 7 02:28:03 backup kernel: md5: detected capacity change from 20002641018880 to 21002772807680
I'm wondering about one thing: the first xfs message is about a
meatadata I/O error on block 0x12c08f360. Since the xfs filesystem
has a blocksize of 4K this block is located at position 20135005568K
which is beyond the end of the RAID10 device. No wonder that the
xfs driver receives an I/O error. And also no wonder that the
filesystem is severely corrupted right now.
Question 1: How did the xfs driver knew on Jan 6 that the RAID10
device was about to be increased from 20TB to 21TB on Jan 7?
Question 2: Why did the xfs driver started to use the additional
space that was not yet there without me executing xfs_growfs.
This looks like a severe XFS-problem to me.
But my hope is that all the data taht was within the filesystem
before Jan 6 14:45 is not involved in the corruption. If xfs
started to use space beyond the end of the underlying raid
device this should have affected only data that was created,
modified or deleted after Jan 6 14:45.
If that was true we could clearly distinct between data
that we must dump and data that we can keep. The machine is
our backup system (as you may have guessed from its name)
and I would like to keep old backup-files.
I remember that mkfs.xfs is clever enough to adopt the
filesystem paramters to the underlying hardware of the
block device that the xfs filesystem is created on. Hence
from the xfs drivers point of view the underlying block
device is not just a sequence of data blocks, but the xfs
driver knows something about the layout of the underlying
hardware.
If that was true - how does the xfs driver reacts if that
information about the layout of the underlying hardware
changes while the xfs-filesystem is mounted?
Seems to be an interesting problem
Kind regards
Peter Koch
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-08 15:16 ` Wols Lists
2018-01-08 15:34 ` Reindl Harald
@ 2018-01-08 16:24 ` Wolfgang Denk
2018-01-10 1:57 ` Guoqing Jiang
2 siblings, 0 replies; 37+ messages in thread
From: Wolfgang Denk @ 2018-01-08 16:24 UTC (permalink / raw)
To: Wols Lists; +Cc: Guoqing Jiang, mdraid.pkoch, linux-raid
Dear Wol,
In message <5A538B30.4080601@youngman.org.uk> you wrote:
>
> You extend the filesystem *after* you've grown the array. The act of
> growing the array has caused the filesystem to crash. That should NOT
> happen - the act of growing the array should be *invisible* to the
> filesystem.
Not if this causes any hard I/O errors...
> In other words, one or more of the following three are true :-
> 1) The OP has been caught by some random act of God
> 2) There's a serious flaw in "mdadm --grow"
> 3) There's a serious flaw in xfs
The original log contained this:
| XFS (md5): metadata I/O error: block 0x12c08f360
| ("xfs_trans_read_buf_map") error 5 numblks 16
| XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
| XFS (md5): metadata I/O error: block 0x12c08f360
| ("xfs_trans_read_buf_map") error 5 numblks 16
| XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
| XFS (md5): metadata I/O error: block 0xebb62c00
| ("xfs_trans_read_buf_map") error 5 numblks 16
| XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
| ...
| ... lots of the above messages deleted
| ...
| XFS (md5): xfs_do_force_shutdown(0x1) called from line 138 of file
| fs/xfs/xfs_bmap_util.c. Return address = 0xffffffff8113908f
| XFS (md5): metadata I/O error: block 0x48c710b00 ("xlog_iodone") error 5
| numblks 64
| XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
| fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
| XFS (md5): Log I/O Error Detected. Shutting down filesystem
To me this looks as if during the growing of the array some hard I/O
errors happenend. That may have been triggered by the growing of
the array. but only as fas as it caused additional disk load /
reading of otherwise idle areas.
I cannot see any indications for 2) or 3) here, so yes, it was 1),
if you consider spurious I/O aerrors as such.
Or am I missing soething else?
Best regards,
Wolfgang Denk
--
DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
Experience is that marvelous thing that enable you to recognize a
mistake when you make it again. - Franklin P. Jones
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-08 15:16 ` Wols Lists
@ 2018-01-08 15:34 ` Reindl Harald
2018-01-08 16:24 ` Wolfgang Denk
2018-01-10 1:57 ` Guoqing Jiang
2 siblings, 0 replies; 37+ messages in thread
From: Reindl Harald @ 2018-01-08 15:34 UTC (permalink / raw)
To: Wols Lists, Guoqing Jiang, mdraid.pkoch, linux-raid
Am 08.01.2018 um 16:16 schrieb Wols Lists:
> On 08/01/18 07:31, Guoqing Jiang wrote:
>> https://raid.wiki.kernel.org/index.php/Growing#Extending_the_filesystem
>>
>> And you can use xfs_growfs for your purpose.
>
> You extend the filesystem *after* you've grown the array. The act of
> growing the array has caused the filesystem to crash. That should NOT
> happen - the act of growing the array should be *invisible* to the
> filesystem.
>
> In other words, one or more of the following three are true :-
> 1) The OP has been caught by some random act of God
> 2) There's a serious flaw in "mdadm --grow"
> 3) There's a serious flaw in xfs
3) should not be possible because as long 2) is running the filesystem
should not know that anything is changing at all - even "should" is the
wrong word: MUST NOT
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-08 7:31 ` Guoqing Jiang
@ 2018-01-08 15:16 ` Wols Lists
2018-01-08 15:34 ` Reindl Harald
` (2 more replies)
0 siblings, 3 replies; 37+ messages in thread
From: Wols Lists @ 2018-01-08 15:16 UTC (permalink / raw)
To: Guoqing Jiang, mdraid.pkoch, linux-raid
On 08/01/18 07:31, Guoqing Jiang wrote:
>
>
> On 01/06/2018 11:44 PM, mdraid.pkoch@dfgh.net wrote:
>> Dear MD-experts:
>>
>> I was under the impression that growing a RAID10 device could be done
>> with an active filesystem running on the device.
>
> It depends on whether the specific filesystem provides related tool or
> not, eg,
> resize2fs can serve ext fs:
Sorry Guoqing, but I think you've *completely* missed the point :-(
>
> https://raid.wiki.kernel.org/index.php/Growing#Extending_the_filesystem
>
> And you can use xfs_growfs for your purpose.
You extend the filesystem *after* you've grown the array. The act of
growing the array has caused the filesystem to crash. That should NOT
happen - the act of growing the array should be *invisible* to the
filesystem.
In other words, one or more of the following three are true :-
1) The OP has been caught by some random act of God
2) There's a serious flaw in "mdadm --grow"
3) There's a serious flaw in xfs
Cheers,
Wol
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-06 15:44 mdraid.pkoch
2018-01-07 19:33 ` John Stoffel
2018-01-07 20:16 ` Andreas Klauer
@ 2018-01-08 7:31 ` Guoqing Jiang
2018-01-08 15:16 ` Wols Lists
2 siblings, 1 reply; 37+ messages in thread
From: Guoqing Jiang @ 2018-01-08 7:31 UTC (permalink / raw)
To: mdraid.pkoch, linux-raid
On 01/06/2018 11:44 PM, mdraid.pkoch@dfgh.net wrote:
> Dear MD-experts:
>
> I was under the impression that growing a RAID10 device could be done
> with an active filesystem running on the device.
It depends on whether the specific filesystem provides related tool or
not, eg,
resize2fs can serve ext fs:
https://raid.wiki.kernel.org/index.php/Growing#Extending_the_filesystem
And you can use xfs_growfs for your purpose.
>
> I did this a couple of times when I added additional 2TB disks to our
> production RAID10 running an ext3 Filesystem. That was a very time
> consuming process and we had to use the filesystem during the reshape.
>
> When I increased the size of the RAID10 from 16 to 20 2TB-disks I could
> not use ext3 anymore due to the 16TB maimum size limitation of ext3
> and I replaced the ext3 filesystem by xfs.
>
> Now today I increased the RAID10 again from 20 to 21 disks with the
> following commands:
>
> mdadm /dev/md5 --add /dev/sdo
> mdadm --grow /dev/md5 --raid-devices=21
>
> My plans were to add another disk after that and then grow
> the XFS-filesystem. I do not add multiple disks at once since
> its hard to predict which disk will end up in what disk-set
>
> Here's mdadm -D /dev/md5 output:
> /dev/md5:
> Version : 1.2
> Creation Time : Sun Feb 10 16:58:10 2013
> Raid Level : raid10
> Array Size : 19533829120 (18628.91 GiB 20002.64 GB)
> Used Dev Size : 1953382912 (1862.89 GiB 2000.26 GB)
> Raid Devices : 21
> Total Devices : 21
> Persistence : Superblock is persistent
>
> Update Time : Sat Jan 6 15:08:37 2018
> State : clean, reshaping
> Active Devices : 21
> Working Devices : 21
> Failed Devices : 0
> Spare Devices : 0
>
> Layout : near=2
> Chunk Size : 512K
>
> Reshape Status : 1% complete
> Delta Devices : 1, (20->21)
>
> Name : backup:5 (local to host backup)
> UUID : 9030ff07:6a292a3c:26589a26:8c92a488
> Events : 86002
>
> Number Major Minor RaidDevice State
> 0 8 16 0 active sync /dev/sdb
> 1 65 48 1 active sync /dev/sdt
> 2 8 64 2 active sync /dev/sde
> 3 65 96 3 active sync /dev/sdw
> 4 8 112 4 active sync /dev/sdh
> 5 65 144 5 active sync /dev/sdz
> 6 8 160 6 active sync /dev/sdk
> 7 65 192 7 active sync /dev/sdac
> 8 8 208 8 active sync /dev/sdn
> 9 65 240 9 active sync /dev/sdaf
> 10 65 0 10 active sync /dev/sdq
> 11 66 32 11 active sync /dev/sdai
> 12 8 32 12 active sync /dev/sdc
> 13 65 64 13 active sync /dev/sdu
> 14 8 80 14 active sync /dev/sdf
> 15 65 112 15 active sync /dev/sdx
> 16 8 128 16 active sync /dev/sdi
> 17 65 160 17 active sync /dev/sdaa
> 18 8 176 18 active sync /dev/sdl
> 19 65 208 19 active sync /dev/sdad
> 20 8 224 20 active sync /dev/sdo
>
>
> As you can see the array-size is still 20TB.
Because the reshaping is not finished yet.
>
> Just one second after starting the reshape operation
> XFS failed with the following messages:
>
> # dmesg
> ...
> RAID10 conf printout:
> --- wd:21 rd:21
> disk 0, wo:0, o:1, dev:sdb
> disk 1, wo:0, o:1, dev:sdt
> disk 2, wo:0, o:1, dev:sde
> disk 3, wo:0, o:1, dev:sdw
> disk 4, wo:0, o:1, dev:sdh
> disk 5, wo:0, o:1, dev:sdz
> disk 6, wo:0, o:1, dev:sdk
> disk 7, wo:0, o:1, dev:sdac
> disk 8, wo:0, o:1, dev:sdn
> disk 9, wo:0, o:1, dev:sdaf
> disk 10, wo:0, o:1, dev:sdq
> disk 11, wo:0, o:1, dev:sdai
> disk 12, wo:0, o:1, dev:sdc
> disk 13, wo:0, o:1, dev:sdu
> disk 14, wo:0, o:1, dev:sdf
> disk 15, wo:0, o:1, dev:sdx
> disk 16, wo:0, o:1, dev:sdi
> disk 17, wo:0, o:1, dev:sdaa
> disk 18, wo:0, o:1, dev:sdl
> disk 19, wo:0, o:1, dev:sdad
> disk 20, wo:1, o:1, dev:sdo
> md: reshape of RAID array md5
> md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
> md: using maximum available idle IO bandwidth (but not more than
> 200000 KB/sec) for reshape.
> md: using 128k window, over a total of 19533829120k.
> XFS (md5): metadata I/O error: block 0x12c08f360
> ("xfs_trans_read_buf_map") error 5 numblks 16
> XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
> XFS (md5): metadata I/O error: block 0x12c08f360
> ("xfs_trans_read_buf_map") error 5 numblks 16
> XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
> XFS (md5): metadata I/O error: block 0xebb62c00
> ("xfs_trans_read_buf_map") error 5 numblks 16
> XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
> ...
> ... lots of the above messages deleted
> ...
> XFS (md5): xfs_do_force_shutdown(0x1) called from line 138 of file
> fs/xfs/xfs_bmap_util.c. Return address = 0xffffffff8113908f
> XFS (md5): metadata I/O error: block 0x48c710b00 ("xlog_iodone") error
> 5 numblks 64
> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
> fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
> XFS (md5): Log I/O Error Detected. Shutting down filesystem
> XFS (md5): Please umount the filesystem and rectify the problem(s)
> XFS (md5): metadata I/O error: block 0x48c710b40 ("xlog_iodone") error
> 5 numblks 64
> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
> fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
> XFS (md5): metadata I/O error: block 0x48c710b80 ("xlog_iodone") error
> 5 numblks 64
> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
> fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
> XFS (md5): metadata I/O error: block 0x48c710bc0 ("xlog_iodone") error
> 5 numblks 64
> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
> fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
> XFS (md5): metadata I/O error: block 0x48c710c00 ("xlog_iodone") error
> 5 numblks 64
> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
> fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
> XFS (md5): metadata I/O error: block 0x48c710c40 ("xlog_iodone") error
> 5 numblks 64
> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
> fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
> XFS (md5): metadata I/O error: block 0x48c710c80 ("xlog_iodone") error
> 5 numblks 64
> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
> fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
> XFS (md5): metadata I/O error: block 0x48c710cc0 ("xlog_iodone") error
> 5 numblks 64
> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
> fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
> XFS (md5): I/O Error Detected. Shutting down filesystem
I guess the IOs from xfs were competing with the md internal IO, not good.
> I did an "umount /dev/md5" and now I'm wondering what my options are:
Though xfs file systems can be grown while mounted, it is better to umount
it first if md reshaping is in progress.
> Should I wait until the reshape has finisched? I assume yes since
> stopping that operation will most likely make things worse.
> Unfortunately reshaping a 20TB RAID10 to 21TB will last about
> 10 hours but it's saturday and I have approx. 40 hours to fix the
> problem until monday morning.
>
> Should I reduce array-size back to 20 disks?
>
> My plans are to run xfs_check first, maybe followed by xfs_repair and
> see what happens.
>
> Any other suggestions?
>
> Do you have an explanation why reshaping a RAID10 with a running
> ext3 filesystem does work while a running XFS-filesystems fails during
> a reshape?
>
> How did the XFS-filesystem notice that a reshape was running? I was
> sure that during the reshape operation every single block of the RAID10
> device could be read or written no matter wether it belongs to the part
> of the RAID that was already reshaped or not. Obviously that's working
> in theory only - or with ext3-filesystems only.
If the IO from fs could conflict with reshape IO, then it could be
trouble, so
again, it is more safer to umount fs first before reshaping, my 0.02$.
Thanks,
Guoqing
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-06 15:44 mdraid.pkoch
2018-01-07 19:33 ` John Stoffel
@ 2018-01-07 20:16 ` Andreas Klauer
2018-01-08 7:31 ` Guoqing Jiang
2 siblings, 0 replies; 37+ messages in thread
From: Andreas Klauer @ 2018-01-07 20:16 UTC (permalink / raw)
To: mdraid.pkoch; +Cc: linux-raid
On Sat, Jan 06, 2018 at 04:44:12PM +0100, mdraid.pkoch@dfgh.net wrote:
> Now today I increased the RAID10 again from 20 to 21 disks with the
> following commands:
>
> mdadm /dev/md5 --add /dev/sdo
> mdadm --grow /dev/md5 --raid-devices=21
>
> Just one second after starting the reshape operation
> XFS failed with the following messages:
>
> md: reshape of RAID array md5
> md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
> md: using maximum available idle IO bandwidth (but not more than 200000
> KB/sec) for reshape.
> md: using 128k window, over a total of 19533829120k.
> XFS (md5): metadata I/O error: block 0x12c08f360
> ("xfs_trans_read_buf_map") error 5 numblks 16
Ouch. No idea what happened there.
Use overlays to try to recover. Don't write anymore.
https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file
I tried to reproduce your problem, created a 20 drive RAID,
and a while loop to grow to 21 drives, then shrink back to 20.
truncate -s 100M {001..021}
losetup ...
mdadm --create /dev/md42 --level=10 --raid-devices=20 /dev/loop{1..20}
mdadm --grow /dev/md42 --add /dev/loop21
while :
do
mdadm --wait /dev/md42
mdadm --grow /dev/md42 --raid-devices=21
mdadm --wait /dev/md42
mdadm --grow /dev/md42 --array-size 1013760
mdadm --wait /dev/md42
mdadm --grow /dev/md42 --raid-devices=20
done
Then I put XFS on top and another while loop to extract a Linux tarball.
while :
do
tar xf linux-4.13.4.tar.xz
sync
rm -rf linux-4.13.4
sync
done
Both running in parallel ad infinitum.
I couldn't get the XFS to corrupt.
mdadm itself eventually died though.
Told me two drives failed though none did and would refuse to continue
the grow operation. Unless I'm missing something, the degraded counter
seems to have gone out of whack. There was nothing in dmesg.
# cat /sys/block/md42/md/degraded
2
# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md42 : active raid10 loop20[19] loop19[18] loop18[17] loop17[16] loop16[15] loop15[14] loop14[13] loop13[12] loop12[11] loop11[10] loop10[9] loop9[8] loop8[7] loop7[6] loop6[5] loop5[4] loop4[3] loop3[2] loop2[1] loop1[0]
1013760 blocks super 1.2 512K chunks 2 near-copies [20/18] [UUUUUUUUUUUUUUUUUUUU]
Stopping and re-assembling and degraded went back to 0.
# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md42 : active raid10 loop1[0] loop20[19] loop19[18] loop18[17] loop17[16] loop16[15] loop15[14] loop14[13] loop13[12] loop12[11] loop11[10] loop10[9] loop9[8] loop8[7] loop7[6] loop6[5] loop5[4] loop4[3] loop3[2] loop2[1]
1013760 blocks super 1.2 512K chunks 2 near-copies [20/20] [UUUUUUUUUUUUUUUUUUUU]
But this should be unrelated to your issue.
No idea what happened to you.
Sorry.
Regards
Andreas Klauer
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Growing RAID10 with active XFS filesystem
2018-01-06 15:44 mdraid.pkoch
@ 2018-01-07 19:33 ` John Stoffel
2018-01-07 20:16 ` Andreas Klauer
2018-01-08 7:31 ` Guoqing Jiang
2 siblings, 0 replies; 37+ messages in thread
From: John Stoffel @ 2018-01-07 19:33 UTC (permalink / raw)
To: mdraid.pkoch; +Cc: linux-raid
mdraid> I was under the impression that growing a RAID10 device could
mdraid> be done with an active filesystem running on the device.
It should be just fine. But in this case, you might also want to talk
with the XFS experts.
mdraid> I did this a couple of times when I added additional 2TB disks
mdraid> to our production RAID10 running an ext3 Filesystem. That was
mdraid> a very time consuming process and we had to use the filesystem
mdraid> during the reshape.
What kernel and distro are you running here? What are the mdadm tools
versions? You need to give more details please.
mdraid> When I increased the size of the RAID10 from 16 to 20
mdraid> 2TB-disks I could not use ext3 anymore due to the 16TB maimum
mdraid> size limitation of ext3 and I replaced the ext3 filesystem by
mdraid> xfs.
That must have been fun... not.
mdraid> Now today I increased the RAID10 again from 20 to 21 disks with the
mdraid> following commands:
mdraid> mdadm /dev/md5 --add /dev/sdo
mdraid> mdadm --grow /dev/md5 --raid-devices=21
mdraid> My plans were to add another disk after that and then grow
mdraid> the XFS-filesystem. I do not add multiple disks at once since
mdraid> its hard to predict which disk will end up in what disk-set
mdraid> Here's mdadm -D /dev/md5 output:
mdraid> /dev/md5:
mdraid> Version : 1.2
mdraid> Creation Time : Sun Feb 10 16:58:10 2013
mdraid> Raid Level : raid10
mdraid> Array Size : 19533829120 (18628.91 GiB 20002.64 GB)
mdraid> Used Dev Size : 1953382912 (1862.89 GiB 2000.26 GB)
mdraid> Raid Devices : 21
mdraid> Total Devices : 21
mdraid> Persistence : Superblock is persistent
mdraid> Update Time : Sat Jan 6 15:08:37 2018
mdraid> State : clean, reshaping
mdraid> Active Devices : 21
mdraid> Working Devices : 21
mdraid> Failed Devices : 0
mdraid> Spare Devices : 0
mdraid> Layout : near=2
mdraid> Chunk Size : 512K
mdraid> Reshape Status : 1% complete
mdraid> Delta Devices : 1, (20->21)
mdraid> Name : backup:5 (local to host backup)
mdraid> UUID : 9030ff07:6a292a3c:26589a26:8c92a488
mdraid> Events : 86002
mdraid> Number Major Minor RaidDevice State
mdraid> 0 8 16 0 active sync /dev/sdb
mdraid> 1 65 48 1 active sync /dev/sdt
mdraid> 2 8 64 2 active sync /dev/sde
mdraid> 3 65 96 3 active sync /dev/sdw
mdraid> 4 8 112 4 active sync /dev/sdh
mdraid> 5 65 144 5 active sync /dev/sdz
mdraid> 6 8 160 6 active sync /dev/sdk
mdraid> 7 65 192 7 active sync /dev/sdac
mdraid> 8 8 208 8 active sync /dev/sdn
mdraid> 9 65 240 9 active sync /dev/sdaf
mdraid> 10 65 0 10 active sync /dev/sdq
mdraid> 11 66 32 11 active sync /dev/sdai
mdraid> 12 8 32 12 active sync /dev/sdc
mdraid> 13 65 64 13 active sync /dev/sdu
mdraid> 14 8 80 14 active sync /dev/sdf
mdraid> 15 65 112 15 active sync /dev/sdx
mdraid> 16 8 128 16 active sync /dev/sdi
mdraid> 17 65 160 17 active sync /dev/sdaa
mdraid> 18 8 176 18 active sync /dev/sdl
mdraid> 19 65 208 19 active sync /dev/sdad
mdraid> 20 8 224 20 active sync /dev/sdo
This all looks fine... but I'm thinking what you *should* have done
instead is build a bunch of 2tb pairs, and them use LVM to span across
them with a volume, then build your XFS filesystem ontop of that.
This way you would have /dev/md1,2,3,4,5,6,7,8,9,10 all inside a VG,
then you would use LVM to stripe across the pairs.
But that's water under the bridge now.
mdraid> As you can see the array-size is still 20TB.
mdraid> Just one second after starting the reshape operation
mdraid> XFS failed with the following messages:
I *think* the mdadm --grow did something, or XFS noticed the change in
array size and grew on it's own. Can you provide the output of
'xfs_info /dev/md5' for us?
mdraid> # dmesg
mdraid> ...
mdraid> RAID10 conf printout:
mdraid> --- wd:21 rd:21
mdraid> disk 0, wo:0, o:1, dev:sdb
mdraid> disk 1, wo:0, o:1, dev:sdt
mdraid> disk 2, wo:0, o:1, dev:sde
mdraid> disk 3, wo:0, o:1, dev:sdw
mdraid> disk 4, wo:0, o:1, dev:sdh
mdraid> disk 5, wo:0, o:1, dev:sdz
mdraid> disk 6, wo:0, o:1, dev:sdk
mdraid> disk 7, wo:0, o:1, dev:sdac
mdraid> disk 8, wo:0, o:1, dev:sdn
mdraid> disk 9, wo:0, o:1, dev:sdaf
mdraid> disk 10, wo:0, o:1, dev:sdq
mdraid> disk 11, wo:0, o:1, dev:sdai
mdraid> disk 12, wo:0, o:1, dev:sdc
mdraid> disk 13, wo:0, o:1, dev:sdu
mdraid> disk 14, wo:0, o:1, dev:sdf
mdraid> disk 15, wo:0, o:1, dev:sdx
mdraid> disk 16, wo:0, o:1, dev:sdi
mdraid> disk 17, wo:0, o:1, dev:sdaa
mdraid> disk 18, wo:0, o:1, dev:sdl
mdraid> disk 19, wo:0, o:1, dev:sdad
mdraid> disk 20, wo:1, o:1, dev:sdo
mdraid> md: reshape of RAID array md5
mdraid> md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
mdraid> md: using maximum available idle IO bandwidth (but not more than 200000
mdraid> KB/sec) for reshape.
mdraid> md: using 128k window, over a total of 19533829120k.
mdraid> XFS (md5): metadata I/O error: block 0x12c08f360
mdraid> ("xfs_trans_read_buf_map") error 5 numblks 16
mdraid> XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
mdraid> XFS (md5): metadata I/O error: block 0x12c08f360
mdraid> ("xfs_trans_read_buf_map") error 5 numblks 16
mdraid> XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
mdraid> XFS (md5): metadata I/O error: block 0xebb62c00
mdraid> ("xfs_trans_read_buf_map") error 5 numblks 16
mdraid> XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
mdraid> ...
mdraid> ... lots of the above messages deleted
mdraid> ...
mdraid> XFS (md5): xfs_do_force_shutdown(0x1) called from line 138 of file
mdraid> fs/xfs/xfs_bmap_util.c. Return address = 0xffffffff8113908f
mdraid> XFS (md5): metadata I/O error: block 0x48c710b00 ("xlog_iodone") error 5
mdraid> numblks 64
mdraid> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
mdraid> fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
mdraid> XFS (md5): Log I/O Error Detected. Shutting down filesystem
mdraid> XFS (md5): Please umount the filesystem and rectify the problem(s)
mdraid> XFS (md5): metadata I/O error: block 0x48c710b40 ("xlog_iodone") error 5
mdraid> numblks 64
mdraid> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
mdraid> fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
mdraid> XFS (md5): metadata I/O error: block 0x48c710b80 ("xlog_iodone") error 5
mdraid> numblks 64
mdraid> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
mdraid> fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
mdraid> XFS (md5): metadata I/O error: block 0x48c710bc0 ("xlog_iodone") error 5
mdraid> numblks 64
mdraid> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
mdraid> fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
mdraid> XFS (md5): metadata I/O error: block 0x48c710c00 ("xlog_iodone") error 5
mdraid> numblks 64
mdraid> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
mdraid> fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
mdraid> XFS (md5): metadata I/O error: block 0x48c710c40 ("xlog_iodone") error 5
mdraid> numblks 64
mdraid> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
mdraid> fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
mdraid> XFS (md5): metadata I/O error: block 0x48c710c80 ("xlog_iodone") error 5
mdraid> numblks 64
mdraid> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
mdraid> fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
mdraid> XFS (md5): metadata I/O error: block 0x48c710cc0 ("xlog_iodone") error 5
mdraid> numblks 64
mdraid> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
mdraid> fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
mdraid> XFS (md5): I/O Error Detected. Shutting down filesystem
mdraid> I did an "umount /dev/md5" and now I'm wondering what my options are:
What does 'xfs_fsck -n /dev/md5' say?
mdraid> Should I wait until the reshape has finisched? I assume yes
mdraid> since stopping that operation will most likely make things
mdraid> worse. Unfortunately reshaping a 20TB RAID10 to 21TB will
mdraid> last about 10 hours but it's saturday and I have approx. 40
mdraid> hours to fix the problem until monday morning.
Are you still having the problem?
mdraid> Should I reduce array-size back to 20 disks?
I don't think so.
mdraid> My plans are to run xfs_check first, maybe followed by
mdraid> xfs_repair and see what happens.
Talk to the XFS folks first, before you do anything!
mdraid> Any other suggestions?
mdraid> Do you have an explanation why reshaping a RAID10 with a running
mdraid> ext3 filesystem does work while a running XFS-filesystems fails during
mdraid> a reshape?
mdraid> How did the XFS-filesystem notice that a reshape was running? I was
mdraid> sure that during the reshape operation every single block of the RAID10
mdraid> device could be read or written no matter wether it belongs to the part
mdraid> of the RAID that was already reshaped or not. Obviously that's working
mdraid> in theory only - or with ext3-filesystems only.
mdraid> Or was i totally wrong with my assumption?
mdraid> Much thanks in advance for any assistance.
mdraid> Peter Koch
mdraid> --
mdraid> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
mdraid> the body of a message to majordomo@vger.kernel.org
mdraid> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 37+ messages in thread
* Growing RAID10 with active XFS filesystem
@ 2018-01-06 15:44 mdraid.pkoch
2018-01-07 19:33 ` John Stoffel
` (2 more replies)
0 siblings, 3 replies; 37+ messages in thread
From: mdraid.pkoch @ 2018-01-06 15:44 UTC (permalink / raw)
To: linux-raid
Dear MD-experts:
I was under the impression that growing a RAID10 device could be done
with an active filesystem running on the device.
I did this a couple of times when I added additional 2TB disks to our
production RAID10 running an ext3 Filesystem. That was a very time
consuming process and we had to use the filesystem during the reshape.
When I increased the size of the RAID10 from 16 to 20 2TB-disks I could
not use ext3 anymore due to the 16TB maimum size limitation of ext3
and I replaced the ext3 filesystem by xfs.
Now today I increased the RAID10 again from 20 to 21 disks with the
following commands:
mdadm /dev/md5 --add /dev/sdo
mdadm --grow /dev/md5 --raid-devices=21
My plans were to add another disk after that and then grow
the XFS-filesystem. I do not add multiple disks at once since
its hard to predict which disk will end up in what disk-set
Here's mdadm -D /dev/md5 output:
/dev/md5:
Version : 1.2
Creation Time : Sun Feb 10 16:58:10 2013
Raid Level : raid10
Array Size : 19533829120 (18628.91 GiB 20002.64 GB)
Used Dev Size : 1953382912 (1862.89 GiB 2000.26 GB)
Raid Devices : 21
Total Devices : 21
Persistence : Superblock is persistent
Update Time : Sat Jan 6 15:08:37 2018
State : clean, reshaping
Active Devices : 21
Working Devices : 21
Failed Devices : 0
Spare Devices : 0
Layout : near=2
Chunk Size : 512K
Reshape Status : 1% complete
Delta Devices : 1, (20->21)
Name : backup:5 (local to host backup)
UUID : 9030ff07:6a292a3c:26589a26:8c92a488
Events : 86002
Number Major Minor RaidDevice State
0 8 16 0 active sync /dev/sdb
1 65 48 1 active sync /dev/sdt
2 8 64 2 active sync /dev/sde
3 65 96 3 active sync /dev/sdw
4 8 112 4 active sync /dev/sdh
5 65 144 5 active sync /dev/sdz
6 8 160 6 active sync /dev/sdk
7 65 192 7 active sync /dev/sdac
8 8 208 8 active sync /dev/sdn
9 65 240 9 active sync /dev/sdaf
10 65 0 10 active sync /dev/sdq
11 66 32 11 active sync /dev/sdai
12 8 32 12 active sync /dev/sdc
13 65 64 13 active sync /dev/sdu
14 8 80 14 active sync /dev/sdf
15 65 112 15 active sync /dev/sdx
16 8 128 16 active sync /dev/sdi
17 65 160 17 active sync /dev/sdaa
18 8 176 18 active sync /dev/sdl
19 65 208 19 active sync /dev/sdad
20 8 224 20 active sync /dev/sdo
As you can see the array-size is still 20TB.
Just one second after starting the reshape operation
XFS failed with the following messages:
# dmesg
...
RAID10 conf printout:
--- wd:21 rd:21
disk 0, wo:0, o:1, dev:sdb
disk 1, wo:0, o:1, dev:sdt
disk 2, wo:0, o:1, dev:sde
disk 3, wo:0, o:1, dev:sdw
disk 4, wo:0, o:1, dev:sdh
disk 5, wo:0, o:1, dev:sdz
disk 6, wo:0, o:1, dev:sdk
disk 7, wo:0, o:1, dev:sdac
disk 8, wo:0, o:1, dev:sdn
disk 9, wo:0, o:1, dev:sdaf
disk 10, wo:0, o:1, dev:sdq
disk 11, wo:0, o:1, dev:sdai
disk 12, wo:0, o:1, dev:sdc
disk 13, wo:0, o:1, dev:sdu
disk 14, wo:0, o:1, dev:sdf
disk 15, wo:0, o:1, dev:sdx
disk 16, wo:0, o:1, dev:sdi
disk 17, wo:0, o:1, dev:sdaa
disk 18, wo:0, o:1, dev:sdl
disk 19, wo:0, o:1, dev:sdad
disk 20, wo:1, o:1, dev:sdo
md: reshape of RAID array md5
md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 200000
KB/sec) for reshape.
md: using 128k window, over a total of 19533829120k.
XFS (md5): metadata I/O error: block 0x12c08f360
("xfs_trans_read_buf_map") error 5 numblks 16
XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
XFS (md5): metadata I/O error: block 0x12c08f360
("xfs_trans_read_buf_map") error 5 numblks 16
XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
XFS (md5): metadata I/O error: block 0xebb62c00
("xfs_trans_read_buf_map") error 5 numblks 16
XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
...
... lots of the above messages deleted
...
XFS (md5): xfs_do_force_shutdown(0x1) called from line 138 of file
fs/xfs/xfs_bmap_util.c. Return address = 0xffffffff8113908f
XFS (md5): metadata I/O error: block 0x48c710b00 ("xlog_iodone") error 5
numblks 64
XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
XFS (md5): Log I/O Error Detected. Shutting down filesystem
XFS (md5): Please umount the filesystem and rectify the problem(s)
XFS (md5): metadata I/O error: block 0x48c710b40 ("xlog_iodone") error 5
numblks 64
XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
XFS (md5): metadata I/O error: block 0x48c710b80 ("xlog_iodone") error 5
numblks 64
XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
XFS (md5): metadata I/O error: block 0x48c710bc0 ("xlog_iodone") error 5
numblks 64
XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
XFS (md5): metadata I/O error: block 0x48c710c00 ("xlog_iodone") error 5
numblks 64
XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
XFS (md5): metadata I/O error: block 0x48c710c40 ("xlog_iodone") error 5
numblks 64
XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
XFS (md5): metadata I/O error: block 0x48c710c80 ("xlog_iodone") error 5
numblks 64
XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
XFS (md5): metadata I/O error: block 0x48c710cc0 ("xlog_iodone") error 5
numblks 64
XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
fs/xfs/xfs_log.c. Return address = 0xffffffff8117cdf4
XFS (md5): I/O Error Detected. Shutting down filesystem
I did an "umount /dev/md5" and now I'm wondering what my options are:
Should I wait until the reshape has finisched? I assume yes since
stopping that operation will most likely make things worse.
Unfortunately reshaping a 20TB RAID10 to 21TB will last about
10 hours but it's saturday and I have approx. 40 hours to fix the
problem until monday morning.
Should I reduce array-size back to 20 disks?
My plans are to run xfs_check first, maybe followed by xfs_repair and
see what happens.
Any other suggestions?
Do you have an explanation why reshaping a RAID10 with a running
ext3 filesystem does work while a running XFS-filesystems fails during
a reshape?
How did the XFS-filesystem notice that a reshape was running? I was
sure that during the reshape operation every single block of the RAID10
device could be read or written no matter wether it belongs to the part
of the RAID that was already reshaped or not. Obviously that's working
in theory only - or with ext3-filesystems only.
Or was i totally wrong with my assumption?
Much thanks in advance for any assistance.
Peter Koch
^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2018-01-15 17:08 UTC | newest]
Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-08 19:08 Growing RAID10 with active XFS filesystem xfs.pkoch
2018-01-08 19:26 ` Darrick J. Wong
2018-01-08 22:01 ` Dave Chinner
2018-01-08 23:44 ` mdraid.pkoch
2018-01-08 23:44 ` xfs.pkoch
2018-01-09 9:36 ` Wols Lists
2018-01-09 21:47 ` IMAP-FCC:Sent
2018-01-09 22:25 ` Dave Chinner
2018-01-09 22:32 ` Reindl Harald
2018-01-10 6:17 ` Wols Lists
2018-01-11 2:14 ` Dave Chinner
2018-01-12 2:16 ` Guoqing Jiang
2018-01-10 14:10 ` Phil Turmel
2018-01-10 21:57 ` Wols Lists
2018-01-11 3:07 ` Dave Chinner
2018-01-12 13:32 ` Wols Lists
2018-01-12 14:25 ` Emmanuel Florac
2018-01-12 17:52 ` Wols Lists
2018-01-12 18:37 ` Emmanuel Florac
2018-01-12 19:35 ` Wol's lists
2018-01-13 12:30 ` Brad Campbell
2018-01-13 13:18 ` Wols Lists
2018-01-13 0:20 ` Stan Hoeppner
2018-01-13 19:29 ` Wol's lists
2018-01-13 22:40 ` Dave Chinner
2018-01-13 23:04 ` Wols Lists
2018-01-14 21:33 ` Wol's lists
2018-01-15 17:08 ` Emmanuel Florac
-- strict thread matches above, loose matches on Subject: below --
2018-01-08 19:06 mdraid.pkoch
2018-01-06 15:44 mdraid.pkoch
2018-01-07 19:33 ` John Stoffel
2018-01-07 20:16 ` Andreas Klauer
2018-01-08 7:31 ` Guoqing Jiang
2018-01-08 15:16 ` Wols Lists
2018-01-08 15:34 ` Reindl Harald
2018-01-08 16:24 ` Wolfgang Denk
2018-01-10 1:57 ` Guoqing Jiang
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.