All of lore.kernel.org
 help / color / mirror / Atom feed
* Growing RAID10 with active XFS filesystem
@ 2018-01-08 19:08 xfs.pkoch
  2018-01-08 19:26 ` Darrick J. Wong
  0 siblings, 1 reply; 37+ messages in thread
From: xfs.pkoch @ 2018-01-08 19:08 UTC (permalink / raw)
  To: linux-xfs

Dear Linux-Raid and Linux-XFS experts:

I'm posting this on both the linux-raid and linux-xfs
mailing list as it's not clear at this point wether
this is a MD- od XFS-problem.

I have described my problem in a recent posting on
linux-raid and Wol's conclusion was:

> In other words, one or more of the following three are true :-
> 1) The OP has been caught by some random act of God
> 2) There's a serious flaw in "mdadm --grow"
> 3) There's a serious flaw in xfs
>
> Cheers,
> Wol

There's very important data on our RAID10 device but I doubt
it's important enough for God to take a hand into our storage.

But let me first summarize what happened and why I believe that
this is an XFS-problem:

Machine running Linux 3.14.69 with no kernel-patches.

XFS filesystem was created with XFS userutils 3.1.11.
I did a fresh compile of xfsprogs-4.9.0 yesterday when
I realized that the 3.1.11 xfs_repair did not help.

mdadm is V3.3

/dev/md5 is a RAID10-device that was created in Feb 2013
with 10 2TB disks and an ext3 filesystem on it. Once in a
while I added two more 2TB disks. Reshaping was done
while the ext3 filesystem was mounted. Then the ext3
filesystem was unmounted resized and mounted again. That
worked until I resized the RAID10 from 16 to 20 disks and
realized that ext3 does not support filesystems >16TB.

I switched to XFS and created a 20TB filesystem. Here are
the details:

# xfs_info /dev/md5
meta-data=/dev/md5               isize=256    agcount=32,
agsize=152608128 blks
           =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=4883457280, imaxpct=5
           =                       sunit=128    swidth=1280 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
           =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Please notice: Ths XFS-filesystem has a size of
4883457280*4K = 19,533,829,120K

On saturday I tried to add two more 2TB disks to the RAID10
and the XFS filesystem was mounted (and in medium use) at that
time. Commands were:

# mdadm /dev/md5 --add /dev/sdo
# mdadm --grow /dev/md5 --raid-devices=21

# mdadm -D /dev/md5
/dev/md5:
          Version : 1.2
    Creation Time : Sun Feb 10 16:58:10 2013
       Raid Level : raid10
       Array Size : 19533829120 (18628.91 GiB 20002.64 GB)
    Used Dev Size : 1953382912 (1862.89 GiB 2000.26 GB)
     Raid Devices : 21
    Total Devices : 21
      Persistence : Superblock is persistent

      Update Time : Sat Jan  6 15:08:37 2018
            State : clean, reshaping
   Active Devices : 21
Working Devices : 21
   Failed Devices : 0
    Spare Devices : 0

           Layout : near=2
       Chunk Size : 512K

   Reshape Status : 1% complete
    Delta Devices : 1, (20->21)

             Name : backup:5  (local to host backup)
             UUID : 9030ff07:6a292a3c:26589a26:8c92a488
           Events : 86002

      Number   Major   Minor   RaidDevice State
         0       8       16        0      active sync   /dev/sdb
         1      65       48        1      active sync   /dev/sdt
         2       8       64        2      active sync   /dev/sde
         3      65       96        3      active sync   /dev/sdw
         4       8      112        4      active sync   /dev/sdh
         5      65      144        5      active sync   /dev/sdz
         6       8      160        6      active sync   /dev/sdk
         7      65      192        7      active sync   /dev/sdac
         8       8      208        8      active sync   /dev/sdn
         9      65      240        9      active sync   /dev/sdaf
        10      65        0       10      active sync   /dev/sdq
        11      66       32       11      active sync   /dev/sdai
        12       8       32       12      active sync   /dev/sdc
        13      65       64       13      active sync   /dev/sdu
        14       8       80       14      active sync   /dev/sdf
        15      65      112       15      active sync   /dev/sdx
        16       8      128       16      active sync   /dev/sdi
        17      65      160       17      active sync   /dev/sdaa
        18       8      176       18      active sync   /dev/sdl
        19      65      208       19      active sync   /dev/sdad
        20       8      224       20      active sync   /dev/sdo

Please notice: Ths RAID10-device has a size of 19,533,829,120K
that's exactly the same size as the contained XFS-filesystem.

Immediately after the RAID10 reshape operation started the
XFS-filesystem reported I/O-errors and was severly damaged.
I waited for the reshape operation to finish and tried to repair
the filesystem with xfs_repair (version 3.1.11) but xfs_repair
crashed, so I tried 4.9.0-version aif xfs_reapair with no luck
either.

/dev/md5 ist now mounted ro,norecovery with an overlay filesystem
on top of it (thanks very much to Andreas for that idea) and I have
setup a new server today. Rsyncing the data to the new server will
take a while and I'm sure I will stumble on lots of corrupted files.
I proceeded from XFS to ZFS (skipped YFS) so lengthy reshape
operations won't happen in the future anymore.

Here are the relevant log messages:

> Jan  6 14:45:00 backup kernel: md: reshape of RAID array md5
> Jan  6 14:45:00 backup kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
> Jan  6 14:45:00 backup kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
> Jan  6 14:45:00 backup kernel: md: using 128k window, over a total of 19533829120k.
> Jan  6 14:45:00 backup kernel: XFS (md5): metadata I/O error: block 0x12c08f360 ("xfs_trans_read_buf_map") error 5 numblks 16
> Jan  6 14:45:00 backup kernel: XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
> Jan  6 14:45:00 backup kernel: XFS (md5): metadata I/O error: block 0x12c08f360 ("xfs_trans_read_buf_map") error 5 numblks 16
> Jan  6 14:45:00 backup kernel: XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
> ... hundreds of the above XFS-messages deleted
> Jan  6 14:45:00 backup kernel: XFS (md5): Log I/O Error Detected.  Shutting down filesystem
> Jan  6 14:45:00 backup kernel: XFS (md5): Please umount the filesystem and rectify the problem(s)

Please notice: no error message about hardware-problems.
All 21 disks are fine and the next messages from the
md-driver was:

> Jan  7 02:28:02 backup kernel: md: md5: reshape done.
> Jan  7 02:28:03 backup kernel: md5: detected capacity change from 20002641018880 to 21002772807680

I'm wondering about one thing: the first xfs message is about a
meatadata I/O error on block 0x12c08f360. Since the xfs filesystem
has a blocksize of 4K this block is located at position 20135005568K
which is beyond the end of the RAID10 device. No wonder that the
xfs driver receives an I/O error. And also no wonder that the
filesystem is severely corrupted right now.

Question 1: How did the xfs driver knew on Jan 6 that the RAID10
device was about to be increased from 20TB to 21TB on Jan 7?

Question 2: Why did the xfs driver started to use the additional
space that was not yet there without me executing xfs_growfs.

This looks like a severe XFS-problem to me.

But my hope is that all the data taht was within the filesystem
before Jan 6 14:45 is not involved in the corruption. If xfs
started to use space beyond the end of the underlying raid
device this should have affected only data that was created,
modified or deleted after Jan 6 14:45.

If that was true we could clearly distinct between data
that we must dump and data that we can keep. The machine is
our backup system (as you may have guessed from its name)
and I would like to keep old backup-files.

I remember that mkfs.xfs is clever enough to adopt the
filesystem paramters to the underlying hardware of the
block device that the xfs filesystem is created on. Hence
from the xfs drivers point of view the underlying block
device is not just a sequence of data blocks, but the xfs
driver knows something about the layout of the underlying
hardware.

If that was true - how does the xfs driver reacts if that
information about the layout of the underlying hardware
changes while the xfs-filesystem is mounted?

Seems to be an interesting problem

Kind regards

Peter Koch


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-08 19:08 Growing RAID10 with active XFS filesystem xfs.pkoch
@ 2018-01-08 19:26 ` Darrick J. Wong
  2018-01-08 22:01   ` Dave Chinner
  0 siblings, 1 reply; 37+ messages in thread
From: Darrick J. Wong @ 2018-01-08 19:26 UTC (permalink / raw)
  To: xfs.pkoch; +Cc: linux-xfs

On Mon, Jan 08, 2018 at 08:08:09PM +0100, xfs.pkoch@dfgh.net wrote:
> Dear Linux-Raid and Linux-XFS experts:
> 
> I'm posting this on both the linux-raid and linux-xfs
> mailing list as it's not clear at this point wether
> this is a MD- od XFS-problem.
> 
> I have described my problem in a recent posting on
> linux-raid and Wol's conclusion was:
> 
> >In other words, one or more of the following three are true :-
> >1) The OP has been caught by some random act of God
> >2) There's a serious flaw in "mdadm --grow"
> >3) There's a serious flaw in xfs
> >
> >Cheers,
> >Wol
> 
> There's very important data on our RAID10 device but I doubt
> it's important enough for God to take a hand into our storage.
> 
> But let me first summarize what happened and why I believe that
> this is an XFS-problem:
> 
> Machine running Linux 3.14.69 with no kernel-patches.
> 
> XFS filesystem was created with XFS userutils 3.1.11.
> I did a fresh compile of xfsprogs-4.9.0 yesterday when
> I realized that the 3.1.11 xfs_repair did not help.
> 
> mdadm is V3.3
> 
> /dev/md5 is a RAID10-device that was created in Feb 2013
> with 10 2TB disks and an ext3 filesystem on it. Once in a
> while I added two more 2TB disks. Reshaping was done
> while the ext3 filesystem was mounted. Then the ext3
> filesystem was unmounted resized and mounted again. That
> worked until I resized the RAID10 from 16 to 20 disks and
> realized that ext3 does not support filesystems >16TB.
> 
> I switched to XFS and created a 20TB filesystem. Here are
> the details:
> 
> # xfs_info /dev/md5
> meta-data=/dev/md5               isize=256    agcount=32,
> agsize=152608128 blks
>           =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=4883457280, imaxpct=5
>           =                       sunit=128    swidth=1280 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=521728, version=2
>           =                       sectsz=512   sunit=8 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> Please notice: Ths XFS-filesystem has a size of
> 4883457280*4K = 19,533,829,120K
> 
> On saturday I tried to add two more 2TB disks to the RAID10
> and the XFS filesystem was mounted (and in medium use) at that
> time. Commands were:
> 
> # mdadm /dev/md5 --add /dev/sdo
> # mdadm --grow /dev/md5 --raid-devices=21
> 
> # mdadm -D /dev/md5
> /dev/md5:
>          Version : 1.2
>    Creation Time : Sun Feb 10 16:58:10 2013
>       Raid Level : raid10
>       Array Size : 19533829120 (18628.91 GiB 20002.64 GB)
>    Used Dev Size : 1953382912 (1862.89 GiB 2000.26 GB)
>     Raid Devices : 21
>    Total Devices : 21
>      Persistence : Superblock is persistent
> 
>      Update Time : Sat Jan  6 15:08:37 2018
>            State : clean, reshaping
>   Active Devices : 21
> Working Devices : 21
>   Failed Devices : 0
>    Spare Devices : 0
> 
>           Layout : near=2
>       Chunk Size : 512K
> 
>   Reshape Status : 1% complete
>    Delta Devices : 1, (20->21)
> 
>             Name : backup:5  (local to host backup)
>             UUID : 9030ff07:6a292a3c:26589a26:8c92a488
>           Events : 86002
> 
>      Number   Major   Minor   RaidDevice State
>         0       8       16        0      active sync   /dev/sdb
>         1      65       48        1      active sync   /dev/sdt
>         2       8       64        2      active sync   /dev/sde
>         3      65       96        3      active sync   /dev/sdw
>         4       8      112        4      active sync   /dev/sdh
>         5      65      144        5      active sync   /dev/sdz
>         6       8      160        6      active sync   /dev/sdk
>         7      65      192        7      active sync   /dev/sdac
>         8       8      208        8      active sync   /dev/sdn
>         9      65      240        9      active sync   /dev/sdaf
>        10      65        0       10      active sync   /dev/sdq
>        11      66       32       11      active sync   /dev/sdai
>        12       8       32       12      active sync   /dev/sdc
>        13      65       64       13      active sync   /dev/sdu
>        14       8       80       14      active sync   /dev/sdf
>        15      65      112       15      active sync   /dev/sdx
>        16       8      128       16      active sync   /dev/sdi
>        17      65      160       17      active sync   /dev/sdaa
>        18       8      176       18      active sync   /dev/sdl
>        19      65      208       19      active sync   /dev/sdad
>        20       8      224       20      active sync   /dev/sdo
> 
> Please notice: Ths RAID10-device has a size of 19,533,829,120K
> that's exactly the same size as the contained XFS-filesystem.
> 
> Immediately after the RAID10 reshape operation started the
> XFS-filesystem reported I/O-errors and was severly damaged.
> I waited for the reshape operation to finish and tried to repair
> the filesystem with xfs_repair (version 3.1.11) but xfs_repair
> crashed, so I tried 4.9.0-version aif xfs_reapair with no luck
> either.
> 
> /dev/md5 ist now mounted ro,norecovery with an overlay filesystem
> on top of it (thanks very much to Andreas for that idea) and I have
> setup a new server today. Rsyncing the data to the new server will
> take a while and I'm sure I will stumble on lots of corrupted files.
> I proceeded from XFS to ZFS (skipped YFS) so lengthy reshape
> operations won't happen in the future anymore.
> 
> Here are the relevant log messages:
> 
> >Jan  6 14:45:00 backup kernel: md: reshape of RAID array md5
> >Jan  6 14:45:00 backup kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
> >Jan  6 14:45:00 backup kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
> >Jan  6 14:45:00 backup kernel: md: using 128k window, over a total of 19533829120k.
> >Jan  6 14:45:00 backup kernel: XFS (md5): metadata I/O error: block 0x12c08f360 ("xfs_trans_read_buf_map") error 5 numblks 16
> >Jan  6 14:45:00 backup kernel: XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
> >Jan  6 14:45:00 backup kernel: XFS (md5): metadata I/O error: block 0x12c08f360 ("xfs_trans_read_buf_map") error 5 numblks 16
> >Jan  6 14:45:00 backup kernel: XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
> >... hundreds of the above XFS-messages deleted
> >Jan  6 14:45:00 backup kernel: XFS (md5): Log I/O Error Detected.  Shutting down filesystem
> >Jan  6 14:45:00 backup kernel: XFS (md5): Please umount the filesystem and rectify the problem(s)
> 
> Please notice: no error message about hardware-problems.
> All 21 disks are fine and the next messages from the
> md-driver was:
> 
> >Jan  7 02:28:02 backup kernel: md: md5: reshape done.
> >Jan  7 02:28:03 backup kernel: md5: detected capacity change from 20002641018880 to 21002772807680
> 
> I'm wondering about one thing: the first xfs message is about a
> meatadata I/O error on block 0x12c08f360. Since the xfs filesystem

I'm sure Dave will have more to say about this, but...

"block 0x12c08f360" == units of sectors, not fs blocks.

IOWs, this IO error happened at offset 2,577,280,712,704 (~2.5TB)

XFS doesn't change the fs size until you tell it to (via growfs);
even if the underlying storage geometry changes, XFS won't act on it
until the admin tells it to.

What did xfs_repair do?

--D

> has a blocksize of 4K this block is located at position 20135005568K
> which is beyond the end of the RAID10 device. No wonder that the
> xfs driver receives an I/O error. And also no wonder that the
> filesystem is severely corrupted right now.
> 
> Question 1: How did the xfs driver knew on Jan 6 that the RAID10
> device was about to be increased from 20TB to 21TB on Jan 7?
> 
> Question 2: Why did the xfs driver started to use the additional
> space that was not yet there without me executing xfs_growfs.
> 
> This looks like a severe XFS-problem to me.
> 
> But my hope is that all the data taht was within the filesystem
> before Jan 6 14:45 is not involved in the corruption. If xfs
> started to use space beyond the end of the underlying raid
> device this should have affected only data that was created,
> modified or deleted after Jan 6 14:45.
> 
> If that was true we could clearly distinct between data
> that we must dump and data that we can keep. The machine is
> our backup system (as you may have guessed from its name)
> and I would like to keep old backup-files.
> 
> I remember that mkfs.xfs is clever enough to adopt the
> filesystem paramters to the underlying hardware of the
> block device that the xfs filesystem is created on. Hence
> from the xfs drivers point of view the underlying block
> device is not just a sequence of data blocks, but the xfs
> driver knows something about the layout of the underlying
> hardware.
> 
> If that was true - how does the xfs driver reacts if that
> information about the layout of the underlying hardware
> changes while the xfs-filesystem is mounted?
> 
> Seems to be an interesting problem
> 
> Kind regards
> 
> Peter Koch
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-08 19:26 ` Darrick J. Wong
@ 2018-01-08 22:01   ` Dave Chinner
  2018-01-08 23:44       ` xfs.pkoch
  2018-01-09  9:36     ` Wols Lists
  0 siblings, 2 replies; 37+ messages in thread
From: Dave Chinner @ 2018-01-08 22:01 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs.pkoch, linux-xfs, linux-raid

[cc linux-raid, like the OP intended to do]

[For XFS folk, the original linux-raid thread is here:
 https://marc.info/?l=linux-raid&m=151525346428531&w=2 ]

On Mon, Jan 08, 2018 at 11:26:07AM -0800, Darrick J. Wong wrote:
> On Mon, Jan 08, 2018 at 08:08:09PM +0100, xfs.pkoch@dfgh.net wrote:
> > Dear Linux-Raid and Linux-XFS experts:
> > 
> > I'm posting this on both the linux-raid and linux-xfs
> > mailing list as it's not clear at this point wether
> > this is a MD- od XFS-problem.
> > 
> > I have described my problem in a recent posting on
> > linux-raid and Wol's conclusion was:
> > 
> > >In other words, one or more of the following three are true :-
> > >1) The OP has been caught by some random act of God
> > >2) There's a serious flaw in "mdadm --grow"
> > >3) There's a serious flaw in xfs
> > >
> > >Cheers,
> > >Wol
> > 
> > There's very important data on our RAID10 device but I doubt
> > it's important enough for God to take a hand into our storage.
> > 
> > But let me first summarize what happened and why I believe that
> > this is an XFS-problem:

The evidence doesn't support that claim.

tl;dr: block device IO errors occurred immediately after a MD
reshape started and the filesystem simply reported and responded
appropriately to those MD device IO errors.

> > Machine running Linux 3.14.69 with no kernel-patches.

So really old kernel....

> > XFS filesystem was created with XFS userutils 3.1.11.

And a really old userspace, too.

> > I did a fresh compile of xfsprogs-4.9.0 yesterday when
> > I realized that the 3.1.11 xfs_repair did not help.
> > 
> > mdadm is V3.3
> > 
> > /dev/md5 is a RAID10-device that was created in Feb 2013
> > with 10 2TB disks and an ext3 filesystem on it. Once in a
> > while I added two more 2TB disks. Reshaping was done
> > while the ext3 filesystem was mounted. Then the ext3
> > filesystem was unmounted resized and mounted again. That
> > worked until I resized the RAID10 from 16 to 20 disks and
> > realized that ext3 does not support filesystems >16TB.
> > 
> > I switched to XFS and created a 20TB filesystem. Here are
> > the details:
> > 
> > # xfs_info /dev/md5
> > meta-data=/dev/md5               isize=256    agcount=32, agsize=152608128 blks
> >           =                       sectsz=512   attr=2
> > data     =                       bsize=4096   blocks=4883457280, imaxpct=5
> >           =                       sunit=128    swidth=1280 blks
> > naming   =version 2              bsize=4096   ascii-ci=0
> > log      =internal               bsize=4096   blocks=521728, version=2
> >           =                       sectsz=512   sunit=8 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > 
> > Please notice: Ths XFS-filesystem has a size of
> > 4883457280*4K = 19,533,829,120K
> > 
> > On saturday I tried to add two more 2TB disks to the RAID10
> > and the XFS filesystem was mounted (and in medium use) at that
> > time. Commands were:
> > 
> > # mdadm /dev/md5 --add /dev/sdo
> > # mdadm --grow /dev/md5 --raid-devices=21

You added one device, not two. That's a recipe for a reshape that
moves every block of data in the device to a different location.

> > # mdadm -D /dev/md5
> > /dev/md5:
> >          Version : 1.2
> >    Creation Time : Sun Feb 10 16:58:10 2013
> >       Raid Level : raid10
> >       Array Size : 19533829120 (18628.91 GiB 20002.64 GB)
> >    Used Dev Size : 1953382912 (1862.89 GiB 2000.26 GB)
> >     Raid Devices : 21
> >    Total Devices : 21
> >      Persistence : Superblock is persistent
> > 
> >      Update Time : Sat Jan  6 15:08:37 2018
> >            State : clean, reshaping
> >   Active Devices : 21
> > Working Devices : 21
> >   Failed Devices : 0
> >    Spare Devices : 0
> > 
> >           Layout : near=2
> >       Chunk Size : 512K
> > 
> >   Reshape Status : 1% complete
> >    Delta Devices : 1, (20->21)

Yup, 21 devices in a RAID 10. That's a really nasty config for
RAID10 which requires an even number of disks to mirror correctly.
Why does MD even allow this sort of whacky, sub-optimal
configuration?

[....]

> > Immediately after the RAID10 reshape operation started the
> > XFS-filesystem reported I/O-errors and was severly damaged.
> > I waited for the reshape operation to finish and tried to repair
> > the filesystem with xfs_repair (version 3.1.11) but xfs_repair
> > crashed, so I tried 4.9.0-version aif xfs_reapair with no luck
> > either.

[...]

> > Here are the relevant log messages:
> > 
> > >Jan  6 14:45:00 backup kernel: md: reshape of RAID array md5
> > >Jan  6 14:45:00 backup kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
> > >Jan  6 14:45:00 backup kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
> > >Jan  6 14:45:00 backup kernel: md: using 128k window, over a total of 19533829120k.
> > >Jan  6 14:45:00 backup kernel: XFS (md5): metadata I/O error: block 0x12c08f360 ("xfs_trans_read_buf_map") error 5 numblks 16
> > >Jan  6 14:45:00 backup kernel: XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
> > >Jan  6 14:45:00 backup kernel: XFS (md5): metadata I/O error: block 0x12c08f360 ("xfs_trans_read_buf_map") error 5 numblks 16
> > >Jan  6 14:45:00 backup kernel: XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
> > >... hundreds of the above XFS-messages deleted
> > >Jan  6 14:45:00 backup kernel: XFS (md5): Log I/O Error Detected.  Shutting down filesystem
> > >Jan  6 14:45:00 backup kernel: XFS (md5): Please umount the filesystem and rectify the problem(s)

IOWs, within /a second/ of the reshape starting, the active, error
free XFS filesystem received hundreds of IO errors on both read and
write IOs from the MD device and shut down the filesystem.

XFS is just the messenger here - something has gone badly wrong at
the MD layer when the reshape kicked off.

> > Please notice: no error message about hardware-problems.
> > All 21 disks are fine and the next messages from the
> > md-driver was:
> > 
> > >Jan  7 02:28:02 backup kernel: md: md5: reshape done.
> > >Jan  7 02:28:03 backup kernel: md5: detected capacity change from 20002641018880 to 21002772807680

Ok, so the reshape took about 12 hours to run, and it grew to 21TB.
A 12 hour long operation is what I'd expect for a major
rearrangement of every block in the MD device....

> > I'm wondering about one thing: the first xfs message is about a
> > meatadata I/O error on block 0x12c08f360. Since the xfs filesystem
> 
> I'm sure Dave will have more to say about this, but...
> 
> "block 0x12c08f360" == units of sectors, not fs blocks.
> 
> IOWs, this IO error happened at offset 2,577,280,712,704 (~2.5TB)

That's correct, Darrick - it's well within the known fileysetm
bounds.

> XFS doesn't change the fs size until you tell it to (via growfs);
> even if the underlying storage geometry changes, XFS won't act on it
> until the admin tells it to.
> 
> What did xfs_repair do?

Yeah, I'd like to see that output (from 4.9.0) too, but experience
tells me it did nothing helpful w.r.t data recovery from a badly
corrupted device.... :/

> > This looks like a severe XFS-problem to me.

I'll say this again: tHe evidence does not support that conclusion.

XFS has done exactly the right thing to protect the filesystem when
fatal IO errors started occurring at the block layer: it shut down
and stopped trying to modify the filesystem.  What caused those
errors and any filesystem and/or data corruption to occur, OTOH, has
nothing to do with XFS.

> > But my hope is that all the data taht was within the filesystem
> > before Jan 6 14:45 is not involved in the corruption. If xfs
> > started to use space beyond the end of the underlying raid
> > device this should have affected only data that was created,
> > modified or deleted after Jan 6 14:45.

Experience tells me that you cannot trust a single byte of data in
that block device now, regardless of it's age and when it was last
modified. The MD reshape may have completed, but what it did is
highly questionable and you need to verify the contents of every
single directory and file.

When this sort of things happens, often the best data recovery
strategy (i.e. fastest and most reliable) is to simply throw
everything away and restore from known good backups...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-08 22:01   ` Dave Chinner
@ 2018-01-08 23:44       ` xfs.pkoch
  2018-01-09  9:36     ` Wols Lists
  1 sibling, 0 replies; 37+ messages in thread
From: mdraid.pkoch @ 2018-01-08 23:44 UTC (permalink / raw)
  Cc: linux-raid, xfs.pkoch.f85f873813.linux-xfs#vger.linux-raid

Hi Dave and Derrick:

Thanks for answers - seems like my interpretation of the
blocknumber was wrong.

So the culprit is the md-driver again. It's producing I/O-errors
without any hardware-errors.

The machine was setup in 2013 so everything is 5 years old
besides the xfsprogs which I compiled yesterday.

xfs_repair output is very long and my impression is that things
were getting worse with every invocation. xfs_repair itself seemed
to have problems. I don't remeber the exact message but
xfs_repair was complainig a lot about a failed write verifier test.

I will copy as much data as I can from the corrupt filesystem to
our new system. For most files we have md5 checksums so I
can test wether their contents are OK or not.

I started xfs_repair -n 20 minutes ago an it has already printed
1165088 lines of messages

Here are some of these lines:

Phase 1 - find and verify superblock...
         - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
         - zero log...
         - scan filesystem freespace and inode maps...
block (30,18106993-18106993) multiply claimed by cnt space tree, state - 2
block (30,18892669-18892669) multiply claimed by cnt space tree, state - 2
block (30,18904839-18904839) multiply claimed by cnt space tree, state - 2
block (30,19815542-19815542) multiply claimed by cnt space tree, state - 2
block (30,15440783-15440783) multiply claimed by cnt space tree, state - 2
block (30,17658438-17658438) multiply claimed by cnt space tree, state - 2
block (30,18749167-18749167) multiply claimed by cnt space tree, state - 2
block (30,19778684-19778684) multiply claimed by cnt space tree, state - 2
block (30,19951864-19951864) multiply claimed by cnt space tree, state - 2
block (30,19816441-19816441) multiply claimed by cnt space tree, state - 2
block (30,18742154-18742154) multiply claimed by cnt space tree, state - 2
block (30,18132613-18132613) multiply claimed by cnt space tree, state - 2
block (30,15502870-15502870) multiply claimed by cnt space tree, state - 2
agf_freeblks 12543116, counted 12543086 in ag 9
block (30,18168170-18168170) multiply claimed by cnt space tree, state - 2
agf_freeblks 6317001, counted 6316991 in ag 25
agf_freeblks 8962131, counted 8962128 in ag 0
block (1,6142-6142) multiply claimed by cnt space tree, state - 2
block (1,6150-6150) multiply claimed by cnt space tree, state - 2
agf_freeblks 8043945, counted 8043942 in ag 21
agf_freeblks 6833504, counted 6833499 in ag 24
block (1,5777-5777) multiply claimed by cnt space tree, state - 2
agf_freeblks 9032166, counted 9032109 in ag 19
agf_freeblks 16877231, counted 16874747 in ag 30
agf_freeblks 6645873, counted 6645861 in ag 27
block (1,8388992-8388992) multiply claimed by cnt space tree, state - 2
agf_freeblks 21229271, counted 21234873 in ag 1
agf_freeblks 11090766, counted 11090638 in ag 14
agf_freeblks 8424280, counted 8424279 in ag 13
agf_freeblks 1618763, counted 1618764 in ag 16
agf_freeblks 5380834, counted 5380831 in ag 15
agf_freeblks 11211636, counted 11211543 in ag 12
agf_freeblks 14135461, counted 14135434 in ag 11
sb_fdblocks 344528311, counted 344530989
         - 00:51:27: scanning filesystem freespace - 32 of 32 allocation 
groups done
         - found root inode chunk
Phase 3 - for each AG...
         - scan (but don't clear) agi unlinked lists...
         - 00:51:27: scanning agi unlinked lists - 32 of 32 allocation 
groups done
         - process known inodes and perform inode discovery...
         - agno = 0
         - agno = 30
         - agno = 15
bad nblocks 17 for inode 64425222202, would reset to 18
bad nextents 12 for inode 64425222202, would reset to 13
Invalid inode number 0xfeffffffffffffff
xfs_dir_ino_validate: XFS_ERROR_REPORT
Metadata corruption detected at xfs_dir3_data block 0x4438f5c60/0x1000
entry "/463380382.M621183P10446.mail,S=2075,W=2116" at block 12 offset 
2192 in directory inode 64425222202 references invalid inode 
18374686479671623679
         would clear inode number in entry at offset 2192...
entry at block 12 offset 2192 in directory inode 64425222202 has illegal 
name "/463380382.M621183P10446.mail,S=2075,W=2116": would clear entry
entry "/463466963.M420615P6276.mail,S=2202,W=2261" at block 12 offset 
2472 in directory inode 64425222202 references invalid inode 
18374686479671623679
         would clear inode number in entry at offset 2472...
entry at block 12 offset 2472 in directory inode 64425222202 has illegal 
name "/463466963.M420615P6276.mail,S=2202,W=2261": would clear entry
entry "/463980159.M342359P4014.mail,S=3285,W=3378" at block 12 offset 
3376 in directory inode 64425222202 references invalid inode 
18374686479671623679
         would clear inode number in entry at offset 3376...
entry at block 12 offset 3376 in directory inode 64425222202 has illegal 
name "/463980159.M342359P4014.mail,S=3285,W=3378": would clear entry
entry "/463984373.M513992P19720.mail,S=10818,W=11143" at block 12 offset 
3432 in directory inode 64425222202 references invalid inode 
18374686479671623679
.....
..... thousends of messages about direcotry inodes referencing inode 
0xfeffffffffffffff
..... and illegal names where first character has been replaced by /
..... most agno have these messages, but some agnos are fine
.....
Phase 4 - check for duplicate blocks...
         - setting up duplicate extent list...
         - 01:10:03: setting up duplicate extent list - 32 of 32 
allocation groups done
         - check for inodes claiming duplicate blocks...
         - agno = 15
         - agno = 30
         - agno = 0
entry ".." at block 0 offset 32 in directory inode 128849025043 
references non-existent inode 124835665944
entry ".." at block 0 offset 32 in directory inode 128849348634 
references non-existent inode 124554268735
entry ".." at block 0 offset 32 in directory inode 128849348643 
references non-existent inode 124554274826
entry ".." at block 0 offset 32 in directory inode 128849350697 
references non-existent inode 4295153945
entry ".." at block 0 offset 32 in directory inode 128849352738 
references non-existent inode 124554268679
entry ".." at block 0 offset 32 in directory inode 128849352744 
references non-existent inode 124554268687
entry ".." at block 0 offset 32 in directory inode 128849393697 
references non-existent inode 124554315786
entry ".." at block 0 offset 32 in directory inode 128849397786 
references non-existent inode 124678412289
entry ".." at block 0 offset 32 in directory inode 128849397815 
references non-existent inode 124678412340
entry ".." at block 0 offset 32 in directory inode 128849397821 
references non-existent inode 4295878668
entry ".." at block 0 offset 32 in directory inode 128849399852 
references non-existent inode 124554274851
entry ".." at block 0 offset 32 in directory inode 128849399867 
references non-existent inode 4295020775
entry ".." at block 0 offset 32 in directory inode 128849403936 
references non-existent inode 124554340368
entry ".." at block 0 offset 32 in directory inode 128849412109 
references non-existent inode 124554403877
entry ".." at block 0 offset 32 in directory inode 64425142305 
references non-existent inode 4295153925
bad nblocks 17 for inode 64425222202, would reset to 18
bad nextents 12 for inode 64425222202, would reset to 13
Invalid inode number 0xfeffffffffffffff
xfs_dir_ino_validate: XFS_ERROR_REPORT
Metadata corruption detected at xfs_dir3_data block 0x4438f5c60/0x1000
would clear entry
would clear entry
would clear entry
.....
..... entry ".." at block 0 offset 32 - messages repeat over and over 
with differnt inodes
.....

Phase 5 which produced a lot of messages as well is missing
when the -n option is used.

> You added one device, not two. That's a recipe for a reshape that
> moves every block of data in the device to a different location.
Of course I was planning to add another one. If I add both in one
step I cannot predict which disk will end up in disk set-A and which
will end up in disk set-B. Since both disk sets are at different location
I have to add the additional disk at location-A first and then the second
disk at location B. Adding two disks in one step does move every
piece of data as well.

> IOWs, within /a second/ of the reshape starting, the active, error
> free XFS filesystem received hundreds of IO errors on both read and
> write IOs from the MD device and shut down the filesystem.
>
> XFS is just the messenger here - something has gone badly wrong at
> the MD layer when the reshape kicked off.
You are right - and this has happened without hardware-problems.
> Yeah, I'd like to see that output (from 4.9.0) too, but experience
> tells me it did nothing helpful w.r.t data recovery from a badly
> corrupted device.... :/
You are right again.

>> This looks like a severe XFS-problem to me.
> I'll say this again: tHe evidence does not support that conclusion.
So let's see  what the MD-experts have to say.

Kind regards

Peter


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
@ 2018-01-08 23:44       ` xfs.pkoch
  0 siblings, 0 replies; 37+ messages in thread
From: xfs.pkoch @ 2018-01-08 23:44 UTC (permalink / raw)
  Cc: mdraid.pkoch.3c0485e297.linux-raid#vger.linux-xfs

Hi Dave and Derrick:

Thanks for answers - seems like my interpretation of the
blocknumber was wrong.

So the culprit is the md-driver again. It's producing I/O-errors
without any hardware-errors.

The machine was setup in 2013 so everything is 5 years old
besides the xfsprogs which I compiled yesterday.

xfs_repair output is very long and my impression is that things
were getting worse with every invocation. xfs_repair itself seemed
to have problems. I don't remeber the exact message but
xfs_repair was complainig a lot about a failed write verifier test.

I will copy as much data as I can from the corrupt filesystem to
our new system. For most files we have md5 checksums so I
can test wether their contents are OK or not.

I started xfs_repair -n 20 minutes ago an it has already printed
1165088 lines of messages

Here are some of these lines:

Phase 1 - find and verify superblock...
         - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
         - zero log...
         - scan filesystem freespace and inode maps...
block (30,18106993-18106993) multiply claimed by cnt space tree, state - 2
block (30,18892669-18892669) multiply claimed by cnt space tree, state - 2
block (30,18904839-18904839) multiply claimed by cnt space tree, state - 2
block (30,19815542-19815542) multiply claimed by cnt space tree, state - 2
block (30,15440783-15440783) multiply claimed by cnt space tree, state - 2
block (30,17658438-17658438) multiply claimed by cnt space tree, state - 2
block (30,18749167-18749167) multiply claimed by cnt space tree, state - 2
block (30,19778684-19778684) multiply claimed by cnt space tree, state - 2
block (30,19951864-19951864) multiply claimed by cnt space tree, state - 2
block (30,19816441-19816441) multiply claimed by cnt space tree, state - 2
block (30,18742154-18742154) multiply claimed by cnt space tree, state - 2
block (30,18132613-18132613) multiply claimed by cnt space tree, state - 2
block (30,15502870-15502870) multiply claimed by cnt space tree, state - 2
agf_freeblks 12543116, counted 12543086 in ag 9
block (30,18168170-18168170) multiply claimed by cnt space tree, state - 2
agf_freeblks 6317001, counted 6316991 in ag 25
agf_freeblks 8962131, counted 8962128 in ag 0
block (1,6142-6142) multiply claimed by cnt space tree, state - 2
block (1,6150-6150) multiply claimed by cnt space tree, state - 2
agf_freeblks 8043945, counted 8043942 in ag 21
agf_freeblks 6833504, counted 6833499 in ag 24
block (1,5777-5777) multiply claimed by cnt space tree, state - 2
agf_freeblks 9032166, counted 9032109 in ag 19
agf_freeblks 16877231, counted 16874747 in ag 30
agf_freeblks 6645873, counted 6645861 in ag 27
block (1,8388992-8388992) multiply claimed by cnt space tree, state - 2
agf_freeblks 21229271, counted 21234873 in ag 1
agf_freeblks 11090766, counted 11090638 in ag 14
agf_freeblks 8424280, counted 8424279 in ag 13
agf_freeblks 1618763, counted 1618764 in ag 16
agf_freeblks 5380834, counted 5380831 in ag 15
agf_freeblks 11211636, counted 11211543 in ag 12
agf_freeblks 14135461, counted 14135434 in ag 11
sb_fdblocks 344528311, counted 344530989
         - 00:51:27: scanning filesystem freespace - 32 of 32 allocation 
groups done
         - found root inode chunk
Phase 3 - for each AG...
         - scan (but don't clear) agi unlinked lists...
         - 00:51:27: scanning agi unlinked lists - 32 of 32 allocation 
groups done
         - process known inodes and perform inode discovery...
         - agno = 0
         - agno = 30
         - agno = 15
bad nblocks 17 for inode 64425222202, would reset to 18
bad nextents 12 for inode 64425222202, would reset to 13
Invalid inode number 0xfeffffffffffffff
xfs_dir_ino_validate: XFS_ERROR_REPORT
Metadata corruption detected at xfs_dir3_data block 0x4438f5c60/0x1000
entry "/463380382.M621183P10446.mail,S=2075,W=2116" at block 12 offset 
2192 in directory inode 64425222202 references invalid inode 
18374686479671623679
         would clear inode number in entry at offset 2192...
entry at block 12 offset 2192 in directory inode 64425222202 has illegal 
name "/463380382.M621183P10446.mail,S=2075,W=2116": would clear entry
entry "/463466963.M420615P6276.mail,S=2202,W=2261" at block 12 offset 
2472 in directory inode 64425222202 references invalid inode 
18374686479671623679
         would clear inode number in entry at offset 2472...
entry at block 12 offset 2472 in directory inode 64425222202 has illegal 
name "/463466963.M420615P6276.mail,S=2202,W=2261": would clear entry
entry "/463980159.M342359P4014.mail,S=3285,W=3378" at block 12 offset 
3376 in directory inode 64425222202 references invalid inode 
18374686479671623679
         would clear inode number in entry at offset 3376...
entry at block 12 offset 3376 in directory inode 64425222202 has illegal 
name "/463980159.M342359P4014.mail,S=3285,W=3378": would clear entry
entry "/463984373.M513992P19720.mail,S=10818,W=11143" at block 12 offset 
3432 in directory inode 64425222202 references invalid inode 
18374686479671623679
.....
..... thousends of messages about direcotry inodes referencing inode 
0xfeffffffffffffff
..... and illegal names where first character has been replaced by /
..... most agno have these messages, but some agnos are fine
.....
Phase 4 - check for duplicate blocks...
         - setting up duplicate extent list...
         - 01:10:03: setting up duplicate extent list - 32 of 32 
allocation groups done
         - check for inodes claiming duplicate blocks...
         - agno = 15
         - agno = 30
         - agno = 0
entry ".." at block 0 offset 32 in directory inode 128849025043 
references non-existent inode 124835665944
entry ".." at block 0 offset 32 in directory inode 128849348634 
references non-existent inode 124554268735
entry ".." at block 0 offset 32 in directory inode 128849348643 
references non-existent inode 124554274826
entry ".." at block 0 offset 32 in directory inode 128849350697 
references non-existent inode 4295153945
entry ".." at block 0 offset 32 in directory inode 128849352738 
references non-existent inode 124554268679
entry ".." at block 0 offset 32 in directory inode 128849352744 
references non-existent inode 124554268687
entry ".." at block 0 offset 32 in directory inode 128849393697 
references non-existent inode 124554315786
entry ".." at block 0 offset 32 in directory inode 128849397786 
references non-existent inode 124678412289
entry ".." at block 0 offset 32 in directory inode 128849397815 
references non-existent inode 124678412340
entry ".." at block 0 offset 32 in directory inode 128849397821 
references non-existent inode 4295878668
entry ".." at block 0 offset 32 in directory inode 128849399852 
references non-existent inode 124554274851
entry ".." at block 0 offset 32 in directory inode 128849399867 
references non-existent inode 4295020775
entry ".." at block 0 offset 32 in directory inode 128849403936 
references non-existent inode 124554340368
entry ".." at block 0 offset 32 in directory inode 128849412109 
references non-existent inode 124554403877
entry ".." at block 0 offset 32 in directory inode 64425142305 
references non-existent inode 4295153925
bad nblocks 17 for inode 64425222202, would reset to 18
bad nextents 12 for inode 64425222202, would reset to 13
Invalid inode number 0xfeffffffffffffff
xfs_dir_ino_validate: XFS_ERROR_REPORT
Metadata corruption detected at xfs_dir3_data block 0x4438f5c60/0x1000
would clear entry
would clear entry
would clear entry
.....
..... entry ".." at block 0 offset 32 - messages repeat over and over 
with differnt inodes
.....

Phase 5 which produced a lot of messages as well is missing
when the -n option is used.

> You added one device, not two. That's a recipe for a reshape that
> moves every block of data in the device to a different location.
Of course I was planning to add another one. If I add both in one
step I cannot predict which disk will end up in disk set-A and which
will end up in disk set-B. Since both disk sets are at different location
I have to add the additional disk at location-A first and then the second
disk at location B. Adding two disks in one step does move every
piece of data as well.

> IOWs, within /a second/ of the reshape starting, the active, error
> free XFS filesystem received hundreds of IO errors on both read and
> write IOs from the MD device and shut down the filesystem.
>
> XFS is just the messenger here - something has gone badly wrong at
> the MD layer when the reshape kicked off.
You are right - and this has happened without hardware-problems.
> Yeah, I'd like to see that output (from 4.9.0) too, but experience
> tells me it did nothing helpful w.r.t data recovery from a badly
> corrupted device.... :/
You are right again.

>> This looks like a severe XFS-problem to me.
> I'll say this again: tHe evidence does not support that conclusion.
So let's see  what the MD-experts have to say.

Kind regards

Peter


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-08 22:01   ` Dave Chinner
  2018-01-08 23:44       ` xfs.pkoch
@ 2018-01-09  9:36     ` Wols Lists
  2018-01-09 21:47       ` IMAP-FCC:Sent
  2018-01-09 22:25       ` Dave Chinner
  1 sibling, 2 replies; 37+ messages in thread
From: Wols Lists @ 2018-01-09  9:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-raid

On 08/01/18 22:01, Dave Chinner wrote:
> Yup, 21 devices in a RAID 10. That's a really nasty config for
> RAID10 which requires an even number of disks to mirror correctly.
> Why does MD even allow this sort of whacky, sub-optimal
> configuration?

Just to point out - if this is raid-10 (and not raid-1+0 which is a
completely different beast) this is actually a normal linux config. I'm
planning to set up a raid-10 across 3 devices. What happens is that is
that raid-10 writes X copies across Y devices. If X = Y then it's a
normal mirror config, if X > Y it makes good use of space (and if X < Y
it doesn't make sense :-)

SDA: 1, 2, 4, 5

SDB: 1, 3, 4, 6

SDC: 2, 3, 5, 6

Cheers,
Wol

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-09  9:36     ` Wols Lists
@ 2018-01-09 21:47       ` IMAP-FCC:Sent
  2018-01-09 22:25       ` Dave Chinner
  1 sibling, 0 replies; 37+ messages in thread
From: IMAP-FCC:Sent @ 2018-01-09 21:47 UTC (permalink / raw)
  To: Wols Lists; +Cc: Dave Chinner, linux-xfs, linux-raid

>>>>> "Wols" == Wols Lists <antlists@youngman.org.uk> writes:

Wols> On 08/01/18 22:01, Dave Chinner wrote:
>> Yup, 21 devices in a RAID 10. That's a really nasty config for
>> RAID10 which requires an even number of disks to mirror correctly.
>> Why does MD even allow this sort of whacky, sub-optimal
>> configuration?

Wols> Just to point out - if this is raid-10 (and not raid-1+0 which is a
Wols> completely different beast) this is actually a normal linux config. I'm
Wols> planning to set up a raid-10 across 3 devices. What happens is that is
Wols> that raid-10 writes X copies across Y devices. If X = Y then it's a
Wols> normal mirror config, if X > Y it makes good use of space (and if X < Y
Wols> it doesn't make sense :-)

Wols> SDA: 1, 2, 4, 5
Wols> SDB: 1, 3, 4, 6
Wols> SDC: 2, 3, 5, 6

This is a nice idea, but honestly, I think it's just asking for
trouble down the line. It's almost more like RAID4 in some ways, but
without parity, just copies.

So I suspect that the problem that's happened here is that some bug in
RAID10 has been found when you do a re-shape (on an old kernel,
RHEL6? Debian?  Not clear...) with a large number of devices.  Since
you have to re-balance the data as new disks are added... it might get
problematic.

In any case, I would recommend that you simple setup RAID1 pairs, then
pull them all into a VG, then create an LV which spans all the RAID1
pairs.  Then you can add new pairs to the system easily and
grow/shrink the array easily.

This also lets you replace the 2tb disks with 4tb or larger disks more
easily as time goes on.  And of course I'd *also* put in some hot
spares.

But then again, if this is just a dumping ground for data with mostly
reads, or just large sequential writes (say for media, images, video,
etc) then going to RAID6 sets (say 10 or so per) which you THEN stripe
over using LVM is a better way to go.

I'll see if I can find some time to try setting up a bunch of test
loop devices on my own to see what happens here.  But I'm also running
newer kernel and Debian Jessie distribution.

But it will probably be Neil who needs to debug the real issue, I
don't know the code well at all.

John

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-09  9:36     ` Wols Lists
  2018-01-09 21:47       ` IMAP-FCC:Sent
@ 2018-01-09 22:25       ` Dave Chinner
  2018-01-09 22:32         ` Reindl Harald
                           ` (2 more replies)
  1 sibling, 3 replies; 37+ messages in thread
From: Dave Chinner @ 2018-01-09 22:25 UTC (permalink / raw)
  To: Wols Lists; +Cc: linux-xfs, linux-raid

On Tue, Jan 09, 2018 at 09:36:49AM +0000, Wols Lists wrote:
> On 08/01/18 22:01, Dave Chinner wrote:
> > Yup, 21 devices in a RAID 10. That's a really nasty config for
> > RAID10 which requires an even number of disks to mirror correctly.
> > Why does MD even allow this sort of whacky, sub-optimal
> > configuration?
> 
> Just to point out - if this is raid-10 (and not raid-1+0 which is a
> completely different beast) this is actually a normal linux config. I'm
> planning to set up a raid-10 across 3 devices. What happens is that is
> that raid-10 writes X copies across Y devices. If X = Y then it's a
> normal mirror config, if X > Y it makes good use of space (and if X < Y
> it doesn't make sense :-)
> 
> SDA: 1, 2, 4, 5
> 
> SDB: 1, 3, 4, 6
> 
> SDC: 2, 3, 5, 6

It's nice to know that MD has redefined RAID-10 to be different to
the industry standard definition that has been used for 20 years and
optimised filesystem layouts for.  Rotoring data across odd numbers
of disks like this is going to really, really suck on filesystems
that are stripe layout aware..

For example, XFS has hot-spot prevention algorithms in it's
internal physical layout for striped devices. It aligns AGs across
different stripe units so that metadata and data doesn't all get
aligned to the one disk in a RAID0/5/6 stripe. If the stripes are
rotoring across disks themselves, then we're going to end up back in
the same position we started with - multiple AGs aligned to the
same disk.

The result is that many XFS workloads are going to hotspot disks and
result in unbalanced load when there are an odd number of disks in a
RAID-10 array.  Actually, it's probably worse than having no
alignment, because it makes hotspot occurrence and behaviour very
unpredictable.

Worse is the fact that there's absolutely nothing we can do to
optimise allocation alignment or IO behaviour at the filesystem
level. We'll have to make mkfs.xfs aware of this clusterfuck and
turn off stripe alignment when we detect such a layout, but that
doesn't help all the existing user installations out there right
now.

IMO, odd-numbered disks in RAID-10 should be considered harmful and
never used....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-09 22:25       ` Dave Chinner
@ 2018-01-09 22:32         ` Reindl Harald
  2018-01-10  6:17         ` Wols Lists
  2018-01-10 14:10         ` Phil Turmel
  2 siblings, 0 replies; 37+ messages in thread
From: Reindl Harald @ 2018-01-09 22:32 UTC (permalink / raw)
  To: Dave Chinner, Wols Lists; +Cc: linux-xfs, linux-raid



Am 09.01.2018 um 23:25 schrieb Dave Chinner:
> On Tue, Jan 09, 2018 at 09:36:49AM +0000, Wols Lists wrote:
>> Just to point out - if this is raid-10 (and not raid-1+0 which is a
>> completely different beast) this is actually a normal linux config. I'm
>> planning to set up a raid-10 across 3 devices. What happens is that is
>> that raid-10 writes X copies across Y devices. If X = Y then it's a
>> normal mirror config, if X > Y it makes good use of space (and if X < Y
>> it doesn't make sense :-)
>>
> IMO, odd-numbered disks in RAID-10 should be considered harmful and
> never used....

agreed and then "writemostly" could work without the lame excues that 
one could have a crazy RAID10 layout.....

https://www.spinics.net/lists/raid/msg55797.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-09 22:25       ` Dave Chinner
  2018-01-09 22:32         ` Reindl Harald
@ 2018-01-10  6:17         ` Wols Lists
  2018-01-11  2:14           ` Dave Chinner
  2018-01-10 14:10         ` Phil Turmel
  2 siblings, 1 reply; 37+ messages in thread
From: Wols Lists @ 2018-01-10  6:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-raid

On 09/01/18 22:25, Dave Chinner wrote:
> On Tue, Jan 09, 2018 at 09:36:49AM +0000, Wols Lists wrote:
>> On 08/01/18 22:01, Dave Chinner wrote:
>>> Yup, 21 devices in a RAID 10. That's a really nasty config for
>>> RAID10 which requires an even number of disks to mirror correctly.
>>> Why does MD even allow this sort of whacky, sub-optimal
>>> configuration?
>>
>> Just to point out - if this is raid-10 (and not raid-1+0 which is a
>> completely different beast) this is actually a normal linux config. I'm
>> planning to set up a raid-10 across 3 devices. What happens is that is
>> that raid-10 writes X copies across Y devices. If X = Y then it's a
>> normal mirror config, if X > Y it makes good use of space (and if X < Y
>> it doesn't make sense :-)
>>
>> SDA: 1, 2, 4, 5
>>
>> SDB: 1, 3, 4, 6
>>
>> SDC: 2, 3, 5, 6
> 
> It's nice to know that MD has redefined RAID-10 to be different to
> the industry standard definition that has been used for 20 years and
> optimised filesystem layouts for.  Rotoring data across odd numbers
> of disks like this is going to really, really suck on filesystems
> that are stripe layout aware..

Actually, I thought that the industry standard definition referred to
Raid-1+0. It's just colloquially referred to as raid-10.
> 
> For example, XFS has hot-spot prevention algorithms in it's
> internal physical layout for striped devices. It aligns AGs across
> different stripe units so that metadata and data doesn't all get
> aligned to the one disk in a RAID0/5/6 stripe. If the stripes are
> rotoring across disks themselves, then we're going to end up back in
> the same position we started with - multiple AGs aligned to the
> same disk.

Are you telling me that xfs is aware of the internal structure of an
md-raid array? Given that md-raid is an abstraction layer, this seems
rather dangerous to me - you're breaking the abstraction and this could
explain the OP's problem. Md-raid changed underneath the filesystem, on
the assumption that the filesystem wouldn't notice, and the filesystem
*did*. BANG!
> 
> The result is that many XFS workloads are going to hotspot disks and
> result in unbalanced load when there are an odd number of disks in a
> RAID-10 array.  Actually, it's probably worse than having no
> alignment, because it makes hotspot occurrence and behaviour very
> unpredictable.
> 
> Worse is the fact that there's absolutely nothing we can do to
> optimise allocation alignment or IO behaviour at the filesystem
> level. We'll have to make mkfs.xfs aware of this clusterfuck and
> turn off stripe alignment when we detect such a layout, but that
> doesn't help all the existing user installations out there right
> now.

So you're telling me that mkfs.xfs *IS* aware of the underlying raid
structure. OOPS! What happens when that structure changes for instance a
raid-5 is converted to raid-6, or another disk is added? If you have to
have special code to deal with md-raid and changes in said raid, where's
the problem with more code for raid-10?
> 
> IMO, odd-numbered disks in RAID-10 should be considered harmful and
> never used....
> 
What about when you have an odd number of mirrors? :-)

Seriously, can't you just make sure that xfs rotates the stripe units
using a number that is relatively prime to the number of disks? If you
have to notice and adjust for changes in the underlying raid structure
anyway, surely that's no greater hardship?

(Just so's you know who I am, I've taken over editorship of the raid
wiki. This is exactly the stuff that belongs on there, so as soon as I
understand what's going on I'll write it up, and I'm happy to be
educated :-) But I do like to really grasp what's going on, so expect
lots of naive questions ... There's not a lot of information on how raid
and filesystems interact, and I haven't really got to grips wioth any of
that at the moment, and I don't use xfs. I use ext4 on gentoo, and the
default btrfs on SUSE.)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-09 22:25       ` Dave Chinner
  2018-01-09 22:32         ` Reindl Harald
  2018-01-10  6:17         ` Wols Lists
@ 2018-01-10 14:10         ` Phil Turmel
  2018-01-10 21:57           ` Wols Lists
  2018-01-11  3:07           ` Dave Chinner
  2 siblings, 2 replies; 37+ messages in thread
From: Phil Turmel @ 2018-01-10 14:10 UTC (permalink / raw)
  To: Dave Chinner, Wols Lists; +Cc: linux-xfs, linux-raid

On 01/09/2018 05:25 PM, Dave Chinner wrote:

> It's nice to know that MD has redefined RAID-10 to be different to
> the industry standard definition that has been used for 20 years and
> optimised filesystem layouts for.  Rotoring data across odd numbers
> of disks like this is going to really, really suck on filesystems
> that are stripe layout aware..

You're a bit late to this party, Dave.  MD has implemented raid10 like
this as far back as I can remember, and it is especially valuable when
running more than two copies.  Running raid10,n3 across four or five
devices is a nice capacity boost without giving up triple copies (when
multiples of three aren't available) or giving up the performance of
mirrored raid.

> For example, XFS has hot-spot prevention algorithms in it's
> internal physical layout for striped devices. It aligns AGs across
> different stripe units so that metadata and data doesn't all get
> aligned to the one disk in a RAID0/5/6 stripe. If the stripes are
> rotoring across disks themselves, then we're going to end up back in
> the same position we started with - multiple AGs aligned to the
> same disk.

All of MD's default raid5 and raid6 layouts rotate stripes, too, so that
parity and syndrome are distributed uniformly.

> The result is that many XFS workloads are going to hotspot disks and
> result in unbalanced load when there are an odd number of disks in a
> RAID-10 array.  Actually, it's probably worse than having no
> alignment, because it makes hotspot occurrence and behaviour very
> unpredictable.
> 
> Worse is the fact that there's absolutely nothing we can do to
> optimise allocation alignment or IO behaviour at the filesystem
> level. We'll have to make mkfs.xfs aware of this clusterfuck and
> turn off stripe alignment when we detect such a layout, but that
> doesn't help all the existing user installations out there right
> now.
> 
> IMO, odd-numbered disks in RAID-10 should be considered harmful and
> never used....

Users are perfectly able to layer raid1+0 or raid0+1 if they don't want
the features of raid10.  Given the advantages of MD's raid10, a pedant
could say XFS's lack of support for it should be considered harmful and
XFS never used.  (-:

FWIW, while I'm sometimes a pendant, I'm not in this case.  I use both
MD raid10 and xfs.

Phil

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-10 14:10         ` Phil Turmel
@ 2018-01-10 21:57           ` Wols Lists
  2018-01-11  3:07           ` Dave Chinner
  1 sibling, 0 replies; 37+ messages in thread
From: Wols Lists @ 2018-01-10 21:57 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

On 10/01/18 14:10, Phil Turmel wrote:
> FWIW, while I'm sometimes a pendant, I'm not in this case.  I use both
> MD raid10 and xfs.

So you sometimes like being left hanging ... :-)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-10  6:17         ` Wols Lists
@ 2018-01-11  2:14           ` Dave Chinner
  2018-01-12  2:16             ` Guoqing Jiang
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2018-01-11  2:14 UTC (permalink / raw)
  To: Wols Lists; +Cc: linux-xfs, linux-raid

On Wed, Jan 10, 2018 at 06:17:11AM +0000, Wols Lists wrote:
> On 09/01/18 22:25, Dave Chinner wrote:
> > On Tue, Jan 09, 2018 at 09:36:49AM +0000, Wols Lists wrote:
> >> On 08/01/18 22:01, Dave Chinner wrote:
> >>> Yup, 21 devices in a RAID 10. That's a really nasty config for
> >>> RAID10 which requires an even number of disks to mirror correctly.
> >>> Why does MD even allow this sort of whacky, sub-optimal
> >>> configuration?
> >>
> >> Just to point out - if this is raid-10 (and not raid-1+0 which is a
> >> completely different beast) this is actually a normal linux config. I'm
> >> planning to set up a raid-10 across 3 devices. What happens is that is
> >> that raid-10 writes X copies across Y devices. If X = Y then it's a
> >> normal mirror config, if X > Y it makes good use of space (and if X < Y
> >> it doesn't make sense :-)
> >>
> >> SDA: 1, 2, 4, 5
> >>
> >> SDB: 1, 3, 4, 6
> >>
> >> SDC: 2, 3, 5, 6
> > 
> > It's nice to know that MD has redefined RAID-10 to be different to
> > the industry standard definition that has been used for 20 years and
> > optimised filesystem layouts for.  Rotoring data across odd numbers
> > of disks like this is going to really, really suck on filesystems
> > that are stripe layout aware..
> 
> Actually, I thought that the industry standard definition referred to
> Raid-1+0. It's just colloquially referred to as raid-10.

https://en.wikipedia.org/wiki/Nested_RAID_levels#RAID_10

"However, a nonstandard definition of "RAID 10" was created for the
Linux MD driver"

So it's not just me who thinks what MD is doing is non-standard.

> > For example, XFS has hot-spot prevention algorithms in it's
> > internal physical layout for striped devices. It aligns AGs across
> > different stripe units so that metadata and data doesn't all get
> > aligned to the one disk in a RAID0/5/6 stripe. If the stripes are
> > rotoring across disks themselves, then we're going to end up back in
> > the same position we started with - multiple AGs aligned to the
> > same disk.
> 
> Are you telling me that xfs is aware of the internal structure of an
> md-raid array?

It's aware of the /alignment/ characteristics of block devices, and
these alignment characteristics are exported by MD. e.g.  These are
exported in /sys/block/<dev>/queue in

	minimum_io_size
		- typically the stripe chunk size
	optimal_io_size
		- typically the stripe width

We get this stuff from DM and MD devices, hardware raid (via scsi
code pages), thinp devices (i.e. to tell us the allocation
granularity so we can align/size IO to match it) and any other block
device that wants to tell us about optimal IO geometry. libblkid
provides us with this information, and it's not just mkfs.xfs that
uses it. e.g. mkfs.ext4 also uses it for the exact same purpose as
XFS....

> Given that md-raid is an abstraction layer, this seems
> rather dangerous to me - you're breaking the abstraction and this could
> explain the OP's problem. Md-raid changed underneath the filesystem, on
> the assumption that the filesystem wouldn't notice, and the filesystem
> *did*. BANG!

No, we aren't breaking any abstractions. It's always been that case
that the filesystem needs to be correctly aligned to the underlying
storage geometry if performance is desired. Think about old skool
filesystems that were aware of the old C/H/S layout of drives back
in the 80s. Optimising layouts for "cylinder groups" in the hardware
gave major performance improvements and we can trace ext4's block
group concept all the way back to those specific hardware geometry
requirements.

I suspect that the problem here is that realtively few people
understand why alignment to the underlying storage geometry is
necessary and don't realise the lengths the storage stack goes to
ensuring alignment is optimal.  It's mostly hidden and automatic
these days because most users lack the knowledge to be able to set
this sort of stuff up correctly.

> > The result is that many XFS workloads are going to hotspot disks and
> > result in unbalanced load when there are an odd number of disks in a
> > RAID-10 array.  Actually, it's probably worse than having no
> > alignment, because it makes hotspot occurrence and behaviour very
> > unpredictable.
> > 
> > Worse is the fact that there's absolutely nothing we can do to
> > optimise allocation alignment or IO behaviour at the filesystem
> > level. We'll have to make mkfs.xfs aware of this clusterfuck and
> > turn off stripe alignment when we detect such a layout, but that
> > doesn't help all the existing user installations out there right
> > now.
> 
> So you're telling me that mkfs.xfs *IS* aware of the underlying raid
> structure. OOPS! What happens when that structure changes for instance a
> raid-5 is converted to raid-6, or another disk is added?

RAID-5 to RAID-6 doesn't change the stripe alignment. That's still
N data disks per stripe, so the geometry and alignment is unchanged
and has no impact on the layout.

But changing the stripe geometry (i.e. number of data disks)
completely fucks IO alignment and that impacts overall storage
performance.  None of the existing data in the filesystem is aligned
to the underlying storage anymore so overwrites will cause all sorts
of RMW storms, you'll get IO hotspots because what used to be on
separate disks is now all on the same disk, etc. And the filesystem
won't be able to fix this because unless you increase the number
data disks by an integer multiple, the alignment cannot be changed
due to fixed locations of metadata in the filesystem.

> If you have to
> have special code to deal with md-raid and changes in said raid, where's
> the problem with more code for raid-10?

I didn't stay we had code to handle "changes in said raid". That's
explicitly what we /don't have/. To handle a geometry/alignment
change in the underlying storage we have to *resilver the entire
filesystem*. And, well, we can't easily do that because that means
we'd have to completely rewrite and re-index the filesystem. It's
faster, easier and more reliable to dump/mkfs/restore the filesystem
than it is to resilver it.

There's many, many reasons why RAID reshaping is considered harmful
and is not recommended by anyone who understands the whole storage
stack intimately.

> > IMO, odd-numbered disks in RAID-10 should be considered harmful and
> > never used....
> > 
> What about when you have an odd number of mirrors? :-)

Be a smart-ass all you want, but it doesn't change the fact that the
"grow-by-one-disk" clusterfuck occurs when you have an odd number of
mirrors, too.

> Seriously, can't you just make sure that xfs rotates the stripe units
> using a number that is relatively prime to the number of disks?

Who said we don't already rotate through stripe units?

And, well, there are situations where ignoring geometry is good
(e.g. delayed allocation allows us to pack lots of small files
together so they aggregate into full stripe writes and avoid RMW
cycles) and there are situations where stripe width rather than
stripe unit alignment is desirable for a single allocation (e.g.
large sequential direct IO writes so we avoid RMW cycles due to
partial stripe overlaps in IO).

These IO alignment optimisations are all done on-the-fly by
filesystems.  Filesystems do far more than you realise with the
geometry information they are provided with and that's why assuming
that you can transparently change the storage geometry without the
filesystem (and hence users) caring about such changes is
fundamentally wrong.

> (Just so's you know who I am, I've taken over editorship of the raid
> wiki. This is exactly the stuff that belongs on there, so as soon as I
> understand what's going on I'll write it up, and I'm happy to be
> educated :-) But I do like to really grasp what's going on, so expect
> lots of naive questions ... There's not a lot of information on how raid
> and filesystems interact, and I haven't really got to grips wioth any of
> that at the moment, and I don't use xfs. I use ext4 on gentoo, and the
> default btrfs on SUSE.)

You've got an awful lot of learning to do, then.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-10 14:10         ` Phil Turmel
  2018-01-10 21:57           ` Wols Lists
@ 2018-01-11  3:07           ` Dave Chinner
  2018-01-12 13:32             ` Wols Lists
  1 sibling, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2018-01-11  3:07 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Wols Lists, linux-xfs, linux-raid

on Wed, Jan 10, 2018 at 09:10:55AM -0500, Phil Turmel wrote:
> On 01/09/2018 05:25 PM, Dave Chinner wrote:
> 
> > It's nice to know that MD has redefined RAID-10 to be different to
> > the industry standard definition that has been used for 20 years and
> > optimised filesystem layouts for.  Rotoring data across odd numbers
> > of disks like this is going to really, really suck on filesystems
> > that are stripe layout aware..
> 
> You're a bit late to this party, Dave.  MD has implemented raid10 like
> this as far back as I can remember, and it is especially valuable when
> running more than two copies.  Running raid10,n3 across four or five
> devices is a nice capacity boost without giving up triple copies (when
> multiples of three aren't available) or giving up the performance of
> mirrored raid.

XFS comes from a different background - high performance, high
reliability and hardware RAID storage. Think hundreds of drives in a
filesystem, not a handful. i.e. The XFS world is largely enterprise
and HPC storage, not small DIY solutions for a home or back-room
office.  We live in a different world, and MD rarely enters mine.

> > For example, XFS has hot-spot prevention algorithms in it's
> > internal physical layout for striped devices. It aligns AGs across
> > different stripe units so that metadata and data doesn't all get
> > aligned to the one disk in a RAID0/5/6 stripe. If the stripes are
> > rotoring across disks themselves, then we're going to end up back in
> > the same position we started with - multiple AGs aligned to the
> > same disk.
> 
> All of MD's default raid5 and raid6 layouts rotate stripes, too, so that
> parity and syndrome are distributed uniformly.

Well, yes, but it appears you haven't thought through what that
typically means.  Take a 4+1, chunk size 128k, stripe width 512k

A	B	C	D	E
0	0	0	0	P
P	1	1	1	1
2	P	2	2	2
3	3	P	3	3
4	4	4	P	4

For every 5 stripe widths, each disk holds one stripe unit of
parity. Hence 80% of data accesses aligned to a specific data offset
hit that disk. i.e. disk A is hit by 0-128k, parity for 512-1024k,
1024-1152k, 1536-1664k and 2048-2176k. IOWs, if we align stuff to
512k, we're going to hit disk A 80% of the time and disk B 20% of
the time.

So, if mkfs.xfs ends up aligning all AGs to a multiple of 512k, then
all our static AG metadata is aligned to disk A. Further, all the
AGs will align their first stripe unit in a stripe width to Disk A,
too.  Hence this results in a major IO hotspot on disk A, and
smaller hotspot on disk B. Disks C, D, and E will have the least IO
load on them.

By telling XFS that the stripe unit is 128k and the stripe width is
512k, we can avoid this problem. mkfs.xfs will rotor it's AG
alignment by some number of stripe units at a time. i.e. AG 0 aligns
to disk A, AG 1 aligns to disk B, AG 2 aligns to disk 3, and so on.

The result is that base alignment used by the filesystem is now
distributed evenly across all disks in the RAID array and so all
disks get loaded evenly. The hot spots go away because the
filesystem has aligned it's layout appropriately for the underlying
storage geometry.  This applies to any RAID geometry that stripes
data across multiple disks in a regular/predictable pattern.

[ I'd cite an internal SGI paper written in 1999 that measured and
analysed all this on RAID0 in real world workloads and industry
standard benchmarks like AIM7 and SpecSFS and lead to the mkfs.xfs
changes I described above, but, well, I haven't had access to that
since I left SGI 10 years ago... ]

> > IMO, odd-numbered disks in RAID-10 should be considered harmful and
> > never used....
> 
> Users are perfectly able to layer raid1+0 or raid0+1 if they don't want
> the features of raid10.  Given the advantages of MD's raid10, a pedant
> could say XFS's lack of support for it should be considered harmful and
> XFS never used.  (-:

MD RAID is fine with XFS as long as you use a sane layout and avoid
doing stupid things that require reshaping and changing the geometry
of the underlying device. Reshaping is where the trouble all
starts...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-11  2:14           ` Dave Chinner
@ 2018-01-12  2:16             ` Guoqing Jiang
  0 siblings, 0 replies; 37+ messages in thread
From: Guoqing Jiang @ 2018-01-12  2:16 UTC (permalink / raw)
  To: Dave Chinner, Wols Lists; +Cc: linux-xfs, linux-raid

Hi Dave,

On 01/11/2018 10:14 AM, Dave Chinner wrote:
>
>> Are you telling me that xfs is aware of the internal structure of an
>> md-raid array?
> It's aware of the /alignment/ characteristics of block devices, and
> these alignment characteristics are exported by MD. e.g.  These are
> exported in /sys/block/<dev>/queue in
>
> 	minimum_io_size
> 		- typically the stripe chunk size
> 	optimal_io_size
> 		- typically the stripe width
>
> We get this stuff from DM and MD devices, hardware raid (via scsi
> code pages), thinp devices (i.e. to tell us the allocation
> granularity so we can align/size IO to match it) and any other block
> device that wants to tell us about optimal IO geometry. libblkid
> provides us with this information, and it's not just mkfs.xfs that
> uses it. e.g. mkfs.ext4 also uses it for the exact same purpose as
> XFS....

I see xfs can detect the geometry with "sunit" and "swidth", ext4 and gfs2
can do similar things as well.

I have one question about xfs on top of raid5. Is it possible that multiple
write operations happen in the same stripe at the same time? Or with the
existence of those parameters, xfs would aggregate them, then it ensures
no conflict happens to the parity of the stripe. Thanks in advance!

Regards,
Guoqing


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-11  3:07           ` Dave Chinner
@ 2018-01-12 13:32             ` Wols Lists
  2018-01-12 14:25               ` Emmanuel Florac
  0 siblings, 1 reply; 37+ messages in thread
From: Wols Lists @ 2018-01-12 13:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-raid

On 11/01/18 03:07, Dave Chinner wrote:
> XFS comes from a different background - high performance, high
> reliability and hardware RAID storage. Think hundreds of drives in a
> filesystem, not a handful. i.e. The XFS world is largely enterprise
> and HPC storage, not small DIY solutions for a home or back-room
> office.  We live in a different world, and MD rarely enters mine.

So what happens when the hardware raid structure changes?

Ext allows you to grow a filesystem. Btrfs allows you to grow a
filesystem. Reiser allows you to grow a file system. Can you add more
disks to XFS and grow the filesystem?

My point is that all this causes geometries to change, and ext and btrfs
amongst others can clearly handle this. Can XFS?

Because if it can, it seems to me the obvious solution to changing raid
geometries is that you need to grow the filesystem, and get that to
adjust its geometries.

Bear in mind, SUSE has now adopted XFS as the default filesystem for
partitions other than /. This means you are going to get a lot of
"hobbyist" systems running XFS on top of MD and LVM. Are you telling me
that XFS is actually very badly suited to be a default filesystem for SUSE?

What concerns me here is, not having a clue how LVM handles changing
partition sizes, what effect this will have on filesystems ... The
problem is the Unix philosophy of "do one thing and do it well".
Sometimes that's just not practical. The Unix philosophy says "leave
partition management to lvm, leave redundancy to md, leave the files to
the filesystem, ..." and then the filesystem comes along and says "hey,
I can't do my job very well, if I don't have a clue about the physical
disk layout". It's a hard circle to square ... :-)

(Anecdotes about btrfs are that it's made a right pigs ear of trying to
do everything itself.)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-12 13:32             ` Wols Lists
@ 2018-01-12 14:25               ` Emmanuel Florac
  2018-01-12 17:52                 ` Wols Lists
  2018-01-14 21:33                 ` Wol's lists
  0 siblings, 2 replies; 37+ messages in thread
From: Emmanuel Florac @ 2018-01-12 14:25 UTC (permalink / raw)
  To: Wols Lists; +Cc: Dave Chinner, linux-xfs, linux-raid

[-- Attachment #1: Type: text/plain, Size: 3983 bytes --]

Le Fri, 12 Jan 2018 13:32:49 +0000
Wols Lists <antlists@youngman.org.uk> écrivait:

> On 11/01/18 03:07, Dave Chinner wrote:
> > XFS comes from a different background - high performance, high
> > reliability and hardware RAID storage. Think hundreds of drives in a
> > filesystem, not a handful. i.e. The XFS world is largely enterprise
> > and HPC storage, not small DIY solutions for a home or back-room
> > office.  We live in a different world, and MD rarely enters mine.  
> 
> So what happens when the hardware raid structure changes?

hardware RAID controllers don't expose RAID structure to the software.
So As far as XFS knows, a hardware RAID is just a very large disk.
That's when using stripe unit and stripe width options make sense in
mkfs_xfs.

> Ext allows you to grow a filesystem. Btrfs allows you to grow a
> filesystem. Reiser allows you to grow a file system. Can you add more
> disks to XFS and grow the filesystem?

Of course. xfs_growfs is your friend. Worked with online filesystems
many years before that functionality came to other filesystems.

> My point is that all this causes geometries to change, and ext and
> btrfs amongst others can clearly handle this. Can XFS?

Neither XFS, ext4 or btrfs can handle this. That's why Dave mentioned
the fact that growing your RAID is almost always the wrong solution.
A much better solution is to add a new array and use LVM to aggregate
it with the existing ones.

Basically growing an array then the filesystem on it generally works
OK, BUT it may kill performance (or not). YMMV. At least, you *probably
won't* get the performance gain that the difference of stripe width
would permit when starting anew.

> Because if it can, it seems to me the obvious solution to changing
> raid geometries is that you need to grow the filesystem, and get that
> to adjust its geometries.

Unfortunately that's nigh impossible. No filesystem in existence does
that. The closest thing is ZFS ability to dynamically change stripe
sizes, but when you extend a ZFS zpool it doesn't rebalance existing
files and data (and offers absolutely no way to do it). Sorry, no pony.

> Bear in mind, SUSE has now adopted XFS as the default filesystem for
> partitions other than /. This means you are going to get a lot of
> "hobbyist" systems running XFS on top of MD and LVM. Are you telling
> me that XFS is actually very badly suited to be a default filesystem
> for SUSE?

Doesn't seem so. In fact XFS is less permissive than other filesystems,
and it's a *darn good thing* IMO. It's better having frightening error
messages "XFS force shutdown" than corrupted data, isn't it?

> What concerns me here is, not having a clue how LVM handles changing
> partition sizes, what effect this will have on filesystems ... The
> problem is the Unix philosophy of "do one thing and do it well".
> Sometimes that's just not practical.

LVM volumes changes are propagated to upper levels. 

If you don't like Unix principles, use Windows then :)

> The Unix philosophy says "leave
> partition management to lvm, leave redundancy to md, leave the files
> to the filesystem, ..." and then the filesystem comes along and says
> "hey, I can't do my job very well, if I don't have a clue about the
> physical disk layout". It's a hard circle to square ... :-)

Yeah, that was apparently the very same thinking that brought us ZFS.

> (Anecdotes about btrfs are that it's made a right pigs ear of trying
> to do everything itself.)
> 

Not so sure. Btrfs is excellent, taking into account how little love it
received for many years at Oracle.

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

[-- Attachment #2: Signature digitale OpenPGP --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-12 14:25               ` Emmanuel Florac
@ 2018-01-12 17:52                 ` Wols Lists
  2018-01-12 18:37                   ` Emmanuel Florac
  2018-01-13  0:20                   ` Stan Hoeppner
  2018-01-14 21:33                 ` Wol's lists
  1 sibling, 2 replies; 37+ messages in thread
From: Wols Lists @ 2018-01-12 17:52 UTC (permalink / raw)
  To: Emmanuel Florac; +Cc: Dave Chinner, linux-xfs, linux-raid

On 12/01/18 14:25, Emmanuel Florac wrote:
> Le Fri, 12 Jan 2018 13:32:49 +0000
> Wols Lists <antlists@youngman.org.uk> écrivait:
> 
>> On 11/01/18 03:07, Dave Chinner wrote:
>>> XFS comes from a different background - high performance, high
>>> reliability and hardware RAID storage. Think hundreds of drives in a
>>> filesystem, not a handful. i.e. The XFS world is largely enterprise
>>> and HPC storage, not small DIY solutions for a home or back-room
>>> office.  We live in a different world, and MD rarely enters mine.  
>>
>> So what happens when the hardware raid structure changes?
> 
> hardware RAID controllers don't expose RAID structure to the software.
> So As far as XFS knows, a hardware RAID is just a very large disk.
> That's when using stripe unit and stripe width options make sense in
> mkfs_xfs.

Umm... So you can't partially populate a chassis and add more disks as
you need them? So you have to manually pass stripe unit and width at
creation time and then they are set in stone? Sorry that doesn't sound
that enterprisey to me :-(
> 
>> Ext allows you to grow a filesystem. Btrfs allows you to grow a
>> filesystem. Reiser allows you to grow a file system. Can you add more
>> disks to XFS and grow the filesystem?
> 
> Of course. xfs_growfs is your friend. Worked with online filesystems
> many years before that functionality came to other filesystems.
> 
>> My point is that all this causes geometries to change, and ext and
>> btrfs amongst others can clearly handle this. Can XFS?
> 
> Neither XFS, ext4 or btrfs can handle this. That's why Dave mentioned
> the fact that growing your RAID is almost always the wrong solution.
> A much better solution is to add a new array and use LVM to aggregate
> it with the existing ones.

Isn't this what btrfs does with a rebalance? And I may well be wrong,
but I got the impression that some file systems could change stripe
geometries dynamically.

Adding a new array imho breaks the KISS principle. So I now have
multiple arrays sitting on the hard drives (wasting parity disks if I
have raid5/6), multiple instances of LVM on top of that, and then the
filesystem sitting on top of multiple volumes.

As a hobbyist I want one array, with one LVM on top of that, and one
filesystem per volume. Anything else starts to get confusing. And if I'm
a professional sys-admin I would want that in spades! It's all very well
expecting a sys-admin to cope, but the fewer boobytraps and landmines
left lying around, the better!

Squaring the circle, again :-(
> 
> Basically growing an array then the filesystem on it generally works
> OK, BUT it may kill performance (or not). YMMV. At least, you *probably
> won't* get the performance gain that the difference of stripe width
> would permit when starting anew.
> 
Point taken - but how are you going to backup your huge petabyte XFS
filesystem to get the performance on your bigger array? Catch 22 ...

>> Because if it can, it seems to me the obvious solution to changing
>> raid geometries is that you need to grow the filesystem, and get that
>> to adjust its geometries.
> 
> Unfortunately that's nigh impossible. No filesystem in existence does
> that. The closest thing is ZFS ability to dynamically change stripe
> sizes, but when you extend a ZFS zpool it doesn't rebalance existing
> files and data (and offers absolutely no way to do it). Sorry, no pony.
> 
Well, how does raid get away with it, rebalancing and restriping
everything :-)

Yes I know, it's a major change if the original file system design
didn't allow for it, and major file system changes can be extremely
destructive to user data ...

>> Bear in mind, SUSE has now adopted XFS as the default filesystem for
>> partitions other than /. This means you are going to get a lot of
>> "hobbyist" systems running XFS on top of MD and LVM. Are you telling
>> me that XFS is actually very badly suited to be a default filesystem
>> for SUSE?
> 
> Doesn't seem so. In fact XFS is less permissive than other filesystems,
> and it's a *darn good thing* IMO. It's better having frightening error
> messages "XFS force shutdown" than corrupted data, isn't it?

False dichotomy, I'm afraid. Do you really want a filesystem that
guarantees integrity, but trashes performance when you want to take
advantage of features such as resizing? I'd rather have integrity,
performance *and* features :-) (Pick any two, I know :-)
> 
>> What concerns me here is, not having a clue how LVM handles changing
>> partition sizes, what effect this will have on filesystems ... The
>> problem is the Unix philosophy of "do one thing and do it well".
>> Sometimes that's just not practical.
> 
> LVM volumes changes are propagated to upper levels. 

And what does the filesystem do with them? If LVM is sat on MD, what then?
> 
> If you don't like Unix principles, use Windows then :)
> 
The phrase "a rock and a hard place" comes to mind. Neither were
designed with commercial solidity and integrity and reliability in mind.
And having used commercial systems I get the impression NIH is alive and
kicking far too much. Both Linux and Windows are much more reliable and
solid than they were, but too many of those features are bolt-ons, and
they feel like it ...

>> The Unix philosophy says "leave
>> partition management to lvm, leave redundancy to md, leave the files
>> to the filesystem, ..." and then the filesystem comes along and says
>> "hey, I can't do my job very well, if I don't have a clue about the
>> physical disk layout". It's a hard circle to square ... :-)
> 
> Yeah, that was apparently the very same thinking that brought us ZFS.
> 
>> (Anecdotes about btrfs are that it's made a right pigs ear of trying
>> to do everything itself.)
>>
> 
> Not so sure. Btrfs is excellent, taking into account how little love it
> received for many years at Oracle.
> 
Yep. The solid features are just that - solid. Snag is, a lot of the
nice features are still experimental, and dangerous! Parity raid, for
example ... and I've heard rumours that the flaws could be unfixable, at
least not until btrfs-2 whenever that gets started ...

When MD adds disks, it rewrites the array from top to bottom or the
other way round, moving everything over to the new layout. Is there no
way a file system can do the same sort of thing? Okay, it would probably
need to be a defrag-like utility and linux prides itself on not needing
defrag :-)

Or could it simply switch over to optimising for the new geometry,
accept the fact that the reshape will have caused hotspots, and every
time it rewrites (meta)data, it adjusts it to the new geometry to
reduce/remove hotspots over time?

Cheers,
Wol

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-12 17:52                 ` Wols Lists
@ 2018-01-12 18:37                   ` Emmanuel Florac
  2018-01-12 19:35                     ` Wol's lists
  2018-01-13  0:20                   ` Stan Hoeppner
  1 sibling, 1 reply; 37+ messages in thread
From: Emmanuel Florac @ 2018-01-12 18:37 UTC (permalink / raw)
  To: Wols Lists; +Cc: Dave Chinner, linux-xfs, linux-raid

[-- Attachment #1: Type: text/plain, Size: 7052 bytes --]

Le Fri, 12 Jan 2018 17:52:59 +0000
Wols Lists <antlists@youngman.org.uk> écrivait:

> >>
> >> So what happens when the hardware raid structure changes?  
> > 
> > hardware RAID controllers don't expose RAID structure to the
> > software. So As far as XFS knows, a hardware RAID is just a very
> > large disk. That's when using stripe unit and stripe width options
> > make sense in mkfs_xfs.  
> 
> Umm... So you can't partially populate a chassis and add more disks as
> you need them? So you have to manually pass stripe unit and width at
> creation time and then they are set in stone? Sorry that doesn't sound
> that enterprisey to me :-(

You *can* but it's generally frowned upon. Adding disks by large
batches of 6, 8, 10 and creating new arrays is always better. Adding
one or two disks at a time is a useful, but cheap hack at best.

> > 
> > Neither XFS, ext4 or btrfs can handle this. That's why Dave
> > mentioned the fact that growing your RAID is almost always the
> > wrong solution. A much better solution is to add a new array and
> > use LVM to aggregate it with the existing ones.  
> 
> Isn't this what btrfs does with a rebalance? And I may well be wrong,
> but I got the impression that some file systems could change stripe
> geometries dynamically.

If btrfs does rebalancing, that's fine then. I suppose running xfs_fsr
on XFS could also rebalance data. Would be nice to have an option to
force rewriting of all files, that would solve this particular problem.
 
> Adding a new array imho breaks the KISS principle. So I now have
> multiple arrays sitting on the hard drives (wasting parity disks if I
> have raid5/6), multiple instances of LVM on top of that, and then the
> filesystem sitting on top of multiple volumes.

No, you need only to declare additional arrays as new physical volumes,
add them to your existing volume group, then extend your existing LVs
as needed. That's standard storage management fare.

You're not supposed to have arrays with tens of drives anyway (unless
you really don't care about your data).

> As a hobbyist I want one array, with one LVM on top of that, and one
> filesystem per volume.

As a hobbyist you don't really have to care about performance. A single
modern hard drive can easily feed a gigabit ethernet connection, anyway.
The systems I set up these times commonly require disk throughput of 3
to 10 GB/s to feed 40GigE lines. Different problems. 

> Anything else starts to get confusing. And if
> I'm a professional sys-admin I would want that in spades! It's all
> very well expecting a sys-admin to cope, but the fewer boobytraps and
> landmines left lying around, the better!
> 
> Squaring the circle, again :-(

Not really, modern tools like lsblk and friends make it really easy to
sort out.

> > 
> > Basically growing an array then the filesystem on it generally works
> > OK, BUT it may kill performance (or not). YMMV. At least, you
> > *probably won't* get the performance gain that the difference of
> > stripe width would permit when starting anew.
> >   
> Point taken - but how are you going to backup your huge petabyte XFS
> filesystem to get the performance on your bigger array? Catch 22 ...

Through big networks, or with big tape drives (LTO-8 is ~1GB/s
compressed).

> 
> >> Because if it can, it seems to me the obvious solution to changing
> >> raid geometries is that you need to grow the filesystem, and get
> >> that to adjust its geometries.  
> > 
> > Unfortunately that's nigh impossible. No filesystem in existence
> > does that. The closest thing is ZFS ability to dynamically change
> > stripe sizes, but when you extend a ZFS zpool it doesn't rebalance
> > existing files and data (and offers absolutely no way to do it).
> > Sorry, no pony. 
> Well, how does raid get away with it, rebalancing and restriping
> everything :-)
> 

Then that must be because Dave is lazy :)

> > Doesn't seem so. In fact XFS is less permissive than other
> > filesystems, and it's a *darn good thing* IMO. It's better having
> > frightening error messages "XFS force shutdown" than corrupted
> > data, isn't it?  
> 
> False dichotomy, I'm afraid. Do you really want a filesystem that
> guarantees integrity, but trashes performance when you want to take
> advantage of features such as resizing? I'd rather have integrity,
> performance *and* features :-) (Pick any two, I know :-)

XFS is clearly optimized for performances, and is currently gaining
interesting new features (thin copy, then probably snapshots, etc). If
what you're looking for is features first, well, there are other
filesystems :)

> > LVM volumes changes are propagated to upper levels.   
> 
> And what does the filesystem do with them? If LVM is sat on MD, what
> then?

MD propagates to LVM that propagates to the FS, actually. Everybody
works together nowadays (didn't used to).

> > 
> > If you don't like Unix principles, use Windows then :)
> >   
> The phrase "a rock and a hard place" comes to mind. Neither were
> designed with commercial solidity and integrity and reliability in
> mind. And having used commercial systems I get the impression NIH is
> alive and kicking far too much. Both Linux and Windows are much more
> reliable and solid than they were, but too many of those features are
> bolt-ons, and they feel like it ...

Linux gives you choice. You want to resize volumes at will? Use ZFS.
You want to juice out all the performance from your disks? use XFS. You
don't bother? use ext4. etc.

> > Not so sure. Btrfs is excellent, taking into account how little
> > love it received for many years at Oracle.
> >   
> Yep. The solid features are just that - solid. Snag is, a lot of the
> nice features are still experimental, and dangerous! Parity raid, for
> example ... and I've heard rumours that the flaws could be unfixable,
> at least not until btrfs-2 whenever that gets started ...

Well I don't know much about btrfs so I can't comment.
 
> When MD adds disks, it rewrites the array from top to bottom or the
> other way round, moving everything over to the new layout. Is there no
> way a file system can do the same sort of thing? Okay, it would
> probably need to be a defrag-like utility and linux prides itself on
> not needing defrag :-)
> 
> Or could it simply switch over to optimising for the new geometry,
> accept the fact that the reshape will have caused hotspots, and every
> time it rewrites (meta)data, it adjusts it to the new geometry to
> reduce/remove hotspots over time?
> 

I suppose it's doable but not sufficiently a prominent use case to
bother much.

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

[-- Attachment #2: Signature digitale OpenPGP --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-12 18:37                   ` Emmanuel Florac
@ 2018-01-12 19:35                     ` Wol's lists
  2018-01-13 12:30                       ` Brad Campbell
  0 siblings, 1 reply; 37+ messages in thread
From: Wol's lists @ 2018-01-12 19:35 UTC (permalink / raw)
  To: Emmanuel Florac; +Cc: Dave Chinner, linux-xfs, linux-raid

On 12/01/18 18:37, Emmanuel Florac wrote:
>> Or could it simply switch over to optimising for the new geometry,
>> accept the fact that the reshape will have caused hotspots, and every
>> time it rewrites (meta)data, it adjusts it to the new geometry to
>> reduce/remove hotspots over time?
>>
> I suppose it's doable but not sufficiently a prominent use case to
> bother much.

Stick it on the "nice to have when somebody gets round to it" list :-) 
But it at least should get put on the list ...

I'll get round to writing all this up soon, so the wiki will try and 
persuade people that resizing arrays is not actually the brightest of 
ideas.

The trouble is lack of people talking to each other and thinking that 
they can rely on the "do one job and do it well". Except of course that 
you can't square a circle ... :-)

But this really is stuff that needs to be on the wiki, and it's stuff 
the MD and filesystem people don't talk about with each other, I expect :-(

Cheers,
Wol

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-12 17:52                 ` Wols Lists
  2018-01-12 18:37                   ` Emmanuel Florac
@ 2018-01-13  0:20                   ` Stan Hoeppner
  2018-01-13 19:29                     ` Wol's lists
  1 sibling, 1 reply; 37+ messages in thread
From: Stan Hoeppner @ 2018-01-13  0:20 UTC (permalink / raw)
  To: Wols Lists, Emmanuel Florac; +Cc: Dave Chinner, linux-xfs, linux-raid

On 01/12/2018 11:52 AM, Wols Lists wrote:
> On 12/01/18 14:25, Emmanuel Florac wrote:
>> Le Fri, 12 Jan 2018 13:32:49 +0000
>> Wols Lists <antlists@youngman.org.uk> écrivait:
>>
>>> On 11/01/18 03:07, Dave Chinner wrote:
>>>> XFS comes from a different background - high performance, high
>>>> reliability and hardware RAID storage. Think hundreds of drives in a
>>>> filesystem, not a handful. i.e. The XFS world is largely enterprise
>>>> and HPC storage, not small DIY solutions for a home or back-room
>>>> office.  We live in a different world, and MD rarely enters mine.
>>> So what happens when the hardware raid structure changes?
>> hardware RAID controllers don't expose RAID structure to the software.
>> So As far as XFS knows, a hardware RAID is just a very large disk.
>> That's when using stripe unit and stripe width options make sense in
>> mkfs_xfs.
> Umm... So you can't partially populate a chassis and add more disks as
> you need them? So you have to manually pass stripe unit and width at
> creation time and then they are set in stone? Sorry that doesn't sound
> that enterprisey to me :-(

It's not set in stone.  If the RAID geometry changes one can specify the 
new geometry at mount say in fstab.  New writes to the filesystem will 
obey the new specified geometry.
>>> Ext allows you to grow a filesystem. Btrfs allows you to grow a
>>> filesystem. Reiser allows you to grow a file system. Can you add more
>>> disks to XFS and grow the filesystem?
>> Of course. xfs_growfs is your friend. Worked with online filesystems
>> many years before that functionality came to other filesystems.
>>
>>> My point is that all this causes geometries to change, and ext and
>>> btrfs amongst others can clearly handle this. Can XFS?
>> Neither XFS, ext4 or btrfs can handle this. That's why Dave mentioned
>> the fact that growing your RAID is almost always the wrong solution.
>> A much better solution is to add a new array and use LVM to aggregate
>> it with the existing ones.
> Isn't this what btrfs does with a rebalance? And I may well be wrong,
> but I got the impression that some file systems could change stripe
> geometries dynamically.
>
> Adding a new array imho breaks the KISS principle. So I now have
> multiple arrays sitting on the hard drives (wasting parity disks if I
> have raid5/6), multiple instances of LVM on top of that, and then the
> filesystem sitting on top of multiple volumes.
>
> As a hobbyist I want one array, with one LVM on top of that, and one
> filesystem per volume. Anything else starts to get confusing. And if I'm
> a professional sys-admin I would want that in spades! It's all very well
> expecting a sys-admin to cope, but the fewer boobytraps and landmines
> left lying around, the better!
>
> Squaring the circle, again :-(
>> Basically growing an array then the filesystem on it generally works
>> OK, BUT it may kill performance (or not). YMMV. At least, you *probably
>> won't* get the performance gain that the difference of stripe width
>> would permit when starting anew.
>>
> Point taken - but how are you going to backup your huge petabyte XFS
> filesystem to get the performance on your bigger array? Catch 22 ...
>
>>> Because if it can, it seems to me the obvious solution to changing
>>> raid geometries is that you need to grow the filesystem, and get that
>>> to adjust its geometries.
>> Unfortunately that's nigh impossible. No filesystem in existence does
>> that. The closest thing is ZFS ability to dynamically change stripe
>> sizes, but when you extend a ZFS zpool it doesn't rebalance existing
>> files and data (and offers absolutely no way to do it). Sorry, no pony.
>>
> Well, how does raid get away with it, rebalancing and restriping
> everything :-)
>
> Yes I know, it's a major change if the original file system design
> didn't allow for it, and major file system changes can be extremely
> destructive to user data ...
>
>>> Bear in mind, SUSE has now adopted XFS as the default filesystem for
>>> partitions other than /. This means you are going to get a lot of
>>> "hobbyist" systems running XFS on top of MD and LVM. Are you telling
>>> me that XFS is actually very badly suited to be a default filesystem
>>> for SUSE?
>> Doesn't seem so. In fact XFS is less permissive than other filesystems,
>> and it's a *darn good thing* IMO. It's better having frightening error
>> messages "XFS force shutdown" than corrupted data, isn't it?
> False dichotomy, I'm afraid. Do you really want a filesystem that
> guarantees integrity, but trashes performance when you want to take
> advantage of features such as resizing? I'd rather have integrity,
> performance *and* features :-) (Pick any two, I know :-)
>>> What concerns me here is, not having a clue how LVM handles changing
>>> partition sizes, what effect this will have on filesystems ... The
>>> problem is the Unix philosophy of "do one thing and do it well".
>>> Sometimes that's just not practical.
>> LVM volumes changes are propagated to upper levels.
> And what does the filesystem do with them? If LVM is sat on MD, what then?
>> If you don't like Unix principles, use Windows then :)
>>
> The phrase "a rock and a hard place" comes to mind. Neither were
> designed with commercial solidity and integrity and reliability in mind.
> And having used commercial systems I get the impression NIH is alive and
> kicking far too much. Both Linux and Windows are much more reliable and
> solid than they were, but too many of those features are bolt-ons, and
> they feel like it ...
>
>>> The Unix philosophy says "leave
>>> partition management to lvm, leave redundancy to md, leave the files
>>> to the filesystem, ..." and then the filesystem comes along and says
>>> "hey, I can't do my job very well, if I don't have a clue about the
>>> physical disk layout". It's a hard circle to square ... :-)
>> Yeah, that was apparently the very same thinking that brought us ZFS.
>>
>>> (Anecdotes about btrfs are that it's made a right pigs ear of trying
>>> to do everything itself.)
>>>
>> Not so sure. Btrfs is excellent, taking into account how little love it
>> received for many years at Oracle.
>>
> Yep. The solid features are just that - solid. Snag is, a lot of the
> nice features are still experimental, and dangerous! Parity raid, for
> example ... and I've heard rumours that the flaws could be unfixable, at
> least not until btrfs-2 whenever that gets started ...
>
> When MD adds disks, it rewrites the array from top to bottom or the
> other way round, moving everything over to the new layout. Is there no
> way a file system can do the same sort of thing? Okay, it would probably
> need to be a defrag-like utility and linux prides itself on not needing
> defrag :-)
>
> Or could it simply switch over to optimising for the new geometry,
> accept the fact that the reshape will have caused hotspots, and every
> time it rewrites (meta)data, it adjusts it to the new geometry to
> reduce/remove hotspots over time?
>
> Cheers,
> Wol
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-12 19:35                     ` Wol's lists
@ 2018-01-13 12:30                       ` Brad Campbell
  2018-01-13 13:18                         ` Wols Lists
  0 siblings, 1 reply; 37+ messages in thread
From: Brad Campbell @ 2018-01-13 12:30 UTC (permalink / raw)
  To: Wol's lists, Emmanuel Florac; +Cc: Dave Chinner, linux-xfs, linux-raid

On 13/01/18 03:35, Wol's lists wrote:

> I'll get round to writing all this up soon, so the wiki will try and 
> persuade people that resizing arrays is not actually the brightest of 
> ideas.

Now hang on. Don't go tarring every use case with the same brush.

There are many use cases for a bucket of disks and high performance is 
but one of them.

Leaving aside XFS, let's look at EXT3/4 as they seem to be generally the 
most common filesystems in use for your average "install it and run it" 
user (ie *ME*).

If you read the mke2fs man page and check out stripe and stride (which 
you *used* to have to specify manually), both of them imply they are 
important for letting the filesystem know the construction of your RAID 
for *performance* reasons.

Nowhere does *anything* make any mention of changing geometry, and if 
you gave a 10 second thought to those parameters and their explanations 
you'd have to think "This filesystem was optimised for the RAID geometry 
it was built with. If I change that, then I won't have the same 
performance I did have at the time of creation". Or maybe that was only 
obvious to me.

Anyway, I happily grew several large arrays over the years *knowing* 
that there would be a performance impact, because for my use case I 
didn't actually care.

"Enterprise" don't grow arrays. They build a storage solution that is 
often extremely finely tuned for exactly their workload and they use it. 
If they need more storage they either replicate or build another (with 
the consequential months of tests/tuning) storage configuration. I see 
Stan Hoeppner replied. If you want a good read, get him going on 
workload specific XFS tuning.

It's only hacks like me that tack disks onto built arrays, but I did it 
*knowing* it wasn't going to affect my workload as all I wanted was a 
huge bucket of storage with quick reads. Writes don't happen often 
enough to matter.

Exposing the geometry to the filesystem is there to give the filesystem 
a chance of performing operations in a manner least likely to create a 
performance hotspot (as pointed out by Dave Chinner). They are hints. 
Change the geometry after the fact and all bets are off.

On another note, personally I've used XFS in a couple of performance 
sensitive roles over the years (when it *really* mattered), but as I 
don't often wade into that end of the pool I tend to stick with the ext 
series.

e2fsck has gotten me out of some really tight spots and I can rely on it 
making the best of a really bad mess. With XFS I've never had the 
pleasure of running it on anything other than top of the line hardware, 
so it never had to clean up after me. It does go like a stung cat though 
when it's tuned up.

If I were to suggest an addition to the RAID wiki, it'd be to elaborate 
on the *creation* time tuning a filesystem create tool does with the 
RAID geometry, and to point out that once you grow the RAID, all 
performance bets are off. I've never met a filesystem that would break 
however.

I've grown RAID 1, 5 & 6. Growing RAID10 with anything other than a near 
configuration and adding another set of disks just feels like a disaster 
waiting to happen. Even I'm not that game.

I do have a staging machine now with a few spare disks, so I might have 
a crack at it, but I won't be using a kernel and userspace as old as the 
thread initiator.

Regards,
Brad

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-13 12:30                       ` Brad Campbell
@ 2018-01-13 13:18                         ` Wols Lists
  0 siblings, 0 replies; 37+ messages in thread
From: Wols Lists @ 2018-01-13 13:18 UTC (permalink / raw)
  To: Brad Campbell, Emmanuel Florac; +Cc: Dave Chinner, linux-xfs, linux-raid

On 13/01/18 12:30, Brad Campbell wrote:
> 
> If I were to suggest an addition to the RAID wiki, it'd be to elaborate
> on the *creation* time tuning a filesystem create tool does with the
> RAID geometry, and to point out that once you grow the RAID, all
> performance bets are off. I've never met a filesystem that would break
> however.

You know me ...

If you read the wiki you'll notice I tend to be very much "pros and
cons. Choose what works for you". This write-up will be very much in the
same vein ...

Cheers,
Wol

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-13  0:20                   ` Stan Hoeppner
@ 2018-01-13 19:29                     ` Wol's lists
  2018-01-13 22:40                       ` Dave Chinner
  0 siblings, 1 reply; 37+ messages in thread
From: Wol's lists @ 2018-01-13 19:29 UTC (permalink / raw)
  To: Stan Hoeppner, Emmanuel Florac; +Cc: Dave Chinner, linux-xfs, linux-raid

On 13/01/18 00:20, Stan Hoeppner wrote:
> It's not set in stone.  If the RAID geometry changes one can specify the 
> new geometry at mount say in fstab.  New writes to the filesystem will 
> obey the new specified geometry.

Does this then update the defaults, or do you need to specify the new 
geometry every mount? Inquiring minds need to know :-)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-13 19:29                     ` Wol's lists
@ 2018-01-13 22:40                       ` Dave Chinner
  2018-01-13 23:04                         ` Wols Lists
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2018-01-13 22:40 UTC (permalink / raw)
  To: Wol's lists; +Cc: Stan Hoeppner, Emmanuel Florac, linux-xfs, linux-raid

On Sat, Jan 13, 2018 at 07:29:19PM +0000, Wol's lists wrote:
> On 13/01/18 00:20, Stan Hoeppner wrote:
> >It's not set in stone.  If the RAID geometry changes one can
> >specify the new geometry at mount say in fstab.  New writes to the
> >filesystem will obey the new specified geometry.

FWIW, I've been assuming in everything I've said that an admin
would use these mount options to ensure new data writes were
properly aligned after a reshape.

> Does this then update the defaults, or do you need to specify the
> new geometry every mount? Inquiring minds need to know :-)

If you're going to document it, then you should observe it's
behaviour yourself, right? You don't even need a MD/RAID device to
test it - just set su/sw manually on the mkfs command line, then
see what happens when you try to change them on subsequent mounts.

Anyway, start by reading Documentation/filesystems/xfs.txt or 'man 5
xfs' where the mount options are documented. That's answer most FAQs
on this subject.

	"Typically the only time these mount options are necessary
	if after an underlying RAID device has  had  it's  geometry
	modified, such as adding a new disk to a RAID5 lun and
	reshaping it."

It should be pretty obvious from this that we know that people
reshape arrays and that we've have had the means to support it all
along. Despite this, we still don't recommend people administer
their RAID-based XFS storage in this manner....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-13 22:40                       ` Dave Chinner
@ 2018-01-13 23:04                         ` Wols Lists
  0 siblings, 0 replies; 37+ messages in thread
From: Wols Lists @ 2018-01-13 23:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Stan Hoeppner, Emmanuel Florac, linux-xfs, linux-raid

On 13/01/18 22:40, Dave Chinner wrote:
> On Sat, Jan 13, 2018 at 07:29:19PM +0000, Wol's lists wrote:
>> On 13/01/18 00:20, Stan Hoeppner wrote:
>>> It's not set in stone.  If the RAID geometry changes one can
>>> specify the new geometry at mount say in fstab.  New writes to the
>>> filesystem will obey the new specified geometry.
> 
> FWIW, I've been assuming in everything I've said that an admin
> would use these mount options to ensure new data writes were
> properly aligned after a reshape.
> 
>> Does this then update the defaults, or do you need to specify the
>> new geometry every mount? Inquiring minds need to know :-)
> 
> If you're going to document it, then you should observe it's
> behaviour yourself, right? You don't even need a MD/RAID device to
> test it - just set su/sw manually on the mkfs command line, then
> see what happens when you try to change them on subsequent mounts.

I suppose I could set up a VM ...
> 
> Anyway, start by reading Documentation/filesystems/xfs.txt or 'man 5
> xfs' where the mount options are documented. That's answer most FAQs
> on this subject.
> 
> 	"Typically the only time these mount options are necessary
> 	if after an underlying RAID device has  had  it's  geometry
> 	modified, such as adding a new disk to a RAID5 lun and
> 	reshaping it."

anthony@ashdown /usr/src $ man 5 xfs
No entry for xfs in section 5 of the manual
anthony@ashdown /usr/src $

> 
> It should be pretty obvious from this that we know that people
> reshape arrays and that we've have had the means to support it all
> along. Despite this, we still don't recommend people administer
> their RAID-based XFS storage in this manner....
> 
Note I described myself as *editor* of the raid wiki. Yes I'd love to
play around with all this stuff, but I don't have the hardware, and my
nice new system I was planning to do all this sort of stuff on won't
POST. I've had that problem before, it's finding time to debug a new
system in the face of family demands... and at present I don't have an
xfs partition anywhere.

Reading xfs.txt doesn't seem to answer the question, though. It sounds
like it doesn't update the underlying defaults so it's required every
mount (which is a safe assumption to make), but it could easily be read
the other way, too.

Thanks. I'll document it to the level I understand, make a mental note
to go back and improve it (I try and do that all the time :-), and then
when my new system is up and running, I'll be playing with that to see
how things behave.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-12 14:25               ` Emmanuel Florac
  2018-01-12 17:52                 ` Wols Lists
@ 2018-01-14 21:33                 ` Wol's lists
  2018-01-15 17:08                   ` Emmanuel Florac
  1 sibling, 1 reply; 37+ messages in thread
From: Wol's lists @ 2018-01-14 21:33 UTC (permalink / raw)
  To: Emmanuel Florac; +Cc: Dave Chinner, linux-xfs, linux-raid

On 12/01/18 14:25, Emmanuel Florac wrote:
>> My point is that all this causes geometries to change, and ext and
>> btrfs amongst others can clearly handle this. Can XFS?

> Neither XFS, ext4 or btrfs can handle this. That's why Dave mentioned
> the fact that growing your RAID is almost always the wrong solution.
> A much better solution is to add a new array and use LVM to aggregate
> it with the existing ones.

Does the new array need the same geometry as the old one?

What happens if my original array is a 4-disk raid-5, and then I add a 
3-disk raid-5? Can XFS cope with the different optimisations required 
for the different layouts on the different arrays?
> 
> Basically growing an array then the filesystem on it generally works
> OK, BUT it may kill performance (or not). YMMV. At least, you *probably
> won't* get the performance gain that the difference of stripe width
> would permit when starting anew.
> 
Cheers,
Wol

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-14 21:33                 ` Wol's lists
@ 2018-01-15 17:08                   ` Emmanuel Florac
  0 siblings, 0 replies; 37+ messages in thread
From: Emmanuel Florac @ 2018-01-15 17:08 UTC (permalink / raw)
  To: Wol's lists; +Cc: Dave Chinner, linux-xfs, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1422 bytes --]

Le Sun, 14 Jan 2018 21:33:17 +0000
"Wol's lists" <antlists@youngman.org.uk> écrivait:

> On 12/01/18 14:25, Emmanuel Florac wrote:
> >> My point is that all this causes geometries to change, and ext and
> >> btrfs amongst others can clearly handle this. Can XFS?  
> 
> > Neither XFS, ext4 or btrfs can handle this. That's why Dave
> > mentioned the fact that growing your RAID is almost always the
> > wrong solution. A much better solution is to add a new array and
> > use LVM to aggregate it with the existing ones.  
> 
> Does the new array need the same geometry as the old one?

That's the best way to preserve performance, yes.
 
> What happens if my original array is a 4-disk raid-5, and then I add
> a 3-disk raid-5? Can XFS cope with the different optimisations
> required for the different layouts on the different arrays?

No because your array will remain optimised for the initial layout.
However, if you add a new array with the same stripe characteristics you
should at least NOT lose performance. See also the mount options  Stan
mentioned earlier :)

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

[-- Attachment #2: Signature digitale OpenPGP --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-08 15:16   ` Wols Lists
  2018-01-08 15:34     ` Reindl Harald
  2018-01-08 16:24     ` Wolfgang Denk
@ 2018-01-10  1:57     ` Guoqing Jiang
  2 siblings, 0 replies; 37+ messages in thread
From: Guoqing Jiang @ 2018-01-10  1:57 UTC (permalink / raw)
  To: Wols Lists, mdraid.pkoch, linux-raid



On 01/08/2018 11:16 PM, Wols Lists wrote:
> On 08/01/18 07:31, Guoqing Jiang wrote:
>>
>> On 01/06/2018 11:44 PM, mdraid.pkoch@dfgh.net wrote:
>>> Dear MD-experts:
>>>
>>> I was under the impression that growing a RAID10 device could be done
>>> with an active filesystem running on the device.
>> It depends on whether the specific filesystem provides related tool or
>> not, eg,
>> resize2fs can serve ext fs:
> Sorry Guoqing, but I think you've *completely* missed the point :-(

Yes, I just want to point is it is more safer to umount fs first, then 
do the reshape.
There is another deadlock issue which mentioned by BingJing, it happened 
during
reshape stage while some I/O come from vfs layer, though it is not the 
same issue,

>> https://raid.wiki.kernel.org/index.php/Growing#Extending_the_filesystem
>>
>> And you can use xfs_growfs for your purpose.
> You extend the filesystem *after* you've grown the array. The act of
> growing the array has caused the filesystem to crash. That should NOT
> happen - the act of growing the array should be *invisible* to the
> filesystem.
>
> In other words, one or more of the following three are true :-
> 1) The OP has been caught by some random act of God
> 2) There's a serious flaw in "mdadm --grow"
> 3) There's a serious flaw in xfs

IMHO, there could be potential issue inside raid10/5, so I choose 2. 
Anyway, I will
try to simulate the scenario and see what will happen ...

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Growing RAID10 with active XFS filesystem
@ 2018-01-08 19:06 mdraid.pkoch
  0 siblings, 0 replies; 37+ messages in thread
From: mdraid.pkoch @ 2018-01-08 19:06 UTC (permalink / raw)
  To: linux-raid

Dear Linux-Raid and Linux-XFS experts:

I'm posting this on both the linux-raid and linux-xfs
mailing list as it's not clear at this point wether
this is a MD- od XFS-problem.

I have described my problem in a recent posting on
linux-raid and Wol's conclusion was:

 > In other words, one or more of the following three are true :-
 > 1) The OP has been caught by some random act of God
 > 2) There's a serious flaw in "mdadm --grow"
 > 3) There's a serious flaw in xfs
 >
 > Cheers,
 > Wol

There's very important data on our RAID10 device but I doubt
it's important enough for God to take a hand into our storage.

But let me first summarize what happened and why I believe that
this is an XFS-problem:

Machine running Linux 3.14.69 with no kernel-patches.

XFS filesystem was created with XFS userutils 3.1.11.
I did a fresh compile of xfsprogs-4.9.0 yesterday when
I realized that the 3.1.11 xfs_repair did not help.

mdadm is V3.3

/dev/md5 is a RAID10-device that was created in Feb 2013
with 10 2TB disks and an ext3 filesystem on it. Once in a
while I added two more 2TB disks. Reshaping was done
while the ext3 filesystem was mounted. Then the ext3
filesystem was unmounted resized and mounted again. That
worked until I resized the RAID10 from 16 to 20 disks and
realized that ext3 does not support filesystems >16TB.

I switched to XFS and created a 20TB filesystem. Here are
the details:

# xfs_info /dev/md5
meta-data=/dev/md5               isize=256    agcount=32,
agsize=152608128 blks
           =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=4883457280, imaxpct=5
           =                       sunit=128    swidth=1280 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
           =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Please notice: Ths XFS-filesystem has a size of
4883457280*4K = 19,533,829,120K

On saturday I tried to add two more 2TB disks to the RAID10
and the XFS filesystem was mounted (and in medium use) at that
time. Commands were:

# mdadm /dev/md5 --add /dev/sdo
# mdadm --grow /dev/md5 --raid-devices=21

# mdadm -D /dev/md5
/dev/md5:
          Version : 1.2
    Creation Time : Sun Feb 10 16:58:10 2013
       Raid Level : raid10
       Array Size : 19533829120 (18628.91 GiB 20002.64 GB)
    Used Dev Size : 1953382912 (1862.89 GiB 2000.26 GB)
     Raid Devices : 21
    Total Devices : 21
      Persistence : Superblock is persistent

      Update Time : Sat Jan  6 15:08:37 2018
            State : clean, reshaping
   Active Devices : 21
Working Devices : 21
   Failed Devices : 0
    Spare Devices : 0

           Layout : near=2
       Chunk Size : 512K

   Reshape Status : 1% complete
    Delta Devices : 1, (20->21)

             Name : backup:5  (local to host backup)
             UUID : 9030ff07:6a292a3c:26589a26:8c92a488
           Events : 86002

      Number   Major   Minor   RaidDevice State
         0       8       16        0      active sync   /dev/sdb
         1      65       48        1      active sync   /dev/sdt
         2       8       64        2      active sync   /dev/sde
         3      65       96        3      active sync   /dev/sdw
         4       8      112        4      active sync   /dev/sdh
         5      65      144        5      active sync   /dev/sdz
         6       8      160        6      active sync   /dev/sdk
         7      65      192        7      active sync   /dev/sdac
         8       8      208        8      active sync   /dev/sdn
         9      65      240        9      active sync   /dev/sdaf
        10      65        0       10      active sync   /dev/sdq
        11      66       32       11      active sync   /dev/sdai
        12       8       32       12      active sync   /dev/sdc
        13      65       64       13      active sync   /dev/sdu
        14       8       80       14      active sync   /dev/sdf
        15      65      112       15      active sync   /dev/sdx
        16       8      128       16      active sync   /dev/sdi
        17      65      160       17      active sync   /dev/sdaa
        18       8      176       18      active sync   /dev/sdl
        19      65      208       19      active sync   /dev/sdad
        20       8      224       20      active sync   /dev/sdo

Please notice: Ths RAID10-device has a size of 19,533,829,120K
that's exactly the same size as the contained XFS-filesystem.

Immediately after the RAID10 reshape operation started the
XFS-filesystem reported I/O-errors and was severly damaged.
I waited for the reshape operation to finish and tried to repair
the filesystem with xfs_repair (version 3.1.11) but xfs_repair
crashed, so I tried 4.9.0-version aif xfs_reapair with no luck
either.

/dev/md5 ist now mounted ro,norecovery with an overlay filesystem
on top of it (thanks very much to Andreas for that idea) and I have
setup a new server today. Rsyncing the data to the new server will
take a while and I'm sure I will stumble on lots of corrupted files.
I proceeded from XFS to ZFS (skipped YFS) so lengthy reshape
operations won't happen in the future anymore.

Here are the relevant log messages:

 > Jan  6 14:45:00 backup kernel: md: reshape of RAID array md5
 > Jan  6 14:45:00 backup kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
 > Jan  6 14:45:00 backup kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
 > Jan  6 14:45:00 backup kernel: md: using 128k window, over a total of 19533829120k.
 > Jan  6 14:45:00 backup kernel: XFS (md5): metadata I/O error: block 0x12c08f360 ("xfs_trans_read_buf_map") error 5 numblks 16
 > Jan  6 14:45:00 backup kernel: XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
 > Jan  6 14:45:00 backup kernel: XFS (md5): metadata I/O error: block 0x12c08f360 ("xfs_trans_read_buf_map") error 5 numblks 16
 > Jan  6 14:45:00 backup kernel: XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
 > ... hundreds of the above XFS-messages deleted
 > Jan  6 14:45:00 backup kernel: XFS (md5): Log I/O Error Detected.  Shutting down filesystem
 > Jan  6 14:45:00 backup kernel: XFS (md5): Please umount the filesystem and rectify the problem(s)

Please notice: no error message about hardware-problems.
All 21 disks are fine and the next messages from the
md-driver was:

 > Jan  7 02:28:02 backup kernel: md: md5: reshape done.
 > Jan  7 02:28:03 backup kernel: md5: detected capacity change from 20002641018880 to 21002772807680

I'm wondering about one thing: the first xfs message is about a
meatadata I/O error on block 0x12c08f360. Since the xfs filesystem
has a blocksize of 4K this block is located at position 20135005568K
which is beyond the end of the RAID10 device. No wonder that the
xfs driver receives an I/O error. And also no wonder that the
filesystem is severely corrupted right now.

Question 1: How did the xfs driver knew on Jan 6 that the RAID10
device was about to be increased from 20TB to 21TB on Jan 7?

Question 2: Why did the xfs driver started to use the additional
space that was not yet there without me executing xfs_growfs.

This looks like a severe XFS-problem to me.

But my hope is that all the data taht was within the filesystem
before Jan 6 14:45 is not involved in the corruption. If xfs
started to use space beyond the end of the underlying raid
device this should have affected only data that was created,
modified or deleted after Jan 6 14:45.

If that was true we could clearly distinct between data
that we must dump and data that we can keep. The machine is
our backup system (as you may have guessed from its name)
and I would like to keep old backup-files.

I remember that mkfs.xfs is clever enough to adopt the
filesystem paramters to the underlying hardware of the
block device that the xfs filesystem is created on. Hence
from the xfs drivers point of view the underlying block
device is not just a sequence of data blocks, but the xfs
driver knows something about the layout of the underlying
hardware.

If that was true - how does the xfs driver reacts if that
information about the layout of the underlying hardware
changes while the xfs-filesystem is mounted?

Seems to be an interesting problem

Kind regards

Peter Koch


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-08 15:16   ` Wols Lists
  2018-01-08 15:34     ` Reindl Harald
@ 2018-01-08 16:24     ` Wolfgang Denk
  2018-01-10  1:57     ` Guoqing Jiang
  2 siblings, 0 replies; 37+ messages in thread
From: Wolfgang Denk @ 2018-01-08 16:24 UTC (permalink / raw)
  To: Wols Lists; +Cc: Guoqing Jiang, mdraid.pkoch, linux-raid

Dear Wol,

In message <5A538B30.4080601@youngman.org.uk> you wrote:
>
> You extend the filesystem *after* you've grown the array. The act of
> growing the array has caused the filesystem to crash. That should NOT
> happen - the act of growing the array should be *invisible* to the
> filesystem.

Not if this causes any hard I/O errors...

> In other words, one or more of the following three are true :-
> 1) The OP has been caught by some random act of God
> 2) There's a serious flaw in "mdadm --grow"
> 3) There's a serious flaw in xfs

The original log contained this:

| XFS (md5): metadata I/O error: block 0x12c08f360
| ("xfs_trans_read_buf_map") error 5 numblks 16
| XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
| XFS (md5): metadata I/O error: block 0x12c08f360
| ("xfs_trans_read_buf_map") error 5 numblks 16
| XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
| XFS (md5): metadata I/O error: block 0xebb62c00
| ("xfs_trans_read_buf_map") error 5 numblks 16
| XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
| ...
| ... lots of the above messages deleted
| ...
| XFS (md5): xfs_do_force_shutdown(0x1) called from line 138 of file
| fs/xfs/xfs_bmap_util.c.  Return address = 0xffffffff8113908f
| XFS (md5): metadata I/O error: block 0x48c710b00 ("xlog_iodone") error 5
| numblks 64
| XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file
| fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
| XFS (md5): Log I/O Error Detected.  Shutting down filesystem

To me this looks as if during the growing of the array some hard I/O
errors happenend.  That may have been triggered by the growing of
the array. but only as fas as it caused additional disk load /
reading of otherwise idle areas.

I cannot see any indications for 2) or 3) here, so yes, it was 1),
if you consider spurious I/O aerrors as such.

Or am I missing soething else?


Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
Experience is that marvelous thing that enable  you  to  recognize  a
mistake when you make it again.                   - Franklin P. Jones

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-08 15:16   ` Wols Lists
@ 2018-01-08 15:34     ` Reindl Harald
  2018-01-08 16:24     ` Wolfgang Denk
  2018-01-10  1:57     ` Guoqing Jiang
  2 siblings, 0 replies; 37+ messages in thread
From: Reindl Harald @ 2018-01-08 15:34 UTC (permalink / raw)
  To: Wols Lists, Guoqing Jiang, mdraid.pkoch, linux-raid



Am 08.01.2018 um 16:16 schrieb Wols Lists:
> On 08/01/18 07:31, Guoqing Jiang wrote:
>> https://raid.wiki.kernel.org/index.php/Growing#Extending_the_filesystem
>>
>> And you can use xfs_growfs for your purpose.
> 
> You extend the filesystem *after* you've grown the array. The act of
> growing the array has caused the filesystem to crash. That should NOT
> happen - the act of growing the array should be *invisible* to the
> filesystem.
> 
> In other words, one or more of the following three are true :-
> 1) The OP has been caught by some random act of God
> 2) There's a serious flaw in "mdadm --grow"
> 3) There's a serious flaw in xfs
3) should not be possible because as long 2) is running the filesystem 
should not know that anything is changing at all - even "should" is the 
wrong word: MUST NOT

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-08  7:31 ` Guoqing Jiang
@ 2018-01-08 15:16   ` Wols Lists
  2018-01-08 15:34     ` Reindl Harald
                       ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Wols Lists @ 2018-01-08 15:16 UTC (permalink / raw)
  To: Guoqing Jiang, mdraid.pkoch, linux-raid

On 08/01/18 07:31, Guoqing Jiang wrote:
> 
> 
> On 01/06/2018 11:44 PM, mdraid.pkoch@dfgh.net wrote:
>> Dear MD-experts:
>>
>> I was under the impression that growing a RAID10 device could be done
>> with an active filesystem running on the device.
> 
> It depends on whether the specific filesystem provides related tool or
> not, eg,
> resize2fs can serve ext fs:

Sorry Guoqing, but I think you've *completely* missed the point :-(
> 
> https://raid.wiki.kernel.org/index.php/Growing#Extending_the_filesystem
> 
> And you can use xfs_growfs for your purpose.

You extend the filesystem *after* you've grown the array. The act of
growing the array has caused the filesystem to crash. That should NOT
happen - the act of growing the array should be *invisible* to the
filesystem.

In other words, one or more of the following three are true :-
1) The OP has been caught by some random act of God
2) There's a serious flaw in "mdadm --grow"
3) There's a serious flaw in xfs

Cheers,
Wol

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-06 15:44 mdraid.pkoch
  2018-01-07 19:33 ` John Stoffel
  2018-01-07 20:16 ` Andreas Klauer
@ 2018-01-08  7:31 ` Guoqing Jiang
  2018-01-08 15:16   ` Wols Lists
  2 siblings, 1 reply; 37+ messages in thread
From: Guoqing Jiang @ 2018-01-08  7:31 UTC (permalink / raw)
  To: mdraid.pkoch, linux-raid



On 01/06/2018 11:44 PM, mdraid.pkoch@dfgh.net wrote:
> Dear MD-experts:
>
> I was under the impression that growing a RAID10 device could be done
> with an active filesystem running on the device.

It depends on whether the specific filesystem provides related tool or 
not, eg,
resize2fs can serve ext fs:

https://raid.wiki.kernel.org/index.php/Growing#Extending_the_filesystem

And you can use xfs_growfs for your purpose.

>
> I did this a couple of times when I added additional 2TB disks to our
> production RAID10 running an ext3 Filesystem. That was a very time
> consuming process and we had to use the filesystem during the reshape.
>
> When I increased the size of the RAID10 from 16 to 20 2TB-disks I could
> not use ext3 anymore due to the 16TB maimum size limitation of ext3
> and I replaced the ext3 filesystem by xfs.
>
> Now today I increased the RAID10 again from 20 to 21 disks with the
> following commands:
>
> mdadm /dev/md5 --add /dev/sdo
> mdadm --grow /dev/md5 --raid-devices=21
>
> My plans were to add another disk after that and then grow
> the XFS-filesystem. I do not add multiple disks at once since
> its hard to predict which disk will end up in what disk-set
>
> Here's mdadm -D /dev/md5 output:
> /dev/md5:
>         Version : 1.2
>   Creation Time : Sun Feb 10 16:58:10 2013
>      Raid Level : raid10
>      Array Size : 19533829120 (18628.91 GiB 20002.64 GB)
>   Used Dev Size : 1953382912 (1862.89 GiB 2000.26 GB)
>    Raid Devices : 21
>   Total Devices : 21
>     Persistence : Superblock is persistent
>
>     Update Time : Sat Jan  6 15:08:37 2018
>           State : clean, reshaping
>  Active Devices : 21
> Working Devices : 21
>  Failed Devices : 0
>   Spare Devices : 0
>
>          Layout : near=2
>      Chunk Size : 512K
>
>  Reshape Status : 1% complete
>   Delta Devices : 1, (20->21)
>
>            Name : backup:5  (local to host backup)
>            UUID : 9030ff07:6a292a3c:26589a26:8c92a488
>          Events : 86002
>
>     Number   Major   Minor   RaidDevice State
>        0       8       16        0      active sync   /dev/sdb
>        1      65       48        1      active sync   /dev/sdt
>        2       8       64        2      active sync   /dev/sde
>        3      65       96        3      active sync   /dev/sdw
>        4       8      112        4      active sync   /dev/sdh
>        5      65      144        5      active sync   /dev/sdz
>        6       8      160        6      active sync   /dev/sdk
>        7      65      192        7      active sync   /dev/sdac
>        8       8      208        8      active sync   /dev/sdn
>        9      65      240        9      active sync   /dev/sdaf
>       10      65        0       10      active sync   /dev/sdq
>       11      66       32       11      active sync   /dev/sdai
>       12       8       32       12      active sync   /dev/sdc
>       13      65       64       13      active sync   /dev/sdu
>       14       8       80       14      active sync   /dev/sdf
>       15      65      112       15      active sync   /dev/sdx
>       16       8      128       16      active sync   /dev/sdi
>       17      65      160       17      active sync   /dev/sdaa
>       18       8      176       18      active sync   /dev/sdl
>       19      65      208       19      active sync   /dev/sdad
>       20       8      224       20      active sync   /dev/sdo
>
>
> As you can see the array-size is still 20TB.

Because the reshaping is not finished yet.

>
> Just one second after starting the reshape operation
> XFS failed with the following messages:
>
> # dmesg
> ...
> RAID10 conf printout:
>  --- wd:21 rd:21
>  disk 0, wo:0, o:1, dev:sdb
>  disk 1, wo:0, o:1, dev:sdt
>  disk 2, wo:0, o:1, dev:sde
>  disk 3, wo:0, o:1, dev:sdw
>  disk 4, wo:0, o:1, dev:sdh
>  disk 5, wo:0, o:1, dev:sdz
>  disk 6, wo:0, o:1, dev:sdk
>  disk 7, wo:0, o:1, dev:sdac
>  disk 8, wo:0, o:1, dev:sdn
>  disk 9, wo:0, o:1, dev:sdaf
>  disk 10, wo:0, o:1, dev:sdq
>  disk 11, wo:0, o:1, dev:sdai
>  disk 12, wo:0, o:1, dev:sdc
>  disk 13, wo:0, o:1, dev:sdu
>  disk 14, wo:0, o:1, dev:sdf
>  disk 15, wo:0, o:1, dev:sdx
>  disk 16, wo:0, o:1, dev:sdi
>  disk 17, wo:0, o:1, dev:sdaa
>  disk 18, wo:0, o:1, dev:sdl
>  disk 19, wo:0, o:1, dev:sdad
>  disk 20, wo:1, o:1, dev:sdo
> md: reshape of RAID array md5
> md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
> md: using maximum available idle IO bandwidth (but not more than 
> 200000 KB/sec) for reshape.
> md: using 128k window, over a total of 19533829120k.
> XFS (md5): metadata I/O error: block 0x12c08f360 
> ("xfs_trans_read_buf_map") error 5 numblks 16
> XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
> XFS (md5): metadata I/O error: block 0x12c08f360 
> ("xfs_trans_read_buf_map") error 5 numblks 16
> XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
> XFS (md5): metadata I/O error: block 0xebb62c00 
> ("xfs_trans_read_buf_map") error 5 numblks 16
> XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
> ...
> ... lots of the above messages deleted
> ...
> XFS (md5): xfs_do_force_shutdown(0x1) called from line 138 of file 
> fs/xfs/xfs_bmap_util.c.  Return address = 0xffffffff8113908f
> XFS (md5): metadata I/O error: block 0x48c710b00 ("xlog_iodone") error 
> 5 numblks 64
> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
> fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
> XFS (md5): Log I/O Error Detected.  Shutting down filesystem
> XFS (md5): Please umount the filesystem and rectify the problem(s)
> XFS (md5): metadata I/O error: block 0x48c710b40 ("xlog_iodone") error 
> 5 numblks 64
> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
> fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
> XFS (md5): metadata I/O error: block 0x48c710b80 ("xlog_iodone") error 
> 5 numblks 64
> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
> fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
> XFS (md5): metadata I/O error: block 0x48c710bc0 ("xlog_iodone") error 
> 5 numblks 64
> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
> fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
> XFS (md5): metadata I/O error: block 0x48c710c00 ("xlog_iodone") error 
> 5 numblks 64
> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
> fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
> XFS (md5): metadata I/O error: block 0x48c710c40 ("xlog_iodone") error 
> 5 numblks 64
> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
> fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
> XFS (md5): metadata I/O error: block 0x48c710c80 ("xlog_iodone") error 
> 5 numblks 64
> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
> fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
> XFS (md5): metadata I/O error: block 0x48c710cc0 ("xlog_iodone") error 
> 5 numblks 64
> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
> fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
> XFS (md5): I/O Error Detected. Shutting down filesystem

I guess the IOs from xfs were competing with the md internal IO, not good.

> I did an "umount /dev/md5" and now I'm wondering what my options are:

Though xfs file systems can be grown while mounted, it is better to umount
it first if md reshaping is in progress.

> Should I wait until the reshape has finisched? I assume yes since 
> stopping that operation will most likely make things worse.
> Unfortunately reshaping a 20TB RAID10 to 21TB will last about
> 10 hours but it's saturday and I have approx. 40 hours to fix the 
> problem until monday morning.
>
> Should I reduce array-size back to 20 disks?
>
> My plans are to run xfs_check first, maybe followed by xfs_repair and
> see what happens.
>
> Any other suggestions?
>
> Do you have an explanation why reshaping a RAID10 with a running
> ext3 filesystem does work while a running XFS-filesystems fails during
> a reshape?
>
> How did the XFS-filesystem notice that a reshape was running? I was
> sure that during the reshape operation every single block of the RAID10
> device could be read or written no matter wether it belongs to the part
> of the RAID that was already reshaped or not. Obviously that's working
> in theory only - or with ext3-filesystems only.

If the IO from fs could conflict with reshape IO, then it could be 
trouble, so
again, it is more safer to umount fs first before reshaping, my 0.02$.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-06 15:44 mdraid.pkoch
  2018-01-07 19:33 ` John Stoffel
@ 2018-01-07 20:16 ` Andreas Klauer
  2018-01-08  7:31 ` Guoqing Jiang
  2 siblings, 0 replies; 37+ messages in thread
From: Andreas Klauer @ 2018-01-07 20:16 UTC (permalink / raw)
  To: mdraid.pkoch; +Cc: linux-raid

On Sat, Jan 06, 2018 at 04:44:12PM +0100, mdraid.pkoch@dfgh.net wrote:
> Now today I increased the RAID10 again from 20 to 21 disks with the
> following commands:
> 
> mdadm /dev/md5 --add /dev/sdo
> mdadm --grow /dev/md5 --raid-devices=21
> 
> Just one second after starting the reshape operation
> XFS failed with the following messages:
> 
> md: reshape of RAID array md5
> md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
> md: using maximum available idle IO bandwidth (but not more than 200000 
> KB/sec) for reshape.
> md: using 128k window, over a total of 19533829120k.
> XFS (md5): metadata I/O error: block 0x12c08f360 
> ("xfs_trans_read_buf_map") error 5 numblks 16

Ouch. No idea what happened there.

Use overlays to try to recover. Don't write anymore.

https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file

I tried to reproduce your problem, created a 20 drive RAID, 
and a while loop to grow to 21 drives, then shrink back to 20.

    truncate -s 100M {001..021}
    losetup ...
    mdadm --create /dev/md42 --level=10 --raid-devices=20 /dev/loop{1..20}
    mdadm --grow /dev/md42 --add /dev/loop21

    while :
    do
        mdadm --wait /dev/md42
        mdadm --grow /dev/md42 --raid-devices=21
        mdadm --wait /dev/md42
        mdadm --grow /dev/md42 --array-size 1013760
        mdadm --wait /dev/md42
        mdadm --grow /dev/md42 --raid-devices=20
    done

Then I put XFS on top and another while loop to extract a Linux tarball.

    while :
    do
        tar xf linux-4.13.4.tar.xz
        sync
        rm -rf linux-4.13.4
        sync
    done

Both running in parallel ad infinitum.

I couldn't get the XFS to corrupt.

mdadm itself eventually died though.

Told me two drives failed though none did and would refuse to continue 
the grow operation. Unless I'm missing something, the degraded counter 
seems to have gone out of whack. There was nothing in dmesg.

# cat /sys/block/md42/md/degraded 
2

# cat /proc/mdstat 
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
md42 : active raid10 loop20[19] loop19[18] loop18[17] loop17[16] loop16[15] loop15[14] loop14[13] loop13[12] loop12[11] loop11[10] loop10[9] loop9[8] loop8[7] loop7[6] loop6[5] loop5[4] loop4[3] loop3[2] loop2[1] loop1[0]
      1013760 blocks super 1.2 512K chunks 2 near-copies [20/18] [UUUUUUUUUUUUUUUUUUUU]

Stopping and re-assembling and degraded went back to 0.

# cat /proc/mdstat 
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
md42 : active raid10 loop1[0] loop20[19] loop19[18] loop18[17] loop17[16] loop16[15] loop15[14] loop14[13] loop13[12] loop12[11] loop11[10] loop10[9] loop9[8] loop8[7] loop7[6] loop6[5] loop5[4] loop4[3] loop3[2] loop2[1]
      1013760 blocks super 1.2 512K chunks 2 near-copies [20/20] [UUUUUUUUUUUUUUUUUUUU]

But this should be unrelated to your issue.
No idea what happened to you.
Sorry.

Regards
Andreas Klauer

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Growing RAID10 with active XFS filesystem
  2018-01-06 15:44 mdraid.pkoch
@ 2018-01-07 19:33 ` John Stoffel
  2018-01-07 20:16 ` Andreas Klauer
  2018-01-08  7:31 ` Guoqing Jiang
  2 siblings, 0 replies; 37+ messages in thread
From: John Stoffel @ 2018-01-07 19:33 UTC (permalink / raw)
  To: mdraid.pkoch; +Cc: linux-raid


mdraid> I was under the impression that growing a RAID10 device could
mdraid> be done with an active filesystem running on the device.

It should be just fine.  But in this case, you might also want to talk
with the XFS experts.  

mdraid> I did this a couple of times when I added additional 2TB disks
mdraid> to our production RAID10 running an ext3 Filesystem. That was
mdraid> a very time consuming process and we had to use the filesystem
mdraid> during the reshape.

What kernel and distro are you running here?  What are the mdadm tools
versions?  You need to give more details please. 

mdraid> When I increased the size of the RAID10 from 16 to 20
mdraid> 2TB-disks I could not use ext3 anymore due to the 16TB maimum
mdraid> size limitation of ext3 and I replaced the ext3 filesystem by
mdraid> xfs.

That must have been fun... not. 

mdraid> Now today I increased the RAID10 again from 20 to 21 disks with the
mdraid> following commands:

mdraid> mdadm /dev/md5 --add /dev/sdo
mdraid> mdadm --grow /dev/md5 --raid-devices=21

mdraid> My plans were to add another disk after that and then grow
mdraid> the XFS-filesystem. I do not add multiple disks at once since
mdraid> its hard to predict which disk will end up in what disk-set

mdraid> Here's mdadm -D /dev/md5 output:
mdraid> /dev/md5:
mdraid>          Version : 1.2
mdraid>    Creation Time : Sun Feb 10 16:58:10 2013
mdraid>       Raid Level : raid10
mdraid>       Array Size : 19533829120 (18628.91 GiB 20002.64 GB)
mdraid>    Used Dev Size : 1953382912 (1862.89 GiB 2000.26 GB)
mdraid>     Raid Devices : 21
mdraid>    Total Devices : 21
mdraid>      Persistence : Superblock is persistent

mdraid>      Update Time : Sat Jan  6 15:08:37 2018
mdraid>            State : clean, reshaping
mdraid>   Active Devices : 21
mdraid> Working Devices : 21
mdraid>   Failed Devices : 0
mdraid>    Spare Devices : 0

mdraid>           Layout : near=2
mdraid>       Chunk Size : 512K

mdraid>   Reshape Status : 1% complete
mdraid>    Delta Devices : 1, (20->21)

mdraid>             Name : backup:5  (local to host backup)
mdraid>             UUID : 9030ff07:6a292a3c:26589a26:8c92a488
mdraid>           Events : 86002

mdraid>      Number   Major   Minor   RaidDevice State
mdraid>         0       8       16        0      active sync   /dev/sdb
mdraid>         1      65       48        1      active sync   /dev/sdt
mdraid>         2       8       64        2      active sync   /dev/sde
mdraid>         3      65       96        3      active sync   /dev/sdw
mdraid>         4       8      112        4      active sync   /dev/sdh
mdraid>         5      65      144        5      active sync   /dev/sdz
mdraid>         6       8      160        6      active sync   /dev/sdk
mdraid>         7      65      192        7      active sync   /dev/sdac
mdraid>         8       8      208        8      active sync   /dev/sdn
mdraid>         9      65      240        9      active sync   /dev/sdaf
mdraid>        10      65        0       10      active sync   /dev/sdq
mdraid>        11      66       32       11      active sync   /dev/sdai
mdraid>        12       8       32       12      active sync   /dev/sdc
mdraid>        13      65       64       13      active sync   /dev/sdu
mdraid>        14       8       80       14      active sync   /dev/sdf
mdraid>        15      65      112       15      active sync   /dev/sdx
mdraid>        16       8      128       16      active sync   /dev/sdi
mdraid>        17      65      160       17      active sync   /dev/sdaa
mdraid>        18       8      176       18      active sync   /dev/sdl
mdraid>        19      65      208       19      active sync   /dev/sdad
mdraid>        20       8      224       20      active sync   /dev/sdo

This all looks fine... but I'm thinking what you *should* have done
instead is build a bunch of 2tb pairs, and them use LVM to span across
them with a volume, then build your XFS filesystem ontop of that.
This way you would have /dev/md1,2,3,4,5,6,7,8,9,10 all inside a VG,
then you would use LVM to stripe across the pairs.  

But that's water under the bridge now.

mdraid> As you can see the array-size is still 20TB.

mdraid> Just one second after starting the reshape operation
mdraid> XFS failed with the following messages:

I *think* the mdadm --grow did something, or XFS noticed the change in
array size and grew on it's own.  Can you provide the output of
'xfs_info /dev/md5' for us?  


mdraid> # dmesg
mdraid> ...
mdraid> RAID10 conf printout:
mdraid>   --- wd:21 rd:21
mdraid>   disk 0, wo:0, o:1, dev:sdb
mdraid>   disk 1, wo:0, o:1, dev:sdt
mdraid>   disk 2, wo:0, o:1, dev:sde
mdraid>   disk 3, wo:0, o:1, dev:sdw
mdraid>   disk 4, wo:0, o:1, dev:sdh
mdraid>   disk 5, wo:0, o:1, dev:sdz
mdraid>   disk 6, wo:0, o:1, dev:sdk
mdraid>   disk 7, wo:0, o:1, dev:sdac
mdraid>   disk 8, wo:0, o:1, dev:sdn
mdraid>   disk 9, wo:0, o:1, dev:sdaf
mdraid>   disk 10, wo:0, o:1, dev:sdq
mdraid>   disk 11, wo:0, o:1, dev:sdai
mdraid>   disk 12, wo:0, o:1, dev:sdc
mdraid>   disk 13, wo:0, o:1, dev:sdu
mdraid>   disk 14, wo:0, o:1, dev:sdf
mdraid>   disk 15, wo:0, o:1, dev:sdx
mdraid>   disk 16, wo:0, o:1, dev:sdi
mdraid>   disk 17, wo:0, o:1, dev:sdaa
mdraid>   disk 18, wo:0, o:1, dev:sdl
mdraid>   disk 19, wo:0, o:1, dev:sdad
mdraid>   disk 20, wo:1, o:1, dev:sdo
mdraid> md: reshape of RAID array md5
mdraid> md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
mdraid> md: using maximum available idle IO bandwidth (but not more than 200000 
mdraid> KB/sec) for reshape.
mdraid> md: using 128k window, over a total of 19533829120k.
mdraid> XFS (md5): metadata I/O error: block 0x12c08f360 
mdraid> ("xfs_trans_read_buf_map") error 5 numblks 16
mdraid> XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
mdraid> XFS (md5): metadata I/O error: block 0x12c08f360 
mdraid> ("xfs_trans_read_buf_map") error 5 numblks 16
mdraid> XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
mdraid> XFS (md5): metadata I/O error: block 0xebb62c00 
mdraid> ("xfs_trans_read_buf_map") error 5 numblks 16
mdraid> XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
mdraid> ...
mdraid> ... lots of the above messages deleted
mdraid> ...
mdraid> XFS (md5): xfs_do_force_shutdown(0x1) called from line 138 of file 
mdraid> fs/xfs/xfs_bmap_util.c.  Return address = 0xffffffff8113908f
mdraid> XFS (md5): metadata I/O error: block 0x48c710b00 ("xlog_iodone") error 5 
mdraid> numblks 64
mdraid> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
mdraid> fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
mdraid> XFS (md5): Log I/O Error Detected.  Shutting down filesystem
mdraid> XFS (md5): Please umount the filesystem and rectify the problem(s)
mdraid> XFS (md5): metadata I/O error: block 0x48c710b40 ("xlog_iodone") error 5 
mdraid> numblks 64
mdraid> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
mdraid> fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
mdraid> XFS (md5): metadata I/O error: block 0x48c710b80 ("xlog_iodone") error 5 
mdraid> numblks 64
mdraid> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
mdraid> fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
mdraid> XFS (md5): metadata I/O error: block 0x48c710bc0 ("xlog_iodone") error 5 
mdraid> numblks 64
mdraid> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
mdraid> fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
mdraid> XFS (md5): metadata I/O error: block 0x48c710c00 ("xlog_iodone") error 5 
mdraid> numblks 64
mdraid> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
mdraid> fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
mdraid> XFS (md5): metadata I/O error: block 0x48c710c40 ("xlog_iodone") error 5 
mdraid> numblks 64
mdraid> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
mdraid> fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
mdraid> XFS (md5): metadata I/O error: block 0x48c710c80 ("xlog_iodone") error 5 
mdraid> numblks 64
mdraid> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
mdraid> fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
mdraid> XFS (md5): metadata I/O error: block 0x48c710cc0 ("xlog_iodone") error 5 
mdraid> numblks 64
mdraid> XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
mdraid> fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
mdraid> XFS (md5): I/O Error Detected. Shutting down filesystem

mdraid> I did an "umount /dev/md5" and now I'm wondering what my options are:

What does 'xfs_fsck -n /dev/md5' say? 

mdraid> Should I wait until the reshape has finisched? I assume yes
mdraid> since stopping that operation will most likely make things
mdraid> worse.  Unfortunately reshaping a 20TB RAID10 to 21TB will
mdraid> last about 10 hours but it's saturday and I have approx. 40
mdraid> hours to fix the problem until monday morning.

Are you still having the problem?

mdraid> Should I reduce array-size back to 20 disks?

I don't think so.

mdraid> My plans are to run xfs_check first, maybe followed by
mdraid> xfs_repair and see what happens.

Talk to the XFS folks first, before you do anything!  

mdraid> Any other suggestions?

mdraid> Do you have an explanation why reshaping a RAID10 with a running
mdraid> ext3 filesystem does work while a running XFS-filesystems fails during
mdraid> a reshape?

mdraid> How did the XFS-filesystem notice that a reshape was running? I was
mdraid> sure that during the reshape operation every single block of the RAID10
mdraid> device could be read or written no matter wether it belongs to the part
mdraid> of the RAID that was already reshaped or not. Obviously that's working
mdraid> in theory only - or with ext3-filesystems only.

mdraid> Or was i totally wrong with my assumption?

mdraid> Much thanks in advance for any assistance.

mdraid> Peter Koch

mdraid> --
mdraid> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
mdraid> the body of a message to majordomo@vger.kernel.org
mdraid> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Growing RAID10 with active XFS filesystem
@ 2018-01-06 15:44 mdraid.pkoch
  2018-01-07 19:33 ` John Stoffel
                   ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: mdraid.pkoch @ 2018-01-06 15:44 UTC (permalink / raw)
  To: linux-raid

Dear MD-experts:

I was under the impression that growing a RAID10 device could be done
with an active filesystem running on the device.

I did this a couple of times when I added additional 2TB disks to our
production RAID10 running an ext3 Filesystem. That was a very time
consuming process and we had to use the filesystem during the reshape.

When I increased the size of the RAID10 from 16 to 20 2TB-disks I could
not use ext3 anymore due to the 16TB maimum size limitation of ext3
and I replaced the ext3 filesystem by xfs.

Now today I increased the RAID10 again from 20 to 21 disks with the
following commands:

mdadm /dev/md5 --add /dev/sdo
mdadm --grow /dev/md5 --raid-devices=21

My plans were to add another disk after that and then grow
the XFS-filesystem. I do not add multiple disks at once since
its hard to predict which disk will end up in what disk-set

Here's mdadm -D /dev/md5 output:
/dev/md5:
         Version : 1.2
   Creation Time : Sun Feb 10 16:58:10 2013
      Raid Level : raid10
      Array Size : 19533829120 (18628.91 GiB 20002.64 GB)
   Used Dev Size : 1953382912 (1862.89 GiB 2000.26 GB)
    Raid Devices : 21
   Total Devices : 21
     Persistence : Superblock is persistent

     Update Time : Sat Jan  6 15:08:37 2018
           State : clean, reshaping
  Active Devices : 21
Working Devices : 21
  Failed Devices : 0
   Spare Devices : 0

          Layout : near=2
      Chunk Size : 512K

  Reshape Status : 1% complete
   Delta Devices : 1, (20->21)

            Name : backup:5  (local to host backup)
            UUID : 9030ff07:6a292a3c:26589a26:8c92a488
          Events : 86002

     Number   Major   Minor   RaidDevice State
        0       8       16        0      active sync   /dev/sdb
        1      65       48        1      active sync   /dev/sdt
        2       8       64        2      active sync   /dev/sde
        3      65       96        3      active sync   /dev/sdw
        4       8      112        4      active sync   /dev/sdh
        5      65      144        5      active sync   /dev/sdz
        6       8      160        6      active sync   /dev/sdk
        7      65      192        7      active sync   /dev/sdac
        8       8      208        8      active sync   /dev/sdn
        9      65      240        9      active sync   /dev/sdaf
       10      65        0       10      active sync   /dev/sdq
       11      66       32       11      active sync   /dev/sdai
       12       8       32       12      active sync   /dev/sdc
       13      65       64       13      active sync   /dev/sdu
       14       8       80       14      active sync   /dev/sdf
       15      65      112       15      active sync   /dev/sdx
       16       8      128       16      active sync   /dev/sdi
       17      65      160       17      active sync   /dev/sdaa
       18       8      176       18      active sync   /dev/sdl
       19      65      208       19      active sync   /dev/sdad
       20       8      224       20      active sync   /dev/sdo


As you can see the array-size is still 20TB.

Just one second after starting the reshape operation
XFS failed with the following messages:

# dmesg
...
RAID10 conf printout:
  --- wd:21 rd:21
  disk 0, wo:0, o:1, dev:sdb
  disk 1, wo:0, o:1, dev:sdt
  disk 2, wo:0, o:1, dev:sde
  disk 3, wo:0, o:1, dev:sdw
  disk 4, wo:0, o:1, dev:sdh
  disk 5, wo:0, o:1, dev:sdz
  disk 6, wo:0, o:1, dev:sdk
  disk 7, wo:0, o:1, dev:sdac
  disk 8, wo:0, o:1, dev:sdn
  disk 9, wo:0, o:1, dev:sdaf
  disk 10, wo:0, o:1, dev:sdq
  disk 11, wo:0, o:1, dev:sdai
  disk 12, wo:0, o:1, dev:sdc
  disk 13, wo:0, o:1, dev:sdu
  disk 14, wo:0, o:1, dev:sdf
  disk 15, wo:0, o:1, dev:sdx
  disk 16, wo:0, o:1, dev:sdi
  disk 17, wo:0, o:1, dev:sdaa
  disk 18, wo:0, o:1, dev:sdl
  disk 19, wo:0, o:1, dev:sdad
  disk 20, wo:1, o:1, dev:sdo
md: reshape of RAID array md5
md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 200000 
KB/sec) for reshape.
md: using 128k window, over a total of 19533829120k.
XFS (md5): metadata I/O error: block 0x12c08f360 
("xfs_trans_read_buf_map") error 5 numblks 16
XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
XFS (md5): metadata I/O error: block 0x12c08f360 
("xfs_trans_read_buf_map") error 5 numblks 16
XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
XFS (md5): metadata I/O error: block 0xebb62c00 
("xfs_trans_read_buf_map") error 5 numblks 16
XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
...
... lots of the above messages deleted
...
XFS (md5): xfs_do_force_shutdown(0x1) called from line 138 of file 
fs/xfs/xfs_bmap_util.c.  Return address = 0xffffffff8113908f
XFS (md5): metadata I/O error: block 0x48c710b00 ("xlog_iodone") error 5 
numblks 64
XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
XFS (md5): Log I/O Error Detected.  Shutting down filesystem
XFS (md5): Please umount the filesystem and rectify the problem(s)
XFS (md5): metadata I/O error: block 0x48c710b40 ("xlog_iodone") error 5 
numblks 64
XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
XFS (md5): metadata I/O error: block 0x48c710b80 ("xlog_iodone") error 5 
numblks 64
XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
XFS (md5): metadata I/O error: block 0x48c710bc0 ("xlog_iodone") error 5 
numblks 64
XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
XFS (md5): metadata I/O error: block 0x48c710c00 ("xlog_iodone") error 5 
numblks 64
XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
XFS (md5): metadata I/O error: block 0x48c710c40 ("xlog_iodone") error 5 
numblks 64
XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
XFS (md5): metadata I/O error: block 0x48c710c80 ("xlog_iodone") error 5 
numblks 64
XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
XFS (md5): metadata I/O error: block 0x48c710cc0 ("xlog_iodone") error 5 
numblks 64
XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
XFS (md5): I/O Error Detected. Shutting down filesystem

I did an "umount /dev/md5" and now I'm wondering what my options are:

Should I wait until the reshape has finisched? I assume yes since 
stopping that operation will most likely make things worse.
Unfortunately reshaping a 20TB RAID10 to 21TB will last about
10 hours but it's saturday and I have approx. 40 hours to fix the 
problem until monday morning.

Should I reduce array-size back to 20 disks?

My plans are to run xfs_check first, maybe followed by xfs_repair and
see what happens.

Any other suggestions?

Do you have an explanation why reshaping a RAID10 with a running
ext3 filesystem does work while a running XFS-filesystems fails during
a reshape?

How did the XFS-filesystem notice that a reshape was running? I was
sure that during the reshape operation every single block of the RAID10
device could be read or written no matter wether it belongs to the part
of the RAID that was already reshaped or not. Obviously that's working
in theory only - or with ext3-filesystems only.

Or was i totally wrong with my assumption?

Much thanks in advance for any assistance.

Peter Koch


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2018-01-15 17:08 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-08 19:08 Growing RAID10 with active XFS filesystem xfs.pkoch
2018-01-08 19:26 ` Darrick J. Wong
2018-01-08 22:01   ` Dave Chinner
2018-01-08 23:44     ` mdraid.pkoch
2018-01-08 23:44       ` xfs.pkoch
2018-01-09  9:36     ` Wols Lists
2018-01-09 21:47       ` IMAP-FCC:Sent
2018-01-09 22:25       ` Dave Chinner
2018-01-09 22:32         ` Reindl Harald
2018-01-10  6:17         ` Wols Lists
2018-01-11  2:14           ` Dave Chinner
2018-01-12  2:16             ` Guoqing Jiang
2018-01-10 14:10         ` Phil Turmel
2018-01-10 21:57           ` Wols Lists
2018-01-11  3:07           ` Dave Chinner
2018-01-12 13:32             ` Wols Lists
2018-01-12 14:25               ` Emmanuel Florac
2018-01-12 17:52                 ` Wols Lists
2018-01-12 18:37                   ` Emmanuel Florac
2018-01-12 19:35                     ` Wol's lists
2018-01-13 12:30                       ` Brad Campbell
2018-01-13 13:18                         ` Wols Lists
2018-01-13  0:20                   ` Stan Hoeppner
2018-01-13 19:29                     ` Wol's lists
2018-01-13 22:40                       ` Dave Chinner
2018-01-13 23:04                         ` Wols Lists
2018-01-14 21:33                 ` Wol's lists
2018-01-15 17:08                   ` Emmanuel Florac
  -- strict thread matches above, loose matches on Subject: below --
2018-01-08 19:06 mdraid.pkoch
2018-01-06 15:44 mdraid.pkoch
2018-01-07 19:33 ` John Stoffel
2018-01-07 20:16 ` Andreas Klauer
2018-01-08  7:31 ` Guoqing Jiang
2018-01-08 15:16   ` Wols Lists
2018-01-08 15:34     ` Reindl Harald
2018-01-08 16:24     ` Wolfgang Denk
2018-01-10  1:57     ` Guoqing Jiang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.