[BUG] non-metadata arrays cannot use more than 27 component devices

All of lore.kernel.org
 help / color / mirror / Atom feed

* [BUG] non-metadata arrays cannot use more than 27 component devices
@ 2017-02-24 12:08 ian_bruce
  2017-02-24 15:20 ` Phil Turmel
  2017-02-27  5:55 ` NeilBrown
  0 siblings, 2 replies; 19+ messages in thread
From: ian_bruce @ 2017-02-24 12:08 UTC (permalink / raw)
  To: linux-raid

When assembling non-metadata arrays ("mdadm --build"), the in-kernel
superblock apparently defaults to the MD-RAID v0.90 type. This imposes a
maximum of 27 component block devices, presumably as well as limits on
device size.

mdadm does not allow you to override this default, by specifying the
v1.2 superblock. It is not clear whether mdadm tells the kernel to use
the v0.90 superblock, or the kernel assumes this by itself. One or other
of them should be fixed; there does not appear to be any reason why the
v1.2 superblock should not be the default in this case.

details are here:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=855871

This problem is easy to reproduce, even with simple hardware. You can
use /dev/loop devices as the array components, as explained in the link.

-- Ian Bruce

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
  2017-02-24 12:08 [BUG] non-metadata arrays cannot use more than 27 component devices ian_bruce
@ 2017-02-24 15:20 ` Phil Turmel
  2017-02-24 16:40   ` ian_bruce
  2017-02-27  5:55 ` NeilBrown
  1 sibling, 1 reply; 19+ messages in thread
From: Phil Turmel @ 2017-02-24 15:20 UTC (permalink / raw)
  To: ian_bruce, linux-raid

On 02/24/2017 07:08 AM, ian_bruce@mail.ru wrote:
> When assembling non-metadata arrays ("mdadm --build"), the in-kernel 
> superblock apparently defaults to the MD-RAID v0.90 type. This
> imposes a maximum of 27 component block devices, presumably as well
> as limits on device size.
> 
> mdadm does not allow you to override this default, by specifying the 
> v1.2 superblock. It is not clear whether mdadm tells the kernel to
> use the v0.90 superblock, or the kernel assumes this by itself. One
> or other of them should be fixed; there does not appear to be any
> reason why the v1.2 superblock should not be the default in this
> case.
> 
> details are here:
> 
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=855871
> 
> This problem is easy to reproduce, even with simple hardware. You
> can use /dev/loop devices as the array components, as explained in
> the link.

Considering the existence of --build is strictly to support arrays that
predate MD raid, it seems a bit of a stretch to claim this as a bug
instead of a feature request.

When you implement this feature, you might want to consider extending
modern MD raid's support for external metadata to use RAM and then
you could do whatever you please with the array events.

See "man mdmon" for a summary of external metadata events.

Phil

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
  2017-02-24 15:20 ` Phil Turmel
@ 2017-02-24 16:40   ` ian_bruce
  2017-02-24 20:46     ` Phil Turmel
  0 siblings, 1 reply; 19+ messages in thread
From: ian_bruce @ 2017-02-24 16:40 UTC (permalink / raw)
  To: linux-raid

On Fri, 24 Feb 2017 10:20:52 -0500
Phil Turmel <philip@turmel.org> wrote:

> Considering the existence of --build is strictly to support arrays
> that predate MD raid, it seems a bit of a stretch to claim this as a
> bug instead of a feature request.

quoting from the mdadm manual page:

    *Build*

    Build an array that doesn't have per-device metadata (superblocks).
    For these sorts of arrays, mdadm cannot differentiate between
    initial creation and subsequent assembly of an array. It also cannot
    perform any checks that appropriate components have been requested.
    Because of this, the Build mode should only be used together with a
    complete understanding of what you are doing.

No mention of "arrays that predate MD RAID" there. Nor any mention of a
27-component limit, either. Nor does the eventual error message mention
any such thing (although "mdadm --create --metadata=0 --raid-devices=28"
does). I'd call that a bug.

Since there's no pre-existing superblock, and the kernel has to create
one, it could just as easily use the v1.2 format as the v0.90 format, as
it does with "mdadm --create". Why shouldn't the v1.2 format be the
default for "mdadm --build" as well? That would be more consistent --
why should these two options behave differently in this regard, in the
absence of any material reason to do so?

--create : initialize v1.2 kernel superblock and write to disk

--build  : initialize v1.2 kernel superblock but don't write to disk

It seems like it would actually be simpler to treat the two cases the
same, rather than differently.

-- Ian Bruce

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
  2017-02-24 16:40   ` ian_bruce
@ 2017-02-24 20:46     ` Phil Turmel
  2017-02-25 20:05       ` Anthony Youngman
  0 siblings, 1 reply; 19+ messages in thread
From: Phil Turmel @ 2017-02-24 20:46 UTC (permalink / raw)
  To: ian_bruce, linux-raid

On 02/24/2017 11:40 AM, ian_bruce@mail.ru wrote:
> On Fri, 24 Feb 2017 10:20:52 -0500 Phil Turmel <philip@turmel.org> 
> wrote:
> 
>> Considering the existence of --build is strictly to support arrays 
>> that predate MD raid, it seems a bit of a stretch to claim this as
>> a bug instead of a feature request.
> 
> quoting from the mdadm manual page:

Quote all you like, it doesn't change the history. Note that build mode
doesn't support a bunch of other MD raid features either, like all of
the parity raid levels.  That it doesn't support v1+ metadata isn't a
surprise, and isn't the only legacy feature that only uses legacy
metadata (built-in kernel auto-assembly gets the most whining, actually).

Anyways, though I can't speak for the maintainers, it seems that build
mode is there to keep the MD maintainers from being yelled at by Linus
for breaking legacy setups.  Nothing more.

If you think its trivial to implement --build with v1.x metadata, go
right ahead.  Post your patches for review.

Phil

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
  2017-02-24 20:46     ` Phil Turmel
@ 2017-02-25 20:05       ` Anthony Youngman
  2017-02-25 22:00         ` Phil Turmel
  0 siblings, 1 reply; 19+ messages in thread
From: Anthony Youngman @ 2017-02-25 20:05 UTC (permalink / raw)
  To: Phil Turmel, ian_bruce, linux-raid

On 24/02/17 20:46, Phil Turmel wrote:
> On 02/24/2017 11:40 AM, ian_bruce@mail.ru wrote:
>> On Fri, 24 Feb 2017 10:20:52 -0500 Phil Turmel <philip@turmel.org>
>> wrote:
>>
>>> Considering the existence of --build is strictly to support arrays
>>> that predate MD raid, it seems a bit of a stretch to claim this as
>>> a bug instead of a feature request.
>> quoting from the mdadm manual page:
> Quote all you like, it doesn't change the history. Note that build mode
> doesn't support a bunch of other MD raid features either, like all of
> the parity raid levels.  That it doesn't support v1+ metadata isn't a
> surprise, and isn't the only legacy feature that only uses legacy
> metadata (built-in kernel auto-assembly gets the most whining, actually).
>
> Anyways, though I can't speak for the maintainers, it seems that build
> mode is there to keep the MD maintainers from being yelled at by Linus
> for breaking legacy setups.  Nothing more.

Although I would have thought build mode was superb for doing backups 
without needing to stop using the system ... I haven't seen any 
documentation about things like breaking raid to do backups and all that 
sort of thing.

I need to investigate it, but I'd like to know how to suspend a mirror, 
back it up, and then resume. The databases I work with have an option 
that suspends all new writes, but flushes all current transactions to 
disk so the disk is consistent for backing up. So if you do that and 
back up the database you know your backup is consistent.

This is all a rather important usage of raid, actually, imho. It seems 
so obvious - create a temporary mirror, wait for the sync to complete, 
suspend i/o to get the disk consistent, then you can break the mirror 
and carry on. Terabytes :-) of data safely backed up in the space of 
seconds.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
  2017-02-25 20:05       ` Anthony Youngman
@ 2017-02-25 22:00         ` Phil Turmel
  2017-02-25 23:30           ` Wols Lists
  0 siblings, 1 reply; 19+ messages in thread
From: Phil Turmel @ 2017-02-25 22:00 UTC (permalink / raw)
  To: Anthony Youngman, ian_bruce, linux-raid

On 02/25/2017 03:05 PM, Anthony Youngman wrote:

> Although I would have thought build mode was superb for doing
> backups without needing to stop using the system ... I haven't seen
> any documentation about things like breaking raid to do backups and
> all that sort of thing.
> 
> I need to investigate it, but I'd like to know how to suspend a
> mirror, back it up, and then resume. The databases I work with have
> an option that suspends all new writes, but flushes all current
> transactions to disk so the disk is consistent for backing up. So if
> you do that and back up the database you know your backup is
> consistent.
> 
> This is all a rather important usage of raid, actually, imho. It
> seems so obvious - create a temporary mirror, wait for the sync to
> complete, suspend i/o to get the disk consistent, then you can break
> the mirror and carry on. Terabytes :-) of data safely backed up in
> the space of seconds.

No. Don't go there.  There's already a technology out there that does
this correctly, called LVM snapshots.  And they let you resume normal
operations after a very brief hesitation, and the snapshot holds the
static image while you copy it off.

Phil

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
  2017-02-25 22:00         ` Phil Turmel
@ 2017-02-25 23:30           ` Wols Lists
  2017-02-25 23:41             ` Phil Turmel
  0 siblings, 1 reply; 19+ messages in thread
From: Wols Lists @ 2017-02-25 23:30 UTC (permalink / raw)
  To: Phil Turmel, ian_bruce, linux-raid

On 25/02/17 22:00, Phil Turmel wrote:
>> This is all a rather important usage of raid, actually, imho. It
>> > seems so obvious - create a temporary mirror, wait for the sync to
>> > complete, suspend i/o to get the disk consistent, then you can break
>> > the mirror and carry on. Terabytes :-) of data safely backed up in
>> > the space of seconds.

> No. Don't go there.  There's already a technology out there that does
> this correctly, called LVM snapshots.  And they let you resume normal
> operations after a very brief hesitation, and the snapshot holds the
> static image while you copy it off.

Will it let you put that snapshot on a hot-plug disk you can remove? For
my little system I'd quite happily mirror it off onto a hard-disk and
unplug it.

Oh - and I'm not running lvm. Not that I think there's anything wrong
with that, it's just yet another layer that I'm not (currently)
comfortable with.

Is there a sound technical reason not to go there, or is it simply a
case of "learn another tool for that job"? The less tools I have to know
the better, imho.

(Although why I'm worrying, I don't know. I know btrfs is planning to
make that obsolete :-)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
  2017-02-25 23:30           ` Wols Lists
@ 2017-02-25 23:41             ` Phil Turmel
  2017-02-25 23:55               ` Wols Lists
  0 siblings, 1 reply; 19+ messages in thread
From: Phil Turmel @ 2017-02-25 23:41 UTC (permalink / raw)
  To: Wols Lists, ian_bruce, linux-raid

On 02/25/2017 06:30 PM, Wols Lists wrote:
> On 25/02/17 22:00, Phil Turmel wrote:

>> No. Don't go there.  There's already a technology out there that does
>> this correctly, called LVM snapshots.  And they let you resume normal
>> operations after a very brief hesitation, and the snapshot holds the
>> static image while you copy it off.
> 
> Will it let you put that snapshot on a hot-plug disk you can remove? For
> my little system I'd quite happily mirror it off onto a hard-disk and
> unplug it.

You can copy it off to any block device you like, or dd it to a file, or
dd and gzip to a compressed file.  Anything you can do to copy a
partition to backup can be used on the snapshot.

> Oh - and I'm not running lvm. Not that I think there's anything wrong
> with that, it's just yet another layer that I'm not (currently)
> comfortable with.

So you know how to use a hammer, and don't feel comfortable with using a
handsaw, so you're going to smash a board in two instead of sawing it?

Ok, maybe that was too facetious. (-:

> Is there a sound technical reason not to go there, or is it simply a
> case of "learn another tool for that job"? The less tools I have to know
> the better, imho.

Um, no, imnsho.  Learn new tools when you need them.

Linux raid has no formal mechanism to cleanly separate a mirror from a
running array, access it as a backup, and not risk corruption when
re-attaching it to the array.  Most filesystems write to the partition
when mounting, even for read-only mounts.  You cannot safely access the
disconnected member except via pure block reads.

Phil

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
  2017-02-25 23:41             ` Phil Turmel
@ 2017-02-25 23:55               ` Wols Lists
  2017-02-26  0:07                 ` Phil Turmel
  0 siblings, 1 reply; 19+ messages in thread
From: Wols Lists @ 2017-02-25 23:55 UTC (permalink / raw)
  To: Phil Turmel, ian_bruce, linux-raid

On 25/02/17 23:41, Phil Turmel wrote:
>> Is there a sound technical reason not to go there, or is it simply a
>> > case of "learn another tool for that job"? The less tools I have to know
>> > the better, imho.

> Um, no, imnsho.  Learn new tools when you need them.

I don't have a problem with that. All too often people use the tool
they're familiar with when it's the wrong tool. But there's a reason
they do that - it's a familiar tool!
> 
> Linux raid has no formal mechanism to cleanly separate a mirror from a
> running array, access it as a backup, and not risk corruption when
> re-attaching it to the array.  Most filesystems write to the partition
> when mounting, even for read-only mounts.  You cannot safely access the
> disconnected member except via pure block reads.

Because to do so doesn't make sense? Or because nobody's bothered to do
it? I get grumpy when people implement corner cases without bothering to
implement the logically sensible options - bit like those extremely
annoying dialog boxes that give you three choices, "yes", "no", "yes to
all". What about no to all?

I feel like mirror-raid is perfect for doing backups. I take your point
that linux hasn't implemented that feature (particularly well), but
surely it's a feature that *should* be there. I know I know - "patches
welcome" :-)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
  2017-02-25 23:55               ` Wols Lists
@ 2017-02-26  0:07                 ` Phil Turmel
  2017-03-01 15:02                   ` Wols Lists
  0 siblings, 1 reply; 19+ messages in thread
From: Phil Turmel @ 2017-02-26  0:07 UTC (permalink / raw)
  To: Wols Lists, ian_bruce, linux-raid

On 02/25/2017 06:55 PM, Wols Lists wrote:
> On 25/02/17 23:41, Phil Turmel wrote:
>>> Is there a sound technical reason not to go there, or is it simply a
>>>> case of "learn another tool for that job"? The less tools I have to know
>>>> the better, imho.
> 
>> Um, no, imnsho.  Learn new tools when you need them.
> 
> I don't have a problem with that. All too often people use the tool
> they're familiar with when it's the wrong tool. But there's a reason
> they do that - it's a familiar tool!
>>
>> Linux raid has no formal mechanism to cleanly separate a mirror from a
>> running array, access it as a backup, and not risk corruption when
>> re-attaching it to the array.  Most filesystems write to the partition
>> when mounting, even for read-only mounts.  You cannot safely access the
>> disconnected member except via pure block reads.
> 
> Because to do so doesn't make sense? Or because nobody's bothered to do
> it? I get grumpy when people implement corner cases without bothering to
> implement the logically sensible options - bit like those extremely
> annoying dialog boxes that give you three choices, "yes", "no", "yes to
> all". What about no to all?

Because while disconnected, and the array begins accumulating
write-intent bits indicating where any disconnected device is out of
date, the array has no way to know what writes are happening to that
member.  And therefore any re-add will introduce unknowable corruptions.
 There is no way to control what writes happen to that member, and
drives don't naturally keep a log of writes that have happened.  The data to
safely do what you want simply doesn't exist.  Your only known safe
choice is to disable write-intent bitmaps, forcing complete resync on
--re-add.

> I feel like mirror-raid is perfect for doing backups.

Your feelings are wrong.  Sorry.  LVM is the perfect tool because it
entirely controls the snapshot and doesn't have to re-add it.

> I take your point
> that linux hasn't implemented that feature (particularly well), but
> surely it's a feature that *should* be there. I know I know - "patches
> welcome" :-)

Good luck creating the necessary data from thin air.  It's not a
question of writing patches.

Phil

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
  2017-02-24 12:08 [BUG] non-metadata arrays cannot use more than 27 component devices ian_bruce
  2017-02-24 15:20 ` Phil Turmel
@ 2017-02-27  5:55 ` NeilBrown
  2017-02-28 10:25   ` ian_bruce
  1 sibling, 1 reply; 19+ messages in thread
From: NeilBrown @ 2017-02-27  5:55 UTC (permalink / raw)
  To: ian_bruce, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1225 bytes --]

On Fri, Feb 24 2017, ian_bruce@mail.ru wrote:

> When assembling non-metadata arrays ("mdadm --build"), the in-kernel
> superblock apparently defaults to the MD-RAID v0.90 type. This imposes a
> maximum of 27 component block devices, presumably as well as limits on
> device size.
>
> mdadm does not allow you to override this default, by specifying the
> v1.2 superblock. It is not clear whether mdadm tells the kernel to use
> the v0.90 superblock, or the kernel assumes this by itself. One or other
> of them should be fixed; there does not appear to be any reason why the
> v1.2 superblock should not be the default in this case.

Can you see if this change improves the behavior for you?

NeilBrown


diff --git a/drivers/md/md.c b/drivers/md/md.c
index ba485dcf1064..e0ac7f5a8e68 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6464,9 +6464,8 @@ static int set_array_info(struct mddev *mddev, mdu_array_info_t *info)
 	mddev->layout        = info->layout;
 	mddev->chunk_sectors = info->chunk_size >> 9;
 
-	mddev->max_disks     = MD_SB_DISKS;
-
 	if (mddev->persistent) {
+		mddev->max_disks     = MD_SB_DISKS;
 		mddev->flags         = 0;
 		mddev->sb_flags         = 0;
 	}


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
  2017-02-27  5:55 ` NeilBrown
@ 2017-02-28 10:25   ` ian_bruce
  2017-02-28 20:29     ` NeilBrown
  0 siblings, 1 reply; 19+ messages in thread
From: ian_bruce @ 2017-02-28 10:25 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Mon, 27 Feb 2017 16:55:56 +1100
NeilBrown <neilb@suse.com> wrote:

>> When assembling non-metadata arrays ("mdadm --build"), the in-kernel
>> superblock apparently defaults to the MD-RAID v0.90 type. This
>> imposes a maximum of 27 component block devices, presumably as well
>> as limits on device size.
>>
>> mdadm does not allow you to override this default, by specifying the
>> v1.2 superblock. It is not clear whether mdadm tells the kernel to
>> use the v0.90 superblock, or the kernel assumes this by itself. One
>> or other of them should be fixed; there does not appear to be any
>> reason why the v1.2 superblock should not be the default in this
>> case.
> 
> Can you see if this change improves the behavior for you?

Unfortunately, I'm not set up for kernel compilation at the moment. But
here is my test case; it shouldn't be any harder to reproduce than this,
on extremely ordinary hardware (= no actual disk RAID array):


# truncate -s 64M img64m.{00..31}   # requires no space on ext4,
#                                   # because sparse files are created
# 
# ls img64m.*
img64m.00  img64m.04  img64m.08  img64m.12  img64m.16  img64m.20  img64m.24  img64m.28
img64m.01  img64m.05  img64m.09  img64m.13  img64m.17  img64m.21  img64m.25  img64m.29
img64m.02  img64m.06  img64m.10  img64m.14  img64m.18  img64m.22  img64m.26  img64m.30
img64m.03  img64m.07  img64m.11  img64m.15  img64m.19  img64m.23  img64m.27  img64m.31
# 
# RAID=$(for x in img64m.* ; do losetup --show -f $x ; done)
# 
# echo $RAID
/dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 /dev/loop4 /dev/loop5 /dev/loop6 /dev/loop7
/dev/loop8 /dev/loop9 /dev/loop10 /dev/loop11 /dev/loop12 /dev/loop13 /dev/loop14 /dev/loop15
/dev/loop16 /dev/loop17 /dev/loop18 /dev/loop19 /dev/loop20 /dev/loop21 /dev/loop22 /dev/loop23
/dev/loop24 /dev/loop25 /dev/loop26 /dev/loop27 /dev/loop28 /dev/loop29 /dev/loop30 /dev/loop31
# 
# mdadm --build /dev/md/md-test --level=linear --raid-devices=32 $RAID
mdadm: ADD_NEW_DISK failed for /dev/loop27: Device or resource busy
# 

kernel log:

    kernel: [109524.168624] md: nonpersistent superblock ...
    kernel: [109524.168638] md: md125: array is limited to 27 devices
    kernel: [109524.168643] md: export_rdev(loop27)
    kernel: [109524.180676] md: md125 stopped.


It appears that I was wrong in assuming that the MD-RAID v0.90
limitation of 4TB per component device would be in effect:


# truncate -s 5T img5t.{00..03}   # sparse files again
# 
# ls -l img5t.*
-rw-r--r-- 1 root root 5497558138880 Feb 28 00:09 img5t.00
-rw-r--r-- 1 root root 5497558138880 Feb 28 00:09 img5t.01
-rw-r--r-- 1 root root 5497558138880 Feb 28 00:09 img5t.02
-rw-r--r-- 1 root root 5497558138880 Feb 28 00:09 img5t.03
# 
# RAID=$(for x in img5t.* ; do losetup --show -f $x ; done)
# 
# echo $RAID
/dev/loop32 /dev/loop33 /dev/loop34 /dev/loop35
# 
# mdadm --build /dev/md/md-test --level=linear --raid-devices=4 $RAID
mdadm: array /dev/md/md-test built and started.
# 
# mdadm --detail /dev/md/md-test
/dev/md/md-test:
        Version : 
  Creation Time : Tue Feb 28 00:18:21 2017
     Raid Level : linear
     Array Size : 21474836480 (20480.00 GiB 21990.23 GB)
   Raid Devices : 4
  Total Devices : 4

          State : clean 
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

       Rounding : 64K

    Number   Major   Minor   RaidDevice State
       0       7       32        0      active sync   /dev/loop32
       1       7       33        1      active sync   /dev/loop33
       2       7       34        2      active sync   /dev/loop34
       3       7       35        3      active sync   /dev/loop35
# 
# mkfs.ext4 /dev/md/md-test
mke2fs 1.43.4 (31-Jan-2017)
Discarding device blocks: done                            
Creating filesystem with 5368709120 4k blocks and 335544320 inodes
Filesystem UUID: da293fd3-b4ec-40e3-b5be-3caeef55edcf
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
	102400000, 214990848, 512000000, 550731776, 644972544, 1934917632, 
	2560000000, 3855122432

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done         

# 
# fsck.ext4 -f /dev/md/md-test
e2fsck 1.43.4 (31-Jan-2017)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/md/md-test: 11/335544320 files (0.0% non-contiguous), 21625375/5368709120 blocks
# 


> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index ba485dcf1064..e0ac7f5a8e68 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -6464,9 +6464,8 @@ static int set_array_info(struct mddev *mddev, mdu_array_info_t *info)
>  	mddev->layout        = info->layout;
>  	mddev->chunk_sectors = info->chunk_size >> 9;
>  
> -	mddev->max_disks     = MD_SB_DISKS;
> -
>  	if (mddev->persistent) {
> +		mddev->max_disks     = MD_SB_DISKS;
>  		mddev->flags         = 0;
>  		mddev->sb_flags         = 0;
>  	}

What value does mddev->max_disks get in the opposite case,
(!mddev->persistent) ?

I note this comment from the top of the function:

    * set_array_info is used two different ways
    * The original usage is when creating a new array.
    * In this usage, raid_disks is > 0 and it together with
    *  level, size, not_persistent,layout,chunksize determine the
    *  shape of the array.
    *  This will always create an array with a type-0.90.0 superblock.

http://lxr.free-electrons.com/source/drivers/md/md.c#L6410

Surely there is an equivalent function which creates arrays with a
type-1 superblock?


-- Ian Bruce

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
  2017-02-28 10:25   ` ian_bruce
@ 2017-02-28 20:29     ` NeilBrown
  2017-03-01 13:05       ` ian_bruce
  0 siblings, 1 reply; 19+ messages in thread
From: NeilBrown @ 2017-02-28 20:29 UTC (permalink / raw)
  To: ian_bruce; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3853 bytes --]

On Tue, Feb 28 2017, ian_bruce@mail.ru wrote:

> On Mon, 27 Feb 2017 16:55:56 +1100
> NeilBrown <neilb@suse.com> wrote:
>
>>> When assembling non-metadata arrays ("mdadm --build"), the in-kernel
>>> superblock apparently defaults to the MD-RAID v0.90 type. This
>>> imposes a maximum of 27 component block devices, presumably as well
>>> as limits on device size.
>>>
>>> mdadm does not allow you to override this default, by specifying the
>>> v1.2 superblock. It is not clear whether mdadm tells the kernel to
>>> use the v0.90 superblock, or the kernel assumes this by itself. One
>>> or other of them should be fixed; there does not appear to be any
>>> reason why the v1.2 superblock should not be the default in this
>>> case.
>> 
>> Can you see if this change improves the behavior for you?
>
> Unfortunately, I'm not set up for kernel compilation at the moment. But
> here is my test case; it shouldn't be any harder to reproduce than this,
> on extremely ordinary hardware (= no actual disk RAID array):
>
>
> # truncate -s 64M img64m.{00..31}   # requires no space on ext4,
> #                                   # because sparse files are created
> # 
> # ls img64m.*
> img64m.00  img64m.04  img64m.08  img64m.12  img64m.16  img64m.20  img64m.24  img64m.28
> img64m.01  img64m.05  img64m.09  img64m.13  img64m.17  img64m.21  img64m.25  img64m.29
> img64m.02  img64m.06  img64m.10  img64m.14  img64m.18  img64m.22  img64m.26  img64m.30
> img64m.03  img64m.07  img64m.11  img64m.15  img64m.19  img64m.23  img64m.27  img64m.31
> # 
> # RAID=$(for x in img64m.* ; do losetup --show -f $x ; done)
> # 
> # echo $RAID
> /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 /dev/loop4 /dev/loop5 /dev/loop6 /dev/loop7
> /dev/loop8 /dev/loop9 /dev/loop10 /dev/loop11 /dev/loop12 /dev/loop13 /dev/loop14 /dev/loop15
> /dev/loop16 /dev/loop17 /dev/loop18 /dev/loop19 /dev/loop20 /dev/loop21 /dev/loop22 /dev/loop23
> /dev/loop24 /dev/loop25 /dev/loop26 /dev/loop27 /dev/loop28 /dev/loop29 /dev/loop30 /dev/loop31
> # 
> # mdadm --build /dev/md/md-test --level=linear --raid-devices=32 $RAID
> mdadm: ADD_NEW_DISK failed for /dev/loop27: Device or resource busy
> # 

Thanks.  That makes it easy.
Test works with my patch applied.
....

>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index ba485dcf1064..e0ac7f5a8e68 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -6464,9 +6464,8 @@ static int set_array_info(struct mddev *mddev, mdu_array_info_t *info)
>>  	mddev->layout        = info->layout;
>>  	mddev->chunk_sectors = info->chunk_size >> 9;
>>  
>> -	mddev->max_disks     = MD_SB_DISKS;
>> -
>>  	if (mddev->persistent) {
>> +		mddev->max_disks     = MD_SB_DISKS;
>>  		mddev->flags         = 0;
>>  		mddev->sb_flags         = 0;
>>  	}
>
> What value does mddev->max_disks get in the opposite case,
> (!mddev->persistent) ?

Default value is zero, which causes no limit to be imposed.

>
> I note this comment from the top of the function:
>
>     * set_array_info is used two different ways
>     * The original usage is when creating a new array.
>     * In this usage, raid_disks is > 0 and it together with
>     *  level, size, not_persistent,layout,chunksize determine the
>     *  shape of the array.
>     *  This will always create an array with a type-0.90.0 superblock.

Unfortunately you cannot always trust comments.  They are more like hints.

>
> http://lxr.free-electrons.com/source/drivers/md/md.c#L6410
>
> Surely there is an equivalent function which creates arrays with a
> type-1 superblock?

Not really.  type-1 superblock are created from userspace by mdadm.
mdadm then tells the kernel "here are some devices that form an array".
md reads the devices, finds the type-1 metadata, and proceeds.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
  2017-02-28 20:29     ` NeilBrown
@ 2017-03-01 13:05       ` ian_bruce
  0 siblings, 0 replies; 19+ messages in thread
From: ian_bruce @ 2017-03-01 13:05 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Wed, 01 Mar 2017 07:29:28 +1100
NeilBrown <neilb@suse.com> wrote:

> Thanks.  That makes it easy.
> Test works with my patch applied.

Thanks for fixing that.

If anybody is curious, the application for this capability is as
follows. For live systems running from a USB flashdrive, we need to
loop-mount an ext4 filesystem image from the fat32-formatted flashdrive.
Unfortunately, the maximum file size on fat32 is 4GB, which is a severe
limitation, when 128GB flashdrives are commonly available.

The solution is to split the ext4 image into multiple sub-4GB chunks,
associate a /dev/loop device with each of those files, have mdadm turn
those into a single RAID device, and mount that as the ext4 filesystem.
It is preferable to use non-metadata, linear-mode RAID for this, because
we can then convert back and forth between the single filesystem image
and its constituent chunks using the non-privileged utilities "cat" and
"split". With a maximum of 27 RAID component devices, the maximum
filesystem size would be 108GB, which is not quite a complete solution.

On Fri, 24 Feb 2017 15:46:19 -0500
Phil Turmel <philip@turmel.org> wrote:

> Note that build mode doesn't support a bunch of other MD raid features
> either, like all of the parity raid levels. That it doesn't support
> v1+ metadata isn't a surprise, and isn't the only legacy feature that
> only uses legacy metadata (built-in kernel auto-assembly gets the most
> whining, actually).

> If you think its trivial to implement --build with v1.x metadata, go
> right ahead. Post your patches for review.

I haven't tested "mdadm --build" with parity RAID myself (although the
/dev/loop trick would probably suffice for that too), but if this is so,
would the change to provide that be as simple as the patch to remove the
27-component limitation? (Although I suppose that unlike linear mode,
the component devices for parity mode would have to be initialized with
consistent data, first.)

Somebody might find a use for non-metadata, parity-mode RAID, if it were
available.

-- Ian Bruce

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
  2017-02-26  0:07                 ` Phil Turmel
@ 2017-03-01 15:02                   ` Wols Lists
  2017-03-01 17:23                     ` Phil Turmel
  0 siblings, 1 reply; 19+ messages in thread
From: Wols Lists @ 2017-03-01 15:02 UTC (permalink / raw)
  To: Phil Turmel, ian_bruce, linux-raid

On 26/02/17 00:07, Phil Turmel wrote:
>> Because to do so doesn't make sense? Or because nobody's bothered to do
>> > it? I get grumpy when people implement corner cases without bothering to
>> > implement the logically sensible options - bit like those extremely
>> > annoying dialog boxes that give you three choices, "yes", "no", "yes to
>> > all". What about no to all?

> Because while disconnected, and the array begins accumulating
> write-intent bits indicating where any disconnected device is out of
> date, the array has no way to know what writes are happening to that
> member.  And therefore any re-add will introduce unknowable corruptions.
>  There is no way to control what writes happen to that member, and
> drives don't naturally keep a log of writes that have happened.  The data to
> safely do what you want simply doesn't exist.  Your only known safe
> choice is to disable write-intent bitmaps, forcing complete resync on
> --re-add.

Sorry to drag this up again, but where are these write intent bits going
to come from? And it's a backup. Why am I going to re-add, unless I'm
going to wipe the old backup and create a new one?
> 
>> > I feel like mirror-raid is perfect for doing backups.

> Your feelings are wrong.  Sorry.  LVM is the perfect tool because it
> entirely controls the snapshot and doesn't have to re-add it.
> 
I think we're talking at cross-purposes here :-) You're talking about
creating a snapshot and backing it up. I'm talking about creating a
mirror, which IS the backup.

VERY different technique, same end result.

And your way is more complicated - more room for sys-admin cock-up :-)

>> > I take your point
>> > that linux hasn't implemented that feature (particularly well), but
>> > surely it's a feature that *should* be there. I know I know - "patches
>> > welcome" :-)

> Good luck creating the necessary data from thin air.  It's not a
> question of writing patches.
> 
mdadm --build /dev/mdbackup --device-count 2 /dev/md/home missing
... hotplug sd-big ...
madam /dev/mdbackup --add /dev/sd-big
... wait for sync to finish ...
mdadm --stop mdbackup
... unplug sd-big ...

You've made me think about it deeper than before - thanks - and I can
think of at least one potential show-stopper, but write-intent bitmaps
and missing raid data are most definitely not it :-)

And why do I think my way is "better" (for certain values of "better"
:-) - because your way only works if it was planned in advance. My way -
if the show stopper isn't - will work on ANY running system whether
planned or not. That said, my problem probably is a show stopper :-(

Cheers,
Wol

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
  2017-03-01 15:02                   ` Wols Lists
@ 2017-03-01 17:23                     ` Phil Turmel
  2017-03-01 18:13                       ` Phil Turmel
  0 siblings, 1 reply; 19+ messages in thread
From: Phil Turmel @ 2017-03-01 17:23 UTC (permalink / raw)
  To: Wols Lists, ian_bruce, linux-raid

On 03/01/2017 10:02 AM, Wols Lists wrote:

> Sorry to drag this up again, but where are these write intent bits
> going to come from? And it's a backup. Why am I going to re-add,
> unless I'm going to wipe the old backup and create a new one?

Given your process below, it's moot.

> And your way is more complicated - more room for sys-admin cock-up
> :-)

I strongly disagree.  This procedure, as shown, is an admin cock-up:

> mdadm --build /dev/mdbackup --device-count 2 /dev/md/home missing
> ... hotplug sd-big ...
> madam /dev/mdbackup --add /dev/sd-big
> ... wait for sync to finish ...
> mdadm --stop mdbackup
> ... unplug sd-big ...

Are you unmounting /dev/md/home while this is going on?  If not, and
there's any significant activity, your "backup" is corrupt.  If you are
unmounting, your data is unavailable for the duration of the resync.

The corresponding procedure for logical volume in LVM would be:

# lvcreate -n homesnaplv -s homelv --size 10g vg0
# dd if=/dev/vg0/homesnaplv of=/dev/sd-big bs=1M
# lvremove /dev/vg0/homesnaplv

Unlike your solution, the LVM snapshot won't be changing underneath you
during the copy.  The allocated size of the snapshot, shown as 10g
above, only has to be big enough to accommodate the amount of writes to
homelv while the dd is in progress.

Also, LVM understands most mounted filesystems, and will invoke the
proper kernel calls to briefly quiesce the filesystem for the snapshot,
ensuring the filesystem copied out is consistent.  But the user sees
only a few tens or hundreds of milliseconds of hesitation and can keep
going.  Writes to homelv while the snapshot exists generate extra disk
activity (to move the replaced blocks to the snapshot storage, with some
metadata), but is otherwise invisible to the users.

LVM is *made* for this.  You should use it.

Phil

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
  2017-03-01 17:23                     ` Phil Turmel
@ 2017-03-01 18:13                       ` Phil Turmel
  2017-03-01 19:50                         ` Anthony Youngman
  0 siblings, 1 reply; 19+ messages in thread
From: Phil Turmel @ 2017-03-01 18:13 UTC (permalink / raw)
  To: Wols Lists, ian_bruce, linux-raid

On 03/01/2017 12:23 PM, Phil Turmel wrote:
> I strongly disagree.  This procedure, as shown, is an admin cock-up:
> 
>> mdadm --build /dev/mdbackup --device-count 2 /dev/md/home missing
>> ... hotplug sd-big ...
>> madam /dev/mdbackup --add /dev/sd-big
>> ... wait for sync to finish ...
>> mdadm --stop mdbackup
>> ... unplug sd-big ...

One more point.  The above is functionally identical in every respect
to just:

# dd if=/dev/md/home of=/dev/sd-big bs=1M

Why are you bothering to --build an array?

Phil


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
  2017-03-01 18:13                       ` Phil Turmel
@ 2017-03-01 19:50                         ` Anthony Youngman
  2017-03-01 22:20                           ` Phil Turmel
  0 siblings, 1 reply; 19+ messages in thread
From: Anthony Youngman @ 2017-03-01 19:50 UTC (permalink / raw)
  To: Phil Turmel, ian_bruce, linux-raid

On 01/03/17 18:13, Phil Turmel wrote:
> On 03/01/2017 12:23 PM, Phil Turmel wrote:
>> I strongly disagree.  This procedure, as shown, is an admin cock-up:
>>
>>> mdadm --build /dev/mdbackup --device-count 2 /dev/md/home missing
>>> ... hotplug sd-big ...
>>> madam /dev/mdbackup --add /dev/sd-big
>>> ... wait for sync to finish ...
>>> mdadm --stop mdbackup
>>> ... unplug sd-big ...
>
> One more point.  The above is functionally identical in every respect
> to just:
>
> # dd if=/dev/md/home of=/dev/sd-big bs=1M
>
> Why are you bothering to --build an array?
>
Because - and this is a point the kernel guys seem to forget - the whole 
point of having a computer system is TO RUN APPLICATIONS, not to run an OS.

As it is, you picked up on the fatal flaw I'd spotted, namely that if 
"home" is mounted, "backup" is going to be corrupt :-( Defeating the 
entire purpose of my idea, which was to back up a running system without 
the need to take down the system to ensure integrity.

I work with a database that, not unreasonably, seeks to cache loads of 
stuff in RAM. I've come across far too many horror stories of corrupt 
backups because the database hadn't flushed its buffers to the OS, so 
all the database files on disk were inconsistent, giving a corrupt 
backup. So the idea was set up the mirror, flush/quiesce the database, 
break the mirror, wake up the database. System disabled for a matter of 
seconds.

It's all very well saying lvm was created with this in mind, but if the 
system wasn't installed with this originally in mind, you're up a gum 
tree. My home system is raid but not lvm, for example - how do I back up 
the system while it's live? (In reality, I don't care :-)

IF it didn't have that fatal flaw, my idea would have been able to back 
up any system. Oh well, it's flawed, time to drop it :-(

Cheers,
Wol

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
  2017-03-01 19:50                         ` Anthony Youngman
@ 2017-03-01 22:20                           ` Phil Turmel
  0 siblings, 0 replies; 19+ messages in thread
From: Phil Turmel @ 2017-03-01 22:20 UTC (permalink / raw)
  To: Anthony Youngman, ian_bruce, linux-raid

On 03/01/2017 02:50 PM, Anthony Youngman wrote:

> Because - and this is a point the kernel guys seem to forget - the
> whole point of having a computer system is TO RUN APPLICATIONS, not
> to run an OS.

If I were a kernel guy, I might be offended.  They run applications,
too, ya know.

> It's all very well saying lvm was created with this in mind, but if
> the system wasn't installed with this originally in mind, you're up a
> gum tree. My home system is raid but not lvm, for example - how do I
> back up the system while it's live? (In reality, I don't care :-)

This little tidbit would have helped.  If you can't redesign your system
to make consistent backups while running, you have to shut down to make
consistent backups.  Tautology, there.  I was arguing from the
assumption that you weren't too far gone to help.

It seems to me that if you can shut it down to make a consistent backup,
you can shut it down to re-jigger the device layering.  Possibly faster
to do that than take the backup, if you are willing take a small risk.

If you are using a database technology that supports checkpointed
backups while running, like PostgreSQL[1], you might get the backup you
need without downtime.  But fixing the device layering issue is what you
should do, imnsho.  Not taking a backup isn't really an option.

1) Create another raid, possibly degraded, and set it up as a PV in a
new volume group.

2) Stop your database and unmount home long enough to resize your
filesystem as small as practical, at least so it fits in an LV in the
new volume group with some free space for future snapshots.

3) Create an LV in the new VG big enough to hold that FS and dd it over.

4) If you need to, break redundant devices out of /dev/md/home to add to
the new array.  When you have the new raid to a comfortable level of
redundancy, mount the LV as /home and resume operations.

5) Finish breaking down the previous array and possibly using its
devices to bolster the redundancy on the new array.

If you are feeling brave, you could use --build mode instead of (3) to
resume running on /dev/md/builthome (comprised of the shrunk
/dev/md/home and the LV) instead of using dd.  When the resync is done,
you'd take a second short outage to unmount /dev/md/builthome and mount
/dev/vg0/home.

Phil

[1] https://www.postgresql.org/docs/9.5/static/continuous-archiving.html

(See the pg_basebackup utility and the description of the API it uses.)

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2017-03-01 22:20 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-24 12:08 [BUG] non-metadata arrays cannot use more than 27 component devices ian_bruce
2017-02-24 15:20 ` Phil Turmel
2017-02-24 16:40   ` ian_bruce
2017-02-24 20:46     ` Phil Turmel
2017-02-25 20:05       ` Anthony Youngman
2017-02-25 22:00         ` Phil Turmel
2017-02-25 23:30           ` Wols Lists
2017-02-25 23:41             ` Phil Turmel
2017-02-25 23:55               ` Wols Lists
2017-02-26  0:07                 ` Phil Turmel
2017-03-01 15:02                   ` Wols Lists
2017-03-01 17:23                     ` Phil Turmel
2017-03-01 18:13                       ` Phil Turmel
2017-03-01 19:50                         ` Anthony Youngman
2017-03-01 22:20                           ` Phil Turmel
2017-02-27  5:55 ` NeilBrown
2017-02-28 10:25   ` ian_bruce
2017-02-28 20:29     ` NeilBrown
2017-03-01 13:05       ` ian_bruce

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.