All of lore.kernel.org
 help / color / mirror / Atom feed
* how do i fix these RAID5 arrays?
@ 2022-11-23 22:07 David T-G
  2022-11-23 22:28 ` Roman Mamedov
  0 siblings, 1 reply; 62+ messages in thread
From: David T-G @ 2022-11-23 22:07 UTC (permalink / raw)
  To: Linux RAID list

Hi, all --

TL;DR : I'm providing lots of detail to try to not leave anything
unexplained, but in the end I need to remove "removed" devices from
RAID5 arrays and add them back to rebuild.


I have 3ea 10T (in round numbers, of course :-) drives 

  diskfarm:~ # fdisk -l /dev/sd[bcd]
  Disk /dev/sdb: 9.1 TiB, 10000831348736 bytes, 19532873728 sectors
  Disk model: TOSHIBA HDWR11A 
  Units: sectors of 1 * 512 = 512 bytes
  Sector size (logical/physical): 512 bytes / 4096 bytes
  I/O size (minimum/optimal): 4096 bytes / 4096 bytes
  Disklabel type: gpt
  Disk identifier: EDF3B089-018E-454F-BD3F-6161A0A0FBFB
  
  Device            Start         End    Sectors  Size Type
  /dev/sdb51         2048  3254781951 3254779904  1.5T Linux LVM
  /dev/sdb52   3254781952  6509561855 3254779904  1.5T Linux LVM
  /dev/sdb53   6509561856  9764341759 3254779904  1.5T Linux LVM
  /dev/sdb54   9764341760 13019121663 3254779904  1.5T Linux LVM
  /dev/sdb55  13019121664 16273901567 3254779904  1.5T Linux LVM
  /dev/sdb56  16273901568 19528681471 3254779904  1.5T Linux LVM
  /dev/sdb128 19528681472 19532873694    4192223    2G Linux filesystem
  
  
  Disk /dev/sdc: 9.1 TiB, 10000831348736 bytes, 19532873728 sectors
  Disk model: TOSHIBA HDWR11A 
  Units: sectors of 1 * 512 = 512 bytes
  Sector size (logical/physical): 512 bytes / 4096 bytes
  I/O size (minimum/optimal): 4096 bytes / 4096 bytes
  Disklabel type: gpt
  Disk identifier: 1AD8FC0A-5ADD-49E6-9BB2-6161A0BEFBFB
  
  Device            Start         End    Sectors  Size Type
  /dev/sdc51         2048  3254781951 3254779904  1.5T Linux LVM
  /dev/sdc52   3254781952  6509561855 3254779904  1.5T Linux LVM
  /dev/sdc53   6509561856  9764341759 3254779904  1.5T Linux LVM
  /dev/sdc54   9764341760 13019121663 3254779904  1.5T Linux LVM
  /dev/sdc55  13019121664 16273901567 3254779904  1.5T Linux LVM
  /dev/sdc56  16273901568 19528681471 3254779904  1.5T Linux LVM
  /dev/sdc128 19528681472 19532873694    4192223    2G Linux filesystem
  
  
  Disk /dev/sdd: 9.1 TiB, 10000831348736 bytes, 19532873728 sectors
  Disk model: TOSHIBA HDWR11A 
  Units: sectors of 1 * 512 = 512 bytes
  Sector size (logical/physical): 512 bytes / 4096 bytes
  I/O size (minimum/optimal): 4096 bytes / 4096 bytes
  Disklabel type: gpt
  Disk identifier: EDF3B089-018E-454F-BD3F-6161A0A0FBFB
  
  Device            Start         End    Sectors  Size Type
  /dev/sdd51         2048  3254781951 3254779904  1.5T Linux LVM
  /dev/sdd52   3254781952  6509561855 3254779904  1.5T Linux LVM
  /dev/sdd53   6509561856  9764341759 3254779904  1.5T Linux LVM
  /dev/sdd54   9764341760 13019121663 3254779904  1.5T Linux LVM
  /dev/sdd55  13019121664 16273901567 3254779904  1.5T Linux LVM
  /dev/sdd56  16273901568 19528681471 3254779904  1.5T Linux LVM
  /dev/sdd128 19528681472 19532873694    4192223    2G Linux filesystem

that I've sliced, RAID5-ed 

  diskfarm:~ # mdadm -D /dev/md51
  /dev/md51:
             Version : 1.2
       Creation Time : Thu Nov  4 00:46:28 2021
          Raid Level : raid5
          Array Size : 4881773568 (4.55 TiB 5.00 TB)
       Used Dev Size : 1627257856 (1551.87 GiB 1666.31 GB)
        Raid Devices : 4
       Total Devices : 3
         Persistence : Superblock is persistent
  
       Intent Bitmap : Internal
  
         Update Time : Wed Nov 23 02:53:35 2022
               State : clean, degraded 
      Active Devices : 3
     Working Devices : 3
      Failed Devices : 0
       Spare Devices : 0
  
              Layout : left-symmetric
          Chunk Size : 512K
  
  Consistency Policy : bitmap
  
                Name : diskfarm:51  (local to host diskfarm)
                UUID : 9330e44f:35baf039:7e971a8e:da983e31
              Events : 37727
  
      Number   Major   Minor   RaidDevice State
         0     259        9        0      active sync   /dev/sdb51
         1     259        2        1      active sync   /dev/sdc51
         3     259       16        2      active sync   /dev/sdd51
         -       0        0        3      removed
  diskfarm:~ # mdadm -E /dev/md51
  /dev/md51:
            Magic : a92b4efc
          Version : 1.2
      Feature Map : 0x0
       Array UUID : cccbe073:d92c6ecd:77ba5c46:5db6b3f0
             Name : diskfarm:10T  (local to host diskfarm)
    Creation Time : Thu Nov  4 00:56:36 2021
       Raid Level : raid0
     Raid Devices : 6
  
   Avail Dev Size : 6508767232 sectors (3.03 TiB 3.33 TB)
      Data Offset : 264192 sectors
     Super Offset : 8 sectors
     Unused Space : before=264112 sectors, after=3254515712 sectors
            State : clean
      Device UUID : 4eb64186:15de3406:50925d42:54df22e1
  
      Update Time : Thu Nov  4 00:56:36 2021
    Bad Block Log : 512 entries available at offset 8 sectors
         Checksum : 45a70eae - correct
           Events : 0
  
       Chunk Size : 512K
  
     Device Role : Active device 0
     Array State : AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
  
into arrays (ignore the "degraded" for the moment), and striped 

  diskfarm:~ # mdadm -D /dev/md50
  /dev/md50:
             Version : 1.2
       Creation Time : Thu Nov  4 00:56:36 2021
          Raid Level : raid0
          Array Size : 19526301696 (18.19 TiB 19.99 TB)
        Raid Devices : 6
       Total Devices : 6
         Persistence : Superblock is persistent
  
         Update Time : Thu Nov  4 00:56:36 2021
               State : clean 
      Active Devices : 6
     Working Devices : 6
      Failed Devices : 0
       Spare Devices : 0
  
              Layout : -unknown-
          Chunk Size : 512K
  
  Consistency Policy : none
  
                Name : diskfarm:10T  (local to host diskfarm)
                UUID : cccbe073:d92c6ecd:77ba5c46:5db6b3f0
              Events : 0
  
      Number   Major   Minor   RaidDevice State
         0       9       51        0      active sync   /dev/md/51
         1       9       52        1      active sync   /dev/md/52
         2       9       53        2      active sync   /dev/md/53
         3       9       54        3      active sync   /dev/md/54
         4       9       55        4      active sync   /dev/md/55
         5       9       56        5      active sync   /dev/md/56
  diskfarm:~ # mdadm -E /dev/md50
  /dev/md50:
     MBR Magic : aa55
  Partition[0] :   4294967295 sectors at            1 (type ee)

into a 20T array, the idea being that each piece of which should take
less time to rebuild if something fails.  That was all great, and then I
wanted to add another disk

  diskfarm:~ # fdisk -l /dev/sdk
  Disk /dev/sdk: 9.1 TiB, 10000831348736 bytes, 19532873728 sectors
  Disk model: TOSHIBA HDWR11A 
  Units: sectors of 1 * 512 = 512 bytes
  Sector size (logical/physical): 512 bytes / 4096 bytes
  I/O size (minimum/optimal): 4096 bytes / 4096 bytes
  Disklabel type: gpt
  Disk identifier: FAB535F8-F57B-4BA4-8DEB-B0DEB49496C1
  
  Device            Start         End    Sectors  Size Type
  /dev/sdk51         2048  3254781951 3254779904  1.5T Linux LVM
  /dev/sdk52   3254781952  6509561855 3254779904  1.5T Linux LVM
  /dev/sdk53   6509561856  9764341759 3254779904  1.5T Linux LVM
  /dev/sdk54   9764341760 13019121663 3254779904  1.5T Linux LVM
  /dev/sdk55  13019121664 16273901567 3254779904  1.5T Linux LVM
  /dev/sdk56  16273901568 19528681471 3254779904  1.5T Linux LVM
  /dev/sdk128 19528681472 19532873694    4192223    2G Linux filesystem

to it to give me 30T usable.

I sliced up the new drive as above, added each slice to each RAID5 array,
and then grow-ed each array to take advantage of it.  And, sure enough,
for 52 it worked:

  diskfarm:~ # mdadm -D /dev/md52
  /dev/md52:
	     Version : 1.2
       Creation Time : Thu Nov  4 00:47:09 2021
	  Raid Level : raid5
	  Array Size : 4881773568 (4.55 TiB 5.00 TB)
       Used Dev Size : 1627257856 (1551.87 GiB 1666.31 GB)
	Raid Devices : 4
       Total Devices : 4
	 Persistence : Superblock is persistent
  
       Intent Bitmap : Internal
  
	 Update Time : Wed Nov 23 02:52:00 2022
	       State : clean 
      Active Devices : 4
     Working Devices : 4
      Failed Devices : 0
       Spare Devices : 0
  
	      Layout : left-symmetric
	  Chunk Size : 512K
  
  Consistency Policy : bitmap
  
		Name : diskfarm:52  (local to host diskfarm)
		UUID : d9eada18:29478a43:37654ef5:d34df19c
	      Events : 10996
  
      Number   Major   Minor   RaidDevice State
	 0     259       10        0      active sync   /dev/sdb52
	 1     259        3        1      active sync   /dev/sdc52
	 3     259       17        2      active sync   /dev/sdd52
	 4     259       24        3      active sync   /dev/sdk52
  diskfarm:~ # mdadm -E /dev/md52
  /dev/md52:
	    Magic : a92b4efc
	  Version : 1.2
      Feature Map : 0x0
       Array UUID : cccbe073:d92c6ecd:77ba5c46:5db6b3f0
	     Name : diskfarm:10T  (local to host diskfarm)
    Creation Time : Thu Nov  4 00:56:36 2021
       Raid Level : raid0
     Raid Devices : 6
  
   Avail Dev Size : 6508767232 sectors (3.03 TiB 3.33 TB)
      Data Offset : 264192 sectors
     Super Offset : 8 sectors
     Unused Space : before=264112 sectors, after=3254515712 sectors
	    State : clean
      Device UUID : 74ab812f:7e1695ec:360638b6:0c73d8b0
  
      Update Time : Thu Nov  4 00:56:36 2021
    Bad Block Log : 512 entries available at offset 8 sectors
	 Checksum : 18d743dd - correct
	   Events : 0
  
       Chunk Size : 512K
  
     Device Role : Active device 1
     Array State : AAAAAA ('A' == active, '.' == missing, 'R' == replacing)

THAT is what really confuses me.  I ran (sorry; they're gone) the same
commands for each device; they should work the same way!  But, obviously,
something ain't right.

On the 5 broken ones, we have one each removed device

  diskfarm:~ # mdadm -D /dev/md5[13456] | egrep '^/dev|active|removed'
  /dev/md51:
	 0     259        9        0      active sync   /dev/sdb51
	 1     259        2        1      active sync   /dev/sdc51
	 3     259       16        2      active sync   /dev/sdd51
	 -       0        0        3      removed
  /dev/md53:
	 0     259       11        0      active sync   /dev/sdb53
	 1     259        4        1      active sync   /dev/sdc53
	 3     259       18        2      active sync   /dev/sdd53
	 -       0        0        3      removed
  /dev/md54:
	 0     259       12        0      active sync   /dev/sdb54
	 1     259        5        1      active sync   /dev/sdc54
	 3     259       19        2      active sync   /dev/sdd54
	 -       0        0        3      removed
  /dev/md55:
	 0     259       13        0      active sync   /dev/sdb55
	 1     259        6        1      active sync   /dev/sdc55
	 3     259       20        2      active sync   /dev/sdd55
	 -       0        0        3      removed
  /dev/md56:
	 0     259       14        0      active sync   /dev/sdb56
	 1     259        7        1      active sync   /dev/sdc56
	 3     259       21        2      active sync   /dev/sdd56
	 -       0        0        3      removed

that are obviously the sdk (new disk) slice.  If md52 were also broken,
I'd figure that the disk was somehow unplugged, but I don't think I can
plug in one sixth of a disk and leave the rest unhooked :-)  So ...  In
addition to wondering how I got here, how do I remove the "removed" ones
and then re-add them to build and grow and finalize this?


TIA

:-D
-- 
David T-G
See http://justpickone.org/davidtg/email/
See http://justpickone.org/davidtg/tofu.txt


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-23 22:07 how do i fix these RAID5 arrays? David T-G
@ 2022-11-23 22:28 ` Roman Mamedov
  2022-11-24  0:01   ` Roger Heflin
  2022-11-24 21:10   ` how do i fix these RAID5 arrays? David T-G
  0 siblings, 2 replies; 62+ messages in thread
From: Roman Mamedov @ 2022-11-23 22:28 UTC (permalink / raw)
  To: David T-G; +Cc: Linux RAID list

On Wed, 23 Nov 2022 22:07:36 +0000
David T-G <davidtg-robot@justpickone.org> wrote:

>   diskfarm:~ # mdadm -D /dev/md50
>   /dev/md50:
>              Version : 1.2
>        Creation Time : Thu Nov  4 00:56:36 2021
>           Raid Level : raid0
>           Array Size : 19526301696 (18.19 TiB 19.99 TB)
>         Raid Devices : 6
>        Total Devices : 6
>          Persistence : Superblock is persistent
>   
>          Update Time : Thu Nov  4 00:56:36 2021
>                State : clean 
>       Active Devices : 6
>      Working Devices : 6
>       Failed Devices : 0
>        Spare Devices : 0
>   
>               Layout : -unknown-
>           Chunk Size : 512K
>   
>   Consistency Policy : none
>   
>                 Name : diskfarm:10T  (local to host diskfarm)
>                 UUID : cccbe073:d92c6ecd:77ba5c46:5db6b3f0
>               Events : 0
>   
>       Number   Major   Minor   RaidDevice State
>          0       9       51        0      active sync   /dev/md/51
>          1       9       52        1      active sync   /dev/md/52
>          2       9       53        2      active sync   /dev/md/53
>          3       9       54        3      active sync   /dev/md/54
>          4       9       55        4      active sync   /dev/md/55
>          5       9       56        5      active sync   /dev/md/56

It feels you haven't thought this through entirely. Sequential writes to this
RAID0 array will alternate across all member arrays, and seeing how those are
not of independent disks, but instead are "vertical" across partitions on the
same disks, it will result in a crazy seek load, as first 512K is written to
the array of the *51 partitions, second 512K go to *52, then to *53,
effectively requiring a full stroke of each drive's head across the entire
surface for each and every 3 *megabytes* written.

mdraid in the "linear" mode, or LVM with one large LV across all PVs (which
are the individual RAID5 arrays), or multi-device Btrfs using "single" profile
for data, all of those would avoid the described effect.

But I should clarify, the entire idea of splitting drives like this seems
questionable to begin with, since drives more often fail entirely, not in part,
so you will not save any time on rebuilds; and the "bitmap" already protects
you against full rebuilds due to any hiccups such as a power cut; or even if a
drive failed in part, in your current setup, or even in the proposed ones I
mentioned above, losing even one RAID5 of all these, would result in a
complete loss of data anyway. Not to mention what you have seems like an insane
amount of complexity.

To summarize, maybe it's better to blow away the entire thing and restart from
the drawing board, while it's not too late? :)

>   diskfarm:~ # mdadm -D /dev/md5[13456] | egrep '^/dev|active|removed'
>   /dev/md51:
> 	 0     259        9        0      active sync   /dev/sdb51
> 	 1     259        2        1      active sync   /dev/sdc51
> 	 3     259       16        2      active sync   /dev/sdd51
> 	 -       0        0        3      removed
>   /dev/md53:
> 	 0     259       11        0      active sync   /dev/sdb53
> 	 1     259        4        1      active sync   /dev/sdc53
> 	 3     259       18        2      active sync   /dev/sdd53
> 	 -       0        0        3      removed
>   /dev/md54:
> 	 0     259       12        0      active sync   /dev/sdb54
> 	 1     259        5        1      active sync   /dev/sdc54
> 	 3     259       19        2      active sync   /dev/sdd54
> 	 -       0        0        3      removed
>   /dev/md55:
> 	 0     259       13        0      active sync   /dev/sdb55
> 	 1     259        6        1      active sync   /dev/sdc55
> 	 3     259       20        2      active sync   /dev/sdd55
> 	 -       0        0        3      removed
>   /dev/md56:
> 	 0     259       14        0      active sync   /dev/sdb56
> 	 1     259        7        1      active sync   /dev/sdc56
> 	 3     259       21        2      active sync   /dev/sdd56
> 	 -       0        0        3      removed
> 
> that are obviously the sdk (new disk) slice.  If md52 were also broken,
> I'd figure that the disk was somehow unplugged, but I don't think I can
> plug in one sixth of a disk and leave the rest unhooked :-)  So ...  In
> addition to wondering how I got here, how do I remove the "removed" ones
> and then re-add them to build and grow and finalize this?

If you want to fix it still, without dmesg it's hard to say how this could
have happened, but what does

  mdadm --re-add /dev/md51 /dev/sdk51

say?

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-23 22:28 ` Roman Mamedov
@ 2022-11-24  0:01   ` Roger Heflin
  2022-11-24 21:20     ` David T-G
  2022-11-24 21:10   ` how do i fix these RAID5 arrays? David T-G
  1 sibling, 1 reply; 62+ messages in thread
From: Roger Heflin @ 2022-11-24  0:01 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: David T-G, Linux RAID list

I have used the slicing scheme.    A couple of advantages are that the
smaller grows will finish in a reasonable time (vs a single
grow/replace that takes a week or more).

And the drives often have small ranges of sectors that have read
errors causing only a single partition to get thrown out and have to
be re-added and that smaller chunk rebuild much faster.

I have also had SATA cabling/interface issues that caused a few
temporary failures to happen and those errors do not generally result
in all the partitions getting kicked out, leaving a much smaller
re-add.

And another advantage is that you aren't limited to the exact same
sizes of devices.

 I use LVM on a 4 section raid built very similar to his, except mine
is a linear lvm so no raid0 seek issues.

I think we need to see a grep -E 'md5|sdk' /var/log/messages to see
where sdk went.

I have never see devices get removed straight away, mine go to failed
and then I have to remove them and re-add them when one does have an
issue.


On Wed, Nov 23, 2022 at 4:41 PM Roman Mamedov <rm@romanrm.net> wrote:
>
> On Wed, 23 Nov 2022 22:07:36 +0000
> David T-G <davidtg-robot@justpickone.org> wrote:
>
> >   diskfarm:~ # mdadm -D /dev/md50
> >   /dev/md50:
> >              Version : 1.2
> >        Creation Time : Thu Nov  4 00:56:36 2021
> >           Raid Level : raid0
> >           Array Size : 19526301696 (18.19 TiB 19.99 TB)
> >         Raid Devices : 6
> >        Total Devices : 6
> >          Persistence : Superblock is persistent
> >
> >          Update Time : Thu Nov  4 00:56:36 2021
> >                State : clean
> >       Active Devices : 6
> >      Working Devices : 6
> >       Failed Devices : 0
> >        Spare Devices : 0
> >
> >               Layout : -unknown-
> >           Chunk Size : 512K
> >
> >   Consistency Policy : none
> >
> >                 Name : diskfarm:10T  (local to host diskfarm)
> >                 UUID : cccbe073:d92c6ecd:77ba5c46:5db6b3f0
> >               Events : 0
> >
> >       Number   Major   Minor   RaidDevice State
> >          0       9       51        0      active sync   /dev/md/51
> >          1       9       52        1      active sync   /dev/md/52
> >          2       9       53        2      active sync   /dev/md/53
> >          3       9       54        3      active sync   /dev/md/54
> >          4       9       55        4      active sync   /dev/md/55
> >          5       9       56        5      active sync   /dev/md/56
>
> It feels you haven't thought this through entirely. Sequential writes to this
> RAID0 array will alternate across all member arrays, and seeing how those are
> not of independent disks, but instead are "vertical" across partitions on the
> same disks, it will result in a crazy seek load, as first 512K is written to
> the array of the *51 partitions, second 512K go to *52, then to *53,
> effectively requiring a full stroke of each drive's head across the entire
> surface for each and every 3 *megabytes* written.
>
> mdraid in the "linear" mode, or LVM with one large LV across all PVs (which
> are the individual RAID5 arrays), or multi-device Btrfs using "single" profile
> for data, all of those would avoid the described effect.
>
> But I should clarify, the entire idea of splitting drives like this seems
> questionable to begin with, since drives more often fail entirely, not in part,
> so you will not save any time on rebuilds; and the "bitmap" already protects
> you against full rebuilds due to any hiccups such as a power cut; or even if a
> drive failed in part, in your current setup, or even in the proposed ones I
> mentioned above, losing even one RAID5 of all these, would result in a
> complete loss of data anyway. Not to mention what you have seems like an insane
> amount of complexity.
>
> To summarize, maybe it's better to blow away the entire thing and restart from
> the drawing board, while it's not too late? :)
>
> >   diskfarm:~ # mdadm -D /dev/md5[13456] | egrep '^/dev|active|removed'
> >   /dev/md51:
> >        0     259        9        0      active sync   /dev/sdb51
> >        1     259        2        1      active sync   /dev/sdc51
> >        3     259       16        2      active sync   /dev/sdd51
> >        -       0        0        3      removed
> >   /dev/md53:
> >        0     259       11        0      active sync   /dev/sdb53
> >        1     259        4        1      active sync   /dev/sdc53
> >        3     259       18        2      active sync   /dev/sdd53
> >        -       0        0        3      removed
> >   /dev/md54:
> >        0     259       12        0      active sync   /dev/sdb54
> >        1     259        5        1      active sync   /dev/sdc54
> >        3     259       19        2      active sync   /dev/sdd54
> >        -       0        0        3      removed
> >   /dev/md55:
> >        0     259       13        0      active sync   /dev/sdb55
> >        1     259        6        1      active sync   /dev/sdc55
> >        3     259       20        2      active sync   /dev/sdd55
> >        -       0        0        3      removed
> >   /dev/md56:
> >        0     259       14        0      active sync   /dev/sdb56
> >        1     259        7        1      active sync   /dev/sdc56
> >        3     259       21        2      active sync   /dev/sdd56
> >        -       0        0        3      removed
> >
> > that are obviously the sdk (new disk) slice.  If md52 were also broken,
> > I'd figure that the disk was somehow unplugged, but I don't think I can
> > plug in one sixth of a disk and leave the rest unhooked :-)  So ...  In
> > addition to wondering how I got here, how do I remove the "removed" ones
> > and then re-add them to build and grow and finalize this?
>
> If you want to fix it still, without dmesg it's hard to say how this could
> have happened, but what does
>
>   mdadm --re-add /dev/md51 /dev/sdk51
>
> say?
>
> --
> With respect,
> Roman

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-23 22:28 ` Roman Mamedov
  2022-11-24  0:01   ` Roger Heflin
@ 2022-11-24 21:10   ` David T-G
  2022-11-24 21:33     ` Wol
  2022-11-25 14:49     ` how do i fix these RAID5 arrays? Wols Lists
  1 sibling, 2 replies; 62+ messages in thread
From: David T-G @ 2022-11-24 21:10 UTC (permalink / raw)
  To: Linux RAID list

Roman, et al --

...and then Roman Mamedov said...
% On Wed, 23 Nov 2022 22:07:36 +0000
% David T-G <davidtg-robot@justpickone.org> wrote:
% 
% >   diskfarm:~ # mdadm -D /dev/md50
...
% >          0       9       51        0      active sync   /dev/md/51
% >          1       9       52        1      active sync   /dev/md/52
% >          2       9       53        2      active sync   /dev/md/53
% >          3       9       54        3      active sync   /dev/md/54
% >          4       9       55        4      active sync   /dev/md/55
% >          5       9       56        5      active sync   /dev/md/56
% 
% It feels you haven't thought this through entirely. Sequential writes to this

Well, it's at least possible that I don't know what I'm doing.  I'm just
a dumb ol' Sys Admin, and career-changed out of the biz a few years back
to boot.  I'm certainly open to advice.  Would changing the default RAID5
or RAID0 stripe size help?


...
% 
% mdraid in the "linear" mode, or LVM with one large LV across all PVs (which
% are the individual RAID5 arrays), or multi-device Btrfs using "single" profile
% for data, all of those would avoid the described effect.

How is linear different from RAID0?  I took a quick look but don't quite
know what I'm reading.  If that's better then, hey, I'd try it (or at
least learn more).

I've played little enough with md, but I haven't played with LVM at all.
I imagine that it's fine to mix them since you've suggested it.  Got any
pointers to a good primer? :-)

I don't want to try BtrFS.  That's another area where I have no experience,
but from what I've seen and read I really don't want to go there yet.


% 
% But I should clarify, the entire idea of splitting drives like this seems
% questionable to begin with, since drives more often fail entirely, not in part,
...
% complete loss of data anyway. Not to mention what you have seems like an insane
% amount of complexity.

To make a long story short, my understanding of a big problem with RAID5
is that rebuilds take a ridiculously long time as the devices get larger.
Using smaller "devices", like partitions of the actual disk, helps get
around that.  If I lose an entire disk, it's no worse than replacing an
entire disk; it's half a dozen rebuilds but at least in small chunks we
can also manage.  If I have read errors or bad sector problems on just a
part, I can toss in a 2T disk to "spare" that piece until I get another
large drive and replace each piece.

As I also understand it, since I wasn't a storage engineer but did have
to automate against big shiny arrays, striping together RAID5 volumes is
pretty straightforward and pretty common.  Maybe my problem is that I
need a couple of orders of magnitude more drives, though.

The whole idea is to allow fault tolerance while also allowing recovery,
with growth by adding another device every once in a while pretty simple.


% 
% To summarize, maybe it's better to blow away the entire thing and restart from
% the drawing board, while it's not too late? :)

I'm open to that idea as well, as long as I can understand where I'm
headed :-)  But what's best?


% 
% >   diskfarm:~ # mdadm -D /dev/md5[13456] | egrep '^/dev|active|removed'
...
% > that are obviously the sdk (new disk) slice.  If md52 were also broken,
% > I'd figure that the disk was somehow unplugged, but I don't think I can
...
% > and then re-add them to build and grow and finalize this?
% 
% If you want to fix it still, without dmesg it's hard to say how this could
% have happened, but what does
% 
%   mdadm --re-add /dev/md51 /dev/sdk51
% 
% say?

Only that it doesn't like the stale pieces:

  diskfarm:~ # dmesg | egrep sdk
  [    8.238044] sd 9:2:0:0: [sdk] 19532873728 512-byte logical blocks: (10.0 TB/9.10 TiB)
  [    8.238045] sd 9:2:0:0: [sdk] 4096-byte physical blocks
  [    8.238051] sd 9:2:0:0: [sdk] Write Protect is off
  [    8.238052] sd 9:2:0:0: [sdk] Mode Sense: 00 3a 00 00
  [    8.238067] sd 9:2:0:0: [sdk] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
  [    8.290084]  sdk: sdk51 sdk52 sdk53 sdk54 sdk55 sdk56 sdk128
  [    8.290747] sd 9:2:0:0: [sdk] Attached SCSI removable disk
  [   17.920802] md: kicking non-fresh sdk51 from array!
  [   17.923119] md/raid:md52: device sdk52 operational as raid disk 3
  [   18.307507] md: kicking non-fresh sdk53 from array!
  [   18.311051] md: kicking non-fresh sdk54 from array!
  [   18.314854] md: kicking non-fresh sdk55 from array!
  [   18.317730] md: kicking non-fresh sdk56 from array!

Does it look like --re-add will be safe?  [Yes, maybe I'll start over,
but clearing this problem would be a nice first step.]


% 
% -- 
% With respect,
% Roman


Thanks again & HAND & Happy Thanksgiving in the US

:-D
-- 
David T-G
See http://justpickone.org/davidtg/email/
See http://justpickone.org/davidtg/tofu.txt


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-24  0:01   ` Roger Heflin
@ 2022-11-24 21:20     ` David T-G
  2022-11-24 21:49       ` Wol
  0 siblings, 1 reply; 62+ messages in thread
From: David T-G @ 2022-11-24 21:20 UTC (permalink / raw)
  To: Linux RAID list

Roger, et al --

...and then Roger Heflin said...
...
% 
%  I use LVM on a 4 section raid built very similar to his, except mine
% is a linear lvm so no raid0 seek issues.

Got any pointers to instructions?


% 
% I think we need to see a grep -E 'md5|sdk' /var/log/messages to see
% where sdk went.
[snip]

Oops!  I knew I had overlooked something.  Not much to see, though :-/

  diskfarm:~ # for M in /var/log/messages-* ; do echo $M ; xz -d $M | egrep 'md5|sdk' ; echo '' ; done                                
  /var/log/messages-20221109.xz
  
  /var/log/messages-20221112.xz
  
  /var/log/messages-20221115.xz
  
  /var/log/messages-20221118.xz
  
  /var/log/messages-20221121.xz
  
  /var/log/messages-20221124.xz
  
  diskfarm:~ # egrep 'md5|sdk' /var/log/messages
  2022-11-24T03:00:35.886257+00:00 diskfarm smartd[1766]: Device: /dev/sdk [SAT], starting scheduled Short Self-Test.
  2022-11-24T03:30:32.154462+00:00 diskfarm smartd[1766]: Device: /dev/sdk [SAT], previous self-test completed without error


Thanks again

:-D
-- 
David T-G
See http://justpickone.org/davidtg/email/
See http://justpickone.org/davidtg/tofu.txt


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-24 21:10   ` how do i fix these RAID5 arrays? David T-G
@ 2022-11-24 21:33     ` Wol
  2022-11-25  1:16       ` Roger Heflin
  2022-11-25 13:30       ` about linear and about RAID10 (was "Re: how do i fix these RAID5 arrays?") David T-G
  2022-11-25 14:49     ` how do i fix these RAID5 arrays? Wols Lists
  1 sibling, 2 replies; 62+ messages in thread
From: Wol @ 2022-11-24 21:33 UTC (permalink / raw)
  To: David T-G, Linux RAID list

On 24/11/2022 21:10, David T-G wrote:
> How is linear different from RAID0?  I took a quick look but don't quite
> know what I'm reading.  If that's better then, hey, I'd try it (or at
> least learn more).

Linear tacks one drive on to the end of another. Raid-0 stripes across 
all drives. Both effectively combine a bunch of drives into one big drive.

Striped gives you speed, a big file gets spread over all the drives. The 
problem, of course, is that losing one drive can easily trash pretty 
much every big file on the array, irretrievably.

Linear means that much of your array can be recovered if a drive fails. 
But it's no faster than a single drive because pretty much every file is 
stored on just the one drive. And depending on what drive you lose, it 
can wipe your directory structure such that you just end up with a 
massive lost+found directory.

That's why there's raid-10. Note that outside of Linux (and often 
inside) when people say "raid-10" they actually mean "raid 1+0". That's 
two striped raid-0's, mirrored.

Linux raid-10 is different. it means you have at least two copies of 
each stripe, smeared across all the disks.

Either version (10, or 1+0), gives you get the speed of striping, and 
the safety of a mirror. 10, however, can use an odd number of disks, and 
disks of random sizes.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-24 21:20     ` David T-G
@ 2022-11-24 21:49       ` Wol
  2022-11-25 13:36         ` and dm-integrity, too (was "Re: how do i fix these RAID5 arrays?") David T-G
  0 siblings, 1 reply; 62+ messages in thread
From: Wol @ 2022-11-24 21:49 UTC (permalink / raw)
  To: David T-G, Linux RAID list

On 24/11/2022 21:20, David T-G wrote:
> Roger, et al --
> 
> ...and then Roger Heflin said...
> ...
> %
> %  I use LVM on a 4 section raid built very similar to his, except mine
> % is a linear lvm so no raid0 seek issues.
> 
> Got any pointers to instructions?

Dunno whether my setup would work for you, but I've raid-5'd my disks, 
then put a single lvm volume over the top, and then broken that up into 
partitions as required.

Note that I didn't raid the entire disk, I left space for swap, boot. and /.

And I always plan to expand the array by doubling the size of the disks. 
Effectively a raid-5+0 sort of thing. Add two double-size disks, raid-0 
the disks I've removed and add them, and there's probably enough space 
to resize onto those three before doubling up what's left and adding 
that back.

(Note also I've got dm-integrity in there too, but that's me.)

https://raid.wiki.kernel.org/index.php/System2020

Cheers,
Wol

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-24 21:33     ` Wol
@ 2022-11-25  1:16       ` Roger Heflin
  2022-11-25 13:22         ` David T-G
  2022-11-25 13:30       ` about linear and about RAID10 (was "Re: how do i fix these RAID5 arrays?") David T-G
  1 sibling, 1 reply; 62+ messages in thread
From: Roger Heflin @ 2022-11-25  1:16 UTC (permalink / raw)
  To: Wol; +Cc: David T-G, Linux RAID list

 If you had 5 raw disks raid0 is faster, but with the separate
partition setup linear will be faster because linear does not have a
seek from one partition to another but instead a linear write.  Note
that on reads raid5 performs similar to a raid0 stripe of 1 less disk,
writes on raid5

Based on those messages The re-add would be expected to work fine.  it
appears that the machine was shutdown wrong and/or crashed and kicked
the devices out because the data was a few events behind on that
device.     I have seen this happen from time to time.  In general the
re-add will do the right thing or will get an error.      When
something goes wrong you always need to check messages and see what
the underlying issue is/was.

Find any lvm instruction and treat the N mdraids as "disks" but and do
not do stripe (the simple default will do what you want)  and the
default method lvm uses will lay out the new block devices as one big
device basically using all of the first raid and then the next one and
so on.   or you can simply recreate the big device and specific an
mdadm type of "linear" rather than raid0.


On Thu, Nov 24, 2022 at 4:03 PM Wol <antlists@youngman.org.uk> wrote:
>
> On 24/11/2022 21:10, David T-G wrote:
> > How is linear different from RAID0?  I took a quick look but don't quite
> > know what I'm reading.  If that's better then, hey, I'd try it (or at
> > least learn more).
>
> Linear tacks one drive on to the end of another. Raid-0 stripes across
> all drives. Both effectively combine a bunch of drives into one big drive.
>
> Striped gives you speed, a big file gets spread over all the drives. The
> problem, of course, is that losing one drive can easily trash pretty
> much every big file on the array, irretrievably.
>
> Linear means that much of your array can be recovered if a drive fails.
> But it's no faster than a single drive because pretty much every file is
> stored on just the one drive. And depending on what drive you lose, it
> can wipe your directory structure such that you just end up with a
> massive lost+found directory.
>
> That's why there's raid-10. Note that outside of Linux (and often
> inside) when people say "raid-10" they actually mean "raid 1+0". That's
> two striped raid-0's, mirrored.
>
> Linux raid-10 is different. it means you have at least two copies of
> each stripe, smeared across all the disks.
>
> Either version (10, or 1+0), gives you get the speed of striping, and
> the safety of a mirror. 10, however, can use an odd number of disks, and
> disks of random sizes.
>
> Cheers,
> Wol

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-25  1:16       ` Roger Heflin
@ 2022-11-25 13:22         ` David T-G
       [not found]           ` <CAAMCDed1-4zFgHMS760dO1pThtkrn8K+FMuG-QQ+9W-FE0iq9Q@mail.gmail.com>
  0 siblings, 1 reply; 62+ messages in thread
From: David T-G @ 2022-11-25 13:22 UTC (permalink / raw)
  To: Linux RAID list

Roger, et al --

...and then Roger Heflin said...
...
% 
% Based on those messages The re-add would be expected to work fine.  it

Sure enough, it did:

  diskfarm:~ # mdadm -D /dev/md5[123456] | egrep '^/dev|Stat|Size'
  /dev/md51:
	  Array Size : 4881773568 (4.55 TiB 5.00 TB)
       Used Dev Size : 1627257856 (1551.87 GiB 1666.31 GB)
	       State : clean 
	  Chunk Size : 512K
      Number   Major   Minor   RaidDevice State
  /dev/md52:
	  Array Size : 4881773568 (4.55 TiB 5.00 TB)
       Used Dev Size : 1627257856 (1551.87 GiB 1666.31 GB)
	       State : clean 
	  Chunk Size : 512K
      Number   Major   Minor   RaidDevice State
  /dev/md53:
	  Array Size : 4881773568 (4.55 TiB 5.00 TB)
       Used Dev Size : 1627257856 (1551.87 GiB 1666.31 GB)
	       State : clean 
	  Chunk Size : 512K
      Number   Major   Minor   RaidDevice State
  /dev/md54:
	  Array Size : 4881773568 (4.55 TiB 5.00 TB)
       Used Dev Size : 1627257856 (1551.87 GiB 1666.31 GB)
	       State : clean 
	  Chunk Size : 512K
      Number   Major   Minor   RaidDevice State
  /dev/md55:
	  Array Size : 4881773568 (4.55 TiB 5.00 TB)
       Used Dev Size : 1627257856 (1551.87 GiB 1666.31 GB)
	       State : clean 
	  Chunk Size : 512K
      Number   Major   Minor   RaidDevice State
  /dev/md56:
	  Array Size : 4881773568 (4.55 TiB 5.00 TB)
       Used Dev Size : 1627257856 (1551.87 GiB 1666.31 GB)
	       State : clean 
	  Chunk Size : 512K
      Number   Major   Minor   RaidDevice State

That's a great first step.  Whew!  And the little arrays are even the right
size:

  diskfarm:~ # fdisk -l /dev/md51
  Disk /dev/md51: 4.6 TiB, 4998936133632 bytes, 9763547136 sectors
  Units: sectors of 1 * 512 = 512 bytes
  Sector size (logical/physical): 512 bytes / 4096 bytes
  I/O size (minimum/optimal): 524288 bytes / 1572864 bytes

Yay.  But ...  The striped device never grew:

  diskfarm:~ # parted /dev/md50 p free
  Model: Linux Software RAID Array (md)
  Disk /dev/md50: 20.0TB
  Sector size (logical/physical): 512B/4096B
  Partition Table: gpt
  Disk Flags: 
  
  Number  Start   End     Size    File system  Name         Flags
	  17.4kB  3146kB  3128kB  Free Space
   1      3146kB  20.0TB  20.0TB  xfs          10Traid50md
	  20.0TB  20.0TB  3129kB  Free Space

Even though I think I'm gonna go with md linear instead of learning
something else, which means I'll rebuild the striped top layer, which
means I should get the whole 30T, I'm curious as to why I don't see it
now.  It appears that one cannot grow

  diskfarm:~ # mdadm --grow /dev/md50 --size max
  mdadm: Cannot set device size in this type of array.

a striped array.  Soooo ...  What to do?  More to the point, what will I
need to do when I add the next 10T drive?


% appears that the machine was shutdown wrong and/or crashed and kicked
...
% something goes wrong you always need to check messages and see what
% the underlying issue is/was.
[snip]

Thanks.  That's what had me confused ... :-/


Thanks again & HAND

:-D
-- 
David T-G
See http://justpickone.org/davidtg/email/
See http://justpickone.org/davidtg/tofu.txt


^ permalink raw reply	[flat|nested] 62+ messages in thread

* about linear and about RAID10 (was "Re: how do i fix these RAID5 arrays?")
  2022-11-24 21:33     ` Wol
  2022-11-25  1:16       ` Roger Heflin
@ 2022-11-25 13:30       ` David T-G
  2022-11-25 14:23         ` Wols Lists
  2022-11-25 18:00         ` about linear and about RAID10 (was "Re: how do i fix these RAID5 arrays?") Roger Heflin
  1 sibling, 2 replies; 62+ messages in thread
From: David T-G @ 2022-11-25 13:30 UTC (permalink / raw)
  To: Linux RAID list

Wol, et al --

...and then Wol said...
% On 24/11/2022 21:10, David T-G wrote:
% > How is linear different from RAID0?  I took a quick look but don't quite
% > know what I'm reading.  If that's better then, hey, I'd try it (or at
% > least learn more).
% 
% Linear tacks one drive on to the end of another. Raid-0 stripes across all
% drives. Both effectively combine a bunch of drives into one big drive.

Ahhhhh...  I gotcha.  Thanks.


% 
...
% 
% That's why there's raid-10. Note that outside of Linux (and often inside)
% when people say "raid-10" they actually mean "raid 1+0". That's two striped
% raid-0's, mirrored.

That's basically what I have on the web server:

  jpo:~ # mdadm -D /dev/md41 | egrep '/dev|Level'
  /dev/md41:
	  Raid Level : raid1
	 0       8       17        0      active sync   /dev/sdb1
	 1       8       34        1      active sync   /dev/sdc2
  jpo:~ # mdadm -D /dev/md42 | egrep '/dev|Level'
  /dev/md42:
	  Raid Level : raid1
	 0       8       18        0      active sync   /dev/sdb2
	 1       8       33        1      active sync   /dev/sdc1
  jpo:~ # mdadm -D /dev/md40 | egrep '/dev|Level'
  /dev/md40:
	  Raid Level : raid0
	 0       9       41        0      active sync   /dev/md/md41
	 1       9       42        1      active sync   /dev/md/md42
  jpo:~ #
  jpo:~ #
  jpo:~ # parted /dev/sdb p
  Model: ATA ST4000VN008-2DR1 (scsi)
  Disk /dev/sdb: 4001GB
  Sector size (logical/physical): 512B/4096B
  Partition Table: gpt
  Disk Flags:
  
  Number  Start   End     Size    File system  Name                    Flags
   1      1049kB  2000GB  2000GB               Raid1-1
   2      2000GB  4001GB  2000GB               Raid1-2
   4      4001GB  4001GB  860kB   ext2         Seag4000-ZDHB2X37-ext2
  
  jpo:~ # parted /dev/sdc p
  Model: ATA ST4000VN008-2DR1 (scsi)
  Disk /dev/sdc: 4001GB
  Sector size (logical/physical): 512B/4096B
  Partition Table: gpt
  Disk Flags:
  
  Number  Start   End     Size    File system  Name                    Flags
   1      1049kB  2000GB  2000GB               Raid1-2
   2      2000GB  4001GB  2000GB               Raid1-1
   4      4001GB  4001GB  860kB                Seag4000-ZDHBKZTG-ext2


% 
...
% 
% Either version (10, or 1+0), gives you get the speed of striping, and the
% safety of a mirror. 10, however, can use an odd number of disks, and disks
% of random sizes.

That's still magic to me :-)  Mirroring (but not doubling up the
redundancy) on an odd number of disks?!?


% 
% Cheers,
% Wol


HAND

:-D
-- 
David T-G
See http://justpickone.org/davidtg/email/
See http://justpickone.org/davidtg/tofu.txt


^ permalink raw reply	[flat|nested] 62+ messages in thread

* and dm-integrity, too (was "Re: how do i fix these RAID5 arrays?")
  2022-11-24 21:49       ` Wol
@ 2022-11-25 13:36         ` David T-G
  0 siblings, 0 replies; 62+ messages in thread
From: David T-G @ 2022-11-25 13:36 UTC (permalink / raw)
  To: Wol; +Cc: Linux RAID list

Wol, et al --

...and then Wol said...
...
% 
% (Note also I've got dm-integrity in there too, but that's me.)
% 
% https://raid.wiki.kernel.org/index.php/System2020

I read *so* much of this and so much more trying to recover a different
array after a couple of drive failures.  I never got overlays and integrity
to work :-(  The latter is another thing on the "future enhancements" list!


% 
% Cheers,
% Wol


HAND

:-D
-- 
David T-G
See http://justpickone.org/davidtg/email/
See http://justpickone.org/davidtg/tofu.txt


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: about linear and about RAID10 (was "Re: how do i fix these RAID5 arrays?")
  2022-11-25 13:30       ` about linear and about RAID10 (was "Re: how do i fix these RAID5 arrays?") David T-G
@ 2022-11-25 14:23         ` Wols Lists
  2022-11-25 19:50           ` about linear and about RAID10 David T-G
  2022-11-25 18:00         ` about linear and about RAID10 (was "Re: how do i fix these RAID5 arrays?") Roger Heflin
  1 sibling, 1 reply; 62+ messages in thread
From: Wols Lists @ 2022-11-25 14:23 UTC (permalink / raw)
  To: David T-G, Linux RAID list

On 25/11/2022 13:30, David T-G wrote:
> % Either version (10, or 1+0), gives you get the speed of striping, and the
> % safety of a mirror. 10, however, can use an odd number of disks, and disks
> % of random sizes.
> 
> That's still magic to me 😄  Mirroring (but not doubling up the
> redundancy) on an odd number of disks?!?

Disk:     a   b   c

Stripe:   1   1   2
           2   3   3
           4   4   5
           5   6   6

and so on.

I was trying to work out how I'd smear them a lot more randomly, but it 
was a nightmare. Iirc, no matter how many drives you have, (for two 
copies) it seems that drive a is only mirrored to drives b and c, for 
any value of a. So if you lose drive a, and then either b or c, you are 
guaranteed to lose half a drive of contents.

It also means that replacing a failed drive will hammer just two drives 
to replace it and not touch the others. I wanted to try and spread stuff 
far more evenly so it read from all the other drives, not just two. 
Okay, it increases the risk that you will lose *some* data to a double 
failure, but reduces the *amount* of data at risk (and also reduces the 
risk of a double failure!). Because if the first failure *provokes* the 
second, data loss is pretty much guaranteed.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-24 21:10   ` how do i fix these RAID5 arrays? David T-G
  2022-11-24 21:33     ` Wol
@ 2022-11-25 14:49     ` Wols Lists
  2022-11-26 20:02       ` John Stoffel
  1 sibling, 1 reply; 62+ messages in thread
From: Wols Lists @ 2022-11-25 14:49 UTC (permalink / raw)
  To: David T-G, Linux RAID list

On 24/11/2022 21:10, David T-G wrote:
> I don't want to try BtrFS.  That's another area where I have no experience,
> but from what I've seen and read I really don't want to go there yet.

Btrfs ...

It's a good idea, and provided you don't do anything esoteric it's been 
solid for years.

It used to have a terrible reputation for surviving a disk full - at a 
guess it needs some disk space to shuffle its btree to recover space - 
and a disk-full situation borked the garbage collection.

Raid-1 (mirroring) by default only mirrors the directories, the data 
isn't mirrored so you can easily still lose that ... (they call that 
user misconfiguration, I call it developer arrogance ...)

Parity raid is still borken...

At the end of the day, if you want to protect your data, DON'T rely on 
the filesystem. There are far too many cases where the developers have 
made decisions that protect the file system (and hence computer uptime) 
at the expense of the data IN the filesystem. I don't give a monkeys if 
the filesystem protects itself to enable a crashed computer to reboot 
ten seconds faster, if the consequence of that change is my computer is 
out of action for a day while I have to restore a backup to re-instate 
the integrity of my data !!!

Cheers,
Wol

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: about linear and about RAID10 (was "Re: how do i fix these RAID5 arrays?")
  2022-11-25 13:30       ` about linear and about RAID10 (was "Re: how do i fix these RAID5 arrays?") David T-G
  2022-11-25 14:23         ` Wols Lists
@ 2022-11-25 18:00         ` Roger Heflin
  2022-11-28 14:46           ` about linear and about RAID10 David T-G
  1 sibling, 1 reply; 62+ messages in thread
From: Roger Heflin @ 2022-11-25 18:00 UTC (permalink / raw)
  To: David T-G; +Cc: Linux RAID list

You do not want to stripe 2 partitions on a single disk, you want that linear.

With a stripe write on 4 striped partitions you get this.
Write a stripe for each disk on part1, then do a 8-10ms seek, then
write the next stripe on part2 and then seek back to part1 and repeat.

With Linear you get:
write a stripe, (there usually won't be any seek, and if there is it
will be a single track seek and much quicker), then write the next
stripe.

If the head data rate is 200MB/sec (ballpark typical for a disk) then
that 10ms could write 2MB of data.   So the larger the stripe size the
less you waste % wise on the seeks).  But if say the block per disk
is 256K, then that write takes around 1.25ms and the seek to the next
one takes 10ms so the seeks significantly reduce you read/write rates.

do a dd if=/dev/mdXX of=/dev/null bs=1M count=100 iflag=direct  on one
of the raid5s of the partitions and then on the raid1 device over
them.  I would expect the raid device over them to be much slower, I
am not sure how much but 5x-20x.

On Fri, Nov 25, 2022 at 7:36 AM David T-G <davidtg-robot@justpickone.org> wrote:
>
> Wol, et al --
>
> ...and then Wol said...
> % On 24/11/2022 21:10, David T-G wrote:
> % > How is linear different from RAID0?  I took a quick look but don't quite
> % > know what I'm reading.  If that's better then, hey, I'd try it (or at
> % > least learn more).
> %
> % Linear tacks one drive on to the end of another. Raid-0 stripes across all
> % drives. Both effectively combine a bunch of drives into one big drive.
>
> Ahhhhh...  I gotcha.  Thanks.
>
>
> %
> ...
> %
> % That's why there's raid-10. Note that outside of Linux (and often inside)
> % when people say "raid-10" they actually mean "raid 1+0". That's two striped
> % raid-0's, mirrored.
>
> That's basically what I have on the web server:
>
>   jpo:~ # mdadm -D /dev/md41 | egrep '/dev|Level'
>   /dev/md41:
>           Raid Level : raid1
>          0       8       17        0      active sync   /dev/sdb1
>          1       8       34        1      active sync   /dev/sdc2
>   jpo:~ # mdadm -D /dev/md42 | egrep '/dev|Level'
>   /dev/md42:
>           Raid Level : raid1
>          0       8       18        0      active sync   /dev/sdb2
>          1       8       33        1      active sync   /dev/sdc1
>   jpo:~ # mdadm -D /dev/md40 | egrep '/dev|Level'
>   /dev/md40:
>           Raid Level : raid0
>          0       9       41        0      active sync   /dev/md/md41
>          1       9       42        1      active sync   /dev/md/md42
>   jpo:~ #
>   jpo:~ #
>   jpo:~ # parted /dev/sdb p
>   Model: ATA ST4000VN008-2DR1 (scsi)
>   Disk /dev/sdb: 4001GB
>   Sector size (logical/physical): 512B/4096B
>   Partition Table: gpt
>   Disk Flags:
>
>   Number  Start   End     Size    File system  Name                    Flags
>    1      1049kB  2000GB  2000GB               Raid1-1
>    2      2000GB  4001GB  2000GB               Raid1-2
>    4      4001GB  4001GB  860kB   ext2         Seag4000-ZDHB2X37-ext2
>
>   jpo:~ # parted /dev/sdc p
>   Model: ATA ST4000VN008-2DR1 (scsi)
>   Disk /dev/sdc: 4001GB
>   Sector size (logical/physical): 512B/4096B
>   Partition Table: gpt
>   Disk Flags:
>
>   Number  Start   End     Size    File system  Name                    Flags
>    1      1049kB  2000GB  2000GB               Raid1-2
>    2      2000GB  4001GB  2000GB               Raid1-1
>    4      4001GB  4001GB  860kB                Seag4000-ZDHBKZTG-ext2
>
>
> %
> ...
> %
> % Either version (10, or 1+0), gives you get the speed of striping, and the
> % safety of a mirror. 10, however, can use an odd number of disks, and disks
> % of random sizes.
>
> That's still magic to me :-)  Mirroring (but not doubling up the
> redundancy) on an odd number of disks?!?
>
>
> %
> % Cheers,
> % Wol
>
>
> HAND
>
> :-D
> --
> David T-G
> See http://justpickone.org/davidtg/email/
> See http://justpickone.org/davidtg/tofu.txt
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
       [not found]           ` <CAAMCDed1-4zFgHMS760dO1pThtkrn8K+FMuG-QQ+9W-FE0iq9Q@mail.gmail.com>
@ 2022-11-25 19:49             ` David T-G
  2022-11-28 14:24               ` md RAID0 can be grown (was "Re: how do i fix these RAID5 arrays?") David T-G
  0 siblings, 1 reply; 62+ messages in thread
From: David T-G @ 2022-11-25 19:49 UTC (permalink / raw)
  To: Linux RAID list

Roger, et al --

...and then Roger Heflin said...
% You may not be able to grow with either linear and/or raid0 under mdadm.

Um ...  uh oh.

I did some reading and see that I had confused adding devices with
growing devices.  I'll add devs to the underlying RAID5 volumes, but
there will only ever be six devs in the RAID0 array.

What do you think of

  mdadm -A --update=devicesize /dev/md50

as discussed in

  https://serverfault.com/questions/1068788/how-to-change-size-of-raid0-software-array-by-resizing-partition

recently?


% 
% And in general I do not partition md* devices as it seems most of the time
% to not be useful especially if using lvm.

Well, yes, but I like to have a separate tiny "metadata" partition
at the end of the disk (real or virtual for consistency) where I keep
useful data.  I could live without it on the big array, though.


% 
% so here is roughly how to do it (commands may not be exact)> and assuming
% your devices are /dev/md5[0123]
% 
% PV == physical volume (a disk or md raid device generally).
% VG == volume group (a group of PV).
% LV == logical volume (a block device inside a vg made up of part of a PV or
% several PVs).
% 
% pvcreate /dev/md5[0123]
% vgcreate bigvg /dev/md5[0123]
% lvcreate -L <size> -n mylv bigvg
% 
% commands to see what is going on are : pvs, lvs, vgs.
% Then build a fs on /dev/bigvg/mylv   Note you can replace bigvg with
% whatever name you want, and same with the mylv.
[snip]

Thanks for the pointers.  I'll go off and read LVM docs and see if I
can come up to speed.


Have a great day

:-D
-- 
David T-G
See http://justpickone.org/davidtg/email/
See http://justpickone.org/davidtg/tofu.txt


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: about linear and about RAID10
  2022-11-25 14:23         ` Wols Lists
@ 2022-11-25 19:50           ` David T-G
  0 siblings, 0 replies; 62+ messages in thread
From: David T-G @ 2022-11-25 19:50 UTC (permalink / raw)
  To: Linux RAID list

Wol, et al --

...and then Wols Lists said...
% On 25/11/2022 13:30, David T-G wrote:
% > 
% > That's still magic to me ????  Mirroring (but not doubling up the
% > redundancy) on an odd number of disks?!?
% 
% Disk:     a   b   c
% 
% Stripe:   1   1   2
%           2   3   3
%           4   4   5
%           5   6   6
% 
% and so on.
[snip]

Ahhhhh...  Thanks.  I was stuck in my symmetrical head and totally
overlooked a staggered layout.  Of course!


Have a great evening

:-D
-- 
David T-G
See http://justpickone.org/davidtg/email/
See http://justpickone.org/davidtg/tofu.txt


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-25 14:49     ` how do i fix these RAID5 arrays? Wols Lists
@ 2022-11-26 20:02       ` John Stoffel
  2022-11-27  9:33         ` Wols Lists
                           ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: John Stoffel @ 2022-11-26 20:02 UTC (permalink / raw)
  To: Wols Lists; +Cc: David T-G, Linux RAID list

>>>>> "Wols" == Wols Lists <antlists@youngman.org.uk> writes:

> On 24/11/2022 21:10, David T-G wrote:
>> I don't want to try BtrFS.  That's another area where I have no experience,
>> but from what I've seen and read I really don't want to go there yet.

> Btrfs ...

> It's a good idea, and provided you don't do anything esoteric it's
> been solid for years.

Does it count when you try to upgrade SLES 12.3 with the latest
patches to 15 and it bombs so badly that the btrfs snapshots can't get
you working again and you have to blow the system away to do a fresh
full re-install?  

I don't trust btrfs because when (not if) it goes badly, it goes
REALLY badly in my experience.  Where ext4 and xfs both will recover
enough to keep working.  You might lose data, but you won't lose the
entire filesystem.  

> It used to have a terrible reputation for surviving a disk full - at
> a guess it needs some disk space to shuffle its btree to recover
> space - and a disk-full situation borked the garbage collection.

It can't handle snapshots well either in my experience.  

> Raid-1 (mirroring) by default only mirrors the directories, the data
> isn't mirrored so you can easily still lose that ... (they call that
> user misconfiguration, I call it developer arrogance ...)

I call it a failure of the layering model.  If you want RAID, use MD.
If you want logical volumes, then put LVM on top.  Then put
filesystems into logical volumes.  

So much simpler...

> Parity raid is still borken...

So why the hell are you recommending it?  

> At the end of the day, if you want to protect your data, DON'T rely on 
> the filesystem. There are far too many cases where the developers have 
> made decisions that protect the file system (and hence computer uptime) 
> at the expense of the data IN the filesystem. I don't give a monkeys if 
> the filesystem protects itself to enable a crashed computer to reboot 
> ten seconds faster, if the consequence of that change is my computer is 
> out of action for a day while I have to restore a backup to re-instate 
> the integrity of my data !!!

Yes, I agree 100%.  

Mirrors and backups are key.  And offsite backups are key too.
Especially for family photos and other keep sakes you don't want to
lose.  rips of CDs and DVDs aren't nearly as important in my book.  


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-26 20:02       ` John Stoffel
@ 2022-11-27  9:33         ` Wols Lists
  2022-11-27 11:46         ` Reindl Harald
  2022-11-27 14:10         ` piergiorgio.sartor
  2 siblings, 0 replies; 62+ messages in thread
From: Wols Lists @ 2022-11-27  9:33 UTC (permalink / raw)
  To: John Stoffel; +Cc: David T-G, Linux RAID list

On 26/11/2022 20:02, John Stoffel wrote:
>> Parity raid is still borken...

> So why the hell are you recommending it?
> 
I did say "so long as you don't do anything exotic". Raids 0 and 1 are 
apparently pretty rock solid, so long as you've actually GOT raid 1, see 
my comment about that!

The snapshot issue is supposedly fixed, but that's part of the disk full 
issue. I agree - it's crazy you can't drop a snapshot to recover disk 
space if the disk fills up - hell that should have been one of the FIRST 
things to be made rock solid, not one of the last!

But I do get the impression that btrfs is finally, provided you avoid 
all the EXPERIMENTAL (which is most of them :-) features is a decent 
file system.

Unfortunately, it seems to be typical in the linux world, you have a 
feature which is broken (alsa, ext4, sysVinit, zfs licence etc), and 
when somebody tries to design and write a decent replacement, the 
distros push it out into the "stable distro" world before it's ready. 
And so decent systems get a bad rep while in beta (or even alpha) 
through no fault of the person/team writing them. Mind you, most of 
btrfs is still alpha :-)

And no. The only place I use it is where it's the SUSE default. Not even 
there, maybe, I think it uses it for /home and I share that between 
distros ...

Cheers,
Wol

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-26 20:02       ` John Stoffel
  2022-11-27  9:33         ` Wols Lists
@ 2022-11-27 11:46         ` Reindl Harald
  2022-11-27 11:52           ` Wols Lists
  2022-11-27 14:58           ` John Stoffel
  2022-11-27 14:10         ` piergiorgio.sartor
  2 siblings, 2 replies; 62+ messages in thread
From: Reindl Harald @ 2022-11-27 11:46 UTC (permalink / raw)
  To: John Stoffel, Wols Lists; +Cc: David T-G, Linux RAID list



Am 26.11.22 um 21:02 schrieb John Stoffel:
> I call it a failure of the layering model.  If you want RAID, use MD.
> If you want logical volumes, then put LVM on top.  Then put
> filesystems into logical volumes.
> 
> So much simpler...

have you ever replaced a 6 TB drive and waited for the resync of mdadm 
in the hope in all that hours no other drive goes down?

when your array is 10% used it's braindead
when your array is new and empty it's braindead

ZFS/BTRFS don't neeed to mirror/restore 90% nulls

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-27 11:46         ` Reindl Harald
@ 2022-11-27 11:52           ` Wols Lists
  2022-11-27 12:06             ` Reindl Harald
  2022-11-27 14:58           ` John Stoffel
  1 sibling, 1 reply; 62+ messages in thread
From: Wols Lists @ 2022-11-27 11:52 UTC (permalink / raw)
  To: Reindl Harald, John Stoffel; +Cc: David T-G, Linux RAID list

On 27/11/2022 11:46, Reindl Harald wrote:
> 
> 
> Am 26.11.22 um 21:02 schrieb John Stoffel:
>> I call it a failure of the layering model.  If you want RAID, use MD.
>> If you want logical volumes, then put LVM on top.  Then put
>> filesystems into logical volumes.
>>
>> So much simpler...
> 
> have you ever replaced a 6 TB drive and waited for the resync of mdadm 
> in the hope in all that hours no other drive goes down?
> 
> when your array is 10% used it's braindead
> when your array is new and empty it's braindead
> 
> ZFS/BTRFS don't neeed to mirror/restore 90% nulls

This is why you have trim. Although I don't know if raid supports that - 
I hope it does I suspect it doesn't.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-27 11:52           ` Wols Lists
@ 2022-11-27 12:06             ` Reindl Harald
  2022-11-27 14:33               ` Wol
  0 siblings, 1 reply; 62+ messages in thread
From: Reindl Harald @ 2022-11-27 12:06 UTC (permalink / raw)
  To: Wols Lists, John Stoffel; +Cc: David T-G, Linux RAID list



Am 27.11.22 um 12:52 schrieb Wols Lists:
> On 27/11/2022 11:46, Reindl Harald wrote:
>>
>>
>> Am 26.11.22 um 21:02 schrieb John Stoffel:
>>> I call it a failure of the layering model.  If you want RAID, use MD.
>>> If you want logical volumes, then put LVM on top.  Then put
>>> filesystems into logical volumes.
>>>
>>> So much simpler...
>>
>> have you ever replaced a 6 TB drive and waited for the resync of mdadm 
>> in the hope in all that hours no other drive goes down?
>>
>> when your array is 10% used it's braindead
>> when your array is new and empty it's braindead
>>
>> ZFS/BTRFS don't neeed to mirror/restore 90% nulls
> 
> This is why you have trim. 

besides that such large disks are typically HDD trim has nothing to do 
with the fact that after a drive replacement linux raid knows *nothing* 
about trim and does a full resync

you are long enough on this list that you should know that

> Although I don't know if raid supports that 

surely - but you miss the topic


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-26 20:02       ` John Stoffel
  2022-11-27  9:33         ` Wols Lists
  2022-11-27 11:46         ` Reindl Harald
@ 2022-11-27 14:10         ` piergiorgio.sartor
  2022-11-27 18:21           ` Reindl Harald
  2 siblings, 1 reply; 62+ messages in thread
From: piergiorgio.sartor @ 2022-11-27 14:10 UTC (permalink / raw)
  To: Reindl Harald, John Stoffel, Wols Lists; +Cc: David T-G, Linux RAID list

November 27, 2022 at 12:46 PM, "Reindl Harald" <h.reindl@thelounge.net> wrote:


> 
> Am 26.11.22 um 21:02 schrieb John Stoffel:
> 
> > 
> > I call it a failure of the layering model. If you want RAID, use MD.
> >  If you want logical volumes, then put LVM on top. Then put
> >  filesystems into logical volumes.
> >  So much simpler...
> > 
> 
> have you ever replaced a 6 TB drive and waited for the resync of mdadm in the hope in all that hours no other drive goes down?
> 
> when your array is 10% used it's braindead
> when your array is new and empty it's braindead
> 
> ZFS/BTRFS don't neeed to mirror/restore 90% nulls
>

You cannot consider the amount of data in the
array as parameter for reliability.

If the array is 99% full, MD or ZFS/BTRFS have
same behaviour, in terms of reliability.
If the array is 0% full, as well.

The only advantage is you wait less, if less
data is present (for ZFS/BTRFS).

Because the day that the ZFS/BTRFS is 99% full,
you got a resync and a failure you have also
double damage: lost array and 99% of the data.

Furthermore, non-layered systems, like those two,
tend to have dependent failures, in terms of
software bugs.

Layered systems have more isolation, bug propagation
is less likely.

Meaning that the risk of a software bug is much
higher, to happen and to have catastrophic effects,
for non-layered systems.

bye,

pg

-- 

piergiorgio sartor

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-27 12:06             ` Reindl Harald
@ 2022-11-27 14:33               ` Wol
  2022-11-27 18:08                 ` Roman Mamedov
  2022-11-27 18:23                 ` Reindl Harald
  0 siblings, 2 replies; 62+ messages in thread
From: Wol @ 2022-11-27 14:33 UTC (permalink / raw)
  To: Reindl Harald, John Stoffel; +Cc: David T-G, Linux RAID list

On 27/11/2022 12:06, Reindl Harald wrote:
> 
> 
> Am 27.11.22 um 12:52 schrieb Wols Lists:
>> On 27/11/2022 11:46, Reindl Harald wrote:
>>>
>>>
>>> Am 26.11.22 um 21:02 schrieb John Stoffel:
>>>> I call it a failure of the layering model.  If you want RAID, use MD.
>>>> If you want logical volumes, then put LVM on top.  Then put
>>>> filesystems into logical volumes.
>>>>
>>>> So much simpler...
>>>
>>> have you ever replaced a 6 TB drive and waited for the resync of 
>>> mdadm in the hope in all that hours no other drive goes down?
>>>
>>> when your array is 10% used it's braindead
>>> when your array is new and empty it's braindead
>>>
>>> ZFS/BTRFS don't neeed to mirror/restore 90% nulls
>>
>> This is why you have trim. 
> 
> besides that such large disks are typically HDD trim has nothing to do 
> with the fact that after a drive replacement linux raid knows *nothing* 
> about trim and does a full resync
> 
> you are long enough on this list that you should know that

Except (1) I didn't say *H*D*D* trim, and (2) if raid just passes trim 
through to the layer below, THAT'S NOT SUPPORTING TRIM. As far as I'm 
concerned, what happens at the level below is just not relevant!

If raid supports trim, that means it intercepts the trim commands, and 
uses it to keep track of what's being used by the layer above.

In other words, if the filesystem is only using 10% of the disk, 
supporting trim means that raid knows which 10% is being used and only 
bothers syncing that!

Cheers,
Wol

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-27 11:46         ` Reindl Harald
  2022-11-27 11:52           ` Wols Lists
@ 2022-11-27 14:58           ` John Stoffel
  1 sibling, 0 replies; 62+ messages in thread
From: John Stoffel @ 2022-11-27 14:58 UTC (permalink / raw)
  To: Reindl Harald; +Cc: John Stoffel, Wols Lists, David T-G, Linux RAID list

>>>>> "Reindl" == Reindl Harald <h.reindl@thelounge.net> writes:

> Am 26.11.22 um 21:02 schrieb John Stoffel:
>> I call it a failure of the layering model.  If you want RAID, use MD.
>> If you want logical volumes, then put LVM on top.  Then put
>> filesystems into logical volumes.
>> 
>> So much simpler...

> have you ever replaced a 6 TB drive and waited for the resync of
> mdadm in the hope in all that hours no other drive goes down?

Yes, but I also run RAID6 so that if I lose a drive, I don't lose
redundancy.  I also have backups.  

> when your array is 10% used it's braindead
> when your array is new and empty it's braindead

> ZFS/BTRFS don't neeed to mirror/restore 90% nulls

I like the idea of ZFS, but in my $WORK experience with Oracle ZFS
arrays, it tended to fall off a cliff performance wise when pushed too
hard.  

And I don't like that you can only grow, not shrink zvols and the
underlying storage.  

But I'm also not smart enough to design a filesystem and make it
work.  :-)


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-27 14:33               ` Wol
@ 2022-11-27 18:08                 ` Roman Mamedov
  2022-11-27 19:21                   ` Wol
  2022-11-27 18:23                 ` Reindl Harald
  1 sibling, 1 reply; 62+ messages in thread
From: Roman Mamedov @ 2022-11-27 18:08 UTC (permalink / raw)
  To: Wol; +Cc: Reindl Harald, John Stoffel, David T-G, Linux RAID list

On Sun, 27 Nov 2022 14:33:37 +0000
Wol <antlists@youngman.org.uk> wrote:

> If raid supports trim, that means it intercepts the trim commands, and 
> uses it to keep track of what's being used by the layer above.
> 
> In other words, if the filesystem is only using 10% of the disk, 
> supporting trim means that raid knows which 10% is being used and only 
> bothers syncing that!

Not sure which RAID system you are speaking of, but that's not presently
implemented in mdadm RAID. It does not use TRIM of the array to keep track of
unused areas on the underlying devices, to skip those during rebuilds. And I
am unaware of any other RAID that does. Would be nice to have though.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-27 14:10         ` piergiorgio.sartor
@ 2022-11-27 18:21           ` Reindl Harald
  2022-11-27 19:37             ` Piergiorgio Sartor
  2022-11-27 22:05             ` Wol
  0 siblings, 2 replies; 62+ messages in thread
From: Reindl Harald @ 2022-11-27 18:21 UTC (permalink / raw)
  To: piergiorgio.sartor, John Stoffel, Wols Lists; +Cc: David T-G, Linux RAID list



Am 27.11.22 um 15:10 schrieb piergiorgio.sartor@nexgo.de:
> November 27, 2022 at 12:46 PM, "Reindl Harald" <h.reindl@thelounge.net> wrote:
> 
>>
>> Am 26.11.22 um 21:02 schrieb John Stoffel:
>>
>>>
>>> I call it a failure of the layering model. If you want RAID, use MD.
>>>   If you want logical volumes, then put LVM on top. Then put
>>>   filesystems into logical volumes.
>>>   So much simpler...
>>>
>>
>> have you ever replaced a 6 TB drive and waited for the resync of mdadm in the hope in all that hours no other drive goes down?
>>
>> when your array is 10% used it's braindead
>> when your array is new and empty it's braindead
>>
>> ZFS/BTRFS don't neeed to mirror/restore 90% nulls
>>
> 
> You cannot consider the amount of data in the
> array as parameter for reliability
> 
> If the array is 99% full, MD or ZFS/BTRFS have
> same behaviour, in terms of reliability.
> If the array is 0% full, as well

you completly miss the point!

if your mdadm-array is built with 6 TB drivres wehn you replace a drive 
you need to sync 6 TB no matter if 10 MB or 5 TB are actually used

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-27 14:33               ` Wol
  2022-11-27 18:08                 ` Roman Mamedov
@ 2022-11-27 18:23                 ` Reindl Harald
  2022-11-27 19:30                   ` Wol
  1 sibling, 1 reply; 62+ messages in thread
From: Reindl Harald @ 2022-11-27 18:23 UTC (permalink / raw)
  To: Wol, John Stoffel; +Cc: David T-G, Linux RAID list



Am 27.11.22 um 15:33 schrieb Wol:
> On 27/11/2022 12:06, Reindl Harald wrote:
>>
>>
>> Am 27.11.22 um 12:52 schrieb Wols Lists:
>>> On 27/11/2022 11:46, Reindl Harald wrote:
>>>>
>>>>
>>>> Am 26.11.22 um 21:02 schrieb John Stoffel:
>>>>> I call it a failure of the layering model.  If you want RAID, use MD.
>>>>> If you want logical volumes, then put LVM on top.  Then put
>>>>> filesystems into logical volumes.
>>>>>
>>>>> So much simpler...
>>>>
>>>> have you ever replaced a 6 TB drive and waited for the resync of 
>>>> mdadm in the hope in all that hours no other drive goes down?
>>>>
>>>> when your array is 10% used it's braindead
>>>> when your array is new and empty it's braindead
>>>>
>>>> ZFS/BTRFS don't neeed to mirror/restore 90% nulls
>>>
>>> This is why you have trim. 
>>
>> besides that such large disks are typically HDD trim has nothing to do 
>> with the fact that after a drive replacement linux raid knows 
>> *nothing* about trim and does a full resync
>>
>> you are long enough on this list that you should know that
> 
> Except (1) I didn't say *H*D*D* trim, and (2) if raid just passes trim 
> through to the layer below, THAT'S NOT SUPPORTING TRIM. As far as I'm 
> concerned, what happens at the level below is just not relevant!

reality don't bother what concerns you

> If raid supports trim, that means it intercepts the trim commands, and 
> uses it to keep track of what's being used by the layer above.
> 
> In other words, if the filesystem is only using 10% of the disk, 
> supporting trim means that raid knows which 10% is being used and only 
> bothers syncing that!

this is nonsense and don't reflect reality

the only thing trim does is tell the underlying device which blocks can 
be used for wear-leveling



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-27 18:08                 ` Roman Mamedov
@ 2022-11-27 19:21                   ` Wol
  2022-11-28  1:26                     ` Reindl Harald
  0 siblings, 1 reply; 62+ messages in thread
From: Wol @ 2022-11-27 19:21 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: John Stoffel, David T-G, Linux RAID list

On 27/11/2022 18:08, Roman Mamedov wrote:
> On Sun, 27 Nov 2022 14:33:37 +0000
> Wol <antlists@youngman.org.uk> wrote:
> 
>> If raid supports trim, that means it intercepts the trim commands, and
>> uses it to keep track of what's being used by the layer above.
>>
>> In other words, if the filesystem is only using 10% of the disk,
>> supporting trim means that raid knows which 10% is being used and only
>> bothers syncing that!
> 
> Not sure which RAID system you are speaking of, but that's not presently
> implemented in mdadm RAID. It does not use TRIM of the array to keep track of
> unused areas on the underlying devices, to skip those during rebuilds. And I
> am unaware of any other RAID that does. Would be nice to have though.
> 
Yup that's what I was saying - it would very much be a "nice to have".

Cheers,
Wol

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-27 18:23                 ` Reindl Harald
@ 2022-11-27 19:30                   ` Wol
  2022-11-27 19:51                     ` Reindl Harald
  0 siblings, 1 reply; 62+ messages in thread
From: Wol @ 2022-11-27 19:30 UTC (permalink / raw)
  To: Reindl Harald, John Stoffel; +Cc: David T-G, Linux RAID list

On 27/11/2022 18:23, Reindl Harald wrote:
>> In other words, if the filesystem is only using 10% of the disk, 
>> supporting trim means that raid knows which 10% is being used and only 
>> bothers syncing that!
> 
> this is nonsense and don't reflect reality
> 
> the only thing trim does is tell the underlying device which blocks can 
> be used for wear-leveling
> 
Then why do some linux block devices THAT HAVE NOTHING TO DO WITH 
HARDWARE support trim? (Sorry I can't name them, I've come across them).

And are you telling me that you're happy with a block device trashing 
your live data because the filesystem or whatever trimmed it? If the 
file system sends a TRIM command, it's saying "I am no longer using this 
space". What the underlying block layer does with it is up that layer. 
An SSD might use it for wear leveling, I'm pretty certain 
thin-provisioning uses it to release space (oh there's my block layer 
that isn't hardware).

AND THERE IS ABSOLUTELY NO REASON why md-raid shouldn't flag it as "this 
doesn't need recovery". Okay, it would need some sort of bitmap to say 
"these stripes are/aren't in use, which would chew up some disk space, 
but it's perfectly feasible.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-27 18:21           ` Reindl Harald
@ 2022-11-27 19:37             ` Piergiorgio Sartor
  2022-11-27 19:52               ` Reindl Harald
  2022-11-27 22:05             ` Wol
  1 sibling, 1 reply; 62+ messages in thread
From: Piergiorgio Sartor @ 2022-11-27 19:37 UTC (permalink / raw)
  To: Reindl Harald
  Cc: piergiorgio.sartor, John Stoffel, Wols Lists, David T-G, Linux RAID list

On Sun, Nov 27, 2022 at 07:21:16PM +0100, Reindl Harald wrote:
> 
> 
> Am 27.11.22 um 15:10 schrieb piergiorgio.sartor@nexgo.de:
> > November 27, 2022 at 12:46 PM, "Reindl Harald" <h.reindl@thelounge.net> wrote:
> > 
> > > 
> > > Am 26.11.22 um 21:02 schrieb John Stoffel:
> > > 
> > > > 
> > > > I call it a failure of the layering model. If you want RAID, use MD.
> > > >   If you want logical volumes, then put LVM on top. Then put
> > > >   filesystems into logical volumes.
> > > >   So much simpler...
> > > > 
> > > 
> > > have you ever replaced a 6 TB drive and waited for the resync of mdadm in the hope in all that hours no other drive goes down?
> > > 
> > > when your array is 10% used it's braindead
> > > when your array is new and empty it's braindead
> > > 
> > > ZFS/BTRFS don't neeed to mirror/restore 90% nulls
> > > 
> > 
> > You cannot consider the amount of data in the
> > array as parameter for reliability
> > 
> > If the array is 99% full, MD or ZFS/BTRFS have
> > same behaviour, in terms of reliability.
> > If the array is 0% full, as well
> 
> you completly miss the point!
> 
> if your mdadm-array is built with 6 TB drivres wehn you replace a drive you
> need to sync 6 TB no matter if 10 MB or 5 TB are actually used

I'm not missing the point, you're not
understanding the consequences of
your way of thinking.

If the ZFS/BTRFS is 99% full, how much
time will it need to be synched?

The same (more or less) of MD.

So, what's the difference in *this* case?

None.

This means the risk of (you wrote, I believe)
an other disk to go down is the same.

This means that you're considering the
reliability as function of how much
the array is full (or empty).

No matter if MD takes *always* full time.
It's not the point, in relation to reliability.

In other word, ZFS/BTRFS optimize the sync
time, for sure, but this should *not* be
considered when thinking in terms of
reliability.

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-27 19:30                   ` Wol
@ 2022-11-27 19:51                     ` Reindl Harald
  0 siblings, 0 replies; 62+ messages in thread
From: Reindl Harald @ 2022-11-27 19:51 UTC (permalink / raw)
  To: Wol, John Stoffel; +Cc: David T-G, Linux RAID list



Am 27.11.22 um 20:30 schrieb Wol:
> On 27/11/2022 18:23, Reindl Harald wrote:
>>> In other words, if the filesystem is only using 10% of the disk, 
>>> supporting trim means that raid knows which 10% is being used and 
>>> only bothers syncing that!
>>
>> this is nonsense and don't reflect reality
>>
>> the only thing trim does is tell the underlying device which blocks 
>> can be used for wear-leveling
>>
> Then why do some linux block devices THAT HAVE NOTHING TO DO WITH 
> HARDWARE support trim? (Sorry I can't name them, I've come across them).

to pass it down until it finally reaches the physical device

> And are you telling me that you're happy with a block device trashing 
> your live data because the filesystem or whatever trimmed it? If the 
> file system sends a TRIM command, it's saying "I am no longer using this 
> space". What the underlying block layer does with it is up that layer. 

it's impressive how many nonsense one can talk!
nothing is thrashing live-data!

> An SSD might use it for wear leveling, I'm pretty certain 
> thin-provisioning uses it to release space (oh there's my block layer 
> that isn't hardware).

so what?
it's still the underlying device

> AND THERE IS ABSOLUTELY NO REASON why md-raid shouldn't flag it as "this 
> doesn't need recovery"

obviously it is

> Okay, it would need some sort of bitmap to say 
> "these stripes are/aren't in use, which would chew up some disk space, 
> but it's perfectly feasible

and here we are: it would need something which isn't there

boy: for about 8 years everything you say on this mailing list is 
guessing while you try to sound like an expert

i told you what is fact and you bubble about a perfect world which don't 
exist

the same way as you pretended convert a degradeded RAID10 to RAID1 on 
double sized disks is easy because it's only change metadata which 
pretty clear showed that you have no clue what you are talking about! 
RAID10 is striped, RAID1 is mirrored

please stop talking aboout things you have no clue about

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-27 19:37             ` Piergiorgio Sartor
@ 2022-11-27 19:52               ` Reindl Harald
  0 siblings, 0 replies; 62+ messages in thread
From: Reindl Harald @ 2022-11-27 19:52 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: John Stoffel, Wols Lists, David T-G, Linux RAID list



Am 27.11.22 um 20:37 schrieb Piergiorgio Sartor:
> On Sun, Nov 27, 2022 at 07:21:16PM +0100, Reindl Harald wrote:
>>> You cannot consider the amount of data in the
>>> array as parameter for reliability
>>>
>>> If the array is 99% full, MD or ZFS/BTRFS have
>>> same behaviour, in terms of reliability.
>>> If the array is 0% full, as well
>>
>> you completly miss the point!
>>
>> if your mdadm-array is built with 6 TB drivres wehn you replace a drive you
>> need to sync 6 TB no matter if 10 MB or 5 TB are actually used
> 
> I'm not missing the point, you're not
> understanding the consequences of
> your way of thinking.
> 
> If the ZFS/BTRFS is 99% full, how much
> time will it need to be synched?
> 
> The same (more or less) of MD

for the sake of god the point was that mdadm don't know aynthing about 
the filesystem because it's a "dumb" block layer

at the end of the days that means when a 6 TB drive fails the full 6 TB 
needs to be re-synced no matter how much space is used on the FS on top

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-27 18:21           ` Reindl Harald
  2022-11-27 19:37             ` Piergiorgio Sartor
@ 2022-11-27 22:05             ` Wol
  2022-11-27 22:08               ` Reindl Harald
                                 ` (2 more replies)
  1 sibling, 3 replies; 62+ messages in thread
From: Wol @ 2022-11-27 22:05 UTC (permalink / raw)
  To: Reindl Harald, piergiorgio.sartor, John Stoffel
  Cc: David T-G, Linux RAID list

On 27/11/2022 18:21, Reindl Harald wrote:
>> If the array is 99% full, MD or ZFS/BTRFS have
>> same behaviour, in terms of reliability.
>> If the array is 0% full, as well
> 
> you completly miss the point!
> 
> if your mdadm-array is built with 6 TB drivres wehn you replace a drive 
> you need to sync 6 TB no matter if 10 MB or 5 TB are actually used

And you are also completely missing the point!

When mdadm creates an array - IF IT SUPPORTED TRIM - you could tell it 
"this is a blank array, don't bother initialising it". So it would 
initialise an internal bitmap to say "all these stripes are empty".

As the file system above sends writes down, mdadm would update the 
bitmap to say "these stripes are in use".

AND THIS IS WHAT I MEAN BY "SUPPORTING TRIM" - when the filesystem sends 
a trim command, saying "I'm no longer using these blocks", MDADM WOULD 
REMEMBER, and if appropriate clear the bitmap.

So when a drive breaks, mdadm would know what stripes are in use, and 
what stripes to sync, and what stripes to ignore!

And no, you are COMPLETELY WRONG in assuming that the only block layers 
that implement trim is hardware. Any block layer that wants to can 
implement trim - there is no reason whatsoever why mdadm couldn't. I 
never said it had, I said I wish it did.

But if virtual file system layers did NOT implement trim, you'd have a 
lot of unhappy punters on Amazon Cloud, or Google Cloud, or whatever 
these other suppliers of virtual systems are, if they had to pay for all 
that disk storage they're not actually using, because their virtual hard 
disks couldn't free up empty space.

tldr - there is no reason why mdadm couldn't implement trim, and if it 
did, then it would know how much of the array needed to be sync'd and 
how much didn't need bothering with.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-27 22:05             ` Wol
@ 2022-11-27 22:08               ` Reindl Harald
  2022-11-27 22:11               ` Reindl Harald
  2022-11-27 22:17               ` Roman Mamedov
  2 siblings, 0 replies; 62+ messages in thread
From: Reindl Harald @ 2022-11-27 22:08 UTC (permalink / raw)
  To: Wol, piergiorgio.sartor, John Stoffel; +Cc: David T-G, Linux RAID list



Am 27.11.22 um 23:05 schrieb Wol:
> On 27/11/2022 18:21, Reindl Harald wrote:
>>> If the array is 99% full, MD or ZFS/BTRFS have
>>> same behaviour, in terms of reliability.
>>> If the array is 0% full, as well
>>
>> you completly miss the point!
>>
>> if your mdadm-array is built with 6 TB drivres wehn you replace a 
>> drive you need to sync 6 TB no matter if 10 MB or 5 TB are actually used
> 
> And you are also completely missing the point!
> 
> When mdadm creates an array - IF IT SUPPORTED TRIM - you could tell it 
> "this is a blank array, don't bother initialising it". So it would 
> initialise an internal bitmap to say "all these stripes are empty"

you could - but it don't

> tldr - there is no reason why mdadm couldn't implement trim, and if it 
> did, then it would know how much of the array needed to be sync'd and 
> how much didn't need bothering with

who the hell cares about what it *could*?
it simply don't and that's state of play

that's the difference between a filesystem on top of mdadm versus a 
filesystem which implements RAID on it's own

that's the whole point: what things actually do

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-27 22:05             ` Wol
  2022-11-27 22:08               ` Reindl Harald
@ 2022-11-27 22:11               ` Reindl Harald
  2022-11-27 22:17               ` Roman Mamedov
  2 siblings, 0 replies; 62+ messages in thread
From: Reindl Harald @ 2022-11-27 22:11 UTC (permalink / raw)
  To: Wol, piergiorgio.sartor, John Stoffel; +Cc: David T-G, Linux RAID list



Am 27.11.22 um 23:05 schrieb Wol:
> And no, you are COMPLETELY WRONG in assuming that the only block layers 
> that implement trim is hardware. 

from a layered point of view my vmware vmdk-disk supporting trim still 
are hardware

> Any block layer that wants to can 
> implement trim - there is no reason whatsoever why mdadm couldn't. I 
> never said it had, I said I wish it did

but when it comes to reality nobody cares what you wish when i outline 
the differences in the real world

in the real world mdadm don't anything you wish but what it does

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-27 22:05             ` Wol
  2022-11-27 22:08               ` Reindl Harald
  2022-11-27 22:11               ` Reindl Harald
@ 2022-11-27 22:17               ` Roman Mamedov
  2 siblings, 0 replies; 62+ messages in thread
From: Roman Mamedov @ 2022-11-27 22:17 UTC (permalink / raw)
  To: Wol
  Cc: Reindl Harald, piergiorgio.sartor, John Stoffel, David T-G,
	Linux RAID list

On Sun, 27 Nov 2022 22:05:07 +0000
Wol <antlists@youngman.org.uk> wrote:

> When mdadm creates an array - IF IT SUPPORTED TRIM - you could tell it 
> "this is a blank array, don't bother initialising it". So it would 
> initialise an internal bitmap to say "all these stripes are empty".
> 
> As the file system above sends writes down, mdadm would update the 
> bitmap to say "these stripes are in use".
> 
> AND THIS IS WHAT I MEAN BY "SUPPORTING TRIM" - when the filesystem sends 
> a trim command, saying "I'm no longer using these blocks", MDADM WOULD 
> REMEMBER, and if appropriate clear the bitmap.

mdadm supports TRIM in the sense that it properly passes down TRIM of the
array to participant member devices, for them to do their own underlying
storage management operations. Even from the complex RAID levels such as RAID5
and RAID6. This is commendable and IIRC far from always was the case.

Stretching "TRIM support" to mean a hypothetical used-blocks-accounting
feature and use of that to speed-up rebuilds, is a surefire way to get
misunderstood in a conversation. As is saying "mdadm doesn't support TRIM"
only due to it not having that specific advanced feature.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: how do i fix these RAID5 arrays?
  2022-11-27 19:21                   ` Wol
@ 2022-11-28  1:26                     ` Reindl Harald
  0 siblings, 0 replies; 62+ messages in thread
From: Reindl Harald @ 2022-11-28  1:26 UTC (permalink / raw)
  To: Wol, Roman Mamedov; +Cc: John Stoffel, David T-G, Linux RAID list



Am 27.11.22 um 20:21 schrieb Wol:
> On 27/11/2022 18:08, Roman Mamedov wrote:
>> On Sun, 27 Nov 2022 14:33:37 +0000
>> Wol <antlists@youngman.org.uk> wrote:
>>
>>> If raid supports trim, that means it intercepts the trim commands, and
>>> uses it to keep track of what's being used by the layer above.
>>>
>>> In other words, if the filesystem is only using 10% of the disk,
>>> supporting trim means that raid knows which 10% is being used and only
>>> bothers syncing that!
>>
>> Not sure which RAID system you are speaking of, but that's not presently
>> implemented in mdadm RAID. It does not use TRIM of the array to keep 
>> track of
>> unused areas on the underlying devices, to skip those during rebuilds. 
>> And I
>> am unaware of any other RAID that does. Would be nice to have though.
>>
> Yup that's what I was saying - it would very much be a "nice to have"

you clearly need to distinct "nice to have" vesus "state of play" - 
nobody gain anything from nice to have when the topic are *existing* 
differences

what is "nice to have" for mdadm is "exists in ZFS/BTRFS"

^ permalink raw reply	[flat|nested] 62+ messages in thread

* md RAID0 can be grown (was "Re: how do i fix these RAID5 arrays?")
  2022-11-25 19:49             ` David T-G
@ 2022-11-28 14:24               ` David T-G
  2022-11-29 21:17                 ` Jani Partanen
  2022-12-03  5:41                 ` md RAID0 can be grown David T-G
  0 siblings, 2 replies; 62+ messages in thread
From: David T-G @ 2022-11-28 14:24 UTC (permalink / raw)
  To: Linux RAID list

Hi, all --

...and then David T-G home said...
% 
% ...and then Roger Heflin said...
% % You may not be able to grow with either linear and/or raid0 under mdadm.
% 
...
% What do you think of
% 
%   mdadm -A --update=devicesize /dev/md50
% 
% as discussed in
% 
%   https://serverfault.com/questions/1068788/how-to-change-size-of-raid0-software-array-by-resizing-partition
% 
% recently?

It looks like this works.  Read on for more future plans, but here's how
growing worked out.

First, you'll recall, I added the new slices to each RAID5 array and then
fixed them so that they're all working again.  Thank you, everyone :-)

Second, all I had to do was stop the array and reassemble, and md noticed
like a champ.  Awesome!

  diskfarm:~ # mdadm -D /dev/md50
  /dev/md50:
  ...
          Raid Level : raid0
          Array Size : 19526301696 (18.19 TiB 19.99 TB)
  ...
  diskfarm:~ # mdadm -S /dev/md50
  mdadm: stopped /dev/md50
  diskfarm:~ # mdadm -A --update=devicesize /dev/md50
  mdadm: /dev/md50 has been started with 6 drives.
  diskfarm:~ # mdadm -D /dev/md50
  /dev/md50:
  ...
          Raid Level : raid0
          Array Size : 29289848832 (27.28 TiB 29.99 TB)
  ...

Next I had to resize the partition to use the more space now
available.

  diskfarm:~ # parted /dev/md50
  ...
  (parted) u s p free
  Model: Linux Software RAID Array (md)
  Disk /dev/md50: 58579697664s
  Sector size (logical/physical): 512B/4096B
  Partition Table: gpt
  Disk Flags:
  
  Number  Start         End           Size          File system  Name         Flags
          34s           6143s         6110s         Free Space
   1      6144s         39052597247s  39052591104s  xfs          10Traid50md
          39052597248s  58579697630s  19527100383s  Free Space
  
  (parted) rm 1
  (parted) mkpart pri xfs 6144s 100%
  (parted) name 1 10Traid50md
  (parted) p free
  Model: Linux Software RAID Array (md)
  Disk /dev/md50: 58579697664s
  Sector size (logical/physical): 512B/4096B
  Partition Table: gpt
  Disk Flags:
  
  Number  Start         End           Size          File system  Name         Flags
          34s           6143s         6110s         Free Space
   1      6144s         58579691519s  58579685376s  xfs          10Traid50md
          58579691520s  58579697630s  6111s         Free Space
  
  (parted) q
  
  diskfarm:~ # parted /dev/md50 p free
  Model: Linux Software RAID Array (md)
  Disk /dev/md50: 30.0TB
  Sector size (logical/physical): 512B/4096B
  Partition Table: gpt
  Disk Flags:
  
  Number  Start   End     Size    File system  Name         Flags
          17.4kB  3146kB  3128kB  Free Space
   1      3146kB  30.0TB  30.0TB  xfs          10Traid50md
          30.0TB  30.0TB  3129kB  Free Space

Finally, I had to grow the XFS filesystem.  That was simple enough,
although it's supposed to be done with the volume mounted, which just
felt ... wrong :-)

  diskfarm:~ # df -kh /mnt/10Traid50md/
  Filesystem      Size  Used Avail Use% Mounted on
  /dev/md50p1      19T   19T   95G 100% /mnt/10Traid50md
  diskfarm:~ # xfs_growfs -n /mnt/10Traid50md
  meta-data=/dev/md50p1            isize=512    agcount=32, agsize=152549248 blks
           =                       sectsz=4096  attr=2, projid32bit=1
           =                       crc=1        finobt=1, sparse=0, rmapbt=0
           =                       reflink=0
  data     =                       bsize=4096   blocks=4881573888, imaxpct=5
           =                       sunit=128    swidth=768 blks
  naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
  log      =internal log           bsize=4096   blocks=521728, version=2
           =                       sectsz=4096  sunit=1 blks, lazy-count=1
  realtime =none                   extsz=4096   blocks=0, rtextents=0
  diskfarm:~ # xfs_growfs /mnt/10Traid50md
  meta-data=/dev/md50p1            isize=512    agcount=32, agsize=152549248 blks
           =                       sectsz=4096  attr=2, projid32bit=1
           =                       crc=1        finobt=1, sparse=0, rmapbt=0
           =                       reflink=0
  data     =                       bsize=4096   blocks=4881573888, imaxpct=5
           =                       sunit=128    swidth=768 blks
  naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
  log      =internal log           bsize=4096   blocks=521728, version=2
           =                       sectsz=4096  sunit=1 blks, lazy-count=1
  realtime =none                   extsz=4096   blocks=0, rtextents=0
  data blocks changed from 4881573888 to 7322460672
  diskfarm:~ # df -kh /mnt/10Traid50md
  Filesystem      Size  Used Avail Use% Mounted on
  /dev/md50p1      28T   19T  9.2T  67% /mnt/10Traid50md

Et voila, we have more free space.  Yay.

So this works in theory, but ... there's that linear question :-/


% 
...
% % so here is roughly how to do it (commands may not be exact)> and assuming
% % your devices are /dev/md5[0123]
% % 
% % PV == physical volume (a disk or md raid device generally).
% % VG == volume group (a group of PV).
% % LV == logical volume (a block device inside a vg made up of part of a PV or
% % several PVs).
% % 
% % pvcreate /dev/md5[0123]
% % vgcreate bigvg /dev/md5[0123]
% % lvcreate -L <size> -n mylv bigvg
[snip]

Thanks again, and I do plan to read up on LVM.  For now, though, I'm
thinkin' I'll rebuild under md in linear mode.  Stealing from my RAID10
subthread (where I owe similar tests), I tried pulling 128MiB to 8GiB of
data from a single RAID5 slice versus the big RAID0 stripe

  diskfarm:~ # for D in 52 50 ; do for C in 128 256 512 ; do for S in 1M 4M 16M ; do CMD="dd if=/dev/md$D of=/dev/null bs=$S count=$C iflag=direct" ; echo "## $CMD" ; $CMD 2>&1 | egrep -v records ; done ; done ; done
  ## dd if=/dev/md52 of=/dev/null bs=1M count=128 iflag=direct
  134217728 bytes (134 MB, 128 MiB) copied, 1.20121 s, 112 MB/s
  ## dd if=/dev/md52 of=/dev/null bs=4M count=128 iflag=direct
  536870912 bytes (537 MB, 512 MiB) copied, 1.82563 s, 294 MB/s
  ## dd if=/dev/md52 of=/dev/null bs=16M count=128 iflag=direct
  2147483648 bytes (2.1 GB, 2.0 GiB) copied, 9.03782 s, 238 MB/s
  ## dd if=/dev/md52 of=/dev/null bs=1M count=256 iflag=direct
  268435456 bytes (268 MB, 256 MiB) copied, 2.6694 s, 101 MB/s
  ## dd if=/dev/md52 of=/dev/null bs=4M count=256 iflag=direct
  1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.72331 s, 288 MB/s
  ## dd if=/dev/md52 of=/dev/null bs=16M count=256 iflag=direct
  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 13.6094 s, 316 MB/s
  ## dd if=/dev/md52 of=/dev/null bs=1M count=512 iflag=direct
  536870912 bytes (537 MB, 512 MiB) copied, 6.39903 s, 83.9 MB/s
  ## dd if=/dev/md52 of=/dev/null bs=4M count=512 iflag=direct
  2147483648 bytes (2.1 GB, 2.0 GiB) copied, 7.45123 s, 288 MB/s
  ## dd if=/dev/md52 of=/dev/null bs=16M count=512 iflag=direct
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 28.1189 s, 305 MB/s
  ## dd if=/dev/md50 of=/dev/null bs=1M count=128 iflag=direct
  134217728 bytes (134 MB, 128 MiB) copied, 3.74023 s, 35.9 MB/s
  ## dd if=/dev/md50 of=/dev/null bs=4M count=128 iflag=direct
  536870912 bytes (537 MB, 512 MiB) copied, 9.96306 s, 53.9 MB/s
  ## dd if=/dev/md50 of=/dev/null bs=16M count=128 iflag=direct
  2147483648 bytes (2.1 GB, 2.0 GiB) copied, 19.994 s, 107 MB/s
  ## dd if=/dev/md50 of=/dev/null bs=1M count=256 iflag=direct
  268435456 bytes (268 MB, 256 MiB) copied, 7.25855 s, 37.0 MB/s
  ## dd if=/dev/md50 of=/dev/null bs=4M count=256 iflag=direct
  1073741824 bytes (1.1 GB, 1.0 GiB) copied, 18.9692 s, 56.6 MB/s
  ## dd if=/dev/md50 of=/dev/null bs=16M count=256 iflag=direct
  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 40.2443 s, 107 MB/s
  ## dd if=/dev/md50 of=/dev/null bs=1M count=512 iflag=direct
  536870912 bytes (537 MB, 512 MiB) copied, 14.1076 s, 38.1 MB/s
  ## dd if=/dev/md50 of=/dev/null bs=4M count=512 iflag=direct
  2147483648 bytes (2.1 GB, 2.0 GiB) copied, 38.6795 s, 55.5 MB/s
  ## dd if=/dev/md50 of=/dev/null bs=16M count=512 iflag=direct
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 81.4364 s, 105 MB/s

and as expected the difference

  RAID5 / RAID0 performance
  (speedup)
  
          1M        4M       16M
      +---------+---------+---------+
  128 | 112/036 | 294/054 | 238/107 |
      | (3.1)   | (5.4)   | (2.2)   |
      +---------+---------+---------+
  256 | 101/037 | 288/057 | 316/107 |
      | (2.7)   | (5.0)   | (3.0)   |
      +---------+---------+---------+
  512 | 084/038 | 288/056 | 305/105 |
      | (2.2)   | (5.1)   | (2.9)   |
      +---------+---------+---------+

is significant.  So, yeah, I'll be wiping and rebuilding md50 as a
straight linear.  Watch for more test results when that's done :-)
Fingers crossed that I get much better results; if not, maybe it'll
be time to switch to LVM after all.


Thanks again to all & HAND

:-D
-- 
David T-G
See http://justpickone.org/davidtg/email/
See http://justpickone.org/davidtg/tofu.txt


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: about linear and about RAID10
  2022-11-25 18:00         ` about linear and about RAID10 (was "Re: how do i fix these RAID5 arrays?") Roger Heflin
@ 2022-11-28 14:46           ` David T-G
  2022-11-28 15:32             ` Reindl Harald
       [not found]             ` <CAAMCDee_YrhXo+5hp31YXgUHkyuUr-zTXOqi0-HUjMrHpYMkTQ@mail.gmail.com>
  0 siblings, 2 replies; 62+ messages in thread
From: David T-G @ 2022-11-28 14:46 UTC (permalink / raw)
  To: Linux RAID list

Hi again, all --

...and then Roger Heflin said...
% You do not want to stripe 2 partitions on a single disk, you want that linear.
% 
...
% 
% do a dd if=/dev/mdXX of=/dev/null bs=1M count=100 iflag=direct  on one
% of the raid5s of the partitions and then on the raid1 device over
% them.  I would expect the raid device over them to be much slower, I
% am not sure how much but 5x-20x.

Note that we aren't talking RAID5 but simple RAID1, but I follow you.
Time for more testing.  I ran the same dd tests as on the RAID5 setup

  jpo:~ # for D in 41 40 ; do for C in 128 256 512 ; do for S in 1M 4M 16M ; do CMD="dd if=/dev/md$D of=/dev/null bs=$S count=$C iflag=direct" ; echo "## $CMD" ; $CMD 2>&1 | egrep -v records ; done ; done ; done
  ## dd if=/dev/md41 of=/dev/null bs=1M count=128 iflag=direct
  134217728 bytes (134 MB, 128 MiB) copied, 0.710608 s, 189 MB/s
  ## dd if=/dev/md41 of=/dev/null bs=4M count=128 iflag=direct
  536870912 bytes (537 MB, 512 MiB) copied, 2.7903 s, 192 MB/s
  ## dd if=/dev/md41 of=/dev/null bs=16M count=128 iflag=direct
  2147483648 bytes (2.1 GB, 2.0 GiB) copied, 11.3205 s, 190 MB/s
  ## dd if=/dev/md41 of=/dev/null bs=1M count=256 iflag=direct
  268435456 bytes (268 MB, 256 MiB) copied, 1.41372 s, 190 MB/s
  ## dd if=/dev/md41 of=/dev/null bs=4M count=256 iflag=direct
  1073741824 bytes (1.1 GB, 1.0 GiB) copied, 5.50616 s, 195 MB/s
  ## dd if=/dev/md41 of=/dev/null bs=16M count=256 iflag=direct
  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 22.7846 s, 189 MB/s
  ## dd if=/dev/md41 of=/dev/null bs=1M count=512 iflag=direct
  536870912 bytes (537 MB, 512 MiB) copied, 3.02753 s, 177 MB/s
  ## dd if=/dev/md41 of=/dev/null bs=4M count=512 iflag=direct
  2147483648 bytes (2.1 GB, 2.0 GiB) copied, 11.2099 s, 192 MB/s
  ## dd if=/dev/md41 of=/dev/null bs=16M count=512 iflag=direct
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 45.5623 s, 189 MB/s
  ## dd if=/dev/md40 of=/dev/null bs=1M count=128 iflag=direct
  134217728 bytes (134 MB, 128 MiB) copied, 1.19657 s, 112 MB/s
  ## dd if=/dev/md40 of=/dev/null bs=4M count=128 iflag=direct
  536870912 bytes (537 MB, 512 MiB) copied, 4.32003 s, 124 MB/s
  ## dd if=/dev/md40 of=/dev/null bs=16M count=128 iflag=direct
  2147483648 bytes (2.1 GB, 2.0 GiB) copied, 12.0615 s, 178 MB/s
  ## dd if=/dev/md40 of=/dev/null bs=1M count=256 iflag=direct
  268435456 bytes (268 MB, 256 MiB) copied, 2.38074 s, 113 MB/s
  ## dd if=/dev/md40 of=/dev/null bs=4M count=256 iflag=direct
  1073741824 bytes (1.1 GB, 1.0 GiB) copied, 8.62803 s, 124 MB/s
  ## dd if=/dev/md40 of=/dev/null bs=16M count=256 iflag=direct
  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 25.2467 s, 170 MB/s
  ## dd if=/dev/md40 of=/dev/null bs=1M count=512 iflag=direct
  536870912 bytes (537 MB, 512 MiB) copied, 5.13948 s, 104 MB/s
  ## dd if=/dev/md40 of=/dev/null bs=4M count=512 iflag=direct
  2147483648 bytes (2.1 GB, 2.0 GiB) copied, 16.5954 s, 129 MB/s
  ## dd if=/dev/md40 of=/dev/null bs=16M count=512 iflag=direct
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 55.5721 s, 155 MB/s

and did the math again

          1M        4M       16M
      +---------+---------+---------+
  128 | 189/112 | 192/124 | 190/178 |
      | (1.68)  | (1.54)  | (1.06)  |
      +---------+---------+---------+
  256 | 190/113 | 195/124 | 189/170 |
      | (1.68)  | (1.57)  | (1.11)  |
      +---------+---------+---------+
  512 | 177/104 | 192/129 | 189/155 |
      | (1.70)  | (1.48)  | (1.21)  |
      +---------+---------+---------+

and ... that was NOT what I expected!  I wonder if it's because of stripe
versus linear again.  A straight mirror will run down the entire disk,
so there's no speedup; if you have to seek from one end to the other, the
head moves the whole way.  By mirroring two halves and swapping them and
then gluing them together, though, a read *should* only have to hit the
first half of either disk and thus be FASTER.  And maybe that's the case
for random versus sequential reads; I dunno.  The difference was nearly
negligible for large reads, but I get a 40% penalty on small reads -- and
this server leans much more toward small files versus large.  Bummer :-(

I don't at this time have a device free to plug in locally to back up the
volume to destroy and rebuild as linear, so that will have to wait.  When
I do get that chance, though, will that help me get to the awesome goal
of actually INCREASING performance by including a RAID0 layer?


Thanks again & HAND

:-D
-- 
David T-G
See http://justpickone.org/davidtg/email/
See http://justpickone.org/davidtg/tofu.txt


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: about linear and about RAID10
  2022-11-28 14:46           ` about linear and about RAID10 David T-G
@ 2022-11-28 15:32             ` Reindl Harald
       [not found]               ` <CAAMCDecXkcmUe=ZFnJ_NndND0C2=D5qSoj1Hohsrty8y1uqdfw@mail.gmail.com>
                                 ` (2 more replies)
       [not found]             ` <CAAMCDee_YrhXo+5hp31YXgUHkyuUr-zTXOqi0-HUjMrHpYMkTQ@mail.gmail.com>
  1 sibling, 3 replies; 62+ messages in thread
From: Reindl Harald @ 2022-11-28 15:32 UTC (permalink / raw)
  To: David T-G, Linux RAID list



Am 28.11.22 um 15:46 schrieb David T-G:
> I don't at this time have a device free to plug in locally to back up the
> volume to destroy and rebuild as linear, so that will have to wait.  When
> I do get that chance, though, will that help me get to the awesome goal
> of actually INCREASING performance by including a RAID0 layer?

stacking layers over layers will *never* increase performance - a pure 
RAID0 will but if one disk is dead all is lost

additional RAID0 on top or below another RAID won't help

your main problem starts by slicing your drives in dozens of partitions 
and "the idea being that each piece of which should take less time to 
rebuild if something fails"

when a drive fails all your partitions on that drive are gone - so 
rebuild isn't faster at the end

with that slicing and layers over layers you get unpredictable 
head-movemnets slowing things down

keep it SIMPLE!

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: about linear and about RAID10
       [not found]               ` <CAAMCDecXkcmUe=ZFnJ_NndND0C2=D5qSoj1Hohsrty8y1uqdfw@mail.gmail.com>
@ 2022-11-28 17:03                 ` Reindl Harald
  0 siblings, 0 replies; 62+ messages in thread
From: Reindl Harald @ 2022-11-28 17:03 UTC (permalink / raw)
  To: Roger Heflin; +Cc: David T-G, Linux RAID list



Am 28.11.22 um 17:51 schrieb Roger Heflin:
> It depends on why the drive fails.
> 
> My operational experience is a complete drive failure is rare

i stopped to count the replaced HDDs in the past 20 years

> Most  of the failures are a bad sector

replacement case when it makes it up to the layer where mdadm even has 
to act

> and/or a transient interface error

won't trigger a rebuild - thats waht write intent bitmaps are for

> on my sliced setup most of the time only result in a single 
> slice/partition getting kicked and needing a re-add.  

hwo can a single parition disappear and others not?

> And the slicing 
> does make each grow step smaller/faster, and it also allows one to play 
> games with different sized disks being sliced.

smaller but NOT faster

> My setup started with 1.5TB disks sliced into 2, and then when I went to 
> 3TB disks sliced into 4, and allows mixing and matching underlying disk 
> sizes for a system with some disks of each.   My 6's have 6 slices 
> (4x.75TB, 2x1.5TB).

yeah makes sense to sepearte os/data and so on but the OP at the end of 
the day thows LVM and/or another RAID-layer on top to end with a singe 
big storage where all that partition slicing is nonsense

> On Mon, Nov 28, 2022 at 10:06 AM Reindl Harald <h.reindl@thelounge.net 
> <mailto:h.reindl@thelounge.net>> wrote:
> 
>     Am 28.11.22 um 15:46 schrieb David T-G:
>      > I don't at this time have a device free to plug in locally to
>     back up the
>      > volume to destroy and rebuild as linear, so that will have to
>     wait.  When
>      > I do get that chance, though, will that help me get to the
>     awesome goal
>      > of actually INCREASING performance by including a RAID0 layer?
> 
>     stacking layers over layers will *never* increase performance - a pure
>     RAID0 will but if one disk is dead all is lost
> 
>     additional RAID0 on top or below another RAID won't help
> 
>     your main problem starts by slicing your drives in dozens of partitions
>     and "the idea being that each piece of which should take less time to
>     rebuild if something fails"
> 
>     when a drive fails all your partitions on that drive are gone - so
>     rebuild isn't faster at the end
> 
>     with that slicing and layers over layers you get unpredictable
>     head-movemnets slowing things down
> 
>     keep it SIMPLE!

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: about linear and about RAID10
  2022-11-28 15:32             ` Reindl Harald
       [not found]               ` <CAAMCDecXkcmUe=ZFnJ_NndND0C2=D5qSoj1Hohsrty8y1uqdfw@mail.gmail.com>
@ 2022-11-28 20:45               ` John Stoffel
  2022-12-03  5:58                 ` David T-G
  2022-12-03  5:45               ` David T-G
  2 siblings, 1 reply; 62+ messages in thread
From: John Stoffel @ 2022-11-28 20:45 UTC (permalink / raw)
  To: Reindl Harald; +Cc: David T-G, Linux RAID list

>>>>> "Reindl" == Reindl Harald <h.reindl@thelounge.net> writes:

> Am 28.11.22 um 15:46 schrieb David T-G:
>> I don't at this time have a device free to plug in locally to back up the
>> volume to destroy and rebuild as linear, so that will have to wait.  When
>> I do get that chance, though, will that help me get to the awesome goal
>> of actually INCREASING performance by including a RAID0 layer?

> stacking layers over layers will *never* increase performance - a pure 
> RAID0 will but if one disk is dead all is lost

> additional RAID0 on top or below another RAID won't help

> your main problem starts by slicing your drives in dozens of partitions 
> and "the idea being that each piece of which should take less time to 
> rebuild if something fails"

> when a drive fails all your partitions on that drive are gone - so 
> rebuild isn't faster at the end

> with that slicing and layers over layers you get unpredictable 
> head-movemnets slowing things down

> keep it SIMPLE!

This is my mantra as well here.  For my home system, I prefer
symplicity and robustness and performance, so I tend just use RAID1
mirrors of all my disks.  I really don't have all that much stuff I
need lots of disk space for.  And for that I have a scratch volume.  

I really like MD down low, with LVM on top so I can move LVs and
resize them easily (grow only generally) to make more room.  If I
really need to shrink a filesystem, it's a outage, but usually that's
ok.

But keeping it simple means that when things break and you're tired
and have family whining at you get things working agian, life will
tend to end up better.

Slicing HDDs into multiple partitions and then running MD on those
multiple partitions just screams of complexity.  Yes, in some cases
you might get a quicker rebuild if you have a block of sectors go bad,
but in general if a disk starts throwing out bad sectors, I'm gonna
replace the entire disk ASAP.  

Now if you have a primary SSD, and a wrote-mostly HDD type setup, then
complexity might be worth it.  Might.  


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: md RAID0 can be grown (was "Re: how do i fix these RAID5 arrays?")
  2022-11-28 14:24               ` md RAID0 can be grown (was "Re: how do i fix these RAID5 arrays?") David T-G
@ 2022-11-29 21:17                 ` Jani Partanen
  2022-11-29 22:22                   ` Roman Mamedov
  2022-12-03  5:41                   ` md vs LVM and VMs and ... (was "Re: md RAID0 can be grown (was ...") David T-G
  2022-12-03  5:41                 ` md RAID0 can be grown David T-G
  1 sibling, 2 replies; 62+ messages in thread
From: Jani Partanen @ 2022-11-29 21:17 UTC (permalink / raw)
  To: David T-G, Linux RAID list

Hi David,

Nice to see that there is others who like to take things extreme and 
live on razor edge. ;)

I had different side disks, so I made raid5 so that I first joined 
example 1TB and 2TB together with md linear so I could add that as 
member to other 3TB raid5 pool.

But stuff started to become complex and then I had few disk failures 
with 3TB drives. (didn't lose any data) So I bought some 4TB drives 
more, enough that I was able to start all over. Made new pool from 4TB 
drives, added it to LVM and then moved all data from old pool to this 
shiny new. When data was transferred, I left all remaining 3TB drives 
and made raid5 pool out of them and added it to LVM.
Anyway, go LVM if you are planning to slice and dice disks like before. 
With LVM later on if you add space, it will be much more simple task and 
LVM automaticly works like linear but offers much other benefits.
LVM is very much worth to learn. Start up virtual machine and assign 
multiple small vdisks to it and you have enviroment where it's safe to 
play around with md and lvm.

Here is some snips how my setup looks now. I have different logical 
volumes for big data pool and virtual machines. Reason is that I don't 
want VM's to go through encryption, VM can do it on it's own and also I 
can now just create logican volume and install VM directly to it if I 
want. For "big" datapool is btrfs because I like cow, quite much of my 
use benefit alot from it.


-----------------------------

vgdisplay
   --- Volume group ---
   VG Name               volgroup0_datapool
   System ID
   Format                lvm2
   Metadata Areas        2
   Metadata Sequence No  62
   VG Access             read/write
   VG Status             resizable
   MAX LV                0
   Cur LV                5
   Open LV               1
   Max PV                0
   Cur PV                2
   Act PV                2
   VG Size               28.19 TiB
   PE Size               4.00 MiB
   Total PE              7390684
   Alloc PE / Size       6843196 / 26.10 TiB
   Free  PE / Size       547488 / <2.09 TiB
   VG UUID               OmE21G-oG3a-Oxqb-Nc0V-uAes-PFfh-nvvP07

  pvdisplay
   --- Physical volume ---
   PV Name               /dev/md124
   VG Name               volgroup0_datapool
   PV Size               13.64 TiB / not usable 6.00 MiB
   Allocatable           yes (but full)
   PE Size               4.00 MiB
   Total PE              3576116
   Free PE               0
   Allocated PE          3576116
   PV UUID               jtE1e0-JK7h-Z6Xy-X1ms-vhnT-NkMh-6aFLRK

   --- Physical volume ---
   PV Name               /dev/md123
   VG Name               volgroup0_datapool
   PV Size               14.55 TiB / not usable 4.00 MiB
   Allocatable           yes
   PE Size               4.00 MiB
   Total PE              3814568
   Free PE               547488
   Allocated PE          3267080
   PV UUID               kYMebn-1JK8-aApe-eTSN-QY7v-c59d-FGStdu

cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md123 : active raid5 sdf1[3] sde1[2] sdd1[1] sdg1[5] sdc1[0]
       15624474624 blocks super 1.2 level 5, 512k chunk, algorithm 2 
[5/5] [UUUUU]
       bitmap: 0/30 pages [0KB], 65536KB chunk

md124 : active raid5 sdn1[6] sdj1[1] sdl1[3] sdk1[2] sdi1[0] sdm1[4]
       14647777280 blocks super 1.2 level 5, 512k chunk, algorithm 2 
[6/6] [UUUUUU]
       bitmap: 0/22 pages [0KB], 65536KB chunk

-----------------------------

It is shame that linux doesn't have native support for one system rule 
them all. There is zfs but because of licence, it most likely never be a 
part of kernel so at least I am not gonna use it until it's in kernel.
Even windows is far ahead of linux on that front. Storage spaces is 
quite powerfull and you can do all kind of nice stuff with it. I have on 
my windows machine tiered store done with storage space. Meaning that I 
have HDD and SSD paired as one big pool and hot data stay on ssd.

You can archive about everything what zfs offers with linux tools, but 
it means you will need to make very complex pool. Layer after layer and 
at least I feel that more layers you add, more breaking points you add.

I really like btrfs, but because raid5 scrub is broken, it cannot be 
really used. Broken raid5 scrub is only deal breaker for me not using 
it. write hole bug is not an issue if you dont raid5 your meta. But 
still btrfs is missing few key features what zfs offer. Encryption and 
cache.


On 28/11/2022 16.24, David T-G wrote:
> Hi, all --
>
> ...and then David T-G home said...
> %
> % ...and then Roger Heflin said...
> % % You may not be able to grow with either linear and/or raid0 under mdadm.
> %
> ...
> % What do you think of
> %
> %   mdadm -A --update=devicesize /dev/md50
> %
> % as discussed in
> %
> %   https://serverfault.com/questions/1068788/how-to-change-size-of-raid0-software-array-by-resizing-partition
> %
> % recently?
>
> It looks like this works.  Read on for more future plans, but here's how
> growing worked out.
>
> First, you'll recall, I added the new slices to each RAID5 array and then
> fixed them so that they're all working again.  Thank you, everyone :-)
>
> Second, all I had to do was stop the array and reassemble, and md noticed
> like a champ.  Awesome!
>
>    diskfarm:~ # mdadm -D /dev/md50
>    /dev/md50:
>    ...
>            Raid Level : raid0
>            Array Size : 19526301696 (18.19 TiB 19.99 TB)
>    ...
>    diskfarm:~ # mdadm -S /dev/md50
>    mdadm: stopped /dev/md50
>    diskfarm:~ # mdadm -A --update=devicesize /dev/md50
>    mdadm: /dev/md50 has been started with 6 drives.
>    diskfarm:~ # mdadm -D /dev/md50
>    /dev/md50:
>    ...
>            Raid Level : raid0
>            Array Size : 29289848832 (27.28 TiB 29.99 TB)
>    ...
>
> Next I had to resize the partition to use the more space now
> available.
>
>    diskfarm:~ # parted /dev/md50
>    ...
>    (parted) u s p free
>    Model: Linux Software RAID Array (md)
>    Disk /dev/md50: 58579697664s
>    Sector size (logical/physical): 512B/4096B
>    Partition Table: gpt
>    Disk Flags:
>    
>    Number  Start         End           Size          File system  Name         Flags
>            34s           6143s         6110s         Free Space
>     1      6144s         39052597247s  39052591104s  xfs          10Traid50md
>            39052597248s  58579697630s  19527100383s  Free Space
>    
>    (parted) rm 1
>    (parted) mkpart pri xfs 6144s 100%
>    (parted) name 1 10Traid50md
>    (parted) p free
>    Model: Linux Software RAID Array (md)
>    Disk /dev/md50: 58579697664s
>    Sector size (logical/physical): 512B/4096B
>    Partition Table: gpt
>    Disk Flags:
>    
>    Number  Start         End           Size          File system  Name         Flags
>            34s           6143s         6110s         Free Space
>     1      6144s         58579691519s  58579685376s  xfs          10Traid50md
>            58579691520s  58579697630s  6111s         Free Space
>    
>    (parted) q
>    
>    diskfarm:~ # parted /dev/md50 p free
>    Model: Linux Software RAID Array (md)
>    Disk /dev/md50: 30.0TB
>    Sector size (logical/physical): 512B/4096B
>    Partition Table: gpt
>    Disk Flags:
>    
>    Number  Start   End     Size    File system  Name         Flags
>            17.4kB  3146kB  3128kB  Free Space
>     1      3146kB  30.0TB  30.0TB  xfs          10Traid50md
>            30.0TB  30.0TB  3129kB  Free Space
>
> Finally, I had to grow the XFS filesystem.  That was simple enough,
> although it's supposed to be done with the volume mounted, which just
> felt ... wrong :-)
>
>    diskfarm:~ # df -kh /mnt/10Traid50md/
>    Filesystem      Size  Used Avail Use% Mounted on
>    /dev/md50p1      19T   19T   95G 100% /mnt/10Traid50md
>    diskfarm:~ # xfs_growfs -n /mnt/10Traid50md
>    meta-data=/dev/md50p1            isize=512    agcount=32, agsize=152549248 blks
>             =                       sectsz=4096  attr=2, projid32bit=1
>             =                       crc=1        finobt=1, sparse=0, rmapbt=0
>             =                       reflink=0
>    data     =                       bsize=4096   blocks=4881573888, imaxpct=5
>             =                       sunit=128    swidth=768 blks
>    naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
>    log      =internal log           bsize=4096   blocks=521728, version=2
>             =                       sectsz=4096  sunit=1 blks, lazy-count=1
>    realtime =none                   extsz=4096   blocks=0, rtextents=0
>    diskfarm:~ # xfs_growfs /mnt/10Traid50md
>    meta-data=/dev/md50p1            isize=512    agcount=32, agsize=152549248 blks
>             =                       sectsz=4096  attr=2, projid32bit=1
>             =                       crc=1        finobt=1, sparse=0, rmapbt=0
>             =                       reflink=0
>    data     =                       bsize=4096   blocks=4881573888, imaxpct=5
>             =                       sunit=128    swidth=768 blks
>    naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
>    log      =internal log           bsize=4096   blocks=521728, version=2
>             =                       sectsz=4096  sunit=1 blks, lazy-count=1
>    realtime =none                   extsz=4096   blocks=0, rtextents=0
>    data blocks changed from 4881573888 to 7322460672
>    diskfarm:~ # df -kh /mnt/10Traid50md
>    Filesystem      Size  Used Avail Use% Mounted on
>    /dev/md50p1      28T   19T  9.2T  67% /mnt/10Traid50md
>
> Et voila, we have more free space.  Yay.
>
> So this works in theory, but ... there's that linear question :-/
>
>
> %
> ...
> % % so here is roughly how to do it (commands may not be exact)> and assuming
> % % your devices are /dev/md5[0123]
> % %
> % % PV == physical volume (a disk or md raid device generally).
> % % VG == volume group (a group of PV).
> % % LV == logical volume (a block device inside a vg made up of part of a PV or
> % % several PVs).
> % %
> % % pvcreate /dev/md5[0123]
> % % vgcreate bigvg /dev/md5[0123]
> % % lvcreate -L <size> -n mylv bigvg
> [snip]
>
> Thanks again, and I do plan to read up on LVM.  For now, though, I'm
> thinkin' I'll rebuild under md in linear mode.  Stealing from my RAID10
> subthread (where I owe similar tests), I tried pulling 128MiB to 8GiB of
> data from a single RAID5 slice versus the big RAID0 stripe
>
>    diskfarm:~ # for D in 52 50 ; do for C in 128 256 512 ; do for S in 1M 4M 16M ; do CMD="dd if=/dev/md$D of=/dev/null bs=$S count=$C iflag=direct" ; echo "## $CMD" ; $CMD 2>&1 | egrep -v records ; done ; done ; done
>    ## dd if=/dev/md52 of=/dev/null bs=1M count=128 iflag=direct
>    134217728 bytes (134 MB, 128 MiB) copied, 1.20121 s, 112 MB/s
>    ## dd if=/dev/md52 of=/dev/null bs=4M count=128 iflag=direct
>    536870912 bytes (537 MB, 512 MiB) copied, 1.82563 s, 294 MB/s
>    ## dd if=/dev/md52 of=/dev/null bs=16M count=128 iflag=direct
>    2147483648 bytes (2.1 GB, 2.0 GiB) copied, 9.03782 s, 238 MB/s
>    ## dd if=/dev/md52 of=/dev/null bs=1M count=256 iflag=direct
>    268435456 bytes (268 MB, 256 MiB) copied, 2.6694 s, 101 MB/s
>    ## dd if=/dev/md52 of=/dev/null bs=4M count=256 iflag=direct
>    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.72331 s, 288 MB/s
>    ## dd if=/dev/md52 of=/dev/null bs=16M count=256 iflag=direct
>    4294967296 bytes (4.3 GB, 4.0 GiB) copied, 13.6094 s, 316 MB/s
>    ## dd if=/dev/md52 of=/dev/null bs=1M count=512 iflag=direct
>    536870912 bytes (537 MB, 512 MiB) copied, 6.39903 s, 83.9 MB/s
>    ## dd if=/dev/md52 of=/dev/null bs=4M count=512 iflag=direct
>    2147483648 bytes (2.1 GB, 2.0 GiB) copied, 7.45123 s, 288 MB/s
>    ## dd if=/dev/md52 of=/dev/null bs=16M count=512 iflag=direct
>    8589934592 bytes (8.6 GB, 8.0 GiB) copied, 28.1189 s, 305 MB/s
>    ## dd if=/dev/md50 of=/dev/null bs=1M count=128 iflag=direct
>    134217728 bytes (134 MB, 128 MiB) copied, 3.74023 s, 35.9 MB/s
>    ## dd if=/dev/md50 of=/dev/null bs=4M count=128 iflag=direct
>    536870912 bytes (537 MB, 512 MiB) copied, 9.96306 s, 53.9 MB/s
>    ## dd if=/dev/md50 of=/dev/null bs=16M count=128 iflag=direct
>    2147483648 bytes (2.1 GB, 2.0 GiB) copied, 19.994 s, 107 MB/s
>    ## dd if=/dev/md50 of=/dev/null bs=1M count=256 iflag=direct
>    268435456 bytes (268 MB, 256 MiB) copied, 7.25855 s, 37.0 MB/s
>    ## dd if=/dev/md50 of=/dev/null bs=4M count=256 iflag=direct
>    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 18.9692 s, 56.6 MB/s
>    ## dd if=/dev/md50 of=/dev/null bs=16M count=256 iflag=direct
>    4294967296 bytes (4.3 GB, 4.0 GiB) copied, 40.2443 s, 107 MB/s
>    ## dd if=/dev/md50 of=/dev/null bs=1M count=512 iflag=direct
>    536870912 bytes (537 MB, 512 MiB) copied, 14.1076 s, 38.1 MB/s
>    ## dd if=/dev/md50 of=/dev/null bs=4M count=512 iflag=direct
>    2147483648 bytes (2.1 GB, 2.0 GiB) copied, 38.6795 s, 55.5 MB/s
>    ## dd if=/dev/md50 of=/dev/null bs=16M count=512 iflag=direct
>    8589934592 bytes (8.6 GB, 8.0 GiB) copied, 81.4364 s, 105 MB/s
>
> and as expected the difference
>
>    RAID5 / RAID0 performance
>    (speedup)
>    
>            1M        4M       16M
>        +---------+---------+---------+
>    128 | 112/036 | 294/054 | 238/107 |
>        | (3.1)   | (5.4)   | (2.2)   |
>        +---------+---------+---------+
>    256 | 101/037 | 288/057 | 316/107 |
>        | (2.7)   | (5.0)   | (3.0)   |
>        +---------+---------+---------+
>    512 | 084/038 | 288/056 | 305/105 |
>        | (2.2)   | (5.1)   | (2.9)   |
>        +---------+---------+---------+
>
> is significant.  So, yeah, I'll be wiping and rebuilding md50 as a
> straight linear.  Watch for more test results when that's done :-)
> Fingers crossed that I get much better results; if not, maybe it'll
> be time to switch to LVM after all.
>
>
> Thanks again to all & HAND
>
> :-D


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: md RAID0 can be grown (was "Re: how do i fix these RAID5 arrays?")
  2022-11-29 21:17                 ` Jani Partanen
@ 2022-11-29 22:22                   ` Roman Mamedov
  2022-12-03  5:41                   ` md vs LVM and VMs and ... (was "Re: md RAID0 can be grown (was ...") David T-G
  1 sibling, 0 replies; 62+ messages in thread
From: Roman Mamedov @ 2022-11-29 22:22 UTC (permalink / raw)
  To: Jani Partanen; +Cc: David T-G, Linux RAID list

On Tue, 29 Nov 2022 23:17:15 +0200
Jani Partanen <jiipee@sotapeli.fi> wrote:

> joined example 1TB and 2TB together with md linear so I could add that as 
> member to other 3TB raid5 pool.

I realize you describe a past setup, but as a fun tip, you could have used md
RAID0 for this: with different sized devices it would automatically stripe the
first 1TB of both disks, and then continue with the tail of 2TB drive as-is.
This would provide some linear speed benefit for the most part of the device,
with no change in reliability in this scheme compared to md linear.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 62+ messages in thread

* md vs LVM and VMs and ... (was "Re: md RAID0 can be grown (was ...")
  2022-11-29 21:17                 ` Jani Partanen
  2022-11-29 22:22                   ` Roman Mamedov
@ 2022-12-03  5:41                   ` David T-G
  2022-12-03 12:06                     ` Wols Lists
  1 sibling, 1 reply; 62+ messages in thread
From: David T-G @ 2022-12-03  5:41 UTC (permalink / raw)
  To: Linux RAID list

Jani, et al --

...and then Jani Partanen said...
% Hi David,
% 
% Nice to see that there is others who like to take things extreme and live on
% razor edge. ;)

Heh.  And here I didn't know I was doing any such thing ...  My whole
goal was to minimize rebuild time for anything less than a full disk and
to stage rebuilds for a full-disk failure.  If these were all 100M or 1T
disks then we wouldn't be worrying about rebuild time :-/


% 
% I had different side disks, so I made raid5 so that I first joined example
% 1TB and 2TB together with md linear so I could add that as member to other
% 3TB raid5 pool.

The good news here is that I don't mix disk sizes; all of these are not
only the same size but, for the foreseeable future, the exact same model.


% 
...
% Anyway, go LVM if you are planning to slice and dice disks like before. With
% LVM later on if you add space, it will be much more simple task and LVM
% automaticly works like linear but offers much other benefits.

I'll definitely read up on it and see where I might play with learning.


% LVM is very much worth to learn. Start up virtual machine and assign
% multiple small vdisks to it and you have enviroment where it's safe to play
% around with md and lvm.
[snip]

Funny you should mention that ...  Getting into VMs has been on my
list for quite a while *sigh* :-)/2  I haven't even had the time and
opportunity to go and read a primer to boot a "Hello, world" VM.


Thanks again & HANN

:-D
-- 
David T-G
See http://justpickone.org/davidtg/email/
See http://justpickone.org/davidtg/tofu.txt


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: md RAID0 can be grown
  2022-11-28 14:24               ` md RAID0 can be grown (was "Re: how do i fix these RAID5 arrays?") David T-G
  2022-11-29 21:17                 ` Jani Partanen
@ 2022-12-03  5:41                 ` David T-G
  1 sibling, 0 replies; 62+ messages in thread
From: David T-G @ 2022-12-03  5:41 UTC (permalink / raw)
  To: Linux RAID list

Hi, all --

...and then David T-G home said...
%
...
%
% Thanks again, and I do plan to read up on LVM.  For now, though, I'm
% thinkin' I'll rebuild under md in linear mode.  Stealing from my RAID10

Rebuilding seems to have been straightforward.  Yay.

  diskfarm:~ # mdadm --verbose --create /dev/md50 --homehost=diskfarm --name=10Traid50md --level=linear --raid-devices=6 --symlinks=yes /dev/md/5[123456]
  mdadm: /dev/md/51 appears to be part of a raid array:
	 level=linear devices=6 ctime=Wed Nov 30 04:17:52 2022
  mdadm: /dev/md/52 appears to be part of a raid array:
	 level=linear devices=6 ctime=Wed Nov 30 04:17:52 2022
  mdadm: /dev/md/53 appears to be part of a raid array:
	 level=linear devices=6 ctime=Wed Nov 30 04:17:52 2022
  mdadm: /dev/md/54 appears to be part of a raid array:
	 level=linear devices=6 ctime=Wed Nov 30 04:17:52 2022
  mdadm: /dev/md/55 appears to be part of a raid array:
	 level=linear devices=6 ctime=Wed Nov 30 04:17:52 2022
  mdadm: /dev/md/56 appears to be part of a raid array:
	 level=linear devices=6 ctime=Wed Nov 30 04:17:52 2022
  Continue creating array? y
  mdadm: Defaulting to version 1.2 metadata
  mdadm: array /dev/md50 started.

  diskfarm:~ # mdadm -D /dev/md50
  /dev/md50:
	     Version : 1.2
       Creation Time : Wed Nov 30 04:20:11 2022
	  Raid Level : linear
	  Array Size : 29289848832 (27.28 TiB 29.99 TB)
	Raid Devices : 6
       Total Devices : 6
	 Persistence : Superblock is persistent

	 Update Time : Wed Nov 30 04:20:11 2022
	       State : clean
      Active Devices : 6
     Working Devices : 6
      Failed Devices : 0
       Spare Devices : 0

	    Rounding : 0K

  Consistency Policy : none

		Name : diskfarm:10Traid50md  (local to host diskfarm)
		UUID : f75bac01:29abcd5d:1d99ffb3:4654bf27
	      Events : 0

      Number   Major   Minor   RaidDevice State
	 0       9       51        0      active sync   /dev/md/51
	 1       9       52        1      active sync   /dev/md/52
	 2       9       53        2      active sync   /dev/md/53
	 3       9       54        3      active sync   /dev/md/54
	 4       9       55        4      active sync   /dev/md/55
	 5       9       56        5      active sync   /dev/md/56

Great so far.  The fun part is that apparently the partition table was still
valid:

  diskfarm:~ # parted /dev/md50
  GNU Parted 3.2
  Using /dev/md50
  Welcome to GNU Parted! Type 'help' to view a list of commands.
  (parted) p
  Model: Linux Software RAID Array (md)
  Disk /dev/md50: 30.0TB
  Sector size (logical/physical): 512B/4096B
  Partition Table: gpt
  Disk Flags:
  															
  Number  Start   End     Size    File system  Name         Flags
   1      3146kB  30.0TB  30.0TB               10Traid50md

  (parted) q

The filesystem didn't still exist, though.  THAT would have been a huge
surprise :-)

  diskfarm:~ # mount /mnt/10Traid50md
  mount: /mnt/10Traid50md: can't find LABEL=10Traid50md.
  diskfarm:~ # ls -goh /dev/disk/by*/* | egrep '10T|md50'
  lrwxrwxrwx 1 10 Nov 30 04:18 /dev/disk/by-id/md-name-diskfarm:10Traid50 -> ../../md50
  lrwxrwxrwx 1 12 Nov 30 04:18 /dev/disk/by-id/md-name-diskfarm:10Traid50-part1 -> ../../md50p1
  lrwxrwxrwx 1 10 Nov 30 04:18 /dev/disk/by-id/md-uuid-3a6fa2d0:071dfffe:bf2e0884:ab29a3ac -> ../../md50
  lrwxrwxrwx 1 12 Nov 30 04:18 /dev/disk/by-id/md-uuid-3a6fa2d0:071dfffe:bf2e0884:ab29a3ac-part1 -> ../../md50p1
  lrwxrwxrwx 1 12 Nov  5 21:30 /dev/disk/by-label/10Tr50-vfat -> ../../sdk128
  lrwxrwxrwx 1 12 Nov  5 21:30 /dev/disk/by-label/10Tr50md-xfs -> ../../sdd128
  lrwxrwxrwx 1 12 Nov  5 21:30 /dev/disk/by-label/10Traid50md-ext3 -> ../../sdb128
  lrwxrwxrwx 1 12 Nov  5 21:30 /dev/disk/by-label/10Traid50md-reis -> ../../sdc128
  lrwxrwxrwx 1 10 Nov  5 21:30 /dev/disk/by-label/WD10Tusb3 -> ../../sdo1
  lrwxrwxrwx 1 10 Nov  5 21:30 /dev/disk/by-partlabel/WD10Tusb3 -> ../../sdo1

  diskfarm:~ # mkfs.xfs -L 10Traid50md /dev/md50p1
  meta-data=/dev/md50p1            isize=512    agcount=32, agsize=228827008 blks
	   =                       sectsz=4096  attr=2, projid32bit=1
	   =                       crc=1        finobt=1, sparse=0, rmapbt=0
	   =                       reflink=0
  data     =                       bsize=4096   blocks=7322460672, imaxpct=5
	   =                       sunit=128    swidth=384 blks
  naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
  log      =internal log           bsize=4096   blocks=521728, version=2
	   =                       sectsz=4096  sunit=1 blks, lazy-count=1
  realtime =none                   extsz=4096   blocks=0, rtextents=0

  diskfarm:~ # df -kh !$
  df -kh /mnt/10Traid50md
  Filesystem      Size  Used Avail Use% Mounted on
  /dev/md50p1      28T   28G   28T   1% /mnt/10Traid50md

Then it was time to hurry up and copy content back from the external
disks to the array to guard against drive failure.  Fast forward a couple
of days ...

% subthread (where I owe similar tests), I tried pulling 128MiB to 8GiB of
% data from a single RAID5 slice versus the big RAID0 stripe
%
%   diskfarm:~ # for D in 52 50 ; do for C in 128 256 512 ; do for S in 1M 4M 16M ; do CMD="dd if=/dev/md$D of=/dev/null bs=$S count=$C iflag=direct" ; echo "## $CMD" ; $CMD 2>&1 | egrep -v records ; done ; done ; done
%   ## dd if=/dev/md52 of=/dev/null bs=1M count=128 iflag=direct
%   134217728 bytes (134 MB, 128 MiB) copied, 1.20121 s, 112 MB/s
%   ## dd if=/dev/md52 of=/dev/null bs=4M count=128 iflag=direct
%   536870912 bytes (537 MB, 512 MiB) copied, 1.82563 s, 294 MB/s
%   ## dd if=/dev/md52 of=/dev/null bs=16M count=128 iflag=direct
%   2147483648 bytes (2.1 GB, 2.0 GiB) copied, 9.03782 s, 238 MB/s
%   ## dd if=/dev/md52 of=/dev/null bs=1M count=256 iflag=direct
%   268435456 bytes (268 MB, 256 MiB) copied, 2.6694 s, 101 MB/s
%   ## dd if=/dev/md52 of=/dev/null bs=4M count=256 iflag=direct
%   1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.72331 s, 288 MB/s
%   ## dd if=/dev/md52 of=/dev/null bs=16M count=256 iflag=direct
%   4294967296 bytes (4.3 GB, 4.0 GiB) copied, 13.6094 s, 316 MB/s
%   ## dd if=/dev/md52 of=/dev/null bs=1M count=512 iflag=direct
%   536870912 bytes (537 MB, 512 MiB) copied, 6.39903 s, 83.9 MB/s
%   ## dd if=/dev/md52 of=/dev/null bs=4M count=512 iflag=direct
%   2147483648 bytes (2.1 GB, 2.0 GiB) copied, 7.45123 s, 288 MB/s
%   ## dd if=/dev/md52 of=/dev/null bs=16M count=512 iflag=direct
%   8589934592 bytes (8.6 GB, 8.0 GiB) copied, 28.1189 s, 305 MB/s
%   ## dd if=/dev/md50 of=/dev/null bs=1M count=128 iflag=direct
%   134217728 bytes (134 MB, 128 MiB) copied, 3.74023 s, 35.9 MB/s
%   ## dd if=/dev/md50 of=/dev/null bs=4M count=128 iflag=direct
%   536870912 bytes (537 MB, 512 MiB) copied, 9.96306 s, 53.9 MB/s
%   ## dd if=/dev/md50 of=/dev/null bs=16M count=128 iflag=direct
%   2147483648 bytes (2.1 GB, 2.0 GiB) copied, 19.994 s, 107 MB/s
%   ## dd if=/dev/md50 of=/dev/null bs=1M count=256 iflag=direct
%   268435456 bytes (268 MB, 256 MiB) copied, 7.25855 s, 37.0 MB/s
%   ## dd if=/dev/md50 of=/dev/null bs=4M count=256 iflag=direct
%   1073741824 bytes (1.1 GB, 1.0 GiB) copied, 18.9692 s, 56.6 MB/s
%   ## dd if=/dev/md50 of=/dev/null bs=16M count=256 iflag=direct
%   4294967296 bytes (4.3 GB, 4.0 GiB) copied, 40.2443 s, 107 MB/s
%   ## dd if=/dev/md50 of=/dev/null bs=1M count=512 iflag=direct
%   536870912 bytes (537 MB, 512 MiB) copied, 14.1076 s, 38.1 MB/s
%   ## dd if=/dev/md50 of=/dev/null bs=4M count=512 iflag=direct
%   2147483648 bytes (2.1 GB, 2.0 GiB) copied, 38.6795 s, 55.5 MB/s
%   ## dd if=/dev/md50 of=/dev/null bs=16M count=512 iflag=direct
%   8589934592 bytes (8.6 GB, 8.0 GiB) copied, 81.4364 s, 105 MB/s

Time to run dd tests.  Curiously, my md52 performance changed, in some
cases significantly!  What gives?!?

  diskfarm:~ # for D in 52 50 ; do for C in 128 256 512 ; do for S in 1M 4M 16M ; do CMD="dd if=/dev/md$D of=/dev/null bs=$S count=$C iflag=direct" ; echo "## $CMD" ; $CMD 2>&1 | egrep -v records ; done ; done ; done
  ## dd if=/dev/md52 of=/dev/null bs=1M count=128 iflag=direct
  134217728 bytes (134 MB, 128 MiB) copied, 2.37693 s, 56.5 MB/s
  ## dd if=/dev/md52 of=/dev/null bs=4M count=128 iflag=direct
  536870912 bytes (537 MB, 512 MiB) copied, 1.72798 s, 311 MB/s
  ## dd if=/dev/md52 of=/dev/null bs=16M count=128 iflag=direct
  2147483648 bytes (2.1 GB, 2.0 GiB) copied, 6.6545 s, 323 MB/s
  ## dd if=/dev/md52 of=/dev/null bs=1M count=256 iflag=direct
  268435456 bytes (268 MB, 256 MiB) copied, 2.45847 s, 109 MB/s
  ## dd if=/dev/md52 of=/dev/null bs=4M count=256 iflag=direct
  1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.1646 s, 339 MB/s
  ## dd if=/dev/md52 of=/dev/null bs=16M count=256 iflag=direct
  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 14.8199 s, 290 MB/s
  ## dd if=/dev/md52 of=/dev/null bs=1M count=512 iflag=direct
  536870912 bytes (537 MB, 512 MiB) copied, 4.32777 s, 124 MB/s
  ## dd if=/dev/md52 of=/dev/null bs=4M count=512 iflag=direct
  2147483648 bytes (2.1 GB, 2.0 GiB) copied, 8.55706 s, 251 MB/s
  ## dd if=/dev/md52 of=/dev/null bs=16M count=512 iflag=direct
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 31.9199 s, 269 MB/s
  ## dd if=/dev/md50 of=/dev/null bs=1M count=128 iflag=direct
  134217728 bytes (134 MB, 128 MiB) copied, 0.718462 s, 187 MB/s
  ## dd if=/dev/md50 of=/dev/null bs=4M count=128 iflag=direct
  536870912 bytes (537 MB, 512 MiB) copied, 1.48336 s, 362 MB/s
  ## dd if=/dev/md50 of=/dev/null bs=16M count=128 iflag=direct
  2147483648 bytes (2.1 GB, 2.0 GiB) copied, 6.07036 s, 354 MB/s
  ## dd if=/dev/md50 of=/dev/null bs=1M count=256 iflag=direct
  268435456 bytes (268 MB, 256 MiB) copied, 1.28401 s, 209 MB/s
  ## dd if=/dev/md50 of=/dev/null bs=4M count=256 iflag=direct
  1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.07217 s, 350 MB/s
  ## dd if=/dev/md50 of=/dev/null bs=16M count=256 iflag=direct
  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 11.4026 s, 377 MB/s
  ## dd if=/dev/md50 of=/dev/null bs=1M count=512 iflag=direct
  536870912 bytes (537 MB, 512 MiB) copied, 2.7174 s, 198 MB/s
  ## dd if=/dev/md50 of=/dev/null bs=4M count=512 iflag=direct
  2147483648 bytes (2.1 GB, 2.0 GiB) copied, 6.22232 s, 345 MB/s
  ## dd if=/dev/md50 of=/dev/null bs=16M count=512 iflag=direct
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 22.6482 s, 379 MB/s

I didn't alter anything in the RAID5 arrays; I only deleted the striped
RAID0 and created a linear RAID0 the same way.  Although I was in normal
multi-user mode for the tests, the system was basically quiet ...


%
% and as expected the difference
%
%   RAID5 / RAID0 performance
%   (speedup)
%
%           1M        4M       16M
%       +---------+---------+---------+
%   128 | 112/036 | 294/054 | 238/107 |
%       | (3.1)   | (5.4)   | (2.2)   |
%       +---------+---------+---------+
%   256 | 101/037 | 288/057 | 316/107 |
%       | (2.7)   | (5.0)   | (3.0)   |
%       +---------+---------+---------+
%   512 | 084/038 | 288/056 | 305/105 |
%       | (2.2)   | (5.1)   | (2.9)   |
%       +---------+---------+---------+

The new chart

  RAID5 / RAID0 performance
  (speedup)

          1M        4M       16M
      +---------+---------+---------+
  128 |  57/187 | 311/362 | 323/354 |
      | (.32)   | (.85)   | (.91)   |
      +---------+---------+---------+
  256 | 109/209 | 339/350 | 290/377 |
      | (.52)   | (.96)   | (.76)   |
      +---------+---------+---------+
  512 | 124/198 | 251/345 | 269/379 |
      | (.62)   | (.72)   | (.70)   |
      +---------+---------+---------+

shows happier results (reading from the RAID0 is *faster* than from a
RAID5), but I don't know how much I trust the results, not least because
they're all over the place nor because it shouldn't be possible to speed
up faster than reading straight from the RAID5 slice.


%
% is significant.  So, yeah, I'll be wiping and rebuilding md50 as a
% straight linear.  Watch for more test results when that's done :-)

Ta daaaa!  And it may all be meaningless *sigh*


% Fingers crossed that I get much better results; if not, maybe it'll
% be time to switch to LVM after all.

Still planning to read ... :-)


Thanks again & HANN

:-D
--
David T-G
See http://justpickone.org/davidtg/email/
See http://justpickone.org/davidtg/tofu.txt


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: about linear and about RAID10
  2022-11-28 15:32             ` Reindl Harald
       [not found]               ` <CAAMCDecXkcmUe=ZFnJ_NndND0C2=D5qSoj1Hohsrty8y1uqdfw@mail.gmail.com>
  2022-11-28 20:45               ` John Stoffel
@ 2022-12-03  5:45               ` David T-G
  2022-12-03 12:20                 ` Reindl Harald
  2 siblings, 1 reply; 62+ messages in thread
From: David T-G @ 2022-12-03  5:45 UTC (permalink / raw)
  To: Linux RAID list

Reindl, et al --

...and then Reindl Harald said...
% 
% Am 28.11.22 um 15:46 schrieb David T-G:
% > I don't at this time have a device free to plug in locally to back up the
% > volume to destroy and rebuild as linear, so that will have to wait.  When
% > I do get that chance, though, will that help me get to the awesome goal
% > of actually INCREASING performance by including a RAID0 layer?
% 
% stacking layers over layers will *never* increase performance - a pure RAID0
% will but if one disk is dead all is lost

True, and we definitely don't want that.


% 
% additional RAID0 on top or below another RAID won't help

I could believe that, because what I don't know about RAID would fill a
book, but I thought that the idea of RAID10 speeding up access was that
the first half of the data is on the FIRST half of the /first/ disk
and the second half of the data is on the FIRST half of the /second/
disk and so the heads only move over half the disk for reads.


% 
% your main problem starts by slicing your drives in dozens of partitions and
% "the idea being that each piece of which should take less time to rebuild if
% something fails"
[snip]

Whoops!  You're on the wrong machine.  This one mirrors two disks; that
one is the one that has a bunch.


HANN

:-D
-- 
David T-G
See http://justpickone.org/davidtg/email/
See http://justpickone.org/davidtg/tofu.txt


^ permalink raw reply	[flat|nested] 62+ messages in thread

* stripe size checking (was "Re: about linear and about RAID10")
       [not found]             ` <CAAMCDee_YrhXo+5hp31YXgUHkyuUr-zTXOqi0-HUjMrHpYMkTQ@mail.gmail.com>
@ 2022-12-03  5:52               ` David T-G
  0 siblings, 0 replies; 62+ messages in thread
From: David T-G @ 2022-12-03  5:52 UTC (permalink / raw)
  To: Linux RAID list

Roger, et al --

...and then Roger Heflin said...
% How big is your stripe size set to?   The bigger the stripe size on the
% md40 main raid the closer it gets to linear.

I don't know ... and I don't know how to tell.

  davidtg@jpo:~> sudo mdadm -D /dev/md40
  /dev/md40:
	     Version : 1.2
       Creation Time : Mon Aug  8 12:15:12 2022
	  Raid Level : raid0
	  Array Size : 3906488320 (3.64 TiB 4.00 TB)
	Raid Devices : 2
       Total Devices : 2
	 Persistence : Superblock is persistent
  
	 Update Time : Mon Aug  8 12:15:12 2022
	       State : clean 
      Active Devices : 2
     Working Devices : 2
      Failed Devices : 0
       Spare Devices : 0
  
	      Layout : -unknown-
	  Chunk Size : 512K
  
  Consistency Policy : none
  
		Name : jpo:40  (local to host jpo)
		UUID : 4735f53c:7cdf7758:e212bec6:aa2942e8
	      Events : 0
  
      Number   Major   Minor   RaidDevice State
	 0       9       41        0      active sync   /dev/md/md41
	 1       9       42        1      active sync   /dev/md/md42
  
  davidtg@jpo:~> sudo mdadm -E /dev/md40
  /dev/md40:
     MBR Magic : aa55
  Partition[0] :   4294967295 sectors at            1 (type ee)

Is it the 512k chunk size?

% 
% And I think I did miss one behavior that explains it being faster with
% larger, and not sucking as bad as I expected.  On disks the internal cache
% is supposed to cache the entire track as it goes under the head, so when

Sounds familiar.


% you ask for the data theoriticaly requiring a seek the drive may already
% have the data still in its cache and hence not need to do a seek.   The
% performance would then be ok so long as the data is still in the cache.

Right.  Everything is awesome when the data is cached :-)


% But given your tests md40 at best gets close to the underlying raids
% performance.  When linear the underlying performance should be roughly
% equal.

Now that I understand linear, that makes sense.


Thanks again & HANN

:-D
-- 
David T-G
See http://justpickone.org/davidtg/email/
See http://justpickone.org/davidtg/tofu.txt


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: about linear and about RAID10
  2022-11-28 20:45               ` John Stoffel
@ 2022-12-03  5:58                 ` David T-G
  2022-12-03 12:16                   ` Wols Lists
  0 siblings, 1 reply; 62+ messages in thread
From: David T-G @ 2022-12-03  5:58 UTC (permalink / raw)
  To: Linux RAID list

John, et al --

...and then John Stoffel said...
% >>>>> "Reindl" == Reindl Harald <h.reindl@thelounge.net> writes:
% 
...
% > keep it SIMPLE!
% 
% This is my mantra as well here.  For my home system, I prefer
% symplicity and robustness and performance, so I tend just use RAID1

I've finally convinced The Boss to spring for additional disks so that I
can mirror, so our two servers both have SSD mirroring; yay.  The web
server doesn't need much space, so it has a pair of 4T HDDs mirrored as
well ... but as RAID10 since I thought that that was cool.  Ah, well.


% mirrors of all my disks.  I really don't have all that much stuff I
% need lots of disk space for.  And for that I have a scratch volume.  
[snip]

Heh.  Not only do we have scratch space on diskfarm, but that's where
we have 30T-plus of data that continues to grow with every video made.
Mirroring just won't do there, both because I'd have to have little
chunks of mirror and because I don't know that I can convince her to pay
for THAT much storage (plus plugging in all of those drives).  We need
RAID5 there ... but the devices are just really frickin' huge.


Thanks & HANN

:-D
-- 
David T-G
See http://justpickone.org/davidtg/email/
See http://justpickone.org/davidtg/tofu.txt


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: md vs LVM and VMs and ... (was "Re: md RAID0 can be grown (was ...")
  2022-12-03  5:41                   ` md vs LVM and VMs and ... (was "Re: md RAID0 can be grown (was ...") David T-G
@ 2022-12-03 12:06                     ` Wols Lists
  2022-12-03 18:04                       ` batches and serial numbers (was "Re: md vs LVM and VMs and ...") David T-G
  0 siblings, 1 reply; 62+ messages in thread
From: Wols Lists @ 2022-12-03 12:06 UTC (permalink / raw)
  To: Linux RAID list

On 03/12/2022 05:41, David T-G wrote:
> % I had different side disks, so I made raid5 so that I first joined example
> % 1TB and 2TB together with md linear so I could add that as member to other
> % 3TB raid5 pool.
> 
> The good news here is that I don't mix disk sizes; all of these are not
> only the same size but, for the foreseeable future, the exact same model.

 From the exact same batch? That's BAD news actually.

Now that disk sizes have been standardised (and the number of actual 
factories/manufacturers seriously reduced), it should be that a 1TB 
drive is a 1TB drive is a 1TB drive. Decimal, that is, not binary. So 
there *shouldn't* be any problems swapping one random drive out for another.

But if all your drives are the same make, model(, and batch), there is a 
not insignificant risk they will all share the same defects, and fail at 
the same time. It's accepted the risk is small, but it's there.

It's why my raid is composed of a Seagate Barracuda 3TB (slap wrist, 
don't use Barracudas!), 2 x 4TB Seagate Ironwolves, and 1 Toshiba 8TB N300.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: about linear and about RAID10
  2022-12-03  5:58                 ` David T-G
@ 2022-12-03 12:16                   ` Wols Lists
  2022-12-03 18:27                     ` David T-G
  0 siblings, 1 reply; 62+ messages in thread
From: Wols Lists @ 2022-12-03 12:16 UTC (permalink / raw)
  To: Linux RAID list

On 03/12/2022 05:58, David T-G wrote:
> % This is my mantra as well here.  For my home system, I prefer
> % symplicity and robustness and performance, so I tend just use RAID1
> 
> I've finally convinced The Boss to spring for additional disks so that I
> can mirror, so our two servers both have SSD mirroring; yay.  The web
> server doesn't need much space, so it has a pair of 4T HDDs mirrored as
> well ... but as RAID10 since I thought that that was cool.  Ah, well.

Raid 10 across two drives? Do I read you right? So you can easily add a 
3rd drive to get 6TB of usable storage, but raid 10 x 2 drives = raid 1 ...

Changing topic slightly, if you have multiple slices per drive, raided 
(which I've done, I wanted /, /home and /var on their own devices), it 
is quite easy to lose just one slice. But that *should* be down to 
mis-configured devices. Get a soft-read error, the *linux* timeout kicks 
in, and the partition gets kicked out.

But that sort of problem should take no time whatsoever to recover from. 
With journalling or bitmap (you'll need to read up on the details) a 
re-add should just replay the lost writes, and you're back in business. 
So - if your aim is speed of recovery - there's no point splitting the 
disk into slices. There are good reasons for doing it, but that isn't 
one of them!

Cheers,
Wol

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: about linear and about RAID10
  2022-12-03  5:45               ` David T-G
@ 2022-12-03 12:20                 ` Reindl Harald
  0 siblings, 0 replies; 62+ messages in thread
From: Reindl Harald @ 2022-12-03 12:20 UTC (permalink / raw)
  To: Linux RAID list



Am 03.12.22 um 06:45 schrieb David T-G:
> Reindl, et al --
> 
> ...and then Reindl Harald said...
> %
> % Am 28.11.22 um 15:46 schrieb David T-G:
> % > I don't at this time have a device free to plug in locally to back up the
> % > volume to destroy and rebuild as linear, so that will have to wait.  When
> % > I do get that chance, though, will that help me get to the awesome goal
> % > of actually INCREASING performance by including a RAID0 layer?
> %
> % stacking layers over layers will *never* increase performance - a pure RAID0
> % will but if one disk is dead all is lost
> 
> True, and we definitely don't want that.

but you do when i read your posts

> % additional RAID0 on top or below another RAID won't help
> 
> I could believe that, because what I don't know about RAID would fill a
> book, but I thought that the idea of RAID10 speeding up access was that
> the first half of the data is on the FIRST half of the /first/ disk
> and the second half of the data is on the FIRST half of the /second/
> disk and so the heads only move over half the disk for reads.

common sense: in the moment you have already reads and span a RAID0 over 
them the aceess pattern is far waway from first and second disk

common sense: you wrote you are dealing mostly with small files - they 
don't gain from striping because they are typically not striped at all

> % your main problem starts by slicing your drives in dozens of partitions and
> % "the idea being that each piece of which should take less time to rebuild if
> % something fails"
> [snip]
> 
> Whoops!  You're on the wrong machine.  This one mirrors two disks; that
> one is the one that has a bunch.

well, when you mix different machines into the same thread i am out here

^ permalink raw reply	[flat|nested] 62+ messages in thread

* batches and serial numbers (was "Re: md vs LVM and VMs and ...")
  2022-12-03 12:06                     ` Wols Lists
@ 2022-12-03 18:04                       ` David T-G
  2022-12-03 20:07                         ` Wols Lists
  2022-12-04 13:04                         ` batches and serial numbers (was "Re: md vs LVM and VMs and ...") Reindl Harald
  0 siblings, 2 replies; 62+ messages in thread
From: David T-G @ 2022-12-03 18:04 UTC (permalink / raw)
  To: Linux RAID list

Anthony, et al --

...and then Wols Lists said...
% On 03/12/2022 05:41, David T-G wrote:
% > 
% > The good news here is that I don't mix disk sizes; all of these are not
% > only the same size but, for the foreseeable future, the exact same model.
% 
% From the exact same batch? That's BAD news actually.

I don't know about the same batch.  I got three together, so maybe, and
then I recently added a fourth.

  diskfarm:~ # for D in /dev/sd[bcdk] ; do printf "$D\t" ; smartctl -i $D | grep Serial ; done
  /dev/sdb        Serial Number:    61U0A0HQFBKG
  /dev/sdc        Serial Number:    61U0A0BEFBKG
  /dev/sdd        Serial Number:    61U0A007FBKG
  /dev/sdk        Serial Number:    91C0A03ZFBKG

How close is too close for SNs?  Anyone have a magic decoder ring?

I seriously think I'm going to be asking for another -- plus the
corresponding offsite external drive -- for Christmas, so we'll be
even more homogeneous as time goes on and maybe sooner than later.


% 
% Now that disk sizes have been standardised (and the number of actual
% factories/manufacturers seriously reduced), it should be that a 1TB drive is
% a 1TB drive is a 1TB drive. Decimal, that is, not binary. So there
% *shouldn't* be any problems swapping one random drive out for another.

That would be nice.  But it just feels so ... WRONG! :-)  I don't want
to have to sweat different numbers of sectors or different caches or
different speeds that will just make hiccups.  I'm not yet sold on
going multi-vendor all together ...  [In different arrays or machines,
certainly, but not when they're supposed to be identical members.]


% 
% But if all your drives are the same make, model(, and batch), there is a not
% insignificant risk they will all share the same defects, and fail at the
% same time. It's accepted the risk is small, but it's there.

What is the problem?  Is it the manufacturer's firmware?  Is it the day
they were made?  If I order a Tosh, a Seag, and a WD all on the same day
then it sounds like I'm [much more] likely to get clones made at the same
time in those few factories.  But I couldn't just buy a drive a quarter
and wait nearly a year to get up and running; I had to start somewhere ...


% 
% It's why my raid is composed of a Seagate Barracuda 3TB (slap wrist, don't
% use Barracudas!), 2 x 4TB Seagate Ironwolves, and 1 Toshiba 8TB N300.

These are Tosh X300s, FWIW.  Like 'em so far!


% 
% Cheers,
% Wol


HAND

:-D
-- 
David T-G
See http://justpickone.org/davidtg/email/
See http://justpickone.org/davidtg/tofu.txt


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: about linear and about RAID10
  2022-12-03 12:16                   ` Wols Lists
@ 2022-12-03 18:27                     ` David T-G
  2022-12-03 23:26                       ` Wol
  2022-12-04 13:08                       ` Reindl Harald
  0 siblings, 2 replies; 62+ messages in thread
From: David T-G @ 2022-12-03 18:27 UTC (permalink / raw)
  To: Linux RAID list

Anthony, et al --

...and then Wols Lists said...
% On 03/12/2022 05:58, David T-G wrote:
% > 
% > I've finally convinced The Boss to spring for additional disks so that I
% > can mirror, so our two servers both have SSD mirroring; yay.  The web
% > server doesn't need much space, so it has a pair of 4T HDDs mirrored as
% > well ... but as RAID10 since I thought that that was cool.  Ah, well.
% 
% Raid 10 across two drives? Do I read you right? So you can easily add a 3rd
% drive to get 6TB of usable storage, but raid 10 x 2 drives = raid 1 ...

Thanks for the aa/bb/cc non-symmetrical layout help recently.  I think I
see where you're going here.  But that isn't what I have in this case.

Each disk is sliced into two large partitions that take up about half:

  davidtg@jpo:~> for D in /dev/sd[bc] ; do sudo parted $D u GiB p free | grep GiB ; done
  Disk /dev/sdb: 3726GiB
          0.00GiB  0.00GiB  0.00GiB  Free Space
   1      0.00GiB  1863GiB  1863GiB               Raid1-1
   2      1863GiB  3726GiB  1863GiB               Raid1-2
   4      3726GiB  3726GiB  0.00GiB  ext2         Seag4000-ZDHB2X37-ext2
  Disk /dev/sdc: 3726GiB
          0.00GiB  0.00GiB  0.00GiB  Free Space
   1      0.00GiB  1863GiB  1863GiB               Raid1-2
   2      1863GiB  3726GiB  1863GiB               Raid1-1
   4      3726GiB  3726GiB  0.00GiB               Seag4000-ZDHBKZTG-ext2

The two halves of each disk are then mirrored across -- BUT in an "X"
layout.  Note that b1 pairs with c2 and c1 pairs with b2.

  davidtg@jpo:~> sudo mdadm -D /dev/md/md4[12] | egrep '/dev/.d|Level'
  /dev/md/md41:
          Raid Level : raid1
         0       8       17        0      active sync   /dev/sdb1
         1       8       34        1      active sync   /dev/sdc2
  /dev/md/md42:
          Raid Level : raid1
         0       8       18        0      active sync   /dev/sdb2
         1       8       33        1      active sync   /dev/sdc1

Finally, the mirrors are striped together (perhaps that should have been
a linear instead) to make the final device.

  davidtg@jpo:~> sudo mdadm -D /dev/md/40 | egrep '/dev/.d|Level'
  /dev/md/40:
          Raid Level : raid0
         0       9       41        0      active sync   /dev/md/md41
         1       9       42        1      active sync   /dev/md/md42

  davidtg@jpo:~> sudo parted /dev/md40 u GiB p free | grep GiB
  Disk /dev/md40: 3726GiB
	  0.00GiB  0.00GiB  0.00GiB  Free Space
   1      0.00GiB  3726GiB  3726GiB  xfs          4TRaid10md
   4      3726GiB  3726GiB  0.00GiB  ext2         4TRaid10md-ntfs

The theory was that each disk would hold half of the total in the first
half of its space and that md would be clever enough to ask the proper
disk for the sector to keep the head in that short run.  Writes cover the
whole disk one way or another, of course, but reads should require less
head movement and be quicker.

Or that's how I understood it in the very many RAID wiki pages and other
docs I read :-)


% 
...
% should just replay the lost writes, and you're back in business. So - if
% your aim is speed of recovery - there's no point splitting the disk into

[Not here; that's the RAID5 system with big disks.]


% slices. There are good reasons for doing it, but that isn't one of them!

How about speed of read?  That was the goal here.  I don't foresee adding
more disks here, although I do actually have one more internal SATA port
and so, maybe, yeah, I might go to a three-disk RAID10 somehow.  But this
server isn't meant to have a lot of content.  [If I can ever find my old
SATA daughter card, though, I could hang some of those leftover <2T disks
on it and shoehorn in more archive disks :-]


% 
% Cheers,
% Wol


Thanks again & HAND

:-D
-- 
David T-G
See http://justpickone.org/davidtg/email/
See http://justpickone.org/davidtg/tofu.txt


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: batches and serial numbers (was "Re: md vs LVM and VMs and ...")
  2022-12-03 18:04                       ` batches and serial numbers (was "Re: md vs LVM and VMs and ...") David T-G
@ 2022-12-03 20:07                         ` Wols Lists
  2022-12-04  2:47                           ` batches and serial numbers David T-G
  2022-12-04 13:04                         ` batches and serial numbers (was "Re: md vs LVM and VMs and ...") Reindl Harald
  1 sibling, 1 reply; 62+ messages in thread
From: Wols Lists @ 2022-12-03 20:07 UTC (permalink / raw)
  To: Linux RAID list

On 03/12/2022 18:04, David T-G wrote:
> % It's why my raid is composed of a Seagate Barracuda 3TB (slap wrist, don't
> % use Barracudas!), 2 x 4TB Seagate Ironwolves, and 1 Toshiba 8TB N300.
> 
> These are Tosh X300s, FWIW.  Like 'em so far!

OUCH !!!

https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

Do the X300s have ERC, and what's the timeout? Barracudas are nice 
drives, I like 'em, but they're not good in raid. And the BarraCudas 
even less so! I've got a nasty feeling your X300s are the same!

I said it's easy to get slices kicked out due to misconfiguration - 
that's exactly what happens with Barracudas, and I suspect your X300s 
suffer the exact same problem ...

Read up, and come back if you've got any problems. The fix is that 
script, but it means if anything goes wrong you're going to be cursing 
"that damn slow computer".

Cheers,
Wol

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: about linear and about RAID10
  2022-12-03 18:27                     ` David T-G
@ 2022-12-03 23:26                       ` Wol
  2022-12-04  2:53                         ` David T-G
  2022-12-04 13:08                       ` Reindl Harald
  1 sibling, 1 reply; 62+ messages in thread
From: Wol @ 2022-12-03 23:26 UTC (permalink / raw)
  To: Linux RAID list

On 03/12/2022 18:27, David T-G wrote:
> Anthony, et al --
> 
> ...and then Wols Lists said...
> % On 03/12/2022 05:58, David T-G wrote:
> % >
> % > I've finally convinced The Boss to spring for additional disks so that I
> % > can mirror, so our two servers both have SSD mirroring; yay.  The web
> % > server doesn't need much space, so it has a pair of 4T HDDs mirrored as
> % > well ... but as RAID10 since I thought that that was cool.  Ah, well.
> %
> % Raid 10 across two drives? Do I read you right? So you can easily add a 3rd
> % drive to get 6TB of usable storage, but raid 10 x 2 drives = raid 1 ...
> 
> Thanks for the aa/bb/cc non-symmetrical layout help recently.  I think I
> see where you're going here.  But that isn't what I have in this case.
> 
> Each disk is sliced into two large partitions that take up about half:
> 
>    davidtg@jpo:~> for D in /dev/sd[bc] ; do sudo parted $D u GiB p free | grep GiB ; done
>    Disk /dev/sdb: 3726GiB
>            0.00GiB  0.00GiB  0.00GiB  Free Space
>     1      0.00GiB  1863GiB  1863GiB               Raid1-1
>     2      1863GiB  3726GiB  1863GiB               Raid1-2
>     4      3726GiB  3726GiB  0.00GiB  ext2         Seag4000-ZDHB2X37-ext2
>    Disk /dev/sdc: 3726GiB
>            0.00GiB  0.00GiB  0.00GiB  Free Space
>     1      0.00GiB  1863GiB  1863GiB               Raid1-2
>     2      1863GiB  3726GiB  1863GiB               Raid1-1
>     4      3726GiB  3726GiB  0.00GiB               Seag4000-ZDHBKZTG-ext2
> 
> The two halves of each disk are then mirrored across -- BUT in an "X"
> layout.  Note that b1 pairs with c2 and c1 pairs with b2.
> 
>    davidtg@jpo:~> sudo mdadm -D /dev/md/md4[12] | egrep '/dev/.d|Level'
>    /dev/md/md41:
>            Raid Level : raid1
>           0       8       17        0      active sync   /dev/sdb1
>           1       8       34        1      active sync   /dev/sdc2
>    /dev/md/md42:
>            Raid Level : raid1
>           0       8       18        0      active sync   /dev/sdb2
>           1       8       33        1      active sync   /dev/sdc1
> 
> Finally, the mirrors are striped together (perhaps that should have been
> a linear instead) to make the final device.
> 
>    davidtg@jpo:~> sudo mdadm -D /dev/md/40 | egrep '/dev/.d|Level'
>    /dev/md/40:
>            Raid Level : raid0
>           0       9       41        0      active sync   /dev/md/md41
>           1       9       42        1      active sync   /dev/md/md42
> 
>    davidtg@jpo:~> sudo parted /dev/md40 u GiB p free | grep GiB
>    Disk /dev/md40: 3726GiB
> 	  0.00GiB  0.00GiB  0.00GiB  Free Space
>     1      0.00GiB  3726GiB  3726GiB  xfs          4TRaid10md
>     4      3726GiB  3726GiB  0.00GiB  ext2         4TRaid10md-ntfs
> 
> The theory was that each disk would hold half of the total in the first
> half of its space and that md would be clever enough to ask the proper
> disk for the sector to keep the head in that short run.  Writes cover the
> whole disk one way or another, of course, but reads should require less
> head movement and be quicker.
> 
> Or that's how I understood it in the very many RAID wiki pages and other
> docs I read :-)
> 
Hmm ... so being pedantic about terminology that's a definite raid-1+0, 
not linux-raid-10.

And if I read you right, you have only 2 disks? Split into 4 partitions 
to give you 1+0? WHY!!! I thought the acronym was Keep It Simple S*****.

Linux and raid try to minimise head movement. So if your reads are all 
over the place, one head will gravitate to the end of its disk, while 
the other will gravitate to the start.

There's not much point messing around and changing it now, but if you 
need to increase the array size, I'd just ditch the fancy layout and 
move to a simple real linux-raid-10. Add a new 4TB to your raid-1 mirror 
and convert it to a true raid-10. You said you had plenty of spare 2TBs, 
so --replace one of your 4TBs onto a pair of 2TBs, then --replace one of 
your raid-0's onto it. Finally, just fail the other raid-0 then replace 
it with the last 4TB. That way you don't ever lose redundancy, despite 
just knocking one drive out.
> 
> %
> ...
> % should just replay the lost writes, and you're back in business. So - if
> % your aim is speed of recovery - there's no point splitting the disk into
> 
> [Not here; that's the RAID5 system with big disks.]
> 
> 
> % slices. There are good reasons for doing it, but that isn't one of them!
> 
> How about speed of read?  That was the goal here.  I don't foresee adding
> more disks here, although I do actually have one more internal SATA port
> and so, maybe, yeah, I might go to a three-disk RAID10 somehow.  But this
> server isn't meant to have a lot of content.  [If I can ever find my old
> SATA daughter card, though, I could hang some of those leftover <2T disks
> on it and shoehorn in more archive disks :-]
> 
Well, adding more disks will speed up access. And how much is a (cheap) 
SATA card? about £10?

Actually, that might be a good idea, especially if you want to get into 
playing with lvm? Especially if you can temporarily (or permanently) add 
a third 2TB. Free up a 4TB as before, make a single-disk mirror out of 
it, and create an LVM volume and partition. I'm hoping this partition 
will be big enough just to do a dd copy of your mirror partition across.

Or you could just do a cp copy, but my drives are littered with hard 
links so cp is not a wise choice for me ...

Then as before free up your second 4TB without actually destroying the 
array to keep redundancy, add it in to your new mirror, then you can 
just kill all the remaining disks on the old array. Then add two (or 
more) 2TBs into the new array. That'll give you 6TB of usable space with 
the first 4TB mirrored across 4 drives (you wanted speed?) and the last 
2TB mirrored across the second half of the two 4TB drives. If you can 
then add any more 2TB drives it'll add more disk space but more 
importantly from your point of view, it'll spread the data across even 
more drives and make large files read even faster ... Something to think 
about. Just a warning - you're supposed to be able to easily convert 
between raids, provided you go via a 2-disk mirror. Raid-10, however, 
seems to be rather tricky to get back into a raid-1 layout ... going to 
raid-5 or 6 or whatever (should you want to) might involve building a 
whole new array.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: batches and serial numbers
  2022-12-03 20:07                         ` Wols Lists
@ 2022-12-04  2:47                           ` David T-G
  2022-12-04 13:54                             ` Wols Lists
  0 siblings, 1 reply; 62+ messages in thread
From: David T-G @ 2022-12-04  2:47 UTC (permalink / raw)
  To: Linux RAID list

Wol --

...and then Wols Lists said...
% On 03/12/2022 18:04, David T-G wrote:
% > % It's why my raid is composed of a Seagate Barracuda 3TB (slap wrist, don't
% > % use Barracudas!), 2 x 4TB Seagate Ironwolves, and 1 Toshiba 8TB N300.
% > 
% > These are Tosh X300s, FWIW.  Like 'em so far!
% 
% OUCH !!!
% 
% https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

No, I'm familiar.


% 
% Do the X300s have ERC, and what's the timeout? Barracudas are nice drives, I

Yep and good.

  diskfarm:~ # /usr/local/bin/smartctl-disks-timeout.sh 
  Drive timeouts: sda Y ; sdb Y ; sdc Y ; sdd Y ; sde Y ; sdf 180 ; sdg 180 ; sdh Y ; sdi 180 ; sdj Y ; sdk Y ; sdl 180 ; sdm Y ; 

I'll append my little enhancement of your script after my sig in case you
find the tweaks interesting.


% like 'em, but they're not good in raid. And the BarraCudas even less so!
% I've got a nasty feeling your X300s are the same!

They have been good to me so far.  I was originally going to get N300s,
but I couldn't at the time, and the X300s read as the same for everything
I could find.  Does your N300 have ERC with a short timeout enabled by
default?


% 
% I said it's easy to get slices kicked out due to misconfiguration - that's
% exactly what happens with Barracudas, and I suspect your X300s suffer the
% exact same problem ...

Well, perhaps.  I'd love to have been able to pin it down, but I never
saw any errors and, perhaps because it was too late by then, couldn't
track them down with additional help from folks here.


% 
% Read up, and come back if you've got any problems. The fix is that script,
% but it means if anything goes wrong you're going to be cursing "that damn
% slow computer".

*grin*


% 
% Cheers,
% Wol


Thanks again & HANW

:-D
-- 
David T-G
See http://justpickone.org/davidtg/email/
See http://justpickone.org/davidtg/tofu.txt

######################################################################

#!/bin/sh

# set timeouts manually where needed

CRED='^[[31m'
CYLO='^[[33m'
CGRN='^[[32m'
CBLU='^[[34m'
CBLK='^[[0m'

# set the timeouts on the local drives
printf "${CBLU}Drive timeouts${CBLK}: "
for DISK in sda sdb sdc sdd sde sdf sdg sdh	# a-d on mobo ; e-h on card
# do i want to apply this to USB drives that show up? hmmm...
do
  printf "$DISK "
  smartctl -q errorsonly -l scterc,70,70 /dev/$DISK
  if [ 4 -eq $? ]
  then
    echo 180 > /sys/block/$DISK/device/timeout
    printf "${CYLO}180"
  else
    printf "${CGRN}Y"
  fi
  printf "${CBLK} ; "
done
echo ''


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: about linear and about RAID10
  2022-12-03 23:26                       ` Wol
@ 2022-12-04  2:53                         ` David T-G
  2022-12-04 13:13                           ` Reindl Harald
  0 siblings, 1 reply; 62+ messages in thread
From: David T-G @ 2022-12-04  2:53 UTC (permalink / raw)
  To: Linux RAID list

Wol, et al --

Just a couple of quick notes ...


...and then Wol said...
% On 03/12/2022 18:27, David T-G wrote:
% > 
% > ...and then Wols Lists said...
% > % On 03/12/2022 05:58, David T-G wrote:
% > % >
...
% > Or that's how I understood it in the very many RAID wiki pages and other
% > docs I read :-)
% > 
% Hmm ... so being pedantic about terminology that's a definite raid-1+0, not
% linux-raid-10.

OK.  I'll try rereading the wiki pages and see where I went wrong.
Perhaps they can be clarified.


...
% it to a true raid-10. You said you had plenty of spare 2TBs, so --replace
% one of your 4TBs onto a pair of 2TBs, then --replace one of your raid-0's

Er, no.  I have plenty of "less-than-two-terabyte" drives.  They are all
used as one-off drives for archiving content.  1T here, 750G there, and
so on all as second copies for an ~18T external archive disk.  But, no, I
don't have perfect lovely half-size disks waiting to bring into play (or
they would be holding archive content! :-)

Sorry for any confusion.


HANW

:-D
-- 
David T-G
See http://justpickone.org/davidtg/email/
See http://justpickone.org/davidtg/tofu.txt


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: batches and serial numbers (was "Re: md vs LVM and VMs and ...")
  2022-12-03 18:04                       ` batches and serial numbers (was "Re: md vs LVM and VMs and ...") David T-G
  2022-12-03 20:07                         ` Wols Lists
@ 2022-12-04 13:04                         ` Reindl Harald
  1 sibling, 0 replies; 62+ messages in thread
From: Reindl Harald @ 2022-12-04 13:04 UTC (permalink / raw)
  To: Linux RAID list



Am 03.12.22 um 19:04 schrieb David T-G:
> % But if all your drives are the same make, model(, and batch), there is a not
> % insignificant risk they will all share the same defects, and fail at the
> % same time. It's accepted the risk is small, but it's there.
> 
> What is the problem?  Is it the manufacturer's firmware?  Is it the day
> they were made?  

you simply don't know until the problem hits you

not so long ago some HP enterprise SSDs for example had a timebomb in 
their firmware by some 32bit counter (power on i think) which killed the 
complete device when not fixed with a firmware update - when that 
happens and all your drives are the same type and bougzt at the same day 
your RAID is gone forever

it's simply common sense that two drives with different ages and from 
different vendos are unlikely failaing within a few hours for the same 
reason

on a 4 disk RAID10 i prefer 2 different disk series with a different age 
by at least one month whenever possible

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: about linear and about RAID10
  2022-12-03 18:27                     ` David T-G
  2022-12-03 23:26                       ` Wol
@ 2022-12-04 13:08                       ` Reindl Harald
  1 sibling, 0 replies; 62+ messages in thread
From: Reindl Harald @ 2022-12-04 13:08 UTC (permalink / raw)
  To: Linux RAID list



Am 03.12.22 um 19:27 schrieb David T-G:
> Anthony, et al --
> 
> ...and then Wols Lists said...
> % On 03/12/2022 05:58, David T-G wrote:
> % >
> % > I've finally convinced The Boss to spring for additional disks so that I
> % > can mirror, so our two servers both have SSD mirroring; yay.  The web
> % > server doesn't need much space, so it has a pair of 4T HDDs mirrored as
> % > well ... but as RAID10 since I thought that that was cool.  Ah, well.
> %
> % Raid 10 across two drives? Do I read you right? So you can easily add a 3rd
> % drive to get 6TB of usable storage, but raid 10 x 2 drives = raid 1 ...
> 
> Thanks for the aa/bb/cc non-symmetrical layout help recently.  I think I
> see where you're going here.  But that isn't what I have in this case.
> 
> Each disk is sliced into two large partitions that take up about half:
> 
> The two halves of each disk are then mirrored across -- BUT in an "X"
> layout.  Note that b1 pairs with c2 and c1 pairs with b2.
> 
> Finally, the mirrors are striped together (perhaps that should have been
> a linear instead) to make the final device.
> 
> The theory was that each disk would hold half of the total in the first
> half of its space and that md would be clever enough to ask the proper
> disk for the sector to keep the head in that short run.  Writes cover the
> whole disk one way or another, of course, but reads should require less
> head movement and be quicker.

but common sense should tell you that's nonsense with slicing! that only 
would be true if it would be instead of partitions on the same disk 
*psyiscal devices*

partitions are always nosense when it comes to performance, they are for 
logical sepearate data and nothing else



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: about linear and about RAID10
  2022-12-04  2:53                         ` David T-G
@ 2022-12-04 13:13                           ` Reindl Harald
  0 siblings, 0 replies; 62+ messages in thread
From: Reindl Harald @ 2022-12-04 13:13 UTC (permalink / raw)
  To: Linux RAID list



Am 04.12.22 um 03:53 schrieb David T-G:
> % Hmm ... so being pedantic about terminology that's a definite raid-1+0, not
> % linux-raid-10.
> 
> OK.  I'll try rereading the wiki pages and see where I went wrong.
> Perhaps they can be clarified.
it's simply: a *native* linux mdadm RAID10 is not excatly the same as 
any other RAID10 combining RAID1+RAID0

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: batches and serial numbers
  2022-12-04  2:47                           ` batches and serial numbers David T-G
@ 2022-12-04 13:54                             ` Wols Lists
  0 siblings, 0 replies; 62+ messages in thread
From: Wols Lists @ 2022-12-04 13:54 UTC (permalink / raw)
  To: Linux RAID list

On 04/12/2022 02:47, David T-G wrote:
> % like 'em, but they're not good in raid. And the BarraCudas even less so!
> % I've got a nasty feeling your X300s are the same!
> 
> They have been good to me so far.  I was originally going to get N300s,
> but I couldn't at the time, and the X300s read as the same for everything
> I could find.  Does your N300 have ERC with a short timeout enabled by
> default?

Well, Phil Turmell (one of the best recovery guys here) has N300s as his 
"go to" drives. I trust his judgement :-)

When I did my search on X300s, they come over as being "fast 
high-performance desktop drives". And manufacturers have pretty much 
deleted ERC from desktop drives!

Given that these are CMR, though, I guess Toshiba thought "high 
performance, we need ERC to prevent disk hangs".

Whereas Barracudas were good desktop workhorse drives, and the 
BarraCudas are aimed at the same market. The BarraCudas are SMR, and 
presumably come with a big enough cache to guarantee decent desktop 
performance, but they aren't aimed at the performance market.

Good to know the X300s aren't going to drop people in it ...

Cheers,
Wol

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2022-12-04 13:54 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-23 22:07 how do i fix these RAID5 arrays? David T-G
2022-11-23 22:28 ` Roman Mamedov
2022-11-24  0:01   ` Roger Heflin
2022-11-24 21:20     ` David T-G
2022-11-24 21:49       ` Wol
2022-11-25 13:36         ` and dm-integrity, too (was "Re: how do i fix these RAID5 arrays?") David T-G
2022-11-24 21:10   ` how do i fix these RAID5 arrays? David T-G
2022-11-24 21:33     ` Wol
2022-11-25  1:16       ` Roger Heflin
2022-11-25 13:22         ` David T-G
     [not found]           ` <CAAMCDed1-4zFgHMS760dO1pThtkrn8K+FMuG-QQ+9W-FE0iq9Q@mail.gmail.com>
2022-11-25 19:49             ` David T-G
2022-11-28 14:24               ` md RAID0 can be grown (was "Re: how do i fix these RAID5 arrays?") David T-G
2022-11-29 21:17                 ` Jani Partanen
2022-11-29 22:22                   ` Roman Mamedov
2022-12-03  5:41                   ` md vs LVM and VMs and ... (was "Re: md RAID0 can be grown (was ...") David T-G
2022-12-03 12:06                     ` Wols Lists
2022-12-03 18:04                       ` batches and serial numbers (was "Re: md vs LVM and VMs and ...") David T-G
2022-12-03 20:07                         ` Wols Lists
2022-12-04  2:47                           ` batches and serial numbers David T-G
2022-12-04 13:54                             ` Wols Lists
2022-12-04 13:04                         ` batches and serial numbers (was "Re: md vs LVM and VMs and ...") Reindl Harald
2022-12-03  5:41                 ` md RAID0 can be grown David T-G
2022-11-25 13:30       ` about linear and about RAID10 (was "Re: how do i fix these RAID5 arrays?") David T-G
2022-11-25 14:23         ` Wols Lists
2022-11-25 19:50           ` about linear and about RAID10 David T-G
2022-11-25 18:00         ` about linear and about RAID10 (was "Re: how do i fix these RAID5 arrays?") Roger Heflin
2022-11-28 14:46           ` about linear and about RAID10 David T-G
2022-11-28 15:32             ` Reindl Harald
     [not found]               ` <CAAMCDecXkcmUe=ZFnJ_NndND0C2=D5qSoj1Hohsrty8y1uqdfw@mail.gmail.com>
2022-11-28 17:03                 ` Reindl Harald
2022-11-28 20:45               ` John Stoffel
2022-12-03  5:58                 ` David T-G
2022-12-03 12:16                   ` Wols Lists
2022-12-03 18:27                     ` David T-G
2022-12-03 23:26                       ` Wol
2022-12-04  2:53                         ` David T-G
2022-12-04 13:13                           ` Reindl Harald
2022-12-04 13:08                       ` Reindl Harald
2022-12-03  5:45               ` David T-G
2022-12-03 12:20                 ` Reindl Harald
     [not found]             ` <CAAMCDee_YrhXo+5hp31YXgUHkyuUr-zTXOqi0-HUjMrHpYMkTQ@mail.gmail.com>
2022-12-03  5:52               ` stripe size checking (was "Re: about linear and about RAID10") David T-G
2022-11-25 14:49     ` how do i fix these RAID5 arrays? Wols Lists
2022-11-26 20:02       ` John Stoffel
2022-11-27  9:33         ` Wols Lists
2022-11-27 11:46         ` Reindl Harald
2022-11-27 11:52           ` Wols Lists
2022-11-27 12:06             ` Reindl Harald
2022-11-27 14:33               ` Wol
2022-11-27 18:08                 ` Roman Mamedov
2022-11-27 19:21                   ` Wol
2022-11-28  1:26                     ` Reindl Harald
2022-11-27 18:23                 ` Reindl Harald
2022-11-27 19:30                   ` Wol
2022-11-27 19:51                     ` Reindl Harald
2022-11-27 14:58           ` John Stoffel
2022-11-27 14:10         ` piergiorgio.sartor
2022-11-27 18:21           ` Reindl Harald
2022-11-27 19:37             ` Piergiorgio Sartor
2022-11-27 19:52               ` Reindl Harald
2022-11-27 22:05             ` Wol
2022-11-27 22:08               ` Reindl Harald
2022-11-27 22:11               ` Reindl Harald
2022-11-27 22:17               ` Roman Mamedov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.