Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild

All of lore.kernel.org
 help / color / mirror / Atom feed

* Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild
@ 2022-01-09 14:21 Jaromír Cápík
  2022-01-10  9:00 ` Wols Lists
  0 siblings, 1 reply; 14+ messages in thread
From: Jaromír Cápík @ 2022-01-09 14:21 UTC (permalink / raw)
  To: linux-raid

Good morning everyone.

After a discussion on the kernelnewbies IRC channel I'm asking for taking this
feature request into consideration.
I'd like to see a new mdadm switch --assume-all-dirty or something more
suitable used together with the --add switch, that would increase the MD RAID5
rebuild speed in case of rotational drives by avoiding reading and checking
the chunk consistency on the newly added drive. It would change the rebuild
strategy in such way It would only read from the N-1 drives containing valid
data and only write to the newly added 'empty' drive during the rebuild.
That would increase the rebuild speed significantly in case the array is full
enough so that the parity could be considered inconsistent for most of the
chunks.
In case of huge arrays (48TB in my case) the array rebuild takes a couple of
days with the current approach even when the array is idle and during that
time any of the drives could fail causing a fatal data loss.

Does it make at least a bit of sense or my understanding and assumptions
are wrong?

Thank you,
Jaromir Capik

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild
  2022-01-09 14:21 Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild Jaromír Cápík
@ 2022-01-10  9:00 ` Wols Lists
  2022-01-10 13:38   ` Jaromír Cápík
       [not found]   ` <CAAMCDec5kcK62enZCOh=SJZu0fecSV60jW8QjMierC147HE5bA@mail.gmail.com>
  0 siblings, 2 replies; 14+ messages in thread
From: Wols Lists @ 2022-01-10  9:00 UTC (permalink / raw)
  To: Jaromír Cápík, linux-raid

On 09/01/2022 14:21, Jaromír Cápík wrote:
> In case of huge arrays (48TB in my case) the array rebuild takes a couple of
> days with the current approach even when the array is idle and during that
> time any of the drives could fail causing a fatal data loss.
> 
> Does it make at least a bit of sense or my understanding and assumptions
> are wrong?

It does make sense, but have you read the code to see if it already does it?

And if it doesn't, someone's going to have to write it, in which case it 
doesn't make sense, not to have that as the default.

Bear in mind that rebuilding the array with a new drive is completely 
different logic to doing an integrity check, so will need its own code, 
so I expect it already works that way.

I think you've got two choices. Firstly, raid or not, you should have 
backups! Raid is for high-availability, not for keeping your data safe! 
And secondly, go raid-6 which gives you that bit extra redundancy.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild
  2022-01-10  9:00 ` Wols Lists
@ 2022-01-10 13:38   ` Jaromír Cápík
  2022-01-10 14:07     ` Wols Lists
       [not found]   ` <CAAMCDec5kcK62enZCOh=SJZu0fecSV60jW8QjMierC147HE5bA@mail.gmail.com>
  1 sibling, 1 reply; 14+ messages in thread
From: Jaromír Cápík @ 2022-01-10 13:38 UTC (permalink / raw)
  To: Wols Lists; +Cc: linux-raid

Nope, I haven't read the code. I only see a low sync speed (fluctuating from 20
to 80MB/s) whilst the drives can perform much better doing sequential reading
and writing (250MB/s per drive and up to 600MB/s all 4 drives in total).
During the sync I hear a high noise caused by heads flying there and back and
that smells.
The chosen drives have poor seeking performance and small caches and are
probably unable to reorder the operations to be more sequential. The whole
solution is 'economic' since the organisation owning the solution is poor and
cannot afford better hardware.
That also means RAID6 is not an option. But we shouldn't search excuses what's
wrong on the chosen scenario when the code is potentially suboptimal :] We're
trying to make Linux better, right? :]

I'm searching for someone, who knows the code well and can confirm my findings
or who could point me at anything I could try in order to increase the rebuild
speed. So far I've tried changing the readahead, minimum resync speed, stripe
cache size, but it increased the resync speed by few percent only.

I believe I would be able to write my own userspace application for rebuilding
the array offline with much higher speed ... just doing XOR of bytes at the same
offsets. That would prove the current rebuild strategy is suboptimal.

Of course it would mean a new code if it doesn't work like suggested and I know
it could be difficult and requiring a deep knowledge of the linux-raid code that
unfortunately I don't have.

Any chance someone here could find time to look at that?

Thank you,
Jaromir Capik

On 09/01/2022 14:21, Jaromír Cápík wrote:

>> In case of huge arrays (48TB in my case) the array rebuild takes a couple of
>> days with the current approach even when the array is idle and during that
>> time any of the drives could fail causing a fatal data loss.
>>
>> Does it make at least a bit of sense or my understanding and assumptions
>> are wrong?
>
>It does make sense, but have you read the code to see if it already does it?
>And if it doesn't, someone's going to have to write it, in which case it 
>
>doesn't make sense, not to have that as the default.
>
>Bear in mind that rebuilding the array with a new drive is completely
>
>different logic to doing an integrity check, so will need its own code, 
>
>so I expect it already works that way.
>
>
>I think you've got two choices. Firstly, raid or not, you should have
>
>backups! Raid is for high-availability, not for keeping your data safe! 
>
>And secondly, go raid-6 which gives you that bit extra redundancy.
>Cheers,
>
>Wol

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild
  2022-01-10 13:38   ` Jaromír Cápík
@ 2022-01-10 14:07     ` Wols Lists
  2022-01-11 12:18       ` Jaromír Cápík
  0 siblings, 1 reply; 14+ messages in thread
From: Wols Lists @ 2022-01-10 14:07 UTC (permalink / raw)
  To: Jaromír Cápík; +Cc: linux-raid

On 10/01/2022 13:38, Jaromír Cápík wrote:
> Nope, I haven't read the code. I only see a low sync speed (fluctuating from 20
> to 80MB/s) whilst the drives can perform much better doing sequential reading
> and writing (250MB/s per drive and up to 600MB/s all 4 drives in total).
> During the sync I hear a high noise caused by heads flying there and back and
> that smells.

Okay, so read performance from the array is worse than you would expect 
from a single drive. And the heads should not be "flying there and back" 
- they should just be streaming data. That's actually worrying - a VERY 
plausible explanation is that your drives are on the verge of failure!!

> The chosen drives have poor seeking performance and small caches and are
> probably unable to reorder the operations to be more sequential. The whole
> solution is 'economic' since the organisation owning the solution is poor and
> cannot afford better hardware.

The drives shouldn't need to reorder the operations - a rebuild is an 
exercise in pure streaming ... unless there are so many badblocks the 
whole drive is a mess ...

> That also means RAID6 is not an option. But we shouldn't search excuses what's
> wrong on the chosen scenario when the code is potentially suboptimal :] We're
> trying to make Linux better, right? :]
> 
> I'm searching for someone, who knows the code well and can confirm my findings
> or who could point me at anything I could try in order to increase the rebuild
> speed. So far I've tried changing the readahead, minimum resync speed, stripe
> cache size, but it increased the resync speed by few percent only.

Actually, you might find (counter-intuitive though it sounds) REDUCING 
the max sync speed might be better ... I'd guess from what you say, 
about 60MB/s.

The other thing is, could you be confusing MB and Mb? Three 250Mb drives 
would peak at about 80MB.
> 
> I believe I would be able to write my own userspace application for rebuilding
> the array offline with much higher speed ... just doing XOR of bytes at the same
> offsets. That would prove the current rebuild strategy is suboptimal.
> 
> Of course it would mean a new code if it doesn't work like suggested and I know
> it could be difficult and requiring a deep knowledge of the linux-raid code that
> unfortunately I don't have.
> 
What make/model are your drives? What does smartdrv say about them? And 
take a look at

https://www.ept.ca/features/everything-need-know-hard-drive-vibration/

The thing that worries me is your reference to repeated seeks. That 
should NOT be happening. Unless of course the system is in heavy use at 
the same time as the rebuild.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild
       [not found]   ` <CAAMCDec5kcK62enZCOh=SJZu0fecSV60jW8QjMierC147HE5bA@mail.gmail.com>
@ 2022-01-11  9:59     ` Jaromír Cápík
  2022-01-11 16:53       ` Roger Heflin
  0 siblings, 1 reply; 14+ messages in thread
From: Jaromír Cápík @ 2022-01-11  9:59 UTC (permalink / raw)
  To: Roger Heflin; +Cc: Linux RAID, Wols Lists

Hello Roger.

I just run atop on a different and much better hardware doing mdadm --grow on raid5 with 4 drives and it shows the following

DSK | sdl | | busy 90% | read 950  | | write 502 | | KiB/r 1012 | KiB/w 506 | | MBr/s 94.0 | | MBw/s 24.9 | | avq 1.29 | avio 6.22 ms | |
DSK | sdk | | busy 89% | read 968  | | write 499 | | KiB/r 995  | KiB/w 509 | | MBr/s 94.1 | | MBw/s 24.8 | | avq 0.92 | avio 6.09 ms | |
DSK | sdj | | busy 88% | read 1004 | | write 503 | | KiB/r 958  | KiB/w 505 | | MBr/s 94.0 | | MBw/s 24.8 | | avq 0.66 | avio 5.91 ms | |
DSK | sdi | | busy 87% | read 1013 | | write 499 | | KiB/r 949  | KiB/w 509 | | MBr/s 94.0 | | MBw/s 24.8 | | avq 0.65 | avio 5.81 ms | |

Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10]
md3 : active raid5 sdi1[5] sdl1[6] sdk1[4] sdj1[2]
      46877237760 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      [=================>...]  resync = 88.5% (13834588672/15625745920) finish=293.1min speed=101843K/sec
      bitmap: 8/59 pages [32KB], 131072KB chunk

Surprisingly all 4 drives show approximately 94MB/s read and 25MB/s write.
Even when each of the drives can read 270MB/s and write 250MB/s, the sync speed is 100MB/s only, so?

Does --grow differ from --add?

Thanks,
Jaromir



---------- Původní e-mail ----------

Od: Roger Heflin <rogerheflin@gmail.com>

Komu: Wols Lists <antlists@youngman.org.uk>

Datum: 11. 1. 2022 1:15:17

Předmět: Re: Feature request: Add flag for assuming a new clean drive
 completely dirty when adding to a degraded raid5 array in order to increase
 the speed of the array rebuild

I just did a "--add" with sdd on a raid6 array missing a volume and here is what sar shows:

06:08:12 PM       sdb     91.03  34615.97      0.36      0.00    380.26      0.41      4.47     30.31
06:08:12 PM       sdc      0.02      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:08:12 PM       sdd     77.12     26.28  34563.36      0.00    448.54      0.64      8.23     27.40
06:08:12 PM       sde     36.45  34598.82      0.36      0.00    949.22      1.43     38.78     70.37
06:08:12 PM       sdf     46.87  34598.89      0.36      0.00    738.25      1.23     26.13     57.81

06:09:12 PM       sda      5.12      0.93     75.33      0.00     14.91      0.01      1.48      0.39
06:09:12 PM       sdb    122.57  46819.67      0.40      0.00    382.00      0.54      4.38     35.85
06:09:12 PM       sdc      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:09:12 PM       sdd    105.92      0.00  46775.73      0.00    441.63      1.12     10.53     35.80
06:09:12 PM       sde     48.47  46817.53      0.40      0.00    965.98      1.95     40.00     97.89
06:09:12 PM       sdf     56.95  46834.53      0.40      0.00    822.39      1.73     30.32     82.33


06:10:12 PM       sda      4.55      1.20     48.20      0.00     10.86      0.01      0.97      0.27

06:10:12 PM       sdb    123.67  46616.93      0.40      0.00    376.96      0.52      4.15     34.66
06:10:12 PM       sdc      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:10:12 PM       sdd    109.82      0.00  46623.40      0.00    424.56      1.30     11.80     36.15
06:10:12 PM       sde     49.18  46602.00      0.40      0.00    947.52      1.93     39.17     97.27
06:10:12 PM       sdf     54.88  46601.07      0.40      0.00    849.10      1.75     31.82     85.16


06:11:12 PM       sda      4.07      1.00     50.80      0.00     12.74      0.01      1.77      0.30

06:11:12 PM       sdb    121.93  46363.20      0.40      0.00    380.24      0.51      4.10     34.72
06:11:12 PM       sdc      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:11:12 PM       sdd    109.58      0.00  46372.47      0.00    423.17      1.37     12.44     35.69
06:11:12 PM       sde     49.38  46371.00      0.40      0.00    939.01      1.93     38.88     97.09
06:11:12 PM       sdf     55.12  46352.53      0.40      0.00    841.00      1.73     31.39     85.25


06:12:12 PM       sda      5.75     14.20     79.05      0.00     16.22      0.01      1.78      0.40

06:12:12 PM       sdb    120.73  45994.13      0.40      0.00    380.97      0.51      4.20     34.72
06:12:12 PM       sdc      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:12:12 PM       sdd    110.95      0.00  45982.87      0.00    414.45      1.43     12.81     35.39
06:12:12 PM       sde     49.63  46020.46      0.40      0.00    927.37      1.91     38.39     96.18
06:12:12 PM       sdf     54.27  46022.80      0.40      0.00    847.97      1.75     32.14     86.65



So there are very few reads going on for sdd, but a lot of reads of the other disks to recalculate what the data on that disk.

This is on raid6, but if raid6 is not doing a pointless check read on a new disk add, I would not expect raid5 to be.


This is on a 5.14 kernel.



On Mon, Jan 10, 2022 at 5:15 PM Wols Lists <antlists@youngman.org.uk> wrote:

On 09/01/2022 14:21, Jaromír Cápík wrote:

> In case of huge arrays (48TB in my case) the array rebuild takes a couple of

> days with the current approach even when the array is idle and during that

> time any of the drives could fail causing a fatal data loss.

>

> Does it make at least a bit of sense or my understanding and assumptions

> are wrong?



It does make sense, but have you read the code to see if it already does it?



And if it doesn't, someone's going to have to write it, in which case it 

doesn't make sense, not to have that as the default.



Bear in mind that rebuilding the array with a new drive is completely

different logic to doing an integrity check, so will need its own code, 

so I expect it already works that way.



I think you've got two choices. Firstly, raid or not, you should have

backups! Raid is for high-availability, not for keeping your data safe! 

And secondly, go raid-6 which gives you that bit extra redundancy.



Cheers,

Wol




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild
  2022-01-10 14:07     ` Wols Lists
@ 2022-01-11 12:18       ` Jaromír Cápík
  0 siblings, 0 replies; 14+ messages in thread
From: Jaromír Cápík @ 2022-01-11 12:18 UTC (permalink / raw)
  To: Wols Lists; +Cc: linux-raid


>> Nope, I haven't read the code. I only see a low sync speed (fluctuating from 20
>> to 80MB/s) whilst the drives can perform much better doing sequential reading
>> and writing (250MB/s per drive and up to 600MB/s all 4 drives in total).
>> During the sync I hear a high noise caused by heads flying there and back and
>> that smells.
>
>Okay, so read performance from the array is worse than you would expect 
>from a single drive. And the heads should not be "flying there and back" 
>- they should just be streaming data. That's actually worrying - a VERY 
>plausible explanation is that your drives are on the verge of failure!!

Nope, the drives are new and OK ... of course I did a ton of tests
and the SMART is looking good ... no reallocated sectors, no pending sectors
and the array now (after the rebuild) works at the expected speed and
without noise ... just the resync was a total disaster.

>> The chosen drives have poor seeking performance and small caches and are
>> probably unable to reorder the operations to be more sequential. The whole
>> solution is 'economic' since the organisation owning the solution is poor and
>> cannot afford better hardware.
>
>The drives shouldn't need to reorder the operations - a rebuild is an
>exercise in pure streaming ... unless there are so many badblocks the
>whole drive is a mess ...

Yeah, I would expect that as well, but the reality was different.
As stated above, the drives are perfectly healthy.

>> That also means RAID6 is not an option. But we shouldn't search excuses what's
>> wrong on the chosen scenario when the code is potentially suboptimal :] We're
>> trying to make Linux better, right? :]
>>
>> I'm searching for someone, who knows the code well and can confirm my findings
>> or who could point me at anything I could try in order to increase the rebuild
>> speed. So far I've tried changing the readahead, minimum resync speed, stripe
>> cache size, but it increased the resync speed by few percent only.
>
>Actually, you might find (counter-intuitive though it sounds) REDUCING 
>the max sync speed might be better ... I'd guess from what you say,
>about 60MB/s.
>The other thing is, could you be confusing MB and Mb? Three 250Mb drives 
>would peak at about 80MB.

Nope, all units were Bytes.

>> The thing that worries me is your reference to repeated seeks. That
>> should NOT be happening. Unless of course the system is in heavy use at
>> the same time as the rebuild.

Nope, the MD device was NOT mounted and no process was touching it.
In case of this cheap HW I suspect a firmware bug in the SATA bridge, triggering
the issue somehow and therefore I'd like to focus on the second and better HW
I mentioned in my previous email addressed to Roger, where I hear no strange sounds,
but still, the resync speed is far below my expectations and as far as I can remember
I was never really satisfied with the RAID5 sync speed.

The assembled array can do over 700MB/s when I temporarilly freeze the sync,
but the sync speed is 100MB/s only ... why so?
Again, the MD device is completely idle ... not mounted and no process is touching it.

---
/dev/md3:
 Timing cached reads:   22440 MB in  1.99 seconds = 11285.30 MB/sec
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 Timing buffered disk reads: 2144 MB in  3.00 seconds = 713.91 MB/sec
---
Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10]
md3 : active raid5 sdi1[5] sdl1[6] sdk1[4] sdj1[2]
      46877237760 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      [==================>..]  resync = 93.6% (14637814004/15625745920) finish=161.8min speed=101758K/sec
      bitmap: 5/59 pages [20KB], 131072KB chunk
---

So, what's wrong on this picture?

Thx,
Jaromir.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild
  2022-01-11  9:59     ` Jaromír Cápík
@ 2022-01-11 16:53       ` Roger Heflin
  2022-01-12 18:45         ` Jaromír Cápík
  2022-01-14 14:43         ` Jaromír Cápík
  0 siblings, 2 replies; 14+ messages in thread
From: Roger Heflin @ 2022-01-11 16:53 UTC (permalink / raw)
  To: Jaromír Cápík; +Cc: Linux RAID, Wols Lists

I have noticed on simple tests making arrays with tmpfs that Intel
cpus seem to be able to xor about 2x the speed of AMD.   The speed may
also vary based on cpu generation.

Also, grow differs in the fact that blocks get moved around hence the writes.

On the raid you are building, is there other IO going on to the disks?
 This will cause seeks and the more io there is (outside of the
rebuild) the worse it will be.

Here is everything I set on my arrays
find /sys -name "*sync_speed_min*" -exec /usr/local/bin/set_sync_speed
15000 {} \;
# MB Intel controller
find /sys/devices/pci0000:00/0000:00:1f.2/ -name "*queue_depth*" -exec
/usr/local/bin/set_queue_depth 1 {} \;
find /sys/devices/pci0000:00/0000:00:1f.2/ -name "nr_requests" -exec
/usr/local/bin/set_queue_depth 4 {} \;
#
# AMD FM2 MB
find /sys/devices/pci0000:00/0000:00:11.0/ -name "queue_depth" -exec
/usr/local/bin/set_queue_depth 8 {} \;
find /sys/devices/pci0000:00/0000:00:11.0/ -name "nr_requests" -exec
/usr/local/bin/set_queue_depth 16 {} \;
echo 30000 > /proc/sys/dev/raid/speed_limit_min
  for mddev in md13 md14 md15 md16 md17 md18 ; do
  blockdev --setfra 65536 /dev/${mddev}
  blockdev --setra 65536 /dev/${mddev}
  echo 32768 > /sys/block/${mddev}/md/stripe_cache_size
  echo 30000 > /sys/block/${mddev}/md/sync_speed_min
  echo 2 > /sys/block/${mddev}/md/group_thread_cnt
done

You will need to adjust my find/pci* devices to find your device, and
you will need to test some with the queue_depth/nr_requests to see
what is best for your controller/disk combination.   You may want to
also test different values with the group_thread_cnt.

The set_queue_depth file (and the set_sync_speed file look like this):
cat //usr/local/bin/set_queue_depth
echo $1 > $2

On mine you will notice I have 6 arrays, 4 of those arrays are 3tb
disk split into 4 750GB partitions to minimize time for a single grow
to complete.

The Other 2 are a 3tb remaining space split into 2 1.5tb spaces to
also minimize the grow time.

I have also found that when a disk fails often only a single partition
gets a bad block and fails and so I only have to --re-add/--add one
device.

And if the disk has not failed you can do a --replace so long as you
can get the old and new devices in the chassis.   With the multiple
partitions it usually means I only have 1 of 4 partitions that failed
in mdadm, and so a re-add gets that one to work and I can then do the
replace which just reads from the disk it is replacing and as such is
much faster.

I also carefully setup the disk partition naming such that the last
digit of the partitions matches the last digit of the md ie:
md16 : active raid6 sdh6[10] sdi6[12] sdj6[7] sdg6[9] sde6[1] sdb6[8] sdf6[11]
      3615495680 blocks super 1.2 level 6, 512k chunk, algorithm 2
[7/7] [UUUUUUU]
      bitmap: 0/6 pages [0KB], 65536KB chunk

as that makes the adding/re-adding simpler as I know which device it always is.


On Tue, Jan 11, 2022 at 3:59 AM Jaromír Cápík <jaromir.capik@email.cz> wrote:
>
> Hello Roger.
>
> I just run atop on a different and much better hardware doing mdadm --grow on raid5 with 4 drives and it shows the following
>
> DSK | sdl | | busy 90% | read 950  | | write 502 | | KiB/r 1012 | KiB/w 506 | | MBr/s 94.0 | | MBw/s 24.9 | | avq 1.29 | avio 6.22 ms | |
> DSK | sdk | | busy 89% | read 968  | | write 499 | | KiB/r 995  | KiB/w 509 | | MBr/s 94.1 | | MBw/s 24.8 | | avq 0.92 | avio 6.09 ms | |
> DSK | sdj | | busy 88% | read 1004 | | write 503 | | KiB/r 958  | KiB/w 505 | | MBr/s 94.0 | | MBw/s 24.8 | | avq 0.66 | avio 5.91 ms | |
> DSK | sdi | | busy 87% | read 1013 | | write 499 | | KiB/r 949  | KiB/w 509 | | MBr/s 94.0 | | MBw/s 24.8 | | avq 0.65 | avio 5.81 ms | |
>
> Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10]
> md3 : active raid5 sdi1[5] sdl1[6] sdk1[4] sdj1[2]
>       46877237760 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
>       [=================>...]  resync = 88.5% (13834588672/15625745920) finish=293.1min speed=101843K/sec
>       bitmap: 8/59 pages [32KB], 131072KB chunk
>
> Surprisingly all 4 drives show approximately 94MB/s read and 25MB/s write.
> Even when each of the drives can read 270MB/s and write 250MB/s, the sync speed is 100MB/s only, so?
>
> Does --grow differ from --add?
>
> Thanks,
> Jaromir
>
>
>
> ---------- Původní e-mail ----------
>
> Od: Roger Heflin <rogerheflin@gmail.com>
>
> Komu: Wols Lists <antlists@youngman.org.uk>
>
> Datum: 11. 1. 2022 1:15:17
>
> Předmět: Re: Feature request: Add flag for assuming a new clean drive
>  completely dirty when adding to a degraded raid5 array in order to increase
>  the speed of the array rebuild
>
> I just did a "--add" with sdd on a raid6 array missing a volume and here is what sar shows:
>
> 06:08:12 PM       sdb     91.03  34615.97      0.36      0.00    380.26      0.41      4.47     30.31
> 06:08:12 PM       sdc      0.02      0.00      0.00      0.00      0.00      0.00      0.00      0.00
> 06:08:12 PM       sdd     77.12     26.28  34563.36      0.00    448.54      0.64      8.23     27.40
> 06:08:12 PM       sde     36.45  34598.82      0.36      0.00    949.22      1.43     38.78     70.37
> 06:08:12 PM       sdf     46.87  34598.89      0.36      0.00    738.25      1.23     26.13     57.81
>
> 06:09:12 PM       sda      5.12      0.93     75.33      0.00     14.91      0.01      1.48      0.39
> 06:09:12 PM       sdb    122.57  46819.67      0.40      0.00    382.00      0.54      4.38     35.85
> 06:09:12 PM       sdc      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
> 06:09:12 PM       sdd    105.92      0.00  46775.73      0.00    441.63      1.12     10.53     35.80
> 06:09:12 PM       sde     48.47  46817.53      0.40      0.00    965.98      1.95     40.00     97.89
> 06:09:12 PM       sdf     56.95  46834.53      0.40      0.00    822.39      1.73     30.32     82.33
>
>
> 06:10:12 PM       sda      4.55      1.20     48.20      0.00     10.86      0.01      0.97      0.27
>
> 06:10:12 PM       sdb    123.67  46616.93      0.40      0.00    376.96      0.52      4.15     34.66
> 06:10:12 PM       sdc      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
> 06:10:12 PM       sdd    109.82      0.00  46623.40      0.00    424.56      1.30     11.80     36.15
> 06:10:12 PM       sde     49.18  46602.00      0.40      0.00    947.52      1.93     39.17     97.27
> 06:10:12 PM       sdf     54.88  46601.07      0.40      0.00    849.10      1.75     31.82     85.16
>
>
> 06:11:12 PM       sda      4.07      1.00     50.80      0.00     12.74      0.01      1.77      0.30
>
> 06:11:12 PM       sdb    121.93  46363.20      0.40      0.00    380.24      0.51      4.10     34.72
> 06:11:12 PM       sdc      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
> 06:11:12 PM       sdd    109.58      0.00  46372.47      0.00    423.17      1.37     12.44     35.69
> 06:11:12 PM       sde     49.38  46371.00      0.40      0.00    939.01      1.93     38.88     97.09
> 06:11:12 PM       sdf     55.12  46352.53      0.40      0.00    841.00      1.73     31.39     85.25
>
>
> 06:12:12 PM       sda      5.75     14.20     79.05      0.00     16.22      0.01      1.78      0.40
>
> 06:12:12 PM       sdb    120.73  45994.13      0.40      0.00    380.97      0.51      4.20     34.72
> 06:12:12 PM       sdc      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
> 06:12:12 PM       sdd    110.95      0.00  45982.87      0.00    414.45      1.43     12.81     35.39
> 06:12:12 PM       sde     49.63  46020.46      0.40      0.00    927.37      1.91     38.39     96.18
> 06:12:12 PM       sdf     54.27  46022.80      0.40      0.00    847.97      1.75     32.14     86.65
>
>
>
> So there are very few reads going on for sdd, but a lot of reads of the other disks to recalculate what the data on that disk.
>
> This is on raid6, but if raid6 is not doing a pointless check read on a new disk add, I would not expect raid5 to be.
>
>
> This is on a 5.14 kernel.
>
>
>
> On Mon, Jan 10, 2022 at 5:15 PM Wols Lists <antlists@youngman.org.uk> wrote:
>
> On 09/01/2022 14:21, Jaromír Cápík wrote:
>
> > In case of huge arrays (48TB in my case) the array rebuild takes a couple of
>
> > days with the current approach even when the array is idle and during that
>
> > time any of the drives could fail causing a fatal data loss.
>
> >
>
> > Does it make at least a bit of sense or my understanding and assumptions
>
> > are wrong?
>
>
>
> It does make sense, but have you read the code to see if it already does it?
>
>
>
> And if it doesn't, someone's going to have to write it, in which case it
>
> doesn't make sense, not to have that as the default.
>
>
>
> Bear in mind that rebuilding the array with a new drive is completely
>
> different logic to doing an integrity check, so will need its own code,
>
> so I expect it already works that way.
>
>
>
> I think you've got two choices. Firstly, raid or not, you should have
>
> backups! Raid is for high-availability, not for keeping your data safe!
>
> And secondly, go raid-6 which gives you that bit extra redundancy.
>
>
>
> Cheers,
>
> Wol
>
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild
  2022-01-11 16:53       ` Roger Heflin
@ 2022-01-12 18:45         ` Jaromír Cápík
  2022-01-14 14:10           ` Jaromír Cápík
  2022-01-14 14:43         ` Jaromír Cápík
  1 sibling, 1 reply; 14+ messages in thread
From: Jaromír Cápík @ 2022-01-12 18:45 UTC (permalink / raw)
  To: Roger Heflin; +Cc: Linux RAID, Wols Lists

Hello Roger.

>I have noticed on simple tests making arrays with tmpfs that Intel
>cpus seem to be able to xor about 2x the speed of AMD.   The speed may
>also vary based on cpu generation.

It was Intel in both cases and the CPU loads were low.


> Also, grow differs in the fact that blocks get moved around hence the writes.

Of course, but even of that the speed was poor :]


> On the raid you are building, is there other IO going on to the disks?

Nope, the array was not mounted and no process was touching the MD device
during the rebuild.


> And if the disk has not failed you can do a --replace so long as you

In the first case there was no space left for another drive ... however,
sometimes I can do the rebuild offline and in such cases I'll rather
clone the content of the old drive with DD. It's much faster.

> I also carefully setup the disk partition naming such that the last
> digit of the partitions matches the last digit of the md as that makes
> the adding/re-adding simpler as I know which device it always is.

I usually do that on servers where I have multiple raid partitions.

Thx,
Jaromir.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild
  2022-01-12 18:45         ` Jaromír Cápík
@ 2022-01-14 14:10           ` Jaromír Cápík
  2022-01-14 17:53             ` Roger Heflin
  0 siblings, 1 reply; 14+ messages in thread
From: Jaromír Cápík @ 2022-01-14 14:10 UTC (permalink / raw)
  To: Roger Heflin; +Cc: Linux RAID, Wols Lists

Hello Roger.

>> Also, grow differs in the fact that blocks get moved around hence the writes.
>
> Of course, but even of that the speed was poor :]

Just for info, I just tried to DD data on the second (faster) hardware
from one RAID drive to an empty one with MD offline and the transfer
speed is the following (sorry for the czech locale :) ...

25423+0 záznamů přečteno
25422+0 záznamů zapsáno
26656899072 bajtů (27 GB, 25 GiB) zkopírováno, 105,227 s, 253 MB/s
25903+0 záznamů přečteno
25902+0 záznamů zapsáno
27160215552 bajtů (27 GB, 25 GiB) zkopírováno, 107,235 s, 253 MB/s
26386+0 záznamů přečteno
26385+0 záznamů zapsáno
27666677760 bajtů (28 GB, 26 GiB) zkopírováno, 109,245 s, 253 MB/s

as you can see, the write speed really matches the speed from the drive datasheet,
so ... why the sync speed was 100MB/s only when the CPU load was low?
Can you explain that?

Thanks,
Jaromir

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild
  2022-01-11 16:53       ` Roger Heflin
  2022-01-12 18:45         ` Jaromír Cápík
@ 2022-01-14 14:43         ` Jaromír Cápík
  1 sibling, 0 replies; 14+ messages in thread
From: Jaromír Cápík @ 2022-01-14 14:43 UTC (permalink / raw)
  To: Roger Heflin; +Cc: Linux RAID, Wols Lists

Hello Roger

> I have noticed on simple tests making arrays with tmpfs that Intel
> cpus seem to be able to xor about 2x the speed of AMD.   The speed may
> also vary based on cpu generation.

I forgot to mention I did some tests here with -O2 optimized C code
and 1 core of the CPU can XOR aproximately 7.11GB of data per second,
so ... not really a bottleneck here.

Regards,
Jaromir

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild
  2022-01-14 14:10           ` Jaromír Cápík
@ 2022-01-14 17:53             ` Roger Heflin
  2022-01-17 17:59               ` Jaromír Cápík
  0 siblings, 1 reply; 14+ messages in thread
From: Roger Heflin @ 2022-01-14 17:53 UTC (permalink / raw)
  To: Jaromír Cápík; +Cc: Linux RAID, Wols Lists

That is typically 100MB/sec per disk as it is reported, and that is a
typical speed I have seen for a rebuild and/or grow.

There are almost certainly algorithm sync points that constrain the
speed less than full streaming speed of all disks.

The algorithm may well be, read the stripe, process the stripe and
write out the new stripe, and start over (in a linear manner)  I would
expect that to be the easiest to keep track of, and that would roughly
get your speed (costs a read to each old disk + a write to the new
disk + bookkeeping writes + parity calc).     Setting up the code such
that it overlaps the operations is going to complicate the code, and
as such was likely not done.

And regardless of the client's only being able to run raid5, there is
significant risks to running raid5.   If on the rebuild you find a bad
block on one of the other disks then you have lost data, and that is
very likely to happen--that exact failure was the first raid failure I
saw 28+ years ago).

How often are you replacing/rebuilding the disks and why?

On Fri, Jan 14, 2022 at 8:10 AM Jaromír Cápík <jaromir.capik@email.cz> wrote:
>
> Hello Roger.
>
> >> Also, grow differs in the fact that blocks get moved around hence the writes.
> >
> > Of course, but even of that the speed was poor :]
>
> Just for info, I just tried to DD data on the second (faster) hardware
> from one RAID drive to an empty one with MD offline and the transfer
> speed is the following (sorry for the czech locale :) ...
>
> 25423+0 záznamů přečteno
> 25422+0 záznamů zapsáno
> 26656899072 bajtů (27 GB, 25 GiB) zkopírováno, 105,227 s, 253 MB/s
> 25903+0 záznamů přečteno
> 25902+0 záznamů zapsáno
> 27160215552 bajtů (27 GB, 25 GiB) zkopírováno, 107,235 s, 253 MB/s
> 26386+0 záznamů přečteno
> 26385+0 záznamů zapsáno
> 27666677760 bajtů (28 GB, 26 GiB) zkopírováno, 109,245 s, 253 MB/s
>
> as you can see, the write speed really matches the speed from the drive datasheet,
> so ... why the sync speed was 100MB/s only when the CPU load was low?
> Can you explain that?
>
> Thanks,
> Jaromir

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild
  2022-01-14 17:53             ` Roger Heflin
@ 2022-01-17 17:59               ` Jaromír Cápík
  2022-01-17 19:59                 ` Wol
  0 siblings, 1 reply; 14+ messages in thread
From: Jaromír Cápík @ 2022-01-17 17:59 UTC (permalink / raw)
  To: Roger Heflin; +Cc: Linux RAID, Wols Lists

Hello Roger.

>That is typically 100MB/sec per disk as it is reported, and that is a
>typical speed I have seen for a rebuild and/or grow.
>
>There are almost certainly algorithm sync points that constrain the
>speed less than full streaming speed of all disks.
>
>The algorithm may well be, read the stripe, process the stripe and
>write out the new stripe, and start over (in a linear manner)  I would
>expect that to be the easiest to keep track of, and that would roughly
>get your speed (costs a read to each old disk + a write to the new
>disk + bookkeeping writes + parity calc).     Setting up the code such
>that it overlaps the operations is going to complicate the code, and
>as such was likely not done.

Yeah, I'm pretty sure the current behavior is suboptimal just because
it was easier for the implementation. And ... surprisingly ... this
feature request is my clumsy try to convince someone amazing and clever
to change that, because ... we love Linux and wanna se it rock! Right? :D

>And regardless of the client's only being able to run raid5, there is
>significant risks to running raid5.   If on the rebuild you find a bad
>block on one of the other disks then you have lost data, and that is
>very likely to happen--that exact failure was the first raid failure I
>saw 28+ years ago).

I'm aware of the risks ... but, losing a file or two is still much better
than losing the whole array just because of the low sync speed when you
need to operate the array in a degraded mode for 3 days instead of 1 day.
Making it faster seems to me quite important / reasonable.

>How often are you replacing/rebuilding the disks and why?

Few times a year, different reasons. Usually requests for higher capacity
where I need to replace all drives one by one and then grow the array.
Sometimes reallocated sectors appear in the SMART output and I never
let such drives in the array considering them unreliable. The --replace
feature is nice, but often there's no room for one more drive in the
chasis and going that way requires an external USB3 rack and a bit of
magic if the operation cannot be done offline.

So, I still hope someone will find enough courage one day to implement
the new optional sync strategy :)

BR, J.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild
  2022-01-17 17:59               ` Jaromír Cápík
@ 2022-01-17 19:59                 ` Wol
  2022-01-18 12:45                   ` Jaromír Cápík
  0 siblings, 1 reply; 14+ messages in thread
From: Wol @ 2022-01-17 19:59 UTC (permalink / raw)
  To: Jaromír Cápík, Roger Heflin; +Cc: Linux RAID

On 17/01/2022 17:59, Jaromír Cápík wrote:
> Few times a year, different reasons. Usually requests for higher capacity
> where I need to replace all drives one by one and then grow the array.
> Sometimes reallocated sectors appear in the SMART output and I never
> let such drives in the array considering them unreliable. The --replace
> feature is nice, but often there's no room for one more drive in the
> chasis and going that way requires an external USB3 rack and a bit of
> magic if the operation cannot be done offline.
> 
You seen the stuff about running raid over USB? Very unwise?

Do you have room for an eSATA card? If so, get an external SATA cage, 
and you can swap a drive out into the cage, rebuild the array with 
--replace, and repeat. Much safer, probably much quicker, and no extra 
work shutting down the system to replace each drive in turn. And if your 
chassis is hot-swap, then all the better, no downtime at all. Just a few 
moments danger each time you physically swap a drive.

> So, I still hope someone will find enough courage one day to implement
> the new optional sync strategy:)

There is the argument that increasing the load on the drive increases 
the risk to the drive ...

Cheers,
Wol

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild
  2022-01-17 19:59                 ` Wol
@ 2022-01-18 12:45                   ` Jaromír Cápík
  0 siblings, 0 replies; 14+ messages in thread
From: Jaromír Cápík @ 2022-01-18 12:45 UTC (permalink / raw)
  To: Wol; +Cc: Linux RAID, Roger Heflin


>> chasis and going that way requires an external USB3 rack and a bit of
>> magic if the operation cannot be done offline.
>
> You seen the stuff about running raid over USB? Very unwise?

Nope, URL?

I mirror the drive in my Intel NUC router to USB3 drive and it works
for years with no problems. Of course I'm aware of lower reliability,
but I hope I'll get I/O errors if something fails. Am I wrong?

Thx, J.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-01-18 12:45 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-09 14:21 Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild Jaromír Cápík
2022-01-10  9:00 ` Wols Lists
2022-01-10 13:38   ` Jaromír Cápík
2022-01-10 14:07     ` Wols Lists
2022-01-11 12:18       ` Jaromír Cápík
     [not found]   ` <CAAMCDec5kcK62enZCOh=SJZu0fecSV60jW8QjMierC147HE5bA@mail.gmail.com>
2022-01-11  9:59     ` Jaromír Cápík
2022-01-11 16:53       ` Roger Heflin
2022-01-12 18:45         ` Jaromír Cápík
2022-01-14 14:10           ` Jaromír Cápík
2022-01-14 17:53             ` Roger Heflin
2022-01-17 17:59               ` Jaromír Cápík
2022-01-17 19:59                 ` Wol
2022-01-18 12:45                   ` Jaromír Cápík
2022-01-14 14:43         ` Jaromír Cápík

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.