* Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild @ 2022-01-09 14:21 Jaromír Cápík 2022-01-10 9:00 ` Wols Lists 0 siblings, 1 reply; 14+ messages in thread From: Jaromír Cápík @ 2022-01-09 14:21 UTC (permalink / raw) To: linux-raid Good morning everyone. After a discussion on the kernelnewbies IRC channel I'm asking for taking this feature request into consideration. I'd like to see a new mdadm switch --assume-all-dirty or something more suitable used together with the --add switch, that would increase the MD RAID5 rebuild speed in case of rotational drives by avoiding reading and checking the chunk consistency on the newly added drive. It would change the rebuild strategy in such way It would only read from the N-1 drives containing valid data and only write to the newly added 'empty' drive during the rebuild. That would increase the rebuild speed significantly in case the array is full enough so that the parity could be considered inconsistent for most of the chunks. In case of huge arrays (48TB in my case) the array rebuild takes a couple of days with the current approach even when the array is idle and during that time any of the drives could fail causing a fatal data loss. Does it make at least a bit of sense or my understanding and assumptions are wrong? Thank you, Jaromir Capik ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild 2022-01-09 14:21 Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild Jaromír Cápík @ 2022-01-10 9:00 ` Wols Lists 2022-01-10 13:38 ` Jaromír Cápík [not found] ` <CAAMCDec5kcK62enZCOh=SJZu0fecSV60jW8QjMierC147HE5bA@mail.gmail.com> 0 siblings, 2 replies; 14+ messages in thread From: Wols Lists @ 2022-01-10 9:00 UTC (permalink / raw) To: Jaromír Cápík, linux-raid On 09/01/2022 14:21, Jaromír Cápík wrote: > In case of huge arrays (48TB in my case) the array rebuild takes a couple of > days with the current approach even when the array is idle and during that > time any of the drives could fail causing a fatal data loss. > > Does it make at least a bit of sense or my understanding and assumptions > are wrong? It does make sense, but have you read the code to see if it already does it? And if it doesn't, someone's going to have to write it, in which case it doesn't make sense, not to have that as the default. Bear in mind that rebuilding the array with a new drive is completely different logic to doing an integrity check, so will need its own code, so I expect it already works that way. I think you've got two choices. Firstly, raid or not, you should have backups! Raid is for high-availability, not for keeping your data safe! And secondly, go raid-6 which gives you that bit extra redundancy. Cheers, Wol ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild 2022-01-10 9:00 ` Wols Lists @ 2022-01-10 13:38 ` Jaromír Cápík 2022-01-10 14:07 ` Wols Lists [not found] ` <CAAMCDec5kcK62enZCOh=SJZu0fecSV60jW8QjMierC147HE5bA@mail.gmail.com> 1 sibling, 1 reply; 14+ messages in thread From: Jaromír Cápík @ 2022-01-10 13:38 UTC (permalink / raw) To: Wols Lists; +Cc: linux-raid Nope, I haven't read the code. I only see a low sync speed (fluctuating from 20 to 80MB/s) whilst the drives can perform much better doing sequential reading and writing (250MB/s per drive and up to 600MB/s all 4 drives in total). During the sync I hear a high noise caused by heads flying there and back and that smells. The chosen drives have poor seeking performance and small caches and are probably unable to reorder the operations to be more sequential. The whole solution is 'economic' since the organisation owning the solution is poor and cannot afford better hardware. That also means RAID6 is not an option. But we shouldn't search excuses what's wrong on the chosen scenario when the code is potentially suboptimal :] We're trying to make Linux better, right? :] I'm searching for someone, who knows the code well and can confirm my findings or who could point me at anything I could try in order to increase the rebuild speed. So far I've tried changing the readahead, minimum resync speed, stripe cache size, but it increased the resync speed by few percent only. I believe I would be able to write my own userspace application for rebuilding the array offline with much higher speed ... just doing XOR of bytes at the same offsets. That would prove the current rebuild strategy is suboptimal. Of course it would mean a new code if it doesn't work like suggested and I know it could be difficult and requiring a deep knowledge of the linux-raid code that unfortunately I don't have. Any chance someone here could find time to look at that? Thank you, Jaromir Capik On 09/01/2022 14:21, Jaromír Cápík wrote: >> In case of huge arrays (48TB in my case) the array rebuild takes a couple of >> days with the current approach even when the array is idle and during that >> time any of the drives could fail causing a fatal data loss. >> >> Does it make at least a bit of sense or my understanding and assumptions >> are wrong? > >It does make sense, but have you read the code to see if it already does it? >And if it doesn't, someone's going to have to write it, in which case it > >doesn't make sense, not to have that as the default. > >Bear in mind that rebuilding the array with a new drive is completely > >different logic to doing an integrity check, so will need its own code, > >so I expect it already works that way. > > >I think you've got two choices. Firstly, raid or not, you should have > >backups! Raid is for high-availability, not for keeping your data safe! > >And secondly, go raid-6 which gives you that bit extra redundancy. >Cheers, > >Wol ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild 2022-01-10 13:38 ` Jaromír Cápík @ 2022-01-10 14:07 ` Wols Lists 2022-01-11 12:18 ` Jaromír Cápík 0 siblings, 1 reply; 14+ messages in thread From: Wols Lists @ 2022-01-10 14:07 UTC (permalink / raw) To: Jaromír Cápík; +Cc: linux-raid On 10/01/2022 13:38, Jaromír Cápík wrote: > Nope, I haven't read the code. I only see a low sync speed (fluctuating from 20 > to 80MB/s) whilst the drives can perform much better doing sequential reading > and writing (250MB/s per drive and up to 600MB/s all 4 drives in total). > During the sync I hear a high noise caused by heads flying there and back and > that smells. Okay, so read performance from the array is worse than you would expect from a single drive. And the heads should not be "flying there and back" - they should just be streaming data. That's actually worrying - a VERY plausible explanation is that your drives are on the verge of failure!! > The chosen drives have poor seeking performance and small caches and are > probably unable to reorder the operations to be more sequential. The whole > solution is 'economic' since the organisation owning the solution is poor and > cannot afford better hardware. The drives shouldn't need to reorder the operations - a rebuild is an exercise in pure streaming ... unless there are so many badblocks the whole drive is a mess ... > That also means RAID6 is not an option. But we shouldn't search excuses what's > wrong on the chosen scenario when the code is potentially suboptimal :] We're > trying to make Linux better, right? :] > > I'm searching for someone, who knows the code well and can confirm my findings > or who could point me at anything I could try in order to increase the rebuild > speed. So far I've tried changing the readahead, minimum resync speed, stripe > cache size, but it increased the resync speed by few percent only. Actually, you might find (counter-intuitive though it sounds) REDUCING the max sync speed might be better ... I'd guess from what you say, about 60MB/s. The other thing is, could you be confusing MB and Mb? Three 250Mb drives would peak at about 80MB. > > I believe I would be able to write my own userspace application for rebuilding > the array offline with much higher speed ... just doing XOR of bytes at the same > offsets. That would prove the current rebuild strategy is suboptimal. > > Of course it would mean a new code if it doesn't work like suggested and I know > it could be difficult and requiring a deep knowledge of the linux-raid code that > unfortunately I don't have. > What make/model are your drives? What does smartdrv say about them? And take a look at https://www.ept.ca/features/everything-need-know-hard-drive-vibration/ The thing that worries me is your reference to repeated seeks. That should NOT be happening. Unless of course the system is in heavy use at the same time as the rebuild. Cheers, Wol ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild 2022-01-10 14:07 ` Wols Lists @ 2022-01-11 12:18 ` Jaromír Cápík 0 siblings, 0 replies; 14+ messages in thread From: Jaromír Cápík @ 2022-01-11 12:18 UTC (permalink / raw) To: Wols Lists; +Cc: linux-raid >> Nope, I haven't read the code. I only see a low sync speed (fluctuating from 20 >> to 80MB/s) whilst the drives can perform much better doing sequential reading >> and writing (250MB/s per drive and up to 600MB/s all 4 drives in total). >> During the sync I hear a high noise caused by heads flying there and back and >> that smells. > >Okay, so read performance from the array is worse than you would expect >from a single drive. And the heads should not be "flying there and back" >- they should just be streaming data. That's actually worrying - a VERY >plausible explanation is that your drives are on the verge of failure!! Nope, the drives are new and OK ... of course I did a ton of tests and the SMART is looking good ... no reallocated sectors, no pending sectors and the array now (after the rebuild) works at the expected speed and without noise ... just the resync was a total disaster. >> The chosen drives have poor seeking performance and small caches and are >> probably unable to reorder the operations to be more sequential. The whole >> solution is 'economic' since the organisation owning the solution is poor and >> cannot afford better hardware. > >The drives shouldn't need to reorder the operations - a rebuild is an >exercise in pure streaming ... unless there are so many badblocks the >whole drive is a mess ... Yeah, I would expect that as well, but the reality was different. As stated above, the drives are perfectly healthy. >> That also means RAID6 is not an option. But we shouldn't search excuses what's >> wrong on the chosen scenario when the code is potentially suboptimal :] We're >> trying to make Linux better, right? :] >> >> I'm searching for someone, who knows the code well and can confirm my findings >> or who could point me at anything I could try in order to increase the rebuild >> speed. So far I've tried changing the readahead, minimum resync speed, stripe >> cache size, but it increased the resync speed by few percent only. > >Actually, you might find (counter-intuitive though it sounds) REDUCING >the max sync speed might be better ... I'd guess from what you say, >about 60MB/s. >The other thing is, could you be confusing MB and Mb? Three 250Mb drives >would peak at about 80MB. Nope, all units were Bytes. >> The thing that worries me is your reference to repeated seeks. That >> should NOT be happening. Unless of course the system is in heavy use at >> the same time as the rebuild. Nope, the MD device was NOT mounted and no process was touching it. In case of this cheap HW I suspect a firmware bug in the SATA bridge, triggering the issue somehow and therefore I'd like to focus on the second and better HW I mentioned in my previous email addressed to Roger, where I hear no strange sounds, but still, the resync speed is far below my expectations and as far as I can remember I was never really satisfied with the RAID5 sync speed. The assembled array can do over 700MB/s when I temporarilly freeze the sync, but the sync speed is 100MB/s only ... why so? Again, the MD device is completely idle ... not mounted and no process is touching it. --- /dev/md3: Timing cached reads: 22440 MB in 1.99 seconds = 11285.30 MB/sec HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device Timing buffered disk reads: 2144 MB in 3.00 seconds = 713.91 MB/sec --- Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10] md3 : active raid5 sdi1[5] sdl1[6] sdk1[4] sdj1[2] 46877237760 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU] [==================>..] resync = 93.6% (14637814004/15625745920) finish=161.8min speed=101758K/sec bitmap: 5/59 pages [20KB], 131072KB chunk --- So, what's wrong on this picture? Thx, Jaromir. ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <CAAMCDec5kcK62enZCOh=SJZu0fecSV60jW8QjMierC147HE5bA@mail.gmail.com>]
* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild [not found] ` <CAAMCDec5kcK62enZCOh=SJZu0fecSV60jW8QjMierC147HE5bA@mail.gmail.com> @ 2022-01-11 9:59 ` Jaromír Cápík 2022-01-11 16:53 ` Roger Heflin 0 siblings, 1 reply; 14+ messages in thread From: Jaromír Cápík @ 2022-01-11 9:59 UTC (permalink / raw) To: Roger Heflin; +Cc: Linux RAID, Wols Lists Hello Roger. I just run atop on a different and much better hardware doing mdadm --grow on raid5 with 4 drives and it shows the following DSK | sdl | | busy 90% | read 950 | | write 502 | | KiB/r 1012 | KiB/w 506 | | MBr/s 94.0 | | MBw/s 24.9 | | avq 1.29 | avio 6.22 ms | | DSK | sdk | | busy 89% | read 968 | | write 499 | | KiB/r 995 | KiB/w 509 | | MBr/s 94.1 | | MBw/s 24.8 | | avq 0.92 | avio 6.09 ms | | DSK | sdj | | busy 88% | read 1004 | | write 503 | | KiB/r 958 | KiB/w 505 | | MBr/s 94.0 | | MBw/s 24.8 | | avq 0.66 | avio 5.91 ms | | DSK | sdi | | busy 87% | read 1013 | | write 499 | | KiB/r 949 | KiB/w 509 | | MBr/s 94.0 | | MBw/s 24.8 | | avq 0.65 | avio 5.81 ms | | Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10] md3 : active raid5 sdi1[5] sdl1[6] sdk1[4] sdj1[2] 46877237760 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU] [=================>...] resync = 88.5% (13834588672/15625745920) finish=293.1min speed=101843K/sec bitmap: 8/59 pages [32KB], 131072KB chunk Surprisingly all 4 drives show approximately 94MB/s read and 25MB/s write. Even when each of the drives can read 270MB/s and write 250MB/s, the sync speed is 100MB/s only, so? Does --grow differ from --add? Thanks, Jaromir ---------- Původní e-mail ---------- Od: Roger Heflin <rogerheflin@gmail.com> Komu: Wols Lists <antlists@youngman.org.uk> Datum: 11. 1. 2022 1:15:17 Předmět: Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild I just did a "--add" with sdd on a raid6 array missing a volume and here is what sar shows: 06:08:12 PM sdb 91.03 34615.97 0.36 0.00 380.26 0.41 4.47 30.31 06:08:12 PM sdc 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 06:08:12 PM sdd 77.12 26.28 34563.36 0.00 448.54 0.64 8.23 27.40 06:08:12 PM sde 36.45 34598.82 0.36 0.00 949.22 1.43 38.78 70.37 06:08:12 PM sdf 46.87 34598.89 0.36 0.00 738.25 1.23 26.13 57.81 06:09:12 PM sda 5.12 0.93 75.33 0.00 14.91 0.01 1.48 0.39 06:09:12 PM sdb 122.57 46819.67 0.40 0.00 382.00 0.54 4.38 35.85 06:09:12 PM sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 06:09:12 PM sdd 105.92 0.00 46775.73 0.00 441.63 1.12 10.53 35.80 06:09:12 PM sde 48.47 46817.53 0.40 0.00 965.98 1.95 40.00 97.89 06:09:12 PM sdf 56.95 46834.53 0.40 0.00 822.39 1.73 30.32 82.33 06:10:12 PM sda 4.55 1.20 48.20 0.00 10.86 0.01 0.97 0.27 06:10:12 PM sdb 123.67 46616.93 0.40 0.00 376.96 0.52 4.15 34.66 06:10:12 PM sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 06:10:12 PM sdd 109.82 0.00 46623.40 0.00 424.56 1.30 11.80 36.15 06:10:12 PM sde 49.18 46602.00 0.40 0.00 947.52 1.93 39.17 97.27 06:10:12 PM sdf 54.88 46601.07 0.40 0.00 849.10 1.75 31.82 85.16 06:11:12 PM sda 4.07 1.00 50.80 0.00 12.74 0.01 1.77 0.30 06:11:12 PM sdb 121.93 46363.20 0.40 0.00 380.24 0.51 4.10 34.72 06:11:12 PM sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 06:11:12 PM sdd 109.58 0.00 46372.47 0.00 423.17 1.37 12.44 35.69 06:11:12 PM sde 49.38 46371.00 0.40 0.00 939.01 1.93 38.88 97.09 06:11:12 PM sdf 55.12 46352.53 0.40 0.00 841.00 1.73 31.39 85.25 06:12:12 PM sda 5.75 14.20 79.05 0.00 16.22 0.01 1.78 0.40 06:12:12 PM sdb 120.73 45994.13 0.40 0.00 380.97 0.51 4.20 34.72 06:12:12 PM sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 06:12:12 PM sdd 110.95 0.00 45982.87 0.00 414.45 1.43 12.81 35.39 06:12:12 PM sde 49.63 46020.46 0.40 0.00 927.37 1.91 38.39 96.18 06:12:12 PM sdf 54.27 46022.80 0.40 0.00 847.97 1.75 32.14 86.65 So there are very few reads going on for sdd, but a lot of reads of the other disks to recalculate what the data on that disk. This is on raid6, but if raid6 is not doing a pointless check read on a new disk add, I would not expect raid5 to be. This is on a 5.14 kernel. On Mon, Jan 10, 2022 at 5:15 PM Wols Lists <antlists@youngman.org.uk> wrote: On 09/01/2022 14:21, Jaromír Cápík wrote: > In case of huge arrays (48TB in my case) the array rebuild takes a couple of > days with the current approach even when the array is idle and during that > time any of the drives could fail causing a fatal data loss. > > Does it make at least a bit of sense or my understanding and assumptions > are wrong? It does make sense, but have you read the code to see if it already does it? And if it doesn't, someone's going to have to write it, in which case it doesn't make sense, not to have that as the default. Bear in mind that rebuilding the array with a new drive is completely different logic to doing an integrity check, so will need its own code, so I expect it already works that way. I think you've got two choices. Firstly, raid or not, you should have backups! Raid is for high-availability, not for keeping your data safe! And secondly, go raid-6 which gives you that bit extra redundancy. Cheers, Wol ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild 2022-01-11 9:59 ` Jaromír Cápík @ 2022-01-11 16:53 ` Roger Heflin 2022-01-12 18:45 ` Jaromír Cápík 2022-01-14 14:43 ` Jaromír Cápík 0 siblings, 2 replies; 14+ messages in thread From: Roger Heflin @ 2022-01-11 16:53 UTC (permalink / raw) To: Jaromír Cápík; +Cc: Linux RAID, Wols Lists I have noticed on simple tests making arrays with tmpfs that Intel cpus seem to be able to xor about 2x the speed of AMD. The speed may also vary based on cpu generation. Also, grow differs in the fact that blocks get moved around hence the writes. On the raid you are building, is there other IO going on to the disks? This will cause seeks and the more io there is (outside of the rebuild) the worse it will be. Here is everything I set on my arrays find /sys -name "*sync_speed_min*" -exec /usr/local/bin/set_sync_speed 15000 {} \; # MB Intel controller find /sys/devices/pci0000:00/0000:00:1f.2/ -name "*queue_depth*" -exec /usr/local/bin/set_queue_depth 1 {} \; find /sys/devices/pci0000:00/0000:00:1f.2/ -name "nr_requests" -exec /usr/local/bin/set_queue_depth 4 {} \; # # AMD FM2 MB find /sys/devices/pci0000:00/0000:00:11.0/ -name "queue_depth" -exec /usr/local/bin/set_queue_depth 8 {} \; find /sys/devices/pci0000:00/0000:00:11.0/ -name "nr_requests" -exec /usr/local/bin/set_queue_depth 16 {} \; echo 30000 > /proc/sys/dev/raid/speed_limit_min for mddev in md13 md14 md15 md16 md17 md18 ; do blockdev --setfra 65536 /dev/${mddev} blockdev --setra 65536 /dev/${mddev} echo 32768 > /sys/block/${mddev}/md/stripe_cache_size echo 30000 > /sys/block/${mddev}/md/sync_speed_min echo 2 > /sys/block/${mddev}/md/group_thread_cnt done You will need to adjust my find/pci* devices to find your device, and you will need to test some with the queue_depth/nr_requests to see what is best for your controller/disk combination. You may want to also test different values with the group_thread_cnt. The set_queue_depth file (and the set_sync_speed file look like this): cat //usr/local/bin/set_queue_depth echo $1 > $2 On mine you will notice I have 6 arrays, 4 of those arrays are 3tb disk split into 4 750GB partitions to minimize time for a single grow to complete. The Other 2 are a 3tb remaining space split into 2 1.5tb spaces to also minimize the grow time. I have also found that when a disk fails often only a single partition gets a bad block and fails and so I only have to --re-add/--add one device. And if the disk has not failed you can do a --replace so long as you can get the old and new devices in the chassis. With the multiple partitions it usually means I only have 1 of 4 partitions that failed in mdadm, and so a re-add gets that one to work and I can then do the replace which just reads from the disk it is replacing and as such is much faster. I also carefully setup the disk partition naming such that the last digit of the partitions matches the last digit of the md ie: md16 : active raid6 sdh6[10] sdi6[12] sdj6[7] sdg6[9] sde6[1] sdb6[8] sdf6[11] 3615495680 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/7] [UUUUUUU] bitmap: 0/6 pages [0KB], 65536KB chunk as that makes the adding/re-adding simpler as I know which device it always is. On Tue, Jan 11, 2022 at 3:59 AM Jaromír Cápík <jaromir.capik@email.cz> wrote: > > Hello Roger. > > I just run atop on a different and much better hardware doing mdadm --grow on raid5 with 4 drives and it shows the following > > DSK | sdl | | busy 90% | read 950 | | write 502 | | KiB/r 1012 | KiB/w 506 | | MBr/s 94.0 | | MBw/s 24.9 | | avq 1.29 | avio 6.22 ms | | > DSK | sdk | | busy 89% | read 968 | | write 499 | | KiB/r 995 | KiB/w 509 | | MBr/s 94.1 | | MBw/s 24.8 | | avq 0.92 | avio 6.09 ms | | > DSK | sdj | | busy 88% | read 1004 | | write 503 | | KiB/r 958 | KiB/w 505 | | MBr/s 94.0 | | MBw/s 24.8 | | avq 0.66 | avio 5.91 ms | | > DSK | sdi | | busy 87% | read 1013 | | write 499 | | KiB/r 949 | KiB/w 509 | | MBr/s 94.0 | | MBw/s 24.8 | | avq 0.65 | avio 5.81 ms | | > > Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10] > md3 : active raid5 sdi1[5] sdl1[6] sdk1[4] sdj1[2] > 46877237760 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU] > [=================>...] resync = 88.5% (13834588672/15625745920) finish=293.1min speed=101843K/sec > bitmap: 8/59 pages [32KB], 131072KB chunk > > Surprisingly all 4 drives show approximately 94MB/s read and 25MB/s write. > Even when each of the drives can read 270MB/s and write 250MB/s, the sync speed is 100MB/s only, so? > > Does --grow differ from --add? > > Thanks, > Jaromir > > > > ---------- Původní e-mail ---------- > > Od: Roger Heflin <rogerheflin@gmail.com> > > Komu: Wols Lists <antlists@youngman.org.uk> > > Datum: 11. 1. 2022 1:15:17 > > Předmět: Re: Feature request: Add flag for assuming a new clean drive > completely dirty when adding to a degraded raid5 array in order to increase > the speed of the array rebuild > > I just did a "--add" with sdd on a raid6 array missing a volume and here is what sar shows: > > 06:08:12 PM sdb 91.03 34615.97 0.36 0.00 380.26 0.41 4.47 30.31 > 06:08:12 PM sdc 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 06:08:12 PM sdd 77.12 26.28 34563.36 0.00 448.54 0.64 8.23 27.40 > 06:08:12 PM sde 36.45 34598.82 0.36 0.00 949.22 1.43 38.78 70.37 > 06:08:12 PM sdf 46.87 34598.89 0.36 0.00 738.25 1.23 26.13 57.81 > > 06:09:12 PM sda 5.12 0.93 75.33 0.00 14.91 0.01 1.48 0.39 > 06:09:12 PM sdb 122.57 46819.67 0.40 0.00 382.00 0.54 4.38 35.85 > 06:09:12 PM sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 06:09:12 PM sdd 105.92 0.00 46775.73 0.00 441.63 1.12 10.53 35.80 > 06:09:12 PM sde 48.47 46817.53 0.40 0.00 965.98 1.95 40.00 97.89 > 06:09:12 PM sdf 56.95 46834.53 0.40 0.00 822.39 1.73 30.32 82.33 > > > 06:10:12 PM sda 4.55 1.20 48.20 0.00 10.86 0.01 0.97 0.27 > > 06:10:12 PM sdb 123.67 46616.93 0.40 0.00 376.96 0.52 4.15 34.66 > 06:10:12 PM sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 06:10:12 PM sdd 109.82 0.00 46623.40 0.00 424.56 1.30 11.80 36.15 > 06:10:12 PM sde 49.18 46602.00 0.40 0.00 947.52 1.93 39.17 97.27 > 06:10:12 PM sdf 54.88 46601.07 0.40 0.00 849.10 1.75 31.82 85.16 > > > 06:11:12 PM sda 4.07 1.00 50.80 0.00 12.74 0.01 1.77 0.30 > > 06:11:12 PM sdb 121.93 46363.20 0.40 0.00 380.24 0.51 4.10 34.72 > 06:11:12 PM sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 06:11:12 PM sdd 109.58 0.00 46372.47 0.00 423.17 1.37 12.44 35.69 > 06:11:12 PM sde 49.38 46371.00 0.40 0.00 939.01 1.93 38.88 97.09 > 06:11:12 PM sdf 55.12 46352.53 0.40 0.00 841.00 1.73 31.39 85.25 > > > 06:12:12 PM sda 5.75 14.20 79.05 0.00 16.22 0.01 1.78 0.40 > > 06:12:12 PM sdb 120.73 45994.13 0.40 0.00 380.97 0.51 4.20 34.72 > 06:12:12 PM sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 06:12:12 PM sdd 110.95 0.00 45982.87 0.00 414.45 1.43 12.81 35.39 > 06:12:12 PM sde 49.63 46020.46 0.40 0.00 927.37 1.91 38.39 96.18 > 06:12:12 PM sdf 54.27 46022.80 0.40 0.00 847.97 1.75 32.14 86.65 > > > > So there are very few reads going on for sdd, but a lot of reads of the other disks to recalculate what the data on that disk. > > This is on raid6, but if raid6 is not doing a pointless check read on a new disk add, I would not expect raid5 to be. > > > This is on a 5.14 kernel. > > > > On Mon, Jan 10, 2022 at 5:15 PM Wols Lists <antlists@youngman.org.uk> wrote: > > On 09/01/2022 14:21, Jaromír Cápík wrote: > > > In case of huge arrays (48TB in my case) the array rebuild takes a couple of > > > days with the current approach even when the array is idle and during that > > > time any of the drives could fail causing a fatal data loss. > > > > > > Does it make at least a bit of sense or my understanding and assumptions > > > are wrong? > > > > It does make sense, but have you read the code to see if it already does it? > > > > And if it doesn't, someone's going to have to write it, in which case it > > doesn't make sense, not to have that as the default. > > > > Bear in mind that rebuilding the array with a new drive is completely > > different logic to doing an integrity check, so will need its own code, > > so I expect it already works that way. > > > > I think you've got two choices. Firstly, raid or not, you should have > > backups! Raid is for high-availability, not for keeping your data safe! > > And secondly, go raid-6 which gives you that bit extra redundancy. > > > > Cheers, > > Wol > > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild 2022-01-11 16:53 ` Roger Heflin @ 2022-01-12 18:45 ` Jaromír Cápík 2022-01-14 14:10 ` Jaromír Cápík 2022-01-14 14:43 ` Jaromír Cápík 1 sibling, 1 reply; 14+ messages in thread From: Jaromír Cápík @ 2022-01-12 18:45 UTC (permalink / raw) To: Roger Heflin; +Cc: Linux RAID, Wols Lists Hello Roger. >I have noticed on simple tests making arrays with tmpfs that Intel >cpus seem to be able to xor about 2x the speed of AMD. The speed may >also vary based on cpu generation. It was Intel in both cases and the CPU loads were low. > Also, grow differs in the fact that blocks get moved around hence the writes. Of course, but even of that the speed was poor :] > On the raid you are building, is there other IO going on to the disks? Nope, the array was not mounted and no process was touching the MD device during the rebuild. > And if the disk has not failed you can do a --replace so long as you In the first case there was no space left for another drive ... however, sometimes I can do the rebuild offline and in such cases I'll rather clone the content of the old drive with DD. It's much faster. > I also carefully setup the disk partition naming such that the last > digit of the partitions matches the last digit of the md as that makes > the adding/re-adding simpler as I know which device it always is. I usually do that on servers where I have multiple raid partitions. Thx, Jaromir. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild 2022-01-12 18:45 ` Jaromír Cápík @ 2022-01-14 14:10 ` Jaromír Cápík 2022-01-14 17:53 ` Roger Heflin 0 siblings, 1 reply; 14+ messages in thread From: Jaromír Cápík @ 2022-01-14 14:10 UTC (permalink / raw) To: Roger Heflin; +Cc: Linux RAID, Wols Lists Hello Roger. >> Also, grow differs in the fact that blocks get moved around hence the writes. > > Of course, but even of that the speed was poor :] Just for info, I just tried to DD data on the second (faster) hardware from one RAID drive to an empty one with MD offline and the transfer speed is the following (sorry for the czech locale :) ... 25423+0 záznamů přečteno 25422+0 záznamů zapsáno 26656899072 bajtů (27 GB, 25 GiB) zkopírováno, 105,227 s, 253 MB/s 25903+0 záznamů přečteno 25902+0 záznamů zapsáno 27160215552 bajtů (27 GB, 25 GiB) zkopírováno, 107,235 s, 253 MB/s 26386+0 záznamů přečteno 26385+0 záznamů zapsáno 27666677760 bajtů (28 GB, 26 GiB) zkopírováno, 109,245 s, 253 MB/s as you can see, the write speed really matches the speed from the drive datasheet, so ... why the sync speed was 100MB/s only when the CPU load was low? Can you explain that? Thanks, Jaromir ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild 2022-01-14 14:10 ` Jaromír Cápík @ 2022-01-14 17:53 ` Roger Heflin 2022-01-17 17:59 ` Jaromír Cápík 0 siblings, 1 reply; 14+ messages in thread From: Roger Heflin @ 2022-01-14 17:53 UTC (permalink / raw) To: Jaromír Cápík; +Cc: Linux RAID, Wols Lists That is typically 100MB/sec per disk as it is reported, and that is a typical speed I have seen for a rebuild and/or grow. There are almost certainly algorithm sync points that constrain the speed less than full streaming speed of all disks. The algorithm may well be, read the stripe, process the stripe and write out the new stripe, and start over (in a linear manner) I would expect that to be the easiest to keep track of, and that would roughly get your speed (costs a read to each old disk + a write to the new disk + bookkeeping writes + parity calc). Setting up the code such that it overlaps the operations is going to complicate the code, and as such was likely not done. And regardless of the client's only being able to run raid5, there is significant risks to running raid5. If on the rebuild you find a bad block on one of the other disks then you have lost data, and that is very likely to happen--that exact failure was the first raid failure I saw 28+ years ago). How often are you replacing/rebuilding the disks and why? On Fri, Jan 14, 2022 at 8:10 AM Jaromír Cápík <jaromir.capik@email.cz> wrote: > > Hello Roger. > > >> Also, grow differs in the fact that blocks get moved around hence the writes. > > > > Of course, but even of that the speed was poor :] > > Just for info, I just tried to DD data on the second (faster) hardware > from one RAID drive to an empty one with MD offline and the transfer > speed is the following (sorry for the czech locale :) ... > > 25423+0 záznamů přečteno > 25422+0 záznamů zapsáno > 26656899072 bajtů (27 GB, 25 GiB) zkopírováno, 105,227 s, 253 MB/s > 25903+0 záznamů přečteno > 25902+0 záznamů zapsáno > 27160215552 bajtů (27 GB, 25 GiB) zkopírováno, 107,235 s, 253 MB/s > 26386+0 záznamů přečteno > 26385+0 záznamů zapsáno > 27666677760 bajtů (28 GB, 26 GiB) zkopírováno, 109,245 s, 253 MB/s > > as you can see, the write speed really matches the speed from the drive datasheet, > so ... why the sync speed was 100MB/s only when the CPU load was low? > Can you explain that? > > Thanks, > Jaromir ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild 2022-01-14 17:53 ` Roger Heflin @ 2022-01-17 17:59 ` Jaromír Cápík 2022-01-17 19:59 ` Wol 0 siblings, 1 reply; 14+ messages in thread From: Jaromír Cápík @ 2022-01-17 17:59 UTC (permalink / raw) To: Roger Heflin; +Cc: Linux RAID, Wols Lists Hello Roger. >That is typically 100MB/sec per disk as it is reported, and that is a >typical speed I have seen for a rebuild and/or grow. > >There are almost certainly algorithm sync points that constrain the >speed less than full streaming speed of all disks. > >The algorithm may well be, read the stripe, process the stripe and >write out the new stripe, and start over (in a linear manner) I would >expect that to be the easiest to keep track of, and that would roughly >get your speed (costs a read to each old disk + a write to the new >disk + bookkeeping writes + parity calc). Setting up the code such >that it overlaps the operations is going to complicate the code, and >as such was likely not done. Yeah, I'm pretty sure the current behavior is suboptimal just because it was easier for the implementation. And ... surprisingly ... this feature request is my clumsy try to convince someone amazing and clever to change that, because ... we love Linux and wanna se it rock! Right? :D >And regardless of the client's only being able to run raid5, there is >significant risks to running raid5. If on the rebuild you find a bad >block on one of the other disks then you have lost data, and that is >very likely to happen--that exact failure was the first raid failure I >saw 28+ years ago). I'm aware of the risks ... but, losing a file or two is still much better than losing the whole array just because of the low sync speed when you need to operate the array in a degraded mode for 3 days instead of 1 day. Making it faster seems to me quite important / reasonable. >How often are you replacing/rebuilding the disks and why? Few times a year, different reasons. Usually requests for higher capacity where I need to replace all drives one by one and then grow the array. Sometimes reallocated sectors appear in the SMART output and I never let such drives in the array considering them unreliable. The --replace feature is nice, but often there's no room for one more drive in the chasis and going that way requires an external USB3 rack and a bit of magic if the operation cannot be done offline. So, I still hope someone will find enough courage one day to implement the new optional sync strategy :) BR, J. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild 2022-01-17 17:59 ` Jaromír Cápík @ 2022-01-17 19:59 ` Wol 2022-01-18 12:45 ` Jaromír Cápík 0 siblings, 1 reply; 14+ messages in thread From: Wol @ 2022-01-17 19:59 UTC (permalink / raw) To: Jaromír Cápík, Roger Heflin; +Cc: Linux RAID On 17/01/2022 17:59, Jaromír Cápík wrote: > Few times a year, different reasons. Usually requests for higher capacity > where I need to replace all drives one by one and then grow the array. > Sometimes reallocated sectors appear in the SMART output and I never > let such drives in the array considering them unreliable. The --replace > feature is nice, but often there's no room for one more drive in the > chasis and going that way requires an external USB3 rack and a bit of > magic if the operation cannot be done offline. > You seen the stuff about running raid over USB? Very unwise? Do you have room for an eSATA card? If so, get an external SATA cage, and you can swap a drive out into the cage, rebuild the array with --replace, and repeat. Much safer, probably much quicker, and no extra work shutting down the system to replace each drive in turn. And if your chassis is hot-swap, then all the better, no downtime at all. Just a few moments danger each time you physically swap a drive. > So, I still hope someone will find enough courage one day to implement > the new optional sync strategy:) There is the argument that increasing the load on the drive increases the risk to the drive ... Cheers, Wol ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild 2022-01-17 19:59 ` Wol @ 2022-01-18 12:45 ` Jaromír Cápík 0 siblings, 0 replies; 14+ messages in thread From: Jaromír Cápík @ 2022-01-18 12:45 UTC (permalink / raw) To: Wol; +Cc: Linux RAID, Roger Heflin >> chasis and going that way requires an external USB3 rack and a bit of >> magic if the operation cannot be done offline. > > You seen the stuff about running raid over USB? Very unwise? Nope, URL? I mirror the drive in my Intel NUC router to USB3 drive and it works for years with no problems. Of course I'm aware of lower reliability, but I hope I'll get I/O errors if something fails. Am I wrong? Thx, J. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild 2022-01-11 16:53 ` Roger Heflin 2022-01-12 18:45 ` Jaromír Cápík @ 2022-01-14 14:43 ` Jaromír Cápík 1 sibling, 0 replies; 14+ messages in thread From: Jaromír Cápík @ 2022-01-14 14:43 UTC (permalink / raw) To: Roger Heflin; +Cc: Linux RAID, Wols Lists Hello Roger > I have noticed on simple tests making arrays with tmpfs that Intel > cpus seem to be able to xor about 2x the speed of AMD. The speed may > also vary based on cpu generation. I forgot to mention I did some tests here with -O2 optimized C code and 1 core of the CPU can XOR aproximately 7.11GB of data per second, so ... not really a bottleneck here. Regards, Jaromir ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2022-01-18 12:45 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-01-09 14:21 Feature request: Add flag for assuming a new clean drive completely dirty when adding to a degraded raid5 array in order to increase the speed of the array rebuild Jaromír Cápík 2022-01-10 9:00 ` Wols Lists 2022-01-10 13:38 ` Jaromír Cápík 2022-01-10 14:07 ` Wols Lists 2022-01-11 12:18 ` Jaromír Cápík [not found] ` <CAAMCDec5kcK62enZCOh=SJZu0fecSV60jW8QjMierC147HE5bA@mail.gmail.com> 2022-01-11 9:59 ` Jaromír Cápík 2022-01-11 16:53 ` Roger Heflin 2022-01-12 18:45 ` Jaromír Cápík 2022-01-14 14:10 ` Jaromír Cápík 2022-01-14 17:53 ` Roger Heflin 2022-01-17 17:59 ` Jaromír Cápík 2022-01-17 19:59 ` Wol 2022-01-18 12:45 ` Jaromír Cápík 2022-01-14 14:43 ` Jaromír Cápík
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.