All of lore.kernel.org
 help / color / mirror / Atom feed
* mdcheck: slow system issues
@ 2020-03-30 12:18 Paul Menzel
  2020-03-30 13:27 ` Reindl Harald
  2020-03-31 10:53 ` Peter Grandi
  0 siblings, 2 replies; 6+ messages in thread
From: Paul Menzel @ 2020-03-30 12:18 UTC (permalink / raw)
  To: linux-raid

Dear Linux folks,


When `mdcheck` runs on two 100 TB software RAIDs our users complain 
about being unable to open files in a reasonable time.

> $ uname -a
> Linux handsomejack.molgen.mpg.de 4.19.57.mx64.276 #1 SMP Wed Jul 3 15:15:22 CEST 2019 x86_64 GNU/Linux

> $ more /proc/mdstat 
> Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath] 
> md1 : active raid6 sdab[0] sdac[15] sdad[14] sdae[13] sdag[12] sdah[11] sdaf[10] sdai[9] sdu[8] sdt[7] sdv[6] sdw[5] sdx[4] sdy[3] sdaa[2] sdz[1]
>       109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>       bitmap: 0/59 pages [0KB], 65536KB chunk
> 
> md0 : active raid6 sde[0] sds[15] sdr[14] sdp[13] sdq[12] sdo[11] sdn[10] sdl[9] sdm[8] sdk[7] sdj[6] sdh[5] sdi[4] sdg[3] sdf[2] sdd[1]
>       109394532352 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>       bitmap: 2/59 pages [8KB], 65536KB chunk
> 
> unused devices: <none>

> $ lspci -nn | grep -i RAID
> 03:00.0 RAID bus controller [0104]: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] [1000:005d] (rev 02)

> $ sysctl dev.raid.speed_limit_min
> dev.raid.speed_limit_min = 1000
> $ sysctl dev.raid.speed_limit_max
> dev.raid.speed_limit_max = 200000

> $ more /etc/cron.d/mdcheck
> 0 18 * * Fri              root /usr/bin/mdcheck --duration "Mon 06:00"
> 0 18 * * Mon,Tue,Wed,Thu  root /usr/bin/mdcheck --continue --duration "Tomorrow 06:00"

> $ dmesg | tail -4
> [Fri Mar 27 17:58:58 2020] md: data-check of RAID array md1
> [Fri Mar 27 17:58:58 2020] md: data-check of RAID array md0
> [Sat Mar 28 18:50:20 2020] md: md1: data-check done.
> [Sat Mar 28 22:33:33 2020] md: md0: data-check done.

During that time only four threads of the CPU are used.

The article *Software RAID check - slow system issues* [1] recommends to 
lower `dev.raid.speed_limit_max`, but the RAID should easily be able to 
do 200 MB/s as our tests show over 600 MB/s during some benchmarks.

How do you run `mdcheck` in production without noticeably affecting the 
system?


Kind regards,

Paul


[1]: 
https://www.alttechnical.com/knowledge-base/linux/126-software-raid-check-slow-system-issues


PS: Details:

> $ sudo mdadm -D /dev/md0
> /dev/md0:
>            Version : 1.2
>      Creation Time : Mon Jul 30 11:44:29 2018
>         Raid Level : raid6
>         Array Size : 109394532352 (104326.76 GiB 112020.00 GB)
>      Used Dev Size : 7813895168 (7451.91 GiB 8001.43 GB)
>       Raid Devices : 16
>      Total Devices : 16
>        Persistence : Superblock is persistent
> 
>      Intent Bitmap : Internal
> 
>        Update Time : Mon Mar 30 13:51:44 2020
>              State : active 
>     Active Devices : 16
>    Working Devices : 16
>     Failed Devices : 0
>      Spare Devices : 0
> 
>             Layout : left-symmetric
>         Chunk Size : 512K
> 
> Consistency Policy : bitmap
> 
>               Name : M8015
>               UUID : 0569ef24:5868e228:ca17105b:ba673204
>             Events : 446871
> 
>     Number   Major   Minor   RaidDevice State
>        0       8       64        0      active sync   /dev/sde
>        1       8       48        1      active sync   /dev/sdd
>        2       8       80        2      active sync   /dev/sdf
>        3       8       96        3      active sync   /dev/sdg
>        4       8      128        4      active sync   /dev/sdi
>        5       8      112        5      active sync   /dev/sdh
>        6       8      144        6      active sync   /dev/sdj
>        7       8      160        7      active sync   /dev/sdk
>        8       8      192        8      active sync   /dev/sdm
>        9       8      176        9      active sync   /dev/sdl
>       10       8      208       10      active sync   /dev/sdn
>       11       8      224       11      active sync   /dev/sdo
>       12      65        0       12      active sync   /dev/sdq
>       13       8      240       13      active sync   /dev/sdp
>       14      65       16       14      active sync   /dev/sdr
>       15      65       32       15      active sync   /dev/sds

> $ sudo mdadm -D /dev/md1
> /dev/md1:
>            Version : 1.2
>      Creation Time : Wed Mar  6 13:56:48 2019
>         Raid Level : raid6
>         Array Size : 109394518016 (104326.74 GiB 112019.99 GB)
>      Used Dev Size : 7813894144 (7451.91 GiB 8001.43 GB)
>       Raid Devices : 16
>      Total Devices : 16
>        Persistence : Superblock is persistent
> 
>      Intent Bitmap : Internal
> 
>        Update Time : Mon Mar 30 03:49:21 2020
>              State : clean 
>     Active Devices : 16
>    Working Devices : 16
>     Failed Devices : 0
>      Spare Devices : 0
> 
>             Layout : left-symmetric
>         Chunk Size : 512K
> 
> Consistency Policy : bitmap
> 
>               Name : M8027
>               UUID : fdb36dce:6e2dfdaa:853cb1a1:402a9a9a
>             Events : 48917
> 
>     Number   Major   Minor   RaidDevice State
>        0      65      176        0      active sync   /dev/sdab
>        1      65      144        1      active sync   /dev/sdz
>        2      65      160        2      active sync   /dev/sdaa
>        3      65      128        3      active sync   /dev/sdy
>        4      65      112        4      active sync   /dev/sdx
>        5      65       96        5      active sync   /dev/sdw
>        6      65       80        6      active sync   /dev/sdv
>        7      65       48        7      active sync   /dev/sdt
>        8      65       64        8      active sync   /dev/sdu
>        9      66       32        9      active sync   /dev/sdai
>       10      65      240       10      active sync   /dev/sdaf
>       11      66       16       11      active sync   /dev/sdah
>       12      66        0       12      active sync   /dev/sdag
>       13      65      224       13      active sync   /dev/sdae
>       14      65      208       14      active sync   /dev/sdad
>       15      65      192       15      active sync   /dev/sdac

> $ lscpu
> Architecture:                    x86_64
> CPU op-mode(s):                  32-bit, 64-bit
> Byte Order:                      Little Endian
> Address sizes:                   46 bits physical, 48 bits virtual
> CPU(s):                          12
> On-line CPU(s) list:             0-11
> Thread(s) per core:              1
> Core(s) per socket:              6
> Socket(s):                       2
> NUMA node(s):                    2
> Vendor ID:                       GenuineIntel
> CPU family:                      6
> Model:                           79
> Model name:                      Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
> Stepping:                        1
> CPU MHz:                         1698.649
> CPU max MHz:                     1700.0000
> CPU min MHz:                     1200.0000
> BogoMIPS:                        3396.26
> Virtualization:                  VT-x
> L1d cache:                       384 KiB
> L1i cache:                       384 KiB
> L2 cache:                        3 MiB
> L3 cache:                        30 MiB
> NUMA node0 CPU(s):               0,2,4,6,8,10
> NUMA node1 CPU(s):               1,3,5,7,9,11
> Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
> Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT disabled
> Vulnerability Meltdown:          Mitigation; PTI
> Vulnerability Spec store bypass: Vulnerable
> Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
> Vulnerability Spectre v2:        Mitigation; Full generic retpoline, STIBP disabled, RSB filling
> Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm p
>                                  be syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmpe
>                                  rf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic mo
>                                  vbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpc
>                                  id_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid r
>                                  tm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mdcheck: slow system issues
  2020-03-30 12:18 mdcheck: slow system issues Paul Menzel
@ 2020-03-30 13:27 ` Reindl Harald
  2020-03-30 13:38   ` Roman Mamedov
  2020-03-31 10:53 ` Peter Grandi
  1 sibling, 1 reply; 6+ messages in thread
From: Reindl Harald @ 2020-03-30 13:27 UTC (permalink / raw)
  To: Paul Menzel, linux-raid



Am 30.03.20 um 14:18 schrieb Paul Menzel:
> How do you run `mdcheck` in production without noticeably affecting the
> system?

you can' for years

either lower "dev.raid.speed_limit_max" and wait ages for the raid check
or cripple system performance

i remember "dev.raid.speed_limit_max" did 10 years ago what it is
supposed to do: use that when the system is idle but slow down the
raid-check in case of heavy user-io

but the current drama exists for many years

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mdcheck: slow system issues
  2020-03-30 13:27 ` Reindl Harald
@ 2020-03-30 13:38   ` Roman Mamedov
  0 siblings, 0 replies; 6+ messages in thread
From: Roman Mamedov @ 2020-03-30 13:38 UTC (permalink / raw)
  To: Reindl Harald; +Cc: Paul Menzel, linux-raid

On Mon, 30 Mar 2020 15:27:13 +0200
Reindl Harald <h.reindl@thelounge.net> wrote:

> 
> 
> Am 30.03.20 um 14:18 schrieb Paul Menzel:
> > How do you run `mdcheck` in production without noticeably affecting the
> > system?
> 
> you can' for years
> 
> either lower "dev.raid.speed_limit_max" and wait ages for the raid check
> or cripple system performance
> 
> i remember "dev.raid.speed_limit_max" did 10 years ago what it is
> supposed to do: use that when the system is idle but slow down the
> raid-check in case of heavy user-io
> 
> but the current drama exists for many years

This still cleanly reverts on current kernels and I believe is the cause of
what you are talking about. Would be nice if you can check if that's indeed
the case. ("patch -R" with the below)

---


From ac8fa4196d205ac8fff3f8932bddbad4f16e4110 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Thu, 19 Feb 2015 16:55:00 +1100
Subject: md: allow resync to go faster when there is competing IO.

When md notices non-sync IO happening while it is trying
to resync (or reshape or recover) it slows down to the
set minimum.

The default minimum might have made sense many years ago
but the drives have become faster.  Changing the default
to match the times isn't really a long term solution.

This patch changes the code so that instead of waiting until the speed
has dropped to the target, it just waits until pending requests
have completed.
This means that the delay inserted is a function of the speed
of the devices.

Testing shows that:
 - for some loads, the resync speed is unchanged.  For those loads
   increasing the minimum doesn't change the speed either.
   So this is a good result.  To increase resync speed under such
   loads we would probably need to increase the resync window
   size.

 - for other loads, resync speed does increase to a reasonable
   fraction (e.g. 20%) of maximum possible, and throughput of
   the load only drops a little bit (e.g. 10%)

 - for other loads, throughput of the non-sync load drops quite a bit
   more.  These seem to be latency-sensitive loads.

So it isn't a perfect solution, but it is mostly an improvement.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/md.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 3b9b032..d4f31e1 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7880,11 +7880,18 @@ void md_do_sync(struct md_thread *thread)
 			/((jiffies-mddev->resync_mark)/HZ +1) +1;
 
 		if (currspeed > speed_min(mddev)) {
-			if ((currspeed > speed_max(mddev)) ||
-					!is_mddev_idle(mddev, 0)) {
+			if (currspeed > speed_max(mddev)) {
 				msleep(500);
 				goto repeat;
 			}
+			if (!is_mddev_idle(mddev, 0)) {
+				/*
+				 * Give other IO more of a chance.
+				 * The faster the devices, the less we wait.
+				 */
+				wait_event(mddev->recovery_wait,
+					   !atomic_read(&mddev->recovery_active));
+			}
 		}
 	}
 	printk(KERN_INFO "md: %s: %s %s.\n",mdname(mddev), desc,
-- 
cgit v1.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: mdcheck: slow system issues
  2020-03-30 12:18 mdcheck: slow system issues Paul Menzel
  2020-03-30 13:27 ` Reindl Harald
@ 2020-03-31 10:53 ` Peter Grandi
  2020-03-31 12:14   ` Phil Turmel
  1 sibling, 1 reply; 6+ messages in thread
From: Peter Grandi @ 2020-03-31 10:53 UTC (permalink / raw)
  To: Linux RAID

> Dear Linux folks, When `mdcheck` runs on two 100 TB software
> RAIDs our users complain about being unable to open files in a
> reasonable time. [...]
>       109394518016 blocks super 1.2 level 6, 512k chunk,
> algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]

Unsurprisingly it is a 16-wide RAID6 of 8TB HDDs.

> [...] The article *Software RAID check - slow system issues*
> [1] recommends to lower `dev.raid.speed_limit_max`, but the
> RAID should easily be able to do 200 MB/s as our tests show
> over 600 MB/s during some benchmarks.

Many people have to find out the hard way that on HDDs
sequential and random IO rates differ by "up to" two orders of
magnitude, and that RAID6 gives an "interesting" tradeoff
between read and write speed with random vs. sequential access.

> How do you run `mdcheck` in production without noticeably
> affecting the system?

Fortunately the only solution that works well is quite simple:
replace the storage system with one with much increased
IOPS-per-TB (that is SSDs or much smaller HDDs, 1TB or less)
*and* switch from RAID6 to RAID10.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mdcheck: slow system issues
  2020-03-31 10:53 ` Peter Grandi
@ 2020-03-31 12:14   ` Phil Turmel
  2020-04-01 19:50     ` Peter Grandi
  0 siblings, 1 reply; 6+ messages in thread
From: Phil Turmel @ 2020-03-31 12:14 UTC (permalink / raw)
  To: Peter Grandi, Linux RAID

On 3/31/20 6:53 AM, Peter Grandi wrote:
>> Dear Linux folks, When `mdcheck` runs on two 100 TB software
>> RAIDs our users complain about being unable to open files in a
>> reasonable time. [...]
>>        109394518016 blocks super 1.2 level 6, 512k chunk,
>> algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
> 
> Unsurprisingly it is a 16-wide RAID6 of 8TB HDDs.

With a 512k chunk.  Definitely not suitable for anything but large media 
file streaming.

>> [...] The article *Software RAID check - slow system issues*
>> [1] recommends to lower `dev.raid.speed_limit_max`, but the
>> RAID should easily be able to do 200 MB/s as our tests show
>> over 600 MB/s during some benchmarks.
> 
> Many people have to find out the hard way that on HDDs
> sequential and random IO rates differ by "up to" two orders of
> magnitude, and that RAID6 gives an "interesting" tradeoff
> between read and write speed with random vs. sequential access.

The random/streaming threshold is proportional to the address stride on 
one device--the raid sector number gap between one chunk and the next 
chunk on that (approximately).  Which is basically chunk * (n-2).  With 
so many member devices, the transition from random-access performance 
and streaming performance requires that much larger accesses.

I configure any raid6 that might have some random loads with a 16k or 
32k chunk size.

Finally, the stripe cache size should be optimized on the system in 
question.  More is generally better, unless it starves the OS of 
buffers.  Adjust and test, with real loads.

>> How do you run `mdcheck` in production without noticeably
>> affecting the system?
> 
> Fortunately the only solution that works well is quite simple:
> replace the storage system with one with much increased
> IOPS-per-TB (that is SSDs or much smaller HDDs, 1TB or less)
> *and* switch from RAID6 to RAID10.

These are good choices too, though not cheap.

Phil

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mdcheck: slow system issues
  2020-03-31 12:14   ` Phil Turmel
@ 2020-04-01 19:50     ` Peter Grandi
  0 siblings, 0 replies; 6+ messages in thread
From: Peter Grandi @ 2020-04-01 19:50 UTC (permalink / raw)
  To: Linux RAID

>> Unsurprisingly it is a 16-wide RAID6 of 8TB HDDs.

> With a 512k chunk.  Definitely not suitable for anything but
> large media file streaming. [...] The random/streaming
> threshold is proportional to the address stride on one
> device--the raid sector number gap between one chunk and the
> next chunk on that (approximately). [...] I configure any
> raid6 that might have some random loads with a 16k or 32k
> chunk size.

That is actually rather controversial: I have read both
arguments like this and the opposite argument that sequential
performance is much better with small chunk sizes because then
sequential access is striped:

* Consider a 512KiB chunk size with 64KiB reads: 8 successive
  reads will be sequentially from the same disk, so top speed
  will be that of a single disk.

* Consider a 16KiB chunk size with 4 data disks with 64KiB
  reads: each read will be spread in parallel over all 4 disks.

The rationale for large chunk sizes is that it minimizes time
wasted on rotational latency: if reading 64KiB from 4 drives
with a 16KiB chunk size, the 64KiB block will only become
available when all four chunks have finished reading, and
because in most RAID types the drives are not synchronized, on
average each chunk will be at a different rotational position,
potentially one full rotation apart, but often half a rotation
apart, that is each read will have an overhead of 8ms of extra
rotational latency, and that's pretty huge. Some more detailed
discussion here:

  http://www.sabi.co.uk/blog/12-thr.html?120310#120310

Multihreading, block device read-ahead, various types of
alternative RAID layouts etc.  complicate things, and in some
small experiments I have done over the years results were
inconclusive, except that really large chunk sizes seemed worse
than smaller ones.

> Finally, the stripe cache size should be optimized on the
> system in question.  More is generally better, unless it
> starves the OS of buffers.

Indeed the stripe cache size matters a great deal to a 16-wide
RAID6, and that's a good point, but it is secondary to the
storage system having designed for high latency during mixed
read-write workloads with even a minimal degree of "random"
access or multithreading.

As to other secondary palliatives, the "unable to open files in
a reasonable time" case often can be made less bad in two other
ways:

* Often the (terrible) Linux block layer has default settings
  that result in enormous amounts of unsynced data in memory,
  and when that eventually is synced to disk, it can create huge
  congestion. This can also happen with hw RAID host adapters
  with onboard caches (in many cases very badly managed by their
  firmware).

* The default disk schedulers (in particular 'cfq') tend to
  prefer reads to writes, and this can result in large delays
  especially if 'atime' if set impacting 'open's, or 'mtime' on
  directories when 'creat'ing files. Using 'deadline' with
  tighter settings for "write_expire" and/or "writes_starved"
  might help.

But nothing other than a simple, quick replacement of the
storage system can work around a storage system designed to
minimize the IOPS-per-TB rate below the combined requirements of
the workload of 'mdcheck' (or backup) and the live workloads.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-04-01 19:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-30 12:18 mdcheck: slow system issues Paul Menzel
2020-03-30 13:27 ` Reindl Harald
2020-03-30 13:38   ` Roman Mamedov
2020-03-31 10:53 ` Peter Grandi
2020-03-31 12:14   ` Phil Turmel
2020-04-01 19:50     ` Peter Grandi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.