linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* unexpected 'mdadm -S' hang with I/O pressure testing
@ 2020-09-12 14:06 Coly Li
  2020-09-12 14:21 ` Coly Li
  2020-09-14  0:03 ` Roger Heflin
  0 siblings, 2 replies; 4+ messages in thread
From: Coly Li @ 2020-09-12 14:06 UTC (permalink / raw)
  To: linux-raid; +Cc: Guoqing Jiang, Song Liu, xiao ni

Unexpected Behavior:
- With Linux v5.9-rc4 mainline kernel and latest mdadm upstream code
- After running fio with 10 jobs, 16 iodpes and 64K block size for a
while, try to stop the fio process by 'Ctrl + c', the main fio process
hangs.
- Then try to stop the md raid 5 array by 'mdadm -S /dev/md0', the mdad
process hangs.
- Reboot the system by 'echo b > /proc/sysrq-trigger', this md raid5
array is assembled but inactive. /proc/mdstat shows,
	Personalities : [raid6] [raid5] [raid4]
	md127 : inactive sdc[0] sde[3] sdd[1]
	      35156259840 blocks super 1.2

Expectation:
- The fio process can stop with 'Ctrl + c'
- The raid5 array can be stopped by 'mdadm -S /dev/md0'
- This md raid5 array may continue to work (resync and being active)
after reboot


How to reproduce:
1) Create md raid5 with 3 hard drives (12TB for each SATA spinning disk)
  # mdadm -C /dev/md0 -l 5 -n 3 /dev/sd{c,d,e}
  # cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sde[3] sdd[1] sdc[0]
      23437506560 blocks super 1.2 level 5, 512k chunk, algorithm 2
[3/2] [UU_]
      [>....................]  recovery =  0.0% (2556792/11718753280)
finish=5765844.7min speed=33K/sec
      bitmap: 2/88 pages [8KB], 65536KB chunk

2) Run fio for random write on the raid5 array
  fio job file content:
[global]
thread=1
ioengine=libaio
random_generator=tausworthe64

[job]
filename=/dev/md0
readwrite=randwrite
blocksize=64K
numjobs=10
iodepth=16
runtime=1m
  # fio ./raid5.fio

3) Wait for 10 seconds after the above fio runs, then type 'Ctrl + c' to
stop the fio process:
x:/home/colyli/fio_test/raid5 # fio ./raid5.fio
job: (g=0): rw=randwrite, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB,
(T) 64.0KiB-64.0KiB, ioengine=libaio, iodepth=16
...
fio-3.23-10-ge007
Starting 12 threads
^Cbs: 12 (f=12): [w(12)][3.3%][w=6080KiB/s][w=95 IOPS][eta 14m:30s]
fio: terminating on signal 2
^C
fio: terminating on signal 2
^C
fio: terminating on signal 2
Jobs: 11 (f=11): [w(5),_(1),w(4),f(1),w(1)][7.5%][eta 14m:20s]
^C
fio: terminating on signal 2
Jobs: 11 (f=11): [w(5),_(1),w(4),f(1),w(1)][70.5%][eta 15m:00s]

Now the fio process is hang forever.

4) try to stop this md raid5 array by mdadm
  # mdadm -S /dev/md0
  Now the mdadm process hangs for ever


Kernel versions to reproduce
- Use latest upstream mdadm source code
- I tried Linux v5.9-rc4, and Linux v4.12, both of them may stable
reproduce the above unexpected behavior.
  Therefore I assume maybe at least from v4.12 to v5.9 may have such issue.

Just for your information, hope you may have a look into it. Thanks in
advance.

Coly Li


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: unexpected 'mdadm -S' hang with I/O pressure testing
  2020-09-12 14:06 unexpected 'mdadm -S' hang with I/O pressure testing Coly Li
@ 2020-09-12 14:21 ` Coly Li
  2020-09-12 14:59   ` Xiao Ni
  2020-09-14  0:03 ` Roger Heflin
  1 sibling, 1 reply; 4+ messages in thread
From: Coly Li @ 2020-09-12 14:21 UTC (permalink / raw)
  To: linux-raid; +Cc: Guoqing Jiang, Song Liu, xiao ni

One thing to correct: the hang is not forever - after I posted the
previous email, all commands returns and the array stopped. It takes
around 40 minutes -- still quite unexpected and suspicious.

Thanks.

Coly Li

On 2020/9/12 22:06, Coly Li wrote:
> Unexpected Behavior:
> - With Linux v5.9-rc4 mainline kernel and latest mdadm upstream code
> - After running fio with 10 jobs, 16 iodpes and 64K block size for a
> while, try to stop the fio process by 'Ctrl + c', the main fio process
> hangs.
> - Then try to stop the md raid 5 array by 'mdadm -S /dev/md0', the mdad
> process hangs.
> - Reboot the system by 'echo b > /proc/sysrq-trigger', this md raid5
> array is assembled but inactive. /proc/mdstat shows,
> 	Personalities : [raid6] [raid5] [raid4]
> 	md127 : inactive sdc[0] sde[3] sdd[1]
> 	      35156259840 blocks super 1.2
> 
> Expectation:
> - The fio process can stop with 'Ctrl + c'
> - The raid5 array can be stopped by 'mdadm -S /dev/md0'
> - This md raid5 array may continue to work (resync and being active)
> after reboot
> 
> 
> How to reproduce:
> 1) Create md raid5 with 3 hard drives (12TB for each SATA spinning disk)
>   # mdadm -C /dev/md0 -l 5 -n 3 /dev/sd{c,d,e}
>   # cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid5 sde[3] sdd[1] sdc[0]
>       23437506560 blocks super 1.2 level 5, 512k chunk, algorithm 2
> [3/2] [UU_]
>       [>....................]  recovery =  0.0% (2556792/11718753280)
> finish=5765844.7min speed=33K/sec
>       bitmap: 2/88 pages [8KB], 65536KB chunk
> 
> 2) Run fio for random write on the raid5 array
>   fio job file content:
> [global]
> thread=1
> ioengine=libaio
> random_generator=tausworthe64
> 
> [job]
> filename=/dev/md0
> readwrite=randwrite
> blocksize=64K
> numjobs=10
> iodepth=16
> runtime=1m
>   # fio ./raid5.fio
> 
> 3) Wait for 10 seconds after the above fio runs, then type 'Ctrl + c' to
> stop the fio process:
> x:/home/colyli/fio_test/raid5 # fio ./raid5.fio
> job: (g=0): rw=randwrite, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB,
> (T) 64.0KiB-64.0KiB, ioengine=libaio, iodepth=16
> ...
> fio-3.23-10-ge007
> Starting 12 threads
> ^Cbs: 12 (f=12): [w(12)][3.3%][w=6080KiB/s][w=95 IOPS][eta 14m:30s]
> fio: terminating on signal 2
> ^C
> fio: terminating on signal 2
> ^C
> fio: terminating on signal 2
> Jobs: 11 (f=11): [w(5),_(1),w(4),f(1),w(1)][7.5%][eta 14m:20s]
> ^C
> fio: terminating on signal 2
> Jobs: 11 (f=11): [w(5),_(1),w(4),f(1),w(1)][70.5%][eta 15m:00s]
> 
> Now the fio process is hang forever.
> 
> 4) try to stop this md raid5 array by mdadm
>   # mdadm -S /dev/md0
>   Now the mdadm process hangs for ever
> 
> 
> Kernel versions to reproduce
> - Use latest upstream mdadm source code
> - I tried Linux v5.9-rc4, and Linux v4.12, both of them may stable
> reproduce the above unexpected behavior.
>   Therefore I assume maybe at least from v4.12 to v5.9 may have such issue.
> 
> Just for your information, hope you may have a look into it. Thanks in
> advance.
> 
> Coly Li
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: unexpected 'mdadm -S' hang with I/O pressure testing
  2020-09-12 14:21 ` Coly Li
@ 2020-09-12 14:59   ` Xiao Ni
  0 siblings, 0 replies; 4+ messages in thread
From: Xiao Ni @ 2020-09-12 14:59 UTC (permalink / raw)
  To: Coly Li, linux-raid; +Cc: Guoqing Jiang, Song Liu

Hi all

I did a test on single disk (just replace /dev/md0 with /dev/sde), it 
had the same problem too.
Ctrl+c can't stop the fio process. There still are write I/O on /dev/sde 
after ctrl+c. It needs to wait
for all I/O finish.

Regards
Xiao

On 09/12/2020 10:21 PM, Coly Li wrote:
> One thing to correct: the hang is not forever - after I posted the
> previous email, all commands returns and the array stopped. It takes
> around 40 minutes -- still quite unexpected and suspicious.
>
> Thanks.
>
> Coly Li
>
> On 2020/9/12 22:06, Coly Li wrote:
>> Unexpected Behavior:
>> - With Linux v5.9-rc4 mainline kernel and latest mdadm upstream code
>> - After running fio with 10 jobs, 16 iodpes and 64K block size for a
>> while, try to stop the fio process by 'Ctrl + c', the main fio process
>> hangs.
>> - Then try to stop the md raid 5 array by 'mdadm -S /dev/md0', the mdad
>> process hangs.
>> - Reboot the system by 'echo b > /proc/sysrq-trigger', this md raid5
>> array is assembled but inactive. /proc/mdstat shows,
>> 	Personalities : [raid6] [raid5] [raid4]
>> 	md127 : inactive sdc[0] sde[3] sdd[1]
>> 	      35156259840 blocks super 1.2
>>
>> Expectation:
>> - The fio process can stop with 'Ctrl + c'
>> - The raid5 array can be stopped by 'mdadm -S /dev/md0'
>> - This md raid5 array may continue to work (resync and being active)
>> after reboot
>>
>>
>> How to reproduce:
>> 1) Create md raid5 with 3 hard drives (12TB for each SATA spinning disk)
>>    # mdadm -C /dev/md0 -l 5 -n 3 /dev/sd{c,d,e}
>>    # cat /proc/mdstat
>> Personalities : [raid6] [raid5] [raid4]
>> md0 : active raid5 sde[3] sdd[1] sdc[0]
>>        23437506560 blocks super 1.2 level 5, 512k chunk, algorithm 2
>> [3/2] [UU_]
>>        [>....................]  recovery =  0.0% (2556792/11718753280)
>> finish=5765844.7min speed=33K/sec
>>        bitmap: 2/88 pages [8KB], 65536KB chunk
>>
>> 2) Run fio for random write on the raid5 array
>>    fio job file content:
>> [global]
>> thread=1
>> ioengine=libaio
>> random_generator=tausworthe64
>>
>> [job]
>> filename=/dev/md0
>> readwrite=randwrite
>> blocksize=64K
>> numjobs=10
>> iodepth=16
>> runtime=1m
>>    # fio ./raid5.fio
>>
>> 3) Wait for 10 seconds after the above fio runs, then type 'Ctrl + c' to
>> stop the fio process:
>> x:/home/colyli/fio_test/raid5 # fio ./raid5.fio
>> job: (g=0): rw=randwrite, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB,
>> (T) 64.0KiB-64.0KiB, ioengine=libaio, iodepth=16
>> ...
>> fio-3.23-10-ge007
>> Starting 12 threads
>> ^Cbs: 12 (f=12): [w(12)][3.3%][w=6080KiB/s][w=95 IOPS][eta 14m:30s]
>> fio: terminating on signal 2
>> ^C
>> fio: terminating on signal 2
>> ^C
>> fio: terminating on signal 2
>> Jobs: 11 (f=11): [w(5),_(1),w(4),f(1),w(1)][7.5%][eta 14m:20s]
>> ^C
>> fio: terminating on signal 2
>> Jobs: 11 (f=11): [w(5),_(1),w(4),f(1),w(1)][70.5%][eta 15m:00s]
>>
>> Now the fio process is hang forever.
>>
>> 4) try to stop this md raid5 array by mdadm
>>    # mdadm -S /dev/md0
>>    Now the mdadm process hangs for ever
>>
>>
>> Kernel versions to reproduce
>> - Use latest upstream mdadm source code
>> - I tried Linux v5.9-rc4, and Linux v4.12, both of them may stable
>> reproduce the above unexpected behavior.
>>    Therefore I assume maybe at least from v4.12 to v5.9 may have such issue.
>>
>> Just for your information, hope you may have a look into it. Thanks in
>> advance.
>>
>> Coly Li
>>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: unexpected 'mdadm -S' hang with I/O pressure testing
  2020-09-12 14:06 unexpected 'mdadm -S' hang with I/O pressure testing Coly Li
  2020-09-12 14:21 ` Coly Li
@ 2020-09-14  0:03 ` Roger Heflin
  1 sibling, 0 replies; 4+ messages in thread
From: Roger Heflin @ 2020-09-14  0:03 UTC (permalink / raw)
  To: Coly Li; +Cc: Linux RAID, Guoqing Jiang, Song Liu, xiao ni

You might want to see how high Dirty is in /proc/meminfo. That is how
much write io has to finish if fio runs a sync call, and it would also
apply to the stop since it should also call a sync.

On most oses the default is some percentage of your total ram and can
be quite high especially if the io rates are low.  ie if you had 1gb
of data and your write rate was 1MB/sec (random IO) then 1000 seconds.

If you test and this is the issue you should be able to from another
window watch dirty slowed go down.

I set mine low as I don't ever want there to be a significant amount
of writes outstanding.  I use these values (5MB and 3MB).

vm.dirty_background_bytes = 3000000
vm.dirty_background_ratio = 0
vm.dirty_bytes = 5000000
vm.dirty_ratio = 0

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-09-14  0:03 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-12 14:06 unexpected 'mdadm -S' hang with I/O pressure testing Coly Li
2020-09-12 14:21 ` Coly Li
2020-09-12 14:59   ` Xiao Ni
2020-09-14  0:03 ` Roger Heflin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).