unexpected 'mdadm -S' hang with I/O pressure testing

* unexpected 'mdadm -S' hang with I/O pressure testing
@ 2020-09-12 14:06 Coly Li
  2020-09-12 14:21 ` Coly Li
  2020-09-14  0:03 ` Roger Heflin
  0 siblings, 2 replies; 4+ messages in thread
From: Coly Li @ 2020-09-12 14:06 UTC (permalink / raw)
  To: linux-raid; +Cc: Guoqing Jiang, Song Liu, xiao ni

Unexpected Behavior:
- With Linux v5.9-rc4 mainline kernel and latest mdadm upstream code
- After running fio with 10 jobs, 16 iodpes and 64K block size for a
while, try to stop the fio process by 'Ctrl + c', the main fio process
hangs.
- Then try to stop the md raid 5 array by 'mdadm -S /dev/md0', the mdad
process hangs.
- Reboot the system by 'echo b > /proc/sysrq-trigger', this md raid5
array is assembled but inactive. /proc/mdstat shows,
	Personalities : [raid6] [raid5] [raid4]
	md127 : inactive sdc[0] sde[3] sdd[1]
	      35156259840 blocks super 1.2

Expectation:
- The fio process can stop with 'Ctrl + c'
- The raid5 array can be stopped by 'mdadm -S /dev/md0'
- This md raid5 array may continue to work (resync and being active)
after reboot

How to reproduce:
1) Create md raid5 with 3 hard drives (12TB for each SATA spinning disk)
  # mdadm -C /dev/md0 -l 5 -n 3 /dev/sd{c,d,e}
  # cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sde[3] sdd[1] sdc[0]
      23437506560 blocks super 1.2 level 5, 512k chunk, algorithm 2
[3/2] [UU_]
      [>....................]  recovery =  0.0% (2556792/11718753280)
finish=5765844.7min speed=33K/sec
      bitmap: 2/88 pages [8KB], 65536KB chunk

2) Run fio for random write on the raid5 array
  fio job file content:
[global]
thread=1
ioengine=libaio
random_generator=tausworthe64

[job]
filename=/dev/md0
readwrite=randwrite
blocksize=64K
numjobs=10
iodepth=16
runtime=1m
  # fio ./raid5.fio

3) Wait for 10 seconds after the above fio runs, then type 'Ctrl + c' to
stop the fio process:
x:/home/colyli/fio_test/raid5 # fio ./raid5.fio
job: (g=0): rw=randwrite, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB,
(T) 64.0KiB-64.0KiB, ioengine=libaio, iodepth=16
...
fio-3.23-10-ge007
Starting 12 threads
^Cbs: 12 (f=12): [w(12)][3.3%][w=6080KiB/s][w=95 IOPS][eta 14m:30s]
fio: terminating on signal 2
^C
fio: terminating on signal 2
^C
fio: terminating on signal 2
Jobs: 11 (f=11): [w(5),_(1),w(4),f(1),w(1)][7.5%][eta 14m:20s]
^C
fio: terminating on signal 2
Jobs: 11 (f=11): [w(5),_(1),w(4),f(1),w(1)][70.5%][eta 15m:00s]

Now the fio process is hang forever.

4) try to stop this md raid5 array by mdadm
  # mdadm -S /dev/md0
  Now the mdadm process hangs for ever

Kernel versions to reproduce
- Use latest upstream mdadm source code
- I tried Linux v5.9-rc4, and Linux v4.12, both of them may stable
reproduce the above unexpected behavior.
  Therefore I assume maybe at least from v4.12 to v5.9 may have such issue.

Just for your information, hope you may have a look into it. Thanks in
advance.

Coly Li

^ permalink raw reply	[flat|nested] 4+ messages in thread