linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 5.1.21 Dell 2950 terrible swraid5 I/O performance with swraid on top of Perc 5/i raid0/jbod
@ 2019-08-19  7:08 Marc MERLIN
  2019-08-19  9:18 ` Paolo Valente
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Marc MERLIN @ 2019-08-19  7:08 UTC (permalink / raw)
  To: linux-block, linux-raid

(Please Cc me on replies so that I can see them more quickly)

Dear Block Folks,

I just inherited a Dell 2950 with a Perc 5/i.
I really don't want to use that Perc 5/i card, but from all the reading
I did, there is no IT/unraid mode for it, so I was stuck setting the 6
2TB drives as 6 independent raid0 drives using the card.
I wish I could just bypass the card and connect the drives directly to a
sata card, but the case and backplane do not seem to make this possible.

I'm getting very weird and effectively unusable I/O performance if
do I do swraid resync which is throttled at 5MB/s

By bad, I mean bad, see this (in more details below):
 Timing buffered disk reads:   2 MB in 36.15 seconds =  56.65 kB/sec

Dear linux-raid folks,

I realize I have a perc 5/i card underneath I've very much like to remove,
but can't on that system.
Still, I'm hitting some quite unexpected swraid performance, including
a kernel warning and raid unclean shutdown on sysrq poweroff.


So, the 6 perc5/i raid0 drives show up in linux as 6 drives, I
partitioned them and created various software raid slices on top
(raid1, raid5 and raid6).  They work fine, but there is something very
wrong with a block layer somewhere. If I send a bunch of writes, the
IO scheduler seems to introduce terrible latency where my whole system
hangs for a few seconds trying to read simple binaries while from what I
can tell, the I/O platters spend all their time writing the backlog of
what's being sent.

You'll read below that somehow I have a swraid6 running on those 6 drives
and that seems to run at ok speed. But I have a bigger swraid5 across the
same 6 drives, and that runs at terrible speed right now.


I tried to disable the card's write cache to let linux and its 32GB of
RAM, do it better, but I didn't see a real improvement:
newmagic:~# megacli -LDSetProp -DisDskCache -L0 -a0  (0,1,2,3,4,5)
newmagic:~# megacli -LDGetProp -DskCache -Lall -a0
> Adapter 0-VD 0(target id: 0): Disk Write Cache : Disabled
> Adapter 0-VD 1(target id: 1): Disk Write Cache : Disabled
> Adapter 0-VD 2(target id: 2): Disk Write Cache : Disabled
> Adapter 0-VD 3(target id: 3): Disk Write Cache : Disabled
> Adapter 0-VD 4(target id: 4): Disk Write Cache : Disabled
> Adapter 0-VD 5(target id: 5): Disk Write Cache : Disabled

For the raid card, I installed the last bios I could find, and here is what it says.
> megasas: 07.707.51.00-rc1
> megaraid_sas 0000:02:0e.0: PCI IRQ 78 -> rerouted to legacy IRQ 18
> megaraid_sas 0000:02:0e.0: FW now in Ready state
> megaraid_sas 0000:02:0e.0: 63 bit DMA mask and 32 bit consistent mask
> megaraid_sas 0000:02:0e.0: firmware supports msix	: (0)
> megaraid_sas 0000:02:0e.0: current msix/online cpus	: (0/4)
> megaraid_sas 0000:02:0e.0: RDPQ mode	: (disabled)
> megaraid_sas 0000:02:0e.0: controller type	: MR(256MB)
> megaraid_sas 0000:02:0e.0: Online Controller Reset(OCR)	: Enabled
> megaraid_sas 0000:02:0e.0: Secure JBOD support	: No
> megaraid_sas 0000:02:0e.0: NVMe passthru support	: No
> megaraid_sas 0000:02:0e.0: FW provided TM TaskAbort/Reset timeout	: 0 secs/0 secs
> megaraid_sas 0000:02:0e.0: megasas_init_mfi: fw_support_ieee=0
> megaraid_sas 0000:02:0e.0: INIT adapter done
> megaraid_sas 0000:02:0e.0: fw state:c0000000
> megaraid_sas 0000:02:0e.0: Jbod map is not supported megasas_setup_jbod_map 5388
> megaraid_sas 0000:02:0e.0: fwstate:c0000000, dis_OCR=0
> megaraid_sas 0000:02:0e.0: MR_DCMD_PD_LIST_QUERY not supported by firmware
> megaraid_sas 0000:02:0e.0: DCMD not supported by firmware - megasas_ld_list_query 4590
> megaraid_sas 0000:02:0e.0: pci id		: (0x1028)/(0x0015)/(0x1028)/(0x1f03)
> megaraid_sas 0000:02:0e.0: unevenspan support	: no
> megaraid_sas 0000:02:0e.0: firmware crash dump	: no
> megaraid_sas 0000:02:0e.0: jbod sync map		: no

I'm also only getting about 5MB/s sustained write speed, which is
pathetic. I have lots of servers with normal sata cards, software raid,
and I get 50 to 100MB/s normally.
I'm hoping the Perc 5/i card is not _that_ bad?  See below.
md0 : active raid1 sde1[4] sdb1[1] sdd1[3] sda1[0] sdc1[2] sdf1[5]
      975872 blocks super 1.2 [6/6] [UUUUUU]
md1 : active raid6 sde3[4] sdb3[1] sdd3[3] sdf3[5] sda3[0] sdc3[2]
      419164160 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UUUUUU]

md2 : active raid6 sde5[4] sdb5[1] sdf5[5] sdd5[3] sdc5[2] sda5[0]
      1677193216 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UUUUUU]
      bitmap: 1/4 pages [4KB], 65536KB chunk

md3 : active raid5 sde6[4] sdb6[1] sdd6[3] sdf6[6] sdc6[2] sda6[0]
      7118330880 blocks super 1.2 level 5, 512k chunk, algorithm 2 [6/5] [UUUUU_]
      [=>...................]  recovery =  7.7% (109702192/1423666176) finish=5790.5min speed=3781K/sec
      bitmap: 0/11 pages [0KB], 65536KB chunk

If I access drives plugged directly into the motherboard's sata port, I
get perfect speed. I've also added an SSD with bcache to frontload one
of the raid arrays that is so slow, and sure enough, it becomes usuable.
When my system is slow as crap due to this issue, I can do full speed
I/O to a different drive plugged into the motherboard's Sata chip (but due 
to the case, the drive is actually sitting on the motherboard, there is 
nowhere to mount it).

The main problem is all my raids are using the same 6 devices, so if
anything spams them with a huge queue, I/O is completely starved for the
other devices.
The terrible write performance, which on top of being bad, prevents
pretty much any other I/O to those drives.

After an unclean shutdown explained below, a resync on the same drives but the other 2 raid arrays,
is much faster and does not make the system unresponsive.
md1 : active raid6 sda3[0] sdb3[1] sdf3[5] sdc3[2] sde3[4] sdd3[3]
      419164160 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UUUUUU]
      [==================>..]  resync = 91.1% (95553272/104791040) finish=1.7min speed=86952K/sec


If I start the recovery or a big copy/rsync towards md2, things get so slow that other IO
hangs for multiple seconds or even 2mn or more sometimes. Yes, that was the stock debian 
kernel, but similar problems with 5.1.21:
> [13900.007277] INFO: task sendmail:30862 blocked for more than 120 seconds.
> [13900.030181]       Not tainted 4.19.0-5-amd64 #1 Debian 4.19.37-5+deb10u2
> [13900.053131] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [13900.078495] sendmail        D    0 30862  30812 0x00000000
> [13900.099272] Call Trace:
> [13900.113941]  ? __schedule+0x2a2/0x870
> [13900.131022]  ? lookup_fast+0xc8/0x2e0
> [13900.148085]  schedule+0x28/0x80
> [13900.163959]  rwsem_down_write_failed+0x183/0x3a0
> [13900.182741]  ? inode_permission+0xbe/0x180
> [13900.200431]  call_rwsem_down_write_failed+0x13/0x20
> [13900.219731]  down_write+0x29/0x40
> [13900.235849]  path_openat+0x615/0x15c0
> [13900.252665]  ? mem_cgroup_commit_charge+0x7a/0x560
> [13900.271680]  do_filp_open+0x93/0x100
> [13900.288163]  ? __check_object_size+0x15d/0x189
> [13900.306276]  do_sys_open+0x186/0x210
> [13900.322529]  do_syscall_64+0x53/0x110
> [13900.338867]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [13900.358047] RIP: 0033:0x7fa715212c8b
> [13900.374306] Code: Bad RIP value.
> [13900.389850] RSP: 002b:00007ffc26ba42a0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
> [13900.414289] RAX: ffffffffffffffda RBX: 00005584ee809978 RCX: 00007fa715212c8b
> [13900.437957] RDX: 00000000000000c2 RSI: 00005584ee8198f0 RDI: 00000000ffffff9c
> [13900.461660] RBP: 00005584ee8198f0 R08: 0000000000007fdd R09: 0000000000000000
> [13900.485361] R10: 00000000000001a0 R11: 0000000000000246 R12: 0000000000000000
> [13900.509096] R13: 0000000000000000 R14: 000000000000000a R15: 0000000000000000

I know I can slow down raid recovery speed, to be able to use the system I actually have to do this:
echo 1000 > /proc/sys/dev/raid/speed_limit_min
of course, at 1MB/s, it will take weeks to resync...

At this point, you could ask if my drives are ok speed wise, and we already have the raid6 resync
I showed above at over 80MB/s

I did some basic I/O read-write tests when the resync wasn't running:
> dd if=/dev/mdx of=/dev/null bs=1M count=40000
> f=/var/space/test; dd if=/dev/zero of=$f bs=1M count=3000 conv=fdatasync; \rm $f
> 
> dd read test: /dev/md0 419430400 bytes (419 MB, 400 MiB) copied, 3.13387 s, 134 MB/s, hdparm -t 208.18MB/s
> dd104857600 bytes (105 MB, 100 MiB) copied, 16.1961 s, 6.5 MB/s
> 
> /dev/md1 419430400 bytes (419 MB, 400 MiB) copied, 1.58549 s, 265 MB/s, hdparm -t 335.11MB/s
> 3145728000 bytes (3.1 GB, 2.9 GiB) copied, 6.51223 s, 483 MB/s
> 
> /dev/md2 419430400 bytes (419 MB, 400 MiB) copied, 1.75172 s, 239 MB/s, hdparm -t 256.08MB/s
> 3145728000 bytes (3.1 GB, 2.9 GiB) copied, 5.25801 s, 598 MB/s
> 
> /dev/md3 419430400 bytes (419 MB, 400 MiB) copied, 1.81613 s, 231 MB/s, hdparm -t 382.33MB/s

Then, when it's running at a mere 4MB/s and apparently spamming all the I/O available:
newmagic:~# for i in md0 md1 md2 md3; do hdparm -t /dev/$i; done

/dev/md0:
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 Timing buffered disk reads: 190 MB in  3.00 seconds =  63.26 MB/sec

/dev/md1:
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 Timing buffered disk reads:   4 MB in  3.21 seconds =   1.25 MB/sec
^[
/dev/md2:
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 Timing buffered disk reads:   6 MB in  9.08 seconds = 676.33 kB/sec

/dev/md3:
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 Timing buffered disk reads:   2 MB in 36.15 seconds =  56.65 kB/sec


I also maybe found a bug in software raid during shutoff:
> [14847.171978] sysrq: SysRq : Power Off
> [14852.341924] WARNING: CPU: 0 PID: 2530 at drivers/md/md.c:8180 md_write_inc+0x15/0x40 [md_mod]
> [14852.359192] Modules linked in: fuse ufs qnx4 hfsplus hfs minix vfat msdos fat jfs xfs dm_mod cpuid ipt_MASQUERADE ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_state xt_conntrack nf_log_ipv4 nf_log_common xt_LOG nft_compat nft_counter nft_chain_nat_ipv4 nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_chain_route_ipv4 nf_tables nfnetlink binfmt_misc ext4 crc16 mbcache jbd2 fscrypto ecb crypto_simd cryptd glue_helper aes_x86_64 ipmi_ssif radeon coretemp ttm drm_kms_helper kvm drm evdev dcdbas iTCO_wdt iTCO_vendor_support serio_raw irqbypass sg pcspkr rng_core i2c_algo_bit ipmi_si i5000_edac ipmi_devintf i5k_amb ipmi_msghandler button ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress xxhash raid10 raid0 multipath linear sata_sil24 e1000e r8169 realtek libphy mii uas usb_storage
> [14852.502352]  raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid1 raid6_pq libcrc32c crc32c_generic hid_generic bcache crc64 usbhid md_mod hid ses enclosure sr_mod scsi_transport_sas cdrom sd_mod ata_generic uhci_hcd ehci_pci ehci_hcd ata_piix libata psmouse lpc_ich megaraid_sas usbcore scsi_mod usb_common bnx2
> [14852.562340] CPU: 0 PID: 2530 Comm: sendmail Not tainted 4.19.0-5-amd64 #1 Debian 4.19.37-5+deb10u2
> [14852.580463] Hardware name: Dell Inc. PowerEdge 2950/0DT021, BIOS 2.7.0 10/30/2010
> [14852.595607] RIP: 0010:md_write_inc+0x15/0x40 [md_mod]
> [14852.605820] Code: ff e8 9f 54 32 f3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 66 66 66 66 90 f6 46 10 01 74 1b 8b 97 c4 01 00 00 85 d2 74 12 <0f> 0b 48 8b 87 e0 02 00 00 a8 03 75 0e 65 48 ff 00 c3 8b 47 40 85
> [14852.643807] RSP: 0000:ffffb1c287767ac0 EFLAGS: 00010002
> [14852.654378] RAX: ffff9615c93a4cf8 RBX: ffff9615c93a4910 RCX: 0000000000000001
> [14852.668807] RDX: 0000000000000001 RSI: ffff96162aa17f00 RDI: ffff961625000000
> [14852.683235] RBP: ffff9615c93a4978 R08: 0000000000000000 R09: ffff961624c3a918
> [14852.697661] R10: 0000000000000000 R11: ffff961625a1f800 R12: 0000000000000001
> [14852.712089] R13: 0000000000000001 R14: ffff961623b6e000 R15: ffff96162aa17f00
> [14852.726518] FS:  00007f2ca54d3f40(0000) GS:ffff96162fa00000(0000) knlGS:0000000000000000
> [14852.742891] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [14852.754505] CR2: 00007f8e63e501c0 CR3: 000000031fb6e000 CR4: 00000000000006f0
> [14852.768931] Call Trace:
> [14852.773891]  add_stripe_bio+0x205/0x7c0 [raid456]
> [14852.783405]  raid5_make_request+0x1bd/0xb60 [raid456]
> [14852.793619]  ? finish_wait+0x80/0x80
> [14852.800851]  ? finish_wait+0x80/0x80
> [14852.808093]  md_handle_request+0x119/0x190 [md_mod]
> [14852.817964]  md_make_request+0x78/0x160 [md_mod]
> [14852.827311]  generic_make_request+0x1a4/0x410
> [14852.836116]  submit_bio+0x45/0x140
> [14852.842991]  ? guard_bio_eod+0x32/0x100
> [14852.850747]  submit_bh_wbc+0x163/0x190
> [14852.858377]  write_all_supers+0x22f/0xa60 [btrfs]
> [14852.867905]  btrfs_commit_transaction+0x581/0x870 [btrfs]
> [14852.878819]  ? finish_wait+0x80/0x80
> [14852.886071]  btrfs_sync_file+0x380/0x3d0 [btrfs]
> [14852.895415]  do_fsync+0x38/0x70
> [14852.901764]  __x64_sys_fsync+0x10/0x20
> [14852.909342]  do_syscall_64+0x53/0x110
> [14852.916742]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [14852.926952] RIP: 0033:0x7f2ca6944a71
> [14852.934185] Code: 6d a5 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b1 0f 1f 80 00 00 00 00 8b 05 da e9 00 00 85 c0 75 16 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f c3 66 0f 1f 44 00 00 53 89 fb 48 83 ec 10
> [14852.972172] RSP: 002b:00007fffe32a0368 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
> [14852.987483] RAX: ffffffffffffffda RBX: 000056297ca540d0 RCX: 00007f2ca6944a71
> [14853.001908] RDX: 0000000000000000 RSI: 000056297ca541b0 RDI: 0000000000000004
> [14853.016334] RBP: 00000000000001d7 R08: 000056297ca541b0 R09: 00007f2ca54d3f40
> [14853.030760] R10: 7541203831202c6e R11: 0000000000000246 R12: 000056297bfbe369
> [14853.045189] R13: 00007fffe32a03b0 R14: 000000000000000a R15: 0000000000000000
> [14853.059617] ---[ end trace 407005be9d52ae9f ]---
> [14854.715315] md: md3: recovery interrupted.
> [14877.083807] bcache: bcache_reboot() Stopping all devices:
> [14879.097334] bcache: bcache_reboot() Timeout waiting for devices to be closed
> [14879.111948] sd 4:0:0:0: [sdh] Synchronizing SCSI cache
> [14879.122617] sd 4:0:0:0: [sdh] Stopping disk
> [14879.615609] sd 3:0:0:0: [sdg] Synchronizing SCSI cache
> [14879.626667] sd 3:0:0:0: [sdg] Stopping disk
> [14881.520158] sd 0:2:2:0: [sdc] tag#684 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> [14881.538216] sd 0:2:2:0: [sdc] tag#684 CDB: Write(10) 2a 00 13 17 1f e8 00 00 08 00
> [14881.553614] print_req_error: I/O error, dev sdc, sector 320282600
> [14881.566001] sd 0:2:4:0: [sde] tag#684 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> [14881.583982] sd 0:2:4:0: [sde] tag#684 CDB: Write(10) 2a 00 13 17 1f e8 00 00 08 00
> [14881.599303] print_req_error: I/O error, dev sde, sector 320282600
> [14881.611638] sd 0:2:5:0: [sdf] tag#684 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> [14881.629587] sd 0:2:5:0: [sdf] tag#684 CDB: Write(10) 2a 00 13 17 1f e8 00 00 08 00
> [14881.661536] print_req_error: I/O error, dev sdf, sector 320282600
> [14881.690648] sd 0:2:5:0: [sdf] tag#684 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> [14881.725455] sd 0:2:5:0: [sdf] tag#684 CDB: Write(10) 2a 00 13 17 20 00 00 02 80 00
> [14881.757640] print_req_error: I/O error, dev sdf, sector 320282624
> [14881.786840] sd 0:2:3:0: [sdd] tag#684 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> [14881.821202] sd 0:2:3:0: [sdd] tag#684 CDB: Write(10) 2a 00 13 17 1f e8 00 00 08 00
> [14881.852497] print_req_error: I/O error, dev sdd, sector 320282600
> [14881.880392] sd 0:2:0:0: [sda] tag#684 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> [14881.913429] sd 0:2:0:0: [sda] tag#684 CDB: Write(10) 2a 00 13 17 1f e8 00 00 08 00
> [14881.943303] print_req_error: I/O error, dev sda, sector 320282600
> [14881.969675] sd 0:2:1:0: [sdb] tag#684 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> [14882.001626] sd 0:2:1:0: [sdb] tag#684 CDB: Write(10) 2a 00 13 17 1f e8 00 00 08 00
> [14882.030904] print_req_error: I/O error, dev sdb, sector 320282600
> [14882.057411] sd 0:2:4:0: [sde] tag#299 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> [14882.088845] sd 0:2:4:0: [sde] tag#299 CDB: Write(10) 2a 00 13 17 20 00 00 02 80 00
> [14882.117051] print_req_error: I/O error, dev sde, sector 320282624
> [14882.142352] sd 0:2:5:0: [sdf] tag#299 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> [14882.142430] sd 0:2:4:0: [sde] tag#300 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> [14882.173313] sd 0:2:5:0: [sdf] tag#299 CDB: Write(10) 2a 00 13 17 22 80 00 01 80 00
> [14882.173315] print_req_error: I/O error, dev sdf, sector 320283264
> [14882.257818] sd 0:2:4:0: [sde] tag#300 CDB: Write(10) 2a 00 13 17 22 80 00 01 80 00
> [14882.286196] print_req_error: I/O error, dev sde, sector 320283264
> [14882.372678] md: super_written gets error=10
> [14882.394226] md/raid:md2: Disk failure on sdc5, disabling device.
> [14882.394226] md/raid:md2: Operation continuing on 5 devices.
> [14882.396634] md: super_written gets error=10
> [14882.443706] md: super_written gets error=10
> [14882.465231] md/raid:md2: Disk failure on sde5, disabling device.
> [14882.465231] md/raid:md2: Operation continuing on 4 devices.
> [14885.396071] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
> [14885.423090] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
> [14885.450404] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
> [14885.476946] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
> [14885.503344] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
> [14885.530389] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
> [14885.563027] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
> [14885.589494] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
> [14885.615995] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
> [14885.642142] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
> [14885.667968] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
> [14885.693224] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
> [14885.717937] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
> [14885.743191] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
> [14885.767407] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
> [14885.792214] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
> [14890.416424] btrfs_dev_stat_print_on_error: 1409 callbacks suppressed
> [14890.416429] BTRFS error (device bcache0): bdev /dev/bcache0 errs: wr 1417, rd 0, flush 0, corrupt 0, gen 0
> [14890.460838] BTRFS error (device bcache0): bdev /dev/bcache0 errs: wr 1418, rd 0, flush 0, corrupt 0, gen 0
> [14890.486347] BTRFS error (device bcache0): bdev /dev/bcache0 errs: wr 1419, rd 0, flush 0, corrupt 0, gen 0
> [14890.511308] BTRFS error (device bcache0): bdev /dev/bcache0 errs: wr 1420, rd 0, flush 0, corrupt 0, gen 0
> [14890.536129] Emergency Sync complete
> [14891.398791] ACPI: Preparing to enter system sleep state S5
> [14891.460410] reboot: Power down
> [14891.471830] acpi_power_off called


megacli -LdPdInfo -a0  output for the first drive below.  
> Number of Virtual Disks: 6
> Virtual Drive: 0 (Target Id: 0)
> Name                :0
> RAID Level          : Primary-0, Secondary-0, RAID Level Qualifier-0
> Size                : 1.818 TB
> Sector Size         : 512
> Parity Size         : 0
> State               : Optimal
> Strip Size          : 64 KB
> Number Of Drives    : 1
> Span Depth          : 1
> Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
> Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
> Default Access Policy: Read/Write
> Current Access Policy: Read/Write
> Disk Cache Policy   : Disabled
> Encryption Type     : None
> Is VD Cached: No
> Number of Spans: 1
> Span: 0 - Number of PDs: 1
> 
> PD: 0 Information
> Enclosure Device ID: 8
> Slot Number: 0
> Drive's position: DiskGroup: 0, Span: 0, Arm: 0
> Enclosure position: N/A
> Device Id: 0
> WWN: 
> Sequence Number: 2
> Media Error Count: 0
> Other Error Count: 1
> Predictive Failure Count: 0
> Last Predictive Failure Event Seq Number: 0
> PD Type: SATA
> 
> Raw Size: 1.819 TB [0xe8e088b0 Sectors]
> Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
> Coerced Size: 1.818 TB [0xe8d00000 Sectors]
> Sector Size:  0
> Firmware state: Online, Spun Up
> Device Firmware Level: AB50
> Shield Counter: 0
> Successful diagnostics completion on :  N/A
> SAS Address(0):
>  0x1221000000000000
> Connected Port Number: 0 
> Inquiry Data:      WD-WMAZA0374092WDC WD20EARS-00MVWB0                    50.0AB50
> FDE Capable: Not Capable
> FDE Enable: Disable
> Secured: Unsecured
> Locked: Unlocked
> Needs EKM Attention: No
> Foreign State: None 
> Device Speed: Unknown 
> Link Speed: Unknown 
> Media Type: Hard Disk Device
> Drive Temperature : N/A
> PI Eligibility:  No 
> Drive is formatted for PI information:  No
> PI: No PI
> Port-0 :
> Port status: Active
> Port's Linkspeed: Unknown 
> Drive has flagged a S.M.A.R.T alert : No

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 5.1.21 Dell 2950 terrible swraid5 I/O performance with swraid on top of Perc 5/i raid0/jbod
  2019-08-19  7:08 5.1.21 Dell 2950 terrible swraid5 I/O performance with swraid on top of Perc 5/i raid0/jbod Marc MERLIN
@ 2019-08-19  9:18 ` Paolo Valente
  2019-08-19 12:02   ` Paolo Valente
  2019-08-19 16:40   ` Marc MERLIN
  2019-08-19 11:42 ` o1bigtenor
  2019-08-19 18:37 ` Roman Mamedov
  2 siblings, 2 replies; 11+ messages in thread
From: Paolo Valente @ 2019-08-19  9:18 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-block, linux-raid



> Il giorno 19 ago 2019, alle ore 09:08, Marc MERLIN <marc@merlins.org> ha scritto:
> 
> (Please Cc me on replies so that I can see them more quickly)
> 
> Dear Block Folks,
> 

Hi Marc,

> I just inherited a Dell 2950 with a Perc 5/i.
> I really don't want to use that Perc 5/i card, but from all the reading
> I did, there is no IT/unraid mode for it, so I was stuck setting the 6
> 2TB drives as 6 independent raid0 drives using the card.
> I wish I could just bypass the card and connect the drives directly to a
> sata card, but the case and backplane do not seem to make this possible.
> 
> I'm getting very weird and effectively unusable I/O performance if
> do I do swraid resync which is throttled at 5MB/s
> 
> By bad, I mean bad, see this (in more details below):
> Timing buffered disk reads:   2 MB in 36.15 seconds =  56.65 kB/sec
> 
> Dear linux-raid folks,
> 
> I realize I have a perc 5/i card underneath I've very much like to remove,
> but can't on that system.
> Still, I'm hitting some quite unexpected swraid performance, including
> a kernel warning and raid unclean shutdown on sysrq poweroff.
> 
> 
> So, the 6 perc5/i raid0 drives show up in linux as 6 drives, I
> partitioned them and created various software raid slices on top
> (raid1, raid5 and raid6).  They work fine, but there is something very
> wrong with a block layer somewhere. If I send a bunch of writes, the
> IO scheduler seems to introduce terrible latency where my whole system
> hangs for a few seconds trying to read simple binaries while from what I
> can tell, the I/O platters spend all their time writing the backlog of
> what's being sent.
> 

Solving this kind of problem is one of the goals of the BFQ I/O scheduler [1].
Have you tried?  If you want to, then start by swathing to BFQ in both the
physical and the virtual block devices in your stack.

Thanks,
Paolo

[1] https://algo.ing.unimo.it/people/paolo/BFQ/

> You'll read below that somehow I have a swraid6 running on those 6 drives
> and that seems to run at ok speed. But I have a bigger swraid5 across the
> same 6 drives, and that runs at terrible speed right now.
> 
> 
> I tried to disable the card's write cache to let linux and its 32GB of
> RAM, do it better, but I didn't see a real improvement:
> newmagic:~# megacli -LDSetProp -DisDskCache -L0 -a0  (0,1,2,3,4,5)
> newmagic:~# megacli -LDGetProp -DskCache -Lall -a0
>> Adapter 0-VD 0(target id: 0): Disk Write Cache : Disabled
>> Adapter 0-VD 1(target id: 1): Disk Write Cache : Disabled
>> Adapter 0-VD 2(target id: 2): Disk Write Cache : Disabled
>> Adapter 0-VD 3(target id: 3): Disk Write Cache : Disabled
>> Adapter 0-VD 4(target id: 4): Disk Write Cache : Disabled
>> Adapter 0-VD 5(target id: 5): Disk Write Cache : Disabled
> 
> For the raid card, I installed the last bios I could find, and here is what it says.
>> megasas: 07.707.51.00-rc1
>> megaraid_sas 0000:02:0e.0: PCI IRQ 78 -> rerouted to legacy IRQ 18
>> megaraid_sas 0000:02:0e.0: FW now in Ready state
>> megaraid_sas 0000:02:0e.0: 63 bit DMA mask and 32 bit consistent mask
>> megaraid_sas 0000:02:0e.0: firmware supports msix	: (0)
>> megaraid_sas 0000:02:0e.0: current msix/online cpus	: (0/4)
>> megaraid_sas 0000:02:0e.0: RDPQ mode	: (disabled)
>> megaraid_sas 0000:02:0e.0: controller type	: MR(256MB)
>> megaraid_sas 0000:02:0e.0: Online Controller Reset(OCR)	: Enabled
>> megaraid_sas 0000:02:0e.0: Secure JBOD support	: No
>> megaraid_sas 0000:02:0e.0: NVMe passthru support	: No
>> megaraid_sas 0000:02:0e.0: FW provided TM TaskAbort/Reset timeout	: 0 secs/0 secs
>> megaraid_sas 0000:02:0e.0: megasas_init_mfi: fw_support_ieee=0
>> megaraid_sas 0000:02:0e.0: INIT adapter done
>> megaraid_sas 0000:02:0e.0: fw state:c0000000
>> megaraid_sas 0000:02:0e.0: Jbod map is not supported megasas_setup_jbod_map 5388
>> megaraid_sas 0000:02:0e.0: fwstate:c0000000, dis_OCR=0
>> megaraid_sas 0000:02:0e.0: MR_DCMD_PD_LIST_QUERY not supported by firmware
>> megaraid_sas 0000:02:0e.0: DCMD not supported by firmware - megasas_ld_list_query 4590
>> megaraid_sas 0000:02:0e.0: pci id		: (0x1028)/(0x0015)/(0x1028)/(0x1f03)
>> megaraid_sas 0000:02:0e.0: unevenspan support	: no
>> megaraid_sas 0000:02:0e.0: firmware crash dump	: no
>> megaraid_sas 0000:02:0e.0: jbod sync map		: no
> 
> I'm also only getting about 5MB/s sustained write speed, which is
> pathetic. I have lots of servers with normal sata cards, software raid,
> and I get 50 to 100MB/s normally.
> I'm hoping the Perc 5/i card is not _that_ bad?  See below.
> md0 : active raid1 sde1[4] sdb1[1] sdd1[3] sda1[0] sdc1[2] sdf1[5]
>      975872 blocks super 1.2 [6/6] [UUUUUU]
> md1 : active raid6 sde3[4] sdb3[1] sdd3[3] sdf3[5] sda3[0] sdc3[2]
>      419164160 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UUUUUU]
> 
> md2 : active raid6 sde5[4] sdb5[1] sdf5[5] sdd5[3] sdc5[2] sda5[0]
>      1677193216 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UUUUUU]
>      bitmap: 1/4 pages [4KB], 65536KB chunk
> 
> md3 : active raid5 sde6[4] sdb6[1] sdd6[3] sdf6[6] sdc6[2] sda6[0]
>      7118330880 blocks super 1.2 level 5, 512k chunk, algorithm 2 [6/5] [UUUUU_]
>      [=>...................]  recovery =  7.7% (109702192/1423666176) finish=5790.5min speed=3781K/sec
>      bitmap: 0/11 pages [0KB], 65536KB chunk
> 
> If I access drives plugged directly into the motherboard's sata port, I
> get perfect speed. I've also added an SSD with bcache to frontload one
> of the raid arrays that is so slow, and sure enough, it becomes usuable.
> When my system is slow as crap due to this issue, I can do full speed
> I/O to a different drive plugged into the motherboard's Sata chip (but due 
> to the case, the drive is actually sitting on the motherboard, there is 
> nowhere to mount it).
> 
> The main problem is all my raids are using the same 6 devices, so if
> anything spams them with a huge queue, I/O is completely starved for the
> other devices.
> The terrible write performance, which on top of being bad, prevents
> pretty much any other I/O to those drives.
> 
> After an unclean shutdown explained below, a resync on the same drives but the other 2 raid arrays,
> is much faster and does not make the system unresponsive.
> md1 : active raid6 sda3[0] sdb3[1] sdf3[5] sdc3[2] sde3[4] sdd3[3]
>      419164160 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UUUUUU]
>      [==================>..]  resync = 91.1% (95553272/104791040) finish=1.7min speed=86952K/sec
> 
> 
> If I start the recovery or a big copy/rsync towards md2, things get so slow that other IO
> hangs for multiple seconds or even 2mn or more sometimes. Yes, that was the stock debian 
> kernel, but similar problems with 5.1.21:
>> [13900.007277] INFO: task sendmail:30862 blocked for more than 120 seconds.
>> [13900.030181]       Not tainted 4.19.0-5-amd64 #1 Debian 4.19.37-5+deb10u2
>> [13900.053131] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [13900.078495] sendmail        D    0 30862  30812 0x00000000
>> [13900.099272] Call Trace:
>> [13900.113941]  ? __schedule+0x2a2/0x870
>> [13900.131022]  ? lookup_fast+0xc8/0x2e0
>> [13900.148085]  schedule+0x28/0x80
>> [13900.163959]  rwsem_down_write_failed+0x183/0x3a0
>> [13900.182741]  ? inode_permission+0xbe/0x180
>> [13900.200431]  call_rwsem_down_write_failed+0x13/0x20
>> [13900.219731]  down_write+0x29/0x40
>> [13900.235849]  path_openat+0x615/0x15c0
>> [13900.252665]  ? mem_cgroup_commit_charge+0x7a/0x560
>> [13900.271680]  do_filp_open+0x93/0x100
>> [13900.288163]  ? __check_object_size+0x15d/0x189
>> [13900.306276]  do_sys_open+0x186/0x210
>> [13900.322529]  do_syscall_64+0x53/0x110
>> [13900.338867]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [13900.358047] RIP: 0033:0x7fa715212c8b
>> [13900.374306] Code: Bad RIP value.
>> [13900.389850] RSP: 002b:00007ffc26ba42a0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
>> [13900.414289] RAX: ffffffffffffffda RBX: 00005584ee809978 RCX: 00007fa715212c8b
>> [13900.437957] RDX: 00000000000000c2 RSI: 00005584ee8198f0 RDI: 00000000ffffff9c
>> [13900.461660] RBP: 00005584ee8198f0 R08: 0000000000007fdd R09: 0000000000000000
>> [13900.485361] R10: 00000000000001a0 R11: 0000000000000246 R12: 0000000000000000
>> [13900.509096] R13: 0000000000000000 R14: 000000000000000a R15: 0000000000000000
> 
> I know I can slow down raid recovery speed, to be able to use the system I actually have to do this:
> echo 1000 > /proc/sys/dev/raid/speed_limit_min
> of course, at 1MB/s, it will take weeks to resync...
> 
> At this point, you could ask if my drives are ok speed wise, and we already have the raid6 resync
> I showed above at over 80MB/s
> 
> I did some basic I/O read-write tests when the resync wasn't running:
>> dd if=/dev/mdx of=/dev/null bs=1M count=40000
>> f=/var/space/test; dd if=/dev/zero of=$f bs=1M count=3000 conv=fdatasync; \rm $f
>> 
>> dd read test: /dev/md0 419430400 bytes (419 MB, 400 MiB) copied, 3.13387 s, 134 MB/s, hdparm -t 208.18MB/s
>> dd104857600 bytes (105 MB, 100 MiB) copied, 16.1961 s, 6.5 MB/s
>> 
>> /dev/md1 419430400 bytes (419 MB, 400 MiB) copied, 1.58549 s, 265 MB/s, hdparm -t 335.11MB/s
>> 3145728000 bytes (3.1 GB, 2.9 GiB) copied, 6.51223 s, 483 MB/s
>> 
>> /dev/md2 419430400 bytes (419 MB, 400 MiB) copied, 1.75172 s, 239 MB/s, hdparm -t 256.08MB/s
>> 3145728000 bytes (3.1 GB, 2.9 GiB) copied, 5.25801 s, 598 MB/s
>> 
>> /dev/md3 419430400 bytes (419 MB, 400 MiB) copied, 1.81613 s, 231 MB/s, hdparm -t 382.33MB/s
> 
> Then, when it's running at a mere 4MB/s and apparently spamming all the I/O available:
> newmagic:~# for i in md0 md1 md2 md3; do hdparm -t /dev/$i; done
> 
> /dev/md0:
> HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
> Timing buffered disk reads: 190 MB in  3.00 seconds =  63.26 MB/sec
> 
> /dev/md1:
> HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
> Timing buffered disk reads:   4 MB in  3.21 seconds =   1.25 MB/sec
> ^[
> /dev/md2:
> HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
> Timing buffered disk reads:   6 MB in  9.08 seconds = 676.33 kB/sec
> 
> /dev/md3:
> HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
> Timing buffered disk reads:   2 MB in 36.15 seconds =  56.65 kB/sec
> 
> 
> I also maybe found a bug in software raid during shutoff:
>> [14847.171978] sysrq: SysRq : Power Off
>> [14852.341924] WARNING: CPU: 0 PID: 2530 at drivers/md/md.c:8180 md_write_inc+0x15/0x40 [md_mod]
>> [14852.359192] Modules linked in: fuse ufs qnx4 hfsplus hfs minix vfat msdos fat jfs xfs dm_mod cpuid ipt_MASQUERADE ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_state xt_conntrack nf_log_ipv4 nf_log_common xt_LOG nft_compat nft_counter nft_chain_nat_ipv4 nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_chain_route_ipv4 nf_tables nfnetlink binfmt_misc ext4 crc16 mbcache jbd2 fscrypto ecb crypto_simd cryptd glue_helper aes_x86_64 ipmi_ssif radeon coretemp ttm drm_kms_helper kvm drm evdev dcdbas iTCO_wdt iTCO_vendor_support serio_raw irqbypass sg pcspkr rng_core i2c_algo_bit ipmi_si i5000_edac ipmi_devintf i5k_amb ipmi_msghandler button ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress xxhash raid10 raid0 multipath linear sata_sil24 e1000e r8169 realtek libphy mii uas usb_storage
>> [14852.502352]  raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid1 raid6_pq libcrc32c crc32c_generic hid_generic bcache crc64 usbhid md_mod hid ses enclosure sr_mod scsi_transport_sas cdrom sd_mod ata_generic uhci_hcd ehci_pci ehci_hcd ata_piix libata psmouse lpc_ich megaraid_sas usbcore scsi_mod usb_common bnx2
>> [14852.562340] CPU: 0 PID: 2530 Comm: sendmail Not tainted 4.19.0-5-amd64 #1 Debian 4.19.37-5+deb10u2
>> [14852.580463] Hardware name: Dell Inc. PowerEdge 2950/0DT021, BIOS 2.7.0 10/30/2010
>> [14852.595607] RIP: 0010:md_write_inc+0x15/0x40 [md_mod]
>> [14852.605820] Code: ff e8 9f 54 32 f3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 66 66 66 66 90 f6 46 10 01 74 1b 8b 97 c4 01 00 00 85 d2 74 12 <0f> 0b 48 8b 87 e0 02 00 00 a8 03 75 0e 65 48 ff 00 c3 8b 47 40 85
>> [14852.643807] RSP: 0000:ffffb1c287767ac0 EFLAGS: 00010002
>> [14852.654378] RAX: ffff9615c93a4cf8 RBX: ffff9615c93a4910 RCX: 0000000000000001
>> [14852.668807] RDX: 0000000000000001 RSI: ffff96162aa17f00 RDI: ffff961625000000
>> [14852.683235] RBP: ffff9615c93a4978 R08: 0000000000000000 R09: ffff961624c3a918
>> [14852.697661] R10: 0000000000000000 R11: ffff961625a1f800 R12: 0000000000000001
>> [14852.712089] R13: 0000000000000001 R14: ffff961623b6e000 R15: ffff96162aa17f00
>> [14852.726518] FS:  00007f2ca54d3f40(0000) GS:ffff96162fa00000(0000) knlGS:0000000000000000
>> [14852.742891] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [14852.754505] CR2: 00007f8e63e501c0 CR3: 000000031fb6e000 CR4: 00000000000006f0
>> [14852.768931] Call Trace:
>> [14852.773891]  add_stripe_bio+0x205/0x7c0 [raid456]
>> [14852.783405]  raid5_make_request+0x1bd/0xb60 [raid456]
>> [14852.793619]  ? finish_wait+0x80/0x80
>> [14852.800851]  ? finish_wait+0x80/0x80
>> [14852.808093]  md_handle_request+0x119/0x190 [md_mod]
>> [14852.817964]  md_make_request+0x78/0x160 [md_mod]
>> [14852.827311]  generic_make_request+0x1a4/0x410
>> [14852.836116]  submit_bio+0x45/0x140
>> [14852.842991]  ? guard_bio_eod+0x32/0x100
>> [14852.850747]  submit_bh_wbc+0x163/0x190
>> [14852.858377]  write_all_supers+0x22f/0xa60 [btrfs]
>> [14852.867905]  btrfs_commit_transaction+0x581/0x870 [btrfs]
>> [14852.878819]  ? finish_wait+0x80/0x80
>> [14852.886071]  btrfs_sync_file+0x380/0x3d0 [btrfs]
>> [14852.895415]  do_fsync+0x38/0x70
>> [14852.901764]  __x64_sys_fsync+0x10/0x20
>> [14852.909342]  do_syscall_64+0x53/0x110
>> [14852.916742]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [14852.926952] RIP: 0033:0x7f2ca6944a71
>> [14852.934185] Code: 6d a5 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b1 0f 1f 80 00 00 00 00 8b 05 da e9 00 00 85 c0 75 16 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f c3 66 0f 1f 44 00 00 53 89 fb 48 83 ec 10
>> [14852.972172] RSP: 002b:00007fffe32a0368 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
>> [14852.987483] RAX: ffffffffffffffda RBX: 000056297ca540d0 RCX: 00007f2ca6944a71
>> [14853.001908] RDX: 0000000000000000 RSI: 000056297ca541b0 RDI: 0000000000000004
>> [14853.016334] RBP: 00000000000001d7 R08: 000056297ca541b0 R09: 00007f2ca54d3f40
>> [14853.030760] R10: 7541203831202c6e R11: 0000000000000246 R12: 000056297bfbe369
>> [14853.045189] R13: 00007fffe32a03b0 R14: 000000000000000a R15: 0000000000000000
>> [14853.059617] ---[ end trace 407005be9d52ae9f ]---
>> [14854.715315] md: md3: recovery interrupted.
>> [14877.083807] bcache: bcache_reboot() Stopping all devices:
>> [14879.097334] bcache: bcache_reboot() Timeout waiting for devices to be closed
>> [14879.111948] sd 4:0:0:0: [sdh] Synchronizing SCSI cache
>> [14879.122617] sd 4:0:0:0: [sdh] Stopping disk
>> [14879.615609] sd 3:0:0:0: [sdg] Synchronizing SCSI cache
>> [14879.626667] sd 3:0:0:0: [sdg] Stopping disk
>> [14881.520158] sd 0:2:2:0: [sdc] tag#684 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>> [14881.538216] sd 0:2:2:0: [sdc] tag#684 CDB: Write(10) 2a 00 13 17 1f e8 00 00 08 00
>> [14881.553614] print_req_error: I/O error, dev sdc, sector 320282600
>> [14881.566001] sd 0:2:4:0: [sde] tag#684 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>> [14881.583982] sd 0:2:4:0: [sde] tag#684 CDB: Write(10) 2a 00 13 17 1f e8 00 00 08 00
>> [14881.599303] print_req_error: I/O error, dev sde, sector 320282600
>> [14881.611638] sd 0:2:5:0: [sdf] tag#684 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>> [14881.629587] sd 0:2:5:0: [sdf] tag#684 CDB: Write(10) 2a 00 13 17 1f e8 00 00 08 00
>> [14881.661536] print_req_error: I/O error, dev sdf, sector 320282600
>> [14881.690648] sd 0:2:5:0: [sdf] tag#684 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>> [14881.725455] sd 0:2:5:0: [sdf] tag#684 CDB: Write(10) 2a 00 13 17 20 00 00 02 80 00
>> [14881.757640] print_req_error: I/O error, dev sdf, sector 320282624
>> [14881.786840] sd 0:2:3:0: [sdd] tag#684 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>> [14881.821202] sd 0:2:3:0: [sdd] tag#684 CDB: Write(10) 2a 00 13 17 1f e8 00 00 08 00
>> [14881.852497] print_req_error: I/O error, dev sdd, sector 320282600
>> [14881.880392] sd 0:2:0:0: [sda] tag#684 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>> [14881.913429] sd 0:2:0:0: [sda] tag#684 CDB: Write(10) 2a 00 13 17 1f e8 00 00 08 00
>> [14881.943303] print_req_error: I/O error, dev sda, sector 320282600
>> [14881.969675] sd 0:2:1:0: [sdb] tag#684 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>> [14882.001626] sd 0:2:1:0: [sdb] tag#684 CDB: Write(10) 2a 00 13 17 1f e8 00 00 08 00
>> [14882.030904] print_req_error: I/O error, dev sdb, sector 320282600
>> [14882.057411] sd 0:2:4:0: [sde] tag#299 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>> [14882.088845] sd 0:2:4:0: [sde] tag#299 CDB: Write(10) 2a 00 13 17 20 00 00 02 80 00
>> [14882.117051] print_req_error: I/O error, dev sde, sector 320282624
>> [14882.142352] sd 0:2:5:0: [sdf] tag#299 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>> [14882.142430] sd 0:2:4:0: [sde] tag#300 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>> [14882.173313] sd 0:2:5:0: [sdf] tag#299 CDB: Write(10) 2a 00 13 17 22 80 00 01 80 00
>> [14882.173315] print_req_error: I/O error, dev sdf, sector 320283264
>> [14882.257818] sd 0:2:4:0: [sde] tag#300 CDB: Write(10) 2a 00 13 17 22 80 00 01 80 00
>> [14882.286196] print_req_error: I/O error, dev sde, sector 320283264
>> [14882.372678] md: super_written gets error=10
>> [14882.394226] md/raid:md2: Disk failure on sdc5, disabling device.
>> [14882.394226] md/raid:md2: Operation continuing on 5 devices.
>> [14882.396634] md: super_written gets error=10
>> [14882.443706] md: super_written gets error=10
>> [14882.465231] md/raid:md2: Disk failure on sde5, disabling device.
>> [14882.465231] md/raid:md2: Operation continuing on 4 devices.
>> [14885.396071] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>> [14885.423090] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>> [14885.450404] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>> [14885.476946] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>> [14885.503344] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>> [14885.530389] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>> [14885.563027] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>> [14885.589494] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>> [14885.615995] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>> [14885.642142] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>> [14885.667968] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>> [14885.693224] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>> [14885.717937] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>> [14885.743191] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>> [14885.767407] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>> [14885.792214] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>> [14890.416424] btrfs_dev_stat_print_on_error: 1409 callbacks suppressed
>> [14890.416429] BTRFS error (device bcache0): bdev /dev/bcache0 errs: wr 1417, rd 0, flush 0, corrupt 0, gen 0
>> [14890.460838] BTRFS error (device bcache0): bdev /dev/bcache0 errs: wr 1418, rd 0, flush 0, corrupt 0, gen 0
>> [14890.486347] BTRFS error (device bcache0): bdev /dev/bcache0 errs: wr 1419, rd 0, flush 0, corrupt 0, gen 0
>> [14890.511308] BTRFS error (device bcache0): bdev /dev/bcache0 errs: wr 1420, rd 0, flush 0, corrupt 0, gen 0
>> [14890.536129] Emergency Sync complete
>> [14891.398791] ACPI: Preparing to enter system sleep state S5
>> [14891.460410] reboot: Power down
>> [14891.471830] acpi_power_off called
> 
> 
> megacli -LdPdInfo -a0  output for the first drive below.  
>> Number of Virtual Disks: 6
>> Virtual Drive: 0 (Target Id: 0)
>> Name                :0
>> RAID Level          : Primary-0, Secondary-0, RAID Level Qualifier-0
>> Size                : 1.818 TB
>> Sector Size         : 512
>> Parity Size         : 0
>> State               : Optimal
>> Strip Size          : 64 KB
>> Number Of Drives    : 1
>> Span Depth          : 1
>> Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
>> Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
>> Default Access Policy: Read/Write
>> Current Access Policy: Read/Write
>> Disk Cache Policy   : Disabled
>> Encryption Type     : None
>> Is VD Cached: No
>> Number of Spans: 1
>> Span: 0 - Number of PDs: 1
>> 
>> PD: 0 Information
>> Enclosure Device ID: 8
>> Slot Number: 0
>> Drive's position: DiskGroup: 0, Span: 0, Arm: 0
>> Enclosure position: N/A
>> Device Id: 0
>> WWN: 
>> Sequence Number: 2
>> Media Error Count: 0
>> Other Error Count: 1
>> Predictive Failure Count: 0
>> Last Predictive Failure Event Seq Number: 0
>> PD Type: SATA
>> 
>> Raw Size: 1.819 TB [0xe8e088b0 Sectors]
>> Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
>> Coerced Size: 1.818 TB [0xe8d00000 Sectors]
>> Sector Size:  0
>> Firmware state: Online, Spun Up
>> Device Firmware Level: AB50
>> Shield Counter: 0
>> Successful diagnostics completion on :  N/A
>> SAS Address(0):
>> 0x1221000000000000
>> Connected Port Number: 0 
>> Inquiry Data:      WD-WMAZA0374092WDC WD20EARS-00MVWB0                    50.0AB50
>> FDE Capable: Not Capable
>> FDE Enable: Disable
>> Secured: Unsecured
>> Locked: Unlocked
>> Needs EKM Attention: No
>> Foreign State: None 
>> Device Speed: Unknown 
>> Link Speed: Unknown 
>> Media Type: Hard Disk Device
>> Drive Temperature : N/A
>> PI Eligibility:  No 
>> Drive is formatted for PI information:  No
>> PI: No PI
>> Port-0 :
>> Port status: Active
>> Port's Linkspeed: Unknown 
>> Drive has flagged a S.M.A.R.T alert : No
> 
> Thanks,
> Marc
> -- 
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems ....
>                                      .... what McDonalds is to gourmet cooking
> Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 5.1.21 Dell 2950 terrible swraid5 I/O performance with swraid on top of Perc 5/i raid0/jbod
  2019-08-19  7:08 5.1.21 Dell 2950 terrible swraid5 I/O performance with swraid on top of Perc 5/i raid0/jbod Marc MERLIN
  2019-08-19  9:18 ` Paolo Valente
@ 2019-08-19 11:42 ` o1bigtenor
  2019-08-19 16:24   ` Marc MERLIN
  2019-08-20  5:49   ` Marc MERLIN
  2019-08-19 18:37 ` Roman Mamedov
  2 siblings, 2 replies; 11+ messages in thread
From: o1bigtenor @ 2019-08-19 11:42 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-block, Linux-RAID

On Mon, Aug 19, 2019 at 2:35 AM Marc MERLIN <marc@merlins.org> wrote:
>
> (Please Cc me on replies so that I can see them more quickly)
>
> Dear Block Folks,
>
> I just inherited a Dell 2950 with a Perc 5/i.
> I really don't want to use that Perc 5/i card, but from all the reading
> I did, there is no IT/unraid mode for it, so I was stuck setting the 6
> 2TB drives as 6 independent raid0 drives using the card.
> I wish I could just bypass the card and connect the drives directly to a
> sata card, but the case and backplane do not seem to make this possible.
>

Not to discourage you from a possibly interesting and fruitful endeavor
but when I bought myself a slightly newer dell server I traded out the
PERC card for a newer version (model 700 IIRC) and then I things
were quite a big different. Said board, bought used, wasn't very
expensive. YMMV

Regards

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 5.1.21 Dell 2950 terrible swraid5 I/O performance with swraid on top of Perc 5/i raid0/jbod
  2019-08-19  9:18 ` Paolo Valente
@ 2019-08-19 12:02   ` Paolo Valente
  2019-08-19 16:40   ` Marc MERLIN
  1 sibling, 0 replies; 11+ messages in thread
From: Paolo Valente @ 2019-08-19 12:02 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-block, linux-raid



> Il giorno 19 ago 2019, alle ore 11:18, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> 
> 
> 
>> Il giorno 19 ago 2019, alle ore 09:08, Marc MERLIN <marc@merlins.org> ha scritto:
>> 
>> (Please Cc me on replies so that I can see them more quickly)
>> 
>> Dear Block Folks,
>> 
> 
> Hi Marc,
> 
>> I just inherited a Dell 2950 with a Perc 5/i.
>> I really don't want to use that Perc 5/i card, but from all the reading
>> I did, there is no IT/unraid mode for it, so I was stuck setting the 6
>> 2TB drives as 6 independent raid0 drives using the card.
>> I wish I could just bypass the card and connect the drives directly to a
>> sata card, but the case and backplane do not seem to make this possible.
>> 
>> I'm getting very weird and effectively unusable I/O performance if
>> do I do swraid resync which is throttled at 5MB/s
>> 
>> By bad, I mean bad, see this (in more details below):
>> Timing buffered disk reads:   2 MB in 36.15 seconds =  56.65 kB/sec
>> 
>> Dear linux-raid folks,
>> 
>> I realize I have a perc 5/i card underneath I've very much like to remove,
>> but can't on that system.
>> Still, I'm hitting some quite unexpected swraid performance, including
>> a kernel warning and raid unclean shutdown on sysrq poweroff.
>> 
>> 
>> So, the 6 perc5/i raid0 drives show up in linux as 6 drives, I
>> partitioned them and created various software raid slices on top
>> (raid1, raid5 and raid6).  They work fine, but there is something very
>> wrong with a block layer somewhere. If I send a bunch of writes, the
>> IO scheduler seems to introduce terrible latency where my whole system
>> hangs for a few seconds trying to read simple binaries while from what I
>> can tell, the I/O platters spend all their time writing the backlog of
>> what's being sent.
>> 
> 
> Solving this kind of problem is one of the goals of the BFQ I/O scheduler [1].
> Have you tried?  If you want to, then start by swathing

switching, sorry

> to BFQ in both the
> physical and the virtual block devices in your stack.
> 
> Thanks,
> Paolo
> 
> [1] https://algo.ing.unimo.it/people/paolo/BFQ/
> 
>> You'll read below that somehow I have a swraid6 running on those 6 drives
>> and that seems to run at ok speed. But I have a bigger swraid5 across the
>> same 6 drives, and that runs at terrible speed right now.
>> 
>> 
>> I tried to disable the card's write cache to let linux and its 32GB of
>> RAM, do it better, but I didn't see a real improvement:
>> newmagic:~# megacli -LDSetProp -DisDskCache -L0 -a0  (0,1,2,3,4,5)
>> newmagic:~# megacli -LDGetProp -DskCache -Lall -a0
>>> Adapter 0-VD 0(target id: 0): Disk Write Cache : Disabled
>>> Adapter 0-VD 1(target id: 1): Disk Write Cache : Disabled
>>> Adapter 0-VD 2(target id: 2): Disk Write Cache : Disabled
>>> Adapter 0-VD 3(target id: 3): Disk Write Cache : Disabled
>>> Adapter 0-VD 4(target id: 4): Disk Write Cache : Disabled
>>> Adapter 0-VD 5(target id: 5): Disk Write Cache : Disabled
>> 
>> For the raid card, I installed the last bios I could find, and here is what it says.
>>> megasas: 07.707.51.00-rc1
>>> megaraid_sas 0000:02:0e.0: PCI IRQ 78 -> rerouted to legacy IRQ 18
>>> megaraid_sas 0000:02:0e.0: FW now in Ready state
>>> megaraid_sas 0000:02:0e.0: 63 bit DMA mask and 32 bit consistent mask
>>> megaraid_sas 0000:02:0e.0: firmware supports msix	: (0)
>>> megaraid_sas 0000:02:0e.0: current msix/online cpus	: (0/4)
>>> megaraid_sas 0000:02:0e.0: RDPQ mode	: (disabled)
>>> megaraid_sas 0000:02:0e.0: controller type	: MR(256MB)
>>> megaraid_sas 0000:02:0e.0: Online Controller Reset(OCR)	: Enabled
>>> megaraid_sas 0000:02:0e.0: Secure JBOD support	: No
>>> megaraid_sas 0000:02:0e.0: NVMe passthru support	: No
>>> megaraid_sas 0000:02:0e.0: FW provided TM TaskAbort/Reset timeout	: 0 secs/0 secs
>>> megaraid_sas 0000:02:0e.0: megasas_init_mfi: fw_support_ieee=0
>>> megaraid_sas 0000:02:0e.0: INIT adapter done
>>> megaraid_sas 0000:02:0e.0: fw state:c0000000
>>> megaraid_sas 0000:02:0e.0: Jbod map is not supported megasas_setup_jbod_map 5388
>>> megaraid_sas 0000:02:0e.0: fwstate:c0000000, dis_OCR=0
>>> megaraid_sas 0000:02:0e.0: MR_DCMD_PD_LIST_QUERY not supported by firmware
>>> megaraid_sas 0000:02:0e.0: DCMD not supported by firmware - megasas_ld_list_query 4590
>>> megaraid_sas 0000:02:0e.0: pci id		: (0x1028)/(0x0015)/(0x1028)/(0x1f03)
>>> megaraid_sas 0000:02:0e.0: unevenspan support	: no
>>> megaraid_sas 0000:02:0e.0: firmware crash dump	: no
>>> megaraid_sas 0000:02:0e.0: jbod sync map		: no
>> 
>> I'm also only getting about 5MB/s sustained write speed, which is
>> pathetic. I have lots of servers with normal sata cards, software raid,
>> and I get 50 to 100MB/s normally.
>> I'm hoping the Perc 5/i card is not _that_ bad?  See below.
>> md0 : active raid1 sde1[4] sdb1[1] sdd1[3] sda1[0] sdc1[2] sdf1[5]
>>     975872 blocks super 1.2 [6/6] [UUUUUU]
>> md1 : active raid6 sde3[4] sdb3[1] sdd3[3] sdf3[5] sda3[0] sdc3[2]
>>     419164160 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UUUUUU]
>> 
>> md2 : active raid6 sde5[4] sdb5[1] sdf5[5] sdd5[3] sdc5[2] sda5[0]
>>     1677193216 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UUUUUU]
>>     bitmap: 1/4 pages [4KB], 65536KB chunk
>> 
>> md3 : active raid5 sde6[4] sdb6[1] sdd6[3] sdf6[6] sdc6[2] sda6[0]
>>     7118330880 blocks super 1.2 level 5, 512k chunk, algorithm 2 [6/5] [UUUUU_]
>>     [=>...................]  recovery =  7.7% (109702192/1423666176) finish=5790.5min speed=3781K/sec
>>     bitmap: 0/11 pages [0KB], 65536KB chunk
>> 
>> If I access drives plugged directly into the motherboard's sata port, I
>> get perfect speed. I've also added an SSD with bcache to frontload one
>> of the raid arrays that is so slow, and sure enough, it becomes usuable.
>> When my system is slow as crap due to this issue, I can do full speed
>> I/O to a different drive plugged into the motherboard's Sata chip (but due 
>> to the case, the drive is actually sitting on the motherboard, there is 
>> nowhere to mount it).
>> 
>> The main problem is all my raids are using the same 6 devices, so if
>> anything spams them with a huge queue, I/O is completely starved for the
>> other devices.
>> The terrible write performance, which on top of being bad, prevents
>> pretty much any other I/O to those drives.
>> 
>> After an unclean shutdown explained below, a resync on the same drives but the other 2 raid arrays,
>> is much faster and does not make the system unresponsive.
>> md1 : active raid6 sda3[0] sdb3[1] sdf3[5] sdc3[2] sde3[4] sdd3[3]
>>     419164160 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UUUUUU]
>>     [==================>..]  resync = 91.1% (95553272/104791040) finish=1.7min speed=86952K/sec
>> 
>> 
>> If I start the recovery or a big copy/rsync towards md2, things get so slow that other IO
>> hangs for multiple seconds or even 2mn or more sometimes. Yes, that was the stock debian 
>> kernel, but similar problems with 5.1.21:
>>> [13900.007277] INFO: task sendmail:30862 blocked for more than 120 seconds.
>>> [13900.030181]       Not tainted 4.19.0-5-amd64 #1 Debian 4.19.37-5+deb10u2
>>> [13900.053131] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> [13900.078495] sendmail        D    0 30862  30812 0x00000000
>>> [13900.099272] Call Trace:
>>> [13900.113941]  ? __schedule+0x2a2/0x870
>>> [13900.131022]  ? lookup_fast+0xc8/0x2e0
>>> [13900.148085]  schedule+0x28/0x80
>>> [13900.163959]  rwsem_down_write_failed+0x183/0x3a0
>>> [13900.182741]  ? inode_permission+0xbe/0x180
>>> [13900.200431]  call_rwsem_down_write_failed+0x13/0x20
>>> [13900.219731]  down_write+0x29/0x40
>>> [13900.235849]  path_openat+0x615/0x15c0
>>> [13900.252665]  ? mem_cgroup_commit_charge+0x7a/0x560
>>> [13900.271680]  do_filp_open+0x93/0x100
>>> [13900.288163]  ? __check_object_size+0x15d/0x189
>>> [13900.306276]  do_sys_open+0x186/0x210
>>> [13900.322529]  do_syscall_64+0x53/0x110
>>> [13900.338867]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [13900.358047] RIP: 0033:0x7fa715212c8b
>>> [13900.374306] Code: Bad RIP value.
>>> [13900.389850] RSP: 002b:00007ffc26ba42a0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
>>> [13900.414289] RAX: ffffffffffffffda RBX: 00005584ee809978 RCX: 00007fa715212c8b
>>> [13900.437957] RDX: 00000000000000c2 RSI: 00005584ee8198f0 RDI: 00000000ffffff9c
>>> [13900.461660] RBP: 00005584ee8198f0 R08: 0000000000007fdd R09: 0000000000000000
>>> [13900.485361] R10: 00000000000001a0 R11: 0000000000000246 R12: 0000000000000000
>>> [13900.509096] R13: 0000000000000000 R14: 000000000000000a R15: 0000000000000000
>> 
>> I know I can slow down raid recovery speed, to be able to use the system I actually have to do this:
>> echo 1000 > /proc/sys/dev/raid/speed_limit_min
>> of course, at 1MB/s, it will take weeks to resync...
>> 
>> At this point, you could ask if my drives are ok speed wise, and we already have the raid6 resync
>> I showed above at over 80MB/s
>> 
>> I did some basic I/O read-write tests when the resync wasn't running:
>>> dd if=/dev/mdx of=/dev/null bs=1M count=40000
>>> f=/var/space/test; dd if=/dev/zero of=$f bs=1M count=3000 conv=fdatasync; \rm $f
>>> 
>>> dd read test: /dev/md0 419430400 bytes (419 MB, 400 MiB) copied, 3.13387 s, 134 MB/s, hdparm -t 208.18MB/s
>>> dd104857600 bytes (105 MB, 100 MiB) copied, 16.1961 s, 6.5 MB/s
>>> 
>>> /dev/md1 419430400 bytes (419 MB, 400 MiB) copied, 1.58549 s, 265 MB/s, hdparm -t 335.11MB/s
>>> 3145728000 bytes (3.1 GB, 2.9 GiB) copied, 6.51223 s, 483 MB/s
>>> 
>>> /dev/md2 419430400 bytes (419 MB, 400 MiB) copied, 1.75172 s, 239 MB/s, hdparm -t 256.08MB/s
>>> 3145728000 bytes (3.1 GB, 2.9 GiB) copied, 5.25801 s, 598 MB/s
>>> 
>>> /dev/md3 419430400 bytes (419 MB, 400 MiB) copied, 1.81613 s, 231 MB/s, hdparm -t 382.33MB/s
>> 
>> Then, when it's running at a mere 4MB/s and apparently spamming all the I/O available:
>> newmagic:~# for i in md0 md1 md2 md3; do hdparm -t /dev/$i; done
>> 
>> /dev/md0:
>> HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
>> Timing buffered disk reads: 190 MB in  3.00 seconds =  63.26 MB/sec
>> 
>> /dev/md1:
>> HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
>> Timing buffered disk reads:   4 MB in  3.21 seconds =   1.25 MB/sec
>> ^[
>> /dev/md2:
>> HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
>> Timing buffered disk reads:   6 MB in  9.08 seconds = 676.33 kB/sec
>> 
>> /dev/md3:
>> HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
>> Timing buffered disk reads:   2 MB in 36.15 seconds =  56.65 kB/sec
>> 
>> 
>> I also maybe found a bug in software raid during shutoff:
>>> [14847.171978] sysrq: SysRq : Power Off
>>> [14852.341924] WARNING: CPU: 0 PID: 2530 at drivers/md/md.c:8180 md_write_inc+0x15/0x40 [md_mod]
>>> [14852.359192] Modules linked in: fuse ufs qnx4 hfsplus hfs minix vfat msdos fat jfs xfs dm_mod cpuid ipt_MASQUERADE ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_state xt_conntrack nf_log_ipv4 nf_log_common xt_LOG nft_compat nft_counter nft_chain_nat_ipv4 nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_chain_route_ipv4 nf_tables nfnetlink binfmt_misc ext4 crc16 mbcache jbd2 fscrypto ecb crypto_simd cryptd glue_helper aes_x86_64 ipmi_ssif radeon coretemp ttm drm_kms_helper kvm drm evdev dcdbas iTCO_wdt iTCO_vendor_support serio_raw irqbypass sg pcspkr rng_core i2c_algo_bit ipmi_si i5000_edac ipmi_devintf i5k_amb ipmi_msghandler button ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress xxhash raid10 raid0 multipath linear sata_sil24 e1000e r8169 realtek libphy mii uas usb_storage
>>> [14852.502352]  raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid1 raid6_pq libcrc32c crc32c_generic hid_generic bcache crc64 usbhid md_mod hid ses enclosure sr_mod scsi_transport_sas cdrom sd_mod ata_generic uhci_hcd ehci_pci ehci_hcd ata_piix libata psmouse lpc_ich megaraid_sas usbcore scsi_mod usb_common bnx2
>>> [14852.562340] CPU: 0 PID: 2530 Comm: sendmail Not tainted 4.19.0-5-amd64 #1 Debian 4.19.37-5+deb10u2
>>> [14852.580463] Hardware name: Dell Inc. PowerEdge 2950/0DT021, BIOS 2.7.0 10/30/2010
>>> [14852.595607] RIP: 0010:md_write_inc+0x15/0x40 [md_mod]
>>> [14852.605820] Code: ff e8 9f 54 32 f3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 66 66 66 66 90 f6 46 10 01 74 1b 8b 97 c4 01 00 00 85 d2 74 12 <0f> 0b 48 8b 87 e0 02 00 00 a8 03 75 0e 65 48 ff 00 c3 8b 47 40 85
>>> [14852.643807] RSP: 0000:ffffb1c287767ac0 EFLAGS: 00010002
>>> [14852.654378] RAX: ffff9615c93a4cf8 RBX: ffff9615c93a4910 RCX: 0000000000000001
>>> [14852.668807] RDX: 0000000000000001 RSI: ffff96162aa17f00 RDI: ffff961625000000
>>> [14852.683235] RBP: ffff9615c93a4978 R08: 0000000000000000 R09: ffff961624c3a918
>>> [14852.697661] R10: 0000000000000000 R11: ffff961625a1f800 R12: 0000000000000001
>>> [14852.712089] R13: 0000000000000001 R14: ffff961623b6e000 R15: ffff96162aa17f00
>>> [14852.726518] FS:  00007f2ca54d3f40(0000) GS:ffff96162fa00000(0000) knlGS:0000000000000000
>>> [14852.742891] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [14852.754505] CR2: 00007f8e63e501c0 CR3: 000000031fb6e000 CR4: 00000000000006f0
>>> [14852.768931] Call Trace:
>>> [14852.773891]  add_stripe_bio+0x205/0x7c0 [raid456]
>>> [14852.783405]  raid5_make_request+0x1bd/0xb60 [raid456]
>>> [14852.793619]  ? finish_wait+0x80/0x80
>>> [14852.800851]  ? finish_wait+0x80/0x80
>>> [14852.808093]  md_handle_request+0x119/0x190 [md_mod]
>>> [14852.817964]  md_make_request+0x78/0x160 [md_mod]
>>> [14852.827311]  generic_make_request+0x1a4/0x410
>>> [14852.836116]  submit_bio+0x45/0x140
>>> [14852.842991]  ? guard_bio_eod+0x32/0x100
>>> [14852.850747]  submit_bh_wbc+0x163/0x190
>>> [14852.858377]  write_all_supers+0x22f/0xa60 [btrfs]
>>> [14852.867905]  btrfs_commit_transaction+0x581/0x870 [btrfs]
>>> [14852.878819]  ? finish_wait+0x80/0x80
>>> [14852.886071]  btrfs_sync_file+0x380/0x3d0 [btrfs]
>>> [14852.895415]  do_fsync+0x38/0x70
>>> [14852.901764]  __x64_sys_fsync+0x10/0x20
>>> [14852.909342]  do_syscall_64+0x53/0x110
>>> [14852.916742]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [14852.926952] RIP: 0033:0x7f2ca6944a71
>>> [14852.934185] Code: 6d a5 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b1 0f 1f 80 00 00 00 00 8b 05 da e9 00 00 85 c0 75 16 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f c3 66 0f 1f 44 00 00 53 89 fb 48 83 ec 10
>>> [14852.972172] RSP: 002b:00007fffe32a0368 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
>>> [14852.987483] RAX: ffffffffffffffda RBX: 000056297ca540d0 RCX: 00007f2ca6944a71
>>> [14853.001908] RDX: 0000000000000000 RSI: 000056297ca541b0 RDI: 0000000000000004
>>> [14853.016334] RBP: 00000000000001d7 R08: 000056297ca541b0 R09: 00007f2ca54d3f40
>>> [14853.030760] R10: 7541203831202c6e R11: 0000000000000246 R12: 000056297bfbe369
>>> [14853.045189] R13: 00007fffe32a03b0 R14: 000000000000000a R15: 0000000000000000
>>> [14853.059617] ---[ end trace 407005be9d52ae9f ]---
>>> [14854.715315] md: md3: recovery interrupted.
>>> [14877.083807] bcache: bcache_reboot() Stopping all devices:
>>> [14879.097334] bcache: bcache_reboot() Timeout waiting for devices to be closed
>>> [14879.111948] sd 4:0:0:0: [sdh] Synchronizing SCSI cache
>>> [14879.122617] sd 4:0:0:0: [sdh] Stopping disk
>>> [14879.615609] sd 3:0:0:0: [sdg] Synchronizing SCSI cache
>>> [14879.626667] sd 3:0:0:0: [sdg] Stopping disk
>>> [14881.520158] sd 0:2:2:0: [sdc] tag#684 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>>> [14881.538216] sd 0:2:2:0: [sdc] tag#684 CDB: Write(10) 2a 00 13 17 1f e8 00 00 08 00
>>> [14881.553614] print_req_error: I/O error, dev sdc, sector 320282600
>>> [14881.566001] sd 0:2:4:0: [sde] tag#684 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>>> [14881.583982] sd 0:2:4:0: [sde] tag#684 CDB: Write(10) 2a 00 13 17 1f e8 00 00 08 00
>>> [14881.599303] print_req_error: I/O error, dev sde, sector 320282600
>>> [14881.611638] sd 0:2:5:0: [sdf] tag#684 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>>> [14881.629587] sd 0:2:5:0: [sdf] tag#684 CDB: Write(10) 2a 00 13 17 1f e8 00 00 08 00
>>> [14881.661536] print_req_error: I/O error, dev sdf, sector 320282600
>>> [14881.690648] sd 0:2:5:0: [sdf] tag#684 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>>> [14881.725455] sd 0:2:5:0: [sdf] tag#684 CDB: Write(10) 2a 00 13 17 20 00 00 02 80 00
>>> [14881.757640] print_req_error: I/O error, dev sdf, sector 320282624
>>> [14881.786840] sd 0:2:3:0: [sdd] tag#684 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>>> [14881.821202] sd 0:2:3:0: [sdd] tag#684 CDB: Write(10) 2a 00 13 17 1f e8 00 00 08 00
>>> [14881.852497] print_req_error: I/O error, dev sdd, sector 320282600
>>> [14881.880392] sd 0:2:0:0: [sda] tag#684 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>>> [14881.913429] sd 0:2:0:0: [sda] tag#684 CDB: Write(10) 2a 00 13 17 1f e8 00 00 08 00
>>> [14881.943303] print_req_error: I/O error, dev sda, sector 320282600
>>> [14881.969675] sd 0:2:1:0: [sdb] tag#684 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>>> [14882.001626] sd 0:2:1:0: [sdb] tag#684 CDB: Write(10) 2a 00 13 17 1f e8 00 00 08 00
>>> [14882.030904] print_req_error: I/O error, dev sdb, sector 320282600
>>> [14882.057411] sd 0:2:4:0: [sde] tag#299 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>>> [14882.088845] sd 0:2:4:0: [sde] tag#299 CDB: Write(10) 2a 00 13 17 20 00 00 02 80 00
>>> [14882.117051] print_req_error: I/O error, dev sde, sector 320282624
>>> [14882.142352] sd 0:2:5:0: [sdf] tag#299 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>>> [14882.142430] sd 0:2:4:0: [sde] tag#300 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>>> [14882.173313] sd 0:2:5:0: [sdf] tag#299 CDB: Write(10) 2a 00 13 17 22 80 00 01 80 00
>>> [14882.173315] print_req_error: I/O error, dev sdf, sector 320283264
>>> [14882.257818] sd 0:2:4:0: [sde] tag#300 CDB: Write(10) 2a 00 13 17 22 80 00 01 80 00
>>> [14882.286196] print_req_error: I/O error, dev sde, sector 320283264
>>> [14882.372678] md: super_written gets error=10
>>> [14882.394226] md/raid:md2: Disk failure on sdc5, disabling device.
>>> [14882.394226] md/raid:md2: Operation continuing on 5 devices.
>>> [14882.396634] md: super_written gets error=10
>>> [14882.443706] md: super_written gets error=10
>>> [14882.465231] md/raid:md2: Disk failure on sde5, disabling device.
>>> [14882.465231] md/raid:md2: Operation continuing on 4 devices.
>>> [14885.396071] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>>> [14885.423090] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>>> [14885.450404] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>>> [14885.476946] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>>> [14885.503344] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>>> [14885.530389] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>>> [14885.563027] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>>> [14885.589494] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>>> [14885.615995] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>>> [14885.642142] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>>> [14885.667968] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>>> [14885.693224] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>>> [14885.717937] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>>> [14885.743191] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>>> [14885.767407] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>>> [14885.792214] bcache: bch_count_backing_io_errors() md2: IO error on backing device, unrecoverable
>>> [14890.416424] btrfs_dev_stat_print_on_error: 1409 callbacks suppressed
>>> [14890.416429] BTRFS error (device bcache0): bdev /dev/bcache0 errs: wr 1417, rd 0, flush 0, corrupt 0, gen 0
>>> [14890.460838] BTRFS error (device bcache0): bdev /dev/bcache0 errs: wr 1418, rd 0, flush 0, corrupt 0, gen 0
>>> [14890.486347] BTRFS error (device bcache0): bdev /dev/bcache0 errs: wr 1419, rd 0, flush 0, corrupt 0, gen 0
>>> [14890.511308] BTRFS error (device bcache0): bdev /dev/bcache0 errs: wr 1420, rd 0, flush 0, corrupt 0, gen 0
>>> [14890.536129] Emergency Sync complete
>>> [14891.398791] ACPI: Preparing to enter system sleep state S5
>>> [14891.460410] reboot: Power down
>>> [14891.471830] acpi_power_off called
>> 
>> 
>> megacli -LdPdInfo -a0  output for the first drive below.  
>>> Number of Virtual Disks: 6
>>> Virtual Drive: 0 (Target Id: 0)
>>> Name                :0
>>> RAID Level          : Primary-0, Secondary-0, RAID Level Qualifier-0
>>> Size                : 1.818 TB
>>> Sector Size         : 512
>>> Parity Size         : 0
>>> State               : Optimal
>>> Strip Size          : 64 KB
>>> Number Of Drives    : 1
>>> Span Depth          : 1
>>> Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
>>> Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
>>> Default Access Policy: Read/Write
>>> Current Access Policy: Read/Write
>>> Disk Cache Policy   : Disabled
>>> Encryption Type     : None
>>> Is VD Cached: No
>>> Number of Spans: 1
>>> Span: 0 - Number of PDs: 1
>>> 
>>> PD: 0 Information
>>> Enclosure Device ID: 8
>>> Slot Number: 0
>>> Drive's position: DiskGroup: 0, Span: 0, Arm: 0
>>> Enclosure position: N/A
>>> Device Id: 0
>>> WWN: 
>>> Sequence Number: 2
>>> Media Error Count: 0
>>> Other Error Count: 1
>>> Predictive Failure Count: 0
>>> Last Predictive Failure Event Seq Number: 0
>>> PD Type: SATA
>>> 
>>> Raw Size: 1.819 TB [0xe8e088b0 Sectors]
>>> Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
>>> Coerced Size: 1.818 TB [0xe8d00000 Sectors]
>>> Sector Size:  0
>>> Firmware state: Online, Spun Up
>>> Device Firmware Level: AB50
>>> Shield Counter: 0
>>> Successful diagnostics completion on :  N/A
>>> SAS Address(0):
>>> 0x1221000000000000
>>> Connected Port Number: 0 
>>> Inquiry Data:      WD-WMAZA0374092WDC WD20EARS-00MVWB0                    50.0AB50
>>> FDE Capable: Not Capable
>>> FDE Enable: Disable
>>> Secured: Unsecured
>>> Locked: Unlocked
>>> Needs EKM Attention: No
>>> Foreign State: None 
>>> Device Speed: Unknown 
>>> Link Speed: Unknown 
>>> Media Type: Hard Disk Device
>>> Drive Temperature : N/A
>>> PI Eligibility:  No 
>>> Drive is formatted for PI information:  No
>>> PI: No PI
>>> Port-0 :
>>> Port status: Active
>>> Port's Linkspeed: Unknown 
>>> Drive has flagged a S.M.A.R.T alert : No
>> 
>> Thanks,
>> Marc
>> -- 
>> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
>> Microsoft is to operating systems ....
>>                                     .... what McDonalds is to gourmet cooking
>> Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 5.1.21 Dell 2950 terrible swraid5 I/O performance with swraid on top of Perc 5/i raid0/jbod
  2019-08-19 11:42 ` o1bigtenor
@ 2019-08-19 16:24   ` Marc MERLIN
  2019-08-20  5:49   ` Marc MERLIN
  1 sibling, 0 replies; 11+ messages in thread
From: Marc MERLIN @ 2019-08-19 16:24 UTC (permalink / raw)
  To: o1bigtenor; +Cc: linux-block, Linux-RAID

On Mon, Aug 19, 2019 at 06:42:23AM -0500, o1bigtenor wrote:
> On Mon, Aug 19, 2019 at 2:35 AM Marc MERLIN <marc@merlins.org> wrote:
> >
> > (Please Cc me on replies so that I can see them more quickly)
> >
> > Dear Block Folks,
> >
> > I just inherited a Dell 2950 with a Perc 5/i.
> > I really don't want to use that Perc 5/i card, but from all the reading
> > I did, there is no IT/unraid mode for it, so I was stuck setting the 6
> > 2TB drives as 6 independent raid0 drives using the card.
> > I wish I could just bypass the card and connect the drives directly to a
> > sata card, but the case and backplane do not seem to make this possible.
> 
> Not to discourage you from a possibly interesting and fruitful endeavor
> but when I bought myself a slightly newer dell server I traded out the
> PERC card for a newer version (model 700 IIRC) and then I things
> were quite a big different. Said board, bought used, wasn't very
> expensive. YMMV

Thanks for that suggestion. Indeed, I'd like nothing more than to get rid of
that Perc 5/i card, even if it can't be as bad as what I'm seeing.
That said, from some reading, an H700 isn't just a swap in replacement, it
doesn't use the same cables from what I can tell.

https://www.serversupply.com/products/part_search/pid_lookup.asp?pid=312363&gclid=CjwKCAjwkenqBRBgEiwA-bZVtsIuJabv8vB5F8teo0XxgozYWxwNCS7N5Ar1fVQjvaBkRsQtelRlBhoC3f0QAvD_BwE
Perc 6/i does use the same cables, but it's barely a better card than 5/i
https://www.newegg.com/p/14G-000T-001F5?item=9SIA9AX8NB4437&source=region&nm_mc=knc-googlemkp-pc&cm_mmc=knc-googlemkp-pc-_-pla-splus+technologies-_-hard+drive+controllers+%2f+raid+cards-_-9SIA9AX8NB4437&gclid=CjwKCAjwkenqBRBgEiwA-bZVtnNFqB4fWVPzKkZn_utZwhgYrnBDyKRrafTgRdX2AlHK9NiXr4sxGxoCDqgQAvD_BwE&gclsrc=aw.ds

Best,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 5.1.21 Dell 2950 terrible swraid5 I/O performance with swraid on top of Perc 5/i raid0/jbod
  2019-08-19  9:18 ` Paolo Valente
  2019-08-19 12:02   ` Paolo Valente
@ 2019-08-19 16:40   ` Marc MERLIN
  2019-08-19 17:05     ` Paolo Valente
  1 sibling, 1 reply; 11+ messages in thread
From: Marc MERLIN @ 2019-08-19 16:40 UTC (permalink / raw)
  To: Paolo Valente; +Cc: linux-block, linux-raid

On Mon, Aug 19, 2019 at 11:18:13AM +0200, Paolo Valente wrote:
> Solving this kind of problem is one of the goals of the BFQ I/O scheduler [1].
> Have you tried?  If you want to, then start by swathing to BFQ in both the
> physical and the virtual block devices in your stack.
 
I sure was not aware of it, thank you for pointing it out.

> Thanks,
> Paolo
> 
> [1] https://algo.ing.unimo.it/people/paolo/BFQ/

I did the following below and when the swraid is rebuilding, I'm still
getting terrible overall throughput:
newmagic:~# hdparm -t /dev/md2
/dev/md2:
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
  Timing buffered disk reads:   2 MB in  5.76 seconds = 355.42 kB/sec

I think things hang a bit less, which I suppose it good, but the system is
still unusable overall.

 
newmagic:~# modprobe bfq
newmagic:~# for i in /sys/block/*/queue/scheduler; do echo $i; echo bfq > $i; cat $i; done
/sys/block/bcache0/queue/scheduler
none
/sys/block/md0/queue/scheduler
none
/sys/block/md1/queue/scheduler
none
/sys/block/md2/queue/scheduler
none
/sys/block/md3/queue/scheduler
none                     
/sys/block/sda/queue/scheduler
[bfq] none
/sys/block/sdb/queue/scheduler
[bfq] none
/sys/block/sdc/queue/scheduler
[bfq] none
/sys/block/sdd/queue/scheduler
[bfq] none
/sys/block/sde/queue/scheduler
[bfq] none
/sys/block/sdf/queue/scheduler
[bfq] none
/sys/block/sdg/queue/scheduler
[bfq] none
/sys/block/sdh/queue/scheduler
[bfq] none
/sys/block/sdi/queue/scheduler
[bfq] none
/sys/block/sr0/queue/scheduler
[bfq] none


Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 5.1.21 Dell 2950 terrible swraid5 I/O performance with swraid on top of Perc 5/i raid0/jbod
  2019-08-19 16:40   ` Marc MERLIN
@ 2019-08-19 17:05     ` Paolo Valente
  2019-08-19 17:26       ` Marc MERLIN
  0 siblings, 1 reply; 11+ messages in thread
From: Paolo Valente @ 2019-08-19 17:05 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-block, linux-raid



> Il giorno 19 ago 2019, alle ore 18:40, Marc MERLIN <marc@merlins.org> ha scritto:
> 
> On Mon, Aug 19, 2019 at 11:18:13AM +0200, Paolo Valente wrote:
>> Solving this kind of problem is one of the goals of the BFQ I/O scheduler [1].
>> Have you tried?  If you want to, then start by swathing to BFQ in both the
>> physical and the virtual block devices in your stack.
> 
> I sure was not aware of it, thank you for pointing it out.
> 
>> Thanks,
>> Paolo
>> 
>> [1] https://algo.ing.unimo.it/people/paolo/BFQ/
> 
> I did the following below and when the swraid is rebuilding, I'm still
> getting terrible overall throughput:
> newmagic:~# hdparm -t /dev/md2
> /dev/md2:
> HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
>  Timing buffered disk reads:   2 MB in  5.76 seconds = 355.42 kB/sec
> 
> I think things hang a bit less, which I suppose it good, but the system is
> still unusable overall.
> 

Ok, I'm sorry it didn't help.  Unless someone spots the problem
somewhere outside BFQ, I'm willing to analyze this apparently tough
scenario, as an opportunity to improve BFQ.  If fine for you, just
contact me offline.

Good luck!
Paolo


> 
> newmagic:~# modprobe bfq
> newmagic:~# for i in /sys/block/*/queue/scheduler; do echo $i; echo bfq > $i; cat $i; done
> /sys/block/bcache0/queue/scheduler
> none
> /sys/block/md0/queue/scheduler
> none
> /sys/block/md1/queue/scheduler
> none
> /sys/block/md2/queue/scheduler
> none
> /sys/block/md3/queue/scheduler
> none                     
> /sys/block/sda/queue/scheduler
> [bfq] none
> /sys/block/sdb/queue/scheduler
> [bfq] none
> /sys/block/sdc/queue/scheduler
> [bfq] none
> /sys/block/sdd/queue/scheduler
> [bfq] none
> /sys/block/sde/queue/scheduler
> [bfq] none
> /sys/block/sdf/queue/scheduler
> [bfq] none
> /sys/block/sdg/queue/scheduler
> [bfq] none
> /sys/block/sdh/queue/scheduler
> [bfq] none
> /sys/block/sdi/queue/scheduler
> [bfq] none
> /sys/block/sr0/queue/scheduler
> [bfq] none
> 
> 
> Thanks,
> Marc
> -- 
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems ....
>                                      .... what McDonalds is to gourmet cooking
> Home page: http://marc.merlins.org/  


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 5.1.21 Dell 2950 terrible swraid5 I/O performance with swraid on top of Perc 5/i raid0/jbod
  2019-08-19 17:05     ` Paolo Valente
@ 2019-08-19 17:26       ` Marc MERLIN
  0 siblings, 0 replies; 11+ messages in thread
From: Marc MERLIN @ 2019-08-19 17:26 UTC (permalink / raw)
  To: Paolo Valente; +Cc: o1bigtenor, linux-block, Linux-RAID

On Mon, Aug 19, 2019 at 07:05:42PM +0200, Paolo Valente wrote:
> > newmagic:~# hdparm -t /dev/md2
> > /dev/md2:
> > HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
> >  Timing buffered disk reads:   2 MB in  5.76 seconds = 355.42 kB/sec
> > 
> > I think things hang a bit less, which I suppose it good, but the system is
> > still unusable overall.
> 
> Ok, I'm sorry it didn't help.  Unless someone spots the problem
> somewhere outside BFQ, I'm willing to analyze this apparently tough
> scenario, as an opportunity to improve BFQ.  If fine for you, just
> contact me offline.

I'm sure there is something very wrong somewhere, and that it's not BFQ's
fault. I just haven't been able to pinpoint the problem.

I ended up finding an H700 card with cables that should fit, so I'm going to
try this first, and see what happens, thanks o1bigtenor for the suggestion.

Linux-raid folks, the original post still has a warning likely worth looking
into:
[14852.341924] WARNING: CPU: 0 PID: 2530 at drivers/md/md.c:8180 md_write_inc+0x15/0x40 [md_mod]
which in turn put the array in dirty mode.

Best,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 5.1.21 Dell 2950 terrible swraid5 I/O performance with swraid on top of Perc 5/i raid0/jbod
  2019-08-19  7:08 5.1.21 Dell 2950 terrible swraid5 I/O performance with swraid on top of Perc 5/i raid0/jbod Marc MERLIN
  2019-08-19  9:18 ` Paolo Valente
  2019-08-19 11:42 ` o1bigtenor
@ 2019-08-19 18:37 ` Roman Mamedov
  2019-08-19 19:16   ` Marc MERLIN
  2 siblings, 1 reply; 11+ messages in thread
From: Roman Mamedov @ 2019-08-19 18:37 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-block, linux-raid

On Mon, 19 Aug 2019 00:08:23 -0700
Marc MERLIN <marc@merlins.org> wrote:

> > Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
> > Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
> > Default Access Policy: Read/Write
> > Current Access Policy: Read/Write
> > Disk Cache Policy   : Disabled

So does it have a BBU? (Battery backup unit)

> I tried to disable the card's write cache to let linux and its 32GB of
> RAM, do it better, but I didn't see a real improvement:

I'd expect that on the contrary, you should look for ways to enable it, and
force-enable even without that BBU (in case of lack of one), because it feels
like what you did is disable disks' own write buffering, and not (only) the
card's!

What you are observing seems to me like what "dd" does with "oflag=dsync" (and
comparable performance that it gets). Definitely feels like it's in some
"extra safe mode" and, say, every 4KB piece of data leads to full flush to
disk before accepting to write the next 4KB.

More things to try, check if it's possible to set up disks not as 1-member
RAID0, but 1-member "linear" ("JBOD"), or even 1-member RAID1, who knows maybe
some of this would work better.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 5.1.21 Dell 2950 terrible swraid5 I/O performance with swraid on top of Perc 5/i raid0/jbod
  2019-08-19 18:37 ` Roman Mamedov
@ 2019-08-19 19:16   ` Marc MERLIN
  0 siblings, 0 replies; 11+ messages in thread
From: Marc MERLIN @ 2019-08-19 19:16 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-block, linux-raid

On Mon, Aug 19, 2019 at 11:37:09PM +0500, Roman Mamedov wrote:
> > > Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
> > > Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
> > > Default Access Policy: Read/Write
> > > Current Access Policy: Read/Write
> > > Disk Cache Policy   : Disabled
> 
> So does it have a BBU? (Battery backup unit)
 
Yes, it does and it's working.
But because I found that with write caching enabled, it seemd to take all
the writes from the raid rebuild in a big queue, and starving I/O for others
requests that I wanted to happen "right now" (like /bin/ls actually being
loaded and running), and reading how the perc 5/i is a crap card, I turned
off its IO caching, leaving the work to linux' block buffer and the 32GB of
RAM in the server that are mostly allocated to disk IO caching.

> > I tried to disable the card's write cache to let linux and its 32GB of
> > RAM, do it better, but I didn't see a real improvement:
> 
> I'd expect that on the contrary, you should look for ways to enable it, and
> force-enable even without that BBU (in case of lack of one), because it feels
> like what you did is disable disks' own write buffering, and not (only) the
> card's!

Yes, you may be correct on that. I can re-enable it, but it was terrible
with it on, too.

> What you are observing seems to me like what "dd" does with "oflag=dsync" (and
> comparable performance that it gets). Definitely feels like it's in some
> "extra safe mode" and, say, every 4KB piece of data leads to full flush to
> disk before accepting to write the next 4KB.

That sounds plausible indeed.

> More things to try, check if it's possible to set up disks not as 1-member
> RAID0, but 1-member "linear" ("JBOD"), or even 1-member RAID1, who knows maybe
> some of this would work better.

Assuming I can do this without losing the entire filesystem, I can try, but
if it can't do single drive raid0, I doubt changing this to single drive
raid1 would make things much better.
Then again, once you're hitting things that aren't working as they should...

I have an H700 in the mail that should arrive tonight, I'll try swapping
that first and see what happens.

Thanks for the answer,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 5.1.21 Dell 2950 terrible swraid5 I/O performance with swraid on top of Perc 5/i raid0/jbod
  2019-08-19 11:42 ` o1bigtenor
  2019-08-19 16:24   ` Marc MERLIN
@ 2019-08-20  5:49   ` Marc MERLIN
  1 sibling, 0 replies; 11+ messages in thread
From: Marc MERLIN @ 2019-08-20  5:49 UTC (permalink / raw)
  To: o1bigtenor; +Cc: linux-block, Linux-RAID

On Mon, Aug 19, 2019 at 06:42:23AM -0500, o1bigtenor wrote:
> Not to discourage you from a possibly interesting and fruitful endeavor
> but when I bought myself a slightly newer dell server I traded out the
> PERC card for a newer version (model 700 IIRC) and then I things
> were quite a big different. Said board, bought used, wasn't very
> expensive. YMMV

Well, just received my H700 today, plugged it in, not everything is perfect
because I'm missing a cable, but sure enough, my rebuild speeds got 8 times
faster and the system stopped hanging all the time if I do any I/O.

I still think there was something fairly unexpected/wrong with the Perc 5/i
outside of it being a cheap card, but I have other things to worry about to
spend more time to find out :)
That said, if anyone wants the card for cost of shipping (US only please,
shipping abroad is too much of a pain), let me know and I'll give it to you.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-08-20  5:49 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-19  7:08 5.1.21 Dell 2950 terrible swraid5 I/O performance with swraid on top of Perc 5/i raid0/jbod Marc MERLIN
2019-08-19  9:18 ` Paolo Valente
2019-08-19 12:02   ` Paolo Valente
2019-08-19 16:40   ` Marc MERLIN
2019-08-19 17:05     ` Paolo Valente
2019-08-19 17:26       ` Marc MERLIN
2019-08-19 11:42 ` o1bigtenor
2019-08-19 16:24   ` Marc MERLIN
2019-08-20  5:49   ` Marc MERLIN
2019-08-19 18:37 ` Roman Mamedov
2019-08-19 19:16   ` Marc MERLIN

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).