* MDRAID NVMe performance question, but I don't know what I don't know
@ 2022-01-11 16:03 Finlayson, James M CIV (USA)
2022-01-11 19:40 ` Geoff Back
2022-01-11 20:34 ` Phil Turmel
0 siblings, 2 replies; 5+ messages in thread
From: Finlayson, James M CIV (USA) @ 2022-01-11 16:03 UTC (permalink / raw)
To: linux-raid
Hi,
Sorry this is a long read. If you want to get to the gist of it, look for "<KEY>" for key points. I'm having some issues with where to find information to troubleshoot mdraid performance issues. The latest "rathole" I'm going down is that I have two identically configured mdraids, 1 per NUMA node on a dual socket AMD Rome with "numas per socket" set to 1 in the BIOS. Things are cranking with a 64K blocksize but I have a substantial disparity between NUMA0's mdraid and NUMA1's.
[root@hornet04 block]# uname -r
<KEY> 5.15.13-1.el8.elrepo.x86_64
<KEY> [root@hornet04 block]# cat /proc/mdstat (md127 is NUMA 0, md126 is NUMA 1).
Personalities : [raid6] [raid5] [raid4]
md126 : active raid5 nvme22n1p1[10] nvme20n1p1[7] nvme21n1p1[8] nvme18n1p1[5] nvme19n1p1[6] nvme17n1p1[4] nvme15n1p1[3] nvme14n1p1[2] nvme12n1p1[0] nvme13n1p1[1]
135007147008 blocks super 1.2 level 5, 128k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
bitmap: 0/112 pages [0KB], 65536KB chunk
md127 : active raid5 nvme9n1p1[10] nvme8n1p1[8] nvme7n1p1[7] nvme6n1p1[6] nvme5n1p1[5] nvme3n1p1[3] nvme4n1p1[4] nvme2n1p1[2] nvme1n1p1[1] nvme0n1p1[0]
135007147008 blocks super 1.2 level 5, 128k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
bitmap: 0/112 pages [0KB], 65536KB chunk
unused devices: <none>
I'm running numa aware identical FIOs, but getting the following in iostat (numa 0 mdraid, outperforms numa 1 mdraid by 12GB/s)
[root@hornet04 ~]# iostat -xkz 1
avg-cpu: %user %nice %system %iowait %steal %idle
0.20 0.00 3.35 0.00 0.00 96.45
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
nvme2c2n1 72856.00 0.00 4662784.00 0.00 0.00 0.00 0.00 0.00 0.68 0.00 49.50 64.00 0.00 0.01 100.00
nvme3c3n1 73077.00 0.00 4676928.00 0.00 0.00 0.00 0.00 0.00 0.68 0.00 49.94 64.00 0.00 0.01 100.00
nvme4c4n1 73013.00 0.00 4672896.00 0.00 0.00 0.00 0.00 0.00 0.69 0.00 50.35 64.00 0.00 0.01 100.00
<KEY> nvme18c18n1 54384.00 0.00 3480576.00 0.00 0.00 0.00 0.00 0.00 144.80 0.00 7874.85 64.00 0.00 0.02 100.00
nvme5c5n1 72841.00 0.00 4661824.00 0.00 0.00 0.00 0.00 0.00 0.70 0.00 51.01 64.00 0.00 0.01 100.00
nvme7c7n1 72220.00 0.00 4622080.00 0.00 0.00 0.00 0.00 0.00 0.67 0.00 48.61 64.00 0.00 0.01 100.00
nvme22c22n1 54652.00 0.00 3497728.00 0.00 0.00 0.00 0.00 0.00 0.64 0.00 34.73 64.00 0.00 0.02 100.00
nvme12c12n1 54756.00 0.00 3504384.00 0.00 0.00 0.00 0.00 0.00 0.66 0.00 36.34 64.00 0.00 0.02 100.00
nvme14c14n1 54517.00 0.00 3489088.00 0.00 0.00 0.00 0.00 0.00 0.65 0.00 35.66 64.00 0.00 0.02 100.00
nvme6c6n1 72721.00 0.00 4654144.00 0.00 0.00 0.00 0.00 0.00 0.68 0.00 49.77 64.00 0.00 0.01 100.00
nvme21c21n1 54731.00 0.00 3502784.00 0.00 0.00 0.00 0.00 0.00 0.67 0.00 36.46 64.00 0.00 0.02 100.00
nvme9c9n1 72661.00 0.00 4650304.00 0.00 0.00 0.00 0.00 0.00 0.71 0.00 51.35 64.00 0.00 0.01 100.00
nvme17c17n1 54462.00 0.00 3485568.00 0.00 0.00 0.00 0.00 0.00 0.66 0.00 36.09 64.00 0.00 0.02 100.00
nvme20c20n1 54463.00 0.00 3485632.00 0.00 0.00 0.00 0.00 0.00 0.66 0.00 36.10 64.00 0.00 0.02 100.10
nvme13c13n1 54910.00 0.00 3514240.00 0.00 0.00 0.00 0.00 0.00 0.61 0.00 33.45 64.00 0.00 0.02 100.00
nvme8c8n1 72622.00 0.00 4647808.00 0.00 0.00 0.00 0.00 0.00 0.67 0.00 48.52 64.00 0.00 0.01 100.00
nvme15c15n1 54543.00 0.00 3490752.00 0.00 0.00 0.00 0.00 0.00 0.61 0.00 33.28 64.00 0.00 0.02 100.00
nvme0c0n1 73215.00 0.00 4685760.00 0.00 0.00 0.00 0.00 0.00 0.67 0.00 49.41 64.00 0.00 0.01 100.00
nvme19c19n1 55034.00 0.00 3522176.00 0.00 0.00 0.00 0.00 0.00 0.67 0.00 36.93 64.00 0.00 0.02 100.10
<KEY> nvme1c1n1 72672.00 0.00 4650944.00 0.00 0.00 0.00 0.00 0.00 106.98 0.00 7774.54 64.00 0.00 0.01 100.00
<KEY> md127 727871.00 0.00 46583744.00 0.00 0.00 0.00 0.00 0.00 11.30 0.00 8221.92 64.00 0.00 0.00 100.00
<KEY> md126 546553.00 0.00 34979392.00 0.00 0.00 0.00 0.00 0.00 14.99 0.00 8194.91 64.00 0.00 0.00 100.10
<KEY> I started chasing the aqu_sz and r_await to see if I have a device issue or if these are known mdraid "features" when I started to try to find the kernel workers and start chasing kernel workers when it became apparent to me that I DON'T KNOW WHAT I'M DOING OR WHAT TO DO NEXT. Any guidance is appreciated. Given 1 drive per NUMA is showing the bad behavior, I'm reluctant to point the finger at hardware.
[root@hornet04 ~]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 2
Vendor ID: AuthenticAMD
BIOS Vendor ID: Advanced Micro Devices, Inc.
CPU family: 23
Model: 49
Model name: AMD EPYC 7742 64-Core Processor
BIOS Model name: AMD EPYC 7742 64-Core Processor
Stepping: 0
CPU MHz: 3243.803
BogoMIPS: 4491.53
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 16384K
<KEY> NUMA node0 CPU(s): 0-63,128-191
<KEY> NUMA node1 CPU(s): 64-127,192-255
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca
<KEY> When I start doing some basic debugging - not a Linux ninja by far, I see the following, but what is throwing me is seeing (at least these workers that I suspect have to do with md, all running on NUMA node 1. This is catching me by surprise. Are there other workers that I'm missing?????
ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan:14,comm | head -1; ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan:14,comm | egrep 'md|raid' | grep -v systemd | grep -v mlx
PID TID CLS RTPRIO NI PRI NUMA PSR %CPU STAT WCHAN COMMAND
1522 1522 TS - 5 14 1 208 0.0 SN - ksmd
1590 1590 TS - -20 39 1 220 0.0 I< - md
3688 3688 TS - -20 39 1 198 0.0 I< - raid5wq
3693 3693 TS - 0 19 1 234 0.0 S - md126_raid5
3694 3694 TS - 0 19 1 95 0.0 S - md127_raid5
3788 3788 TS - 0 19 1 240 0.0 Ss - lsmdcat /
Jim Finlayson
U.S. Department of Defense
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: MDRAID NVMe performance question, but I don't know what I don't know
2022-01-11 16:03 MDRAID NVMe performance question, but I don't know what I don't know Finlayson, James M CIV (USA)
@ 2022-01-11 19:40 ` Geoff Back
2022-01-11 20:31 ` [Non-DoD Source] " Finlayson, James M CIV (USA)
2022-01-11 20:34 ` Phil Turmel
1 sibling, 1 reply; 5+ messages in thread
From: Geoff Back @ 2022-01-11 19:40 UTC (permalink / raw)
To: Finlayson, James M CIV (USA), linux-raid
Hi James,
My first thought would be: how sure are you about which physical socket
(and hence NUMA node) each NVME drive is connected to?
Regards,
Geoff.
On 11/01/2022 16:03, Finlayson, James M CIV (USA) wrote:
> Hi,
> Sorry this is a long read. If you want to get to the gist of it, look for "<KEY>" for key points. I'm having some issues with where to find information to troubleshoot mdraid performance issues. The latest "rathole" I'm going down is that I have two identically configured mdraids, 1 per NUMA node on a dual socket AMD Rome with "numas per socket" set to 1 in the BIOS. Things are cranking with a 64K blocksize but I have a substantial disparity between NUMA0's mdraid and NUMA1's.
>
> [root@hornet04 block]# uname -r
> <KEY> 5.15.13-1.el8.elrepo.x86_64
>
> <KEY> [root@hornet04 block]# cat /proc/mdstat (md127 is NUMA 0, md126 is NUMA 1).
> Personalities : [raid6] [raid5] [raid4]
> md126 : active raid5 nvme22n1p1[10] nvme20n1p1[7] nvme21n1p1[8] nvme18n1p1[5] nvme19n1p1[6] nvme17n1p1[4] nvme15n1p1[3] nvme14n1p1[2] nvme12n1p1[0] nvme13n1p1[1]
> 135007147008 blocks super 1.2 level 5, 128k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
> bitmap: 0/112 pages [0KB], 65536KB chunk
>
> md127 : active raid5 nvme9n1p1[10] nvme8n1p1[8] nvme7n1p1[7] nvme6n1p1[6] nvme5n1p1[5] nvme3n1p1[3] nvme4n1p1[4] nvme2n1p1[2] nvme1n1p1[1] nvme0n1p1[0]
> 135007147008 blocks super 1.2 level 5, 128k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
> bitmap: 0/112 pages [0KB], 65536KB chunk
>
> unused devices: <none>
>
>
> I'm running numa aware identical FIOs, but getting the following in iostat (numa 0 mdraid, outperforms numa 1 mdraid by 12GB/s)
>
> [root@hornet04 ~]# iostat -xkz 1
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.20 0.00 3.35 0.00 0.00 96.45
>
> Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
> nvme2c2n1 72856.00 0.00 4662784.00 0.00 0.00 0.00 0.00 0.00 0.68 0.00 49.50 64.00 0.00 0.01 100.00
> nvme3c3n1 73077.00 0.00 4676928.00 0.00 0.00 0.00 0.00 0.00 0.68 0.00 49.94 64.00 0.00 0.01 100.00
> nvme4c4n1 73013.00 0.00 4672896.00 0.00 0.00 0.00 0.00 0.00 0.69 0.00 50.35 64.00 0.00 0.01 100.00
> <KEY> nvme18c18n1 54384.00 0.00 3480576.00 0.00 0.00 0.00 0.00 0.00 144.80 0.00 7874.85 64.00 0.00 0.02 100.00
> nvme5c5n1 72841.00 0.00 4661824.00 0.00 0.00 0.00 0.00 0.00 0.70 0.00 51.01 64.00 0.00 0.01 100.00
> nvme7c7n1 72220.00 0.00 4622080.00 0.00 0.00 0.00 0.00 0.00 0.67 0.00 48.61 64.00 0.00 0.01 100.00
> nvme22c22n1 54652.00 0.00 3497728.00 0.00 0.00 0.00 0.00 0.00 0.64 0.00 34.73 64.00 0.00 0.02 100.00
> nvme12c12n1 54756.00 0.00 3504384.00 0.00 0.00 0.00 0.00 0.00 0.66 0.00 36.34 64.00 0.00 0.02 100.00
> nvme14c14n1 54517.00 0.00 3489088.00 0.00 0.00 0.00 0.00 0.00 0.65 0.00 35.66 64.00 0.00 0.02 100.00
> nvme6c6n1 72721.00 0.00 4654144.00 0.00 0.00 0.00 0.00 0.00 0.68 0.00 49.77 64.00 0.00 0.01 100.00
> nvme21c21n1 54731.00 0.00 3502784.00 0.00 0.00 0.00 0.00 0.00 0.67 0.00 36.46 64.00 0.00 0.02 100.00
> nvme9c9n1 72661.00 0.00 4650304.00 0.00 0.00 0.00 0.00 0.00 0.71 0.00 51.35 64.00 0.00 0.01 100.00
> nvme17c17n1 54462.00 0.00 3485568.00 0.00 0.00 0.00 0.00 0.00 0.66 0.00 36.09 64.00 0.00 0.02 100.00
> nvme20c20n1 54463.00 0.00 3485632.00 0.00 0.00 0.00 0.00 0.00 0.66 0.00 36.10 64.00 0.00 0.02 100.10
> nvme13c13n1 54910.00 0.00 3514240.00 0.00 0.00 0.00 0.00 0.00 0.61 0.00 33.45 64.00 0.00 0.02 100.00
> nvme8c8n1 72622.00 0.00 4647808.00 0.00 0.00 0.00 0.00 0.00 0.67 0.00 48.52 64.00 0.00 0.01 100.00
> nvme15c15n1 54543.00 0.00 3490752.00 0.00 0.00 0.00 0.00 0.00 0.61 0.00 33.28 64.00 0.00 0.02 100.00
> nvme0c0n1 73215.00 0.00 4685760.00 0.00 0.00 0.00 0.00 0.00 0.67 0.00 49.41 64.00 0.00 0.01 100.00
> nvme19c19n1 55034.00 0.00 3522176.00 0.00 0.00 0.00 0.00 0.00 0.67 0.00 36.93 64.00 0.00 0.02 100.10
> <KEY> nvme1c1n1 72672.00 0.00 4650944.00 0.00 0.00 0.00 0.00 0.00 106.98 0.00 7774.54 64.00 0.00 0.01 100.00
> <KEY> md127 727871.00 0.00 46583744.00 0.00 0.00 0.00 0.00 0.00 11.30 0.00 8221.92 64.00 0.00 0.00 100.00
> <KEY> md126 546553.00 0.00 34979392.00 0.00 0.00 0.00 0.00 0.00 14.99 0.00 8194.91 64.00 0.00 0.00 100.10
>
>
> <KEY> I started chasing the aqu_sz and r_await to see if I have a device issue or if these are known mdraid "features" when I started to try to find the kernel workers and start chasing kernel workers when it became apparent to me that I DON'T KNOW WHAT I'M DOING OR WHAT TO DO NEXT. Any guidance is appreciated. Given 1 drive per NUMA is showing the bad behavior, I'm reluctant to point the finger at hardware.
>
>
> [root@hornet04 ~]# lscpu
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Byte Order: Little Endian
> CPU(s): 256
> On-line CPU(s) list: 0-255
> Thread(s) per core: 2
> Core(s) per socket: 64
> Socket(s): 2
> NUMA node(s): 2
> Vendor ID: AuthenticAMD
> BIOS Vendor ID: Advanced Micro Devices, Inc.
> CPU family: 23
> Model: 49
> Model name: AMD EPYC 7742 64-Core Processor
> BIOS Model name: AMD EPYC 7742 64-Core Processor
> Stepping: 0
> CPU MHz: 3243.803
> BogoMIPS: 4491.53
> Virtualization: AMD-V
> L1d cache: 32K
> L1i cache: 32K
> L2 cache: 512K
> L3 cache: 16384K
> <KEY> NUMA node0 CPU(s): 0-63,128-191
> <KEY> NUMA node1 CPU(s): 64-127,192-255
> Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca
>
>
> <KEY> When I start doing some basic debugging - not a Linux ninja by far, I see the following, but what is throwing me is seeing (at least these workers that I suspect have to do with md, all running on NUMA node 1. This is catching me by surprise. Are there other workers that I'm missing?????
>
> ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan:14,comm | head -1; ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan:14,comm | egrep 'md|raid' | grep -v systemd | grep -v mlx
> PID TID CLS RTPRIO NI PRI NUMA PSR %CPU STAT WCHAN COMMAND
> 1522 1522 TS - 5 14 1 208 0.0 SN - ksmd
> 1590 1590 TS - -20 39 1 220 0.0 I< - md
> 3688 3688 TS - -20 39 1 198 0.0 I< - raid5wq
> 3693 3693 TS - 0 19 1 234 0.0 S - md126_raid5
> 3694 3694 TS - 0 19 1 95 0.0 S - md127_raid5
> 3788 3788 TS - 0 19 1 240 0.0 Ss - lsmdcat /
>
>
>
> Jim Finlayson
> U.S. Department of Defense
>
--
Geoff Back
What if we're all just characters in someone's nightmares?
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: [Non-DoD Source] Re: MDRAID NVMe performance question, but I don't know what I don't know
2022-01-11 19:40 ` Geoff Back
@ 2022-01-11 20:31 ` Finlayson, James M CIV (USA)
0 siblings, 0 replies; 5+ messages in thread
From: Finlayson, James M CIV (USA) @ 2022-01-11 20:31 UTC (permalink / raw)
To: 'Geoff Back', linux-raid
Unless I did something completely foolish:
[root@hornet04 ~]# for i in /sys/class/nvme/nvme* ; do echo $i `cat $i/numa_node` `ls -d $i/nvme*` ; done
/sys/class/nvme/nvme0 0 /sys/class/nvme/nvme0/nvme0c0n1
/sys/class/nvme/nvme1 0 /sys/class/nvme/nvme1/nvme1c1n1
/sys/class/nvme/nvme10 0 /sys/class/nvme/nvme10/nvme10c10n1
/sys/class/nvme/nvme11 0 /sys/class/nvme/nvme11/nvme11c11n1
/sys/class/nvme/nvme12 1 /sys/class/nvme/nvme12/nvme12c12n1
/sys/class/nvme/nvme13 1 /sys/class/nvme/nvme13/nvme13c13n1
/sys/class/nvme/nvme14 1 /sys/class/nvme/nvme14/nvme14c14n1
/sys/class/nvme/nvme15 1 /sys/class/nvme/nvme15/nvme15c15n1
/sys/class/nvme/nvme16 1 /sys/class/nvme/nvme16/nvme16c16n1
/sys/class/nvme/nvme17 1 /sys/class/nvme/nvme17/nvme17c17n1
/sys/class/nvme/nvme18 1 /sys/class/nvme/nvme18/nvme18c18n1
/sys/class/nvme/nvme19 1 /sys/class/nvme/nvme19/nvme19c19n1
/sys/class/nvme/nvme2 0 /sys/class/nvme/nvme2/nvme2c2n1
/sys/class/nvme/nvme20 1 /sys/class/nvme/nvme20/nvme20c20n1
/sys/class/nvme/nvme21 1 /sys/class/nvme/nvme21/nvme21c21n1
/sys/class/nvme/nvme22 1 /sys/class/nvme/nvme22/nvme22c22n1
/sys/class/nvme/nvme23 1 /sys/class/nvme/nvme23/nvme23c23n1
/sys/class/nvme/nvme24 1 /sys/class/nvme/nvme24/nvme24c24n1
/sys/class/nvme/nvme3 0 /sys/class/nvme/nvme3/nvme3c3n1
/sys/class/nvme/nvme4 0 /sys/class/nvme/nvme4/nvme4c4n1
/sys/class/nvme/nvme5 0 /sys/class/nvme/nvme5/nvme5c5n1
/sys/class/nvme/nvme6 0 /sys/class/nvme/nvme6/nvme6c6n1
/sys/class/nvme/nvme7 0 /sys/class/nvme/nvme7/nvme7c7n1
/sys/class/nvme/nvme8 0 /sys/class/nvme/nvme8/nvme8c8n1
/sys/class/nvme/nvme9 0 /sys/class/nvme/nvme9/nvme9c9n1
-----Original Message-----
From: Geoff Back <geoff@demonlair.co.uk>
Sent: Tuesday, January 11, 2022 2:40 PM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>; linux-raid@vger.kernel.org
Subject: [Non-DoD Source] Re: MDRAID NVMe performance question, but I don't know what I don't know
Hi James,
My first thought would be: how sure are you about which physical socket (and hence NUMA node) each NVME drive is connected to?
Regards,
Geoff.
On 11/01/2022 16:03, Finlayson, James M CIV (USA) wrote:
> Hi,
> Sorry this is a long read. If you want to get to the gist of it, look for "<KEY>" for key points. I'm having some issues with where to find information to troubleshoot mdraid performance issues. The latest "rathole" I'm going down is that I have two identically configured mdraids, 1 per NUMA node on a dual socket AMD Rome with "numas per socket" set to 1 in the BIOS. Things are cranking with a 64K blocksize but I have a substantial disparity between NUMA0's mdraid and NUMA1's.
>
> [root@hornet04 block]# uname -r
> <KEY> 5.15.13-1.el8.elrepo.x86_64
>
> <KEY> [root@hornet04 block]# cat /proc/mdstat (md127 is NUMA 0, md126 is NUMA 1).
> Personalities : [raid6] [raid5] [raid4]
> md126 : active raid5 nvme22n1p1[10] nvme20n1p1[7] nvme21n1p1[8] nvme18n1p1[5] nvme19n1p1[6] nvme17n1p1[4] nvme15n1p1[3] nvme14n1p1[2] nvme12n1p1[0] nvme13n1p1[1]
> 135007147008 blocks super 1.2 level 5, 128k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
> bitmap: 0/112 pages [0KB], 65536KB chunk
>
> md127 : active raid5 nvme9n1p1[10] nvme8n1p1[8] nvme7n1p1[7] nvme6n1p1[6] nvme5n1p1[5] nvme3n1p1[3] nvme4n1p1[4] nvme2n1p1[2] nvme1n1p1[1] nvme0n1p1[0]
> 135007147008 blocks super 1.2 level 5, 128k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
> bitmap: 0/112 pages [0KB], 65536KB chunk
>
> unused devices: <none>
>
>
> I'm running numa aware identical FIOs, but getting the following in
> iostat (numa 0 mdraid, outperforms numa 1 mdraid by 12GB/s)
>
> [root@hornet04 ~]# iostat -xkz 1
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.20 0.00 3.35 0.00 0.00 96.45
>
> Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
> nvme2c2n1 72856.00 0.00 4662784.00 0.00 0.00 0.00 0.00 0.00 0.68 0.00 49.50 64.00 0.00 0.01 100.00
> nvme3c3n1 73077.00 0.00 4676928.00 0.00 0.00 0.00 0.00 0.00 0.68 0.00 49.94 64.00 0.00 0.01 100.00
> nvme4c4n1 73013.00 0.00 4672896.00 0.00 0.00 0.00 0.00 0.00 0.69 0.00 50.35 64.00 0.00 0.01 100.00
> <KEY> nvme18c18n1 54384.00 0.00 3480576.00 0.00 0.00 0.00 0.00 0.00 144.80 0.00 7874.85 64.00 0.00 0.02 100.00
> nvme5c5n1 72841.00 0.00 4661824.00 0.00 0.00 0.00 0.00 0.00 0.70 0.00 51.01 64.00 0.00 0.01 100.00
> nvme7c7n1 72220.00 0.00 4622080.00 0.00 0.00 0.00 0.00 0.00 0.67 0.00 48.61 64.00 0.00 0.01 100.00
> nvme22c22n1 54652.00 0.00 3497728.00 0.00 0.00 0.00 0.00 0.00 0.64 0.00 34.73 64.00 0.00 0.02 100.00
> nvme12c12n1 54756.00 0.00 3504384.00 0.00 0.00 0.00 0.00 0.00 0.66 0.00 36.34 64.00 0.00 0.02 100.00
> nvme14c14n1 54517.00 0.00 3489088.00 0.00 0.00 0.00 0.00 0.00 0.65 0.00 35.66 64.00 0.00 0.02 100.00
> nvme6c6n1 72721.00 0.00 4654144.00 0.00 0.00 0.00 0.00 0.00 0.68 0.00 49.77 64.00 0.00 0.01 100.00
> nvme21c21n1 54731.00 0.00 3502784.00 0.00 0.00 0.00 0.00 0.00 0.67 0.00 36.46 64.00 0.00 0.02 100.00
> nvme9c9n1 72661.00 0.00 4650304.00 0.00 0.00 0.00 0.00 0.00 0.71 0.00 51.35 64.00 0.00 0.01 100.00
> nvme17c17n1 54462.00 0.00 3485568.00 0.00 0.00 0.00 0.00 0.00 0.66 0.00 36.09 64.00 0.00 0.02 100.00
> nvme20c20n1 54463.00 0.00 3485632.00 0.00 0.00 0.00 0.00 0.00 0.66 0.00 36.10 64.00 0.00 0.02 100.10
> nvme13c13n1 54910.00 0.00 3514240.00 0.00 0.00 0.00 0.00 0.00 0.61 0.00 33.45 64.00 0.00 0.02 100.00
> nvme8c8n1 72622.00 0.00 4647808.00 0.00 0.00 0.00 0.00 0.00 0.67 0.00 48.52 64.00 0.00 0.01 100.00
> nvme15c15n1 54543.00 0.00 3490752.00 0.00 0.00 0.00 0.00 0.00 0.61 0.00 33.28 64.00 0.00 0.02 100.00
> nvme0c0n1 73215.00 0.00 4685760.00 0.00 0.00 0.00 0.00 0.00 0.67 0.00 49.41 64.00 0.00 0.01 100.00
> nvme19c19n1 55034.00 0.00 3522176.00 0.00 0.00 0.00 0.00 0.00 0.67 0.00 36.93 64.00 0.00 0.02 100.10
> <KEY> nvme1c1n1 72672.00 0.00 4650944.00 0.00 0.00 0.00 0.00 0.00 106.98 0.00 7774.54 64.00 0.00 0.01 100.00
> <KEY> md127 727871.00 0.00 46583744.00 0.00 0.00 0.00 0.00 0.00 11.30 0.00 8221.92 64.00 0.00 0.00 100.00
> <KEY> md126 546553.00 0.00 34979392.00 0.00 0.00 0.00 0.00 0.00 14.99 0.00 8194.91 64.00 0.00 0.00 100.10
>
>
> <KEY> I started chasing the aqu_sz and r_await to see if I have a device issue or if these are known mdraid "features" when I started to try to find the kernel workers and start chasing kernel workers when it became apparent to me that I DON'T KNOW WHAT I'M DOING OR WHAT TO DO NEXT. Any guidance is appreciated. Given 1 drive per NUMA is showing the bad behavior, I'm reluctant to point the finger at hardware.
>
>
> [root@hornet04 ~]# lscpu
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Byte Order: Little Endian
> CPU(s): 256
> On-line CPU(s) list: 0-255
> Thread(s) per core: 2
> Core(s) per socket: 64
> Socket(s): 2
> NUMA node(s): 2
> Vendor ID: AuthenticAMD
> BIOS Vendor ID: Advanced Micro Devices, Inc.
> CPU family: 23
> Model: 49
> Model name: AMD EPYC 7742 64-Core Processor
> BIOS Model name: AMD EPYC 7742 64-Core Processor
> Stepping: 0
> CPU MHz: 3243.803
> BogoMIPS: 4491.53
> Virtualization: AMD-V
> L1d cache: 32K
> L1i cache: 32K
> L2 cache: 512K
> L3 cache: 16384K
> <KEY> NUMA node0 CPU(s): 0-63,128-191
> <KEY> NUMA node1 CPU(s): 64-127,192-255
> Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca
>
>
> <KEY> When I start doing some basic debugging - not a Linux ninja by far, I see the following, but what is throwing me is seeing (at least these workers that I suspect have to do with md, all running on NUMA node 1. This is catching me by surprise. Are there other workers that I'm missing?????
>
> ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan:14,comm | head -1; ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan:14,comm | egrep 'md|raid' | grep -v systemd | grep -v mlx
> PID TID CLS RTPRIO NI PRI NUMA PSR %CPU STAT WCHAN COMMAND
> 1522 1522 TS - 5 14 1 208 0.0 SN - ksmd
> 1590 1590 TS - -20 39 1 220 0.0 I< - md
> 3688 3688 TS - -20 39 1 198 0.0 I< - raid5wq
> 3693 3693 TS - 0 19 1 234 0.0 S - md126_raid5
> 3694 3694 TS - 0 19 1 95 0.0 S - md127_raid5
> 3788 3788 TS - 0 19 1 240 0.0 Ss - lsmdcat /
>
>
>
> Jim Finlayson
> U.S. Department of Defense
>
--
Geoff Back
What if we're all just characters in someone's nightmares?
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: MDRAID NVMe performance question, but I don't know what I don't know
2022-01-11 16:03 MDRAID NVMe performance question, but I don't know what I don't know Finlayson, James M CIV (USA)
2022-01-11 19:40 ` Geoff Back
@ 2022-01-11 20:34 ` Phil Turmel
2022-01-11 20:38 ` [Non-DoD Source] " Finlayson, James M CIV (USA)
1 sibling, 1 reply; 5+ messages in thread
From: Phil Turmel @ 2022-01-11 20:34 UTC (permalink / raw)
To: Finlayson, James M CIV (USA), linux-raid
Hi James,
On 1/11/22 11:03 AM, Finlayson, James M CIV (USA) wrote:
> Hi,
> Sorry this is a long read. If you want to get to the gist of it, look for "<KEY>" for key points. I'm having some issues with where to find information to troubleshoot mdraid performance issues. The latest "rathole" I'm going down is that I have two identically configured mdraids, 1 per NUMA node on a dual socket AMD Rome with "numas per socket" set to 1 in the BIOS. Things are cranking with a 64K blocksize but I have a substantial disparity between NUMA0's mdraid and NUMA1's.
[trim /]
Is there any chance your NVMe devices are installed asymmetrically on
your PCIe bus(ses) ?
try:
# lspci -tv
Might be illuminating. In my office server, the PCIe slots are routed
through one of the two sockets. The slots routed through socket 1
simply don't work when the second processor is not installed. Devices
in a socket 0 slot have to route through that CPU when the other CPU
talks to them, and vice versa.
Phil
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: [Non-DoD Source] Re: MDRAID NVMe performance question, but I don't know what I don't know
2022-01-11 20:34 ` Phil Turmel
@ 2022-01-11 20:38 ` Finlayson, James M CIV (USA)
0 siblings, 0 replies; 5+ messages in thread
From: Finlayson, James M CIV (USA) @ 2022-01-11 20:38 UTC (permalink / raw)
To: 'Phil Turmel', linux-raid
[root@hornet04 ~]# lstopo -v | egrep -i 'numa|pci|bridge'
NUMANode L#0 (P#0 local=263873404KB total=263873404KB)
HostBridge L#0 (buses=0000:[00-06])
PCIBridge L#1 (busid=0000:00:01.3 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[03-03])
PCI L#0 (busid=0000:03:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=15)
PCIBridge L#2 (busid=0000:00:01.4 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[04-04])
PCI L#1 (busid=0000:04:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=16)
PCIBridge L#3 (busid=0000:00:01.5 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[05-05])
PCI L#2 (busid=0000:05:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=17)
PCIBridge L#4 (busid=0000:00:01.6 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[06-06])
PCI L#3 (busid=0000:06:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=18)
HostBridge L#5 (buses=0000:[20-27])
PCIBridge L#6 (busid=0000:20:01.1 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[23-23])
PCI L#4 (busid=0000:23:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=7)
PCIBridge L#7 (busid=0000:20:01.2 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[24-24])
PCI L#5 (busid=0000:24:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=8-1)
PCIBridge L#8 (busid=0000:20:01.3 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[25-25])
PCI L#6 (busid=0000:25:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=9)
PCIBridge L#9 (busid=0000:20:01.4 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[26-26])
PCI L#7 (busid=0000:26:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=10-1)
PCIBridge L#10 (busid=0000:20:03.1 id=1022:1483 class=0604(PCIBridge) link=15.75GB/s buses=0000:[27-27])
PCI L#8 (busid=0000:27:00.0 id=15b3:1017 class=0200(Ethernet) link=15.75GB/s PCISlot=1)
PCI L#9 (busid=0000:27:00.1 id=15b3:1017 class=0200(Ethernet) link=15.75GB/s PCISlot=1)
HostBridge L#11 (buses=0000:[40-45])
PCIBridge L#12 (busid=0000:40:01.1 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[43-43])
PCI L#10 (busid=0000:43:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=3)
PCIBridge L#13 (busid=0000:40:01.2 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[44-44])
PCI L#11 (busid=0000:44:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=4)
PCIBridge L#14 (busid=0000:40:01.3 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[45-45])
PCI L#12 (busid=0000:45:00.0 id=15b3:1017 class=0200(Ethernet) link=7.88GB/s PCISlot=10)
PCI L#13 (busid=0000:45:00.1 id=15b3:1017 class=0200(Ethernet) link=7.88GB/s PCISlot=10)
HostBridge L#15 (buses=0000:[60-65])
PCIBridge L#16 (busid=0000:60:03.2 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[64-64])
PCI L#14 (busid=0000:64:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=1-1)
PCIBridge L#17 (busid=0000:60:03.3 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[65-65])
PCI L#15 (busid=0000:65:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=2)
PCIBridge L#18 (busid=0000:60:05.2 id=1022:1483 class=0604(PCIBridge) link=0.50GB/s buses=0000:[61-61])
PCI L#16 (busid=0000:61:00.1 id=102b:0538 class=0300(VGA) link=0.50GB/s)
NUMANode L#1 (P#1 local=264165280KB total=264165280KB)
HostBridge L#19 (buses=0000:[a0-a6])
PCIBridge L#20 (busid=0000:a0:03.3 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[a3-a3])
PCI L#17 (busid=0000:a3:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=31)
PCIBridge L#21 (busid=0000:a0:03.4 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[a4-a4])
PCI L#18 (busid=0000:a4:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=32)
PCIBridge L#22 (busid=0000:a0:03.5 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[a5-a5])
PCI L#19 (busid=0000:a5:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=33)
PCIBridge L#23 (busid=0000:a0:03.6 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[a6-a6])
PCI L#20 (busid=0000:a6:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=34)
HostBridge L#24 (buses=0000:[c0-c8])
PCIBridge L#25 (busid=0000:c0:01.1 id=1022:1483 class=0604(PCIBridge) link=3.94GB/s buses=0000:[c3-c3])
PCI L#21 (busid=0000:c3:00.0 id=1b4b:2241 class=0108(NVMExp) link=3.94GB/s PCISlot=8)
PCIBridge L#26 (busid=0000:c0:03.1 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[c5-c5])
PCI L#22 (busid=0000:c5:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=23)
PCIBridge L#27 (busid=0000:c0:03.2 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[c6-c6])
PCI L#23 (busid=0000:c6:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=24)
PCIBridge L#28 (busid=0000:c0:03.3 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[c7-c7])
PCI L#24 (busid=0000:c7:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=25)
PCIBridge L#29 (busid=0000:c0:03.4 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[c8-c8])
PCI L#25 (busid=0000:c8:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=26)
HostBridge L#30 (buses=0000:[e0-e6])
PCIBridge L#31 (busid=0000:e0:03.1 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[e5-e5])
PCI L#26 (busid=0000:e5:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=21)
PCIBridge L#32 (busid=0000:e0:03.2 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[e6-e6])
PCI L#27 (busid=0000:e6:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=22)
PCIBridge L#33 (busid=0000:e0:03.3 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[e3-e3])
PCI L#28 (busid=0000:e3:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=19)
PCIBridge L#34 (busid=0000:e0:03.4 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[e4-e4])
PCI L#29 (busid=0000:e4:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=20)
-----Original Message-----
From: Phil Turmel <philip@turmel.org>
Sent: Tuesday, January 11, 2022 3:35 PM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>; linux-raid@vger.kernel.org
Subject: [Non-DoD Source] Re: MDRAID NVMe performance question, but I don't know what I don't know
Hi James,
On 1/11/22 11:03 AM, Finlayson, James M CIV (USA) wrote:
> Hi,
> Sorry this is a long read. If you want to get to the gist of it, look for "<KEY>" for key points. I'm having some issues with where to find information to troubleshoot mdraid performance issues. The latest "rathole" I'm going down is that I have two identically configured mdraids, 1 per NUMA node on a dual socket AMD Rome with "numas per socket" set to 1 in the BIOS. Things are cranking with a 64K blocksize but I have a substantial disparity between NUMA0's mdraid and NUMA1's.
[trim /]
Is there any chance your NVMe devices are installed asymmetrically on your PCIe bus(ses) ?
try:
# lspci -tv
Might be illuminating. In my office server, the PCIe slots are routed through one of the two sockets. The slots routed through socket 1 simply don't work when the second processor is not installed. Devices in a socket 0 slot have to route through that CPU when the other CPU talks to them, and vice versa.
Phil
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2022-01-11 20:49 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-11 16:03 MDRAID NVMe performance question, but I don't know what I don't know Finlayson, James M CIV (USA)
2022-01-11 19:40 ` Geoff Back
2022-01-11 20:31 ` [Non-DoD Source] " Finlayson, James M CIV (USA)
2022-01-11 20:34 ` Phil Turmel
2022-01-11 20:38 ` [Non-DoD Source] " Finlayson, James M CIV (USA)
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.