All of lore.kernel.org
 help / color / mirror / Atom feed
* MDRAID NVMe performance question, but I don't know what I don't know
@ 2022-01-11 16:03 Finlayson, James M CIV (USA)
  2022-01-11 19:40 ` Geoff Back
  2022-01-11 20:34 ` Phil Turmel
  0 siblings, 2 replies; 5+ messages in thread
From: Finlayson, James M CIV (USA) @ 2022-01-11 16:03 UTC (permalink / raw)
  To: linux-raid

Hi,
Sorry this is a long read.   If you want to get to the gist of it, look for "<KEY>" for key points.   I'm having some issues with where to find information to troubleshoot mdraid performance issues.   The latest "rathole" I'm going down is that I have two identically configured mdraids, 1 per NUMA node on a dual socket AMD Rome with "numas per socket" set to 1 in the BIOS.   Things are cranking with a 64K blocksize but I have a substantial disparity between NUMA0's mdraid and NUMA1's.     

[root@hornet04 block]# uname -r
<KEY> 5.15.13-1.el8.elrepo.x86_64

<KEY>  [root@hornet04 block]# cat /proc/mdstat  (md127 is NUMA 0, md126 is NUMA 1).
Personalities : [raid6] [raid5] [raid4] 
md126 : active raid5 nvme22n1p1[10] nvme20n1p1[7] nvme21n1p1[8] nvme18n1p1[5] nvme19n1p1[6] nvme17n1p1[4] nvme15n1p1[3] nvme14n1p1[2] nvme12n1p1[0] nvme13n1p1[1]
      135007147008 blocks super 1.2 level 5, 128k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
      bitmap: 0/112 pages [0KB], 65536KB chunk

md127 : active raid5 nvme9n1p1[10] nvme8n1p1[8] nvme7n1p1[7] nvme6n1p1[6] nvme5n1p1[5] nvme3n1p1[3] nvme4n1p1[4] nvme2n1p1[2] nvme1n1p1[1] nvme0n1p1[0]
      135007147008 blocks super 1.2 level 5, 128k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
      bitmap: 0/112 pages [0KB], 65536KB chunk

unused devices: <none>


I'm running numa aware identical FIOs, but getting the following in iostat (numa 0 mdraid, outperforms numa 1 mdraid by 12GB/s)

[root@hornet04 ~]#  iostat -xkz 1 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.20    0.00    3.35    0.00    0.00   96.45

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme2c2n1     72856.00    0.00 4662784.00      0.00     0.00     0.00   0.00   0.00    0.68    0.00  49.50    64.00     0.00   0.01 100.00
nvme3c3n1     73077.00    0.00 4676928.00      0.00     0.00     0.00   0.00   0.00    0.68    0.00  49.94    64.00     0.00   0.01 100.00
nvme4c4n1     73013.00    0.00 4672896.00      0.00     0.00     0.00   0.00   0.00    0.69    0.00  50.35    64.00     0.00   0.01 100.00
<KEY> nvme18c18n1   54384.00    0.00 3480576.00      0.00     0.00     0.00   0.00   0.00  144.80    0.00 7874.85    64.00     0.00   0.02 100.00
nvme5c5n1     72841.00    0.00 4661824.00      0.00     0.00     0.00   0.00   0.00    0.70    0.00  51.01    64.00     0.00   0.01 100.00
nvme7c7n1     72220.00    0.00 4622080.00      0.00     0.00     0.00   0.00   0.00    0.67    0.00  48.61    64.00     0.00   0.01 100.00
nvme22c22n1   54652.00    0.00 3497728.00      0.00     0.00     0.00   0.00   0.00    0.64    0.00  34.73    64.00     0.00   0.02 100.00
nvme12c12n1   54756.00    0.00 3504384.00      0.00     0.00     0.00   0.00   0.00    0.66    0.00  36.34    64.00     0.00   0.02 100.00
nvme14c14n1   54517.00    0.00 3489088.00      0.00     0.00     0.00   0.00   0.00    0.65    0.00  35.66    64.00     0.00   0.02 100.00
nvme6c6n1     72721.00    0.00 4654144.00      0.00     0.00     0.00   0.00   0.00    0.68    0.00  49.77    64.00     0.00   0.01 100.00
nvme21c21n1   54731.00    0.00 3502784.00      0.00     0.00     0.00   0.00   0.00    0.67    0.00  36.46    64.00     0.00   0.02 100.00
nvme9c9n1     72661.00    0.00 4650304.00      0.00     0.00     0.00   0.00   0.00    0.71    0.00  51.35    64.00     0.00   0.01 100.00
nvme17c17n1   54462.00    0.00 3485568.00      0.00     0.00     0.00   0.00   0.00    0.66    0.00  36.09    64.00     0.00   0.02 100.00
nvme20c20n1   54463.00    0.00 3485632.00      0.00     0.00     0.00   0.00   0.00    0.66    0.00  36.10    64.00     0.00   0.02 100.10
nvme13c13n1   54910.00    0.00 3514240.00      0.00     0.00     0.00   0.00   0.00    0.61    0.00  33.45    64.00     0.00   0.02 100.00
nvme8c8n1     72622.00    0.00 4647808.00      0.00     0.00     0.00   0.00   0.00    0.67    0.00  48.52    64.00     0.00   0.01 100.00
nvme15c15n1   54543.00    0.00 3490752.00      0.00     0.00     0.00   0.00   0.00    0.61    0.00  33.28    64.00     0.00   0.02 100.00
nvme0c0n1     73215.00    0.00 4685760.00      0.00     0.00     0.00   0.00   0.00    0.67    0.00  49.41    64.00     0.00   0.01 100.00
nvme19c19n1   55034.00    0.00 3522176.00      0.00     0.00     0.00   0.00   0.00    0.67    0.00  36.93    64.00     0.00   0.02 100.10
<KEY> nvme1c1n1     72672.00    0.00 4650944.00      0.00     0.00     0.00   0.00   0.00  106.98    0.00 7774.54    64.00     0.00   0.01 100.00
<KEY> md127         727871.00    0.00 46583744.00      0.00     0.00     0.00   0.00   0.00   11.30    0.00 8221.92    64.00     0.00   0.00 100.00
<KEY> md126         546553.00    0.00 34979392.00      0.00     0.00     0.00   0.00   0.00   14.99    0.00 8194.91    64.00     0.00   0.00 100.10


<KEY> I started chasing the aqu_sz and r_await to see if I have a device issue or if these are known mdraid "features" when I started to try to find the kernel workers and start chasing kernel workers when it became apparent to me that I DON'T KNOW WHAT I'M DOING OR WHAT TO DO NEXT. Any guidance is appreciated.  Given 1 drive per NUMA is showing the bad behavior, I'm reluctant to point the finger at hardware.


[root@hornet04 ~]# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              256
On-line CPU(s) list: 0-255
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           2
NUMA node(s):        2
Vendor ID:           AuthenticAMD
BIOS Vendor ID:      Advanced Micro Devices, Inc.
CPU family:          23
Model:               49
Model name:          AMD EPYC 7742 64-Core Processor
BIOS Model name:     AMD EPYC 7742 64-Core Processor                
Stepping:            0
CPU MHz:             3243.803
BogoMIPS:            4491.53
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
<KEY> NUMA node0 CPU(s):   0-63,128-191
<KEY>  NUMA node1 CPU(s):   64-127,192-255
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca


<KEY> When I start doing some basic debugging - not a Linux ninja by far, I see the following, but what is throwing me is seeing (at least these workers that I suspect have to do with md, all running on NUMA node 1.   This is catching me by surprise.   Are there other workers that I'm missing?????

ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan:14,comm | head -1; ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan:14,comm  | egrep 'md|raid' | grep -v systemd | grep -v mlx
    PID     TID CLS RTPRIO  NI PRI NUMA PSR %CPU STAT WCHAN          COMMAND
   1522    1522 TS       -   5  14    1 208  0.0 SN   -              ksmd
   1590    1590 TS       - -20  39    1 220  0.0 I<   -              md
   3688    3688 TS       - -20  39    1 198  0.0 I<   -              raid5wq
   3693    3693 TS       -   0  19    1 234  0.0 S    -              md126_raid5
   3694    3694 TS       -   0  19    1  95  0.0 S    -              md127_raid5
   3788    3788 TS       -   0  19    1 240  0.0 Ss   -              lsmdcat /



Jim Finlayson
U.S. Department of Defense


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MDRAID NVMe performance question, but I don't know what I don't know
  2022-01-11 16:03 MDRAID NVMe performance question, but I don't know what I don't know Finlayson, James M CIV (USA)
@ 2022-01-11 19:40 ` Geoff Back
  2022-01-11 20:31   ` [Non-DoD Source] " Finlayson, James M CIV (USA)
  2022-01-11 20:34 ` Phil Turmel
  1 sibling, 1 reply; 5+ messages in thread
From: Geoff Back @ 2022-01-11 19:40 UTC (permalink / raw)
  To: Finlayson, James M CIV (USA), linux-raid

Hi James,

My first thought would be: how sure are you about which physical socket
(and hence NUMA node) each NVME drive is connected to?

Regards,

Geoff.


On 11/01/2022 16:03, Finlayson, James M CIV (USA) wrote:
> Hi,
> Sorry this is a long read.   If you want to get to the gist of it, look for "<KEY>" for key points.   I'm having some issues with where to find information to troubleshoot mdraid performance issues.   The latest "rathole" I'm going down is that I have two identically configured mdraids, 1 per NUMA node on a dual socket AMD Rome with "numas per socket" set to 1 in the BIOS.   Things are cranking with a 64K blocksize but I have a substantial disparity between NUMA0's mdraid and NUMA1's.     
>
> [root@hornet04 block]# uname -r
> <KEY> 5.15.13-1.el8.elrepo.x86_64
>
> <KEY>  [root@hornet04 block]# cat /proc/mdstat  (md127 is NUMA 0, md126 is NUMA 1).
> Personalities : [raid6] [raid5] [raid4] 
> md126 : active raid5 nvme22n1p1[10] nvme20n1p1[7] nvme21n1p1[8] nvme18n1p1[5] nvme19n1p1[6] nvme17n1p1[4] nvme15n1p1[3] nvme14n1p1[2] nvme12n1p1[0] nvme13n1p1[1]
>       135007147008 blocks super 1.2 level 5, 128k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
>       bitmap: 0/112 pages [0KB], 65536KB chunk
>
> md127 : active raid5 nvme9n1p1[10] nvme8n1p1[8] nvme7n1p1[7] nvme6n1p1[6] nvme5n1p1[5] nvme3n1p1[3] nvme4n1p1[4] nvme2n1p1[2] nvme1n1p1[1] nvme0n1p1[0]
>       135007147008 blocks super 1.2 level 5, 128k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
>       bitmap: 0/112 pages [0KB], 65536KB chunk
>
> unused devices: <none>
>
>
> I'm running numa aware identical FIOs, but getting the following in iostat (numa 0 mdraid, outperforms numa 1 mdraid by 12GB/s)
>
> [root@hornet04 ~]#  iostat -xkz 1 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0.20    0.00    3.35    0.00    0.00   96.45
>
> Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
> nvme2c2n1     72856.00    0.00 4662784.00      0.00     0.00     0.00   0.00   0.00    0.68    0.00  49.50    64.00     0.00   0.01 100.00
> nvme3c3n1     73077.00    0.00 4676928.00      0.00     0.00     0.00   0.00   0.00    0.68    0.00  49.94    64.00     0.00   0.01 100.00
> nvme4c4n1     73013.00    0.00 4672896.00      0.00     0.00     0.00   0.00   0.00    0.69    0.00  50.35    64.00     0.00   0.01 100.00
> <KEY> nvme18c18n1   54384.00    0.00 3480576.00      0.00     0.00     0.00   0.00   0.00  144.80    0.00 7874.85    64.00     0.00   0.02 100.00
> nvme5c5n1     72841.00    0.00 4661824.00      0.00     0.00     0.00   0.00   0.00    0.70    0.00  51.01    64.00     0.00   0.01 100.00
> nvme7c7n1     72220.00    0.00 4622080.00      0.00     0.00     0.00   0.00   0.00    0.67    0.00  48.61    64.00     0.00   0.01 100.00
> nvme22c22n1   54652.00    0.00 3497728.00      0.00     0.00     0.00   0.00   0.00    0.64    0.00  34.73    64.00     0.00   0.02 100.00
> nvme12c12n1   54756.00    0.00 3504384.00      0.00     0.00     0.00   0.00   0.00    0.66    0.00  36.34    64.00     0.00   0.02 100.00
> nvme14c14n1   54517.00    0.00 3489088.00      0.00     0.00     0.00   0.00   0.00    0.65    0.00  35.66    64.00     0.00   0.02 100.00
> nvme6c6n1     72721.00    0.00 4654144.00      0.00     0.00     0.00   0.00   0.00    0.68    0.00  49.77    64.00     0.00   0.01 100.00
> nvme21c21n1   54731.00    0.00 3502784.00      0.00     0.00     0.00   0.00   0.00    0.67    0.00  36.46    64.00     0.00   0.02 100.00
> nvme9c9n1     72661.00    0.00 4650304.00      0.00     0.00     0.00   0.00   0.00    0.71    0.00  51.35    64.00     0.00   0.01 100.00
> nvme17c17n1   54462.00    0.00 3485568.00      0.00     0.00     0.00   0.00   0.00    0.66    0.00  36.09    64.00     0.00   0.02 100.00
> nvme20c20n1   54463.00    0.00 3485632.00      0.00     0.00     0.00   0.00   0.00    0.66    0.00  36.10    64.00     0.00   0.02 100.10
> nvme13c13n1   54910.00    0.00 3514240.00      0.00     0.00     0.00   0.00   0.00    0.61    0.00  33.45    64.00     0.00   0.02 100.00
> nvme8c8n1     72622.00    0.00 4647808.00      0.00     0.00     0.00   0.00   0.00    0.67    0.00  48.52    64.00     0.00   0.01 100.00
> nvme15c15n1   54543.00    0.00 3490752.00      0.00     0.00     0.00   0.00   0.00    0.61    0.00  33.28    64.00     0.00   0.02 100.00
> nvme0c0n1     73215.00    0.00 4685760.00      0.00     0.00     0.00   0.00   0.00    0.67    0.00  49.41    64.00     0.00   0.01 100.00
> nvme19c19n1   55034.00    0.00 3522176.00      0.00     0.00     0.00   0.00   0.00    0.67    0.00  36.93    64.00     0.00   0.02 100.10
> <KEY> nvme1c1n1     72672.00    0.00 4650944.00      0.00     0.00     0.00   0.00   0.00  106.98    0.00 7774.54    64.00     0.00   0.01 100.00
> <KEY> md127         727871.00    0.00 46583744.00      0.00     0.00     0.00   0.00   0.00   11.30    0.00 8221.92    64.00     0.00   0.00 100.00
> <KEY> md126         546553.00    0.00 34979392.00      0.00     0.00     0.00   0.00   0.00   14.99    0.00 8194.91    64.00     0.00   0.00 100.10
>
>
> <KEY> I started chasing the aqu_sz and r_await to see if I have a device issue or if these are known mdraid "features" when I started to try to find the kernel workers and start chasing kernel workers when it became apparent to me that I DON'T KNOW WHAT I'M DOING OR WHAT TO DO NEXT. Any guidance is appreciated.  Given 1 drive per NUMA is showing the bad behavior, I'm reluctant to point the finger at hardware.
>
>
> [root@hornet04 ~]# lscpu
> Architecture:        x86_64
> CPU op-mode(s):      32-bit, 64-bit
> Byte Order:          Little Endian
> CPU(s):              256
> On-line CPU(s) list: 0-255
> Thread(s) per core:  2
> Core(s) per socket:  64
> Socket(s):           2
> NUMA node(s):        2
> Vendor ID:           AuthenticAMD
> BIOS Vendor ID:      Advanced Micro Devices, Inc.
> CPU family:          23
> Model:               49
> Model name:          AMD EPYC 7742 64-Core Processor
> BIOS Model name:     AMD EPYC 7742 64-Core Processor                
> Stepping:            0
> CPU MHz:             3243.803
> BogoMIPS:            4491.53
> Virtualization:      AMD-V
> L1d cache:           32K
> L1i cache:           32K
> L2 cache:            512K
> L3 cache:            16384K
> <KEY> NUMA node0 CPU(s):   0-63,128-191
> <KEY>  NUMA node1 CPU(s):   64-127,192-255
> Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca
>
>
> <KEY> When I start doing some basic debugging - not a Linux ninja by far, I see the following, but what is throwing me is seeing (at least these workers that I suspect have to do with md, all running on NUMA node 1.   This is catching me by surprise.   Are there other workers that I'm missing?????
>
> ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan:14,comm | head -1; ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan:14,comm  | egrep 'md|raid' | grep -v systemd | grep -v mlx
>     PID     TID CLS RTPRIO  NI PRI NUMA PSR %CPU STAT WCHAN          COMMAND
>    1522    1522 TS       -   5  14    1 208  0.0 SN   -              ksmd
>    1590    1590 TS       - -20  39    1 220  0.0 I<   -              md
>    3688    3688 TS       - -20  39    1 198  0.0 I<   -              raid5wq
>    3693    3693 TS       -   0  19    1 234  0.0 S    -              md126_raid5
>    3694    3694 TS       -   0  19    1  95  0.0 S    -              md127_raid5
>    3788    3788 TS       -   0  19    1 240  0.0 Ss   -              lsmdcat /
>
>
>
> Jim Finlayson
> U.S. Department of Defense
>

-- 
Geoff Back
What if we're all just characters in someone's nightmares?


^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [Non-DoD Source] Re: MDRAID NVMe performance question, but I don't know what I don't know
  2022-01-11 19:40 ` Geoff Back
@ 2022-01-11 20:31   ` Finlayson, James M CIV (USA)
  0 siblings, 0 replies; 5+ messages in thread
From: Finlayson, James M CIV (USA) @ 2022-01-11 20:31 UTC (permalink / raw)
  To: 'Geoff Back', linux-raid

Unless I did something completely foolish:

[root@hornet04 ~]# for i in /sys/class/nvme/nvme* ; do echo $i `cat $i/numa_node` `ls -d $i/nvme*` ; done
/sys/class/nvme/nvme0 0 /sys/class/nvme/nvme0/nvme0c0n1
/sys/class/nvme/nvme1 0 /sys/class/nvme/nvme1/nvme1c1n1
/sys/class/nvme/nvme10 0 /sys/class/nvme/nvme10/nvme10c10n1
/sys/class/nvme/nvme11 0 /sys/class/nvme/nvme11/nvme11c11n1
/sys/class/nvme/nvme12 1 /sys/class/nvme/nvme12/nvme12c12n1
/sys/class/nvme/nvme13 1 /sys/class/nvme/nvme13/nvme13c13n1
/sys/class/nvme/nvme14 1 /sys/class/nvme/nvme14/nvme14c14n1
/sys/class/nvme/nvme15 1 /sys/class/nvme/nvme15/nvme15c15n1
/sys/class/nvme/nvme16 1 /sys/class/nvme/nvme16/nvme16c16n1
/sys/class/nvme/nvme17 1 /sys/class/nvme/nvme17/nvme17c17n1
/sys/class/nvme/nvme18 1 /sys/class/nvme/nvme18/nvme18c18n1
/sys/class/nvme/nvme19 1 /sys/class/nvme/nvme19/nvme19c19n1
/sys/class/nvme/nvme2 0 /sys/class/nvme/nvme2/nvme2c2n1
/sys/class/nvme/nvme20 1 /sys/class/nvme/nvme20/nvme20c20n1
/sys/class/nvme/nvme21 1 /sys/class/nvme/nvme21/nvme21c21n1
/sys/class/nvme/nvme22 1 /sys/class/nvme/nvme22/nvme22c22n1
/sys/class/nvme/nvme23 1 /sys/class/nvme/nvme23/nvme23c23n1
/sys/class/nvme/nvme24 1 /sys/class/nvme/nvme24/nvme24c24n1
/sys/class/nvme/nvme3 0 /sys/class/nvme/nvme3/nvme3c3n1
/sys/class/nvme/nvme4 0 /sys/class/nvme/nvme4/nvme4c4n1
/sys/class/nvme/nvme5 0 /sys/class/nvme/nvme5/nvme5c5n1
/sys/class/nvme/nvme6 0 /sys/class/nvme/nvme6/nvme6c6n1
/sys/class/nvme/nvme7 0 /sys/class/nvme/nvme7/nvme7c7n1
/sys/class/nvme/nvme8 0 /sys/class/nvme/nvme8/nvme8c8n1
/sys/class/nvme/nvme9 0 /sys/class/nvme/nvme9/nvme9c9n1



-----Original Message-----
From: Geoff Back <geoff@demonlair.co.uk> 
Sent: Tuesday, January 11, 2022 2:40 PM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>; linux-raid@vger.kernel.org
Subject: [Non-DoD Source] Re: MDRAID NVMe performance question, but I don't know what I don't know

Hi James,

My first thought would be: how sure are you about which physical socket (and hence NUMA node) each NVME drive is connected to?

Regards,

Geoff.


On 11/01/2022 16:03, Finlayson, James M CIV (USA) wrote:
> Hi,
> Sorry this is a long read.   If you want to get to the gist of it, look for "<KEY>" for key points.   I'm having some issues with where to find information to troubleshoot mdraid performance issues.   The latest "rathole" I'm going down is that I have two identically configured mdraids, 1 per NUMA node on a dual socket AMD Rome with "numas per socket" set to 1 in the BIOS.   Things are cranking with a 64K blocksize but I have a substantial disparity between NUMA0's mdraid and NUMA1's.     
>
> [root@hornet04 block]# uname -r
> <KEY> 5.15.13-1.el8.elrepo.x86_64
>
> <KEY>  [root@hornet04 block]# cat /proc/mdstat  (md127 is NUMA 0, md126 is NUMA 1).
> Personalities : [raid6] [raid5] [raid4]
> md126 : active raid5 nvme22n1p1[10] nvme20n1p1[7] nvme21n1p1[8] nvme18n1p1[5] nvme19n1p1[6] nvme17n1p1[4] nvme15n1p1[3] nvme14n1p1[2] nvme12n1p1[0] nvme13n1p1[1]
>       135007147008 blocks super 1.2 level 5, 128k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
>       bitmap: 0/112 pages [0KB], 65536KB chunk
>
> md127 : active raid5 nvme9n1p1[10] nvme8n1p1[8] nvme7n1p1[7] nvme6n1p1[6] nvme5n1p1[5] nvme3n1p1[3] nvme4n1p1[4] nvme2n1p1[2] nvme1n1p1[1] nvme0n1p1[0]
>       135007147008 blocks super 1.2 level 5, 128k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
>       bitmap: 0/112 pages [0KB], 65536KB chunk
>
> unused devices: <none>
>
>
> I'm running numa aware identical FIOs, but getting the following in 
> iostat (numa 0 mdraid, outperforms numa 1 mdraid by 12GB/s)
>
> [root@hornet04 ~]#  iostat -xkz 1 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0.20    0.00    3.35    0.00    0.00   96.45
>
> Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
> nvme2c2n1     72856.00    0.00 4662784.00      0.00     0.00     0.00   0.00   0.00    0.68    0.00  49.50    64.00     0.00   0.01 100.00
> nvme3c3n1     73077.00    0.00 4676928.00      0.00     0.00     0.00   0.00   0.00    0.68    0.00  49.94    64.00     0.00   0.01 100.00
> nvme4c4n1     73013.00    0.00 4672896.00      0.00     0.00     0.00   0.00   0.00    0.69    0.00  50.35    64.00     0.00   0.01 100.00
> <KEY> nvme18c18n1   54384.00    0.00 3480576.00      0.00     0.00     0.00   0.00   0.00  144.80    0.00 7874.85    64.00     0.00   0.02 100.00
> nvme5c5n1     72841.00    0.00 4661824.00      0.00     0.00     0.00   0.00   0.00    0.70    0.00  51.01    64.00     0.00   0.01 100.00
> nvme7c7n1     72220.00    0.00 4622080.00      0.00     0.00     0.00   0.00   0.00    0.67    0.00  48.61    64.00     0.00   0.01 100.00
> nvme22c22n1   54652.00    0.00 3497728.00      0.00     0.00     0.00   0.00   0.00    0.64    0.00  34.73    64.00     0.00   0.02 100.00
> nvme12c12n1   54756.00    0.00 3504384.00      0.00     0.00     0.00   0.00   0.00    0.66    0.00  36.34    64.00     0.00   0.02 100.00
> nvme14c14n1   54517.00    0.00 3489088.00      0.00     0.00     0.00   0.00   0.00    0.65    0.00  35.66    64.00     0.00   0.02 100.00
> nvme6c6n1     72721.00    0.00 4654144.00      0.00     0.00     0.00   0.00   0.00    0.68    0.00  49.77    64.00     0.00   0.01 100.00
> nvme21c21n1   54731.00    0.00 3502784.00      0.00     0.00     0.00   0.00   0.00    0.67    0.00  36.46    64.00     0.00   0.02 100.00
> nvme9c9n1     72661.00    0.00 4650304.00      0.00     0.00     0.00   0.00   0.00    0.71    0.00  51.35    64.00     0.00   0.01 100.00
> nvme17c17n1   54462.00    0.00 3485568.00      0.00     0.00     0.00   0.00   0.00    0.66    0.00  36.09    64.00     0.00   0.02 100.00
> nvme20c20n1   54463.00    0.00 3485632.00      0.00     0.00     0.00   0.00   0.00    0.66    0.00  36.10    64.00     0.00   0.02 100.10
> nvme13c13n1   54910.00    0.00 3514240.00      0.00     0.00     0.00   0.00   0.00    0.61    0.00  33.45    64.00     0.00   0.02 100.00
> nvme8c8n1     72622.00    0.00 4647808.00      0.00     0.00     0.00   0.00   0.00    0.67    0.00  48.52    64.00     0.00   0.01 100.00
> nvme15c15n1   54543.00    0.00 3490752.00      0.00     0.00     0.00   0.00   0.00    0.61    0.00  33.28    64.00     0.00   0.02 100.00
> nvme0c0n1     73215.00    0.00 4685760.00      0.00     0.00     0.00   0.00   0.00    0.67    0.00  49.41    64.00     0.00   0.01 100.00
> nvme19c19n1   55034.00    0.00 3522176.00      0.00     0.00     0.00   0.00   0.00    0.67    0.00  36.93    64.00     0.00   0.02 100.10
> <KEY> nvme1c1n1     72672.00    0.00 4650944.00      0.00     0.00     0.00   0.00   0.00  106.98    0.00 7774.54    64.00     0.00   0.01 100.00
> <KEY> md127         727871.00    0.00 46583744.00      0.00     0.00     0.00   0.00   0.00   11.30    0.00 8221.92    64.00     0.00   0.00 100.00
> <KEY> md126         546553.00    0.00 34979392.00      0.00     0.00     0.00   0.00   0.00   14.99    0.00 8194.91    64.00     0.00   0.00 100.10
>
>
> <KEY> I started chasing the aqu_sz and r_await to see if I have a device issue or if these are known mdraid "features" when I started to try to find the kernel workers and start chasing kernel workers when it became apparent to me that I DON'T KNOW WHAT I'M DOING OR WHAT TO DO NEXT. Any guidance is appreciated.  Given 1 drive per NUMA is showing the bad behavior, I'm reluctant to point the finger at hardware.
>
>
> [root@hornet04 ~]# lscpu
> Architecture:        x86_64
> CPU op-mode(s):      32-bit, 64-bit
> Byte Order:          Little Endian
> CPU(s):              256
> On-line CPU(s) list: 0-255
> Thread(s) per core:  2
> Core(s) per socket:  64
> Socket(s):           2
> NUMA node(s):        2
> Vendor ID:           AuthenticAMD
> BIOS Vendor ID:      Advanced Micro Devices, Inc.
> CPU family:          23
> Model:               49
> Model name:          AMD EPYC 7742 64-Core Processor
> BIOS Model name:     AMD EPYC 7742 64-Core Processor                
> Stepping:            0
> CPU MHz:             3243.803
> BogoMIPS:            4491.53
> Virtualization:      AMD-V
> L1d cache:           32K
> L1i cache:           32K
> L2 cache:            512K
> L3 cache:            16384K
> <KEY> NUMA node0 CPU(s):   0-63,128-191
> <KEY>  NUMA node1 CPU(s):   64-127,192-255
> Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca
>
>
> <KEY> When I start doing some basic debugging - not a Linux ninja by far, I see the following, but what is throwing me is seeing (at least these workers that I suspect have to do with md, all running on NUMA node 1.   This is catching me by surprise.   Are there other workers that I'm missing?????
>
> ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan:14,comm | head -1; ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan:14,comm  | egrep 'md|raid' | grep -v systemd | grep -v mlx
>     PID     TID CLS RTPRIO  NI PRI NUMA PSR %CPU STAT WCHAN          COMMAND
>    1522    1522 TS       -   5  14    1 208  0.0 SN   -              ksmd
>    1590    1590 TS       - -20  39    1 220  0.0 I<   -              md
>    3688    3688 TS       - -20  39    1 198  0.0 I<   -              raid5wq
>    3693    3693 TS       -   0  19    1 234  0.0 S    -              md126_raid5
>    3694    3694 TS       -   0  19    1  95  0.0 S    -              md127_raid5
>    3788    3788 TS       -   0  19    1 240  0.0 Ss   -              lsmdcat /
>
>
>
> Jim Finlayson
> U.S. Department of Defense
>

--
Geoff Back
What if we're all just characters in someone's nightmares?


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MDRAID NVMe performance question, but I don't know what I don't know
  2022-01-11 16:03 MDRAID NVMe performance question, but I don't know what I don't know Finlayson, James M CIV (USA)
  2022-01-11 19:40 ` Geoff Back
@ 2022-01-11 20:34 ` Phil Turmel
  2022-01-11 20:38   ` [Non-DoD Source] " Finlayson, James M CIV (USA)
  1 sibling, 1 reply; 5+ messages in thread
From: Phil Turmel @ 2022-01-11 20:34 UTC (permalink / raw)
  To: Finlayson, James M CIV (USA), linux-raid

Hi James,

On 1/11/22 11:03 AM, Finlayson, James M CIV (USA) wrote:
> Hi,
> Sorry this is a long read.   If you want to get to the gist of it, look for "<KEY>" for key points.   I'm having some issues with where to find information to troubleshoot mdraid performance issues.   The latest "rathole" I'm going down is that I have two identically configured mdraids, 1 per NUMA node on a dual socket AMD Rome with "numas per socket" set to 1 in the BIOS.   Things are cranking with a 64K blocksize but I have a substantial disparity between NUMA0's mdraid and NUMA1's.

[trim /]

Is there any chance your NVMe devices are installed asymmetrically on 
your PCIe bus(ses) ?

try:

# lspci -tv

Might be illuminating.  In my office server, the PCIe slots are routed 
through one of the two sockets.  The slots routed through socket 1 
simply don't work when the second processor is not installed.  Devices 
in a socket 0 slot have to route through that CPU when the other CPU 
talks to them, and vice versa.

Phil

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [Non-DoD Source] Re: MDRAID NVMe performance question, but I don't know what I don't know
  2022-01-11 20:34 ` Phil Turmel
@ 2022-01-11 20:38   ` Finlayson, James M CIV (USA)
  0 siblings, 0 replies; 5+ messages in thread
From: Finlayson, James M CIV (USA) @ 2022-01-11 20:38 UTC (permalink / raw)
  To: 'Phil Turmel', linux-raid

[root@hornet04 ~]# lstopo -v | egrep -i 'numa|pci|bridge'
    NUMANode L#0 (P#0 local=263873404KB total=263873404KB)
    HostBridge L#0 (buses=0000:[00-06])
      PCIBridge L#1 (busid=0000:00:01.3 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[03-03])
        PCI L#0 (busid=0000:03:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=15)
      PCIBridge L#2 (busid=0000:00:01.4 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[04-04])
        PCI L#1 (busid=0000:04:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=16)
      PCIBridge L#3 (busid=0000:00:01.5 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[05-05])
        PCI L#2 (busid=0000:05:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=17)
      PCIBridge L#4 (busid=0000:00:01.6 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[06-06])
        PCI L#3 (busid=0000:06:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=18)
    HostBridge L#5 (buses=0000:[20-27])
      PCIBridge L#6 (busid=0000:20:01.1 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[23-23])
        PCI L#4 (busid=0000:23:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=7)
      PCIBridge L#7 (busid=0000:20:01.2 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[24-24])
        PCI L#5 (busid=0000:24:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=8-1)
      PCIBridge L#8 (busid=0000:20:01.3 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[25-25])
        PCI L#6 (busid=0000:25:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=9)
      PCIBridge L#9 (busid=0000:20:01.4 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[26-26])
        PCI L#7 (busid=0000:26:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=10-1)
      PCIBridge L#10 (busid=0000:20:03.1 id=1022:1483 class=0604(PCIBridge) link=15.75GB/s buses=0000:[27-27])
        PCI L#8 (busid=0000:27:00.0 id=15b3:1017 class=0200(Ethernet) link=15.75GB/s PCISlot=1)
        PCI L#9 (busid=0000:27:00.1 id=15b3:1017 class=0200(Ethernet) link=15.75GB/s PCISlot=1)
    HostBridge L#11 (buses=0000:[40-45])
      PCIBridge L#12 (busid=0000:40:01.1 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[43-43])
        PCI L#10 (busid=0000:43:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=3)
      PCIBridge L#13 (busid=0000:40:01.2 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[44-44])
        PCI L#11 (busid=0000:44:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=4)
      PCIBridge L#14 (busid=0000:40:01.3 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[45-45])
        PCI L#12 (busid=0000:45:00.0 id=15b3:1017 class=0200(Ethernet) link=7.88GB/s PCISlot=10)
        PCI L#13 (busid=0000:45:00.1 id=15b3:1017 class=0200(Ethernet) link=7.88GB/s PCISlot=10)
    HostBridge L#15 (buses=0000:[60-65])
      PCIBridge L#16 (busid=0000:60:03.2 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[64-64])
        PCI L#14 (busid=0000:64:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=1-1)
      PCIBridge L#17 (busid=0000:60:03.3 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[65-65])
        PCI L#15 (busid=0000:65:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=2)
      PCIBridge L#18 (busid=0000:60:05.2 id=1022:1483 class=0604(PCIBridge) link=0.50GB/s buses=0000:[61-61])
        PCI L#16 (busid=0000:61:00.1 id=102b:0538 class=0300(VGA) link=0.50GB/s)
    NUMANode L#1 (P#1 local=264165280KB total=264165280KB)
    HostBridge L#19 (buses=0000:[a0-a6])
      PCIBridge L#20 (busid=0000:a0:03.3 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[a3-a3])
        PCI L#17 (busid=0000:a3:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=31)
      PCIBridge L#21 (busid=0000:a0:03.4 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[a4-a4])
        PCI L#18 (busid=0000:a4:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=32)
      PCIBridge L#22 (busid=0000:a0:03.5 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[a5-a5])
        PCI L#19 (busid=0000:a5:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=33)
      PCIBridge L#23 (busid=0000:a0:03.6 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[a6-a6])
        PCI L#20 (busid=0000:a6:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=34)
    HostBridge L#24 (buses=0000:[c0-c8])
      PCIBridge L#25 (busid=0000:c0:01.1 id=1022:1483 class=0604(PCIBridge) link=3.94GB/s buses=0000:[c3-c3])
        PCI L#21 (busid=0000:c3:00.0 id=1b4b:2241 class=0108(NVMExp) link=3.94GB/s PCISlot=8)
      PCIBridge L#26 (busid=0000:c0:03.1 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[c5-c5])
        PCI L#22 (busid=0000:c5:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=23)
      PCIBridge L#27 (busid=0000:c0:03.2 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[c6-c6])
        PCI L#23 (busid=0000:c6:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=24)
      PCIBridge L#28 (busid=0000:c0:03.3 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[c7-c7])
        PCI L#24 (busid=0000:c7:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=25)
      PCIBridge L#29 (busid=0000:c0:03.4 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[c8-c8])
        PCI L#25 (busid=0000:c8:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=26)
    HostBridge L#30 (buses=0000:[e0-e6])
      PCIBridge L#31 (busid=0000:e0:03.1 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[e5-e5])
        PCI L#26 (busid=0000:e5:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=21)
      PCIBridge L#32 (busid=0000:e0:03.2 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[e6-e6])
        PCI L#27 (busid=0000:e6:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=22)
      PCIBridge L#33 (busid=0000:e0:03.3 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[e3-e3])
        PCI L#28 (busid=0000:e3:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=19)
      PCIBridge L#34 (busid=0000:e0:03.4 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[e4-e4])
        PCI L#29 (busid=0000:e4:00.0 id=144d:a824 class=0108(NVMExp) link=7.88GB/s PCISlot=20)

-----Original Message-----
From: Phil Turmel <philip@turmel.org> 
Sent: Tuesday, January 11, 2022 3:35 PM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@mail.mil>; linux-raid@vger.kernel.org
Subject: [Non-DoD Source] Re: MDRAID NVMe performance question, but I don't know what I don't know

Hi James,

On 1/11/22 11:03 AM, Finlayson, James M CIV (USA) wrote:
> Hi,
> Sorry this is a long read.   If you want to get to the gist of it, look for "<KEY>" for key points.   I'm having some issues with where to find information to troubleshoot mdraid performance issues.   The latest "rathole" I'm going down is that I have two identically configured mdraids, 1 per NUMA node on a dual socket AMD Rome with "numas per socket" set to 1 in the BIOS.   Things are cranking with a 64K blocksize but I have a substantial disparity between NUMA0's mdraid and NUMA1's.

[trim /]

Is there any chance your NVMe devices are installed asymmetrically on your PCIe bus(ses) ?

try:

# lspci -tv

Might be illuminating.  In my office server, the PCIe slots are routed through one of the two sockets.  The slots routed through socket 1 simply don't work when the second processor is not installed.  Devices in a socket 0 slot have to route through that CPU when the other CPU talks to them, and vice versa.

Phil

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-01-11 20:49 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-11 16:03 MDRAID NVMe performance question, but I don't know what I don't know Finlayson, James M CIV (USA)
2022-01-11 19:40 ` Geoff Back
2022-01-11 20:31   ` [Non-DoD Source] " Finlayson, James M CIV (USA)
2022-01-11 20:34 ` Phil Turmel
2022-01-11 20:38   ` [Non-DoD Source] " Finlayson, James M CIV (USA)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.