* RDMA performance comparison: IBNBD, SCST, NVMEoF
@ 2017-04-18 17:33 ` Roman Penyaev
0 siblings, 0 replies; 5+ messages in thread
From: Roman Penyaev @ 2017-04-18 17:33 UTC (permalink / raw)
To: Bart Van Assche, Sagi Grimberg, Doug Ledford, Jens Axboe,
Christoph Hellwig, Fabian Holler, Milind Dumbare, Michael Wang,
Danil Kipnis, Jinpu Wang, linux-block, linux-rdma
Hi Bart, Sagi and all,
By current email I would like to share some fresh RDMA performance
results of IBNBD, SCST and NVMEof, based on 4.10 kernel and variety
of configurations.
All fio runs are grouped by the name of a project, crucial config
differencies (e.g. CPU pinning or register_always=3DN) and two testing
modes: MANY-DISKS and MANY-JOBS. In each group of results amount of
simultaneous fio jobs is increasing starting from 1 up to 128. E.g.
in MANY-DISKS testing mode 1 fio job is dedicated to 1 disk, where
amount of jobs (and disks) is growing, in its turn, in MANY-JOBS
testing mode each fio job produces IO for the same disk, i.e.:
MANY-DISKS:
x1:
numjobs=3D1
[job1]
filename=3D/dev/nvme0n1
...
x128:
numjobs=3D1
[job1]
filename=3D/dev/nvme0n1
[job2]
filename=3D/dev/nvme0n2
...
[job128]
filename=3D/dev/nvme0n128
MANY-JOBS:
x1:
numjobs=3D1
[job1]
filename=3D/dev/nvme0n1
...
x128:
numjobs=3D128
[job1]
filename=3D/dev/nvme0n1
Each group of results represents itself as a performance measurement,
which can be easily plotted, taking number of jobs as X axis and iops,
overall IO latencies or anything else extracted from fio json result
files as Y axis.
FIO configurations were generated and saved along with produced fio
json results by the fio-runner.py script [1]. Complete archive with
FIO configs and results can be downloaded here [2].
The following metrics were taken from fio json results:
write/iops - IOPS
write/lat/mean - average latency (=CE=BCs)
Here I would like to present reduced results table taking into account
only runs with CPU pinning in MANY-DISKS testing mode, since CPU pinning
makes more sense in terms of performance and MANY-DISKS and MANY-JOBS
results look very much similar:
write/iops (MANY-DISKS)
IBNBD_pin NVME_noreg_pin NVME_pin SCST_noreg_pin SCST_pin
x1 80398.96 75577.24 54110.19 59555.04 48446.05
x2 109018.60 96478.45 69176.77 73925.81 55557.59
x4 169164.56 140558.75 93700.96 75419.91 56294.61
x8 197725.44 159887.33 99773.05 79750.92 55938.84
x16 176782.36 150448.33 99644.05 92964.23 56463.14
x32 139666.00 123198.38 81845.30 81287.98 50590.86
x64 125666.16 82231.77 72117.67 72023.32 45121.17
x128 120253.63 73911.97 65665.08 74642.27 47268.46
write/lat/mean (MANY-DISKS)
IBNBD_pin NVME_noreg_pin NVME_pin SCST_noreg_pin SCST_pin
x1 647.78 697.91 1032.97 925.51 1173.04
x2 973.20 1104.38 1612.75 1462.18 2047.11
x4 1279.49 1528.09 2452.22 3188.41 4235.95
x8 2356.92 2929.87 4891.70 6248.85 8907.10
x16 5605.62 6575.70 10046.4 10830.50 17945.57
x32 14489.54 16516.60 24849.16 24984.26 40335.09
x64 32364.39 49481.42 56615.23 56559.02 90590.84
x128 67570.88 110768.70 124249.4 109321.84 171390.00
* Where suffixes mean:
_pin - CPU pinning
_noreg - modules on initiator side (ib_srp, nvme_rdma) were loaded
with 'register_always=3DN' param
Complete table results and corresponding graphs are presented on Google
sheet [3].
Conclusion:
IBNBD outperforms in average by:
NVME_noreg_pin NVME_pin SCST_noreg_pin SCST_pin
iops 41% 72% 61% 155%
lat/mean 28% 42% 38% 60%
* Complete tables results [3] were taken into account for average
percentage calculation.
Test setup is the following:
Initiator and target HW configuration:
AMD Opteron 6386 SE, 64CPU, 128Gb
InfiniBand: Mellanox Technologies MT26428
[ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE]
Initiator and target SW configuration:
vanilla Linux 4.10
+ IBNBD patches
+ SCST from https://github.com/bvanassche/scst, master branch
Initiator side:
IBNBD and NVME: MQ mode
SRP: default RQ, on attempt to set 'use_blk_mq=3DY' IO hangs.
FIO generic configuration pattern:
bssplit=3D512/20:1k/16:2k/9:4k/12:8k/19:16k/10:32k/8:64k/4
fadvise_hint=3D0
rw=3Drandrw:2
direct=3D1
random_distribution=3Dzipf:1.2
time_based=3D1
runtime=3D10
ioengine=3Dlibaio
iodepth=3D128
iodepth_batch_submit=3D128
iodepth_batch_complete=3D128
group_reporting
Target side:
128 null_blk devices with default configuration, opened as blockio.
NVMEoF configuration script [4].
SCST configuration script [5].
Would be great to receive any feedback. I am open for further perf
tuning and testing with other possible configurations and options.
Thanks.
--
Roman
[1] FIO runner and results extractor script:
https://drive.google.com/open?id=3D0B8_SivzwHdgSS2RKcmc4bWg0YjA
[2] Archive with FIO configurations and results:
https://drive.google.com/open?id=3D0B8_SivzwHdgSaDlhMXV6THhoRXc
[3] Google sheet with performance measurements:
https://drive.google.com/open?id=3D1sCTBKLA5gbhhkgd2USZXY43VL3zLidzdqDe=
ObZn9Edc
[4] NVMEoF configuration:
https://drive.google.com/open?id=3D0B8_SivzwHdgSTzRjbGtmaVR6LWM
[5] SCST configuration:
https://drive.google.com/open?id=3D0B8_SivzwHdgSM1B5eGpKWmFJMFk
^ permalink raw reply [flat|nested] 5+ messages in thread
* RDMA performance comparison: IBNBD, SCST, NVMEoF
@ 2017-04-18 17:33 ` Roman Penyaev
0 siblings, 0 replies; 5+ messages in thread
From: Roman Penyaev @ 2017-04-18 17:33 UTC (permalink / raw)
To: Bart Van Assche, Sagi Grimberg, Doug Ledford, Jens Axboe,
Christoph Hellwig, Fabian Holler, Milind Dumbare, Michael Wang,
Danil Kipnis, Jinpu Wang, linux-block, linux-rdma
Hi Bart, Sagi and all,
By current email I would like to share some fresh RDMA performance
results of IBNBD, SCST and NVMEof, based on 4.10 kernel and variety
of configurations.
All fio runs are grouped by the name of a project, crucial config
differencies (e.g. CPU pinning or register_always=N) and two testing
modes: MANY-DISKS and MANY-JOBS. In each group of results amount of
simultaneous fio jobs is increasing starting from 1 up to 128. E.g.
in MANY-DISKS testing mode 1 fio job is dedicated to 1 disk, where
amount of jobs (and disks) is growing, in its turn, in MANY-JOBS
testing mode each fio job produces IO for the same disk, i.e.:
MANY-DISKS:
x1:
numjobs=1
[job1]
filename=/dev/nvme0n1
...
x128:
numjobs=1
[job1]
filename=/dev/nvme0n1
[job2]
filename=/dev/nvme0n2
...
[job128]
filename=/dev/nvme0n128
MANY-JOBS:
x1:
numjobs=1
[job1]
filename=/dev/nvme0n1
...
x128:
numjobs=128
[job1]
filename=/dev/nvme0n1
Each group of results represents itself as a performance measurement,
which can be easily plotted, taking number of jobs as X axis and iops,
overall IO latencies or anything else extracted from fio json result
files as Y axis.
FIO configurations were generated and saved along with produced fio
json results by the fio-runner.py script [1]. Complete archive with
FIO configs and results can be downloaded here [2].
The following metrics were taken from fio json results:
write/iops - IOPS
write/lat/mean - average latency (μs)
Here I would like to present reduced results table taking into account
only runs with CPU pinning in MANY-DISKS testing mode, since CPU pinning
makes more sense in terms of performance and MANY-DISKS and MANY-JOBS
results look very much similar:
write/iops (MANY-DISKS)
IBNBD_pin NVME_noreg_pin NVME_pin SCST_noreg_pin SCST_pin
x1 80398.96 75577.24 54110.19 59555.04 48446.05
x2 109018.60 96478.45 69176.77 73925.81 55557.59
x4 169164.56 140558.75 93700.96 75419.91 56294.61
x8 197725.44 159887.33 99773.05 79750.92 55938.84
x16 176782.36 150448.33 99644.05 92964.23 56463.14
x32 139666.00 123198.38 81845.30 81287.98 50590.86
x64 125666.16 82231.77 72117.67 72023.32 45121.17
x128 120253.63 73911.97 65665.08 74642.27 47268.46
write/lat/mean (MANY-DISKS)
IBNBD_pin NVME_noreg_pin NVME_pin SCST_noreg_pin SCST_pin
x1 647.78 697.91 1032.97 925.51 1173.04
x2 973.20 1104.38 1612.75 1462.18 2047.11
x4 1279.49 1528.09 2452.22 3188.41 4235.95
x8 2356.92 2929.87 4891.70 6248.85 8907.10
x16 5605.62 6575.70 10046.4 10830.50 17945.57
x32 14489.54 16516.60 24849.16 24984.26 40335.09
x64 32364.39 49481.42 56615.23 56559.02 90590.84
x128 67570.88 110768.70 124249.4 109321.84 171390.00
* Where suffixes mean:
_pin - CPU pinning
_noreg - modules on initiator side (ib_srp, nvme_rdma) were loaded
with 'register_always=N' param
Complete table results and corresponding graphs are presented on Google
sheet [3].
Conclusion:
IBNBD outperforms in average by:
NVME_noreg_pin NVME_pin SCST_noreg_pin SCST_pin
iops 41% 72% 61% 155%
lat/mean 28% 42% 38% 60%
* Complete tables results [3] were taken into account for average
percentage calculation.
Test setup is the following:
Initiator and target HW configuration:
AMD Opteron 6386 SE, 64CPU, 128Gb
InfiniBand: Mellanox Technologies MT26428
[ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE]
Initiator and target SW configuration:
vanilla Linux 4.10
+ IBNBD patches
+ SCST from https://github.com/bvanassche/scst, master branch
Initiator side:
IBNBD and NVME: MQ mode
SRP: default RQ, on attempt to set 'use_blk_mq=Y' IO hangs.
FIO generic configuration pattern:
bssplit=512/20:1k/16:2k/9:4k/12:8k/19:16k/10:32k/8:64k/4
fadvise_hint=0
rw=randrw:2
direct=1
random_distribution=zipf:1.2
time_based=1
runtime=10
ioengine=libaio
iodepth=128
iodepth_batch_submit=128
iodepth_batch_complete=128
group_reporting
Target side:
128 null_blk devices with default configuration, opened as blockio.
NVMEoF configuration script [4].
SCST configuration script [5].
Would be great to receive any feedback. I am open for further perf
tuning and testing with other possible configurations and options.
Thanks.
--
Roman
[1] FIO runner and results extractor script:
https://drive.google.com/open?id=0B8_SivzwHdgSS2RKcmc4bWg0YjA
[2] Archive with FIO configurations and results:
https://drive.google.com/open?id=0B8_SivzwHdgSaDlhMXV6THhoRXc
[3] Google sheet with performance measurements:
https://drive.google.com/open?id=1sCTBKLA5gbhhkgd2USZXY43VL3zLidzdqDeObZn9Edc
[4] NVMEoF configuration:
https://drive.google.com/open?id=0B8_SivzwHdgSTzRjbGtmaVR6LWM
[5] SCST configuration:
https://drive.google.com/open?id=0B8_SivzwHdgSM1B5eGpKWmFJMFk
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RDMA performance comparison: IBNBD, SCST, NVMEoF
2017-04-18 17:33 ` Roman Penyaev
@ 2017-04-18 18:22 ` Bart Van Assche
-1 siblings, 0 replies; 5+ messages in thread
From: Bart Van Assche @ 2017-04-18 18:22 UTC (permalink / raw)
To: linux-block, linux-rdma, roman.penyaev, mail, sagi, jinpu.wang,
yun.wang, hch, axboe, danil.kipnis, Milind.dumbare, dledford
On Tue, 2017-04-18 at 19:33 +0200, Roman Penyaev wrote:
> By current email I would like to share some fresh RDMA performance
> results of IBNBD, SCST and NVMEof, based on 4.10 kernel and variety
> of configurations.
Hello Roman,
Thank you for having shared these results. But please do not expect me
to have another look at IBNBD before the design bugs in the driver and
also in the protocol get fixed. The presentation during Vault 2017 made
it clear that the driver does not scale if more than two CPUs submit I/O
simultaneously at the initiator side. The comments Sagi posted should be
addressed but I haven't seen any progress from the IBNBD authors with
regard to these comments ...
See also:
*=A0Danil Kipnis, Infiniband Network Block Device (IBNBD), Vault 2017
(https://vault2017.sched.com/event/9Xjw/infiniband-network-block-device-ibn=
bd-danil-kipnis-profitbricks-gmbh).
* Sagi Grimberg, Re: [RFC PATCH 00/28] INFINIBAND NETWORK BLOCK
DEVICE (IBNBD), March 27th, 2017
(https://www.spinics.net/lists/linux-rdma/msg47879.html).
Best regards,
Bart.=
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RDMA performance comparison: IBNBD, SCST, NVMEoF
@ 2017-04-18 18:22 ` Bart Van Assche
0 siblings, 0 replies; 5+ messages in thread
From: Bart Van Assche @ 2017-04-18 18:22 UTC (permalink / raw)
To: linux-block, linux-rdma, roman.penyaev, mail, sagi, jinpu.wang,
yun.wang, hch, axboe, danil.kipnis, Milind.dumbare, dledford
On Tue, 2017-04-18 at 19:33 +0200, Roman Penyaev wrote:
> By current email I would like to share some fresh RDMA performance
> results of IBNBD, SCST and NVMEof, based on 4.10 kernel and variety
> of configurations.
Hello Roman,
Thank you for having shared these results. But please do not expect me
to have another look at IBNBD before the design bugs in the driver and
also in the protocol get fixed. The presentation during Vault 2017 made
it clear that the driver does not scale if more than two CPUs submit I/O
simultaneously at the initiator side. The comments Sagi posted should be
addressed but I haven't seen any progress from the IBNBD authors with
regard to these comments ...
See also:
* Danil Kipnis, Infiniband Network Block Device (IBNBD), Vault 2017
(https://vault2017.sched.com/event/9Xjw/infiniband-network-block-device-ibnbd-danil-kipnis-profitbricks-gmbh).
* Sagi Grimberg, Re: [RFC PATCH 00/28] INFINIBAND NETWORK BLOCK
DEVICE (IBNBD), March 27th, 2017
(https://www.spinics.net/lists/linux-rdma/msg47879.html).
Best regards,
Bart.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RDMA performance comparison: IBNBD, SCST, NVMEoF
2017-04-18 18:22 ` Bart Van Assche
(?)
@ 2017-04-19 6:02 ` Roman Penyaev
-1 siblings, 0 replies; 5+ messages in thread
From: Roman Penyaev @ 2017-04-19 6:02 UTC (permalink / raw)
To: Bart Van Assche
Cc: linux-block, linux-rdma, mail, sagi, jinpu.wang, yun.wang, hch,
axboe, danil.kipnis, Milind.dumbare, dledford
Hello Bart,
On Tue, Apr 18, 2017 at 8:22 PM, Bart Van Assche
<Bart.VanAssche@sandisk.com> wrote:
> On Tue, 2017-04-18 at 19:33 +0200, Roman Penyaev wrote:
>> By current email I would like to share some fresh RDMA performance
>> results of IBNBD, SCST and NVMEof, based on 4.10 kernel and variety
>> of configurations.
>
> Hello Roman,
>
> Thank you for having shared these results. But please do not expect me
> to have another look at IBNBD before the design bugs in the driver and
> also in the protocol get fixed.
I expected only that you might find results interesting, where I target
the following:
1) retest on latest kernel
2) compare against NVMEoF
3) retest using register_always=N
> The presentation during Vault 2017 made
> it clear that the driver does not scale if more than two CPUs submit I/O
> simultaneously at the initiator side.
On the iops graph, where I increase number of simultaneous fio jobs up to 128
(initiator has 64 CPUs), NVMEoF tends to repeat the same curve, staying always
below IBNBD. So even this is a scalability problem, it can be seen on NVMEoF
runs also. That's why I posted these results to draw someone's attention.
--
Roman
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2017-04-19 6:02 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-18 17:33 RDMA performance comparison: IBNBD, SCST, NVMEoF Roman Penyaev
2017-04-18 17:33 ` Roman Penyaev
2017-04-18 18:22 ` Bart Van Assche
2017-04-18 18:22 ` Bart Van Assche
2017-04-19 6:02 ` Roman Penyaev
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.