All of lore.kernel.org
 help / color / mirror / Atom feed
* RDMA performance comparison: IBNBD, SCST, NVMEoF
@ 2017-04-18 17:33 ` Roman Penyaev
  0 siblings, 0 replies; 5+ messages in thread
From: Roman Penyaev @ 2017-04-18 17:33 UTC (permalink / raw)
  To: Bart Van Assche, Sagi Grimberg, Doug Ledford, Jens Axboe,
	Christoph Hellwig, Fabian Holler, Milind Dumbare, Michael Wang,
	Danil Kipnis, Jinpu Wang, linux-block, linux-rdma

Hi Bart, Sagi and all,

By current email I would like to share some fresh RDMA performance
results of IBNBD, SCST and NVMEof, based on 4.10 kernel and variety
of configurations.

All fio runs are grouped by the name of a project, crucial config
differencies (e.g. CPU pinning or register_always=3DN) and two testing
modes: MANY-DISKS and MANY-JOBS.  In each group of results amount of
simultaneous fio jobs is increasing starting from 1 up to 128.  E.g.
in MANY-DISKS testing mode 1 fio job is dedicated to 1 disk, where
amount of jobs (and disks) is growing, in its turn, in MANY-JOBS
testing mode each fio job produces IO for the same disk, i.e.:

  MANY-DISKS:
     x1:
         numjobs=3D1
         [job1]
         filename=3D/dev/nvme0n1
     ...
     x128:
         numjobs=3D1
         [job1]
         filename=3D/dev/nvme0n1
         [job2]
         filename=3D/dev/nvme0n2
         ...
         [job128]
         filename=3D/dev/nvme0n128

  MANY-JOBS:
     x1:
         numjobs=3D1
         [job1]
         filename=3D/dev/nvme0n1
     ...
     x128:
         numjobs=3D128
         [job1]
         filename=3D/dev/nvme0n1

Each group of results represents itself as a performance measurement,
which can be easily plotted, taking number of jobs as X axis and iops,
overall IO latencies or anything else extracted from fio json result
files as Y axis.

FIO configurations were generated and saved along with produced fio
json results by the fio-runner.py script [1].  Complete archive with
FIO configs and results can be downloaded here [2].

The following metrics were taken from fio json results:

    write/iops     - IOPS
    write/lat/mean - average latency (=CE=BCs)

Here I would like to present reduced results table taking into account
only runs with CPU pinning in MANY-DISKS testing mode, since CPU pinning
makes more sense in terms of performance and MANY-DISKS and MANY-JOBS
results look very much similar:

write/iops (MANY-DISKS)
      IBNBD_pin   NVME_noreg_pin  NVME_pin    SCST_noreg_pin  SCST_pin
x1    80398.96    75577.24        54110.19    59555.04        48446.05
x2    109018.60   96478.45        69176.77    73925.81        55557.59
x4    169164.56   140558.75       93700.96    75419.91        56294.61
x8    197725.44   159887.33       99773.05    79750.92        55938.84
x16   176782.36   150448.33       99644.05    92964.23        56463.14
x32   139666.00   123198.38       81845.30    81287.98        50590.86
x64   125666.16   82231.77        72117.67    72023.32        45121.17
x128  120253.63   73911.97        65665.08    74642.27        47268.46

write/lat/mean (MANY-DISKS)
      IBNBD_pin   NVME_noreg_pin  NVME_pin    SCST_noreg_pin  SCST_pin
x1    647.78      697.91          1032.97     925.51          1173.04
x2    973.20      1104.38         1612.75     1462.18         2047.11
x4    1279.49     1528.09         2452.22     3188.41         4235.95
x8    2356.92     2929.87         4891.70     6248.85         8907.10
x16   5605.62     6575.70         10046.4     10830.50        17945.57
x32   14489.54    16516.60        24849.16    24984.26        40335.09
x64   32364.39    49481.42        56615.23    56559.02        90590.84
x128  67570.88    110768.70       124249.4    109321.84       171390.00

    * Where suffixes mean:

     _pin   - CPU pinning
     _noreg - modules on initiator side (ib_srp, nvme_rdma) were loaded
              with 'register_always=3DN' param

Complete table results and corresponding graphs are presented on Google
sheet [3].

Conclusion:
    IBNBD outperforms in average by:

                     NVME_noreg_pin  NVME_pin  SCST_noreg_pin  SCST_pin
       iops          41%             72%       61%             155%
       lat/mean      28%             42%       38%             60%

       * Complete tables results [3] were taken into account for average
         percentage calculation.

Test setup is the following:

Initiator and target HW configuration:

    AMD Opteron 6386 SE, 64CPU, 128Gb
    InfiniBand: Mellanox Technologies MT26428
                [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE]

Initiator and target SW configuration:

    vanilla Linux 4.10
    + IBNBD patches
    + SCST from https://github.com/bvanassche/scst, master branch

Initiator side:

    IBNBD and NVME: MQ mode
    SRP: default RQ, on attempt to set 'use_blk_mq=3DY' IO hangs.

    FIO generic configuration pattern:

        bssplit=3D512/20:1k/16:2k/9:4k/12:8k/19:16k/10:32k/8:64k/4
        fadvise_hint=3D0
        rw=3Drandrw:2
        direct=3D1
        random_distribution=3Dzipf:1.2
        time_based=3D1
        runtime=3D10
        ioengine=3Dlibaio
        iodepth=3D128
        iodepth_batch_submit=3D128
        iodepth_batch_complete=3D128
        group_reporting

Target side:

    128 null_blk devices with default configuration, opened as blockio.

NVMEoF configuration script [4].
SCST configuration script [5].


Would be great to receive any feedback.  I am open for further perf
tuning and testing with other possible configurations and options.

Thanks.

--
Roman

[1] FIO runner and results extractor script:
    https://drive.google.com/open?id=3D0B8_SivzwHdgSS2RKcmc4bWg0YjA

[2] Archive with FIO configurations and results:
    https://drive.google.com/open?id=3D0B8_SivzwHdgSaDlhMXV6THhoRXc

[3] Google sheet with performance measurements:
    https://drive.google.com/open?id=3D1sCTBKLA5gbhhkgd2USZXY43VL3zLidzdqDe=
ObZn9Edc

[4] NVMEoF configuration:
    https://drive.google.com/open?id=3D0B8_SivzwHdgSTzRjbGtmaVR6LWM

[5] SCST configuration:
    https://drive.google.com/open?id=3D0B8_SivzwHdgSM1B5eGpKWmFJMFk

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RDMA performance comparison: IBNBD, SCST, NVMEoF
@ 2017-04-18 17:33 ` Roman Penyaev
  0 siblings, 0 replies; 5+ messages in thread
From: Roman Penyaev @ 2017-04-18 17:33 UTC (permalink / raw)
  To: Bart Van Assche, Sagi Grimberg, Doug Ledford, Jens Axboe,
	Christoph Hellwig, Fabian Holler, Milind Dumbare, Michael Wang,
	Danil Kipnis, Jinpu Wang, linux-block, linux-rdma

Hi Bart, Sagi and all,

By current email I would like to share some fresh RDMA performance
results of IBNBD, SCST and NVMEof, based on 4.10 kernel and variety
of configurations.

All fio runs are grouped by the name of a project, crucial config
differencies (e.g. CPU pinning or register_always=N) and two testing
modes: MANY-DISKS and MANY-JOBS.  In each group of results amount of
simultaneous fio jobs is increasing starting from 1 up to 128.  E.g.
in MANY-DISKS testing mode 1 fio job is dedicated to 1 disk, where
amount of jobs (and disks) is growing, in its turn, in MANY-JOBS
testing mode each fio job produces IO for the same disk, i.e.:

  MANY-DISKS:
     x1:
         numjobs=1
         [job1]
         filename=/dev/nvme0n1
     ...
     x128:
         numjobs=1
         [job1]
         filename=/dev/nvme0n1
         [job2]
         filename=/dev/nvme0n2
         ...
         [job128]
         filename=/dev/nvme0n128

  MANY-JOBS:
     x1:
         numjobs=1
         [job1]
         filename=/dev/nvme0n1
     ...
     x128:
         numjobs=128
         [job1]
         filename=/dev/nvme0n1

Each group of results represents itself as a performance measurement,
which can be easily plotted, taking number of jobs as X axis and iops,
overall IO latencies or anything else extracted from fio json result
files as Y axis.

FIO configurations were generated and saved along with produced fio
json results by the fio-runner.py script [1].  Complete archive with
FIO configs and results can be downloaded here [2].

The following metrics were taken from fio json results:

    write/iops     - IOPS
    write/lat/mean - average latency (μs)

Here I would like to present reduced results table taking into account
only runs with CPU pinning in MANY-DISKS testing mode, since CPU pinning
makes more sense in terms of performance and MANY-DISKS and MANY-JOBS
results look very much similar:

write/iops (MANY-DISKS)
      IBNBD_pin   NVME_noreg_pin  NVME_pin    SCST_noreg_pin  SCST_pin
x1    80398.96    75577.24        54110.19    59555.04        48446.05
x2    109018.60   96478.45        69176.77    73925.81        55557.59
x4    169164.56   140558.75       93700.96    75419.91        56294.61
x8    197725.44   159887.33       99773.05    79750.92        55938.84
x16   176782.36   150448.33       99644.05    92964.23        56463.14
x32   139666.00   123198.38       81845.30    81287.98        50590.86
x64   125666.16   82231.77        72117.67    72023.32        45121.17
x128  120253.63   73911.97        65665.08    74642.27        47268.46

write/lat/mean (MANY-DISKS)
      IBNBD_pin   NVME_noreg_pin  NVME_pin    SCST_noreg_pin  SCST_pin
x1    647.78      697.91          1032.97     925.51          1173.04
x2    973.20      1104.38         1612.75     1462.18         2047.11
x4    1279.49     1528.09         2452.22     3188.41         4235.95
x8    2356.92     2929.87         4891.70     6248.85         8907.10
x16   5605.62     6575.70         10046.4     10830.50        17945.57
x32   14489.54    16516.60        24849.16    24984.26        40335.09
x64   32364.39    49481.42        56615.23    56559.02        90590.84
x128  67570.88    110768.70       124249.4    109321.84       171390.00

    * Where suffixes mean:

     _pin   - CPU pinning
     _noreg - modules on initiator side (ib_srp, nvme_rdma) were loaded
              with 'register_always=N' param

Complete table results and corresponding graphs are presented on Google
sheet [3].

Conclusion:
    IBNBD outperforms in average by:

                     NVME_noreg_pin  NVME_pin  SCST_noreg_pin  SCST_pin
       iops          41%             72%       61%             155%
       lat/mean      28%             42%       38%             60%

       * Complete tables results [3] were taken into account for average
         percentage calculation.

Test setup is the following:

Initiator and target HW configuration:

    AMD Opteron 6386 SE, 64CPU, 128Gb
    InfiniBand: Mellanox Technologies MT26428
                [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE]

Initiator and target SW configuration:

    vanilla Linux 4.10
    + IBNBD patches
    + SCST from https://github.com/bvanassche/scst, master branch

Initiator side:

    IBNBD and NVME: MQ mode
    SRP: default RQ, on attempt to set 'use_blk_mq=Y' IO hangs.

    FIO generic configuration pattern:

        bssplit=512/20:1k/16:2k/9:4k/12:8k/19:16k/10:32k/8:64k/4
        fadvise_hint=0
        rw=randrw:2
        direct=1
        random_distribution=zipf:1.2
        time_based=1
        runtime=10
        ioengine=libaio
        iodepth=128
        iodepth_batch_submit=128
        iodepth_batch_complete=128
        group_reporting

Target side:

    128 null_blk devices with default configuration, opened as blockio.

NVMEoF configuration script [4].
SCST configuration script [5].


Would be great to receive any feedback.  I am open for further perf
tuning and testing with other possible configurations and options.

Thanks.

--
Roman

[1] FIO runner and results extractor script:
    https://drive.google.com/open?id=0B8_SivzwHdgSS2RKcmc4bWg0YjA

[2] Archive with FIO configurations and results:
    https://drive.google.com/open?id=0B8_SivzwHdgSaDlhMXV6THhoRXc

[3] Google sheet with performance measurements:
    https://drive.google.com/open?id=1sCTBKLA5gbhhkgd2USZXY43VL3zLidzdqDeObZn9Edc

[4] NVMEoF configuration:
    https://drive.google.com/open?id=0B8_SivzwHdgSTzRjbGtmaVR6LWM

[5] SCST configuration:
    https://drive.google.com/open?id=0B8_SivzwHdgSM1B5eGpKWmFJMFk

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RDMA performance comparison: IBNBD, SCST, NVMEoF
  2017-04-18 17:33 ` Roman Penyaev
@ 2017-04-18 18:22   ` Bart Van Assche
  -1 siblings, 0 replies; 5+ messages in thread
From: Bart Van Assche @ 2017-04-18 18:22 UTC (permalink / raw)
  To: linux-block, linux-rdma, roman.penyaev, mail, sagi, jinpu.wang,
	yun.wang, hch, axboe, danil.kipnis, Milind.dumbare, dledford

On Tue, 2017-04-18 at 19:33 +0200, Roman Penyaev wrote:
> By current email I would like to share some fresh RDMA performance
> results of IBNBD, SCST and NVMEof, based on 4.10 kernel and variety
> of configurations.

Hello Roman,

Thank you for having shared these results. But please do not expect me
to have another look at IBNBD before the design bugs in the driver and
also in the protocol get fixed. The presentation during Vault 2017 made
it clear that the driver does not scale if more than two CPUs submit I/O
simultaneously at the initiator side. The comments Sagi posted should be
addressed but I haven't seen any progress from the IBNBD authors with
regard to these comments ...

See also:
*=A0Danil Kipnis, Infiniband Network Block Device (IBNBD), Vault 2017
(https://vault2017.sched.com/event/9Xjw/infiniband-network-block-device-ibn=
bd-danil-kipnis-profitbricks-gmbh).
* Sagi Grimberg, Re: [RFC PATCH 00/28] INFINIBAND NETWORK BLOCK
DEVICE (IBNBD), March 27th, 2017
(https://www.spinics.net/lists/linux-rdma/msg47879.html).

Best regards,

Bart.=

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RDMA performance comparison: IBNBD, SCST, NVMEoF
@ 2017-04-18 18:22   ` Bart Van Assche
  0 siblings, 0 replies; 5+ messages in thread
From: Bart Van Assche @ 2017-04-18 18:22 UTC (permalink / raw)
  To: linux-block, linux-rdma, roman.penyaev, mail, sagi, jinpu.wang,
	yun.wang, hch, axboe, danil.kipnis, Milind.dumbare, dledford

On Tue, 2017-04-18 at 19:33 +0200, Roman Penyaev wrote:
> By current email I would like to share some fresh RDMA performance
> results of IBNBD, SCST and NVMEof, based on 4.10 kernel and variety
> of configurations.

Hello Roman,

Thank you for having shared these results. But please do not expect me
to have another look at IBNBD before the design bugs in the driver and
also in the protocol get fixed. The presentation during Vault 2017 made
it clear that the driver does not scale if more than two CPUs submit I/O
simultaneously at the initiator side. The comments Sagi posted should be
addressed but I haven't seen any progress from the IBNBD authors with
regard to these comments ...

See also:
* Danil Kipnis, Infiniband Network Block Device (IBNBD), Vault 2017
(https://vault2017.sched.com/event/9Xjw/infiniband-network-block-device-ibnbd-danil-kipnis-profitbricks-gmbh).
* Sagi Grimberg, Re: [RFC PATCH 00/28] INFINIBAND NETWORK BLOCK
DEVICE (IBNBD), March 27th, 2017
(https://www.spinics.net/lists/linux-rdma/msg47879.html).

Best regards,

Bart.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RDMA performance comparison: IBNBD, SCST, NVMEoF
  2017-04-18 18:22   ` Bart Van Assche
  (?)
@ 2017-04-19  6:02   ` Roman Penyaev
  -1 siblings, 0 replies; 5+ messages in thread
From: Roman Penyaev @ 2017-04-19  6:02 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, linux-rdma, mail, sagi, jinpu.wang, yun.wang, hch,
	axboe, danil.kipnis, Milind.dumbare, dledford

Hello Bart,

On Tue, Apr 18, 2017 at 8:22 PM, Bart Van Assche
<Bart.VanAssche@sandisk.com> wrote:
> On Tue, 2017-04-18 at 19:33 +0200, Roman Penyaev wrote:
>> By current email I would like to share some fresh RDMA performance
>> results of IBNBD, SCST and NVMEof, based on 4.10 kernel and variety
>> of configurations.
>
> Hello Roman,
>
> Thank you for having shared these results. But please do not expect me
> to have another look at IBNBD before the design bugs in the driver and
> also in the protocol get fixed.

I expected only that you might find results interesting, where I target
the following:

    1) retest on latest kernel
    2) compare against NVMEoF
    3) retest using register_always=N


> The presentation during Vault 2017 made
> it clear that the driver does not scale if more than two CPUs submit I/O
> simultaneously at the initiator side.

On the iops graph, where I increase number of simultaneous fio jobs up to 128
(initiator has 64 CPUs), NVMEoF tends to repeat the same curve, staying always
below IBNBD.  So even this is a scalability problem, it can be seen on NVMEoF
runs also. That's why I posted these results to draw someone's attention.


--
Roman

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-04-19  6:02 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-18 17:33 RDMA performance comparison: IBNBD, SCST, NVMEoF Roman Penyaev
2017-04-18 17:33 ` Roman Penyaev
2017-04-18 18:22 ` Bart Van Assche
2017-04-18 18:22   ` Bart Van Assche
2017-04-19  6:02   ` Roman Penyaev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.