All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
To: linux-rdma <linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Cc: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	hch <hch-jcswGhMUV9g@public.gmane.org>,
	Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>,
	Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Subject: Re: [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
Date: Mon, 30 Oct 2017 14:18:42 -0400	[thread overview]
Message-ID: <2FB68C09-F811-44C9-9F48-826C074DDE14@oracle.com> (raw)
In-Reply-To: <87A0B150-CE67-4C8C-914E-53F66411E1BB-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>


> On Oct 29, 2017, at 2:24 PM, Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> 
> 
>> On Oct 29, 2017, at 12:38 PM, idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org wrote:
>> 
>> From: Idan Burstein <idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> 
>> NVMe over Fabrics in its secure "register_always" mode
>> registers and invalidates the user buffer upon each IO.
>> The protocol enables the host to request the susbsystem
>> to use SEND WITH INVALIDATE operation while returning the
>> response capsule and invalidate the local key
>> (remote_invalidation).
>> In some HW implementations, the local network adapter may
>> perform better while using local invalidation operations.
>> 
>> The results below show that running with local invalidation
>> rather then remote invalidation improve the iops one could
>> achieve by using the ConnectX-5Ex network adapter by x1.36 factor.
>> Nevertheless, using local invalidation induce more CPU overhead
>> than enabling the target to invalidate remotly, therefore,
>> because there is a CPU% vs IOPs limit tradeoff we propose to
>> have module parameter to define whether to request remote
>> invalidation or not.
>> 
>> The following results were taken against a single nvme over fabrics
>> subsystem with a single namespace backed by null_blk:
>> 
>> Block Size       s/g reg_wr      inline reg_wr    inline reg_wr + local inv
>> ++++++++++++   ++++++++++++++   ++++++++++++++++ +++++++++++++++++++++++++++
>> 512B            1446.6K/8.57%    5224.2K/76.21%   7143.3K/79.72%
>> 1KB             1390.6K/8.5%     4656.7K/71.69%   5860.6K/55.45%
>> 2KB             1343.8K/8.6%     3410.3K/38.96%   4106.7K/55.82%
>> 4KB             1254.8K/8.39%    2033.6K/15.86%   2165.3K/17.48%
>> 8KB             1079.5K/7.7%     1143.1K/7.27%    1158.2K/7.33%
>> 16KB            603.4K/3.64%     593.8K/3.4%      588.9K/3.77%
>> 32KB            294.8K/2.04%     293.7K/1.98%     294.4K/2.93%
>> 64KB            138.2K/1.32%     141.6K/1.26%     135.6K/1.34%
> 
> Units reported here are KIOPS and %CPU ? What was the benchmark?
> 
> Was any root cause analysis done to understand why the IOPS
> rate changes without RI? Reduction in avg RTT? Fewer long-
> running outliers? Lock contention in the ULP?
> 
> I am curious enough to add a similar setting to NFS/RDMA,
> now that I have mlx5 devices.

I see the plan is to change the NVMeoF initiator to wait for
invalidation to complete, and then test again. However, I am
still curious what we're dealing with for NFS/RDMA (which
always waits for invalidation before allowing an RPC to
complete).

I tested here with a pair of EDR CX-4s in RoCE mode. I found
that there is a 15-20us latency penalty when Remote Invalid-
ation is used with RDMA Read, but when used with RDMA Write,
Remote Invalidation behaves as desired. The RDMA Read penalty
vanishes after payloads are larger than about 8kB.

Simple QD=1 iozone test with direct I/O on NFSv3, and a tmpfs
share on the NFS server. In this test, memory registration is
used for all data payloads.

Remote Invalidation enabled, reclen in kB, output in
microseconds:

              kB  reclen    write  rewrite    read    reread
          131072       1       47       54       27       27
          131072       2       61       62       27       28
          131072       4       63       62       28       28
          131072       8       59       65       29       29
          131072      16       67       66       31       32
          131072      32       75       73       42       42
          131072      64       92       87       64       44

Remote Invalidation disabled, reclen in kB, output in
microseconds

              kB  reclen    write  rewrite    read    reread
          131072       1       43       43       32       32
          131072       2       45       52       32       32
          131072       4       48       45       32       32
          131072       8       68       64       33       33
          131072      16       74       60       35       35
          131072      32       85       82       49       41
          131072      64      102       98       63       52

I would expect a similar ~5us latency benefit for both RDMA
Reads and RDMA Writes when Remote Invalidation is in use.
Small I/Os involving RDMA Read are precisely where NFS/RDMA
gains the most benefit from Remote Invalidation, so this
result is disappointing.

Unfortunately, the current version of RPC-over-RDMA does not
have the ability to convey an Rkey for the target to invalid-
ate remotely, though that is one of the features being
considered for the next version.

IOPS results (fio 70/30, multi-thread iozone, etc) are very
clearly worse without Remote Invalidation, so I currently
don't plan to submit a patch that allows RI to be disabled
for some NFS/RDMA mounts. Currently the Linux NFS client
indicates to servers that RI is supported whenever FRWR is
supported on the client's HCA. The server is then free to
decide whether to use Send With Invalidate with any Rkey
in the current RPC.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

WARNING: multiple messages have this Message-ID (diff)
From: chuck.lever@oracle.com (Chuck Lever)
Subject: [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter
Date: Mon, 30 Oct 2017 14:18:42 -0400	[thread overview]
Message-ID: <2FB68C09-F811-44C9-9F48-826C074DDE14@oracle.com> (raw)
In-Reply-To: <87A0B150-CE67-4C8C-914E-53F66411E1BB@oracle.com>


> On Oct 29, 2017,@2:24 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> 
> 
>> On Oct 29, 2017,@12:38 PM, idanb@mellanox.com wrote:
>> 
>> From: Idan Burstein <idanb at mellanox.com>
>> 
>> NVMe over Fabrics in its secure "register_always" mode
>> registers and invalidates the user buffer upon each IO.
>> The protocol enables the host to request the susbsystem
>> to use SEND WITH INVALIDATE operation while returning the
>> response capsule and invalidate the local key
>> (remote_invalidation).
>> In some HW implementations, the local network adapter may
>> perform better while using local invalidation operations.
>> 
>> The results below show that running with local invalidation
>> rather then remote invalidation improve the iops one could
>> achieve by using the ConnectX-5Ex network adapter by x1.36 factor.
>> Nevertheless, using local invalidation induce more CPU overhead
>> than enabling the target to invalidate remotly, therefore,
>> because there is a CPU% vs IOPs limit tradeoff we propose to
>> have module parameter to define whether to request remote
>> invalidation or not.
>> 
>> The following results were taken against a single nvme over fabrics
>> subsystem with a single namespace backed by null_blk:
>> 
>> Block Size       s/g reg_wr      inline reg_wr    inline reg_wr + local inv
>> ++++++++++++   ++++++++++++++   ++++++++++++++++ +++++++++++++++++++++++++++
>> 512B            1446.6K/8.57%    5224.2K/76.21%   7143.3K/79.72%
>> 1KB             1390.6K/8.5%     4656.7K/71.69%   5860.6K/55.45%
>> 2KB             1343.8K/8.6%     3410.3K/38.96%   4106.7K/55.82%
>> 4KB             1254.8K/8.39%    2033.6K/15.86%   2165.3K/17.48%
>> 8KB             1079.5K/7.7%     1143.1K/7.27%    1158.2K/7.33%
>> 16KB            603.4K/3.64%     593.8K/3.4%      588.9K/3.77%
>> 32KB            294.8K/2.04%     293.7K/1.98%     294.4K/2.93%
>> 64KB            138.2K/1.32%     141.6K/1.26%     135.6K/1.34%
> 
> Units reported here are KIOPS and %CPU ? What was the benchmark?
> 
> Was any root cause analysis done to understand why the IOPS
> rate changes without RI? Reduction in avg RTT? Fewer long-
> running outliers? Lock contention in the ULP?
> 
> I am curious enough to add a similar setting to NFS/RDMA,
> now that I have mlx5 devices.

I see the plan is to change the NVMeoF initiator to wait for
invalidation to complete, and then test again. However, I am
still curious what we're dealing with for NFS/RDMA (which
always waits for invalidation before allowing an RPC to
complete).

I tested here with a pair of EDR CX-4s in RoCE mode. I found
that there is a 15-20us latency penalty when Remote Invalid-
ation is used with RDMA Read, but when used with RDMA Write,
Remote Invalidation behaves as desired. The RDMA Read penalty
vanishes after payloads are larger than about 8kB.

Simple QD=1 iozone test with direct I/O on NFSv3, and a tmpfs
share on the NFS server. In this test, memory registration is
used for all data payloads.

Remote Invalidation enabled, reclen in kB, output in
microseconds:

              kB  reclen    write  rewrite    read    reread
          131072       1       47       54       27       27
          131072       2       61       62       27       28
          131072       4       63       62       28       28
          131072       8       59       65       29       29
          131072      16       67       66       31       32
          131072      32       75       73       42       42
          131072      64       92       87       64       44

Remote Invalidation disabled, reclen in kB, output in
microseconds

              kB  reclen    write  rewrite    read    reread
          131072       1       43       43       32       32
          131072       2       45       52       32       32
          131072       4       48       45       32       32
          131072       8       68       64       33       33
          131072      16       74       60       35       35
          131072      32       85       82       49       41
          131072      64      102       98       63       52

I would expect a similar ~5us latency benefit for both RDMA
Reads and RDMA Writes when Remote Invalidation is in use.
Small I/Os involving RDMA Read are precisely where NFS/RDMA
gains the most benefit from Remote Invalidation, so this
result is disappointing.

Unfortunately, the current version of RPC-over-RDMA does not
have the ability to convey an Rkey for the target to invalid-
ate remotely, though that is one of the features being
considered for the next version.

IOPS results (fio 70/30, multi-thread iozone, etc) are very
clearly worse without Remote Invalidation, so I currently
don't plan to submit a patch that allows RI to be disabled
for some NFS/RDMA mounts. Currently the Linux NFS client
indicates to servers that RI is supported whenever FRWR is
supported on the client's HCA. The server is then free to
decide whether to use Send With Invalidate with any Rkey
in the current RPC.


--
Chuck Lever

  parent reply	other threads:[~2017-10-30 18:18 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-29 16:38 [PATCH 0/2] Performance Improvents for Secured Mode NVMe over Fabrics and other RDMA ULPs idanb-VPRAkNaXOzVWk0Htik3J/w
2017-10-29 16:38 ` idanb
     [not found] ` <1509295101-14081-1-git-send-email-idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-10-29 16:38   ` [PATCH 1/2] IB/mlx5: posting klm/mtt list inline in the send queue for reg_wr idanb-VPRAkNaXOzVWk0Htik3J/w
2017-10-29 16:38     ` idanb
     [not found]     ` <1509295101-14081-2-git-send-email-idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-10-29 17:09       ` Sagi Grimberg
2017-10-29 17:09         ` Sagi Grimberg
2018-04-19 21:11     ` roland
2018-04-20  2:04       ` Doug Ledford
2018-04-20 19:07         ` Max Gurtovoy
2017-10-29 16:38   ` [PATCH 2/2] nvme-rdma: Add remote_invalidation module parameter idanb-VPRAkNaXOzVWk0Htik3J/w
2017-10-29 16:38     ` idanb
     [not found]     ` <1509295101-14081-3-git-send-email-idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-10-29 17:52       ` Jason Gunthorpe
2017-10-29 17:52         ` Jason Gunthorpe
     [not found]         ` <20171029175237.GD4488-uk2M96/98Pc@public.gmane.org>
2017-10-30  8:14           ` Sagi Grimberg
2017-10-30  8:14             ` Sagi Grimberg
     [not found]             ` <740c93f4-164e-d4e3-97b1-313a0420ae81-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-10-30  8:38               ` idanb
2017-10-30  8:38                 ` idanb
     [not found]                 ` <7e038d80-de95-8fb7-e313-825e40c03e88-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-10-30  9:44                   ` Sagi Grimberg
2017-10-30  9:44                     ` Sagi Grimberg
     [not found]                     ` <38523e67-fa00-dd03-5b6f-34cd6f863c8f-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-10-30 10:31                       ` idanb
2017-10-30 10:31                         ` idanb
     [not found]                         ` <5783e083-93d4-8e72-c380-03fc54a0291b-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-10-30 10:33                           ` Sagi Grimberg
2017-10-30 10:33                             ` Sagi Grimberg
     [not found]                             ` <360f892f-b88d-0947-1590-ab1d64d4da13-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-10-30 12:35                               ` idanb
2017-10-30 12:35                                 ` idanb
2017-10-29 18:24       ` Chuck Lever
2017-10-29 18:24         ` Chuck Lever
     [not found]         ` <87A0B150-CE67-4C8C-914E-53F66411E1BB-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-10-30 18:18           ` Chuck Lever [this message]
2017-10-30 18:18             ` Chuck Lever
2017-10-30  8:11       ` Sagi Grimberg
2017-10-30  8:11         ` Sagi Grimberg
     [not found]         ` <4423f96f-bf42-3603-aa6a-fa259a1d09d1-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-10-30  8:45           ` idanb
2017-10-30  8:45             ` idanb
2017-10-29 16:59   ` [PATCH 0/2] Performance Improvents for Secured Mode NVMe over Fabrics and other RDMA ULPs Sagi Grimberg
2017-10-29 16:59     ` Sagi Grimberg
     [not found]     ` <e1ba4993-5c77-4754-967c-4da8ec34bf77-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-10-29 17:09       ` Max Gurtovoy
2017-10-29 17:09         ` Max Gurtovoy
     [not found]         ` <8cb59a0a-c902-0c0e-27a8-fdf9b98982ac-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-10-29 17:43           ` Leon Romanovsky
2017-10-29 17:43             ` Leon Romanovsky
     [not found]             ` <AM0PR0502MB38906F57CD68C3707F8FE765C5580@AM0PR0502MB3890.eurprd05.prod.outlook.com>
     [not found]               ` <AM0PR0502MB38906F57CD68C3707F8FE765C5580-EJTefJAZ6OmxAoLrISMGDMDSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-10-30  5:21                 ` Leon Romanovsky
2017-10-30  5:21                   ` Leon Romanovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2FB68C09-F811-44C9-9F48-826C074DDE14@oracle.com \
    --to=chuck.lever-qhclzuegtsvqt0dzr+alfa@public.gmane.org \
    --cc=hch-jcswGhMUV9g@public.gmane.org \
    --cc=idanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
    --cc=leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
    --cc=linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org \
    --cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
    --cc=sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.