From: Chuck Lever <chuck.lever@oracle.com>
To: Tom Talpey <tom@talpey.com>
Cc: linux-rdma@vger.kernel.org,
Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH RFC] svcrdma: Ignore source port when computing DRC hash
Date: Mon, 10 Jun 2019 13:50:27 -0400 [thread overview]
Message-ID: <721DF459-ECAE-4FDD-A016-AFB193BA1C65@oracle.com> (raw)
In-Reply-To: <4b05cdf7-2c2d-366f-3a29-1034bfec2941@talpey.com>
Hi Tom-
> On Jun 10, 2019, at 10:50 AM, Tom Talpey <tom@talpey.com> wrote:
>
> On 6/5/2019 1:25 PM, Chuck Lever wrote:
>> Hi Tom-
>>> On Jun 5, 2019, at 12:43 PM, Tom Talpey <tom@talpey.com> wrote:
>>>
>>> On 6/5/2019 8:15 AM, Chuck Lever wrote:
>>>> The DRC is not working at all after an RPC/RDMA transport reconnect.
>>>> The problem is that the new connection uses a different source port,
>>>> which defeats DRC hash.
>>>>
>>>> An NFS/RDMA client's source port is meaningless for RDMA transports.
>>>> The transport layer typically sets the source port value on the
>>>> connection to a random ephemeral port. The server already ignores it
>>>> for the "secure port" check. See commit 16e4d93f6de7 ("NFSD: Ignore
>>>> client's source port on RDMA transports").
>>>
>>> Where does the entropy come from, then, for the server to not
>>> match other requests from other mount points on this same client?
>> The first ~200 bytes of each RPC Call message.
>> [ Note that this has some fun ramifications for calls with small
>> RPC headers that use Read chunks. ]
>
> Ok, good to know. I forgot that the Linux server implemented this.
> I have some concerns abot it, honestly, and it's important to remember
> that it's not the same on all servers. But for the problem you're
> fixing, it's ok I guess and certainly better than today. Still, the
> errors are goingto be completely silent, and can lead to data being
> corrupted. Well, welcome to the world of NFSv3.
I don't see another option.
Some regard this checksum as more robust than using the client's
IP source port. After all, the same argument can be made that
the server cannot depend on clients to reuse their source port.
That is simply a convention that many clients adopted before
servers used a stronger DRC hash mechanism.
>>> And since RDMA is capable of
>>> such high IOPS, the likelihood seems rather high.
>> Only when the server's durable storage is slow enough to cause
>> some RPC requests to have extremely high latency.
>> And, most clients use an atomic counter for their XIDs, so they
>> are also likely to wrap that counter over some long-pending RPC
>> request.
>> The only real answer here is NFSv4 sessions.
>>> Missing the cache
>>> might actually be safer than hitting, in this case.
>> Remember that _any_ retransmit on RPC/RDMA requires a fresh
>> connection, that includes NFSv3, to reset credit accounting
>> due to the lost half of the RPC Call/Reply pair.
>> I can very quickly reproduce bad (non-deterministic) behavior
>> by running a software build on an NFSv3 on RDMA mount point
>> with disconnect injection. If the DRC issue is addressed, the
>> software build runs to completion.
>
> Ok, good. But I have a better test.
>
> In the Connectathon suite, there's a "Special" test called "nfsidem".
> I wrote this test in, like, 1989 so I remember it :-)
>
> This test performs all the non-idempotent NFv3 operations in a loop,
> and each loop element depends on the previous one, so if there's
> any failure, the test imemdiately bombs.
>
> Nobody seems to understand it, usually when it gets run people will
> run it without injecting errors, and it "passes" so they decide
> everything is ok.
>
> So my suggestion is to run your flakeway packet-drop harness while
> running nfsidem in a huge loop (nfsidem 10000). The test is slow,
> owing to the expensive operations it performs, so you'll need to
> run it for a long time.
>
> You'll almost definitely get a failure or two, since the NFSv3
> protocol is flawed by design. But you can compare the behaviors,
> and even compute a likelihood. I'd love to see some actual numbers.
I configured the client to disconnect after 23711 RPCs have completed.
(I can re-run these with more frequent disconnects if you think that
would be useful).
Here's a run with the DRC modification:
[cel@manet ~]$ sudo mount -o vers=3,proto=rdma,sec=sys klimt.ib:/export/tmp /mnt
[cel@manet ~]$ (cd /mnt; ~/src/cthon04/special/nfsidem 100000)
testing 100000 idempotencies in directory "./TEST"
[cel@manet ~]$ sudo umount /mnt
Here's a run with the stock v5.1 Linux server:
[cel@manet ~]$ sudo mount -o vers=3,proto=rdma,sec=sys klimt.ib:/export/tmp /mnt
[cel@manet ~]$ (cd /mnt; ~/src/cthon04/special/nfsidem 100000)
testing 100000 idempotencies in directory "./TEST"
[cel@manet ~]$
This test reported no errors in either case. We can see that the
disconnects did trigger retransmits:
RPC statistics:
1888819 RPC requests sent, 1888581 RPC replies received (0 XIDs not found)
average backlog queue length: 119
ACCESS:
300001 ops (15%) 44 retrans (0%) 0 major timeouts
avg bytes sent per op: 132 avg bytes received per op: 120
backlog wait: 0.591118 RTT: 0.017463 total execute time: 0.614795 (milliseconds)
REMOVE:
300000 ops (15%) 40 retrans (0%) 0 major timeouts
avg bytes sent per op: 136 avg bytes received per op: 144
backlog wait: 0.531667 RTT: 0.018973 total execute time: 0.556927 (milliseconds)
MKDIR:
200000 ops (10%) 26 retrans (0%) 0 major timeouts
avg bytes sent per op: 158 avg bytes received per op: 272
backlog wait: 0.518940 RTT: 0.019755 total execute time: 0.545230 (milliseconds)
RMDIR:
200000 ops (10%) 24 retrans (0%) 0 major timeouts
avg bytes sent per op: 130 avg bytes received per op: 144
backlog wait: 0.512320 RTT: 0.018580 total execute time: 0.537095 (milliseconds)
LOOKUP:
188533 ops (9%) 21 retrans (0%) 0 major timeouts
avg bytes sent per op: 136 avg bytes received per op: 174
backlog wait: 0.455925 RTT: 0.017721 total execute time: 0.480011 (milliseconds)
SETATTR:
100000 ops (5%) 11 retrans (0%) 0 major timeouts
avg bytes sent per op: 160 avg bytes received per op: 144
backlog wait: 0.371960 RTT: 0.019470 total execute time: 0.398330 (milliseconds)
WRITE:
100000 ops (5%) 9 retrans (0%) 0 major timeouts
avg bytes sent per op: 180 avg bytes received per op: 136
backlog wait: 0.399190 RTT: 0.022860 total execute time: 0.436610 (milliseconds)
CREATE:
100000 ops (5%) 9 retrans (0%) 0 major timeouts
avg bytes sent per op: 168 avg bytes received per op: 272
backlog wait: 0.365290 RTT: 0.019560 total execute time: 0.391140 (milliseconds)
SYMLINK:
100000 ops (5%) 18 retrans (0%) 0 major timeouts
avg bytes sent per op: 188 avg bytes received per op: 272
backlog wait: 0.750470 RTT: 0.020150 total execute time: 0.786410 (milliseconds)
RENAME:
100000 ops (5%) 14 retrans (0%) 0 major timeouts
avg bytes sent per op: 180 avg bytes received per op: 260
backlog wait: 0.461650 RTT: 0.020710 total execute time: 0.489670 (milliseconds)
--
Chuck Lever
next prev parent reply other threads:[~2019-06-10 17:52 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-06-05 12:15 [PATCH RFC] svcrdma: Ignore source port when computing DRC hash Chuck Lever
2019-06-05 15:57 ` Olga Kornievskaia
2019-06-05 17:28 ` Chuck Lever
2019-06-06 18:13 ` Olga Kornievskaia
2019-06-06 18:33 ` Chuck Lever
2019-06-07 15:43 ` Olga Kornievskaia
2019-06-10 14:38 ` Tom Talpey
2019-06-05 16:43 ` Tom Talpey
2019-06-05 17:25 ` Chuck Lever
2019-06-10 14:50 ` Tom Talpey
2019-06-10 17:50 ` Chuck Lever [this message]
2019-06-10 19:14 ` Tom Talpey
2019-06-10 21:57 ` Chuck Lever
2019-06-10 22:13 ` Tom Talpey
2019-06-11 0:07 ` Tom Talpey
2019-06-11 14:25 ` Chuck Lever
2019-06-11 14:23 ` Chuck Lever
2019-06-06 13:08 ` Sasha Levin
2019-06-06 13:24 ` Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=721DF459-ECAE-4FDD-A016-AFB193BA1C65@oracle.com \
--to=chuck.lever@oracle.com \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=tom@talpey.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).