From: Chuck Lever <chuck.lever@oracle.com>
To: Tom Talpey <tom@talpey.com>
Cc: linux-rdma@vger.kernel.org,
Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH RFC] svcrdma: Ignore source port when computing DRC hash
Date: Tue, 11 Jun 2019 10:23:28 -0400 [thread overview]
Message-ID: <7DCE8406-AF05-44B3-9EEC-447A7D7E919F@oracle.com> (raw)
In-Reply-To: <c00bea9a-97eb-2161-acce-6bbc5249e733@talpey.com>
> On Jun 10, 2019, at 6:13 PM, Tom Talpey <tom@talpey.com> wrote:
>
> On 6/10/2019 5:57 PM, Chuck Lever wrote:
>>> On Jun 10, 2019, at 3:14 PM, Tom Talpey <tom@talpey.com> wrote:
>>>
>>> On 6/10/2019 1:50 PM, Chuck Lever wrote:
>>>> Hi Tom-
>>>>> On Jun 10, 2019, at 10:50 AM, Tom Talpey <tom@talpey.com> wrote:
>>>>>
>>>>> On 6/5/2019 1:25 PM, Chuck Lever wrote:
>>>>>> Hi Tom-
>>>>>>> On Jun 5, 2019, at 12:43 PM, Tom Talpey <tom@talpey.com> wrote:
>>>>>>>
>>>>>>> On 6/5/2019 8:15 AM, Chuck Lever wrote:
>>>>>>>> The DRC is not working at all after an RPC/RDMA transport reconnect.
>>>>>>>> The problem is that the new connection uses a different source port,
>>>>>>>> which defeats DRC hash.
>>>>>>>>
>>>>>>>> An NFS/RDMA client's source port is meaningless for RDMA transports.
>>>>>>>> The transport layer typically sets the source port value on the
>>>>>>>> connection to a random ephemeral port. The server already ignores it
>>>>>>>> for the "secure port" check. See commit 16e4d93f6de7 ("NFSD: Ignore
>>>>>>>> client's source port on RDMA transports").
>>>>>>>
>>>>>>> Where does the entropy come from, then, for the server to not
>>>>>>> match other requests from other mount points on this same client?
>>>>>> The first ~200 bytes of each RPC Call message.
>>>>>> [ Note that this has some fun ramifications for calls with small
>>>>>> RPC headers that use Read chunks. ]
>>>>>
>>>>> Ok, good to know. I forgot that the Linux server implemented this.
>>>>> I have some concerns abot it, honestly, and it's important to remember
>>>>> that it's not the same on all servers. But for the problem you're
>>>>> fixing, it's ok I guess and certainly better than today. Still, the
>>>>> errors are goingto be completely silent, and can lead to data being
>>>>> corrupted. Well, welcome to the world of NFSv3.
>>>> I don't see another option.
>>>> Some regard this checksum as more robust than using the client's
>>>> IP source port. After all, the same argument can be made that
>>>> the server cannot depend on clients to reuse their source port.
>>>> That is simply a convention that many clients adopted before
>>>> servers used a stronger DRC hash mechanism.
>>>>>>> And since RDMA is capable of
>>>>>>> such high IOPS, the likelihood seems rather high.
>>>>>> Only when the server's durable storage is slow enough to cause
>>>>>> some RPC requests to have extremely high latency.
>>>>>> And, most clients use an atomic counter for their XIDs, so they
>>>>>> are also likely to wrap that counter over some long-pending RPC
>>>>>> request.
>>>>>> The only real answer here is NFSv4 sessions.
>>>>>>> Missing the cache
>>>>>>> might actually be safer than hitting, in this case.
>>>>>> Remember that _any_ retransmit on RPC/RDMA requires a fresh
>>>>>> connection, that includes NFSv3, to reset credit accounting
>>>>>> due to the lost half of the RPC Call/Reply pair.
>>>>>> I can very quickly reproduce bad (non-deterministic) behavior
>>>>>> by running a software build on an NFSv3 on RDMA mount point
>>>>>> with disconnect injection. If the DRC issue is addressed, the
>>>>>> software build runs to completion.
>>>>>
>>>>> Ok, good. But I have a better test.
>>>>>
>>>>> In the Connectathon suite, there's a "Special" test called "nfsidem".
>>>>> I wrote this test in, like, 1989 so I remember it :-)
>>>>>
>>>>> This test performs all the non-idempotent NFv3 operations in a loop,
>>>>> and each loop element depends on the previous one, so if there's
>>>>> any failure, the test imemdiately bombs.
>>>>>
>>>>> Nobody seems to understand it, usually when it gets run people will
>>>>> run it without injecting errors, and it "passes" so they decide
>>>>> everything is ok.
>>>>>
>>>>> So my suggestion is to run your flakeway packet-drop harness while
>>>>> running nfsidem in a huge loop (nfsidem 10000). The test is slow,
>>>>> owing to the expensive operations it performs, so you'll need to
>>>>> run it for a long time.
>>>>>
>>>>> You'll almost definitely get a failure or two, since the NFSv3
>>>>> protocol is flawed by design. But you can compare the behaviors,
>>>>> and even compute a likelihood. I'd love to see some actual numbers.
>>>> I configured the client to disconnect after 23711 RPCs have completed.
>>>> (I can re-run these with more frequent disconnects if you think that
>>>> would be useful).
>>>> Here's a run with the DRC modification:
>>>> [cel@manet ~]$ sudo mount -o vers=3,proto=rdma,sec=sys klimt.ib:/export/tmp /mnt
>>>> [cel@manet ~]$ (cd /mnt; ~/src/cthon04/special/nfsidem 100000)
>>>> testing 100000 idempotencies in directory "./TEST"
>>>> [cel@manet ~]$ sudo umount /mnt
>>>> Here's a run with the stock v5.1 Linux server:
>>>> [cel@manet ~]$ sudo mount -o vers=3,proto=rdma,sec=sys klimt.ib:/export/tmp /mnt
>>>> [cel@manet ~]$ (cd /mnt; ~/src/cthon04/special/nfsidem 100000)
>>>> testing 100000 idempotencies in directory "./TEST"
>>>> [cel@manet ~]$
>>>> This test reported no errors in either case. We can see that the
>>>> disconnects did trigger retransmits:
>>>> RPC statistics:
>>>> 1888819 RPC requests sent, 1888581 RPC replies received (0 XIDs not found)
>>>> average backlog queue length: 119
>>>
>>> Ok, well, that's 1.2% error rate, which IMO could be cranked up much
>>> higher for testing purposes. I'd also be sure the server was serving
>>> other workloads during the same time, putting at least some pressure
>>> on the DRC. The op rate of a single nfsidem test is pretty low so I
>>> doubt it's ever evicting anything.
>>>
>>> Ideally, it would be best to
>>> 1) increase the error probability
>>> 2) run several concurrent nfsidem tests, on different connections
>>> 3) apply some other load to the server, e.g. several cthon basics
>>>
>>> The idea being to actually get the needle off of zero and measure some
>>> kind of difference. Otherwise it really isn't giving any information
>>> apart from a slight didn't-break-it confidence. Honestly, I'm surprised
>>> you couldn't evince a failure from stock. On paper, these results don't
>>> actually tell us the patch is doing anything.
>> I boosted the disconnect injection rate to once every 11353 RPCs,
>> and mounted a second share with "nosharecache", running nfsidem on
>> both mounts. Both mounts are subject to disconnect injection.
>> With the current v5.2-rc Linux server, both nfsidem jobs fail within
>> 30 seconds.
>> With my DRC fix applied, both jobs run to completion with no errors.
>> It takes more than an hour.
>
> Great, that definitely proves it. It's relatively low risk, and
> should go in.
>
> I think it's worth thinking through the best way to address this
> across multiple transports, especially RDMA, where the port and
> address behaviors may differ from TCP/UDP. Perhaps tossing the whole
> idea of using the 5-tuple should simply be set aside, after all
> these years. Even though NFSv4.1 already has!
>
> Thanks for humoring me with the extra testing, by the way :-)
Sho 'nuf. Worth it to ensure we are all on the same page now.
I'll send this to Bruce with some clarifications to the description.
> Tom.
>
>> Here are the op metrics for one of the mounts during a run that
>> completes successfully:
>> RPC statistics:
>> 4091380 RPC requests sent, 4088143 RPC replies received (0 XIDs not found)
>> average backlog queue length: 1800
>> ACCESS:
>> 300395 ops (7%) 301 retrans (0%) 0 major timeouts
>> avg bytes sent per op: 132 avg bytes received per op: 120
>> backlog wait: 4.289199 RTT: 0.019821 total execute time: 4.315092 (milliseconds)
>> REMOVE:
>> 300390 ops (7%) 168 retrans (0%) 0 major timeouts
>> avg bytes sent per op: 136 avg bytes received per op: 144
>> backlog wait: 2.368664 RTT: 0.070106 total execute time: 2.445148 (milliseconds)
>> MKDIR:
>> 200262 ops (4%) 193 retrans (0%) 0 major timeouts
>> avg bytes sent per op: 158 avg bytes received per op: 271
>> backlog wait: 4.207034 RTT: 0.075421 total execute time: 4.289101 (milliseconds)
>> RMDIR:
>> 200262 ops (4%) 100 retrans (0%) 0 major timeouts
>> avg bytes sent per op: 130 avg bytes received per op: 144
>> backlog wait: 2.050749 RTT: 0.071676 total execute time: 2.128801 (milliseconds)
>> LOOKUP:
>> 194664 ops (4%) 233 retrans (0%) 0 major timeouts
>> avg bytes sent per op: 136 avg bytes received per op: 176
>> backlog wait: 5.365984 RTT: 0.020615 total execute time: 5.392769 (milliseconds)
>> SETATTR:
>> 100130 ops (2%) 86 retrans (0%) 0 major timeouts
>> avg bytes sent per op: 160 avg bytes received per op: 144
>> backlog wait: 3.520603 RTT: 0.066863 total execute time: 3.594327 (milliseconds)
>> WRITE:
>> 100130 ops (2%) 82 retrans (0%) 0 major timeouts
>> avg bytes sent per op: 180 avg bytes received per op: 136
>> backlog wait: 3.331249 RTT: 0.118316 total execute time: 3.459563 (milliseconds)
>> CREATE:
>> 100130 ops (2%) 95 retrans (0%) 0 major timeouts
>> avg bytes sent per op: 168 avg bytes received per op: 272
>> backlog wait: 4.099451 RTT: 0.071437 total execute time: 4.177479 (milliseconds)
>> SYMLINK:
>> 100130 ops (2%) 83 retrans (0%) 0 major timeouts
>> avg bytes sent per op: 188 avg bytes received per op: 271
>> backlog wait: 3.727704 RTT: 0.073534 total execute time: 3.807700 (milliseconds)
>> RENAME:
>> 100130 ops (2%) 68 retrans (0%) 0 major timeouts
>> avg bytes sent per op: 180 avg bytes received per op: 260
>> backlog wait: 2.659982 RTT: 0.070518 total execute time: 2.738979 (milliseconds)
>> LINK:
>> 100130 ops (2%) 85 retrans (0%) 0 major timeouts
>> avg bytes sent per op: 172 avg bytes received per op: 232
>> backlog wait: 3.676680 RTT: 0.066773 total execute time: 3.749785 (milliseconds)
>> GETATTR:
>> 230 ops (0%) 81 retrans (35%) 0 major timeouts
>> avg bytes sent per op: 170 avg bytes received per op: 112
>> backlog wait: 1584.026087 RTT: 0.043478 total execute time: 1584.082609 (milliseconds)
>> READDIRPLUS:
>> 10 ops (0%)
>> avg bytes sent per op: 149 avg bytes received per op: 1726
>> backlog wait: 0.000000 RTT: 0.300000 total execute time: 0.400000 (milliseconds)
>> FSINFO:
>> 2 ops (0%)
>> avg bytes sent per op: 112 avg bytes received per op: 80
>> backlog wait: 0.000000 RTT: 0.000000 total execute time: 0.000000 (milliseconds)
>> NULL:
>> 1 ops (0%)
>> avg bytes sent per op: 40 avg bytes received per op: 24
>> backlog wait: 2.000000 RTT: 0.000000 total execute time: 3.000000 (milliseconds)
>> READLINK:
>> 1 ops (0%)
>> avg bytes sent per op: 128 avg bytes received per op: 1140
>> backlog wait: 0.000000 RTT: 0.000000 total execute time: 0.000000 (milliseconds)
>> PATHCONF:
>> 1 ops (0%)
>> avg bytes sent per op: 112 avg bytes received per op: 56
>> backlog wait: 0.000000 RTT: 0.000000 total execute time: 0.000000 (milliseconds)
>>> Tom.
>>>
>>>> ACCESS:
>>>> 300001 ops (15%) 44 retrans (0%) 0 major timeouts
>>>> avg bytes sent per op: 132 avg bytes received per op: 120
>>>> backlog wait: 0.591118 RTT: 0.017463 total execute time: 0.614795 (milliseconds)
>>>> REMOVE:
>>>> 300000 ops (15%) 40 retrans (0%) 0 major timeouts
>>>> avg bytes sent per op: 136 avg bytes received per op: 144
>>>> backlog wait: 0.531667 RTT: 0.018973 total execute time: 0.556927 (milliseconds)
>>>> MKDIR:
>>>> 200000 ops (10%) 26 retrans (0%) 0 major timeouts
>>>> avg bytes sent per op: 158 avg bytes received per op: 272
>>>> backlog wait: 0.518940 RTT: 0.019755 total execute time: 0.545230 (milliseconds)
>>>> RMDIR:
>>>> 200000 ops (10%) 24 retrans (0%) 0 major timeouts
>>>> avg bytes sent per op: 130 avg bytes received per op: 144
>>>> backlog wait: 0.512320 RTT: 0.018580 total execute time: 0.537095 (milliseconds)
>>>> LOOKUP:
>>>> 188533 ops (9%) 21 retrans (0%) 0 major timeouts
>>>> avg bytes sent per op: 136 avg bytes received per op: 174
>>>> backlog wait: 0.455925 RTT: 0.017721 total execute time: 0.480011 (milliseconds)
>>>> SETATTR:
>>>> 100000 ops (5%) 11 retrans (0%) 0 major timeouts
>>>> avg bytes sent per op: 160 avg bytes received per op: 144
>>>> backlog wait: 0.371960 RTT: 0.019470 total execute time: 0.398330 (milliseconds)
>>>> WRITE:
>>>> 100000 ops (5%) 9 retrans (0%) 0 major timeouts
>>>> avg bytes sent per op: 180 avg bytes received per op: 136
>>>> backlog wait: 0.399190 RTT: 0.022860 total execute time: 0.436610 (milliseconds)
>>>> CREATE:
>>>> 100000 ops (5%) 9 retrans (0%) 0 major timeouts
>>>> avg bytes sent per op: 168 avg bytes received per op: 272
>>>> backlog wait: 0.365290 RTT: 0.019560 total execute time: 0.391140 (milliseconds)
>>>> SYMLINK:
>>>> 100000 ops (5%) 18 retrans (0%) 0 major timeouts
>>>> avg bytes sent per op: 188 avg bytes received per op: 272
>>>> backlog wait: 0.750470 RTT: 0.020150 total execute time: 0.786410 (milliseconds)
>>>> RENAME:
>>>> 100000 ops (5%) 14 retrans (0%) 0 major timeouts
>>>> avg bytes sent per op: 180 avg bytes received per op: 260
>>>> backlog wait: 0.461650 RTT: 0.020710 total execute time: 0.489670 (milliseconds)
>>>> --
>>>> Chuck Lever
>> --
>> Chuck Lever
--
Chuck Lever
next prev parent reply other threads:[~2019-06-11 14:23 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-06-05 12:15 [PATCH RFC] svcrdma: Ignore source port when computing DRC hash Chuck Lever
2019-06-05 15:57 ` Olga Kornievskaia
2019-06-05 17:28 ` Chuck Lever
2019-06-06 18:13 ` Olga Kornievskaia
2019-06-06 18:33 ` Chuck Lever
2019-06-07 15:43 ` Olga Kornievskaia
2019-06-10 14:38 ` Tom Talpey
2019-06-05 16:43 ` Tom Talpey
2019-06-05 17:25 ` Chuck Lever
2019-06-10 14:50 ` Tom Talpey
2019-06-10 17:50 ` Chuck Lever
2019-06-10 19:14 ` Tom Talpey
2019-06-10 21:57 ` Chuck Lever
2019-06-10 22:13 ` Tom Talpey
2019-06-11 0:07 ` Tom Talpey
2019-06-11 14:25 ` Chuck Lever
2019-06-11 14:23 ` Chuck Lever [this message]
2019-06-06 13:08 ` Sasha Levin
2019-06-06 13:24 ` Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7DCE8406-AF05-44B3-9EEC-447A7D7E919F@oracle.com \
--to=chuck.lever@oracle.com \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=tom@talpey.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).