All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tom Talpey <tom@talpey.com>
To: Olga Kornievskaia <aglo@umich.edu>, Chuck Lever <chuck.lever@oracle.com>
Cc: NeilBrown <neilb@suse.com>,
	Anna Schumaker <Anna.Schumaker@netapp.com>,
	Trond Myklebust <trondmy@hammerspace.com>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH 0/9] Multiple network connections for a single NFS mount.
Date: Tue, 11 Jun 2019 17:35:20 -0400	[thread overview]
Message-ID: <e33a22ef-b335-36e1-ea7f-2d2b5b2ed390@talpey.com> (raw)
In-Reply-To: <CAN-5tyFP9qK9Tjv-FCeZJGMnhhnsZh0+VCguuRaDOG2kB9A-OQ@mail.gmail.com>

On 6/11/2019 5:10 PM, Olga Kornievskaia wrote:
> On Tue, Jun 11, 2019 at 4:09 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>>
>>
>>
>>> On Jun 11, 2019, at 4:02 PM, Tom Talpey <tom@talpey.com> wrote:
>>>
>>> On 6/11/2019 3:13 PM, Olga Kornievskaia wrote:
>>>> On Tue, Jun 11, 2019 at 1:47 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> On Jun 11, 2019, at 11:34 AM, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>>>
>>>>>> On Tue, Jun 11, 2019 at 10:52 AM Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>>>>
>>>>>>> Hi Neil-
>>>>>>>
>>>>>>>
>>>>>>>> On Jun 10, 2019, at 9:09 PM, NeilBrown <neilb@suse.com> wrote:
>>>>>>>>
>>>>>>>> On Fri, May 31 2019, Chuck Lever wrote:
>>>>>>>>
>>>>>>>>>> On May 30, 2019, at 6:56 PM, NeilBrown <neilb@suse.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On Thu, May 30 2019, Chuck Lever wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Neil-
>>>>>>>>>>>
>>>>>>>>>>> Thanks for chasing this a little further.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On May 29, 2019, at 8:41 PM, NeilBrown <neilb@suse.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> This patch set is based on the patches in the multipath_tcp branch of
>>>>>>>>>>>> git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git
>>>>>>>>>>>>
>>>>>>>>>>>> I'd like to add my voice to those supporting this work and wanting to
>>>>>>>>>>>> see it land.
>>>>>>>>>>>> We have had customers/partners wanting this sort of functionality for
>>>>>>>>>>>> years.  In SLES releases prior to SLE15, we've provide a
>>>>>>>>>>>> "nosharetransport" mount option, so that several filesystem could be
>>>>>>>>>>>> mounted from the same server and each would get its own TCP
>>>>>>>>>>>> connection.
>>>>>>>>>>>
>>>>>>>>>>> Is it well understood why splitting up the TCP connections result
>>>>>>>>>>> in better performance?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> In SLE15 we are using this 'nconnect' feature, which is much nicer.
>>>>>>>>>>>>
>>>>>>>>>>>> Partners have assured us that it improves total throughput,
>>>>>>>>>>>> particularly with bonded networks, but we haven't had any concrete
>>>>>>>>>>>> data until Olga Kornievskaia provided some concrete test data - thanks
>>>>>>>>>>>> Olga!
>>>>>>>>>>>>
>>>>>>>>>>>> My understanding, as I explain in one of the patches, is that parallel
>>>>>>>>>>>> hardware is normally utilized by distributing flows, rather than
>>>>>>>>>>>> packets.  This avoid out-of-order deliver of packets in a flow.
>>>>>>>>>>>> So multiple flows are needed to utilizes parallel hardware.
>>>>>>>>>>>
>>>>>>>>>>> Indeed.
>>>>>>>>>>>
>>>>>>>>>>> However I think one of the problems is what happens in simpler scenarios.
>>>>>>>>>>> We had reports that using nconnect > 1 on virtual clients made things
>>>>>>>>>>> go slower. It's not always wise to establish multiple connections
>>>>>>>>>>> between the same two IP addresses. It depends on the hardware on each
>>>>>>>>>>> end, and the network conditions.
>>>>>>>>>>
>>>>>>>>>> This is a good argument for leaving the default at '1'.  When
>>>>>>>>>> documentation is added to nfs(5), we can make it clear that the optimal
>>>>>>>>>> number is dependant on hardware.
>>>>>>>>>
>>>>>>>>> Is there any visibility into the NIC hardware that can guide this setting?
>>>>>>>>>
>>>>>>>>
>>>>>>>> I doubt it, partly because there is more than just the NIC hardware at issue.
>>>>>>>> There is also the server-side hardware and possibly hardware in the middle.
>>>>>>>
>>>>>>> So the best guidance is YMMV. :-)
>>>>>>>
>>>>>>>
>>>>>>>>>>> What about situations where the network capabilities between server and
>>>>>>>>>>> client change? Problem is that neither endpoint can detect that; TCP
>>>>>>>>>>> usually just deals with it.
>>>>>>>>>>
>>>>>>>>>> Being able to manually change (-o remount) the number of connections
>>>>>>>>>> might be useful...
>>>>>>>>>
>>>>>>>>> Ugh. I have problems with the administrative interface for this feature,
>>>>>>>>> and this is one of them.
>>>>>>>>>
>>>>>>>>> Another is what prevents your client from using a different nconnect=
>>>>>>>>> setting on concurrent mounts of the same server? It's another case of a
>>>>>>>>> per-mount setting being used to control a resource that is shared across
>>>>>>>>> mounts.
>>>>>>>>
>>>>>>>> I think that horse has well and truly bolted.
>>>>>>>> It would be nice to have a "server" abstraction visible to user-space
>>>>>>>> where we could adjust settings that make sense server-wide, and then a way
>>>>>>>> to mount individual filesystems from that "server" - but we don't.
>>>>>>>
>>>>>>> Even worse, there will be some resource sharing between containers that
>>>>>>> might be undesirable. The host should have ultimate control over those
>>>>>>> resources.
>>>>>>>
>>>>>>> But that is neither here nor there.
>>>>>>>
>>>>>>>
>>>>>>>> Probably the best we can do is to document (in nfs(5)) which options are
>>>>>>>> per-server and which are per-mount.
>>>>>>>
>>>>>>> Alternately, the behavior of this option could be documented this way:
>>>>>>>
>>>>>>> The default value is one. To resolve conflicts between nconnect settings on
>>>>>>> different mount points to the same server, the value set on the first mount
>>>>>>> applies until there are no more mounts of that server, unless nosharecache
>>>>>>> is specified. When following a referral to another server, the nconnect
>>>>>>> setting is inherited, but the effective value is determined by other mounts
>>>>>>> of that server that are already in place.
>>>>>>>
>>>>>>> I hate to say it, but the way to make this work deterministically is to
>>>>>>> ask administrators to ensure that the setting is the same on all mounts
>>>>>>> of the same server. Again I'd rather this take care of itself, but it
>>>>>>> appears that is not going to be possible.
>>>>>>>
>>>>>>>
>>>>>>>>> Adding user tunables has never been known to increase the aggregate
>>>>>>>>> amount of happiness in the universe. I really hope we can come up with
>>>>>>>>> a better administrative interface... ideally, none would be best.
>>>>>>>>
>>>>>>>> I agree that none would be best.  It isn't clear to me that that is
>>>>>>>> possible.
>>>>>>>> At present, we really don't have enough experience with this
>>>>>>>> functionality to be able to say what the trade-offs are.
>>>>>>>> If we delay the functionality until we have the perfect interface,
>>>>>>>> we may never get that experience.
>>>>>>>>
>>>>>>>> We can document "nconnect=" as a hint, and possibly add that
>>>>>>>> "nconnect=1" is a firm guarantee that more will not be used.
>>>>>>>
>>>>>>> Agree that 1 should be the default. If we make this setting a
>>>>>>> hint, then perhaps it should be renamed; nconnect makes it sound
>>>>>>> like the client will always open N connections. How about "maxconn" ?
>>>>>>
>>>>>> "maxconn" sounds to me like it's possible that the code would choose a
>>>>>> number that's less than that which I think would be misleading given
>>>>>> that the implementation (as is now) will open the specified number of
>>>>>> connection (bounded by the hard coded default we currently have set at
>>>>>> some value X which I'm in favor is increasing from 16 to 32).
>>>>>
>>>>> Earlier in this thread, Neil proposed to make nconnect a hint. Sounds
>>>>> like the long term plan is to allow "up to N" connections with some
>>>>> mechanism to create new connections on-demand." maxconn fits that idea
>>>>> better, though I'd prefer no new mount options... the point being that
>>>>> eventually, this setting is likely to be an upper bound rather than a
>>>>> fixed value.
>>>> Fair enough. If the dynamic connection management is in the cards,
>>>> then "maxconn" would be an appropriate name but I also agree with you
>>>> that if we are doing dynamic management then we shouldn't need a mount
>>>> option at all. I, for one, am skeptical that we'll gain benefits from
>>>> dynamic connection management given that cost of tearing and starting
>>>> the new connection.
>>>> I would argue that since now no dynamic management is implemented then
>>>> we stay with the "nconnect" mount option and if and when such feature
>>>> is found desirable then we get rid of the mount option all together.
>>>>>>> Then, to better define the behavior:
>>>>>>>
>>>>>>> The range of valid maxconn values is 1 to 3? to 8? to NCPUS? to the
>>>>>>> count of the client’s NUMA nodes? I’d be in favor of a small number
>>>>>>> to start with. Solaris' experience with multiple connections is that
>>>>>>> there is very little benefit past 8.
>>>>>>
>>>>>> My linux to linux experience has been that there is benefit of having
>>>>>> more than 8 connections. I have previously posted results that went
>>>>>> upto 10 connection (it's on my list of thing to test uptown 16). With
>>>>>> the Netapp performance lab they have maxed out 25G connection setup
>>>>>> they were using with so they didn't experiment with nconnect=8 but no
>>>>>> evidence that with a larger network pipe performance would stop
>>>>>> improving.
>>>>>>
>>>>>> Given the existing performance studies, I would like to argue that
>>>>>> having such low values are not warranted.
>>>>>
>>>>> They are warranted until we have a better handle on the risks of a
>>>>> performance regression occurring with large nconnect settings. The
>>>>> maximum number can always be raised once we are confident the
>>>>> behaviors are well understood.
>>>>>
>>>>> Also, I'd like to see some careful studies that demonstrate why
>>>>> you don't see excellent results with just two or three connections.
>>>>> Nearly full link bandwidth has been achieved with MP-TCP and two or
>>>>> three subflows on one NIC. Why is it not possible with NFS/TCP ?
>>>> Performance tests that do simple buffer to buffer measurements are one
>>>> thing but doing a complicated system that involves a filesystem is
>>>> another thing. The closest we can get to this network performance
>>>> tests is NFSoRDMA which saves various copies and as you know with that
>>>> we can get close to network link capacity.
>>
>> Yes, in certain circumstances, but there are still areas that can
>> benefit or need substantial improvement (NFS WRITE performance is
>> one such area).
>>
>>
>>> I really hope nconnect is not just a workaround for some undiscovered
>>> performance issue. All that does is kick the can down the road.
>>>
>>> But a word of experience from SMB3 multichannel - more connections also
>>> bring more issues for customers. Inevitably, with many connections
>>> active under load, one or more will experience disconnects or slowdowns.
>>> When this happens, some very unpredictable and hard to diagnose
>>> behaviors start to occur. For example, all that careful load balancing
>>> immediately goes out the window, and retries start to take over the
>>> latencies. Some IOs sail through (the ones on the good connections) and
>>> others delay for many seconds (while the connection is reestablished).
>>> I don't recommend starting this effort with such a lofty goal as 8, 10
>>> or 16 connections, especially with a protocol such as NFSv3.
>>
>> +1
>>
>> Learn to crawl then walk then run.
> 
> Neil,
> 
> What's your experience with providing "nosharedtransport" option to
> the SLE customers? Were you are having customers coming back and
> complain about the multiple connections issues?
> 
> When the connection is having issues, because we have to retransmit
> from the same port, there isn't anything to be done but wait for the
> new connection to be established and add to the latency of the
> operation over the bad connection. There could be smarts added to the
> (new) scheduler to grade the connections and if connection is having
> issues not assign tasks to it until it recovers but all that are
> additional improvement and I don't think we should restrict
> connections right of the bet. This is an option that allows for 8, 10,
> 16 (32) connections but it doesn't mean customer have to set such high
> value and we can recommend for low values.
> 
> Solaris has it, Microsoft has it and linux has been deprived of it,
> let's join the party.

Let me be clear about one thing - SMB3 has it because the protocol
is designed for it. Multichannel leverages SMB2 sessions to allow
retransmit on any active bound connection. NFSv4.1 (and later) have
a similar capability.

NFSv2 and NFSv3, however, do not, and I've already stated my concerns
about pushing them too far. I agree with your sentiment, but for these
protocols, please bear in mind the risks.

Tom.

> 
> 
>>
>>
>>> JMHO.
>>>
>>> Tom.
>>>
>>>
>>>>>>> If maxconn is specified with a datagram transport, does the mount
>>>>>>> operation fail, or is the setting is ignored?
>>>>>>
>>>>>> Perhaps we can add a warning on the mount command saying that option
>>>>>> is ignored but succeed the mount.
>>>>>>
>>>>>>> If maxconn is a hint, when does the client open additional
>>>>>>> connections?
>>>>>>>
>>>>>>> IMO documentation should be clear that this setting is not for the
>>>>>>> purpose of multipathing/trunking (using multiple NICs on the client
>>>>>>> or server). The client has to do trunking detection/discovery in that
>>>>>>> case, and nconnect doesn't add that logic. This is strictly for
>>>>>>> enabling multiple connections between one client-server IP address
>>>>>>> pair.
>>>>>>
>>>>>> I agree this should be as that last statement says multiple connection
>>>>>> to the same IP and in my option this shouldn't be a hint.
>>>>>>
>>>>>>> Do we need to state explicitly that all transport connections for a
>>>>>>> mount (or client-server pair) are the same connection type (i.e., all
>>>>>>> TCP or all RDMA, never a mix)?
>>>>>>
>>>>>> That might be an interesting future option but I think for now, we can
>>>>>> clearly say it's a TCP only option in documentation which can always
>>>>>> be changed if extension to that functionality will be implemented.
>>>>>
>>>>> Is there a reason you feel RDMA shouldn't be included? I've tried
>>>>> nconnect with my RDMA rig, and didn't see any problem with it.
>>>> No reason, I should have said "a single type of a connection only
>>>> option" not a mix. Of course with RDMA even with a single connection
>>>> we can achieve almost max bandwidth so having using nconnect seems
>>>> unnecessary.
>>>>>>>> Then further down the track, we might change the actual number of
>>>>>>>> connections automatically if a way can be found to do that without cost.
>>>>>>>
>>>>>>> Fair enough.
>>>>>>>
>>>>>>>
>>>>>>>> Do you have any objections apart from the nconnect= mount option?
>>>>>>>
>>>>>>> Well I realize my last e-mail sounded a little negative, but I'm
>>>>>>> actually in favor of adding the ability to open multiple connections
>>>>>>> per client-server pair. I just want to be careful about making this
>>>>>>> a feature that has as few downsides as possible right from the start.
>>>>>>> I'll try to be more helpful in my responses.
>>>>>>>
>>>>>>> Remaining implementation issues that IMO need to be sorted:
>>>>>>
>>>>>> I'm curious are you saying all this need to be resolved before we
>>>>>> consider including this functionality? These are excellent questions
>>>>>> but I think they imply some complex enhancements (like ability to do
>>>>>> different schedulers and not only round robin) that are "enhancement"
>>>>>> and not requirements.
>>>>>>
>>>>>>> • We want to take care that the client can recover network resources
>>>>>>> that have gone idle. Can we reuse the auto-close logic to close extra
>>>>>>> connections?
>>>>>> Since we are using round-robin scheduler then can we consider any
>>>>>> resources going idle?
>>>>>
>>>>> Again, I was thinking of nconnect as a hint here, not as a fixed
>>>>> number of connections.
>>>>>
>>>>>
>>>>>> It's hard to know the future, we might set a
>>>>>> timer after which we can say that a connection has been idle for long
>>>>>> enough time and we close it and as soon as that happens the traffic is
>>>>>> going to be generated again and we'll have to pay the penalty of
>>>>>> establishing a new connection before sending traffic.
>>>>>>
>>>>>>> • How will the client schedule requests on multiple connections?
>>>>>>> Should we enable the use of different schedulers?
>>>>>> That's an interesting idea but I don't think it shouldn't stop the
>>>>>> round robin solution from going thru.
>>>>>>
>>>>>>> • How will retransmits be handled?
>>>>>>> • How will the client recover from broken connections? Today's clients
>>>>>>> use disconnect to determine when to retransmit, thus there might be
>>>>>>> some unwanted interactions here that result in mount hangs.
>>>>>>> • Assume NFSv4.1 session ID rather than client ID trunking: is Linux
>>>>>>> client support in place for this already?
>>>>>>> • Are there any concerns about how the Linux server DRC will behave in
>>>>>>> multi-connection scenarios?
>>>>>>
>>>>>> I think we've talked about retransmission question. Retransmission are
>>>>>> handled by existing logic and are done by the same transport (ie
>>>>>> connection).
>>>>>
>>>>> Given the proposition that nconnect will be a hint (eventually) in
>>>>> the form of a dynamically managed set of connections, I think we need
>>>>> to answer some of these questions again. The answers could be "not
>>>>> yet implemented" or "no way jose".
>>>>>
>>>>> It would be helpful if the answers were all in one place (eg a design
>>>>> document or FAQ).
>>>>>
>>>>>
>>>>>>> None of these seem like a deal breaker. And possibly several of these
>>>>>>> are already decided, but just need to be published/documented.
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Chuck Lever
>>>>>
>>>>> --
>>>>> Chuck Lever
>>
>> --
>> Chuck Lever
>>
>>
>>
> 
> 

  reply	other threads:[~2019-06-11 21:35 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-30  0:41 [PATCH 0/9] Multiple network connections for a single NFS mount NeilBrown
2019-05-30  0:41 ` [PATCH 2/9] SUNRPC: Allow creation of RPC clients with multiple connections NeilBrown
2019-05-30  0:41 ` [PATCH 9/9] NFS: Allow multiple connections to a NFSv2 or NFSv3 server NeilBrown
2019-05-30  0:41 ` [PATCH 4/9] SUNRPC: enhance rpc_clnt_show_stats() to report on all xprts NeilBrown
2019-05-30  0:41 ` [PATCH 5/9] SUNRPC: add links for all client xprts to debugfs NeilBrown
2019-05-30  0:41 ` [PATCH 3/9] NFS: send state management on a single connection NeilBrown
2019-07-23 18:11   ` Schumaker, Anna
2019-07-23 22:54     ` NeilBrown
2019-07-31  2:05     ` [PATCH] NFS: add flags arg to nfs4_call_sync_sequence() NeilBrown
2019-05-30  0:41 ` [PATCH 8/9] pNFS: Allow multiple connections to the DS NeilBrown
2019-05-30  0:41 ` [PATCH 1/9] SUNRPC: Add basic load balancing to the transport switch NeilBrown
2019-05-30  0:41 ` [PATCH 7/9] NFSv4: Allow multiple connections to NFSv4.x servers NeilBrown
2019-05-30  0:41 ` [PATCH 6/9] NFS: Add a mount option to specify number of TCP connections to use NeilBrown
2019-05-30 17:05 ` [PATCH 0/9] Multiple network connections for a single NFS mount Tom Talpey
2019-05-30 17:20   ` Olga Kornievskaia
2019-05-30 17:41     ` Tom Talpey
2019-05-30 18:41       ` Olga Kornievskaia
2019-05-31  1:45         ` Tom Talpey
2019-05-30 22:38       ` NeilBrown
2019-05-31  1:48         ` Tom Talpey
2019-05-31  2:31           ` NeilBrown
2019-05-31 12:39             ` Tom Talpey
2019-05-30 23:53     ` Rick Macklem
2019-05-31  0:15       ` J. Bruce Fields
2019-05-31  1:01       ` NeilBrown
2019-05-31  2:20         ` Rick Macklem
2019-05-31 12:36           ` Tom Talpey
2019-05-31 13:33             ` Trond Myklebust
2019-05-30 17:56 ` Chuck Lever
2019-05-30 18:59   ` Olga Kornievskaia
2019-05-30 22:56   ` NeilBrown
2019-05-31 13:46     ` Chuck Lever
2019-05-31 15:38       ` J. Bruce Fields
2019-06-11  1:09       ` NeilBrown
2019-06-11 14:51         ` Chuck Lever
2019-06-11 15:05           ` Tom Talpey
2019-06-11 15:20           ` Trond Myklebust
2019-06-11 15:35             ` Chuck Lever
2019-06-11 16:41               ` Trond Myklebust
2019-06-11 17:32                 ` Chuck Lever
2019-06-11 17:44                   ` Trond Myklebust
2019-06-12 12:34                     ` Steve Dickson
2019-06-12 12:47                       ` Trond Myklebust
2019-06-12 13:10                         ` Trond Myklebust
2019-06-11 15:34           ` Olga Kornievskaia
2019-06-11 17:46             ` Chuck Lever
2019-06-11 19:13               ` Olga Kornievskaia
2019-06-11 20:02                 ` Tom Talpey
2019-06-11 20:09                   ` Chuck Lever
2019-06-11 21:10                     ` Olga Kornievskaia
2019-06-11 21:35                       ` Tom Talpey [this message]
2019-06-11 22:55                         ` NeilBrown
2019-06-12 12:55                           ` Tom Talpey
2019-06-11 23:02                       ` NeilBrown
2019-06-11 23:21                   ` NeilBrown
2019-06-12 12:52                     ` Tom Talpey
2019-06-11 23:42               ` NeilBrown
2019-06-12 12:39                 ` Steve Dickson
2019-06-12 17:36                 ` Chuck Lever
2019-06-12 23:03                   ` NeilBrown
2019-06-13 16:13                     ` Chuck Lever
2019-06-12  1:49           ` NeilBrown
2019-06-12 18:32             ` Chuck Lever
2019-06-12 23:37               ` NeilBrown
2019-06-13 16:27                 ` Chuck Lever
2019-05-31  0:24 ` J. Bruce Fields

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e33a22ef-b335-36e1-ea7f-2d2b5b2ed390@talpey.com \
    --to=tom@talpey.com \
    --cc=Anna.Schumaker@netapp.com \
    --cc=aglo@umich.edu \
    --cc=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.com \
    --cc=trondmy@hammerspace.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.