From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759742Ab3B1Rp5 (ORCPT ); Thu, 28 Feb 2013 12:45:57 -0500 Received: from mail-pb0-f41.google.com ([209.85.160.41]:52835 "EHLO mail-pb0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753490Ab3B1Rpy (ORCPT ); Thu, 28 Feb 2013 12:45:54 -0500 MIME-Version: 1.0 In-Reply-To: <512F948A.9060404@canonical.com> References: <1361831310-24260-1-git-send-email-chiluk@canonical.com> <512DE8A6.9030000@samba.org> <20130227083419.0af9deaf@corrin.poochiereds.net> <512E8787.6070709@canonical.com> <20130228072637.3b71a4f7@corrin.poochiereds.net> <20130228084704.7f267119@corrin.poochiereds.net> <512F948A.9060404@canonical.com> Date: Thu, 28 Feb 2013 11:45:53 -0600 Message-ID: Subject: Re: [PATCH] CIFS: Decrease reconnection delay when switching nics From: Steve French To: Dave Chiluk Cc: Jeff Layton , "Stefan (metze) Metzmacher" , Steve French , linux-cifs@vger.kernel.org, samba-technical@lists.samba.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 28, 2013 at 11:31 AM, Dave Chiluk wrote: > On 02/28/2013 10:47 AM, Jeff Layton wrote: >> On Thu, 28 Feb 2013 10:04:36 -0600 >> Steve French wrote: >> >>> On Thu, Feb 28, 2013 at 9:26 AM, Jeff Layton wrote: >>>> On Wed, 27 Feb 2013 16:24:07 -0600 >>>> Dave Chiluk wrote: >>>> >>>>> On 02/27/2013 10:34 AM, Jeff Layton wrote: >>>>>> On Wed, 27 Feb 2013 12:06:14 +0100 >>>>>> "Stefan (metze) Metzmacher" wrote: >>>>>> >>>>>>> Hi Dave, >>>>>>> >>>>>>>> When messages are currently in queue awaiting a response, decrease amount of >>>>>>>> time before attempting cifs_reconnect to SMB_MAX_RTT = 10 seconds. The current >>>>>>>> wait time before attempting to reconnect is currently 2*SMB_ECHO_INTERVAL(120 >>>>>>>> seconds) since the last response was recieved. This does not take into account >>>>>>>> the fact that messages waiting for a response should be serviced within a >>>>>>>> reasonable round trip time. >>>>>>> >>>>>>> Wouldn't that mean that the client will disconnect a good connection, >>>>>>> if the server doesn't response within 10 seconds? >>>>>>> Reads and Writes can take longer than 10 seconds... >>>>>>> >>>>>> >>>>>> Where does this magic value of 10s come from? Note that a slow server >>>>>> can take *minutes* to respond to writes that are long past the EOF. >>>>> It comes from the desire to decrease the reconnection delay to something >>>>> better than a random number between 60 and 120 seconds. I am not >>>>> committed to this number, and it is open for discussion. Additionally >>>>> if you look closely at the logic it's not 10 seconds per request, but >>>>> actually when requests have been in flight for more than 10 seconds make >>>>> sure we've heard from the server in the last 10 seconds. >>>>> >>>>> Can you explain more fully your use case of writes that are long past >>>>> the EOF? Perhaps with a test-case or script that I can test? As far as >>>>> I know writes long past EOF will just result in a sparse file, and >>>>> return in a reasonable round trip time *(that's at least what I'm seeing >>>>> with my testing). dd if=/dev/zero of=/mnt/cifs/a bs=1M count=100 >>>>> seek=100000, starts receiving responses from the server in about .05 >>>>> seconds with subsequent responses following at roughly .002-.01 second >>>>> intervals. This is well within my 10 second value. Even adding the >>>>> latency of AT&T's 2g cell network brings it up to only 1s. Still 10x >>>>> less than my 10 second value. >>>>> >>>>> The new logic goes like this >>>>> if( we've been expecting a response from the server (in_flight), and >>>>> message has been in_flight for more than 10 seconds and >>>>> we haven't had any other contact from the server in that time >>>>> reconnect >>>>> >>>> >>>> That will break writes long past the EOF. Note too that reconnects on >>>> CIFS are horrifically expensive and problematic. Much of the state on a >>>> CIFS mount is tied to the connection. When that drops, open files are >>>> closed and things like locks are dropped. SMB1 has no real mechanism >>>> for state recovery, so that can really be a problem. >>>> >>>>> On a side note, I discovered a small race condition in the previous >>>>> logic while working on this, that my new patch also fixes. >>>>> 1s request >>>>> 2s response >>>>> 61.995 echo job pops >>>>> 121.995 echo job pops and sends echo >>>>> 122 server_unresponsive called. Finds no response and attempts to >>>>> reconnect >>>>> 122.95 response to echo received >>>>> >>>> >>>> Sure, here's a reproducer. Do this against a windows server, preferably >>>> one exporting NTFS on relatively slow storage. Make sure that >>>> "testfile" doesn't exist first: >>>> >>>> $ dd if=/dev/zero of=/path/to/cifs/share/testfile bs=1M count=1 seek=3192 >>>> >>>> NTFS doesn't support sparse files, so the OS has to zero-fill up to the >>>> point where you're writing. That can take a looooong time on slow >>>> storage (minutes even). What we do now is periodically send a SMB echo >>>> to make sure the server is alive rather than trying to time out a >>>> particular call. >>> >>> Writing past end of file in Windows can be very slow, but note that it >>> is possible for a windows to set as sparse a file on an NTFS >>> partition. Quoting from >>> http://msdn.microsoft.com/en-us/library/windows/desktop/aa365566%28v=vs.85%29.aspx >>> Windows NTFS does support sparse files (and we could even send it over >>> cifs if we want) but it has to be explicitly set by the app on the >>> file: >>> >>> "To determine whether a file system supports sparse files, call the >>> GetVolumeInformation function and examine the >>> FILE_SUPPORTS_SPARSE_FILES bit flag returned through the >>> lpFileSystemFlags parameter. >>> >>> Most applications are not aware of sparse files and will not create >>> sparse files. The fact that an application is reading a sparse file is >>> transparent to the application. An application that is aware of >>> sparse-files should determine whether its data set is suitable to be >>> kept in a sparse file. After that determination is made, the >>> application must explicitly declare a file as sparse, using the >>> FSCTL_SET_SPARSE control code." >>> >>> >> >> That's interesting. I didn't know about the fsctl. >> >> It doesn't really help us though. Not all servers support passthrough >> infolevels, and there are other filesystems (e.g. FAT) that don't >> support sparse files at all. >> >> In any case, the upshot of all of this is that we simply can't assume >> that we'll get the response to a particular call in any given amount of >> time, so we have to periodically check that the server is still >> responding via echoes before giving up on it completely. >> > > I just verified this by running the dd testcase against a windows 7 > server. I'm going to rewrite my patch to optimise the echo logic as > Jeff suggested earlier. The only difference being that, I think we > should still have regular echos when nothing else is happening, so that > the connection can be rebuilt when nothing urgent is going on. > > It still makes more sense to me that we should be checking the status of > the tcp socket, and it's underlying nic, but I'm still not completely > clear on how that could be accomplished. Any pointers to that regard > would be appreciated. It is also worth checking if the witness protocol would help us (even in a nonclustered environment) because it was designed to allow (at least for smb3 mounts) a client to tell when a server is up or down -- Thanks, Steve