From mboxrd@z Thu Jan 1 00:00:00 1970 From: jimc@math.ucla.edu (Jim Carter) Subject: Re: clients suddenly start hanging (was: (no subject)) Date: Wed, 7 May 2008 21:52:34 -0700 (PDT) Message-ID: <20080508045235.13C9EF8ED9@serval.math.ucla.edu> References: <20080423185018.122C53C3B1@xena.cft.ca.us> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: (Ian Kent "Mon, 28 Apr 2008 14:26:34 +0800") List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: autofs-bounces@linux.kernel.org Errors-To: autofs-bounces@linux.kernel.org To: autofs@linux.kernel.org Cc: Ian Kent On Mon, 28 Apr 2008 14:26:34 +0800 Ian Kent wrote: > Jeff Moyer has identified a race in due to an execution order dependency > in the autofs4 function root.c:try_to_fill_dentry(). --snip-- > After the daemon finishes the mount, it calls back into the kernel > to release the waiters. When this happens, P1 is woken up and goes > about clearing the DCACHE_AUTOFS_PENDING flag, but it does this in > D1! So, given that P1 in our case is a program that will immediately > try to access a file under /mount/submount/foo, we end up finding the > dentry D2 which still has the pending flag set, and we set out to > wait for a mount *again*! I applied the two patches (redo-lookup-in-ttfd and correct-return-in-ttfd) and restarted/reloaded the resulting module, but unfortunately it did not improve the issue of hanging client processes in my submount case. I've improved my test program and picked up some behaviors that were not obvious before. Expected expiration is not happening. Some but usually not all (once it *was* all) of the filesystems stay mounted for 900 secs, much longer than the default expiration of 300 secs. But eventually many of them get expired. (There are no competing processes that might "cd" into an automounted directory and prevent expiry, except for the filesystem containing my home directory.) In one case a set of 3 filesystems from the same machine (submount) stayed mounted for 103 minutes straight (never expired as far as I can see), being tested with successful access every 900 secs. Then all three became absent from /proc/mounts, and were tested. One of them returned ENOENT from opendir(); the other two test processes hung likely in opendir(). In another case a machine exports only one filesystem, and it was accessed and expired normally for about three repetitions; then it stayed mounted for 900 secs. It was accessed successfully, but then after only 266 secs it disappeared from /proc/mounts, was tested, and got ENOENT. 900 secs later it was re-tested (I assume it was not mounted), and the test process hung. I have the test output including backtraces with symbols, immediately after each of the three hangs (actually, 25 secs after the process started, which is the criterion I use for hanging). But it's 5627 lines or 282 Kb and it's pretty much the same as the one I sent in earlier except for more diagnostic output. How about I send it direct, if you want it, rather than filling up everyone's mailbox? Or would it be better to have the output in the mail archive even though it's bloated? James F. Carter Voice 310 825 2897 FAX 310 206 6673 UCLA-Mathnet; 6115 MSA; 520 Portola Plaza; Los Angeles, CA, USA 90095-1555 Email: jimc@math.ucla.edu http://www.math.ucla.edu/~jimc (q.v. for PGP key)