From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from zeniv.linux.org.uk ([195.92.253.2]:55981 "EHLO ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932757AbcBHXfj (ORCPT ); Mon, 8 Feb 2016 18:35:39 -0500 Date: Mon, 8 Feb 2016 23:35:35 +0000 From: Al Viro To: Mike Marshall Cc: Linus Torvalds , linux-fsdevel , Stephen Rothwell Subject: Re: Orangefs ABI documentation Message-ID: <20160208233535.GC17997@ZenIV.linux.org.uk> References: <20160124001615.GT17997@ZenIV.linux.org.uk> <20160124040529.GX17997@ZenIV.linux.org.uk> <20160130173413.GE17997@ZenIV.linux.org.uk> <20160130182731.GF17997@ZenIV.linux.org.uk> <20160206194210.GX17997@ZenIV.linux.org.uk> <20160207013835.GY17997@ZenIV.linux.org.uk> <20160207035331.GZ17997@ZenIV.linux.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Mon, Feb 08, 2016 at 05:26:53PM -0500, Mike Marshall wrote: > My impression is that dbench is more likely to fail ungracefully > if I signal the client-core to abort during the execute phase, and > more likely to complete normally if I signal the client-core to abort > during warmup (or cleanup, which removes the directory tree > built during warmup). > > I'll do more tests tomorrow with more debug turned on, and see if > I can get some idea of what makes dbench so ill... the most important > thing is that the kernel doesn't crash, but it would be gravy if user > processes could better withstand a client-core recycle. > > Here's the bufmap debug output, I didn't want to send 700k > of mostly "orangefs_bufmap_copy_from_iovec" to the list: > > http://myweb.clemson.edu/~hubcap/out >>From the look of it, quite a few are of "getattr after create has failed due to restart" - AFAICS, that has nothing to do with slot-related issues, but readv/writev failures with -EINTR *are* slot-related (in addition to the problem with EINTR handling in callers of wait_for_direct_io() mentioned upthread)... > grepping for "finalize" in all that noise is a good way to see > where client-core restarts happened. I ran dbench numerous > times, and managed to signal the client-core to restart during > the same run several times... FWIW, restart treatment is definitely still not right - we *do* want to wait for bufmap to run down before trying to claim the slot on restart. The problems with the current code are: * more than one process holding slots during the daemon restart => possible for the first one to get through dropping the slot and into requesting the new one before the second one manages to drop its slot. Restarted daemon won't be able to insert its bufmap in that case. * one process holding a slot during restart => very likely EIO, since it ends up dropping the last reference to old bufmap and finds NULL __orangefs_bufmap if it gets there before the new daemon. * lack of wait for daemon restart anyplace reachable (the one in service_operation() can be reached only if we are about to do double-free, already having bufmap refcounts buggered; with that crap fixed, you are not hitting that logics anymore). I think I have a solution for the slot side of that, but I'm still not happy with the use of low-level waitqueue primitives in that sucker, which is almost always a Very Bad Sign(tm). I'm not sure what to do with operations that go into restart directly in service_operation() - it looks like it ought to wait for op_timeout_secs and if the damn things isn't back *and* managed to service the operation by that time, we are screwed. Note that this if (op_state_purged(op)) { ret = (op->attempts < ORANGEFS_PURGE_RETRY_COUNT) ? -EAGAIN : -EIO; gossip_debug(GOSSIP_WAIT_DEBUG, "*** %s:" " operation purged (tag " "%llu, %p, att %d)\n", __func__, llu(op->tag), op, op->attempts); orangefs_clean_up_interrupted_operation(op); break; } is hit only if the daemon got stopped after we'd submitted the operation; if it had been stopped before entering service_operation() and new one hadn't managed to catch up with requests in that interval, no attempts to resubmit are made. And AFAICS if (is_daemon_in_service() < 0) { /* * By incrementing the per-operation attempt counter, we * directly go into the timeout logic while waiting for * the matching downcall to be read */ gossip_debug(GOSSIP_WAIT_DEBUG, "%s:client core is NOT in service(%d).\n", __func__, is_daemon_in_service()); op->attempts++; } is inherently racy. Suppose the daemon gets stopped during the window between that check and the moment request is added to the queue. If it had been stopped past that window, op would've been purged; if it had been stopped before the check, we get ->attempts bumped, but if it happens _within_ that window, in wait_for_matching_downcall() you'll be getting to /* * if this was our first attempt and client-core * has not purged our operation, we are happy to * simply wait */ if (op->attempts == 0 && !op_state_purged(op)) { spin_unlock(&op->lock); schedule(); with nothing to wake you up until the new daemon gets started. Sure, it's an interruptible sleep, so you can always kill -9 the sucker, but it still looks wrong.