From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from zeniv.linux.org.uk ([195.92.253.2]:55981 "EHLO
	ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S932757AbcBHXfj (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Mon, 8 Feb 2016 18:35:39 -0500
Date: Mon, 8 Feb 2016 23:35:35 +0000
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Mike Marshall <hubcap@omnibond.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Stephen Rothwell <sfr@canb.auug.org.au>
Subject: Re: Orangefs ABI documentation
Message-ID: <20160208233535.GC17997@ZenIV.linux.org.uk>
References: <20160124001615.GT17997@ZenIV.linux.org.uk>
 <20160124040529.GX17997@ZenIV.linux.org.uk>
 <CA+D=wkjuAq-krCKPMFJ6M4iik4Cz=WLS1KE=pQQGSWpz07q6Rw@mail.gmail.com>
 <20160130173413.GE17997@ZenIV.linux.org.uk>
 <20160130182731.GF17997@ZenIV.linux.org.uk>
 <CAOg9mSQOZP155TnuoJG_YFb19oU35cj8fHTxn8BfJCw+asP-CQ@mail.gmail.com>
 <20160206194210.GX17997@ZenIV.linux.org.uk>
 <20160207013835.GY17997@ZenIV.linux.org.uk>
 <20160207035331.GZ17997@ZenIV.linux.org.uk>
 <CAOg9mSR1-7xOZCC=8dbx+zfHDep4NyUZm0e5XBDQzpCFaCNH7Q@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAOg9mSR1-7xOZCC=8dbx+zfHDep4NyUZm0e5XBDQzpCFaCNH7Q@mail.gmail.com>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Mon, Feb 08, 2016 at 05:26:53PM -0500, Mike Marshall wrote:

> My impression is that dbench is more likely to fail ungracefully
> if I signal the client-core to abort during the execute phase, and
> more likely to complete normally if I signal the client-core to abort
> during warmup (or cleanup, which removes the directory tree
> built during warmup).
> 
> I'll do more tests tomorrow with more debug turned on, and see if
> I can get some idea of what makes dbench so ill... the most important
> thing is that the kernel doesn't crash, but it would be gravy if user
> processes could better withstand a client-core recycle.
> 
> Here's the bufmap debug output, I didn't want to send 700k
> of mostly "orangefs_bufmap_copy_from_iovec" to the list:
> 
> http://myweb.clemson.edu/~hubcap/out

>>From the look of it, quite a few are of "getattr after create has failed
due to restart" - AFAICS, that has nothing to do with slot-related issues,
but readv/writev failures with -EINTR *are* slot-related (in addition to
the problem with EINTR handling in callers of wait_for_direct_io() mentioned
upthread)...

> grepping for "finalize" in all that noise is a good way to see
> where client-core restarts happened. I ran dbench numerous
> times, and managed to signal the client-core to restart during
> the same run several times...

FWIW, restart treatment is definitely still not right - we *do* want to
wait for bufmap to run down before trying to claim the slot on restart.
The problems with the current code are:
	* more than one process holding slots during the daemon restart =>
possible for the first one to get through dropping the slot and into
requesting the new one before the second one manages to drop its slot.
Restarted daemon won't be able to insert its bufmap in that case.
	* one process holding a slot during restart => very likely EIO,
since it ends up dropping the last reference to old bufmap and finds
NULL __orangefs_bufmap if it gets there before the new daemon.
	* lack of wait for daemon restart anyplace reachable (the one in
service_operation() can be reached only if we are about to do double-free,
already having bufmap refcounts buggered; with that crap fixed, you are
not hitting that logics anymore).

I think I have a solution for the slot side of that, but I'm still not happy
with the use of low-level waitqueue primitives in that sucker, which is almost
always a Very Bad Sign(tm).

I'm not sure what to do with operations that go into restart directly in
service_operation() - it looks like it ought to wait for op_timeout_secs
and if the damn things isn't back *and* managed to service the operation
by that time, we are screwed.  Note that this
                if (op_state_purged(op)) {      
                        ret = (op->attempts < ORANGEFS_PURGE_RETRY_COUNT) ?
                                 -EAGAIN :
                                 -EIO;
                        gossip_debug(GOSSIP_WAIT_DEBUG,
                                     "*** %s:"
                                     " operation purged (tag "
                                     "%llu, %p, att %d)\n",
                                     __func__,
                                     llu(op->tag),
                                     op,
                                     op->attempts);
                        orangefs_clean_up_interrupted_operation(op);
                        break;
                }
is hit only if the daemon got stopped after we'd submitted the operation;
if it had been stopped before entering service_operation() and new one hadn't
managed to catch up with requests in that interval, no attempts to resubmit
are made.

And AFAICS
        if (is_daemon_in_service() < 0) {
                /*
                 * By incrementing the per-operation attempt counter, we
                 * directly go into the timeout logic while waiting for
                 * the matching downcall to be read
                 */
                gossip_debug(GOSSIP_WAIT_DEBUG,
                             "%s:client core is NOT in service(%d).\n",
                             __func__,
                             is_daemon_in_service());
                op->attempts++;
        }
is inherently racy.  Suppose the daemon gets stopped during the window
between that check and the moment request is added to the queue.  If it had
been stopped past that window, op would've been purged; if it had been
stopped before the check, we get ->attempts bumped, but if it happens _within_
that window, in wait_for_matching_downcall() you'll be getting to
                /*
                 * if this was our first attempt and client-core
                 * has not purged our operation, we are happy to
                 * simply wait
                 */
                if (op->attempts == 0 && !op_state_purged(op)) {
                        spin_unlock(&op->lock);
                        schedule();
with nothing to wake you up until the new daemon gets started.  Sure, it's
an interruptible sleep, so you can always kill -9 the sucker, but it still
looks wrong.