From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from zeniv.linux.org.uk ([195.92.253.2]:57150 "EHLO
	ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753493AbcBIDcI (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Mon, 8 Feb 2016 22:32:08 -0500
Date: Tue, 9 Feb 2016 03:32:04 +0000
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Mike Marshall <hubcap@omnibond.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Stephen Rothwell <sfr@canb.auug.org.au>
Subject: Re: Orangefs ABI documentation
Message-ID: <20160209033203.GE17997@ZenIV.linux.org.uk>
References: <20160124040529.GX17997@ZenIV.linux.org.uk>
 <CA+D=wkjuAq-krCKPMFJ6M4iik4Cz=WLS1KE=pQQGSWpz07q6Rw@mail.gmail.com>
 <20160130173413.GE17997@ZenIV.linux.org.uk>
 <20160130182731.GF17997@ZenIV.linux.org.uk>
 <CAOg9mSQOZP155TnuoJG_YFb19oU35cj8fHTxn8BfJCw+asP-CQ@mail.gmail.com>
 <20160206194210.GX17997@ZenIV.linux.org.uk>
 <20160207013835.GY17997@ZenIV.linux.org.uk>
 <20160207035331.GZ17997@ZenIV.linux.org.uk>
 <CAOg9mSR1-7xOZCC=8dbx+zfHDep4NyUZm0e5XBDQzpCFaCNH7Q@mail.gmail.com>
 <20160208233535.GC17997@ZenIV.linux.org.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160208233535.GC17997@ZenIV.linux.org.uk>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Mon, Feb 08, 2016 at 11:35:35PM +0000, Al Viro wrote:

> And AFAICS
>         if (is_daemon_in_service() < 0) {
>                 /*
>                  * By incrementing the per-operation attempt counter, we
>                  * directly go into the timeout logic while waiting for
>                  * the matching downcall to be read
>                  */
>                 gossip_debug(GOSSIP_WAIT_DEBUG,
>                              "%s:client core is NOT in service(%d).\n",
>                              __func__,
>                              is_daemon_in_service());
>                 op->attempts++;
>         }
> is inherently racy.  Suppose the daemon gets stopped during the window
> between that check and the moment request is added to the queue.  If it had
> been stopped past that window, op would've been purged; if it had been
> stopped before the check, we get ->attempts bumped, but if it happens _within_
> that window, in wait_for_matching_downcall() you'll be getting to
>                 /*
>                  * if this was our first attempt and client-core
>                  * has not purged our operation, we are happy to
>                  * simply wait
>                  */
>                 if (op->attempts == 0 && !op_state_purged(op)) {
>                         spin_unlock(&op->lock);
>                         schedule();
> with nothing to wake you up until the new daemon gets started.  Sure, it's
> an interruptible sleep, so you can always kill -9 the sucker, but it still
> looks wrong.

BTW, what is wait_for_matching_downcall() trying to do if the purge happens
just as it's being entered?  We put ourselves on op->waitq.  Then we check
op_state_serviced() - OK, it hadn't been.  We check for signals; normally
there won't be any.  Then we see that the state is purged and proceed to
do schedule_timeout(op_timeout_secs * HZ).  Which will be a plain and simple
"sleep for 10 seconds", since the wakeup on op->waitq has happened on purge.
Before we'd done prepare_to_wait().  That'll be followed by returning
-ETIMEDOUT and failing service_operation() with no attempts to retry.
OTOH, if the purge happens a few cycles later, after we'd entered
the loop and done prepare_to_wait(), we'll end up in retry logics.
Looks bogus...

FWIW, what all these bugs show is that manual use of wait queues is very
easy to get wrong.  If nothing else, I'd be tempted to replace op->waitq
with another struct completion and kill those loops in wait_for_..._downcall()
entirely - wait_for_completion_interruptible_timeout() *ONCE* and then
look at the state.  With complete() done in op_set_state_{purged,serviced}...
Objections?