From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from zeniv.linux.org.uk ([195.92.253.2]:49929 "EHLO
	ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1750801AbcBKAof (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Wed, 10 Feb 2016 19:44:35 -0500
Date: Thu, 11 Feb 2016 00:44:32 +0000
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Mike Marshall <hubcap@omnibond.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Stephen Rothwell <sfr@canb.auug.org.au>
Subject: Re: Orangefs ABI documentation
Message-ID: <20160211004432.GM17997@ZenIV.linux.org.uk>
References: <20160207035331.GZ17997@ZenIV.linux.org.uk>
 <CAOg9mSR1-7xOZCC=8dbx+zfHDep4NyUZm0e5XBDQzpCFaCNH7Q@mail.gmail.com>
 <20160208233535.GC17997@ZenIV.linux.org.uk>
 <20160209033203.GE17997@ZenIV.linux.org.uk>
 <CAOg9mSTFivsJDL-Ppruivd8gp_iThdq6N6+bWg0TfLXpV=rs8g@mail.gmail.com>
 <20160209174049.GG17997@ZenIV.linux.org.uk>
 <CAOg9mSTzE7KmrVC1zWSgC+vo20HfKLrsM3VPkxnYLN9roi+ZOw@mail.gmail.com>
 <20160209221623.GI17997@ZenIV.linux.org.uk>
 <20160209224050.GJ17997@ZenIV.linux.org.uk>
 <20160209231328.GK17997@ZenIV.linux.org.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160209231328.GK17997@ZenIV.linux.org.uk>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Tue, Feb 09, 2016 at 11:13:28PM +0000, Al Viro wrote:
> On Tue, Feb 09, 2016 at 10:40:50PM +0000, Al Viro wrote:
> 
> > And the version in orangefs-2.9.3.tar.gz (your Frankenstein module?) is
> > vulnerable to the same race.  2.8.1 isn't - it ignores signals on the
> > cancel, but that means waiting for cancel to be processed (or timed out)
> > on any interrupted read() before we return to userland.  We can return
> > to that behaviour, of course, but I suspect that offloading it to something
> > async (along with freeing the slot used by original operation) would be
> > better from QoI point of view.
> 
> That breakage had been introduced between 2.8.5 and 2.8.6 (at some point
> during the spring of 2012).  AFAICS, all versions starting with 2.8.6 are
> vulnerable...

Matter of fact, older versions are _also_ broken, but it's much harder
to trigger; you need the daemon to stall long enough for read() to time
out, then everything as before.  After 2.8.5 all it takes on the read()
side is a SIGKILL to read() caller just after the daemon has picked your
request...

TBH, it's tempting to kill the "wait for cancel to be finished" logics
completely, mark cancel requests with a flag and stash the slot number
into them.  Then, if flag is present, let set_op_state_purged() and
set_op_state_serviced() callers release the slot and drop the request.
And have wait_for_direct_io() treat the "need to cancel" case as "fire
and forget" - no waiting for anything, no releasing the slots, just free
the original op and bugger off.

The set_op_state_purged() part is where it gets delicate - we want to remove
the sucker from list/hash before dropping it.  AFAICS, it's doable - there's
nothing in the "release slot and drop the request" that would not be allowed
under the list/hash spinlocks...

In principle, readdir problem could also be handled in a similar way, but
there we'd need to have the interrupted readdir itself marked with that
new flag and left in in-progress hash, so that when the response finally
arrives it would be found and slot freeing would be handled ;-/  Doable,
but not fun...

If there is (or at least supposed to be) something that prevents completions
of readdir requests (on unrelated directories, by different processes, etc.)
out of order, PLEASE SAY SO.  I would really prefer not to have to fight
the readdir side of that mess; cancels are already bad enough ;-/