From mboxrd@z Thu Jan  1 00:00:00 1970
From: Al Viro <viro@ZenIV.linux.org.uk>
Subject: Re: [PATCH 1/2 linux-next] Revert "ufs: fix deadlocks
 introduced by sb mutex merge"
Date: Thu, 4 Jun 2015 23:22:43 +0100
Message-ID: <20150604222243.GA7337__13051.8858707093$1433456666$gmane$org@ZenIV.linux.org.uk>
References: <1432754131-27425-1-git-send-email-fabf@skynet.be>
	<20150527145735.e3d1913bc66426038d53be32@linux-foundation.org>
	<20150604050123.GL7232@ZenIV.linux.org.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
Content-Disposition: inline
In-Reply-To: <20150604050123.GL7232@ZenIV.linux.org.uk>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Evgeniy Dushistov <dushistov@mail.ru>, Ian Campbell <ian.campbell@citrix.com>, Ian Jackson <Ian.Jackson@eu.citrix.com>, xen-devel <xen-devel@lists.xen.org>, Fabian Frederick <fabf@skynet.be>, linux-fsdevel@vger.kernel.org, Jan Kara <jack@suse.cz>, Alexey Khoroshilov <khoroshilov@ispras.ru>, Roger Pau Monne <roger.pau@citrix.com>
List-Id: xen-devel@lists.xenproject.org

On Thu, Jun 04, 2015 at 06:01:23AM +0100, Al Viro wrote:

> So we need
> 	* per-page exclusion for reallocation time (normal page locks are
> doing that)
> 	* per-fs exclusion for block and fragment allocations (->s_lock?)
> 	* per-fs exclusion for inode allocations (->s_lock?)
> 	* per-inode exclusion for mapping changes (a-la ext2 truncate_mutex)
> 	* per-directory exclusion for contents access (->i_mutex gives that)
> 
> Looks like we ought to add ->truncate_mutex and shove lock_ufs() calls
> all way down into balloc.c (and ialloc.c for inode allocations)...

After looking through that code, that appears to be feasible (and would
simplify the hell out of truncate side of things).  However, there's
a problem with the way we do ->write_begin() and ->writepage() there - it's
quite suboptimal for short files.  Suppose we start with an empty file on
e.g. a filesystem with 1Kb fragments and 4Kb blocks.  We grab page 0 and
call ->write_begin() on it, offset range 0 to 4095.  OK, that'll be
block_write_begin(), with ufs_getfrag_block as callback.  And it'll proceed
to call it 4 times, one for each kilobyte.  In ascending order.  If we are
lucky, it'll allocate the first fragment of an otherwise empty block and
then extend it 3 times until we get the full block.  If we are *not* that
lucky, and the first fragment is grabbed in already partially used block,
we'll end up doing reallocations (and copying, while we are at it, hopefully
without bogus uptodate flags set anywhere in process).  The same if somebody
else grabs a fragment in the middle of block we are growing into.

A part of that is the lack of prealloc in fs/ufs; that would certainly
improve things, but I really wonder if we are doing the handling of
partial blocks in the wrong place.  As it is, that happens fairly deep
in call chain - ufs_write_begin() -> block_write_begin() ->
__block_write_begin() -> ufs_getfrag_block() -> ufs_inode_getfrag() ->
ufs_new_fragments() -> ufs_change_blocknr()) and we don't notice
the partial blocks until we get to ufs_new_fragments(); in reality,
we can tell if there's a partial block from the very beginning, just by
looking at inode size and checking if the direct pointer in question
is not zero.

Do we really need it done that deep in chain?  Note that there's another
long-standing problem - we map the things fragment-by-fragment, which is
seriously redundant; within the same logical block we have either
	* hole - no fragments present, or
	* partial block at the end of short file - known number of adjacent
fragments present, or
	* full block - all fragments are present and adjacent
Trying to map fragments one by one is completely pointless.  Which makes
the use fs/buffer.c helpers dubious.  Besides, we pay for doing it that
deep by having to pass the page all way down into block allocator.

AFAICS, writepage cares about the partial blocks only when EOF is right
inside the page - pages past EOF shouldn't be faulted in and truncate_setsize()
should've killed everything off before lowering ->i_size.  And since truncate
and ->write_begin are serialized on ->i_mutex, we are down to
	* write_iter starting past the EOF should start with dummy extending
truncate (and truncate back to original size if nothing actually gets
written)
	* extending truncate() of a file with partial block should grab the
page containing EOF and expand it (with possible reallocation).
	* truncate() shrinking a file to something with partial block should
do nothing special, other than releasing only part of fragments for that
block.
	* write_begin and writepage should check if they are dealing with
a page covering a partial block and start with extending it - they know how
far to go and they are serialized by page lock.

Does anybody see holes in that?  This way we can handle the partial
blocks higher in the call chain, with a lot less PITA...  And with that
done, we can have ufs_block_getfrag() just check if the previous bh in
the page falls into the same block and is already mapped and map ours to
the next fragment if that's the case - allocation would either happen
for entire block or page, whichever is smaller, or be already done by caller
(UFS blocks can be bigger than a page).