From mboxrd@z Thu Jan 1 00:00:00 1970 From: Al Viro Subject: Re: [PATCH 1/2 linux-next] Revert "ufs: fix deadlocks introduced by sb mutex merge" Date: Thu, 4 Jun 2015 23:22:43 +0100 Message-ID: <20150604222243.GA7337__13051.8858707093$1433456666$gmane$org@ZenIV.linux.org.uk> References: <1432754131-27425-1-git-send-email-fabf@skynet.be> <20150527145735.e3d1913bc66426038d53be32@linux-foundation.org> <20150604050123.GL7232@ZenIV.linux.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Content-Disposition: inline In-Reply-To: <20150604050123.GL7232@ZenIV.linux.org.uk> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Andrew Morton Cc: Evgeniy Dushistov , Ian Campbell , Ian Jackson , xen-devel , Fabian Frederick , linux-fsdevel@vger.kernel.org, Jan Kara , Alexey Khoroshilov , Roger Pau Monne List-Id: xen-devel@lists.xenproject.org On Thu, Jun 04, 2015 at 06:01:23AM +0100, Al Viro wrote: > So we need > * per-page exclusion for reallocation time (normal page locks are > doing that) > * per-fs exclusion for block and fragment allocations (->s_lock?) > * per-fs exclusion for inode allocations (->s_lock?) > * per-inode exclusion for mapping changes (a-la ext2 truncate_mutex) > * per-directory exclusion for contents access (->i_mutex gives that) > > Looks like we ought to add ->truncate_mutex and shove lock_ufs() calls > all way down into balloc.c (and ialloc.c for inode allocations)... After looking through that code, that appears to be feasible (and would simplify the hell out of truncate side of things). However, there's a problem with the way we do ->write_begin() and ->writepage() there - it's quite suboptimal for short files. Suppose we start with an empty file on e.g. a filesystem with 1Kb fragments and 4Kb blocks. We grab page 0 and call ->write_begin() on it, offset range 0 to 4095. OK, that'll be block_write_begin(), with ufs_getfrag_block as callback. And it'll proceed to call it 4 times, one for each kilobyte. In ascending order. If we are lucky, it'll allocate the first fragment of an otherwise empty block and then extend it 3 times until we get the full block. If we are *not* that lucky, and the first fragment is grabbed in already partially used block, we'll end up doing reallocations (and copying, while we are at it, hopefully without bogus uptodate flags set anywhere in process). The same if somebody else grabs a fragment in the middle of block we are growing into. A part of that is the lack of prealloc in fs/ufs; that would certainly improve things, but I really wonder if we are doing the handling of partial blocks in the wrong place. As it is, that happens fairly deep in call chain - ufs_write_begin() -> block_write_begin() -> __block_write_begin() -> ufs_getfrag_block() -> ufs_inode_getfrag() -> ufs_new_fragments() -> ufs_change_blocknr()) and we don't notice the partial blocks until we get to ufs_new_fragments(); in reality, we can tell if there's a partial block from the very beginning, just by looking at inode size and checking if the direct pointer in question is not zero. Do we really need it done that deep in chain? Note that there's another long-standing problem - we map the things fragment-by-fragment, which is seriously redundant; within the same logical block we have either * hole - no fragments present, or * partial block at the end of short file - known number of adjacent fragments present, or * full block - all fragments are present and adjacent Trying to map fragments one by one is completely pointless. Which makes the use fs/buffer.c helpers dubious. Besides, we pay for doing it that deep by having to pass the page all way down into block allocator. AFAICS, writepage cares about the partial blocks only when EOF is right inside the page - pages past EOF shouldn't be faulted in and truncate_setsize() should've killed everything off before lowering ->i_size. And since truncate and ->write_begin are serialized on ->i_mutex, we are down to * write_iter starting past the EOF should start with dummy extending truncate (and truncate back to original size if nothing actually gets written) * extending truncate() of a file with partial block should grab the page containing EOF and expand it (with possible reallocation). * truncate() shrinking a file to something with partial block should do nothing special, other than releasing only part of fragments for that block. * write_begin and writepage should check if they are dealing with a page covering a partial block and start with extending it - they know how far to go and they are serialized by page lock. Does anybody see holes in that? This way we can handle the partial blocks higher in the call chain, with a lot less PITA... And with that done, we can have ufs_block_getfrag() just check if the previous bh in the page falls into the same block and is already mapped and map ours to the next fragment if that's the case - allocation would either happen for entire block or page, whichever is smaller, or be already done by caller (UFS blocks can be bigger than a page).