[Ocfs2-devel] [PATCH] Revert iput deferring code in ocfs2_drop_dentry_lock

From: Goldwyn Rodrigues <rgoldwyn@suse.de>
To: ocfs2-devel@oss.oracle.com
Subject: [Ocfs2-devel] [PATCH] Revert iput deferring code in ocfs2_drop_dentry_lock
Date: Thu, 16 Jan 2014 07:35:58 -0600	[thread overview]
Message-ID: <52D7E03E.8090702@suse.de> (raw)
In-Reply-To: <20140116024743.GA25833@quack.suse.cz>

On 01/15/2014 08:47 PM, Jan Kara wrote:
> On Wed 15-01-14 17:17:55, Goldwyn Rodrigues wrote:
>> On 01/15/2014 09:53 AM, Jan Kara wrote:
>>> On Wed 15-01-14 09:32:39, Goldwyn Rodrigues wrote:
>>>> On 01/15/2014 07:33 AM, Jan Kara wrote:
>>>>> On Wed 15-01-14 06:51:51, Goldwyn Rodrigues wrote:
>>>>>> The following patches are reverted in this patch because these
>>>>>> patches caused regression in the unlink() calls.
>>>>>>
>>>>>> ea455f8ab68338ba69f5d3362b342c115bea8e13 - ocfs2: Push out dropping
>>>>>> of dentry lock to ocfs2_wq
>>>>>> f7b1aa69be138ad9d7d3f31fa56f4c9407f56b6a - ocfs2: Fix deadlock on umount
>>>>>> 5fd131893793567c361ae64cbeb28a2a753bbe35 - ocfs2: Don't oops in
>>>>>> ocfs2_kill_sb on a failed mount
>>>>>>
>>>>>> The regression is caused because these patches delay the iput() in case
>>>>>> of dentry unlocks. This also delays the unlocking of the open lockres.
>>>>>> The open lockresource is required to test if the inode can be wiped from
>>>>>> disk on not. When the deleting node does not get the open lock, it marks
>>>>>> it as orphan (even though it is not in use by another node/process)
>>>>>> and causes a journal checkpoint. This delays operations following the
>>>>>> inode eviction. This also moves the inode to the orphaned inode
>>>>>> which further causes more I/O and a lot of unneccessary orphans.
>>>>>>
>>>>>> As for the fix, I tried reproducing the problem by using quotas and enabling
>>>>>> lockdep, but the issue could not be reproduced.
>>>>>    So I was thinking about this for a while trying to remember... What it
>>>>> really boils down to is whether it is OK to call ocfs2_delete_inode() from
>>>>> DLM downconvert thread. Bear in mind that ocfs2_delete_inode() takes all
>>>>> kinds of interesting cluster locks meaning that the downconvert thread can
>>>>> block waiting to acquire some cluster lock. That seems like asking for
>>>>> trouble to me (i.e., what if the node holding the lock we need from
>>>>> downconvert thread needs a lock from us first?) but maybe it is OK these
>>>>> days.
>>>>>
>>>>
>>>> The only lock it tries to take is the "inode lock" resource, which
>>>> seems to be fine to take.
>>>    Why is it obviously fine? Some other node might be holding the inode lock
>>> and still write to the inode. If that needs a cluster lock for block
>>> allocation and that allocation cluster lock is currently hold by our node,
>>> we are screwed. Aren't we?
>>>
>>
>> No. If another thread is using the allocation cluster lock implies
>> we already have the inode lock. Cluster locks are also taken in a
>> strict order in order to avoid ABBA locking.
>    I agree with what you wrote above but it's not exactly what I'm talking
> about. What seems to be possible is:
> NODE 1					NODE2
> holds dentry lock for 'foo'
> holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE
> 					dquot_initialize(bar)
> 					  ocfs2_dquot_acquire()
> 					    ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
> 					    ...
> downconvert thread (triggered from another
> node or a different process from NODE2)
>    ocfs2_dentry_post_unlock()
>      ...
>      iput(foo)
>        ocfs2_evict_inode(foo)
>          ocfs2_clear_inode(foo)
>            dquot_drop(inode)
> 	    ...
>              ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
>               - blocks
> 					    finds we need more space in
> 					    quota file
> 					    ...
> 					    ocfs2_extend_no_holes()
> 					      ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE)
> 						- deadlocks waiting for
> 						  downconvert thread
>

Thanks. This explains the situation much better. Whats stopping NODE1 
from releasing the GLOBAL_BITMAP_SYSTEM_INODE? (to break from hold and 
wait condition)

> However the positive part of this is that the race I was originally afraid
> of - when ocfs2_delete_inode() would be triggered from
> ocfs2_dentry_post_unlock() - isn't possible because ocfs2 seems to keep
> invariant that node can cache dentry for 'foo' without holding inode lock
> for 'foo' only if foo->i_nlink > 0. Thus iput() from downconvert thread
> should never go into ocfs2_delete_inode() (however it would be nice to
> assert this in ocfs2_dentry_post_unlock() for a peace of mind).

No, ocfs2_dentry_convert_worker sets OCFS2_INODE_MAYBE_ORPHANED. So, 
ocfs2_delete_inode() is called even if i_nlink > 0.

>
> Since the deadlock seems to be quota specific, it should be possible
> postpone just the quota processing for the workqueue. It isn't completely
> trivial because we still have to cleanup inode quota pointers but it should
> be doable. I'll try to have a look at this tomorrow.

Thanks! We need to work out a different way in order to fix this so that 
open locks are not delayed and does not hurt unlink performance.

-- 
Goldwyn