Re: [PATCH] Btrfs: fix hang during inode eviction due to concurrent readahead

From: Filipe Manana <fdmanana@kernel.org>
To: Liu Bo <bo.li.liu@oracle.com>
Cc: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
	Zygo Blaxell <ce3g8jdj@umail.furryterror.org>,
	Filipe Manana <fdmanana@suse.com>
Subject: Re: [PATCH] Btrfs: fix hang during inode eviction due to concurrent readahead
Date: Wed, 10 Jun 2015 11:44:17 +0100	[thread overview]
Message-ID: <CAL3q7H4Pxx7PWyJZFjYs4hVxep7Df_V=8cgPb9R_D7vmoNiB7w@mail.gmail.com> (raw)
In-Reply-To: <20150610104129.GD3561@localhost.localdomain>

On Wed, Jun 10, 2015 at 11:41 AM, Liu Bo <bo.li.liu@oracle.com> wrote:
> On Tue, May 26, 2015 at 12:55:42AM +0100, fdmanana@kernel.org wrote:
>> From: Filipe Manana <fdmanana@suse.com>
>>
>> Zygo Blaxell and other users have reported occasional hangs while an
>> inode is being evicted, leading to traces like the following:
>>
>> [ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds.
>> [ 5281.973836]       Not tainted 4.0.0-rc5-btrfs-next-9+ #2
>> [ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [ 5281.976364] rm              D ffff8800724cfc38     0 20488   7747 0x00000000
>> [ 5281.977506]  ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001
>> [ 5281.978461]  ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78
>> [ 5281.979541]  ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123
>> [ 5281.981396] Call Trace:
>> [ 5281.982066]  [<ffffffff8143107e>] schedule+0x74/0x83
>> [ 5281.983341]  [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs]
>> [ 5281.985127]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
>> [ 5281.986715]  [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs]
>> [ 5281.988680]  [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs]
>> [ 5281.990200]  [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs]
>> [ 5281.991781]  [<ffffffff8116964d>] evict+0xa0/0x148
>> [ 5281.992735]  [<ffffffff8116a43d>] iput+0x18f/0x1e5
>> [ 5281.993796]  [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa
>> [ 5281.994806]  [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
>> [ 5281.996120]  [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab
>> [ 5281.997562]  [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>> [ 5281.998815]  [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b
>> [ 5281.999920]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
>> [ 5282.001299] 1 lock held by rm/20488:
>> [ 5282.002066]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b
>>
>> This happens when we have readahead, which calls readpages(), happening
>> right before the inode eviction handler is invoked. So the reason is
>> essentially:
>>
>> 1) readpages() is called while a reference on the inode is held, so
>>    eviction can not be triggered before readpages() returns. It also
>>    locks one or more ranges in the inode's io_tree (which is done at
>>    extent_io.c:__do_contiguous_readpages());
>>
>> 2) readpages() submits several read bios, all with an end io callback
>>    that runs extent_io.c:end_bio_extent_readpage() and that is executed
>>    by other task when a bio finishes, corresponding to a work queue
>>    (fs_info->end_io_workers) worker kthread. This callback unlocks
>>    the ranges in the inode's io_tree that were previously locked in
>>    step 1;
>>
>> 3) readpages() returns, the reference on the inode is dropped;
>>
>> 4) One or more of the read bios previously submitted are still not
>>    complete (their end io callback was not yet invoked or has not
>>    yet finished execution);
>>
>> 5) Inode eviction is triggered (through an unlink call for example).
>>    The inode reference count was not incremented before submitting
>>    the read bios, therefore this is possible;
>>
>> 6) The eviction handler starts executing and enters the loop that
>>    iterates over all extent states in the inode's io_tree;
>
> One question here:
>  btrfs_evict_inode()->
>         evict_inode_truncate_pages()->
>                 truncate_inode_pages_final()
>
> truncate_inode_pages_final() finds all pages in the mapping and
> lock_page() on each page, if there is a readpage running by other tasks,
> it should be blocked on lock_page() until end_bio_extent_readpage()
> does a unlock_page().
>
> How does eviction handler bypasses this and enters 'extent_state' loop?
>
> Am I missing something?

Yes, you are missing something.
end_bio_extent_readpage() unlocks the pages and only after it does
unlock the ranges in the inode's io tree.

thanks

>
> Thanks,
>
> -liubo
>>
>> 7) The loop picks one extent state record and uses its ->start and
>>    ->end fields, after releasing the inode's io_tree spinlock, to
>>    call lock_extent_bits() and clear_extent_bit(). The call to lock
>>    the range [state->start, state->end] blocks because the whole
>>    range or a part of it was locked by the previous call to
>>    readpages() and the corresponding end io callback, which unlocks
>>    the range was not yet executed;
>>
>> 8) The end io callback for the read bio is executed and unlocks the
>>    range [state->start, state->end] (or a superset of that range).
>>    And at clear_extent_bit() the extent_state record state is used
>>    as a second argument to split_state(), which sets state->start to
>>    a larger value;
>>
>> 9) The task executing the eviction handler is woken up by the task
>>    executing the bio's end io callback (through clear_state_bit) and
>>    the eviction handler locks the range
>>    [old value for state->start, state->end]. Shortly after, when
>>    calling clear_extent_bit(), it unlocks the range
>>    [new value for state->start, state->end], so it ends up unlocking
>>    only part of the range that it locked, leaving an extent state
>>    record in the io_tree that represents the unlocked subrange;
>>
>> 10) The eviction handler loop, in its next iteration, gets the
>>     extent_state record for the subrange that it did not unlock in the
>>     previous step and then tries to lock it, resulting in an hang.
>>
>> So fix this by not using the ->start and ->end fields of an existing
>> extent_state record. This is a simple solution, and an alternative
>> could be to bump the inode's reference count before submitting each
>> read bio and having it dropped in the bio's end io callback. But that
>> would be a more invasive/complex change and would not protect against
>> other possible places that are not holding a reference on the inode
>> as well. Something to consider in the future.
>>
>> Many thanks to Zygo Blaxell for reporting, in the mailing list, the
>> issue, a set of scripts to trigger it and testing this fix.
>>
>> Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
>> Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> ---
>>  fs/btrfs/inode.c | 26 +++++++++++++++++++++-----
>>  1 file changed, 21 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 0020b56..ef05800 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -4987,24 +4987,40 @@ static void evict_inode_truncate_pages(struct inode *inode)
>>       }
>>       write_unlock(&map_tree->lock);
>>
>> +     /*
>> +      * Keep looping until we have no more ranges in the io tree.
>> +      * We can have ongoing bios started by readpages (called from readahead)
>> +      * that didn't get their end io callbacks called yet or they are still
>> +      * in progress ((extent_io.c:end_bio_extent_readpage()). This means some
>> +      * ranges can still be locked and eviction started because before
>> +      * submitting those bios, which are executed by a separate task (work
>> +      * queue kthread), inode references (inode->i_count) were not taken
>> +      * (which would be dropped in the end io callback of each bio).
>> +      * Therefore here we effectively end up waiting for those bios and
>> +      * anyone else holding locked ranges without having bumped the inode's
>> +      * reference count - if we don't do it, when they access the inode's
>> +      * io_tree to unlock a range it may be too late, leading to an
>> +      * use-after-free issue.
>> +      */
>>       spin_lock(&io_tree->lock);
>>       while (!RB_EMPTY_ROOT(&io_tree->state)) {
>>               struct extent_state *state;
>>               struct extent_state *cached_state = NULL;
>> +             u64 start;
>> +             u64 end;
>>
>>               node = rb_first(&io_tree->state);
>>               state = rb_entry(node, struct extent_state, rb_node);
>> -             atomic_inc(&state->refs);
>> +             start = state->start;
>> +             end = state->end;
>>               spin_unlock(&io_tree->lock);
>>
>> -             lock_extent_bits(io_tree, state->start, state->end,
>> -                              0, &cached_state);
>> -             clear_extent_bit(io_tree, state->start, state->end,
>> +             lock_extent_bits(io_tree, start, end, 0, &cached_state);
>> +             clear_extent_bit(io_tree, start, end,
>>                                EXTENT_LOCKED | EXTENT_DIRTY |
>>                                EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
>>                                EXTENT_DEFRAG, 1, 1,
>>                                &cached_state, GFP_NOFS);
>> -             free_extent_state(state);
>>
>>               cond_resched();
>>               spin_lock(&io_tree->lock);
>> --
>> 2.1.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html