* [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error @ 2017-03-06 2:55 Qu Wenruo 2017-03-06 2:55 ` [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang Qu Wenruo ` (2 more replies) 0 siblings, 3 replies; 16+ messages in thread From: Qu Wenruo @ 2017-03-06 2:55 UTC (permalink / raw) To: fdmanana, linux-btrfs [BUG] When btrfs_reloc_clone_csum() reports error, it can underflow metadata and leads to kernel assertion on outstanding extents in run_delalloc_nocow() and cow_file_range(). BTRFS info (device vdb5): relocating block group 12582912 flags data BTRFS info (device vdb5): found 1 extents assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858 Currently, due to another bug blocking ordered extents, the bug is only reproducible under certain block group layout and using error injection. a) Create one data block group with one 4K extent in it. To avoid the bug that hangs btrfs due to ordered extent which never finishes b) Make btrfs_reloc_clone_csum() always fail c) Relocate that block group [CAUSE] run_delalloc_nocow() and cow_file_range() handles error from btrfs_reloc_clone_csum() wrongly: (The ascii chart shows a more generic case of this bug other than the bug mentioned above) |<------------------ delalloc range --------------------------->| | OE 1 | OE 2 | ... | OE n | |<----------- cleanup range --------------->| |<----------- ----------->| \/ btrfs_finish_ordered_io() range So error handler, which calls extent_clear_unlock_delalloc() with EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io() will both cover OE n, and free its metadata, causing metadata under flow. [Fix] The fix is to ensure after calling btrfs_add_ordered_extent(), we only call error handler after increasing the iteration offset, so that cleanup range won't cover any created ordered extent. |<------------------ delalloc range --------------------------->| | OE 1 | OE 2 | ... | OE n | |<----------- ----------->|<---------- cleanup range --------->| \/ btrfs_finish_ordered_io() range Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> --- changelog: v6: New, split from v5 patch, as this is a separate bug. --- fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 39 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index b2bc07aad1ae..1d83d504f2e5 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode, BTRFS_DATA_RELOC_TREE_OBJECTID) { ret = btrfs_reloc_clone_csums(inode, start, cur_alloc_size); + /* + * Only drop cache here, and process as normal. + * + * We must not allow extent_clear_unlock_delalloc() + * at out_unlock label to free meta of this ordered + * extent, as its meta should be freed by + * btrfs_finish_ordered_io(). + * + * So we must continue until @start is increased to + * skip current ordered extent. + */ if (ret) - goto out_drop_extent_cache; + btrfs_drop_extent_cache(BTRFS_I(inode), start, + start + ram_size - 1, 0); } btrfs_dec_block_group_reservations(fs_info, ins.objectid); - if (disk_num_bytes < cur_alloc_size) - break; - /* we're not doing compressed IO, don't unlock the first * page (which the caller expects to stay locked), don't * clear any dirty bits and don't set any writeback bits @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode, delalloc_end, locked_page, EXTENT_LOCKED | EXTENT_DELALLOC, op); - disk_num_bytes -= cur_alloc_size; + if (disk_num_bytes > cur_alloc_size) + disk_num_bytes = 0; + else + disk_num_bytes -= cur_alloc_size; num_bytes -= cur_alloc_size; alloc_hint = ins.objectid + ins.offset; start += cur_alloc_size; + + /* + * btrfs_reloc_clone_csums() error, since start is increased + * extent_clear_unlock_delalloc() at out_unlock label won't + * free metadata of current ordered extent, we're OK to exit. + */ + if (ret) + goto out_unlock; } out: return ret; @@ -1414,15 +1434,14 @@ static noinline int run_delalloc_nocow(struct inode *inode, BUG_ON(ret); /* -ENOMEM */ if (root->root_key.objectid == - BTRFS_DATA_RELOC_TREE_OBJECTID) { + BTRFS_DATA_RELOC_TREE_OBJECTID) + /* + * Error handled later, as we must prevent + * extent_clear_unlock_delalloc() in error handler + * from freeing metadata of created ordered extent. + */ ret = btrfs_reloc_clone_csums(inode, cur_offset, num_bytes); - if (ret) { - if (!nolock && nocow) - btrfs_end_write_no_snapshoting(root); - goto error; - } - } extent_clear_unlock_delalloc(inode, cur_offset, cur_offset + num_bytes - 1, end, @@ -1434,6 +1453,14 @@ static noinline int run_delalloc_nocow(struct inode *inode, if (!nolock && nocow) btrfs_end_write_no_snapshoting(root); cur_offset = extent_end; + + /* + * btrfs_reloc_clone_csums() error, now we're OK to call error + * handler, as metadata for created ordered extent will only + * be freed by btrfs_finish_ordered_io(). + */ + if (ret) + goto error; if (cur_offset > end) break; } -- 2.12.0 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang 2017-03-06 2:55 [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Qu Wenruo @ 2017-03-06 2:55 ` Qu Wenruo 2017-03-06 23:18 ` Filipe Manana 2017-03-07 22:11 ` Liu Bo 2017-03-06 23:19 ` [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Filipe Manana 2017-03-07 20:49 ` Liu Bo 2 siblings, 2 replies; 16+ messages in thread From: Qu Wenruo @ 2017-03-06 2:55 UTC (permalink / raw) To: fdmanana, linux-btrfs; +Cc: Filipe Manana [BUG] If run_delalloc_range() returns error and there is already some ordered extents created, btrfs will be hanged with the following backtrace: Call Trace: __schedule+0x2d4/0xae0 schedule+0x3d/0x90 btrfs_start_ordered_extent+0x160/0x200 [btrfs] ? wake_atomic_t_function+0x60/0x60 btrfs_run_ordered_extent_work+0x25/0x40 [btrfs] btrfs_scrubparity_helper+0x1c1/0x620 [btrfs] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs] process_one_work+0x2af/0x720 ? process_one_work+0x22b/0x720 worker_thread+0x4b/0x4f0 kthread+0x10f/0x150 ? process_one_work+0x720/0x720 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x2e/0x40 [CAUSE] |<------------------ delalloc range --------------------------->| | OE 1 | OE 2 | ... | OE n | |<>| |<---------- cleanup range --------->| || \_=> First page handled by end_extent_writepage() in __extent_writepage() The problem is caused by error handler of run_delalloc_range(), which doesn't handle any created ordered extents, leaving them waiting on btrfs_finish_ordered_io() to finish. However after run_delalloc_range() returns error, __extent_writepage() won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered except the first page, and btrfs_finish_ordered_io() won't be triggered for created ordered extents either. So OE 2~n will hang forever, and if OE 1 is larger than one page, it will also hang. [FIX] Introduce btrfs_cleanup_ordered_extents() function to cleanup created ordered extents and finish them manually. The function is based on existing btrfs_endio_direct_write_update_ordered() function, and modify it to act just like btrfs_writepage_endio_hook() but handles specified range other than one page. After fix, delalloc error will be handled like: |<------------------ delalloc range --------------------------->| | OE 1 | OE 2 | ... | OE n | |<>|<-------- ----------->|<------ old error handler --------->| || || || \_=> Cleaned up by cleanup_ordered_extents() \_=> First page handled by end_extent_writepage() in __extent_writepage() Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> --- changelog: v2: Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing outstanding extents, which is already done by extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit v3: Skip first page to avoid underflow ordered->bytes_left. Fix range passed in cow_file_range() which doesn't cover the whole extent. Expend extent_clear_unlock_delalloc() range to allow them to handle metadata release. v4: Don't use extra bit to skip metadata freeing for ordered extent, but only handle btrfs_reloc_clone_csums() error just before processing next extent. This makes error handle much easier for run_delalloc_nocow(). v5: Variant gramma and comment fixes suggested by Filipe Manana Enhanced commit message to focus on the generic error handler bug, pointed out by Filipe Manana Adding missing wq/func user in __endio_write_update_ordered(), pointed out by Filipe Manana. Enhanced commit message with ascii art to explain the bug more easily. Fix a bug which can leads to corrupted extent allocation, exposed by Filipe Manana. v6: Split the metadata underflow fix to another patch. Use better comment and btrfs_cleanup_ordered_extent() implementation from Filipe Manana. --- fs/btrfs/inode.c | 62 ++++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 47 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 1d83d504f2e5..9b03eb9b27d0 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -115,6 +115,30 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len, u64 ram_bytes, int compress_type, int type); +static void __endio_write_update_ordered(struct inode *inode, + u64 offset, u64 bytes, + bool uptodate); + +/* + * Cleanup all submitted ordered extents in specified range to handle errors + * from the fill_dellaloc() callback. + * + * NOTE: caller must ensure that when an error happens, it can not call + * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING + * and EXTENT_DELALLOC simultaneously, because that causes the reserved metadata + * to be released, which we want to happen only when finishing the ordered + * extent (btrfs_finish_ordered_io()). Also note that the caller of the + * fill_dealloc() callback already does proper cleanup for the first page of + * the range, that is, it invokes the callback writepage_end_io_hook() for the + * range of the first page. + */ +static inline void btrfs_cleanup_ordered_extents(struct inode *inode, + u64 offset, u64 bytes) +{ + return __endio_write_update_ordered(inode, offset + PAGE_SIZE, + bytes - PAGE_SIZE, false); +} + static int btrfs_dirty_inode(struct inode *inode); #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS @@ -1536,6 +1560,8 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page, ret = cow_file_range_async(inode, locked_page, start, end, page_started, nr_written); } + if (ret) + btrfs_cleanup_ordered_extents(inode, start, end - start + 1); return ret; } @@ -8142,17 +8168,26 @@ static void btrfs_endio_direct_read(struct bio *bio) bio_put(bio); } -static void btrfs_endio_direct_write_update_ordered(struct inode *inode, - const u64 offset, - const u64 bytes, - const int uptodate) +static void __endio_write_update_ordered(struct inode *inode, + u64 offset, u64 bytes, + bool uptodate) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct btrfs_ordered_extent *ordered = NULL; + struct btrfs_workqueue *wq; + btrfs_work_func_t func; u64 ordered_offset = offset; u64 ordered_bytes = bytes; int ret; + if (btrfs_is_free_space_inode(BTRFS_I(inode))) { + wq = fs_info->endio_freespace_worker; + func = btrfs_freespace_write_helper; + } else { + wq = fs_info->endio_write_workers; + func = btrfs_endio_write_helper; + } + again: ret = btrfs_dec_test_first_ordered_pending(inode, &ordered, &ordered_offset, @@ -8161,9 +8196,8 @@ static void btrfs_endio_direct_write_update_ordered(struct inode *inode, if (!ret) goto out_test; - btrfs_init_work(&ordered->work, btrfs_endio_write_helper, - finish_ordered_fn, NULL, NULL); - btrfs_queue_work(fs_info->endio_write_workers, &ordered->work); + btrfs_init_work(&ordered->work, func, finish_ordered_fn, NULL, NULL); + btrfs_queue_work(wq, &ordered->work); out_test: /* * our bio might span multiple ordered extents. If we haven't @@ -8181,10 +8215,8 @@ static void btrfs_endio_direct_write(struct bio *bio) struct btrfs_dio_private *dip = bio->bi_private; struct bio *dio_bio = dip->dio_bio; - btrfs_endio_direct_write_update_ordered(dip->inode, - dip->logical_offset, - dip->bytes, - !bio->bi_error); + __endio_write_update_ordered(dip->inode, dip->logical_offset, + dip->bytes, !bio->bi_error); kfree(dip); @@ -8545,10 +8577,10 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode, io_bio = NULL; } else { if (write) - btrfs_endio_direct_write_update_ordered(inode, + __endio_write_update_ordered(inode, file_offset, dio_bio->bi_iter.bi_size, - 0); + false); else unlock_extent(&BTRFS_I(inode)->io_tree, file_offset, file_offset + dio_bio->bi_iter.bi_size - 1); @@ -8683,11 +8715,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter) */ if (dio_data.unsubmitted_oe_range_start < dio_data.unsubmitted_oe_range_end) - btrfs_endio_direct_write_update_ordered(inode, + __endio_write_update_ordered(inode, dio_data.unsubmitted_oe_range_start, dio_data.unsubmitted_oe_range_end - dio_data.unsubmitted_oe_range_start, - 0); + false); } else if (ret >= 0 && (size_t)ret < count) btrfs_delalloc_release_space(inode, offset, count - (size_t)ret); -- 2.12.0 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang 2017-03-06 2:55 ` [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang Qu Wenruo @ 2017-03-06 23:18 ` Filipe Manana 2017-03-07 22:11 ` Liu Bo 1 sibling, 0 replies; 16+ messages in thread From: Filipe Manana @ 2017-03-06 23:18 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs, Filipe Manana On Mon, Mar 6, 2017 at 2:55 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: > [BUG] > If run_delalloc_range() returns error and there is already some ordered > extents created, btrfs will be hanged with the following backtrace: > > Call Trace: > __schedule+0x2d4/0xae0 > schedule+0x3d/0x90 > btrfs_start_ordered_extent+0x160/0x200 [btrfs] > ? wake_atomic_t_function+0x60/0x60 > btrfs_run_ordered_extent_work+0x25/0x40 [btrfs] > btrfs_scrubparity_helper+0x1c1/0x620 [btrfs] > btrfs_flush_delalloc_helper+0xe/0x10 [btrfs] > process_one_work+0x2af/0x720 > ? process_one_work+0x22b/0x720 > worker_thread+0x4b/0x4f0 > kthread+0x10f/0x150 > ? process_one_work+0x720/0x720 > ? kthread_create_on_node+0x40/0x40 > ret_from_fork+0x2e/0x40 > > [CAUSE] > > |<------------------ delalloc range --------------------------->| > | OE 1 | OE 2 | ... | OE n | > |<>| |<---------- cleanup range --------->| > || > \_=> First page handled by end_extent_writepage() in __extent_writepage() > > The problem is caused by error handler of run_delalloc_range(), which > doesn't handle any created ordered extents, leaving them waiting on > btrfs_finish_ordered_io() to finish. > > However after run_delalloc_range() returns error, __extent_writepage() > won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered > except the first page, and btrfs_finish_ordered_io() won't be triggered > for created ordered extents either. > > So OE 2~n will hang forever, and if OE 1 is larger than one page, it > will also hang. > > [FIX] > Introduce btrfs_cleanup_ordered_extents() function to cleanup created > ordered extents and finish them manually. > > The function is based on existing > btrfs_endio_direct_write_update_ordered() function, and modify it to > act just like btrfs_writepage_endio_hook() but handles specified range > other than one page. > > After fix, delalloc error will be handled like: > > |<------------------ delalloc range --------------------------->| > | OE 1 | OE 2 | ... | OE n | > |<>|<-------- ----------->|<------ old error handler --------->| > || || > || \_=> Cleaned up by cleanup_ordered_extents() > \_=> First page handled by end_extent_writepage() in __extent_writepage() > > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> thanks > Signed-off-by: Filipe Manana <fdmanana@suse.com> > --- > changelog: > v2: > Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing > outstanding extents, which is already done by > extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit > v3: > Skip first page to avoid underflow ordered->bytes_left. > Fix range passed in cow_file_range() which doesn't cover the whole > extent. > Expend extent_clear_unlock_delalloc() range to allow them to handle > metadata release. > v4: > Don't use extra bit to skip metadata freeing for ordered extent, > but only handle btrfs_reloc_clone_csums() error just before processing > next extent. > This makes error handle much easier for run_delalloc_nocow(). > v5: > Variant gramma and comment fixes suggested by Filipe Manana > Enhanced commit message to focus on the generic error handler bug, > pointed out by Filipe Manana > Adding missing wq/func user in __endio_write_update_ordered(), pointed > out by Filipe Manana. > Enhanced commit message with ascii art to explain the bug more easily. > Fix a bug which can leads to corrupted extent allocation, exposed by > Filipe Manana. > v6: > Split the metadata underflow fix to another patch. > Use better comment and btrfs_cleanup_ordered_extent() implementation > from Filipe Manana. > --- > fs/btrfs/inode.c | 62 ++++++++++++++++++++++++++++++++++++++++++-------------- > 1 file changed, 47 insertions(+), 15 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 1d83d504f2e5..9b03eb9b27d0 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -115,6 +115,30 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len, > u64 ram_bytes, int compress_type, > int type); > > +static void __endio_write_update_ordered(struct inode *inode, > + u64 offset, u64 bytes, > + bool uptodate); > + > +/* > + * Cleanup all submitted ordered extents in specified range to handle errors > + * from the fill_dellaloc() callback. > + * > + * NOTE: caller must ensure that when an error happens, it can not call > + * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING > + * and EXTENT_DELALLOC simultaneously, because that causes the reserved metadata > + * to be released, which we want to happen only when finishing the ordered > + * extent (btrfs_finish_ordered_io()). Also note that the caller of the > + * fill_dealloc() callback already does proper cleanup for the first page of > + * the range, that is, it invokes the callback writepage_end_io_hook() for the > + * range of the first page. > + */ > +static inline void btrfs_cleanup_ordered_extents(struct inode *inode, > + u64 offset, u64 bytes) > +{ > + return __endio_write_update_ordered(inode, offset + PAGE_SIZE, > + bytes - PAGE_SIZE, false); > +} > + > static int btrfs_dirty_inode(struct inode *inode); > > #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS > @@ -1536,6 +1560,8 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page, > ret = cow_file_range_async(inode, locked_page, start, end, > page_started, nr_written); > } > + if (ret) > + btrfs_cleanup_ordered_extents(inode, start, end - start + 1); > return ret; > } > > @@ -8142,17 +8168,26 @@ static void btrfs_endio_direct_read(struct bio *bio) > bio_put(bio); > } > > -static void btrfs_endio_direct_write_update_ordered(struct inode *inode, > - const u64 offset, > - const u64 bytes, > - const int uptodate) > +static void __endio_write_update_ordered(struct inode *inode, > + u64 offset, u64 bytes, > + bool uptodate) > { > struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); > struct btrfs_ordered_extent *ordered = NULL; > + struct btrfs_workqueue *wq; > + btrfs_work_func_t func; > u64 ordered_offset = offset; > u64 ordered_bytes = bytes; > int ret; > > + if (btrfs_is_free_space_inode(BTRFS_I(inode))) { > + wq = fs_info->endio_freespace_worker; > + func = btrfs_freespace_write_helper; > + } else { > + wq = fs_info->endio_write_workers; > + func = btrfs_endio_write_helper; > + } > + > again: > ret = btrfs_dec_test_first_ordered_pending(inode, &ordered, > &ordered_offset, > @@ -8161,9 +8196,8 @@ static void btrfs_endio_direct_write_update_ordered(struct inode *inode, > if (!ret) > goto out_test; > > - btrfs_init_work(&ordered->work, btrfs_endio_write_helper, > - finish_ordered_fn, NULL, NULL); > - btrfs_queue_work(fs_info->endio_write_workers, &ordered->work); > + btrfs_init_work(&ordered->work, func, finish_ordered_fn, NULL, NULL); > + btrfs_queue_work(wq, &ordered->work); > out_test: > /* > * our bio might span multiple ordered extents. If we haven't > @@ -8181,10 +8215,8 @@ static void btrfs_endio_direct_write(struct bio *bio) > struct btrfs_dio_private *dip = bio->bi_private; > struct bio *dio_bio = dip->dio_bio; > > - btrfs_endio_direct_write_update_ordered(dip->inode, > - dip->logical_offset, > - dip->bytes, > - !bio->bi_error); > + __endio_write_update_ordered(dip->inode, dip->logical_offset, > + dip->bytes, !bio->bi_error); > > kfree(dip); > > @@ -8545,10 +8577,10 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode, > io_bio = NULL; > } else { > if (write) > - btrfs_endio_direct_write_update_ordered(inode, > + __endio_write_update_ordered(inode, > file_offset, > dio_bio->bi_iter.bi_size, > - 0); > + false); > else > unlock_extent(&BTRFS_I(inode)->io_tree, file_offset, > file_offset + dio_bio->bi_iter.bi_size - 1); > @@ -8683,11 +8715,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter) > */ > if (dio_data.unsubmitted_oe_range_start < > dio_data.unsubmitted_oe_range_end) > - btrfs_endio_direct_write_update_ordered(inode, > + __endio_write_update_ordered(inode, > dio_data.unsubmitted_oe_range_start, > dio_data.unsubmitted_oe_range_end - > dio_data.unsubmitted_oe_range_start, > - 0); > + false); > } else if (ret >= 0 && (size_t)ret < count) > btrfs_delalloc_release_space(inode, offset, > count - (size_t)ret); > -- > 2.12.0 > > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang 2017-03-06 2:55 ` [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang Qu Wenruo 2017-03-06 23:18 ` Filipe Manana @ 2017-03-07 22:11 ` Liu Bo 2017-03-08 0:18 ` Qu Wenruo 1 sibling, 1 reply; 16+ messages in thread From: Liu Bo @ 2017-03-07 22:11 UTC (permalink / raw) To: Qu Wenruo; +Cc: fdmanana, linux-btrfs, Filipe Manana On Mon, Mar 06, 2017 at 10:55:47AM +0800, Qu Wenruo wrote: > [BUG] > If run_delalloc_range() returns error and there is already some ordered > extents created, btrfs will be hanged with the following backtrace: > > Call Trace: > __schedule+0x2d4/0xae0 > schedule+0x3d/0x90 > btrfs_start_ordered_extent+0x160/0x200 [btrfs] > ? wake_atomic_t_function+0x60/0x60 > btrfs_run_ordered_extent_work+0x25/0x40 [btrfs] > btrfs_scrubparity_helper+0x1c1/0x620 [btrfs] > btrfs_flush_delalloc_helper+0xe/0x10 [btrfs] > process_one_work+0x2af/0x720 > ? process_one_work+0x22b/0x720 > worker_thread+0x4b/0x4f0 > kthread+0x10f/0x150 > ? process_one_work+0x720/0x720 > ? kthread_create_on_node+0x40/0x40 > ret_from_fork+0x2e/0x40 > > [CAUSE] > > |<------------------ delalloc range --------------------------->| > | OE 1 | OE 2 | ... | OE n | > |<>| |<---------- cleanup range --------->| > || > \_=> First page handled by end_extent_writepage() in __extent_writepage() > > The problem is caused by error handler of run_delalloc_range(), which > doesn't handle any created ordered extents, leaving them waiting on > btrfs_finish_ordered_io() to finish. > > However after run_delalloc_range() returns error, __extent_writepage() > won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered > except the first page, and btrfs_finish_ordered_io() won't be triggered > for created ordered extents either. > > So OE 2~n will hang forever, and if OE 1 is larger than one page, it > will also hang. > > [FIX] > Introduce btrfs_cleanup_ordered_extents() function to cleanup created > ordered extents and finish them manually. > > The function is based on existing > btrfs_endio_direct_write_update_ordered() function, and modify it to > act just like btrfs_writepage_endio_hook() but handles specified range > other than one page. > > After fix, delalloc error will be handled like: > > |<------------------ delalloc range --------------------------->| > | OE 1 | OE 2 | ... | OE n | > |<>|<-------- ----------->|<------ old error handler --------->| > || || > || \_=> Cleaned up by cleanup_ordered_extents() > \_=> First page handled by end_extent_writepage() in __extent_writepage() > Looks good, some nitpicks below. > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> > Signed-off-by: Filipe Manana <fdmanana@suse.com> > --- > changelog: > v2: > Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing > outstanding extents, which is already done by > extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit > v3: > Skip first page to avoid underflow ordered->bytes_left. > Fix range passed in cow_file_range() which doesn't cover the whole > extent. > Expend extent_clear_unlock_delalloc() range to allow them to handle > metadata release. > v4: > Don't use extra bit to skip metadata freeing for ordered extent, > but only handle btrfs_reloc_clone_csums() error just before processing > next extent. > This makes error handle much easier for run_delalloc_nocow(). > v5: > Variant gramma and comment fixes suggested by Filipe Manana > Enhanced commit message to focus on the generic error handler bug, > pointed out by Filipe Manana > Adding missing wq/func user in __endio_write_update_ordered(), pointed > out by Filipe Manana. > Enhanced commit message with ascii art to explain the bug more easily. > Fix a bug which can leads to corrupted extent allocation, exposed by > Filipe Manana. > v6: > Split the metadata underflow fix to another patch. > Use better comment and btrfs_cleanup_ordered_extent() implementation > from Filipe Manana. > --- > fs/btrfs/inode.c | 62 ++++++++++++++++++++++++++++++++++++++++++-------------- > 1 file changed, 47 insertions(+), 15 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 1d83d504f2e5..9b03eb9b27d0 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -115,6 +115,30 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len, > u64 ram_bytes, int compress_type, > int type); > > +static void __endio_write_update_ordered(struct inode *inode, > + u64 offset, u64 bytes, > + bool uptodate); > + > +/* > + * Cleanup all submitted ordered extents in specified range to handle errors > + * from the fill_dellaloc() callback. > + * > + * NOTE: caller must ensure that when an error happens, it can not call > + * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING > + * and EXTENT_DELALLOC simultaneously, because that causes the reserved metadata > + * to be released, which we want to happen only when finishing the ordered > + * extent (btrfs_finish_ordered_io()). Also note that the caller of the > + * fill_dealloc() callback already does proper cleanup for the first page of fill_delalloc() > + * the range, that is, it invokes the callback writepage_end_io_hook() for the > + * range of the first page. > + */ > +static inline void btrfs_cleanup_ordered_extents(struct inode *inode, > + u64 offset, u64 bytes) > +{ > + return __endio_write_update_ordered(inode, offset + PAGE_SIZE, > + bytes - PAGE_SIZE, false); > +} > + > static int btrfs_dirty_inode(struct inode *inode); > > #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS > @@ -1536,6 +1560,8 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page, > ret = cow_file_range_async(inode, locked_page, start, end, > page_started, nr_written); > } > + if (ret) > + btrfs_cleanup_ordered_extents(inode, start, end - start + 1); > return ret; > } > > @@ -8142,17 +8168,26 @@ static void btrfs_endio_direct_read(struct bio *bio) > bio_put(bio); > } > > -static void btrfs_endio_direct_write_update_ordered(struct inode *inode, > - const u64 offset, > - const u64 bytes, > - const int uptodate) > +static void __endio_write_update_ordered(struct inode *inode, > + u64 offset, u64 bytes, > + bool uptodate) Not serious, but why const was killed? With that fixed, you can have Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Thanks, -liubo > { > struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); > struct btrfs_ordered_extent *ordered = NULL; > + struct btrfs_workqueue *wq; > + btrfs_work_func_t func; > u64 ordered_offset = offset; > u64 ordered_bytes = bytes; > int ret; > > + if (btrfs_is_free_space_inode(BTRFS_I(inode))) { > + wq = fs_info->endio_freespace_worker; > + func = btrfs_freespace_write_helper; > + } else { > + wq = fs_info->endio_write_workers; > + func = btrfs_endio_write_helper; > + } > + > again: > ret = btrfs_dec_test_first_ordered_pending(inode, &ordered, > &ordered_offset, > @@ -8161,9 +8196,8 @@ static void btrfs_endio_direct_write_update_ordered(struct inode *inode, > if (!ret) > goto out_test; > > - btrfs_init_work(&ordered->work, btrfs_endio_write_helper, > - finish_ordered_fn, NULL, NULL); > - btrfs_queue_work(fs_info->endio_write_workers, &ordered->work); > + btrfs_init_work(&ordered->work, func, finish_ordered_fn, NULL, NULL); > + btrfs_queue_work(wq, &ordered->work); > out_test: > /* > * our bio might span multiple ordered extents. If we haven't > @@ -8181,10 +8215,8 @@ static void btrfs_endio_direct_write(struct bio *bio) > struct btrfs_dio_private *dip = bio->bi_private; > struct bio *dio_bio = dip->dio_bio; > > - btrfs_endio_direct_write_update_ordered(dip->inode, > - dip->logical_offset, > - dip->bytes, > - !bio->bi_error); > + __endio_write_update_ordered(dip->inode, dip->logical_offset, > + dip->bytes, !bio->bi_error); > > kfree(dip); > > @@ -8545,10 +8577,10 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode, > io_bio = NULL; > } else { > if (write) > - btrfs_endio_direct_write_update_ordered(inode, > + __endio_write_update_ordered(inode, > file_offset, > dio_bio->bi_iter.bi_size, > - 0); > + false); > else > unlock_extent(&BTRFS_I(inode)->io_tree, file_offset, > file_offset + dio_bio->bi_iter.bi_size - 1); > @@ -8683,11 +8715,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter) > */ > if (dio_data.unsubmitted_oe_range_start < > dio_data.unsubmitted_oe_range_end) > - btrfs_endio_direct_write_update_ordered(inode, > + __endio_write_update_ordered(inode, > dio_data.unsubmitted_oe_range_start, > dio_data.unsubmitted_oe_range_end - > dio_data.unsubmitted_oe_range_start, > - 0); > + false); > } else if (ret >= 0 && (size_t)ret < count) > btrfs_delalloc_release_space(inode, offset, > count - (size_t)ret); > -- > 2.12.0 > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang 2017-03-07 22:11 ` Liu Bo @ 2017-03-08 0:18 ` Qu Wenruo 2017-03-08 0:21 ` Filipe Manana 0 siblings, 1 reply; 16+ messages in thread From: Qu Wenruo @ 2017-03-08 0:18 UTC (permalink / raw) To: bo.li.liu; +Cc: fdmanana, linux-btrfs, Filipe Manana At 03/08/2017 06:11 AM, Liu Bo wrote: > On Mon, Mar 06, 2017 at 10:55:47AM +0800, Qu Wenruo wrote: >> [BUG] >> If run_delalloc_range() returns error and there is already some ordered >> extents created, btrfs will be hanged with the following backtrace: >> >> Call Trace: >> __schedule+0x2d4/0xae0 >> schedule+0x3d/0x90 >> btrfs_start_ordered_extent+0x160/0x200 [btrfs] >> ? wake_atomic_t_function+0x60/0x60 >> btrfs_run_ordered_extent_work+0x25/0x40 [btrfs] >> btrfs_scrubparity_helper+0x1c1/0x620 [btrfs] >> btrfs_flush_delalloc_helper+0xe/0x10 [btrfs] >> process_one_work+0x2af/0x720 >> ? process_one_work+0x22b/0x720 >> worker_thread+0x4b/0x4f0 >> kthread+0x10f/0x150 >> ? process_one_work+0x720/0x720 >> ? kthread_create_on_node+0x40/0x40 >> ret_from_fork+0x2e/0x40 >> >> [CAUSE] >> >> |<------------------ delalloc range --------------------------->| >> | OE 1 | OE 2 | ... | OE n | >> |<>| |<---------- cleanup range --------->| >> || >> \_=> First page handled by end_extent_writepage() in __extent_writepage() >> >> The problem is caused by error handler of run_delalloc_range(), which >> doesn't handle any created ordered extents, leaving them waiting on >> btrfs_finish_ordered_io() to finish. >> >> However after run_delalloc_range() returns error, __extent_writepage() >> won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered >> except the first page, and btrfs_finish_ordered_io() won't be triggered >> for created ordered extents either. >> >> So OE 2~n will hang forever, and if OE 1 is larger than one page, it >> will also hang. >> >> [FIX] >> Introduce btrfs_cleanup_ordered_extents() function to cleanup created >> ordered extents and finish them manually. >> >> The function is based on existing >> btrfs_endio_direct_write_update_ordered() function, and modify it to >> act just like btrfs_writepage_endio_hook() but handles specified range >> other than one page. >> >> After fix, delalloc error will be handled like: >> >> |<------------------ delalloc range --------------------------->| >> | OE 1 | OE 2 | ... | OE n | >> |<>|<-------- ----------->|<------ old error handler --------->| >> || || >> || \_=> Cleaned up by cleanup_ordered_extents() >> \_=> First page handled by end_extent_writepage() in __extent_writepage() >> > > Looks good, some nitpicks below. > >> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> >> Signed-off-by: Filipe Manana <fdmanana@suse.com> >> --- >> changelog: >> v2: >> Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing >> outstanding extents, which is already done by >> extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit >> v3: >> Skip first page to avoid underflow ordered->bytes_left. >> Fix range passed in cow_file_range() which doesn't cover the whole >> extent. >> Expend extent_clear_unlock_delalloc() range to allow them to handle >> metadata release. >> v4: >> Don't use extra bit to skip metadata freeing for ordered extent, >> but only handle btrfs_reloc_clone_csums() error just before processing >> next extent. >> This makes error handle much easier for run_delalloc_nocow(). >> v5: >> Variant gramma and comment fixes suggested by Filipe Manana >> Enhanced commit message to focus on the generic error handler bug, >> pointed out by Filipe Manana >> Adding missing wq/func user in __endio_write_update_ordered(), pointed >> out by Filipe Manana. >> Enhanced commit message with ascii art to explain the bug more easily. >> Fix a bug which can leads to corrupted extent allocation, exposed by >> Filipe Manana. >> v6: >> Split the metadata underflow fix to another patch. >> Use better comment and btrfs_cleanup_ordered_extent() implementation >> from Filipe Manana. >> --- >> fs/btrfs/inode.c | 62 ++++++++++++++++++++++++++++++++++++++++++-------------- >> 1 file changed, 47 insertions(+), 15 deletions(-) >> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> index 1d83d504f2e5..9b03eb9b27d0 100644 >> --- a/fs/btrfs/inode.c >> +++ b/fs/btrfs/inode.c >> @@ -115,6 +115,30 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len, >> u64 ram_bytes, int compress_type, >> int type); >> >> +static void __endio_write_update_ordered(struct inode *inode, >> + u64 offset, u64 bytes, >> + bool uptodate); >> + >> +/* >> + * Cleanup all submitted ordered extents in specified range to handle errors >> + * from the fill_dellaloc() callback. >> + * >> + * NOTE: caller must ensure that when an error happens, it can not call >> + * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING >> + * and EXTENT_DELALLOC simultaneously, because that causes the reserved metadata >> + * to be released, which we want to happen only when finishing the ordered >> + * extent (btrfs_finish_ordered_io()). Also note that the caller of the >> + * fill_dealloc() callback already does proper cleanup for the first page of > > fill_delalloc() > >> + * the range, that is, it invokes the callback writepage_end_io_hook() for the >> + * range of the first page. >> + */ >> +static inline void btrfs_cleanup_ordered_extents(struct inode *inode, >> + u64 offset, u64 bytes) >> +{ >> + return __endio_write_update_ordered(inode, offset + PAGE_SIZE, >> + bytes - PAGE_SIZE, false); >> +} >> + >> static int btrfs_dirty_inode(struct inode *inode); >> >> #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS >> @@ -1536,6 +1560,8 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page, >> ret = cow_file_range_async(inode, locked_page, start, end, >> page_started, nr_written); >> } >> + if (ret) >> + btrfs_cleanup_ordered_extents(inode, start, end - start + 1); >> return ret; >> } >> >> @@ -8142,17 +8168,26 @@ static void btrfs_endio_direct_read(struct bio *bio) >> bio_put(bio); >> } >> >> -static void btrfs_endio_direct_write_update_ordered(struct inode *inode, >> - const u64 offset, >> - const u64 bytes, >> - const int uptodate) >> +static void __endio_write_update_ordered(struct inode *inode, >> + u64 offset, u64 bytes, >> + bool uptodate) > > Not serious, but why const was killed? Because const for @offset, @bytes has no meaning, u64/bool are passed by their value. Any modification to @offset/@bytes/@update has no effect on the passed values. Thanks, Qu > > With that fixed, you can have > > Reviewed-by: Liu Bo <bo.li.liu@oracle.com> > > Thanks, > > -liubo >> { >> struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); >> struct btrfs_ordered_extent *ordered = NULL; >> + struct btrfs_workqueue *wq; >> + btrfs_work_func_t func; >> u64 ordered_offset = offset; >> u64 ordered_bytes = bytes; >> int ret; >> >> + if (btrfs_is_free_space_inode(BTRFS_I(inode))) { >> + wq = fs_info->endio_freespace_worker; >> + func = btrfs_freespace_write_helper; >> + } else { >> + wq = fs_info->endio_write_workers; >> + func = btrfs_endio_write_helper; >> + } >> + >> again: >> ret = btrfs_dec_test_first_ordered_pending(inode, &ordered, >> &ordered_offset, >> @@ -8161,9 +8196,8 @@ static void btrfs_endio_direct_write_update_ordered(struct inode *inode, >> if (!ret) >> goto out_test; >> >> - btrfs_init_work(&ordered->work, btrfs_endio_write_helper, >> - finish_ordered_fn, NULL, NULL); >> - btrfs_queue_work(fs_info->endio_write_workers, &ordered->work); >> + btrfs_init_work(&ordered->work, func, finish_ordered_fn, NULL, NULL); >> + btrfs_queue_work(wq, &ordered->work); >> out_test: >> /* >> * our bio might span multiple ordered extents. If we haven't >> @@ -8181,10 +8215,8 @@ static void btrfs_endio_direct_write(struct bio *bio) >> struct btrfs_dio_private *dip = bio->bi_private; >> struct bio *dio_bio = dip->dio_bio; >> >> - btrfs_endio_direct_write_update_ordered(dip->inode, >> - dip->logical_offset, >> - dip->bytes, >> - !bio->bi_error); >> + __endio_write_update_ordered(dip->inode, dip->logical_offset, >> + dip->bytes, !bio->bi_error); >> >> kfree(dip); >> >> @@ -8545,10 +8577,10 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode, >> io_bio = NULL; >> } else { >> if (write) >> - btrfs_endio_direct_write_update_ordered(inode, >> + __endio_write_update_ordered(inode, >> file_offset, >> dio_bio->bi_iter.bi_size, >> - 0); >> + false); >> else >> unlock_extent(&BTRFS_I(inode)->io_tree, file_offset, >> file_offset + dio_bio->bi_iter.bi_size - 1); >> @@ -8683,11 +8715,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter) >> */ >> if (dio_data.unsubmitted_oe_range_start < >> dio_data.unsubmitted_oe_range_end) >> - btrfs_endio_direct_write_update_ordered(inode, >> + __endio_write_update_ordered(inode, >> dio_data.unsubmitted_oe_range_start, >> dio_data.unsubmitted_oe_range_end - >> dio_data.unsubmitted_oe_range_start, >> - 0); >> + false); >> } else if (ret >= 0 && (size_t)ret < count) >> btrfs_delalloc_release_space(inode, offset, >> count - (size_t)ret); >> -- >> 2.12.0 >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang 2017-03-08 0:18 ` Qu Wenruo @ 2017-03-08 0:21 ` Filipe Manana 2017-03-08 0:26 ` Qu Wenruo 0 siblings, 1 reply; 16+ messages in thread From: Filipe Manana @ 2017-03-08 0:21 UTC (permalink / raw) To: Qu Wenruo; +Cc: Liu Bo, linux-btrfs, Filipe Manana On Wed, Mar 8, 2017 at 12:18 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: > > > At 03/08/2017 06:11 AM, Liu Bo wrote: >> >> On Mon, Mar 06, 2017 at 10:55:47AM +0800, Qu Wenruo wrote: >>> >>> [BUG] >>> If run_delalloc_range() returns error and there is already some ordered >>> extents created, btrfs will be hanged with the following backtrace: >>> >>> Call Trace: >>> __schedule+0x2d4/0xae0 >>> schedule+0x3d/0x90 >>> btrfs_start_ordered_extent+0x160/0x200 [btrfs] >>> ? wake_atomic_t_function+0x60/0x60 >>> btrfs_run_ordered_extent_work+0x25/0x40 [btrfs] >>> btrfs_scrubparity_helper+0x1c1/0x620 [btrfs] >>> btrfs_flush_delalloc_helper+0xe/0x10 [btrfs] >>> process_one_work+0x2af/0x720 >>> ? process_one_work+0x22b/0x720 >>> worker_thread+0x4b/0x4f0 >>> kthread+0x10f/0x150 >>> ? process_one_work+0x720/0x720 >>> ? kthread_create_on_node+0x40/0x40 >>> ret_from_fork+0x2e/0x40 >>> >>> [CAUSE] >>> >>> |<------------------ delalloc range --------------------------->| >>> | OE 1 | OE 2 | ... | OE n | >>> |<>| |<---------- cleanup range --------->| >>> || >>> \_=> First page handled by end_extent_writepage() in >>> __extent_writepage() >>> >>> The problem is caused by error handler of run_delalloc_range(), which >>> doesn't handle any created ordered extents, leaving them waiting on >>> btrfs_finish_ordered_io() to finish. >>> >>> However after run_delalloc_range() returns error, __extent_writepage() >>> won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered >>> except the first page, and btrfs_finish_ordered_io() won't be triggered >>> for created ordered extents either. >>> >>> So OE 2~n will hang forever, and if OE 1 is larger than one page, it >>> will also hang. >>> >>> [FIX] >>> Introduce btrfs_cleanup_ordered_extents() function to cleanup created >>> ordered extents and finish them manually. >>> >>> The function is based on existing >>> btrfs_endio_direct_write_update_ordered() function, and modify it to >>> act just like btrfs_writepage_endio_hook() but handles specified range >>> other than one page. >>> >>> After fix, delalloc error will be handled like: >>> >>> |<------------------ delalloc range --------------------------->| >>> | OE 1 | OE 2 | ... | OE n | >>> |<>|<-------- ----------->|<------ old error handler --------->| >>> || || >>> || \_=> Cleaned up by cleanup_ordered_extents() >>> \_=> First page handled by end_extent_writepage() in >>> __extent_writepage() >>> >> >> Looks good, some nitpicks below. >> >>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> >>> Signed-off-by: Filipe Manana <fdmanana@suse.com> >>> --- >>> changelog: >>> v2: >>> Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing >>> outstanding extents, which is already done by >>> extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit >>> v3: >>> Skip first page to avoid underflow ordered->bytes_left. >>> Fix range passed in cow_file_range() which doesn't cover the whole >>> extent. >>> Expend extent_clear_unlock_delalloc() range to allow them to handle >>> metadata release. >>> v4: >>> Don't use extra bit to skip metadata freeing for ordered extent, >>> but only handle btrfs_reloc_clone_csums() error just before processing >>> next extent. >>> This makes error handle much easier for run_delalloc_nocow(). >>> v5: >>> Variant gramma and comment fixes suggested by Filipe Manana >>> Enhanced commit message to focus on the generic error handler bug, >>> pointed out by Filipe Manana >>> Adding missing wq/func user in __endio_write_update_ordered(), pointed >>> out by Filipe Manana. >>> Enhanced commit message with ascii art to explain the bug more easily. >>> Fix a bug which can leads to corrupted extent allocation, exposed by >>> Filipe Manana. >>> v6: >>> Split the metadata underflow fix to another patch. >>> Use better comment and btrfs_cleanup_ordered_extent() implementation >>> from Filipe Manana. >>> --- >>> fs/btrfs/inode.c | 62 >>> ++++++++++++++++++++++++++++++++++++++++++-------------- >>> 1 file changed, 47 insertions(+), 15 deletions(-) >>> >>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >>> index 1d83d504f2e5..9b03eb9b27d0 100644 >>> --- a/fs/btrfs/inode.c >>> +++ b/fs/btrfs/inode.c >>> @@ -115,6 +115,30 @@ static struct extent_map *create_io_em(struct inode >>> *inode, u64 start, u64 len, >>> u64 ram_bytes, int compress_type, >>> int type); >>> >>> +static void __endio_write_update_ordered(struct inode *inode, >>> + u64 offset, u64 bytes, >>> + bool uptodate); >>> + >>> +/* >>> + * Cleanup all submitted ordered extents in specified range to handle >>> errors >>> + * from the fill_dellaloc() callback. >>> + * >>> + * NOTE: caller must ensure that when an error happens, it can not call >>> + * extent_clear_unlock_delalloc() to clear both the bits >>> EXTENT_DO_ACCOUNTING >>> + * and EXTENT_DELALLOC simultaneously, because that causes the reserved >>> metadata >>> + * to be released, which we want to happen only when finishing the >>> ordered >>> + * extent (btrfs_finish_ordered_io()). Also note that the caller of the >>> + * fill_dealloc() callback already does proper cleanup for the first >>> page of >> >> >> fill_delalloc() >> >>> + * the range, that is, it invokes the callback writepage_end_io_hook() >>> for the >>> + * range of the first page. >>> + */ >>> +static inline void btrfs_cleanup_ordered_extents(struct inode *inode, >>> + u64 offset, u64 bytes) >>> +{ >>> + return __endio_write_update_ordered(inode, offset + PAGE_SIZE, >>> + bytes - PAGE_SIZE, false); >>> +} >>> + >>> static int btrfs_dirty_inode(struct inode *inode); >>> >>> #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS >>> @@ -1536,6 +1560,8 @@ static int run_delalloc_range(struct inode *inode, >>> struct page *locked_page, >>> ret = cow_file_range_async(inode, locked_page, start, >>> end, >>> page_started, nr_written); >>> } >>> + if (ret) >>> + btrfs_cleanup_ordered_extents(inode, start, end - start + >>> 1); >>> return ret; >>> } >>> >>> @@ -8142,17 +8168,26 @@ static void btrfs_endio_direct_read(struct bio >>> *bio) >>> bio_put(bio); >>> } >>> >>> -static void btrfs_endio_direct_write_update_ordered(struct inode *inode, >>> - const u64 offset, >>> - const u64 bytes, >>> - const int uptodate) >>> +static void __endio_write_update_ordered(struct inode *inode, >>> + u64 offset, u64 bytes, >>> + bool uptodate) >> >> >> Not serious, but why const was killed? > > > Because const for @offset, @bytes has no meaning, u64/bool are passed by > their value. > > Any modification to @offset/@bytes/@update has no effect on the passed > values. But it helps detect logic errors inside the function. For example when you have local variables with names similar to the function parameters, it's easy to make a mistake and modify a parameter instead of a local variable - that's why I always use const for new functions (and not only because I like to type a lot). > > Thanks, > Qu > > >> >> With that fixed, you can have >> >> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> >> >> Thanks, >> >> -liubo >>> >>> { >>> struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); >>> struct btrfs_ordered_extent *ordered = NULL; >>> + struct btrfs_workqueue *wq; >>> + btrfs_work_func_t func; >>> u64 ordered_offset = offset; >>> u64 ordered_bytes = bytes; >>> int ret; >>> >>> + if (btrfs_is_free_space_inode(BTRFS_I(inode))) { >>> + wq = fs_info->endio_freespace_worker; >>> + func = btrfs_freespace_write_helper; >>> + } else { >>> + wq = fs_info->endio_write_workers; >>> + func = btrfs_endio_write_helper; >>> + } >>> + >>> again: >>> ret = btrfs_dec_test_first_ordered_pending(inode, &ordered, >>> &ordered_offset, >>> @@ -8161,9 +8196,8 @@ static void >>> btrfs_endio_direct_write_update_ordered(struct inode *inode, >>> if (!ret) >>> goto out_test; >>> >>> - btrfs_init_work(&ordered->work, btrfs_endio_write_helper, >>> - finish_ordered_fn, NULL, NULL); >>> - btrfs_queue_work(fs_info->endio_write_workers, &ordered->work); >>> + btrfs_init_work(&ordered->work, func, finish_ordered_fn, NULL, >>> NULL); >>> + btrfs_queue_work(wq, &ordered->work); >>> out_test: >>> /* >>> * our bio might span multiple ordered extents. If we haven't >>> @@ -8181,10 +8215,8 @@ static void btrfs_endio_direct_write(struct bio >>> *bio) >>> struct btrfs_dio_private *dip = bio->bi_private; >>> struct bio *dio_bio = dip->dio_bio; >>> >>> - btrfs_endio_direct_write_update_ordered(dip->inode, >>> - dip->logical_offset, >>> - dip->bytes, >>> - !bio->bi_error); >>> + __endio_write_update_ordered(dip->inode, dip->logical_offset, >>> + dip->bytes, !bio->bi_error); >>> >>> kfree(dip); >>> >>> @@ -8545,10 +8577,10 @@ static void btrfs_submit_direct(struct bio >>> *dio_bio, struct inode *inode, >>> io_bio = NULL; >>> } else { >>> if (write) >>> - btrfs_endio_direct_write_update_ordered(inode, >>> + __endio_write_update_ordered(inode, >>> file_offset, >>> dio_bio->bi_iter.bi_size, >>> - 0); >>> + false); >>> else >>> unlock_extent(&BTRFS_I(inode)->io_tree, >>> file_offset, >>> file_offset + dio_bio->bi_iter.bi_size - >>> 1); >>> @@ -8683,11 +8715,11 @@ static ssize_t btrfs_direct_IO(struct kiocb >>> *iocb, struct iov_iter *iter) >>> */ >>> if (dio_data.unsubmitted_oe_range_start < >>> dio_data.unsubmitted_oe_range_end) >>> - >>> btrfs_endio_direct_write_update_ordered(inode, >>> + __endio_write_update_ordered(inode, >>> >>> dio_data.unsubmitted_oe_range_start, >>> dio_data.unsubmitted_oe_range_end >>> - >>> >>> dio_data.unsubmitted_oe_range_start, >>> - 0); >>> + false); >>> } else if (ret >= 0 && (size_t)ret < count) >>> btrfs_delalloc_release_space(inode, offset, >>> count - >>> (size_t)ret); >>> -- >>> 2.12.0 >>> >>> >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang 2017-03-08 0:21 ` Filipe Manana @ 2017-03-08 0:26 ` Qu Wenruo 0 siblings, 0 replies; 16+ messages in thread From: Qu Wenruo @ 2017-03-08 0:26 UTC (permalink / raw) To: Filipe Manana; +Cc: Liu Bo, linux-btrfs, Filipe Manana At 03/08/2017 08:21 AM, Filipe Manana wrote: > On Wed, Mar 8, 2017 at 12:18 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: >> >> >> At 03/08/2017 06:11 AM, Liu Bo wrote: >>> >>> On Mon, Mar 06, 2017 at 10:55:47AM +0800, Qu Wenruo wrote: >>>> >>>> [BUG] >>>> If run_delalloc_range() returns error and there is already some ordered >>>> extents created, btrfs will be hanged with the following backtrace: >>>> >>>> Call Trace: >>>> __schedule+0x2d4/0xae0 >>>> schedule+0x3d/0x90 >>>> btrfs_start_ordered_extent+0x160/0x200 [btrfs] >>>> ? wake_atomic_t_function+0x60/0x60 >>>> btrfs_run_ordered_extent_work+0x25/0x40 [btrfs] >>>> btrfs_scrubparity_helper+0x1c1/0x620 [btrfs] >>>> btrfs_flush_delalloc_helper+0xe/0x10 [btrfs] >>>> process_one_work+0x2af/0x720 >>>> ? process_one_work+0x22b/0x720 >>>> worker_thread+0x4b/0x4f0 >>>> kthread+0x10f/0x150 >>>> ? process_one_work+0x720/0x720 >>>> ? kthread_create_on_node+0x40/0x40 >>>> ret_from_fork+0x2e/0x40 >>>> >>>> [CAUSE] >>>> >>>> |<------------------ delalloc range --------------------------->| >>>> | OE 1 | OE 2 | ... | OE n | >>>> |<>| |<---------- cleanup range --------->| >>>> || >>>> \_=> First page handled by end_extent_writepage() in >>>> __extent_writepage() >>>> >>>> The problem is caused by error handler of run_delalloc_range(), which >>>> doesn't handle any created ordered extents, leaving them waiting on >>>> btrfs_finish_ordered_io() to finish. >>>> >>>> However after run_delalloc_range() returns error, __extent_writepage() >>>> won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered >>>> except the first page, and btrfs_finish_ordered_io() won't be triggered >>>> for created ordered extents either. >>>> >>>> So OE 2~n will hang forever, and if OE 1 is larger than one page, it >>>> will also hang. >>>> >>>> [FIX] >>>> Introduce btrfs_cleanup_ordered_extents() function to cleanup created >>>> ordered extents and finish them manually. >>>> >>>> The function is based on existing >>>> btrfs_endio_direct_write_update_ordered() function, and modify it to >>>> act just like btrfs_writepage_endio_hook() but handles specified range >>>> other than one page. >>>> >>>> After fix, delalloc error will be handled like: >>>> >>>> |<------------------ delalloc range --------------------------->| >>>> | OE 1 | OE 2 | ... | OE n | >>>> |<>|<-------- ----------->|<------ old error handler --------->| >>>> || || >>>> || \_=> Cleaned up by cleanup_ordered_extents() >>>> \_=> First page handled by end_extent_writepage() in >>>> __extent_writepage() >>>> >>> >>> Looks good, some nitpicks below. >>> >>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> >>>> Signed-off-by: Filipe Manana <fdmanana@suse.com> >>>> --- >>>> changelog: >>>> v2: >>>> Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing >>>> outstanding extents, which is already done by >>>> extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit >>>> v3: >>>> Skip first page to avoid underflow ordered->bytes_left. >>>> Fix range passed in cow_file_range() which doesn't cover the whole >>>> extent. >>>> Expend extent_clear_unlock_delalloc() range to allow them to handle >>>> metadata release. >>>> v4: >>>> Don't use extra bit to skip metadata freeing for ordered extent, >>>> but only handle btrfs_reloc_clone_csums() error just before processing >>>> next extent. >>>> This makes error handle much easier for run_delalloc_nocow(). >>>> v5: >>>> Variant gramma and comment fixes suggested by Filipe Manana >>>> Enhanced commit message to focus on the generic error handler bug, >>>> pointed out by Filipe Manana >>>> Adding missing wq/func user in __endio_write_update_ordered(), pointed >>>> out by Filipe Manana. >>>> Enhanced commit message with ascii art to explain the bug more easily. >>>> Fix a bug which can leads to corrupted extent allocation, exposed by >>>> Filipe Manana. >>>> v6: >>>> Split the metadata underflow fix to another patch. >>>> Use better comment and btrfs_cleanup_ordered_extent() implementation >>>> from Filipe Manana. >>>> --- >>>> fs/btrfs/inode.c | 62 >>>> ++++++++++++++++++++++++++++++++++++++++++-------------- >>>> 1 file changed, 47 insertions(+), 15 deletions(-) >>>> >>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >>>> index 1d83d504f2e5..9b03eb9b27d0 100644 >>>> --- a/fs/btrfs/inode.c >>>> +++ b/fs/btrfs/inode.c >>>> @@ -115,6 +115,30 @@ static struct extent_map *create_io_em(struct inode >>>> *inode, u64 start, u64 len, >>>> u64 ram_bytes, int compress_type, >>>> int type); >>>> >>>> +static void __endio_write_update_ordered(struct inode *inode, >>>> + u64 offset, u64 bytes, >>>> + bool uptodate); >>>> + >>>> +/* >>>> + * Cleanup all submitted ordered extents in specified range to handle >>>> errors >>>> + * from the fill_dellaloc() callback. >>>> + * >>>> + * NOTE: caller must ensure that when an error happens, it can not call >>>> + * extent_clear_unlock_delalloc() to clear both the bits >>>> EXTENT_DO_ACCOUNTING >>>> + * and EXTENT_DELALLOC simultaneously, because that causes the reserved >>>> metadata >>>> + * to be released, which we want to happen only when finishing the >>>> ordered >>>> + * extent (btrfs_finish_ordered_io()). Also note that the caller of the >>>> + * fill_dealloc() callback already does proper cleanup for the first >>>> page of >>> >>> >>> fill_delalloc() >>> >>>> + * the range, that is, it invokes the callback writepage_end_io_hook() >>>> for the >>>> + * range of the first page. >>>> + */ >>>> +static inline void btrfs_cleanup_ordered_extents(struct inode *inode, >>>> + u64 offset, u64 bytes) >>>> +{ >>>> + return __endio_write_update_ordered(inode, offset + PAGE_SIZE, >>>> + bytes - PAGE_SIZE, false); >>>> +} >>>> + >>>> static int btrfs_dirty_inode(struct inode *inode); >>>> >>>> #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS >>>> @@ -1536,6 +1560,8 @@ static int run_delalloc_range(struct inode *inode, >>>> struct page *locked_page, >>>> ret = cow_file_range_async(inode, locked_page, start, >>>> end, >>>> page_started, nr_written); >>>> } >>>> + if (ret) >>>> + btrfs_cleanup_ordered_extents(inode, start, end - start + >>>> 1); >>>> return ret; >>>> } >>>> >>>> @@ -8142,17 +8168,26 @@ static void btrfs_endio_direct_read(struct bio >>>> *bio) >>>> bio_put(bio); >>>> } >>>> >>>> -static void btrfs_endio_direct_write_update_ordered(struct inode *inode, >>>> - const u64 offset, >>>> - const u64 bytes, >>>> - const int uptodate) >>>> +static void __endio_write_update_ordered(struct inode *inode, >>>> + u64 offset, u64 bytes, >>>> + bool uptodate) >>> >>> >>> Not serious, but why const was killed? >> >> >> Because const for @offset, @bytes has no meaning, u64/bool are passed by >> their value. >> >> Any modification to @offset/@bytes/@update has no effect on the passed >> values. > > But it helps detect logic errors inside the function. For example when > you have local variables with names similar to the function > parameters, it's easy to make a mistake and modify a parameter instead > of a local variable - that's why I always use const for new functions > (and not only because I like to type a lot). Yes, that makes sense. I was originally worried about killing the possibility to reuse @offset as iterator, just like what we do with @start in cow_file_range(). But it turns out that, the modification of @start in cow_file_range() is already a bad idea. So in next version const prefix will be added bad. Thanks, Qu > >> >> Thanks, >> Qu >> >> >>> >>> With that fixed, you can have >>> >>> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> >>> >>> Thanks, >>> >>> -liubo >>>> >>>> { >>>> struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); >>>> struct btrfs_ordered_extent *ordered = NULL; >>>> + struct btrfs_workqueue *wq; >>>> + btrfs_work_func_t func; >>>> u64 ordered_offset = offset; >>>> u64 ordered_bytes = bytes; >>>> int ret; >>>> >>>> + if (btrfs_is_free_space_inode(BTRFS_I(inode))) { >>>> + wq = fs_info->endio_freespace_worker; >>>> + func = btrfs_freespace_write_helper; >>>> + } else { >>>> + wq = fs_info->endio_write_workers; >>>> + func = btrfs_endio_write_helper; >>>> + } >>>> + >>>> again: >>>> ret = btrfs_dec_test_first_ordered_pending(inode, &ordered, >>>> &ordered_offset, >>>> @@ -8161,9 +8196,8 @@ static void >>>> btrfs_endio_direct_write_update_ordered(struct inode *inode, >>>> if (!ret) >>>> goto out_test; >>>> >>>> - btrfs_init_work(&ordered->work, btrfs_endio_write_helper, >>>> - finish_ordered_fn, NULL, NULL); >>>> - btrfs_queue_work(fs_info->endio_write_workers, &ordered->work); >>>> + btrfs_init_work(&ordered->work, func, finish_ordered_fn, NULL, >>>> NULL); >>>> + btrfs_queue_work(wq, &ordered->work); >>>> out_test: >>>> /* >>>> * our bio might span multiple ordered extents. If we haven't >>>> @@ -8181,10 +8215,8 @@ static void btrfs_endio_direct_write(struct bio >>>> *bio) >>>> struct btrfs_dio_private *dip = bio->bi_private; >>>> struct bio *dio_bio = dip->dio_bio; >>>> >>>> - btrfs_endio_direct_write_update_ordered(dip->inode, >>>> - dip->logical_offset, >>>> - dip->bytes, >>>> - !bio->bi_error); >>>> + __endio_write_update_ordered(dip->inode, dip->logical_offset, >>>> + dip->bytes, !bio->bi_error); >>>> >>>> kfree(dip); >>>> >>>> @@ -8545,10 +8577,10 @@ static void btrfs_submit_direct(struct bio >>>> *dio_bio, struct inode *inode, >>>> io_bio = NULL; >>>> } else { >>>> if (write) >>>> - btrfs_endio_direct_write_update_ordered(inode, >>>> + __endio_write_update_ordered(inode, >>>> file_offset, >>>> dio_bio->bi_iter.bi_size, >>>> - 0); >>>> + false); >>>> else >>>> unlock_extent(&BTRFS_I(inode)->io_tree, >>>> file_offset, >>>> file_offset + dio_bio->bi_iter.bi_size - >>>> 1); >>>> @@ -8683,11 +8715,11 @@ static ssize_t btrfs_direct_IO(struct kiocb >>>> *iocb, struct iov_iter *iter) >>>> */ >>>> if (dio_data.unsubmitted_oe_range_start < >>>> dio_data.unsubmitted_oe_range_end) >>>> - >>>> btrfs_endio_direct_write_update_ordered(inode, >>>> + __endio_write_update_ordered(inode, >>>> >>>> dio_data.unsubmitted_oe_range_start, >>>> dio_data.unsubmitted_oe_range_end >>>> - >>>> >>>> dio_data.unsubmitted_oe_range_start, >>>> - 0); >>>> + false); >>>> } else if (ret >= 0 && (size_t)ret < count) >>>> btrfs_delalloc_release_space(inode, offset, >>>> count - >>>> (size_t)ret); >>>> -- >>>> 2.12.0 >>>> >>>> >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> >>> >> >> > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error 2017-03-06 2:55 [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Qu Wenruo 2017-03-06 2:55 ` [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang Qu Wenruo @ 2017-03-06 23:19 ` Filipe Manana 2017-03-07 17:32 ` David Sterba 2017-03-07 20:49 ` Liu Bo 2 siblings, 1 reply; 16+ messages in thread From: Filipe Manana @ 2017-03-06 23:19 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On Mon, Mar 6, 2017 at 2:55 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: > [BUG] > When btrfs_reloc_clone_csum() reports error, it can underflow metadata > and leads to kernel assertion on outstanding extents in > run_delalloc_nocow() and cow_file_range(). > > BTRFS info (device vdb5): relocating block group 12582912 flags data > BTRFS info (device vdb5): found 1 extents > assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858 > > Currently, due to another bug blocking ordered extents, the bug is only > reproducible under certain block group layout and using error injection. > > a) Create one data block group with one 4K extent in it. > To avoid the bug that hangs btrfs due to ordered extent which never > finishes > b) Make btrfs_reloc_clone_csum() always fail > c) Relocate that block group > > [CAUSE] > run_delalloc_nocow() and cow_file_range() handles error from > btrfs_reloc_clone_csum() wrongly: > > (The ascii chart shows a more generic case of this bug other than the > bug mentioned above) > > |<------------------ delalloc range --------------------------->| > | OE 1 | OE 2 | ... | OE n | > |<----------- cleanup range --------------->| > |<----------- ----------->| > \/ > btrfs_finish_ordered_io() range > > So error handler, which calls extent_clear_unlock_delalloc() with > EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io() > will both cover OE n, and free its metadata, causing metadata under flow. > > [Fix] > The fix is to ensure after calling btrfs_add_ordered_extent(), we only > call error handler after increasing the iteration offset, so that > cleanup range won't cover any created ordered extent. > > |<------------------ delalloc range --------------------------->| > | OE 1 | OE 2 | ... | OE n | > |<----------- ----------->|<---------- cleanup range --------->| > \/ > btrfs_finish_ordered_io() range > > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> thanks > --- > changelog: > v6: > New, split from v5 patch, as this is a separate bug. > --- > fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------ > 1 file changed, 39 insertions(+), 12 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index b2bc07aad1ae..1d83d504f2e5 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode, > BTRFS_DATA_RELOC_TREE_OBJECTID) { > ret = btrfs_reloc_clone_csums(inode, start, > cur_alloc_size); > + /* > + * Only drop cache here, and process as normal. > + * > + * We must not allow extent_clear_unlock_delalloc() > + * at out_unlock label to free meta of this ordered > + * extent, as its meta should be freed by > + * btrfs_finish_ordered_io(). > + * > + * So we must continue until @start is increased to > + * skip current ordered extent. > + */ > if (ret) > - goto out_drop_extent_cache; > + btrfs_drop_extent_cache(BTRFS_I(inode), start, > + start + ram_size - 1, 0); > } > > btrfs_dec_block_group_reservations(fs_info, ins.objectid); > > - if (disk_num_bytes < cur_alloc_size) > - break; > - > /* we're not doing compressed IO, don't unlock the first > * page (which the caller expects to stay locked), don't > * clear any dirty bits and don't set any writeback bits > @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode, > delalloc_end, locked_page, > EXTENT_LOCKED | EXTENT_DELALLOC, > op); > - disk_num_bytes -= cur_alloc_size; > + if (disk_num_bytes > cur_alloc_size) > + disk_num_bytes = 0; > + else > + disk_num_bytes -= cur_alloc_size; > num_bytes -= cur_alloc_size; > alloc_hint = ins.objectid + ins.offset; > start += cur_alloc_size; > + > + /* > + * btrfs_reloc_clone_csums() error, since start is increased > + * extent_clear_unlock_delalloc() at out_unlock label won't > + * free metadata of current ordered extent, we're OK to exit. > + */ > + if (ret) > + goto out_unlock; > } > out: > return ret; > @@ -1414,15 +1434,14 @@ static noinline int run_delalloc_nocow(struct inode *inode, > BUG_ON(ret); /* -ENOMEM */ > > if (root->root_key.objectid == > - BTRFS_DATA_RELOC_TREE_OBJECTID) { > + BTRFS_DATA_RELOC_TREE_OBJECTID) > + /* > + * Error handled later, as we must prevent > + * extent_clear_unlock_delalloc() in error handler > + * from freeing metadata of created ordered extent. > + */ > ret = btrfs_reloc_clone_csums(inode, cur_offset, > num_bytes); > - if (ret) { > - if (!nolock && nocow) > - btrfs_end_write_no_snapshoting(root); > - goto error; > - } > - } > > extent_clear_unlock_delalloc(inode, cur_offset, > cur_offset + num_bytes - 1, end, > @@ -1434,6 +1453,14 @@ static noinline int run_delalloc_nocow(struct inode *inode, > if (!nolock && nocow) > btrfs_end_write_no_snapshoting(root); > cur_offset = extent_end; > + > + /* > + * btrfs_reloc_clone_csums() error, now we're OK to call error > + * handler, as metadata for created ordered extent will only > + * be freed by btrfs_finish_ordered_io(). > + */ > + if (ret) > + goto error; > if (cur_offset > end) > break; > } > -- > 2.12.0 > > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error 2017-03-06 23:19 ` [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Filipe Manana @ 2017-03-07 17:32 ` David Sterba 2017-03-07 20:51 ` Liu Bo 0 siblings, 1 reply; 16+ messages in thread From: David Sterba @ 2017-03-07 17:32 UTC (permalink / raw) To: Filipe Manana; +Cc: Qu Wenruo, linux-btrfs On Mon, Mar 06, 2017 at 11:19:32PM +0000, Filipe Manana wrote: > On Mon, Mar 6, 2017 at 2:55 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: > > [BUG] > > When btrfs_reloc_clone_csum() reports error, it can underflow metadata > > and leads to kernel assertion on outstanding extents in > > run_delalloc_nocow() and cow_file_range(). > > > > BTRFS info (device vdb5): relocating block group 12582912 flags data > > BTRFS info (device vdb5): found 1 extents > > assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858 > > > > Currently, due to another bug blocking ordered extents, the bug is only > > reproducible under certain block group layout and using error injection. > > > > a) Create one data block group with one 4K extent in it. > > To avoid the bug that hangs btrfs due to ordered extent which never > > finishes > > b) Make btrfs_reloc_clone_csum() always fail > > c) Relocate that block group > > > > [CAUSE] > > run_delalloc_nocow() and cow_file_range() handles error from > > btrfs_reloc_clone_csum() wrongly: > > > > (The ascii chart shows a more generic case of this bug other than the > > bug mentioned above) > > > > |<------------------ delalloc range --------------------------->| > > | OE 1 | OE 2 | ... | OE n | > > |<----------- cleanup range --------------->| > > |<----------- ----------->| > > \/ > > btrfs_finish_ordered_io() range > > > > So error handler, which calls extent_clear_unlock_delalloc() with > > EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io() > > will both cover OE n, and free its metadata, causing metadata under flow. > > > > [Fix] > > The fix is to ensure after calling btrfs_add_ordered_extent(), we only > > call error handler after increasing the iteration offset, so that > > cleanup range won't cover any created ordered extent. > > > > |<------------------ delalloc range --------------------------->| > > | OE 1 | OE 2 | ... | OE n | > > |<----------- ----------->|<---------- cleanup range --------->| > > \/ > > btrfs_finish_ordered_io() range > > > > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> > > Reviewed-by: Filipe Manana <fdmanana@suse.com> 1-2 added to for-next and queued for 4.11. Thanks. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error 2017-03-07 17:32 ` David Sterba @ 2017-03-07 20:51 ` Liu Bo 0 siblings, 0 replies; 16+ messages in thread From: Liu Bo @ 2017-03-07 20:51 UTC (permalink / raw) To: dsterba, Filipe Manana, Qu Wenruo, linux-btrfs On Tue, Mar 07, 2017 at 06:32:23PM +0100, David Sterba wrote: > On Mon, Mar 06, 2017 at 11:19:32PM +0000, Filipe Manana wrote: > > On Mon, Mar 6, 2017 at 2:55 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: > > > [BUG] > > > When btrfs_reloc_clone_csum() reports error, it can underflow metadata > > > and leads to kernel assertion on outstanding extents in > > > run_delalloc_nocow() and cow_file_range(). > > > > > > BTRFS info (device vdb5): relocating block group 12582912 flags data > > > BTRFS info (device vdb5): found 1 extents > > > assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858 > > > > > > Currently, due to another bug blocking ordered extents, the bug is only > > > reproducible under certain block group layout and using error injection. > > > > > > a) Create one data block group with one 4K extent in it. > > > To avoid the bug that hangs btrfs due to ordered extent which never > > > finishes > > > b) Make btrfs_reloc_clone_csum() always fail > > > c) Relocate that block group > > > > > > [CAUSE] > > > run_delalloc_nocow() and cow_file_range() handles error from > > > btrfs_reloc_clone_csum() wrongly: > > > > > > (The ascii chart shows a more generic case of this bug other than the > > > bug mentioned above) > > > > > > |<------------------ delalloc range --------------------------->| > > > | OE 1 | OE 2 | ... | OE n | > > > |<----------- cleanup range --------------->| > > > |<----------- ----------->| > > > \/ > > > btrfs_finish_ordered_io() range > > > > > > So error handler, which calls extent_clear_unlock_delalloc() with > > > EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io() > > > will both cover OE n, and free its metadata, causing metadata under flow. > > > > > > [Fix] > > > The fix is to ensure after calling btrfs_add_ordered_extent(), we only > > > call error handler after increasing the iteration offset, so that > > > cleanup range won't cover any created ordered extent. > > > > > > |<------------------ delalloc range --------------------------->| > > > | OE 1 | OE 2 | ... | OE n | > > > |<----------- ----------->|<---------- cleanup range --------->| > > > \/ > > > btrfs_finish_ordered_io() range > > > > > > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> > > > > Reviewed-by: Filipe Manana <fdmanana@suse.com> > > 1-2 added to for-next and queued for 4.11. Thanks. I'm afraid that patch 1 has a problem, please take a look at the thread. Thanks, -liubo ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error 2017-03-06 2:55 [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Qu Wenruo 2017-03-06 2:55 ` [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang Qu Wenruo 2017-03-06 23:19 ` [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Filipe Manana @ 2017-03-07 20:49 ` Liu Bo 2017-03-07 20:59 ` Liu Bo 2017-03-08 0:17 ` Qu Wenruo 2 siblings, 2 replies; 16+ messages in thread From: Liu Bo @ 2017-03-07 20:49 UTC (permalink / raw) To: Qu Wenruo; +Cc: fdmanana, linux-btrfs On Mon, Mar 06, 2017 at 10:55:46AM +0800, Qu Wenruo wrote: > [BUG] > When btrfs_reloc_clone_csum() reports error, it can underflow metadata > and leads to kernel assertion on outstanding extents in > run_delalloc_nocow() and cow_file_range(). > > BTRFS info (device vdb5): relocating block group 12582912 flags data > BTRFS info (device vdb5): found 1 extents > assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858 > > Currently, due to another bug blocking ordered extents, the bug is only > reproducible under certain block group layout and using error injection. > > a) Create one data block group with one 4K extent in it. > To avoid the bug that hangs btrfs due to ordered extent which never > finishes > b) Make btrfs_reloc_clone_csum() always fail > c) Relocate that block group > > [CAUSE] > run_delalloc_nocow() and cow_file_range() handles error from > btrfs_reloc_clone_csum() wrongly: > > (The ascii chart shows a more generic case of this bug other than the > bug mentioned above) > > |<------------------ delalloc range --------------------------->| > | OE 1 | OE 2 | ... | OE n | > |<----------- cleanup range --------------->| > |<----------- ----------->| > \/ > btrfs_finish_ordered_io() range > > So error handler, which calls extent_clear_unlock_delalloc() with > EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io() > will both cover OE n, and free its metadata, causing metadata under flow. > > [Fix] > The fix is to ensure after calling btrfs_add_ordered_extent(), we only > call error handler after increasing the iteration offset, so that > cleanup range won't cover any created ordered extent. > > |<------------------ delalloc range --------------------------->| > | OE 1 | OE 2 | ... | OE n | > |<----------- ----------->|<---------- cleanup range --------->| > \/ > btrfs_finish_ordered_io() range > > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> > --- > changelog: > v6: > New, split from v5 patch, as this is a separate bug. > --- > fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------ > 1 file changed, 39 insertions(+), 12 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index b2bc07aad1ae..1d83d504f2e5 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode, > BTRFS_DATA_RELOC_TREE_OBJECTID) { > ret = btrfs_reloc_clone_csums(inode, start, > cur_alloc_size); > + /* > + * Only drop cache here, and process as normal. > + * > + * We must not allow extent_clear_unlock_delalloc() > + * at out_unlock label to free meta of this ordered > + * extent, as its meta should be freed by > + * btrfs_finish_ordered_io(). > + * > + * So we must continue until @start is increased to > + * skip current ordered extent. > + */ > if (ret) > - goto out_drop_extent_cache; > + btrfs_drop_extent_cache(BTRFS_I(inode), start, > + start + ram_size - 1, 0); > } > > btrfs_dec_block_group_reservations(fs_info, ins.objectid); > > - if (disk_num_bytes < cur_alloc_size) > - break; > - > /* we're not doing compressed IO, don't unlock the first > * page (which the caller expects to stay locked), don't > * clear any dirty bits and don't set any writeback bits > @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode, > delalloc_end, locked_page, > EXTENT_LOCKED | EXTENT_DELALLOC, > op); > - disk_num_bytes -= cur_alloc_size; > + if (disk_num_bytes > cur_alloc_size) > + disk_num_bytes = 0; > + else > + disk_num_bytes -= cur_alloc_size; I don't get the logic here, why do we 'break' if disk_num_bytes > cur_alloc_size? Thanks, -liubo > num_bytes -= cur_alloc_size; > alloc_hint = ins.objectid + ins.offset; > start += cur_alloc_size; > + > + /* > + * btrfs_reloc_clone_csums() error, since start is increased > + * extent_clear_unlock_delalloc() at out_unlock label won't > + * free metadata of current ordered extent, we're OK to exit. > + */ > + if (ret) > + goto out_unlock; > } > out: > return ret; > @@ -1414,15 +1434,14 @@ static noinline int run_delalloc_nocow(struct inode *inode, > BUG_ON(ret); /* -ENOMEM */ > > if (root->root_key.objectid == > - BTRFS_DATA_RELOC_TREE_OBJECTID) { > + BTRFS_DATA_RELOC_TREE_OBJECTID) > + /* > + * Error handled later, as we must prevent > + * extent_clear_unlock_delalloc() in error handler > + * from freeing metadata of created ordered extent. > + */ > ret = btrfs_reloc_clone_csums(inode, cur_offset, > num_bytes); > - if (ret) { > - if (!nolock && nocow) > - btrfs_end_write_no_snapshoting(root); > - goto error; > - } > - } > > extent_clear_unlock_delalloc(inode, cur_offset, > cur_offset + num_bytes - 1, end, > @@ -1434,6 +1453,14 @@ static noinline int run_delalloc_nocow(struct inode *inode, > if (!nolock && nocow) > btrfs_end_write_no_snapshoting(root); > cur_offset = extent_end; > + > + /* > + * btrfs_reloc_clone_csums() error, now we're OK to call error > + * handler, as metadata for created ordered extent will only > + * be freed by btrfs_finish_ordered_io(). > + */ > + if (ret) > + goto error; > if (cur_offset > end) > break; > } > -- > 2.12.0 > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error 2017-03-07 20:49 ` Liu Bo @ 2017-03-07 20:59 ` Liu Bo 2017-03-08 0:17 ` Filipe Manana 2017-03-08 0:17 ` Qu Wenruo 1 sibling, 1 reply; 16+ messages in thread From: Liu Bo @ 2017-03-07 20:59 UTC (permalink / raw) To: Qu Wenruo; +Cc: fdmanana, linux-btrfs On Tue, Mar 07, 2017 at 12:49:58PM -0800, Liu Bo wrote: > On Mon, Mar 06, 2017 at 10:55:46AM +0800, Qu Wenruo wrote: > > [BUG] > > When btrfs_reloc_clone_csum() reports error, it can underflow metadata > > and leads to kernel assertion on outstanding extents in > > run_delalloc_nocow() and cow_file_range(). > > > > BTRFS info (device vdb5): relocating block group 12582912 flags data > > BTRFS info (device vdb5): found 1 extents > > assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858 > > > > Currently, due to another bug blocking ordered extents, the bug is only > > reproducible under certain block group layout and using error injection. > > > > a) Create one data block group with one 4K extent in it. > > To avoid the bug that hangs btrfs due to ordered extent which never > > finishes > > b) Make btrfs_reloc_clone_csum() always fail > > c) Relocate that block group > > > > [CAUSE] > > run_delalloc_nocow() and cow_file_range() handles error from > > btrfs_reloc_clone_csum() wrongly: > > > > (The ascii chart shows a more generic case of this bug other than the > > bug mentioned above) > > > > |<------------------ delalloc range --------------------------->| > > | OE 1 | OE 2 | ... | OE n | > > |<----------- cleanup range --------------->| > > |<----------- ----------->| > > \/ > > btrfs_finish_ordered_io() range > > > > So error handler, which calls extent_clear_unlock_delalloc() with > > EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io() > > will both cover OE n, and free its metadata, causing metadata under flow. > > > > [Fix] > > The fix is to ensure after calling btrfs_add_ordered_extent(), we only > > call error handler after increasing the iteration offset, so that > > cleanup range won't cover any created ordered extent. > > > > |<------------------ delalloc range --------------------------->| > > | OE 1 | OE 2 | ... | OE n | > > |<----------- ----------->|<---------- cleanup range --------->| > > \/ > > btrfs_finish_ordered_io() range > > > > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> > > --- > > changelog: > > v6: > > New, split from v5 patch, as this is a separate bug. > > --- > > fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------ > > 1 file changed, 39 insertions(+), 12 deletions(-) > > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > > index b2bc07aad1ae..1d83d504f2e5 100644 > > --- a/fs/btrfs/inode.c > > +++ b/fs/btrfs/inode.c > > @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode, > > BTRFS_DATA_RELOC_TREE_OBJECTID) { > > ret = btrfs_reloc_clone_csums(inode, start, > > cur_alloc_size); > > + /* > > + * Only drop cache here, and process as normal. > > + * > > + * We must not allow extent_clear_unlock_delalloc() > > + * at out_unlock label to free meta of this ordered > > + * extent, as its meta should be freed by > > + * btrfs_finish_ordered_io(). > > + * > > + * So we must continue until @start is increased to > > + * skip current ordered extent. > > + */ > > if (ret) > > - goto out_drop_extent_cache; > > + btrfs_drop_extent_cache(BTRFS_I(inode), start, > > + start + ram_size - 1, 0); > > } > > > > btrfs_dec_block_group_reservations(fs_info, ins.objectid); > > > > - if (disk_num_bytes < cur_alloc_size) > > - break; > > - > > /* we're not doing compressed IO, don't unlock the first > > * page (which the caller expects to stay locked), don't > > * clear any dirty bits and don't set any writeback bits > > @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode, > > delalloc_end, locked_page, > > EXTENT_LOCKED | EXTENT_DELALLOC, > > op); > > - disk_num_bytes -= cur_alloc_size; > > + if (disk_num_bytes > cur_alloc_size) > > + disk_num_bytes = 0; > > + else > > + disk_num_bytes -= cur_alloc_size; > > I don't get the logic here, why do we 'break' if disk_num_bytes > cur_alloc_size? I assume that you've run fstests against this patch, if so, I actually start worrying about that no fstests found this problem. Thanks, -liubo ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error 2017-03-07 20:59 ` Liu Bo @ 2017-03-08 0:17 ` Filipe Manana 2017-03-08 1:16 ` Liu Bo 0 siblings, 1 reply; 16+ messages in thread From: Filipe Manana @ 2017-03-08 0:17 UTC (permalink / raw) To: Liu Bo; +Cc: Qu Wenruo, linux-btrfs On Tue, Mar 7, 2017 at 8:59 PM, Liu Bo <bo.li.liu@oracle.com> wrote: > On Tue, Mar 07, 2017 at 12:49:58PM -0800, Liu Bo wrote: >> On Mon, Mar 06, 2017 at 10:55:46AM +0800, Qu Wenruo wrote: >> > [BUG] >> > When btrfs_reloc_clone_csum() reports error, it can underflow metadata >> > and leads to kernel assertion on outstanding extents in >> > run_delalloc_nocow() and cow_file_range(). >> > >> > BTRFS info (device vdb5): relocating block group 12582912 flags data >> > BTRFS info (device vdb5): found 1 extents >> > assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858 >> > >> > Currently, due to another bug blocking ordered extents, the bug is only >> > reproducible under certain block group layout and using error injection. >> > >> > a) Create one data block group with one 4K extent in it. >> > To avoid the bug that hangs btrfs due to ordered extent which never >> > finishes >> > b) Make btrfs_reloc_clone_csum() always fail >> > c) Relocate that block group >> > >> > [CAUSE] >> > run_delalloc_nocow() and cow_file_range() handles error from >> > btrfs_reloc_clone_csum() wrongly: >> > >> > (The ascii chart shows a more generic case of this bug other than the >> > bug mentioned above) >> > >> > |<------------------ delalloc range --------------------------->| >> > | OE 1 | OE 2 | ... | OE n | >> > |<----------- cleanup range --------------->| >> > |<----------- ----------->| >> > \/ >> > btrfs_finish_ordered_io() range >> > >> > So error handler, which calls extent_clear_unlock_delalloc() with >> > EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io() >> > will both cover OE n, and free its metadata, causing metadata under flow. >> > >> > [Fix] >> > The fix is to ensure after calling btrfs_add_ordered_extent(), we only >> > call error handler after increasing the iteration offset, so that >> > cleanup range won't cover any created ordered extent. >> > >> > |<------------------ delalloc range --------------------------->| >> > | OE 1 | OE 2 | ... | OE n | >> > |<----------- ----------->|<---------- cleanup range --------->| >> > \/ >> > btrfs_finish_ordered_io() range >> > >> > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> >> > --- >> > changelog: >> > v6: >> > New, split from v5 patch, as this is a separate bug. >> > --- >> > fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------ >> > 1 file changed, 39 insertions(+), 12 deletions(-) >> > >> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> > index b2bc07aad1ae..1d83d504f2e5 100644 >> > --- a/fs/btrfs/inode.c >> > +++ b/fs/btrfs/inode.c >> > @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode, >> > BTRFS_DATA_RELOC_TREE_OBJECTID) { >> > ret = btrfs_reloc_clone_csums(inode, start, >> > cur_alloc_size); >> > + /* >> > + * Only drop cache here, and process as normal. >> > + * >> > + * We must not allow extent_clear_unlock_delalloc() >> > + * at out_unlock label to free meta of this ordered >> > + * extent, as its meta should be freed by >> > + * btrfs_finish_ordered_io(). >> > + * >> > + * So we must continue until @start is increased to >> > + * skip current ordered extent. >> > + */ >> > if (ret) >> > - goto out_drop_extent_cache; >> > + btrfs_drop_extent_cache(BTRFS_I(inode), start, >> > + start + ram_size - 1, 0); >> > } >> > >> > btrfs_dec_block_group_reservations(fs_info, ins.objectid); >> > >> > - if (disk_num_bytes < cur_alloc_size) >> > - break; >> > - >> > /* we're not doing compressed IO, don't unlock the first >> > * page (which the caller expects to stay locked), don't >> > * clear any dirty bits and don't set any writeback bits >> > @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode, >> > delalloc_end, locked_page, >> > EXTENT_LOCKED | EXTENT_DELALLOC, >> > op); >> > - disk_num_bytes -= cur_alloc_size; >> > + if (disk_num_bytes > cur_alloc_size) >> > + disk_num_bytes = 0; >> > + else >> > + disk_num_bytes -= cur_alloc_size; >> >> I don't get the logic here, why do we 'break' if disk_num_bytes > cur_alloc_size? > > I assume that you've run fstests against this patch, if so, I actually > start worrying about that no fstests found this problem. The operator is definitely wrong, should be < instead of > (previous patch versions were correct). It's not a surprise fstests don't catch this - one would need a fs with no more free space to allocate new data chunks and existing data chunks would have to be fragmented to the point we need to allocate two or more smaller extents to satisfy the write, so whether fstests exercises this depends on the size of the test devices (mine are 100G for example) and the size of the data the tests create. Having a deterministic test for that for that should be possible and useful. > > Thanks, > > -liubo ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error 2017-03-08 0:17 ` Filipe Manana @ 2017-03-08 1:16 ` Liu Bo 2017-03-08 1:21 ` Qu Wenruo 0 siblings, 1 reply; 16+ messages in thread From: Liu Bo @ 2017-03-08 1:16 UTC (permalink / raw) To: Filipe Manana; +Cc: Qu Wenruo, linux-btrfs On Wed, Mar 08, 2017 at 12:17:16AM +0000, Filipe Manana wrote: > On Tue, Mar 7, 2017 at 8:59 PM, Liu Bo <bo.li.liu@oracle.com> wrote: > > On Tue, Mar 07, 2017 at 12:49:58PM -0800, Liu Bo wrote: > >> On Mon, Mar 06, 2017 at 10:55:46AM +0800, Qu Wenruo wrote: > >> > [BUG] > >> > When btrfs_reloc_clone_csum() reports error, it can underflow metadata > >> > and leads to kernel assertion on outstanding extents in > >> > run_delalloc_nocow() and cow_file_range(). > >> > > >> > BTRFS info (device vdb5): relocating block group 12582912 flags data > >> > BTRFS info (device vdb5): found 1 extents > >> > assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858 > >> > > >> > Currently, due to another bug blocking ordered extents, the bug is only > >> > reproducible under certain block group layout and using error injection. > >> > > >> > a) Create one data block group with one 4K extent in it. > >> > To avoid the bug that hangs btrfs due to ordered extent which never > >> > finishes > >> > b) Make btrfs_reloc_clone_csum() always fail > >> > c) Relocate that block group > >> > > >> > [CAUSE] > >> > run_delalloc_nocow() and cow_file_range() handles error from > >> > btrfs_reloc_clone_csum() wrongly: > >> > > >> > (The ascii chart shows a more generic case of this bug other than the > >> > bug mentioned above) > >> > > >> > |<------------------ delalloc range --------------------------->| > >> > | OE 1 | OE 2 | ... | OE n | > >> > |<----------- cleanup range --------------->| > >> > |<----------- ----------->| > >> > \/ > >> > btrfs_finish_ordered_io() range > >> > > >> > So error handler, which calls extent_clear_unlock_delalloc() with > >> > EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io() > >> > will both cover OE n, and free its metadata, causing metadata under flow. > >> > > >> > [Fix] > >> > The fix is to ensure after calling btrfs_add_ordered_extent(), we only > >> > call error handler after increasing the iteration offset, so that > >> > cleanup range won't cover any created ordered extent. > >> > > >> > |<------------------ delalloc range --------------------------->| > >> > | OE 1 | OE 2 | ... | OE n | > >> > |<----------- ----------->|<---------- cleanup range --------->| > >> > \/ > >> > btrfs_finish_ordered_io() range > >> > > >> > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> > >> > --- > >> > changelog: > >> > v6: > >> > New, split from v5 patch, as this is a separate bug. > >> > --- > >> > fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------ > >> > 1 file changed, 39 insertions(+), 12 deletions(-) > >> > > >> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > >> > index b2bc07aad1ae..1d83d504f2e5 100644 > >> > --- a/fs/btrfs/inode.c > >> > +++ b/fs/btrfs/inode.c > >> > @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode, > >> > BTRFS_DATA_RELOC_TREE_OBJECTID) { > >> > ret = btrfs_reloc_clone_csums(inode, start, > >> > cur_alloc_size); > >> > + /* > >> > + * Only drop cache here, and process as normal. > >> > + * > >> > + * We must not allow extent_clear_unlock_delalloc() > >> > + * at out_unlock label to free meta of this ordered > >> > + * extent, as its meta should be freed by > >> > + * btrfs_finish_ordered_io(). > >> > + * > >> > + * So we must continue until @start is increased to > >> > + * skip current ordered extent. > >> > + */ > >> > if (ret) > >> > - goto out_drop_extent_cache; > >> > + btrfs_drop_extent_cache(BTRFS_I(inode), start, > >> > + start + ram_size - 1, 0); > >> > } > >> > > >> > btrfs_dec_block_group_reservations(fs_info, ins.objectid); > >> > > >> > - if (disk_num_bytes < cur_alloc_size) > >> > - break; > >> > - > >> > /* we're not doing compressed IO, don't unlock the first > >> > * page (which the caller expects to stay locked), don't > >> > * clear any dirty bits and don't set any writeback bits > >> > @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode, > >> > delalloc_end, locked_page, > >> > EXTENT_LOCKED | EXTENT_DELALLOC, > >> > op); > >> > - disk_num_bytes -= cur_alloc_size; > >> > + if (disk_num_bytes > cur_alloc_size) > >> > + disk_num_bytes = 0; > >> > + else > >> > + disk_num_bytes -= cur_alloc_size; > >> > >> I don't get the logic here, why do we 'break' if disk_num_bytes > cur_alloc_size? > > > > I assume that you've run fstests against this patch, if so, I actually > > start worrying about that no fstests found this problem. > > The operator is definitely wrong, should be < instead of > (previous > patch versions were correct). > > It's not a surprise fstests don't catch this - one would need a fs > with no more free space to allocate new data chunks and existing data > chunks would have to be fragmented to the point we need to allocate > two or more smaller extents to satisfy the write, so whether fstests > exercises this depends on the size of the test devices (mine are 100G > for example) and the size of the data the tests create. Having a > deterministic test for that for that should be possible and useful. I see, I think that 'mount -o fragment=data' could help us get it. Thanks, -liubo ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error 2017-03-08 1:16 ` Liu Bo @ 2017-03-08 1:21 ` Qu Wenruo 0 siblings, 0 replies; 16+ messages in thread From: Qu Wenruo @ 2017-03-08 1:21 UTC (permalink / raw) To: bo.li.liu, Filipe Manana; +Cc: linux-btrfs At 03/08/2017 09:16 AM, Liu Bo wrote: > On Wed, Mar 08, 2017 at 12:17:16AM +0000, Filipe Manana wrote: >> On Tue, Mar 7, 2017 at 8:59 PM, Liu Bo <bo.li.liu@oracle.com> wrote: >>> On Tue, Mar 07, 2017 at 12:49:58PM -0800, Liu Bo wrote: >>>> On Mon, Mar 06, 2017 at 10:55:46AM +0800, Qu Wenruo wrote: >>>>> [BUG] >>>>> When btrfs_reloc_clone_csum() reports error, it can underflow metadata >>>>> and leads to kernel assertion on outstanding extents in >>>>> run_delalloc_nocow() and cow_file_range(). >>>>> >>>>> BTRFS info (device vdb5): relocating block group 12582912 flags data >>>>> BTRFS info (device vdb5): found 1 extents >>>>> assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858 >>>>> >>>>> Currently, due to another bug blocking ordered extents, the bug is only >>>>> reproducible under certain block group layout and using error injection. >>>>> >>>>> a) Create one data block group with one 4K extent in it. >>>>> To avoid the bug that hangs btrfs due to ordered extent which never >>>>> finishes >>>>> b) Make btrfs_reloc_clone_csum() always fail >>>>> c) Relocate that block group >>>>> >>>>> [CAUSE] >>>>> run_delalloc_nocow() and cow_file_range() handles error from >>>>> btrfs_reloc_clone_csum() wrongly: >>>>> >>>>> (The ascii chart shows a more generic case of this bug other than the >>>>> bug mentioned above) >>>>> >>>>> |<------------------ delalloc range --------------------------->| >>>>> | OE 1 | OE 2 | ... | OE n | >>>>> |<----------- cleanup range --------------->| >>>>> |<----------- ----------->| >>>>> \/ >>>>> btrfs_finish_ordered_io() range >>>>> >>>>> So error handler, which calls extent_clear_unlock_delalloc() with >>>>> EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io() >>>>> will both cover OE n, and free its metadata, causing metadata under flow. >>>>> >>>>> [Fix] >>>>> The fix is to ensure after calling btrfs_add_ordered_extent(), we only >>>>> call error handler after increasing the iteration offset, so that >>>>> cleanup range won't cover any created ordered extent. >>>>> >>>>> |<------------------ delalloc range --------------------------->| >>>>> | OE 1 | OE 2 | ... | OE n | >>>>> |<----------- ----------->|<---------- cleanup range --------->| >>>>> \/ >>>>> btrfs_finish_ordered_io() range >>>>> >>>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> >>>>> --- >>>>> changelog: >>>>> v6: >>>>> New, split from v5 patch, as this is a separate bug. >>>>> --- >>>>> fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------ >>>>> 1 file changed, 39 insertions(+), 12 deletions(-) >>>>> >>>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >>>>> index b2bc07aad1ae..1d83d504f2e5 100644 >>>>> --- a/fs/btrfs/inode.c >>>>> +++ b/fs/btrfs/inode.c >>>>> @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode, >>>>> BTRFS_DATA_RELOC_TREE_OBJECTID) { >>>>> ret = btrfs_reloc_clone_csums(inode, start, >>>>> cur_alloc_size); >>>>> + /* >>>>> + * Only drop cache here, and process as normal. >>>>> + * >>>>> + * We must not allow extent_clear_unlock_delalloc() >>>>> + * at out_unlock label to free meta of this ordered >>>>> + * extent, as its meta should be freed by >>>>> + * btrfs_finish_ordered_io(). >>>>> + * >>>>> + * So we must continue until @start is increased to >>>>> + * skip current ordered extent. >>>>> + */ >>>>> if (ret) >>>>> - goto out_drop_extent_cache; >>>>> + btrfs_drop_extent_cache(BTRFS_I(inode), start, >>>>> + start + ram_size - 1, 0); >>>>> } >>>>> >>>>> btrfs_dec_block_group_reservations(fs_info, ins.objectid); >>>>> >>>>> - if (disk_num_bytes < cur_alloc_size) >>>>> - break; >>>>> - >>>>> /* we're not doing compressed IO, don't unlock the first >>>>> * page (which the caller expects to stay locked), don't >>>>> * clear any dirty bits and don't set any writeback bits >>>>> @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode, >>>>> delalloc_end, locked_page, >>>>> EXTENT_LOCKED | EXTENT_DELALLOC, >>>>> op); >>>>> - disk_num_bytes -= cur_alloc_size; >>>>> + if (disk_num_bytes > cur_alloc_size) >>>>> + disk_num_bytes = 0; >>>>> + else >>>>> + disk_num_bytes -= cur_alloc_size; >>>> >>>> I don't get the logic here, why do we 'break' if disk_num_bytes > cur_alloc_size? >>> >>> I assume that you've run fstests against this patch, if so, I actually >>> start worrying about that no fstests found this problem. >> >> The operator is definitely wrong, should be < instead of > (previous >> patch versions were correct). >> >> It's not a surprise fstests don't catch this - one would need a fs >> with no more free space to allocate new data chunks and existing data >> chunks would have to be fragmented to the point we need to allocate >> two or more smaller extents to satisfy the write, so whether fstests >> exercises this depends on the size of the test devices (mine are 100G >> for example) and the size of the data the tests create. Having a >> deterministic test for that for that should be possible and useful. > > I see, I think that 'mount -o fragment=data' could help us get it. > > Thanks, > > -liubo > > I'll also submit a test case for this. By filling the fs with small files(sectorsize) with sequential number as file name. Then sync fs, remove the files with odd number, then do a large write. This one should trigger the problem without the use of the debug option, and can be a generic test. Thanks, Qu ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error 2017-03-07 20:49 ` Liu Bo 2017-03-07 20:59 ` Liu Bo @ 2017-03-08 0:17 ` Qu Wenruo 1 sibling, 0 replies; 16+ messages in thread From: Qu Wenruo @ 2017-03-08 0:17 UTC (permalink / raw) To: bo.li.liu; +Cc: fdmanana, linux-btrfs At 03/08/2017 04:49 AM, Liu Bo wrote: > On Mon, Mar 06, 2017 at 10:55:46AM +0800, Qu Wenruo wrote: >> [BUG] >> When btrfs_reloc_clone_csum() reports error, it can underflow metadata >> and leads to kernel assertion on outstanding extents in >> run_delalloc_nocow() and cow_file_range(). >> >> BTRFS info (device vdb5): relocating block group 12582912 flags data >> BTRFS info (device vdb5): found 1 extents >> assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858 >> >> Currently, due to another bug blocking ordered extents, the bug is only >> reproducible under certain block group layout and using error injection. >> >> a) Create one data block group with one 4K extent in it. >> To avoid the bug that hangs btrfs due to ordered extent which never >> finishes >> b) Make btrfs_reloc_clone_csum() always fail >> c) Relocate that block group >> >> [CAUSE] >> run_delalloc_nocow() and cow_file_range() handles error from >> btrfs_reloc_clone_csum() wrongly: >> >> (The ascii chart shows a more generic case of this bug other than the >> bug mentioned above) >> >> |<------------------ delalloc range --------------------------->| >> | OE 1 | OE 2 | ... | OE n | >> |<----------- cleanup range --------------->| >> |<----------- ----------->| >> \/ >> btrfs_finish_ordered_io() range >> >> So error handler, which calls extent_clear_unlock_delalloc() with >> EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io() >> will both cover OE n, and free its metadata, causing metadata under flow. >> >> [Fix] >> The fix is to ensure after calling btrfs_add_ordered_extent(), we only >> call error handler after increasing the iteration offset, so that >> cleanup range won't cover any created ordered extent. >> >> |<------------------ delalloc range --------------------------->| >> | OE 1 | OE 2 | ... | OE n | >> |<----------- ----------->|<---------- cleanup range --------->| >> \/ >> btrfs_finish_ordered_io() range >> >> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> >> --- >> changelog: >> v6: >> New, split from v5 patch, as this is a separate bug. >> --- >> fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------ >> 1 file changed, 39 insertions(+), 12 deletions(-) >> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> index b2bc07aad1ae..1d83d504f2e5 100644 >> --- a/fs/btrfs/inode.c >> +++ b/fs/btrfs/inode.c >> @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode, >> BTRFS_DATA_RELOC_TREE_OBJECTID) { >> ret = btrfs_reloc_clone_csums(inode, start, >> cur_alloc_size); >> + /* >> + * Only drop cache here, and process as normal. >> + * >> + * We must not allow extent_clear_unlock_delalloc() >> + * at out_unlock label to free meta of this ordered >> + * extent, as its meta should be freed by >> + * btrfs_finish_ordered_io(). >> + * >> + * So we must continue until @start is increased to >> + * skip current ordered extent. >> + */ >> if (ret) >> - goto out_drop_extent_cache; >> + btrfs_drop_extent_cache(BTRFS_I(inode), start, >> + start + ram_size - 1, 0); >> } >> >> btrfs_dec_block_group_reservations(fs_info, ins.objectid); >> >> - if (disk_num_bytes < cur_alloc_size) >> - break; >> - >> /* we're not doing compressed IO, don't unlock the first >> * page (which the caller expects to stay locked), don't >> * clear any dirty bits and don't set any writeback bits >> @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode, >> delalloc_end, locked_page, >> EXTENT_LOCKED | EXTENT_DELALLOC, >> op); >> - disk_num_bytes -= cur_alloc_size; >> + if (disk_num_bytes > cur_alloc_size) >> + disk_num_bytes = 0; >> + else >> + disk_num_bytes -= cur_alloc_size; > > I don't get the logic here, why do we 'break' if disk_num_bytes > cur_alloc_size? Thanks for the catch, it should be disk_num_bytes < cur_alloc_size. I'll update it. Thanks, Qu > > Thanks, > > -liubo > >> num_bytes -= cur_alloc_size; >> alloc_hint = ins.objectid + ins.offset; >> start += cur_alloc_size; >> + >> + /* >> + * btrfs_reloc_clone_csums() error, since start is increased >> + * extent_clear_unlock_delalloc() at out_unlock label won't >> + * free metadata of current ordered extent, we're OK to exit. >> + */ >> + if (ret) >> + goto out_unlock; >> } >> out: >> return ret; >> @@ -1414,15 +1434,14 @@ static noinline int run_delalloc_nocow(struct inode *inode, >> BUG_ON(ret); /* -ENOMEM */ >> >> if (root->root_key.objectid == >> - BTRFS_DATA_RELOC_TREE_OBJECTID) { >> + BTRFS_DATA_RELOC_TREE_OBJECTID) >> + /* >> + * Error handled later, as we must prevent >> + * extent_clear_unlock_delalloc() in error handler >> + * from freeing metadata of created ordered extent. >> + */ >> ret = btrfs_reloc_clone_csums(inode, cur_offset, >> num_bytes); >> - if (ret) { >> - if (!nolock && nocow) >> - btrfs_end_write_no_snapshoting(root); >> - goto error; >> - } >> - } >> >> extent_clear_unlock_delalloc(inode, cur_offset, >> cur_offset + num_bytes - 1, end, >> @@ -1434,6 +1453,14 @@ static noinline int run_delalloc_nocow(struct inode *inode, >> if (!nolock && nocow) >> btrfs_end_write_no_snapshoting(root); >> cur_offset = extent_end; >> + >> + /* >> + * btrfs_reloc_clone_csums() error, now we're OK to call error >> + * handler, as metadata for created ordered extent will only >> + * be freed by btrfs_finish_ordered_io(). >> + */ >> + if (ret) >> + goto error; >> if (cur_offset > end) >> break; >> } >> -- >> 2.12.0 >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2017-03-08 1:23 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-03-06 2:55 [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Qu Wenruo 2017-03-06 2:55 ` [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang Qu Wenruo 2017-03-06 23:18 ` Filipe Manana 2017-03-07 22:11 ` Liu Bo 2017-03-08 0:18 ` Qu Wenruo 2017-03-08 0:21 ` Filipe Manana 2017-03-08 0:26 ` Qu Wenruo 2017-03-06 23:19 ` [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Filipe Manana 2017-03-07 17:32 ` David Sterba 2017-03-07 20:51 ` Liu Bo 2017-03-07 20:49 ` Liu Bo 2017-03-07 20:59 ` Liu Bo 2017-03-08 0:17 ` Filipe Manana 2017-03-08 1:16 ` Liu Bo 2017-03-08 1:21 ` Qu Wenruo 2017-03-08 0:17 ` Qu Wenruo
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.