From: Nikolay Borisov <nborisov@suse.com>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>, Qu Wenruo <wqu@suse.de>,
David Sterba <DSterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v2] btrfs: Fix error handling in btrfs_cleanup_ordered_extents
Date: Mon, 29 Oct 2018 14:21:46 +0200 [thread overview]
Message-ID: <94c6c66e-bdca-8637-2970-d20aa3cd383b@suse.com> (raw)
In-Reply-To: <2b939b6c-7606-13ea-bfa0-a709fb634ae5@suse.com>
On 29.10.18 г. 9:51 ч., Nikolay Borisov wrote:
>
>
> On 29.10.18 г. 7:53 ч., Qu Wenruo wrote:
>> [snip]
>>>> The cause sounds valid, however would you please explain more about how
>>>> such cleaning on unrelated delalloc range happens?
>>>
>>> So in my case the following happened - 2 block groups were created as
>>> delalloc ranges in the - 0-1m and 1m-128m. Their respective pages were
>>> dirtied, so when page 0 - 0-4k when into writepage_delalloc,
>>> find_lock_delalloc_range would return the range 0-1m. So the call to
>>> fill_delalloc instantiates OE 0-1m and writeback continues as expected.
>>>
>>> Now, when the 2nd page from range 0-1m is again set for writeback and
>>> find_lock_delalloc_range is called with delalloc_start == 4096 it will
>>> actually return the range 1m-128m.
>>
>> IMHO this looks strange and may need extra investigation.
>>
>> Normally I would expect it returns the range 0~1M or 4K~1M.
>
> It cannot return 4k-1m since the writeback for the first page has
> already dealt with this range. Also take a look in writepage_delalloc
> how find_lock_delalloc_range is called : for 'start' it's passed the
> page offset, calculated in __extent_writepage. And when
> find_delalloc_range is called it just searches for an extent which ends
> after passed start value. In find_delalloc_range first tree_search
> is called which returns the 1m-128m range, then we go in the while(1)
> loop on the first itertaion found is 0 so *end is populated with 128m ,
> found is set to 1 and *start is set to 1m.
>
> On the second iteration the check if (found && (state->start !=
> cur_start || (state->state & EXTENT_BOUNDARY)))
>
> is triggered since the next extent found will have EXTENT_BOUNDARY since
> it will be the next block group from relocation. EXTENT_BOUNDARY will be
> set from relocate_file_extent_cluster' main loop:
>
> if (nr < cluster->nr &&
>
> page_start + offset == cluster->boundary[nr]) {
>
> set_extent_bits(&BTRFS_I(inode)->io_tree,
>
> page_start, page_end,
>
> EXTENT_BOUNDARY);
>
> nr++;
>
> }
So it seems I was wrong w.r.t to sequence of events that result in the extra extent being returned.
Here is what I got after further investigation. First let's look at the relocation side:
btrfs-4018 [001] .... 186.783244: relocate_file_extent_cluster: Setting EXTENT_BOUNDARY for page: 0000000074cc47c4 page_offset: 0 end: 4095 <- first page of range 0-1m
btrfs-4018 [001] .... 186.783248: relocate_file_extent_cluster: Setting DELALLOC for page: 0000000074cc47c4 page_offset: 0 block group:1104150528
btrfs-4018 [001] .... 186.783286: relocate_file_extent_cluster: Setting DELALLOC for page: 000000004a28475a page_offset: 4096 block group:1104150528 <- 2nd page of range 0-1m
btrfs-4018 [001] .... 186.784855: relocate_file_extent_cluster: Setting EXTENT_BOUNDARY for page: 00000000f58f50dc page_offset: 1048576 end: 1052671 < - 1st page of range 1m-128m
Mind the addresses of the given pages, they are all predicated on btrfs_ino == 260, which is the ino for relocation inode.
So the assumption is that when writing back page 000000004a28475a we cannot really be processing range >1m since it will
be outside of the range for the page, but this is not the case. Now on the writeback side:
kworker/u12:1-68 [002] .... 188.100471: find_lock_delalloc_range: Processing delalloc range: 0 - 1048575 for page: 0000000074cc47c4 < - writeback for first page of range 0-1m happens
so we naturally instantiate this range.
kworker/u12:1-68 [002] ...1 188.106523: find_delalloc_range.constprop.25: 1213015261107184058: Got start: 1048576 end 1052671 <- This is output from find_delalloc_range when start (the offset of the passed page to find_lock_delalloc_range is 4096 i.e it's the 2nd page for range 0-1m). So we find 1m - 1m + 4k on the first iteration of the loop in find_delalloc_range, at this point *start = 1048576 and *end = 1052671 and cached_state = the extent _state representing this first 4k range of the larger 1m-128m range.
kworker/u12:1-68 [002] ...1 188.106526: find_delalloc_range.constprop.25: 1213015261107184058: Got start: 1052672 end 135266303 - We loop for the second time, this time we find the 1m+4k - 128m range and actually trigger the (total_bytes >= max_bytes) check
kworker/u12:1-68 [002] ...1 188.106526: find_delalloc_range.constprop.25: 1213015261107184058: returning from total_bytes >= max_bytes
kworker/u12:1-68 [002] .... 188.106528: find_delalloc_range.constprop.25: 1213015261107184058: Returning cached_state->start: 1048576 cached_state->end: 1052671 *start = 1048576 *end = 135266303 <--- this is what is returned from find_delalloc_range - cached_state and *start and *end differ, since *end was set to 128m right before we hit the total_bytes check.
kworker/u12:1-68 [002] .... 188.106529: find_lock_delalloc_range: Processing delalloc range: 1048576 - 135266303 for page: 000000004a28475a <--- after we return from find_delalloc_range this is what
fin_lock_delalloc_range is processing - the next delalloc range 1m-128m but the page passed is indeed the 2nd page for the previous range as evident from the address 000000004a28475a
kworker/u12:1-68 [002] .... 188.113870: run_delalloc_nocow: btrfs_reloc_clone_csums err for OE[1048576 134217728] page->offset: 4096 end: 8191 <--- here we just create the OE in run_delalloc_nocow.
The only thing which I'm now struggling to rationalize is why find_delalloc_range finds an extent_state in the range 1048576-1052671 and then finds another range 1052672 - 135266303, shouldn't in fact it
find just 1048576 - 135266303. Perhaps it's due to the search being performed in the io_extent tree and not the extent_tree.
Here is the data_reloc tree :
item 2 key (260 INODE_ITEM 0) itemoff 15951 itemsize 160
generation 1 transid 30 size 940572672 nbytes 940572672
block group 0 mode 100600 links 0 uid 0 gid 0 rdev 0
sequence 8 flags 0x18(NOCOMPRESS|PREALLOC)
atime 0.0 (1970-01-01 00:00:00)
ctime 1540814915.362521123 (2018-10-29 12:08:35)
mtime 0.0 (1970-01-01 00:00:00)
otime 0.0 (1970-01-01 00:00:00)
item 3 key (260 EXTENT_DATA 0) itemoff 15898 itemsize 53
generation 28 type 2 (prealloc)
prealloc data disk byte 3872391168 nr 1048576
prealloc data offset 0 nr 1048576
item 4 key (260 EXTENT_DATA 1048576) itemoff 15845 itemsize 53
generation 28 type 2 (prealloc)
prealloc data disk byte 3873439744 nr 134217728
prealloc data offset 0 nr 134217728
as you can see we have a 0- 1048576 extent and 1048576-134217728. Those correspond to the following data block group being relocated:
item 139 key (1104150528 EXTENT_ITEM 1048576) itemoff 11661 itemsize 53
refs 1 gen 7 flags DATA
extent data backref root FS_TREE objectid 257 offset 0 count 1
item 140 key (1104150528 BLOCK_GROUP_ITEM 1073741824) itemoff 11637 itemsize 24
block group used 940572672 chunk_objectid 256 flags DATA|RAID1
item 141 key (1105199104 EXTENT_ITEM 134217728) itemoff 11584 itemsize 53
refs 1 gen 9 flags DATA
extent data backref root FS_TREE objectid 258 offset 0 count 1
<snip>
next prev parent reply other threads:[~2018-10-29 12:21 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-10-26 11:41 [PATCH v2] btrfs: Fix error handling in btrfs_cleanup_ordered_extents Nikolay Borisov
2018-10-26 11:53 ` Qu Wenruo
2018-10-26 12:02 ` Nikolay Borisov
2018-10-29 5:53 ` Qu Wenruo
2018-10-29 7:51 ` Nikolay Borisov
2018-10-29 12:21 ` Nikolay Borisov [this message]
2018-10-29 14:39 ` Nikolay Borisov
2018-11-20 19:00 ` Josef Bacik
2018-11-20 21:09 ` Nikolay Borisov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=94c6c66e-bdca-8637-2970-d20aa3cd383b@suse.com \
--to=nborisov@suse.com \
--cc=DSterba@suse.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo.btrfs@gmx.com \
--cc=wqu@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.