Re: [PATCH 3/5] Btrfs: only associate the locked page with one async_cow struct

From: Nikolay Borisov <nborisov@suse.com>
To: Chris Mason <clm@fb.com>
Cc: Kernel Team <Kernel-team@fb.com>,
	"axboe@kernel.dk" <axboe@kernel.dk>, Tejun Heo <tj@kernel.org>,
	David Sterba <dsterba@suse.com>, "jack@suse.cz" <jack@suse.cz>,
	"josef@toxicpanda.com" <josef@toxicpanda.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 3/5] Btrfs: only associate the locked page with one async_cow struct
Date: Fri, 26 Jul 2019 18:29:00 +0300	[thread overview]
Message-ID: <c2419d01-5c84-3fb4-189e-4db519d08796@suse.com> (raw)
In-Reply-To: <C459F9B6-8154-4099-BC62-C3DCCAF56CE2@fb.com>

On 11.07.19 г. 22:52 ч., Chris Mason wrote:
> On 11 Jul 2019, at 12:00, Nikolay Borisov wrote:
> 
>> On 10.07.19 г. 22:28 ч., Tejun Heo wrote:
>>> From: Chris Mason <clm@fb.com>
>>>
>>> The btrfs writepages function collects a large range of pages flagged
>>> for delayed allocation, and then sends them down through the COW code
>>> for processing.  When compression is on, we allocate one async_cow
>>
>> nit: The code no longer uses async_cow to represent in-flight chunks 
>> but
>> the more aptly named async_chunk. Presumably this patchset predates
>> those changes.
> 
> Not by much, but yes.
> 
>>
>>>
>>> The end result is that cowA is completed and cleaned up before cowB 
>>> even
>>> starts processing.  This means we can free locked_page() and reuse it
>>> elsewhere.  If we get really lucky, it'll have the same page->index 
>>> in
>>> its new home as it did before.
>>>
>>> While we're processing cowB, we might decide we need to fall back to
>>> uncompressed IO, and so compress_file_range() will call
>>> __set_page_dirty_nobufers() on cowB->locked_page.
>>>
>>> Without cgroups in use, this creates as a phantom dirty page, which> 
>>> isn't great but isn't the end of the world.  With cgroups in use, we
>>
>> Having a phantom dirty page is not great but not terrible without
>> cgroups but apart from that, does it have any other implications?
> 
> Best case, it'll go through the writepage fixup worker and go through 
> the whole cow machinery again.  Worst case we go to this code more than 
> once:
> 
>                          /*
>                           * if page_started, cow_file_range inserted an
>                           * inline extent and took care of all the 
> unlocking
>                           * and IO for us.  Otherwise, we need to submit
>                           * all those pages down to the drive.
>                           */
>                          if (!page_started && !ret)
>                                  extent_write_locked_range(inode,
>                                                    async_extent->start,
>                                                    async_extent->start +
>                                                    async_extent->ram_size 
> - 1,
>                                                    WB_SYNC_ALL);
>                          else if (ret)
>                                  unlock_page(async_chunk->locked_page);
> 
> 
> That never happened in production as far as I can tell, but it seems 
> possible.
> 
>>
>>
>>> might crash in the accounting code because page->mapping->i_wb isn't
>>> set.
>>>
>>> [ 8308.523110] BUG: unable to handle kernel NULL pointer dereference 
>>> at 00000000000000d0
>>> [ 8308.531084] IP: percpu_counter_add_batch+0x11/0x70
>>> [ 8308.538371] PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
>>> [ 8308.541750] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
>>> [ 8308.551948] CPU: 16 PID: 2172 Comm: rm Not tainted
>>> [ 8308.566883] RIP: 0010:percpu_counter_add_batch+0x11/0x70
>>> [ 8308.567891] RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
>>> [ 8308.568986] RAX: 0000000000000005 RBX: 0000000000000090 RCX: 
>>> 0000000000026115
>>> [ 8308.570734] RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 
>>> 0000000000000090
>>> [ 8308.572543] RBP: 0000000000000000 R08: fffffffffffffff5 R09: 
>>> 0000000000000000
>>> [ 8308.573856] R10: 00000000000260c0 R11: ffff881037fc26c0 R12: 
>>> ffffffffffffffff
>>> [ 8308.580099] R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 
>>> 0000000000000001
>>> [ 8308.582520] FS:  00007f5503ced480(0000) GS:ffff880ff7200000(0000) 
>>> knlGS:0000000000000000
>>> [ 8308.585440] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 8308.587951] CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 
>>> 0000000000360ee0
>>> [ 8308.590707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
>>> 0000000000000000
>>> [ 8308.592865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
>>> 0000000000000400
>>> [ 8308.594469] Call Trace:
>>> [ 8308.595149]  account_page_cleaned+0x15b/0x1f0
>>> [ 8308.596340]  __cancel_dirty_page+0x146/0x200
>>> [ 8308.599395]  truncate_cleanup_page+0x92/0xb0
>>> [ 8308.600480]  truncate_inode_pages_range+0x202/0x7d0
>>> [ 8308.617392]  btrfs_evict_inode+0x92/0x5a0
>>> [ 8308.619108]  evict+0xc1/0x190
>>> [ 8308.620023]  do_unlinkat+0x176/0x280
>>> [ 8308.621202]  do_syscall_64+0x63/0x1a0
>>> [ 8308.623451]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>>
>>> The fix here is to make asyc_cow->locked_page NULL everywhere but the
>>> one async_cow struct that's allowed to do things to the locked page.
>>>
>>> Signed-off-by: Chris Mason <clm@fb.com>
>>> Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and 
>>> reads")
>>> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>>> ---
>>>  fs/btrfs/extent_io.c |  2 +-
>>>  fs/btrfs/inode.c     | 25 +++++++++++++++++++++----
>>>  2 files changed, 22 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>> index 5106008f5e28..a31574df06aa 100644
>>> --- a/fs/btrfs/extent_io.c
>>> +++ b/fs/btrfs/extent_io.c
>>> @@ -1838,7 +1838,7 @@ static int __process_pages_contig(struct 
>>> address_space *mapping,
>>>  			if (page_ops & PAGE_SET_PRIVATE2)
>>>  				SetPagePrivate2(pages[i]);
>>>
>>> -			if (pages[i] == locked_page) {
>>> +			if (locked_page && pages[i] == locked_page) {
>>
>> Why not make the check just if (locked_page) then clean it up, since 
>> if
>> __process_pages_contig is called from the owner of the page then it's
>> guaranteed that the page will fall within it's range.
> 
> I'm not convinced that every single caller of __process_pages_contig is 
> making sure to only send locked_page for ranges that correspond to the 
> locked_page.  I'm not sure exactly what you're asking for though, it 
> looks like it would require some larger changes to the flow of that 
> loop.

What I meant it is to simply factor out the code dealing with locked
page outside of the loop and still place it inside
__process_pages_contig. Also looking at the way locked_pages is passed
across different call chains I arrive at:

compress_file_range  <-- locked page is null
 extent_clear_unlock_delalloc
  __process_pages_contig

submit_compressed_extents <---- locked page is null
 extent_clear_unlock_delalloc
  __process_pages_contig

btrfs_run_delalloc_range | run_delalloc_nocow
 cow_file_range <--- [when called from btrfs_run_delalloc_range we are
all fine and dandy because it will always iterates a range which belongs
to the page. So we can free the page and set it null for subsequent
passes of the loop.]

Looking run_delalloc_nocow I see the page is used 5
times - 2 of those, at the beginning and end of the function, are only
used during error cases. The other 2 times is if cow_start is different
than -1 , which happens if !nocow is true. I've yet to wrap my head
around run_delalloc_nocow but I think it should also be safe to pass
locked page just once.

cow_file_range_async <--- always called with the correct locked page, in
this case the function is called before any async chunks are going to be
submitted.
 extent_clear_unlock_delalloc
  __process_pages_contig

btrfs_run_delalloc_range <--- this one is called with locked_page
belonging to the passed delalloc range.
 run_delalloc_nocow
  extent_clear_unlock_delalloc
   __process_pages_contig

writepage_delalloc <-- calls find_lock_delalloc_range only if we aren't
caalled from compress path and the start range always belongs to the page
 find_lock_delalloc_range <----  if the range is not delalloc it will
retry. But that function is also called with the correct page.
  lock_delalloc_pages <--- ignores range which belongs only to this page
    __unlock_for_delaloc <--- ignores range which belongs only to this page

<snip>