linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Nikolay Borisov <nborisov@suse.com>
To: Chris Mason <clm@fb.com>
Cc: Kernel Team <Kernel-team@fb.com>,
	"axboe@kernel.dk" <axboe@kernel.dk>, Tejun Heo <tj@kernel.org>,
	David Sterba <dsterba@suse.com>, "jack@suse.cz" <jack@suse.cz>,
	"josef@toxicpanda.com" <josef@toxicpanda.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 3/5] Btrfs: only associate the locked page with one async_cow struct
Date: Fri, 26 Jul 2019 18:29:00 +0300	[thread overview]
Message-ID: <c2419d01-5c84-3fb4-189e-4db519d08796@suse.com> (raw)
In-Reply-To: <C459F9B6-8154-4099-BC62-C3DCCAF56CE2@fb.com>



On 11.07.19 г. 22:52 ч., Chris Mason wrote:
> On 11 Jul 2019, at 12:00, Nikolay Borisov wrote:
> 
>> On 10.07.19 г. 22:28 ч., Tejun Heo wrote:
>>> From: Chris Mason <clm@fb.com>
>>>
>>> The btrfs writepages function collects a large range of pages flagged
>>> for delayed allocation, and then sends them down through the COW code
>>> for processing.  When compression is on, we allocate one async_cow
>>
>> nit: The code no longer uses async_cow to represent in-flight chunks 
>> but
>> the more aptly named async_chunk. Presumably this patchset predates
>> those changes.
> 
> Not by much, but yes.
> 
>>
>>>
>>> The end result is that cowA is completed and cleaned up before cowB 
>>> even
>>> starts processing.  This means we can free locked_page() and reuse it
>>> elsewhere.  If we get really lucky, it'll have the same page->index 
>>> in
>>> its new home as it did before.
>>>
>>> While we're processing cowB, we might decide we need to fall back to
>>> uncompressed IO, and so compress_file_range() will call
>>> __set_page_dirty_nobufers() on cowB->locked_page.
>>>
>>> Without cgroups in use, this creates as a phantom dirty page, which> 
>>> isn't great but isn't the end of the world.  With cgroups in use, we
>>
>> Having a phantom dirty page is not great but not terrible without
>> cgroups but apart from that, does it have any other implications?
> 
> Best case, it'll go through the writepage fixup worker and go through 
> the whole cow machinery again.  Worst case we go to this code more than 
> once:
> 
>                          /*
>                           * if page_started, cow_file_range inserted an
>                           * inline extent and took care of all the 
> unlocking
>                           * and IO for us.  Otherwise, we need to submit
>                           * all those pages down to the drive.
>                           */
>                          if (!page_started && !ret)
>                                  extent_write_locked_range(inode,
>                                                    async_extent->start,
>                                                    async_extent->start +
>                                                    async_extent->ram_size 
> - 1,
>                                                    WB_SYNC_ALL);
>                          else if (ret)
>                                  unlock_page(async_chunk->locked_page);
> 
> 
> That never happened in production as far as I can tell, but it seems 
> possible.
> 
>>
>>
>>> might crash in the accounting code because page->mapping->i_wb isn't
>>> set.
>>>
>>> [ 8308.523110] BUG: unable to handle kernel NULL pointer dereference 
>>> at 00000000000000d0
>>> [ 8308.531084] IP: percpu_counter_add_batch+0x11/0x70
>>> [ 8308.538371] PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
>>> [ 8308.541750] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
>>> [ 8308.551948] CPU: 16 PID: 2172 Comm: rm Not tainted
>>> [ 8308.566883] RIP: 0010:percpu_counter_add_batch+0x11/0x70
>>> [ 8308.567891] RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
>>> [ 8308.568986] RAX: 0000000000000005 RBX: 0000000000000090 RCX: 
>>> 0000000000026115
>>> [ 8308.570734] RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 
>>> 0000000000000090
>>> [ 8308.572543] RBP: 0000000000000000 R08: fffffffffffffff5 R09: 
>>> 0000000000000000
>>> [ 8308.573856] R10: 00000000000260c0 R11: ffff881037fc26c0 R12: 
>>> ffffffffffffffff
>>> [ 8308.580099] R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 
>>> 0000000000000001
>>> [ 8308.582520] FS:  00007f5503ced480(0000) GS:ffff880ff7200000(0000) 
>>> knlGS:0000000000000000
>>> [ 8308.585440] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 8308.587951] CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 
>>> 0000000000360ee0
>>> [ 8308.590707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
>>> 0000000000000000
>>> [ 8308.592865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
>>> 0000000000000400
>>> [ 8308.594469] Call Trace:
>>> [ 8308.595149]  account_page_cleaned+0x15b/0x1f0
>>> [ 8308.596340]  __cancel_dirty_page+0x146/0x200
>>> [ 8308.599395]  truncate_cleanup_page+0x92/0xb0
>>> [ 8308.600480]  truncate_inode_pages_range+0x202/0x7d0
>>> [ 8308.617392]  btrfs_evict_inode+0x92/0x5a0
>>> [ 8308.619108]  evict+0xc1/0x190
>>> [ 8308.620023]  do_unlinkat+0x176/0x280
>>> [ 8308.621202]  do_syscall_64+0x63/0x1a0
>>> [ 8308.623451]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>>
>>> The fix here is to make asyc_cow->locked_page NULL everywhere but the
>>> one async_cow struct that's allowed to do things to the locked page.
>>>
>>> Signed-off-by: Chris Mason <clm@fb.com>
>>> Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and 
>>> reads")
>>> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>>> ---
>>>  fs/btrfs/extent_io.c |  2 +-
>>>  fs/btrfs/inode.c     | 25 +++++++++++++++++++++----
>>>  2 files changed, 22 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>> index 5106008f5e28..a31574df06aa 100644
>>> --- a/fs/btrfs/extent_io.c
>>> +++ b/fs/btrfs/extent_io.c
>>> @@ -1838,7 +1838,7 @@ static int __process_pages_contig(struct 
>>> address_space *mapping,
>>>  			if (page_ops & PAGE_SET_PRIVATE2)
>>>  				SetPagePrivate2(pages[i]);
>>>
>>> -			if (pages[i] == locked_page) {
>>> +			if (locked_page && pages[i] == locked_page) {
>>
>> Why not make the check just if (locked_page) then clean it up, since 
>> if
>> __process_pages_contig is called from the owner of the page then it's
>> guaranteed that the page will fall within it's range.
> 
> I'm not convinced that every single caller of __process_pages_contig is 
> making sure to only send locked_page for ranges that correspond to the 
> locked_page.  I'm not sure exactly what you're asking for though, it 
> looks like it would require some larger changes to the flow of that 
> loop.


What I meant it is to simply factor out the code dealing with locked
page outside of the loop and still place it inside
__process_pages_contig. Also looking at the way locked_pages is passed
across different call chains I arrive at:


compress_file_range  <-- locked page is null
 extent_clear_unlock_delalloc
  __process_pages_contig

submit_compressed_extents <---- locked page is null
 extent_clear_unlock_delalloc
  __process_pages_contig

btrfs_run_delalloc_range | run_delalloc_nocow
 cow_file_range <--- [when called from btrfs_run_delalloc_range we are
all fine and dandy because it will always iterates a range which belongs
to the page. So we can free the page and set it null for subsequent
passes of the loop.]

Looking run_delalloc_nocow I see the page is used 5
times - 2 of those, at the beginning and end of the function, are only
used during error cases. The other 2 times is if cow_start is different
than -1 , which happens if !nocow is true. I've yet to wrap my head
around run_delalloc_nocow but I think it should also be safe to pass
locked page just once.

cow_file_range_async <--- always called with the correct locked page, in
this case the function is called before any async chunks are going to be
submitted.
 extent_clear_unlock_delalloc
  __process_pages_contig

btrfs_run_delalloc_range <--- this one is called with locked_page
belonging to the passed delalloc range.
 run_delalloc_nocow
  extent_clear_unlock_delalloc
   __process_pages_contig


writepage_delalloc <-- calls find_lock_delalloc_range only if we aren't
caalled from compress path and the start range always belongs to the page
 find_lock_delalloc_range <----  if the range is not delalloc it will
retry. But that function is also called with the correct page.
  lock_delalloc_pages <--- ignores range which belongs only to this page
    __unlock_for_delaloc <--- ignores range which belongs only to this page




<snip>

  reply	other threads:[~2019-07-26 15:37 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-10 19:28 [PATCHSET v3 btrfs/for-next] btrfs: fix cgroup writeback support Tejun Heo
2019-07-10 19:28 ` [PATCH 1/5] Btrfs: stop using btrfs_schedule_bio() Tejun Heo
2019-07-11 11:32   ` Nikolay Borisov
2019-07-10 19:28 ` [PATCH 2/5] Btrfs: delete the entire async bio submission framework Tejun Heo
2019-07-11 14:53   ` Nikolay Borisov
2019-07-10 19:28 ` [PATCH 3/5] Btrfs: only associate the locked page with one async_cow struct Tejun Heo
2019-07-11 16:00   ` Nikolay Borisov
2019-07-11 19:52     ` Chris Mason
2019-07-26 15:29       ` Nikolay Borisov [this message]
2019-07-10 19:28 ` [PATCH 4/5] Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios Tejun Heo
2019-07-10 19:28 ` [PATCH 5/5] Btrfs: extent_write_locked_range() should attach inode->i_wb Tejun Heo
2019-07-26 15:13 ` [PATCHSET v3 btrfs/for-next] btrfs: fix cgroup writeback support David Sterba
2019-09-05 11:59   ` David Sterba
2019-09-06 17:46     ` Tejun Heo
2019-10-02 14:15       ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c2419d01-5c84-3fb4-189e-4db519d08796@suse.com \
    --to=nborisov@suse.com \
    --cc=Kernel-team@fb.com \
    --cc=axboe@kernel.dk \
    --cc=clm@fb.com \
    --cc=dsterba@suse.com \
    --cc=jack@suse.cz \
    --cc=josef@toxicpanda.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).