All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yafang Shao <laoar.shao@gmail.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>,
	Dave Chinner <david@fromorbit.com>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Linux MM <linux-mm@kvack.org>
Subject: Re: [PATCH v3] xfs: avoid deadlock when trigger memory reclaim in ->writepages
Date: Tue, 16 Jun 2020 19:42:55 +0800	[thread overview]
Message-ID: <CALOAHbA_yXDzzni7Pn5RUjSAyyGrniW9Aq1iJC4AsxuJ0Abgow@mail.gmail.com> (raw)
In-Reply-To: <20200616104806.GE9499@dhcp22.suse.cz>

On Tue, Jun 16, 2020 at 6:48 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 16-06-20 17:39:33, Yafang Shao wrote:
> > On Tue, Jun 16, 2020 at 5:27 PM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Tue 16-06-20 17:05:25, Yafang Shao wrote:
> > > > On Tue, Jun 16, 2020 at 4:16 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > >
> > > > > On Mon 15-06-20 07:56:21, Yafang Shao wrote:
> > > > > > Recently there is a XFS deadlock on our server with an old kernel.
> > > > > > This deadlock is caused by allocating memory in xfs_map_blocks() while
> > > > > > doing writeback on behalf of memroy reclaim. Although this deadlock happens
> > > > > > on an old kernel, I think it could happen on the upstream as well. This
> > > > > > issue only happens once and can't be reproduced, so I haven't tried to
> > > > > > reproduce it on upsteam kernel.
> > > > > >
> > > > > > Bellow is the call trace of this deadlock.
> > > > > > [480594.790087] INFO: task redis-server:16212 blocked for more than 120 seconds.
> > > > > > [480594.790087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > > [480594.790088] redis-server    D ffffffff8168bd60     0 16212  14347 0x00000004
> > > > > > [480594.790090]  ffff880da128f070 0000000000000082 ffff880f94a2eeb0 ffff880da128ffd8
> > > > > > [480594.790092]  ffff880da128ffd8 ffff880da128ffd8 ffff880f94a2eeb0 ffff88103f9d6c40
> > > > > > [480594.790094]  0000000000000000 7fffffffffffffff ffff88207ffc0ee8 ffffffff8168bd60
> > > > > > [480594.790096] Call Trace:
> > > > > > [480594.790101]  [<ffffffff8168dce9>] schedule+0x29/0x70
> > > > > > [480594.790103]  [<ffffffff8168b749>] schedule_timeout+0x239/0x2c0
> > > > > > [480594.790111]  [<ffffffff8168d28e>] io_schedule_timeout+0xae/0x130
> > > > > > [480594.790114]  [<ffffffff8168d328>] io_schedule+0x18/0x20
> > > > > > [480594.790116]  [<ffffffff8168bd71>] bit_wait_io+0x11/0x50
> > > > > > [480594.790118]  [<ffffffff8168b895>] __wait_on_bit+0x65/0x90
> > > > > > [480594.790121]  [<ffffffff811814e1>] wait_on_page_bit+0x81/0xa0
> > > > > > [480594.790125]  [<ffffffff81196ad2>] shrink_page_list+0x6d2/0xaf0
> > > > > > [480594.790130]  [<ffffffff811975a3>] shrink_inactive_list+0x223/0x710
> > > > > > [480594.790135]  [<ffffffff81198225>] shrink_lruvec+0x3b5/0x810
> > > > > > [480594.790139]  [<ffffffff8119873a>] shrink_zone+0xba/0x1e0
> > > > > > [480594.790141]  [<ffffffff81198c20>] do_try_to_free_pages+0x100/0x510
> > > > > > [480594.790143]  [<ffffffff8119928d>] try_to_free_mem_cgroup_pages+0xdd/0x170
> > > > > > [480594.790145]  [<ffffffff811f32de>] mem_cgroup_reclaim+0x4e/0x120
> > > > > > [480594.790147]  [<ffffffff811f37cc>] __mem_cgroup_try_charge+0x41c/0x670
> > > > > > [480594.790153]  [<ffffffff811f5cb6>] __memcg_kmem_newpage_charge+0xf6/0x180
> > > > > > [480594.790157]  [<ffffffff8118c72d>] __alloc_pages_nodemask+0x22d/0x420
> > > > > > [480594.790162]  [<ffffffff811d0c7a>] alloc_pages_current+0xaa/0x170
> > > > > > [480594.790165]  [<ffffffff811db8fc>] new_slab+0x30c/0x320
> > > > > > [480594.790168]  [<ffffffff811dd17c>] ___slab_alloc+0x3ac/0x4f0
> > > > > > [480594.790204]  [<ffffffff81685656>] __slab_alloc+0x40/0x5c
> > > > > > [480594.790206]  [<ffffffff811dfc43>] kmem_cache_alloc+0x193/0x1e0
> > > > > > [480594.790233]  [<ffffffffa04fab67>] kmem_zone_alloc+0x97/0x130 [xfs]
> > > > > > [480594.790247]  [<ffffffffa04f90ba>] _xfs_trans_alloc+0x3a/0xa0 [xfs]
> > > > > > [480594.790261]  [<ffffffffa04f915c>] xfs_trans_alloc+0x3c/0x50 [xfs]
> > > > >
> > > > > Now with a more explanation from Dave I have got back to the original
> > > > > backtrace. Not sure which kernel version you are using but this path
> > > > > should have passed xfs_trans_reserve which sets PF_MEMALLOC_NOFS and
> > > > > this in turn should have made __alloc_pages_nodemask to use __GFP_NOFS
> > > > > and the memcg reclaim shouldn't ever wait_on_page_writeback (pressumably
> > > > > this is what the io_schedule is coming from).
> > > >
> > > > Hi Michal,
> > > >
> > > > The page is allocated before calling xfs_trans_reserve, so the
> > > > PF_MEMALLOC_NOFS hasn't been set yet.
> > > > See bellow,
> > > >
> > > > xfs_trans_alloc
> > > >     kmem_zone_zalloc() <<< GPF_NOFS hasn't been set yet, but it may
> > > > trigger memory reclaim
> > > >     xfs_trans_reserve() <<<< GPF_NOFS is set here (for the kernel
> > > > prior to commit 9070733b4efac, it is PF_FSTRANS)
> > >
> > > You are right, I have misread the actual allocation side. 8683edb7755b8
> > > has added KM_NOFS and 73d30d48749f8 has removed it. I cannot really
> > > judge the correctness here.
> > >
> >
> > The history is complicated, but it doesn't matter.
> > Let's  turn back to the upstream kernel now. As I explained in the commit log,
> > xfs_vm_writepages
> >   -> iomap_writepages.
> >      -> write_cache_pages
> >         -> lock_page <<<< This page is locked.
> >         -> writepages ( which is  iomap_do_writepage)
> >            -> xfs_map_blocks
> >               -> xfs_convert_blocks
> >                  -> xfs_bmapi_convert_delalloc
> >                     -> xfs_trans_alloc
> >                          -> kmem_zone_zalloc //It should alloc page
> > with GFP_NOFS
> >
> > If GFP_NOFS isn't set in xfs_trans_alloc(), the kmem_zone_zalloc() may
> > trigger the memory reclaim then it may wait on the page locked in
> > write_cache_pages() ...
>
> This cannot happen because the memory reclaim backs off on locked pages.
> Have a look at trylock_page at the very beginning of the shrink_page_list
> loop. You are likely waiting on a different page which is not being
> flushed because of the FS ordering requirement or something like that.
>

Right, there should be multiple-pages, some of which are already under
PG_writeback. When a new page in these multiple-pages is being
processed the reclaim is triggered and then a page already under
PG_writeback in these multiple-pages is reclaimed and then waited.

> > That means the ->writepages should be set with GFP_NOFS to avoid this
> > recursive filesystem reclaim.
>
> --
> Michal Hocko
> SUSE Labs



-- 
Thanks
Yafang

  reply	other threads:[~2020-06-16 11:43 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-15 11:56 [PATCH v3] xfs: avoid deadlock when trigger memory reclaim in ->writepages Yafang Shao
2020-06-15 14:25 ` Holger Hoffstätte
2020-06-15 14:51   ` Yafang Shao
2020-06-15 14:51     ` Yafang Shao
2020-06-15 14:53   ` Michal Hocko
2020-06-15 15:07     ` Matthew Wilcox
2020-06-15 23:23       ` Dave Chinner
2020-06-15 15:08     ` Yafang Shao
2020-06-15 23:06     ` Dave Chinner
2020-06-16  7:56       ` Michal Hocko
2020-06-16 10:17       ` Yafang Shao
2020-06-16  8:16 ` Michal Hocko
2020-06-16  9:05   ` Yafang Shao
2020-06-16  9:05     ` Yafang Shao
2020-06-16  9:27     ` Michal Hocko
2020-06-16  9:39       ` Yafang Shao
2020-06-16  9:39         ` Yafang Shao
2020-06-16 10:48         ` Michal Hocko
2020-06-16 11:42           ` Yafang Shao [this message]
2020-06-16 11:42             ` Yafang Shao
2020-06-18  0:34           ` Dave Chinner
2020-06-18 11:04             ` Yafang Shao
2020-06-18 11:04               ` Yafang Shao
2020-06-22  1:23 ` [xfs] 59d77e81c5: WARNING:at_fs/iomap/buffered-io.c:#iomap_do_writepage kernel test robot
2020-06-22  1:23   ` kernel test robot
2020-06-22 12:20   ` Yafang Shao
2020-06-22 12:20     ` Yafang Shao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALOAHbA_yXDzzni7Pn5RUjSAyyGrniW9Aq1iJC4AsxuJ0Abgow@mail.gmail.com \
    --to=laoar.shao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=mhocko@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.