All of lore.kernel.org
 help / color / mirror / Atom feed
From: Shawn Bohrer <sbohrer@cloudflare.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Ivan Babrou <ivan@cloudflare.com>, linux-xfs@vger.kernel.org
Subject: Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag()
Date: Thu, 29 Nov 2018 08:36:48 -0600	[thread overview]
Message-ID: <20181129143647.GA8246@sbohrer-cf-dell> (raw)
In-Reply-To: <20181129021800.GQ6311@dastard>

Hi Dave,

I've got a few follow up questions below based on your response about
this.

On Thu, Nov 29, 2018 at 01:18:00PM +1100, Dave Chinner wrote:
> On Wed, Nov 28, 2018 at 04:36:25PM -0800, Ivan Babrou wrote:
> > The catalyst of our issue is terrible disks. It's not uncommon to see
> > the following stack in hung task detector:
> > 
> > Nov 15 21:55:13 21m21 kernel: INFO: task some-task:156314 blocked for
> > more than 10 seconds.
> > Nov 15 21:55:13 21m21 kernel:       Tainted: G           O
> > 4.14.59-cloudflare-2018.7.5 #1
> > Nov 15 21:55:13 21m21 kernel: "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Nov 15 21:55:13 21m21 kernel: some-task     D11792 156314 156183 0x00000080
> > Nov 15 21:55:13 21m21 kernel: Call Trace:
> > Nov 15 21:55:13 21m21 kernel:  ? __schedule+0x21a/0x820
> > Nov 15 21:55:13 21m21 kernel:  schedule+0x28/0x80
> > Nov 15 21:55:13 21m21 kernel:  schedule_preempt_disabled+0xa/0x10
> > Nov 15 21:55:13 21m21 kernel:  __mutex_lock.isra.2+0x16a/0x490
> > Nov 15 21:55:13 21m21 kernel:  ? xfs_reclaim_inodes_ag+0x265/0x2d0
> > Nov 15 21:55:13 21m21 kernel:  xfs_reclaim_inodes_ag+0x265/0x2d0
> > Nov 15 21:55:13 21m21 kernel:  ? kmem_cache_alloc+0x14d/0x1b0
> > Nov 15 21:55:13 21m21 kernel:  ? radix_tree_gang_lookup_tag+0xc4/0x130
> > Nov 15 21:55:13 21m21 kernel:  ? __list_lru_walk_one.isra.5+0x33/0x130
> > Nov 15 21:55:13 21m21 kernel:  xfs_reclaim_inodes_nr+0x31/0x40
> > Nov 15 21:55:13 21m21 kernel:  super_cache_scan+0x156/0x1a0
> > Nov 15 21:55:13 21m21 kernel:  shrink_slab.part.51+0x1d2/0x3a0
> > Nov 15 21:55:13 21m21 kernel:  shrink_node+0x113/0x2e0
> > Nov 15 21:55:13 21m21 kernel:  do_try_to_free_pages+0xb3/0x310
> > Nov 15 21:55:13 21m21 kernel:  try_to_free_pages+0xd2/0x190
> > Nov 15 21:55:13 21m21 kernel:  __alloc_pages_slowpath+0x3a3/0xdc0
> > Nov 15 21:55:13 21m21 kernel:  ? ip_output+0x5c/0xc0
> > Nov 15 21:55:13 21m21 kernel:  ? update_curr+0x141/0x1a0
> > Nov 15 21:55:13 21m21 kernel:  __alloc_pages_nodemask+0x223/0x240
> > Nov 15 21:55:13 21m21 kernel:  skb_page_frag_refill+0x93/0xb0
> > Nov 15 21:55:13 21m21 kernel:  sk_page_frag_refill+0x19/0x80
> > Nov 15 21:55:13 21m21 kernel:  tcp_sendmsg_locked+0x247/0xdc0
> > Nov 15 21:55:13 21m21 kernel:  tcp_sendmsg+0x27/0x40
> > Nov 15 21:55:13 21m21 kernel:  sock_sendmsg+0x36/0x40
> > Nov 15 21:55:13 21m21 kernel:  sock_write_iter+0x84/0xd0
> > Nov 15 21:55:13 21m21 kernel:  __vfs_write+0xdd/0x140
> > Nov 15 21:55:13 21m21 kernel:  vfs_write+0xad/0x1a0
> > Nov 15 21:55:13 21m21 kernel:  SyS_write+0x42/0x90
> > Nov 15 21:55:13 21m21 kernel:  do_syscall_64+0x60/0x110
> > Nov 15 21:55:13 21m21 kernel:  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > 
> > Here "some-task" is trying to send some bytes over network and it's
> > stuck in direct reclaim. Naturally, kswapd is not keeping up with its
> > duties.
> 
> That's not kswapd causing the problem here, that's direct reclaim.

It is understood that the above is direct reclaim.  When this happens
kswapd is also blocked as below.  As I'm sure you can imagine many
other tasks get blocked in direct reclaim as well.

[Thu Nov 15 21:52:06 2018] INFO: task kswapd0:1061 blocked for more than 10 seconds.
[Thu Nov 15 21:52:06 2018]       Tainted: G           O    4.14.59-cloudflare-2018.7.5 #1
[Thu Nov 15 21:52:06 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Nov 15 21:52:06 2018] kswapd0         D12848  1061      2 0x80000000
[Thu Nov 15 21:52:06 2018] Call Trace:
[Thu Nov 15 21:52:06 2018]  ? __schedule+0x21a/0x820
[Thu Nov 15 21:52:06 2018]  schedule+0x28/0x80
[Thu Nov 15 21:52:06 2018]  io_schedule+0x12/0x40
[Thu Nov 15 21:52:06 2018]  __xfs_iflock+0xe3/0x100
[Thu Nov 15 21:52:06 2018]  ? bit_waitqueue+0x30/0x30
[Thu Nov 15 21:52:06 2018]  xfs_reclaim_inode+0x141/0x300
[Thu Nov 15 21:52:06 2018]  xfs_reclaim_inodes_ag+0x19d/0x2d0
[Thu Nov 15 21:52:06 2018]  xfs_reclaim_inodes_nr+0x31/0x40
[Thu Nov 15 21:52:07 2018]  super_cache_scan+0x156/0x1a0
[Thu Nov 15 21:52:07 2018]  shrink_slab.part.51+0x1d2/0x3a0
[Thu Nov 15 21:52:07 2018]  shrink_node+0x113/0x2e0
[Thu Nov 15 21:52:07 2018]  kswapd+0x28a/0x6d0
[Thu Nov 15 21:52:07 2018]  ? mem_cgroup_shrink_node+0x150/0x150
[Thu Nov 15 21:52:07 2018]  kthread+0x119/0x130
[Thu Nov 15 21:52:07 2018]  ? kthread_create_on_node+0x40/0x40
[Thu Nov 15 21:52:07 2018]  ret_from_fork+0x35/0x40

> > One solution to this is to not go into direct reclaim by keeping more
> > free pages with vm.watermark_scale_factor,  but I'd like to discard
> > this and argue that we're going to hit direct reclaim at some point
> > anyway.
> 
> Right, but the problem is that the mm/ subsystem allows effectively
> unbound direct reclaim concurrency. At some point, having tens to
> hundreds of direct reclaimers all trying to write dirty inodes to
> disk causes catastrophic IO breakdown and everything grinds to a
> halt forever. We have to prevent that breakdown from occurring.
> 
> i.e. we have to throttle direct reclaim to before it reaches IO
> breakdown /somewhere/. The memory reclaim subsystem does not do it,
> so we have to do it in XFS itself. The problem here is that if we
> ignore direct reclaim (i.e do nothing rather than block waiting on
> reclaim progress) then the mm/ reclaim algorithms will eventually
> think they aren't making progress and unleash the OOM killer.

Here is my naive question.  Why does kswapd block?  Wouldn't it make
sense for kswapd to asynchronously start the xfs_reclaim_inodes
process and then continue looking for other pages (perhaps page cache)
that it can easily free?

In my mind this might prevent us from ever getting to the point of
direct reclaim.  And if we did get to that point then yes I can see
that you might need to synchronously block all tasks in direct reclaim
in xfs_reclaim_inodes to prevent the thundering herd problem.

My other question is why does the mm/ reclaim algorithms think that
that they need to force this metadata reclaim?  I think Ivan's main
question was we have 95GB of page cache maybe 2-3GB of total slab
memory in use, and maybe 1GB of dirty pages.  Blocking the world for
any disk I/O at this point seems insane when there is other quickly
freeable memory.  I assume the answer is LRU?  Our page cache pages
are newer or more frequently accesses then this filesystem metadata?

Thanks,
Shawn

  reply	other threads:[~2018-11-30  1:43 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-29  0:36 Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag() Ivan Babrou
2018-11-29  2:18 ` Dave Chinner
2018-11-29 14:36   ` Shawn Bohrer [this message]
2018-11-29 21:20     ` Dave Chinner
2018-11-29 22:22   ` Ivan Babrou
2018-11-30  2:18     ` Dave Chinner
2018-11-30  3:31       ` Ivan Babrou
2018-11-30  6:49         ` Dave Chinner
2018-11-30  7:45           ` Dave Chinner
2018-12-19 22:15             ` Ivan Babrou
2018-12-21  4:00               ` Kenton Varda
2018-12-25 23:47                 ` Dave Chinner
2018-12-26  3:16                   ` Kenton Varda
2018-12-29 19:05                     ` Darrick J. Wong
2019-01-01 23:48                     ` Dave Chinner
2019-01-02 10:34               ` Arkadiusz Miśkiewicz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181129143647.GA8246@sbohrer-cf-dell \
    --to=sbohrer@cloudflare.com \
    --cc=david@fromorbit.com \
    --cc=ivan@cloudflare.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.