All of lore.kernel.org
 help / color / mirror / Atom feed
* Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag()
@ 2018-11-29  0:36 Ivan Babrou
  2018-11-29  2:18 ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Ivan Babrou @ 2018-11-29  0:36 UTC (permalink / raw)
  To: linux-xfs; +Cc: Shawn Bohrer

Hello,

We're experiencing some interesting issues with memory reclaim, both
kswapd and direct reclaim.

A typical machine is 2 x NUMA with 128GB of RAM and 6 XFS filesystems.
Page cache is around 95GB and dirty pages hover around 50MB, rarely
jumping up to 1GB.

The catalyst of our issue is terrible disks. It's not uncommon to see
the following stack in hung task detector:

Nov 15 21:55:13 21m21 kernel: INFO: task some-task:156314 blocked for
more than 10 seconds.
Nov 15 21:55:13 21m21 kernel:       Tainted: G           O
4.14.59-cloudflare-2018.7.5 #1
Nov 15 21:55:13 21m21 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 15 21:55:13 21m21 kernel: some-task     D11792 156314 156183 0x00000080
Nov 15 21:55:13 21m21 kernel: Call Trace:
Nov 15 21:55:13 21m21 kernel:  ? __schedule+0x21a/0x820
Nov 15 21:55:13 21m21 kernel:  schedule+0x28/0x80
Nov 15 21:55:13 21m21 kernel:  schedule_preempt_disabled+0xa/0x10
Nov 15 21:55:13 21m21 kernel:  __mutex_lock.isra.2+0x16a/0x490
Nov 15 21:55:13 21m21 kernel:  ? xfs_reclaim_inodes_ag+0x265/0x2d0
Nov 15 21:55:13 21m21 kernel:  xfs_reclaim_inodes_ag+0x265/0x2d0
Nov 15 21:55:13 21m21 kernel:  ? kmem_cache_alloc+0x14d/0x1b0
Nov 15 21:55:13 21m21 kernel:  ? radix_tree_gang_lookup_tag+0xc4/0x130
Nov 15 21:55:13 21m21 kernel:  ? __list_lru_walk_one.isra.5+0x33/0x130
Nov 15 21:55:13 21m21 kernel:  xfs_reclaim_inodes_nr+0x31/0x40
Nov 15 21:55:13 21m21 kernel:  super_cache_scan+0x156/0x1a0
Nov 15 21:55:13 21m21 kernel:  shrink_slab.part.51+0x1d2/0x3a0
Nov 15 21:55:13 21m21 kernel:  shrink_node+0x113/0x2e0
Nov 15 21:55:13 21m21 kernel:  do_try_to_free_pages+0xb3/0x310
Nov 15 21:55:13 21m21 kernel:  try_to_free_pages+0xd2/0x190
Nov 15 21:55:13 21m21 kernel:  __alloc_pages_slowpath+0x3a3/0xdc0
Nov 15 21:55:13 21m21 kernel:  ? ip_output+0x5c/0xc0
Nov 15 21:55:13 21m21 kernel:  ? update_curr+0x141/0x1a0
Nov 15 21:55:13 21m21 kernel:  __alloc_pages_nodemask+0x223/0x240
Nov 15 21:55:13 21m21 kernel:  skb_page_frag_refill+0x93/0xb0
Nov 15 21:55:13 21m21 kernel:  sk_page_frag_refill+0x19/0x80
Nov 15 21:55:13 21m21 kernel:  tcp_sendmsg_locked+0x247/0xdc0
Nov 15 21:55:13 21m21 kernel:  tcp_sendmsg+0x27/0x40
Nov 15 21:55:13 21m21 kernel:  sock_sendmsg+0x36/0x40
Nov 15 21:55:13 21m21 kernel:  sock_write_iter+0x84/0xd0
Nov 15 21:55:13 21m21 kernel:  __vfs_write+0xdd/0x140
Nov 15 21:55:13 21m21 kernel:  vfs_write+0xad/0x1a0
Nov 15 21:55:13 21m21 kernel:  SyS_write+0x42/0x90
Nov 15 21:55:13 21m21 kernel:  do_syscall_64+0x60/0x110
Nov 15 21:55:13 21m21 kernel:  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Here "some-task" is trying to send some bytes over network and it's
stuck in direct reclaim. Naturally, kswapd is not keeping up with its
duties.

It seems to me that our terrible disks sometimes take a pause to think
about the meaning of life for a few seconds. During that time XFS
shrinker is stuck, which drives the whole system out of free memory
and in turns triggers direct reclaim.

One solution to this is to not go into direct reclaim by keeping more
free pages with vm.watermark_scale_factor,  but I'd like to discard
this and argue that we're going to hit direct reclaim at some point
anyway.

The solution I have in mind for this is not to try to write anything
to (disastrously terrible) storage in shrinkers. We have 95GB of page
cache readily available for reclaim and it seems a lot cheaper to grab
that.

That brings me to the first question around memory subsystem: are
shrinkers supposed to flush any dirty data? My gut feeling is that
they should not do that, because there's already writeback mechanism
with own tunables for limits to take care of that. If a system runs
out of memory reclaimable without IO and dirty pages are under limit,
it's totally fair to OOM somebody.

It's totally possible that I'm wrong about this feeling, but either
way I think docs need an update on this matter:

* https://elixir.bootlin.com/linux/v4.14.55/source/Documentation/filesystems/vfs.txt

  nr_cached_objects: called by the sb cache shrinking function for the
  filesystem to return the number of freeable cached objects it contains.

My second question is conditional on the first one: if filesystems are
supposed to flush dirty data in response to shrinkers, then how can I
stop this, given my knowledge about combination of lots of available
page cache and terrible disks?

I've tried two things to address this problem ad-hoc.

1. Run the following systemtap script to trick shrinkers into thinking
that XFS has nothing to free:

probe kernel.function("xfs_fs_nr_cached_objects").return {
  $return = 0
}

That did the job and shrink_node latency dropped considerably, while
calls to xfs_fs_free_cached_objects disappeared.

2. Use vm.vfs_cache_pressure to do the same thing. This failed
miserably, because of the following code in super_cache_count:

  if (sb->s_op && sb->s_op->nr_cached_objects)
    total_objects = sb->s_op->nr_cached_objects(sb, sc);

  total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
  total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);

  total_objects = vfs_pressure_ratio(total_objects);
  return total_objects;

XFS was doing its job cleaning up inodes with the background mechanims
it has (m_reclaim_workqueue), but kernel also stopped cleaning up
readily available inodes after XFS.

I'm not a kernel hacker and to be honest with you I don't even
understand all the nuances here. All I know is:

1. I have lots of page cache and terrible disks.
2. I want to reclaim page cache and never touch disks in response to
memory reclaim.
3. Direct reclaim will happen at some point, somebody will want a big
chunk of memory all at once.
4. I'm probably ok with reclaiming clean xfs inodes synchronously in
reclaim path.

This brings me to my final question: what should I do to avoid latency
in reclaim (direct or kswapd)?

To reiterate the importance of this issue: we see interactive
applications with zero IO stall for multiple seconds in writes to
non-blocking sockets and page faults on newly allocated memory, while
95GB of memory is in page cache.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag()
  2018-11-29  0:36 Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag() Ivan Babrou
@ 2018-11-29  2:18 ` Dave Chinner
  2018-11-29 14:36   ` Shawn Bohrer
  2018-11-29 22:22   ` Ivan Babrou
  0 siblings, 2 replies; 16+ messages in thread
From: Dave Chinner @ 2018-11-29  2:18 UTC (permalink / raw)
  To: Ivan Babrou; +Cc: linux-xfs, Shawn Bohrer

On Wed, Nov 28, 2018 at 04:36:25PM -0800, Ivan Babrou wrote:
> Hello,
> 
> We're experiencing some interesting issues with memory reclaim, both
> kswapd and direct reclaim.
> 
> A typical machine is 2 x NUMA with 128GB of RAM and 6 XFS filesystems.
> Page cache is around 95GB and dirty pages hover around 50MB, rarely
> jumping up to 1GB.

What is your workload?

> The catalyst of our issue is terrible disks. It's not uncommon to see
> the following stack in hung task detector:
> 
> Nov 15 21:55:13 21m21 kernel: INFO: task some-task:156314 blocked for
> more than 10 seconds.
> Nov 15 21:55:13 21m21 kernel:       Tainted: G           O
> 4.14.59-cloudflare-2018.7.5 #1
> Nov 15 21:55:13 21m21 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Nov 15 21:55:13 21m21 kernel: some-task     D11792 156314 156183 0x00000080
> Nov 15 21:55:13 21m21 kernel: Call Trace:
> Nov 15 21:55:13 21m21 kernel:  ? __schedule+0x21a/0x820
> Nov 15 21:55:13 21m21 kernel:  schedule+0x28/0x80
> Nov 15 21:55:13 21m21 kernel:  schedule_preempt_disabled+0xa/0x10
> Nov 15 21:55:13 21m21 kernel:  __mutex_lock.isra.2+0x16a/0x490
> Nov 15 21:55:13 21m21 kernel:  ? xfs_reclaim_inodes_ag+0x265/0x2d0
> Nov 15 21:55:13 21m21 kernel:  xfs_reclaim_inodes_ag+0x265/0x2d0
> Nov 15 21:55:13 21m21 kernel:  ? kmem_cache_alloc+0x14d/0x1b0
> Nov 15 21:55:13 21m21 kernel:  ? radix_tree_gang_lookup_tag+0xc4/0x130
> Nov 15 21:55:13 21m21 kernel:  ? __list_lru_walk_one.isra.5+0x33/0x130
> Nov 15 21:55:13 21m21 kernel:  xfs_reclaim_inodes_nr+0x31/0x40
> Nov 15 21:55:13 21m21 kernel:  super_cache_scan+0x156/0x1a0
> Nov 15 21:55:13 21m21 kernel:  shrink_slab.part.51+0x1d2/0x3a0
> Nov 15 21:55:13 21m21 kernel:  shrink_node+0x113/0x2e0
> Nov 15 21:55:13 21m21 kernel:  do_try_to_free_pages+0xb3/0x310
> Nov 15 21:55:13 21m21 kernel:  try_to_free_pages+0xd2/0x190
> Nov 15 21:55:13 21m21 kernel:  __alloc_pages_slowpath+0x3a3/0xdc0
> Nov 15 21:55:13 21m21 kernel:  ? ip_output+0x5c/0xc0
> Nov 15 21:55:13 21m21 kernel:  ? update_curr+0x141/0x1a0
> Nov 15 21:55:13 21m21 kernel:  __alloc_pages_nodemask+0x223/0x240
> Nov 15 21:55:13 21m21 kernel:  skb_page_frag_refill+0x93/0xb0
> Nov 15 21:55:13 21m21 kernel:  sk_page_frag_refill+0x19/0x80
> Nov 15 21:55:13 21m21 kernel:  tcp_sendmsg_locked+0x247/0xdc0
> Nov 15 21:55:13 21m21 kernel:  tcp_sendmsg+0x27/0x40
> Nov 15 21:55:13 21m21 kernel:  sock_sendmsg+0x36/0x40
> Nov 15 21:55:13 21m21 kernel:  sock_write_iter+0x84/0xd0
> Nov 15 21:55:13 21m21 kernel:  __vfs_write+0xdd/0x140
> Nov 15 21:55:13 21m21 kernel:  vfs_write+0xad/0x1a0
> Nov 15 21:55:13 21m21 kernel:  SyS_write+0x42/0x90
> Nov 15 21:55:13 21m21 kernel:  do_syscall_64+0x60/0x110
> Nov 15 21:55:13 21m21 kernel:  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> 
> Here "some-task" is trying to send some bytes over network and it's
> stuck in direct reclaim. Naturally, kswapd is not keeping up with its
> duties.

That's not kswapd causing the problem here, that's direct reclaim.

> It seems to me that our terrible disks sometimes take a pause to think
> about the meaning of life for a few seconds.

You've probably got lots of dirty inodes backed up waiting for IO.

How big are xfs inode slab caches (slabinfo output, please)? How big are the logs
on the filesystems on the machine (xfs_info output, please)?

> During that time XFS
> shrinker is stuck, which drives the whole system out of free memory
> and in turns triggers direct reclaim.

It happens the other way around. What you have here is direct
reclaim trying to get the reclaim lock for a given AG to be able to
reclaim inodes from it. direct reclaim only gets stuck here when all
AGs are already being reclaimed from (i.e. already under direct
reclaim).

> One solution to this is to not go into direct reclaim by keeping more
> free pages with vm.watermark_scale_factor,  but I'd like to discard
> this and argue that we're going to hit direct reclaim at some point
> anyway.

Right, but the problem is that the mm/ subsystem allows effectively
unbound direct reclaim concurrency. At some point, having tens to
hundreds of direct reclaimers all trying to write dirty inodes to
disk causes catastrophic IO breakdown and everything grinds to a
halt forever. We have to prevent that breakdown from occurring.

i.e. we have to throttle direct reclaim to before it reaches IO
breakdown /somewhere/. The memory reclaim subsystem does not do it,
so we have to do it in XFS itself. The problem here is that if we
ignore direct reclaim (i.e do nothing rather than block waiting on
reclaim progress) then the mm/ reclaim algorithms will eventually
think they aren't making progress and unleash the OOM killer.

> The solution I have in mind for this is not to try to write anything
> to (disastrously terrible) storage in shrinkers. We have 95GB of page
> cache readily available for reclaim and it seems a lot cheaper to grab
> that.

The mm/ subsystem tries to keep a balance between page cache and
shrinker controlled caches. /proc/sys/vm/vfs_cache_pressure
controls the balance between page cache and inode/dentry cache
reclaim pressure..

> That brings me to the first question around memory subsystem: are
> shrinkers supposed to flush any dirty data?

No shrinker does that, but if they wanted to they would be perfectly
within their right to write back data.

XFS is writing back /metadata/ here because direct reclaim is
putting excessive pressure on the inode cache and not allowing the
background cleaners to do their job.

> My gut feeling is that
> they should not do that, because there's already writeback mechanism
> with own tunables for limits to take care of that. If a system runs
> out of memory reclaimable without IO and dirty pages are under limit,
> it's totally fair to OOM somebody.
> 
> It's totally possible that I'm wrong about this feeling, but either
> way I think docs need an update on this matter:
> 
> * https://elixir.bootlin.com/linux/v4.14.55/source/Documentation/filesystems/vfs.txt
> 
>   nr_cached_objects: called by the sb cache shrinking function for the
>   filesystem to return the number of freeable cached objects it contains.

You are assuming that "freeable" means "instantly freeable object",
not "unreferenced object that can be freed in the near future". We
use the latter definition in the shrinkers, not the former.

> My second question is conditional on the first one: if filesystems are
> supposed to flush dirty data in response to shrinkers, then how can I
> stop this, given my knowledge about combination of lots of available
> page cache and terrible disks?

Filesystems have more caches that just the page cache.

> I've tried two things to address this problem ad-hoc.
> 
> 1. Run the following systemtap script to trick shrinkers into thinking
> that XFS has nothing to free:
> 
> probe kernel.function("xfs_fs_nr_cached_objects").return {
>   $return = 0
> }
>
> That did the job and shrink_node latency dropped considerably, while
> calls to xfs_fs_free_cached_objects disappeared.

Which effectively turned off direct reclaim for XFS inodes. See
above - this just means that when you have no easily reclaimable
page cache the system will OOM kill rather than wait for inodes to
be reclaimed. i.e. it looks good until everything suddenly goes
wrong and then everything dies a horrible death.

> 2. Use vm.vfs_cache_pressure to do the same thing. This failed
> miserably, because of the following code in super_cache_count:
> 
>   if (sb->s_op && sb->s_op->nr_cached_objects)
>     total_objects = sb->s_op->nr_cached_objects(sb, sc);
> 
>   total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
>   total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);
> 
>   total_objects = vfs_pressure_ratio(total_objects);
>   return total_objects;
> 
> XFS was doing its job cleaning up inodes with the background mechanims
> it has (m_reclaim_workqueue), but kernel also stopped cleaning up
> readily available inodes after XFS.

You mean the kernel stopped handing readily available inodes to XFS
to free.

ie. The VFS has to release unreferenced inodes before they are
handed to XFS to clean up and free. IOWs, I suspect you biased
vm.vfs_cache_pressure to turn off inode reclaim to effectively do
the same thing as above. However, by stopping the feed of
unreferenced inodes to XFS you stopped inode reclaim altogether and
that caused other problems.

What I suspect you want to do is bias vfs_cache_pressure the other
way, so that it /agressively/ feeds inode reclaim and so prevent
large numbers of inodes from building up in the cache. That way
reclaim won't have as many dirty inodes to block on when a sustained
memory shortage drives large amounts of direct reclaim into the
shrinkers...

> I'm not a kernel hacker and to be honest with you I don't even
> understand all the nuances here. All I know is:
> 
> 1. I have lots of page cache and terrible disks.
> 2. I want to reclaim page cache and never touch disks in response to
> memory reclaim.
> 3. Direct reclaim will happen at some point, somebody will want a big
> chunk of memory all at once.
> 4. I'm probably ok with reclaiming clean xfs inodes synchronously in
> reclaim path.
> 
> This brings me to my final question: what should I do to avoid latency
> in reclaim (direct or kswapd)?
> 
> To reiterate the importance of this issue: we see interactive
> applications with zero IO stall for multiple seconds in writes to
> non-blocking sockets and page faults on newly allocated memory, while
> 95GB of memory is in page cache.

It really just sounds like you are allowing too many dirty inodes to
build up in memory. You have a huge amount of memory but really slow
disks - that's just asking for idata and metadata writeback latency
problems.

It sounds like, you need to cause inode cleaning to happen
earlier/faster than it is happening now, because your disks are slow
and so you can't allow a big backlog to build up.  Reducing
/proc/sys/fs/xfs/xfssyncd_centisecs can get background metadata
writeback started earlier. As per Documentation/filesystems/xfs.txt:

  fs.xfs.xfssyncd_centisecs     (Min: 100  Default: 3000  Max: 720000)
        The interval at which the filesystem flushes metadata
        out to disk and runs internal cache cleanup routines.

If this doesn't help, then I'd suggest taht your config is simply
allowing too many dirty inodes to build up in memory.  With XFS, the
number of dirty inodes that can be held in memory at any point in
time is bound by the size of the XFS log in the filesystem. Hence if
you have hundreds of MB of journal space, then you can cache
hundreds of thousands of dirty inodes in memory and so can take tens
of minutes to write them back to slow disks. So if earlier writeback
doesn't help, I'd suggest that using smaller logs with your
filesystems is the only real option here....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag()
  2018-11-29  2:18 ` Dave Chinner
@ 2018-11-29 14:36   ` Shawn Bohrer
  2018-11-29 21:20     ` Dave Chinner
  2018-11-29 22:22   ` Ivan Babrou
  1 sibling, 1 reply; 16+ messages in thread
From: Shawn Bohrer @ 2018-11-29 14:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Ivan Babrou, linux-xfs

Hi Dave,

I've got a few follow up questions below based on your response about
this.

On Thu, Nov 29, 2018 at 01:18:00PM +1100, Dave Chinner wrote:
> On Wed, Nov 28, 2018 at 04:36:25PM -0800, Ivan Babrou wrote:
> > The catalyst of our issue is terrible disks. It's not uncommon to see
> > the following stack in hung task detector:
> > 
> > Nov 15 21:55:13 21m21 kernel: INFO: task some-task:156314 blocked for
> > more than 10 seconds.
> > Nov 15 21:55:13 21m21 kernel:       Tainted: G           O
> > 4.14.59-cloudflare-2018.7.5 #1
> > Nov 15 21:55:13 21m21 kernel: "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Nov 15 21:55:13 21m21 kernel: some-task     D11792 156314 156183 0x00000080
> > Nov 15 21:55:13 21m21 kernel: Call Trace:
> > Nov 15 21:55:13 21m21 kernel:  ? __schedule+0x21a/0x820
> > Nov 15 21:55:13 21m21 kernel:  schedule+0x28/0x80
> > Nov 15 21:55:13 21m21 kernel:  schedule_preempt_disabled+0xa/0x10
> > Nov 15 21:55:13 21m21 kernel:  __mutex_lock.isra.2+0x16a/0x490
> > Nov 15 21:55:13 21m21 kernel:  ? xfs_reclaim_inodes_ag+0x265/0x2d0
> > Nov 15 21:55:13 21m21 kernel:  xfs_reclaim_inodes_ag+0x265/0x2d0
> > Nov 15 21:55:13 21m21 kernel:  ? kmem_cache_alloc+0x14d/0x1b0
> > Nov 15 21:55:13 21m21 kernel:  ? radix_tree_gang_lookup_tag+0xc4/0x130
> > Nov 15 21:55:13 21m21 kernel:  ? __list_lru_walk_one.isra.5+0x33/0x130
> > Nov 15 21:55:13 21m21 kernel:  xfs_reclaim_inodes_nr+0x31/0x40
> > Nov 15 21:55:13 21m21 kernel:  super_cache_scan+0x156/0x1a0
> > Nov 15 21:55:13 21m21 kernel:  shrink_slab.part.51+0x1d2/0x3a0
> > Nov 15 21:55:13 21m21 kernel:  shrink_node+0x113/0x2e0
> > Nov 15 21:55:13 21m21 kernel:  do_try_to_free_pages+0xb3/0x310
> > Nov 15 21:55:13 21m21 kernel:  try_to_free_pages+0xd2/0x190
> > Nov 15 21:55:13 21m21 kernel:  __alloc_pages_slowpath+0x3a3/0xdc0
> > Nov 15 21:55:13 21m21 kernel:  ? ip_output+0x5c/0xc0
> > Nov 15 21:55:13 21m21 kernel:  ? update_curr+0x141/0x1a0
> > Nov 15 21:55:13 21m21 kernel:  __alloc_pages_nodemask+0x223/0x240
> > Nov 15 21:55:13 21m21 kernel:  skb_page_frag_refill+0x93/0xb0
> > Nov 15 21:55:13 21m21 kernel:  sk_page_frag_refill+0x19/0x80
> > Nov 15 21:55:13 21m21 kernel:  tcp_sendmsg_locked+0x247/0xdc0
> > Nov 15 21:55:13 21m21 kernel:  tcp_sendmsg+0x27/0x40
> > Nov 15 21:55:13 21m21 kernel:  sock_sendmsg+0x36/0x40
> > Nov 15 21:55:13 21m21 kernel:  sock_write_iter+0x84/0xd0
> > Nov 15 21:55:13 21m21 kernel:  __vfs_write+0xdd/0x140
> > Nov 15 21:55:13 21m21 kernel:  vfs_write+0xad/0x1a0
> > Nov 15 21:55:13 21m21 kernel:  SyS_write+0x42/0x90
> > Nov 15 21:55:13 21m21 kernel:  do_syscall_64+0x60/0x110
> > Nov 15 21:55:13 21m21 kernel:  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > 
> > Here "some-task" is trying to send some bytes over network and it's
> > stuck in direct reclaim. Naturally, kswapd is not keeping up with its
> > duties.
> 
> That's not kswapd causing the problem here, that's direct reclaim.

It is understood that the above is direct reclaim.  When this happens
kswapd is also blocked as below.  As I'm sure you can imagine many
other tasks get blocked in direct reclaim as well.

[Thu Nov 15 21:52:06 2018] INFO: task kswapd0:1061 blocked for more than 10 seconds.
[Thu Nov 15 21:52:06 2018]       Tainted: G           O    4.14.59-cloudflare-2018.7.5 #1
[Thu Nov 15 21:52:06 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Nov 15 21:52:06 2018] kswapd0         D12848  1061      2 0x80000000
[Thu Nov 15 21:52:06 2018] Call Trace:
[Thu Nov 15 21:52:06 2018]  ? __schedule+0x21a/0x820
[Thu Nov 15 21:52:06 2018]  schedule+0x28/0x80
[Thu Nov 15 21:52:06 2018]  io_schedule+0x12/0x40
[Thu Nov 15 21:52:06 2018]  __xfs_iflock+0xe3/0x100
[Thu Nov 15 21:52:06 2018]  ? bit_waitqueue+0x30/0x30
[Thu Nov 15 21:52:06 2018]  xfs_reclaim_inode+0x141/0x300
[Thu Nov 15 21:52:06 2018]  xfs_reclaim_inodes_ag+0x19d/0x2d0
[Thu Nov 15 21:52:06 2018]  xfs_reclaim_inodes_nr+0x31/0x40
[Thu Nov 15 21:52:07 2018]  super_cache_scan+0x156/0x1a0
[Thu Nov 15 21:52:07 2018]  shrink_slab.part.51+0x1d2/0x3a0
[Thu Nov 15 21:52:07 2018]  shrink_node+0x113/0x2e0
[Thu Nov 15 21:52:07 2018]  kswapd+0x28a/0x6d0
[Thu Nov 15 21:52:07 2018]  ? mem_cgroup_shrink_node+0x150/0x150
[Thu Nov 15 21:52:07 2018]  kthread+0x119/0x130
[Thu Nov 15 21:52:07 2018]  ? kthread_create_on_node+0x40/0x40
[Thu Nov 15 21:52:07 2018]  ret_from_fork+0x35/0x40

> > One solution to this is to not go into direct reclaim by keeping more
> > free pages with vm.watermark_scale_factor,  but I'd like to discard
> > this and argue that we're going to hit direct reclaim at some point
> > anyway.
> 
> Right, but the problem is that the mm/ subsystem allows effectively
> unbound direct reclaim concurrency. At some point, having tens to
> hundreds of direct reclaimers all trying to write dirty inodes to
> disk causes catastrophic IO breakdown and everything grinds to a
> halt forever. We have to prevent that breakdown from occurring.
> 
> i.e. we have to throttle direct reclaim to before it reaches IO
> breakdown /somewhere/. The memory reclaim subsystem does not do it,
> so we have to do it in XFS itself. The problem here is that if we
> ignore direct reclaim (i.e do nothing rather than block waiting on
> reclaim progress) then the mm/ reclaim algorithms will eventually
> think they aren't making progress and unleash the OOM killer.

Here is my naive question.  Why does kswapd block?  Wouldn't it make
sense for kswapd to asynchronously start the xfs_reclaim_inodes
process and then continue looking for other pages (perhaps page cache)
that it can easily free?

In my mind this might prevent us from ever getting to the point of
direct reclaim.  And if we did get to that point then yes I can see
that you might need to synchronously block all tasks in direct reclaim
in xfs_reclaim_inodes to prevent the thundering herd problem.

My other question is why does the mm/ reclaim algorithms think that
that they need to force this metadata reclaim?  I think Ivan's main
question was we have 95GB of page cache maybe 2-3GB of total slab
memory in use, and maybe 1GB of dirty pages.  Blocking the world for
any disk I/O at this point seems insane when there is other quickly
freeable memory.  I assume the answer is LRU?  Our page cache pages
are newer or more frequently accesses then this filesystem metadata?

Thanks,
Shawn

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag()
  2018-11-29 14:36   ` Shawn Bohrer
@ 2018-11-29 21:20     ` Dave Chinner
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2018-11-29 21:20 UTC (permalink / raw)
  To: Shawn Bohrer; +Cc: Ivan Babrou, linux-xfs

On Thu, Nov 29, 2018 at 08:36:48AM -0600, Shawn Bohrer wrote:
> Hi Dave,
> 
> I've got a few follow up questions below based on your response about
> this.
> 
> On Thu, Nov 29, 2018 at 01:18:00PM +1100, Dave Chinner wrote:
> > On Wed, Nov 28, 2018 at 04:36:25PM -0800, Ivan Babrou wrote:
> > > The catalyst of our issue is terrible disks. It's not uncommon to see
> > > the following stack in hung task detector:
> > > 
> > > Nov 15 21:55:13 21m21 kernel: INFO: task some-task:156314 blocked for
> > > more than 10 seconds.
> > > Nov 15 21:55:13 21m21 kernel:       Tainted: G           O
> > > 4.14.59-cloudflare-2018.7.5 #1
> > > Nov 15 21:55:13 21m21 kernel: "echo 0 >
> > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > Nov 15 21:55:13 21m21 kernel: some-task     D11792 156314 156183 0x00000080
> > > Nov 15 21:55:13 21m21 kernel: Call Trace:
> > > Nov 15 21:55:13 21m21 kernel:  ? __schedule+0x21a/0x820
> > > Nov 15 21:55:13 21m21 kernel:  schedule+0x28/0x80
> > > Nov 15 21:55:13 21m21 kernel:  schedule_preempt_disabled+0xa/0x10
> > > Nov 15 21:55:13 21m21 kernel:  __mutex_lock.isra.2+0x16a/0x490
> > > Nov 15 21:55:13 21m21 kernel:  ? xfs_reclaim_inodes_ag+0x265/0x2d0
> > > Nov 15 21:55:13 21m21 kernel:  xfs_reclaim_inodes_ag+0x265/0x2d0
> > > Nov 15 21:55:13 21m21 kernel:  ? kmem_cache_alloc+0x14d/0x1b0
> > > Nov 15 21:55:13 21m21 kernel:  ? radix_tree_gang_lookup_tag+0xc4/0x130
> > > Nov 15 21:55:13 21m21 kernel:  ? __list_lru_walk_one.isra.5+0x33/0x130
> > > Nov 15 21:55:13 21m21 kernel:  xfs_reclaim_inodes_nr+0x31/0x40
> > > Nov 15 21:55:13 21m21 kernel:  super_cache_scan+0x156/0x1a0
> > > Nov 15 21:55:13 21m21 kernel:  shrink_slab.part.51+0x1d2/0x3a0
> > > Nov 15 21:55:13 21m21 kernel:  shrink_node+0x113/0x2e0
> > > Nov 15 21:55:13 21m21 kernel:  do_try_to_free_pages+0xb3/0x310
> > > Nov 15 21:55:13 21m21 kernel:  try_to_free_pages+0xd2/0x190
> > > Nov 15 21:55:13 21m21 kernel:  __alloc_pages_slowpath+0x3a3/0xdc0
> > > Nov 15 21:55:13 21m21 kernel:  ? ip_output+0x5c/0xc0
> > > Nov 15 21:55:13 21m21 kernel:  ? update_curr+0x141/0x1a0
> > > Nov 15 21:55:13 21m21 kernel:  __alloc_pages_nodemask+0x223/0x240
> > > Nov 15 21:55:13 21m21 kernel:  skb_page_frag_refill+0x93/0xb0
> > > Nov 15 21:55:13 21m21 kernel:  sk_page_frag_refill+0x19/0x80
> > > Nov 15 21:55:13 21m21 kernel:  tcp_sendmsg_locked+0x247/0xdc0
> > > Nov 15 21:55:13 21m21 kernel:  tcp_sendmsg+0x27/0x40
> > > Nov 15 21:55:13 21m21 kernel:  sock_sendmsg+0x36/0x40
> > > Nov 15 21:55:13 21m21 kernel:  sock_write_iter+0x84/0xd0
> > > Nov 15 21:55:13 21m21 kernel:  __vfs_write+0xdd/0x140
> > > Nov 15 21:55:13 21m21 kernel:  vfs_write+0xad/0x1a0
> > > Nov 15 21:55:13 21m21 kernel:  SyS_write+0x42/0x90
> > > Nov 15 21:55:13 21m21 kernel:  do_syscall_64+0x60/0x110
> > > Nov 15 21:55:13 21m21 kernel:  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > > 
> > > Here "some-task" is trying to send some bytes over network and it's
> > > stuck in direct reclaim. Naturally, kswapd is not keeping up with its
> > > duties.
> > 
> > That's not kswapd causing the problem here, that's direct reclaim.
> 
> It is understood that the above is direct reclaim.  When this happens
> kswapd is also blocked as below.  As I'm sure you can imagine many
> other tasks get blocked in direct reclaim as well.

Kswapd is allowed to block waiting for IO - it runs in GFP_KERNEL
context and so can both issue write back of dirty page cache pages
and wait for them to complete IO. With that reclaim context
(GFP_KERNEL) we can also issue and wait for IO from
the filesystem shrinkers.

Blocking kswapd is less than ideal; facebook hit this particular
"kswapd is blocked" issue to but their proposed "don't block kswapd
during inode reclaim" patch caused an increase in OOM kills on my
low memory test VMs as well as small, repeatable performance
regressions on my benchmark workloads so it wasn't merged.

> > > One solution to this is to not go into direct reclaim by keeping more
> > > free pages with vm.watermark_scale_factor,  but I'd like to discard
> > > this and argue that we're going to hit direct reclaim at some point
> > > anyway.
> > 
> > Right, but the problem is that the mm/ subsystem allows effectively
> > unbound direct reclaim concurrency. At some point, having tens to
> > hundreds of direct reclaimers all trying to write dirty inodes to
> > disk causes catastrophic IO breakdown and everything grinds to a
> > halt forever. We have to prevent that breakdown from occurring.
> > 
> > i.e. we have to throttle direct reclaim to before it reaches IO
> > breakdown /somewhere/. The memory reclaim subsystem does not do it,
> > so we have to do it in XFS itself. The problem here is that if we
> > ignore direct reclaim (i.e do nothing rather than block waiting on
> > reclaim progress) then the mm/ reclaim algorithms will eventually
> > think they aren't making progress and unleash the OOM killer.
> 
> Here is my naive question.  Why does kswapd block?  Wouldn't it make
> sense for kswapd to asynchronously start the xfs_reclaim_inodes
> process and then continue looking for other pages (perhaps page cache)
> that it can easily free?

kswapd blocks because it is the only thing that (almost) guarantees
forwards progress in memory reclaim. If the memory is full of dirty
pages, it *must* block and wait for writeback to complete and clean
the pages it needs to reclaim for the waiting allocators. If it does
not block, we trigger the OOM killer prematurely. i.e. waiting a
short time for IO to complete avoids false positive transient ENOMEM
detection.

> In my mind this might prevent us from ever getting to the point of
> direct reclaim.

Direct reclaim happens first. kswapd is only kicked when direct
reclaim isn't making enough progress - kswapd happens in the
background. The problem we have is that the primary memory
reclaimer is the thundering heard of direct reclaim, and kswapd can
then get stuck behind the thundering heard.

> And if we did get to that point then yes I can see
> that you might need to synchronously block all tasks in direct reclaim
> in xfs_reclaim_inodes to prevent the thundering herd problem.

We already had to deal with the thundering heard before kswapd
starts working.

> My other question is why does the mm/ reclaim algorithms think that
> that they need to force this metadata reclaim?  I think Ivan's main
> question was we have 95GB of page cache maybe 2-3GB of total slab
> memory in use, and maybe 1GB of dirty pages.

The kernel keeps a balance between the caches. If it scans 1% of
the page cache, it also needs to scan 1% of the other caches to keep
them in balance.

> Blocking the world for
> any disk I/O at this point seems insane when there is other quickly
> freeable memory.  I assume the answer is LRU?  Our page cache pages
> are newer or more frequently accesses then this filesystem metadata?

Not blocking for IO in the shrinkers is even worse. I made some mods
yesterday to make inode reclaim block less to do the same work
(more efficient IO dispatch == less waiting) and the only thing I
succeeded in doing is slowing down my benchmark workloads by 30-40%.

Why? because blocking less in inode reclaim meant that the memory
reclaim did more scanning, and so reclaimed /other caches faster/.
That meant it was trashing the working set in the xfs bufffer cache
(which holds all the metadata buffers) and so that cause a
substantial increase in metadata read IO on a /write-only workload/.
IOWs, metadata buffers were being reclaimed far too quickly because
I made inode cache reclaim faster.

All the caches must be balanced for the system to perform well
across a wide range of workloads. And it's an unfortunate fact that
the only way we can maintain that balance right now is via harsh
throttling of the inode cache. I'll keep trying to find a different
solution, but the reality is that memory reclaim subsystem has
fairly major limitations in what we can and can't do to maintain
balance across related and/or dependent caches....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag()
  2018-11-29  2:18 ` Dave Chinner
  2018-11-29 14:36   ` Shawn Bohrer
@ 2018-11-29 22:22   ` Ivan Babrou
  2018-11-30  2:18     ` Dave Chinner
  1 sibling, 1 reply; 16+ messages in thread
From: Ivan Babrou @ 2018-11-29 22:22 UTC (permalink / raw)
  To: david; +Cc: linux-xfs, Shawn Bohrer

On Wed, Nov 28, 2018 at 6:18 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Nov 28, 2018 at 04:36:25PM -0800, Ivan Babrou wrote:
> > Hello,
> >
> > We're experiencing some interesting issues with memory reclaim, both
> > kswapd and direct reclaim.
> >
> > A typical machine is 2 x NUMA with 128GB of RAM and 6 XFS filesystems.
> > Page cache is around 95GB and dirty pages hover around 50MB, rarely
> > jumping up to 1GB.
>
> What is your workload?

My test setup is an empty machine with 256GB of RAM booted from
network into memory with just systemd essentials running.

I create XFS on a 10TB drive (via LUKS), mount the drive and write
300GiB of randomness:

$ sudo mkfs.xfs /dev/mapper/luks-sda
$ sudo mount /dev/mapper/luks-sda /mnt
$ sudo dd if=/dev/urandom of=/mnt/300g.random bs=1M count=300K status=progress

Then I reboot and just mount the drive again to run my test workload:

$ dd if=/mnt/300g.random of=/dev/null bs=1M status=progress

After running it once and populating page cache I restart it to collect traces.

Here's xfs_info:

$ sudo xfs_info /mnt
meta-data=/dev/mapper/luks-sda   isize=512    agcount=10, agsize=268435455 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1 spinodes=0 rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=2441608704, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

And here's slabinfo:

$ sudo cat /proc/slabinfo
slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab>
<pagesperslab> : tunables <limit> <batchcount> <sharedfactor> :
slabdata <active_slabs> <num_slabs> <sharedavail>
kcopyd_job             0      0   3312    9    8 : tunables    0    0
  0 : slabdata      0      0      0
dm_uevent              0      0   2632   12    8 : tunables    0    0
  0 : slabdata      0      0      0
dax_cache             96    504    768   42    8 : tunables    0    0
  0 : slabdata     12     12      0
nf_conntrack_expect      0      0    224   36    2 : tunables    0
0    0 : slabdata      0      0      0
nf_conntrack           0      0    320   51    4 : tunables    0    0
  0 : slabdata      0      0      0
kvm_async_pf           0      0    136   30    1 : tunables    0    0
  0 : slabdata      0      0      0
kvm_vcpu               0      0  19392    1    8 : tunables    0    0
  0 : slabdata      0      0      0
kvm_mmu_page_header      0      0    168   48    2 : tunables    0
0    0 : slabdata      0      0      0
pte_list_desc          0      0     32  128    1 : tunables    0    0
  0 : slabdata      0      0      0
ip6-frags              0      0    208   39    2 : tunables    0    0
  0 : slabdata      0      0      0
RAWv6                168    168   1152   28    8 : tunables    0    0
  0 : slabdata      6      6      0
UDPv6                900    900   1280   25    8 : tunables    0    0
  0 : slabdata     36     36      0
tw_sock_TCPv6       1122   1122    248   33    2 : tunables    0    0
  0 : slabdata     34     34      0
request_sock_TCPv6      0      0    312   52    4 : tunables    0    0
   0 : slabdata      0      0      0
TCPv6                420    420   2240   14    8 : tunables    0    0
  0 : slabdata     30     30      0
cfq_io_cq           1156   1156    120   34    1 : tunables    0    0
  0 : slabdata     34     34      0
mqueue_inode_cache     34     34    960   34    8 : tunables    0    0
   0 : slabdata      1      1      0
xfs_dqtrx              0      0    528   31    4 : tunables    0    0
  0 : slabdata      0      0      0
xfs_rui_item           0      0    672   48    8 : tunables    0    0
  0 : slabdata      0      0      0
xfs_rud_item           0      0    152   53    2 : tunables    0    0
  0 : slabdata      0      0      0
xfs_ili              144    144    168   48    2 : tunables    0    0
  0 : slabdata      3      3      0
xfs_inode            170    170    960   34    8 : tunables    0    0
  0 : slabdata      5      5      0
xfs_efd_item           0      0    416   39    4 : tunables    0    0
  0 : slabdata      0      0      0
xfs_buf_item         132    132    248   33    2 : tunables    0    0
  0 : slabdata      4      4      0
xfs_da_state           0      0    480   34    4 : tunables    0    0
  0 : slabdata      0      0      0
xfs_btree_cur        420    420    232   35    2 : tunables    0    0
  0 : slabdata     12     12      0
xfs_log_ticket       308    308    184   44    2 : tunables    0    0
  0 : slabdata      7      7      0
bio-2                 51     51    320   51    4 : tunables    0    0
  0 : slabdata      1      1      0
nfs_direct_cache       0      0    336   48    4 : tunables    0    0
  0 : slabdata      0      0      0
nfs_read_data         36     36    896   36    8 : tunables    0    0
  0 : slabdata      1      1      0
nfs_inode_cache        0      0   1080   30    8 : tunables    0    0
  0 : slabdata      0      0      0
isofs_inode_cache      0      0    656   49    8 : tunables    0    0
  0 : slabdata      0      0      0
fat_inode_cache        0      0    744   44    8 : tunables    0    0
  0 : slabdata      0      0      0
fat_cache              0      0     40  102    1 : tunables    0    0
  0 : slabdata      0      0      0
jbd2_journal_head      0      0    120   34    1 : tunables    0    0
  0 : slabdata      0      0      0
jbd2_revoke_table_s      0      0     16  256    1 : tunables    0
0    0 : slabdata      0      0      0
ext4_inode_cache       0      0   1088   30    8 : tunables    0    0
  0 : slabdata      0      0      0
ext4_allocation_context      0      0    128   32    1 : tunables    0
   0    0 : slabdata      0      0      0
ext4_io_end            0      0     64   64    1 : tunables    0    0
  0 : slabdata      0      0      0
ext4_extent_status      0      0     40  102    1 : tunables    0    0
   0 : slabdata      0      0      0
mbcache                0      0     56   73    1 : tunables    0    0
  0 : slabdata      0      0      0
fscrypt_info           0      0     32  128    1 : tunables    0    0
  0 : slabdata      0      0      0
fscrypt_ctx            0      0     48   85    1 : tunables    0    0
  0 : slabdata      0      0      0
userfaultfd_ctx_cache      0      0    192   42    2 : tunables    0
 0    0 : slabdata      0      0      0
dio                    0      0    640   51    8 : tunables    0    0
  0 : slabdata      0      0      0
pid_namespace          0      0   2232   14    8 : tunables    0    0
  0 : slabdata      0      0      0
posix_timers_cache    884    884    240   34    2 : tunables    0    0
   0 : slabdata     26     26      0
rpc_inode_cache        0      0    704   46    8 : tunables    0    0
  0 : slabdata      0      0      0
ip4-frags            185    185    216   37    2 : tunables    0    0
  0 : slabdata      5      5      0
xfrm_dst_cache         0      0    448   36    4 : tunables    0    0
  0 : slabdata      0      0      0
RAW                 1564   1564    960   34    8 : tunables    0    0
  0 : slabdata     46     46      0
tw_sock_TCP          990    990    248   33    2 : tunables    0    0
  0 : slabdata     30     30      0
request_sock_TCP     520    520    312   52    4 : tunables    0    0
  0 : slabdata     10     10      0
TCP                  544    544   2048   16    8 : tunables    0    0
  0 : slabdata     34     34      0
hugetlbfs_inode_cache    104    104    624   52    8 : tunables    0
 0    0 : slabdata      2      2      0
dquot                  0      0    256   32    2 : tunables    0    0
  0 : slabdata      0      0      0
request_queue        195    195   2144   15    8 : tunables    0    0
  0 : slabdata     13     13      0
blkdev_ioc          1560   1560    104   39    1 : tunables    0    0
  0 : slabdata     40     40      0
user_namespace         0      0    440   37    4 : tunables    0    0
  0 : slabdata      0      0      0
dmaengine-unmap-256     15     15   2112   15    8 : tunables    0
0    0 : slabdata      1      1      0
dmaengine-unmap-128   1230   1230   1088   30    8 : tunables    0
0    0 : slabdata     41     41      0
sock_inode_cache    6808   6808    704   46    8 : tunables    0    0
  0 : slabdata    148    148      0
file_lock_cache     1600   1600    200   40    2 : tunables    0    0
  0 : slabdata     40     40      0
fsnotify_mark_connector   5610   5610     24  170    1 : tunables    0
   0    0 : slabdata     33     33      0
net_namespace          5      5   5888    5    8 : tunables    0    0
  0 : slabdata      1      1      0
taskstats           1813   1813    328   49    4 : tunables    0    0
  0 : slabdata     37     37      0
proc_inode_cache   13776  13776    680   48    8 : tunables    0    0
  0 : slabdata    287    287      0
sigqueue            2040   2040    160   51    2 : tunables    0    0
  0 : slabdata     40     40      0
bdev_cache           909   1287    832   39    8 : tunables    0    0
  0 : slabdata     33     33      0
shmem_inode_cache  48414  49864    712   46    8 : tunables    0    0
  0 : slabdata   1084   1084      0
kernfs_node_cache  54722  54780    136   30    1 : tunables    0    0
  0 : slabdata   1826   1826      0
mnt_cache           1890   1890    384   42    4 : tunables    0    0
  0 : slabdata     45     45      0
inode_cache        25175  26447    608   53    8 : tunables    0    0
  0 : slabdata    499    499      0
dentry            108066 117096    192   42    2 : tunables    0    0
  0 : slabdata   2788   2788      0
iint_cache             0      0     80   51    1 : tunables    0    0
  0 : slabdata      0      0      0
buffer_head          585    585    104   39    1 : tunables    0    0
  0 : slabdata     15     15      0
vm_area_struct     69560  69760    200   40    2 : tunables    0    0
  0 : slabdata   1744   1744      0
mm_struct           4920   4920   1088   30    8 : tunables    0    0
  0 : slabdata    164    164      0
files_cache         6624   6624    704   46    8 : tunables    0    0
  0 : slabdata    144    144      0
signal_cache        7040   7200   1024   32    8 : tunables    0    0
  0 : slabdata    225    225      0
sighand_cache       2670   2670   2112   15    8 : tunables    0    0
  0 : slabdata    178    178      0
task_struct         1431   1468   7744    4    8 : tunables    0    0
  0 : slabdata    367    367      0
cred_jar           20580  20580    192   42    2 : tunables    0    0
  0 : slabdata    490    490      0
Acpi-Operand       21504  21504     72   56    1 : tunables    0    0
  0 : slabdata    384    384      0
Acpi-Parse          2920   2920     56   73    1 : tunables    0    0
  0 : slabdata     40     40      0
Acpi-State          1836   1836     80   51    1 : tunables    0    0
  0 : slabdata     36     36      0
Acpi-Namespace     80882  82416     40  102    1 : tunables    0    0
  0 : slabdata    808    808      0
anon_vma_chain     95424  95744     64   64    1 : tunables    0    0
  0 : slabdata   1496   1496      0
anon_vma           47886  47886     88   46    1 : tunables    0    0
  0 : slabdata   1041   1041      0
pid                 7776   7776    128   32    1 : tunables    0    0
  0 : slabdata    243    243      0
numa_policy           62     62    264   31    2 : tunables    0    0
  0 : slabdata      2      2      0
trace_event_file    2208   2208     88   46    1 : tunables    0    0
  0 : slabdata     48     48      0
ftrace_event_field   4675   4675     48   85    1 : tunables    0    0
   0 : slabdata     55     55      0
radix_tree_node   1266533 1266720    584   28    4 : tunables    0
0    0 : slabdata  45240  45240      0
task_group           184    184    704   46    8 : tunables    0    0
  0 : slabdata      4      4      0
dma-kmalloc-8192       0      0   8192    4    8 : tunables    0    0
  0 : slabdata      0      0      0
dma-kmalloc-4096       0      0   4096    8    8 : tunables    0    0
  0 : slabdata      0      0      0
dma-kmalloc-2048       0      0   2048   16    8 : tunables    0    0
  0 : slabdata      0      0      0
dma-kmalloc-1024       0      0   1024   32    8 : tunables    0    0
  0 : slabdata      0      0      0
dma-kmalloc-512        0      0    512   32    4 : tunables    0    0
  0 : slabdata      0      0      0
dma-kmalloc-256        0      0    256   32    2 : tunables    0    0
  0 : slabdata      0      0      0
dma-kmalloc-128        0      0    128   32    1 : tunables    0    0
  0 : slabdata      0      0      0
dma-kmalloc-64         0      0     64   64    1 : tunables    0    0
  0 : slabdata      0      0      0
dma-kmalloc-32         0      0     32  128    1 : tunables    0    0
  0 : slabdata      0      0      0
dma-kmalloc-16         0      0     16  256    1 : tunables    0    0
  0 : slabdata      0      0      0
dma-kmalloc-8          0      0      8  512    1 : tunables    0    0
  0 : slabdata      0      0      0
dma-kmalloc-192        0      0    192   42    2 : tunables    0    0
  0 : slabdata      0      0      0
dma-kmalloc-96         0      0     96   42    1 : tunables    0    0
  0 : slabdata      0      0      0
kmalloc-8192        1292   1292   8192    4    8 : tunables    0    0
  0 : slabdata    323    323      0
kmalloc-4096        1056   1056   4096    8    8 : tunables    0    0
  0 : slabdata    132    132      0
kmalloc-2048        3246   3312   2048   16    8 : tunables    0    0
  0 : slabdata    207    207      0
kmalloc-1024        9056   9056   1024   32    8 : tunables    0    0
  0 : slabdata    283    283      0
kmalloc-512         7487   7488    512   32    4 : tunables    0    0
  0 : slabdata    234    234      0
kmalloc-256        17544  18528    256   32    2 : tunables    0    0
  0 : slabdata    579    579      0
kmalloc-192        10416  10416    192   42    2 : tunables    0    0
  0 : slabdata    248    248      0
kmalloc-128         6756   6816    128   32    1 : tunables    0    0
  0 : slabdata    213    213      0
kmalloc-96         48655  49308     96   42    1 : tunables    0    0
  0 : slabdata   1174   1174      0
kmalloc-64         23650  23808     64   64    1 : tunables    0    0
  0 : slabdata    372    372      0
kmalloc-32         34432  34432     32  128    1 : tunables    0    0
  0 : slabdata    269    269      0
kmalloc-16         27392  27392     16  256    1 : tunables    0    0
  0 : slabdata    107    107      0
kmalloc-8          37376  37376      8  512    1 : tunables    0    0
  0 : slabdata     73     73      0
kmem_cache_node     1725   1728     64   64    1 : tunables    0    0
  0 : slabdata     27     27      0
kmem_cache           966    966    384   42    4 : tunables    0    0
  0 : slabdata     23     23      0

To illustrate the issue I'm issuing the simplest command I can come up with:

$ date; sudo touch /mnt/empty
Thu Nov 29 20:09:35 UTC 2018

Now let's see what happens in 10s surrounding this event. Please read
through all events to see the whole picture.

1. Memory reclaim is happening as it should (my working set is bigger than RAM):

$ sudo /usr/share/bcc/tools/funclatency -mTi 1 shrink_node
Tracing 1 functions for "shrink_node"... Hit Ctrl-C to end.

...

20:09:30
     msecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 2        |**************************              |
         8 -> 15         : 1        |*************                           |
        16 -> 31         : 3        |****************************************|

20:09:31
     msecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 2        |****************************************|
         8 -> 15         : 1        |********************                    |

20:09:32
     msecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 3        |****************************************|

20:09:33
     msecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 2        |**************************              |
         8 -> 15         : 1        |*************                           |
        16 -> 31         : 3        |****************************************|

20:09:34

20:09:35
     msecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 2        |****************************************|
         8 -> 15         : 2        |****************************************|
        16 -> 31         : 2        |****************************************|

20:09:36
     msecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 1        |********************                    |
         8 -> 15         : 1        |********************                    |
        16 -> 31         : 2        |****************************************|
        32 -> 63         : 2        |****************************************|

20:09:37

20:09:38
     msecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 2        |****************************************|
         8 -> 15         : 2        |****************************************|
        16 -> 31         : 2        |****************************************|

20:09:39

20:09:40
     msecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 2        |****************************************|
         8 -> 15         : 2        |****************************************|
        16 -> 31         : 2        |****************************************|

2. XFS reclaim only kicks in when I create an empty file, silence otherwise:

$ sudo /usr/share/bcc/tools/funclatency -mTi 1 xfs_fs_free_cached_objects
Tracing 1 functions for "xfs_fs_free_cached_objects"... Hit Ctrl-C to end.

...

20:09:30

20:09:31

20:09:32

20:09:33

20:09:34

20:09:35

20:09:36

20:09:37
     msecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 2        |****************************************|

20:09:38

20:09:39

20:09:40

3. We can see that XFS inode is actually reclaimed from kswapd0:

$ sudo /usr/share/bcc/tools/stackcount -Ti 1 xfs_reclaim_inode
Tracing 1 functions for "xfs_reclaim_inode"... Hit Ctrl-C to end.

...

20:09:30

20:09:31

20:09:32

20:09:33

20:09:34

20:09:35

20:09:36
  xfs_reclaim_inode
  xfs_reclaim_inodes_ag
  xfs_reclaim_inodes_nr
  kretprobe_trampoline
    kswapd0 [1862]
    1


20:09:37

20:09:38

20:09:39

20:09:40

4. And we can also see that flush on disk is happening from the same kswapd0:

$ sudo /usr/share/bcc/tools/stackcount -Ti 1 xfs_iflush
Tracing 1 functions for "xfs_iflush"... Hit Ctrl-C to end.

...

20:09:30

20:09:31

20:09:32

20:09:33

20:09:34

20:09:35

20:09:36

20:09:37
  xfs_iflush
  xfs_reclaim_inode
  xfs_reclaim_inodes_ag
  xfs_reclaim_inodes_nr
  kretprobe_trampoline
    kswapd0 [1862]
    1


20:09:38

20:09:39

20:09:40

5. Back to XFS reclaim source locations, both kswapd0 and kswapd1 hit it:

ivan@36s336:~$ sudo /usr/share/bcc/tools/stackcount -Ti 1
xfs_fs_free_cached_objects
Tracing 1 functions for "xfs_fs_free_cached_objects"... Hit Ctrl-C to end.

...

20:09:30

20:09:31

20:09:32

20:09:33

20:09:34

20:09:35

20:09:36

20:09:37
  xfs_fs_free_cached_objects
  super_cache_scan
  shrink_slab.part.45
  shrink_node
  kretprobe_trampoline
    kswapd1 [1863]
    1

  xfs_fs_free_cached_objects
  super_cache_scan
  shrink_slab.part.45
  shrink_node
  kretprobe_trampoline
    kswapd0 [1862]
    1


20:09:38

20:09:39

20:09:40

6. And we can see that it was kswapd0 that flushed data to disk synchronously:

$ sudo /usr/share/bcc/tools/stackcount -Ti 1 xfs_buf_submit_wait
Tracing 1 functions for "xfs_buf_submit_wait"... Hit Ctrl-C to end.

...

20:09:30

20:09:31

20:09:32

20:09:33

20:09:34

20:09:35

20:09:36
  xfs_buf_submit_wait
  xfs_bwrite
  xfs_reclaim_inode
  xfs_reclaim_inodes_ag
  xfs_reclaim_inodes_nr
  kretprobe_trampoline
    kswapd0 [1862]
    1


20:09:37

20:09:38

20:09:39

20:09:40

7. Finally, we only see xfs_fs_nr_cached_objects return non-zero
around the same time:

$ sudo /usr/share/bcc/tools/trace -T 'r::xfs_fs_nr_cached_objects
(retval > 0) "cached = %d", retval'
TIME     PID     TID     COMM            FUNC             -
20:09:36 1862    1862    kswapd0         xfs_fs_nr_cached_objects cached = 1
20:09:36 1862    1862    kswapd0         xfs_fs_nr_cached_objects cached = 1
20:09:36 1863    1863    kswapd1         xfs_fs_nr_cached_objects cached = 1
20:09:36 1863    1863    kswapd1         xfs_fs_nr_cached_objects cached = 1

You can see that have 249GB of page cache:

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           251G        1.0G        1.1G        656M        249G        248G
Swap:            0B          0B          0B

And the number of dirty pages is zero:

$ cat /proc/vmstat
nr_free_pages 295648
nr_zone_inactive_anon 99463
nr_zone_active_anon 76344
nr_zone_inactive_file 56796555
nr_zone_active_file 8297372
nr_zone_unevictable 0
nr_zone_write_pending 0
nr_mlock 0
nr_page_table_pages 1578
nr_kernel_stack 9168
nr_bounce 0
nr_free_cma 0
numa_hit 99220256
numa_miss 52104657
numa_foreign 52104657
numa_interleave 135792
numa_local 99127058
numa_other 52197855
nr_inactive_anon 99463
nr_active_anon 76344
nr_inactive_file 56796555
nr_active_file 8297372
nr_unevictable 0
nr_slab_reclaimable 194565
nr_slab_unreclaimable 43128
nr_isolated_anon 0
nr_isolated_file 0
workingset_refault 69892763
workingset_activate 289419
workingset_nodereclaim 0
nr_anon_pages 7700
nr_mapped 5765
nr_file_pages 65262134
nr_dirty 0
nr_writeback 0
nr_writeback_temp 0
nr_shmem 168119
nr_shmem_hugepages 0
nr_shmem_pmdmapped 0
nr_anon_transparent_hugepages 0
nr_unstable 0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_dirtied 0
nr_written 0
nr_dirty_threshold 13017651
nr_dirty_background_threshold 6500878
pgpgin 594234184
pgpgout 2360
pswpin 0
pswpout 0
pgalloc_dma 0
pgalloc_dma32 348991
pgalloc_normal 151243219
pgalloc_movable 0
allocstall_dma 0
allocstall_dma32 0
allocstall_normal 0
allocstall_movable 0
pgskip_dma 0
pgskip_dma32 0
pgskip_normal 0
pgskip_movable 0
pgfree 151892594
pgactivate 8963375
pgdeactivate 742721
pglazyfree 0
pgfault 3796959
pgmajfault 0
pglazyfreed 0
pgrefill 742721
pgsteal_kswapd 83442196
pgsteal_direct 0
pgscan_kswapd 83442409
pgscan_direct 0
pgscan_direct_throttle 0
zone_reclaim_failed 0
pginodesteal 0
slabs_scanned 26710
kswapd_inodesteal 0
kswapd_low_wmark_hit_quickly 0
kswapd_high_wmark_hit_quickly 0
pageoutrun 1636
pgrotated 0
drop_pagecache 0
drop_slab 0
oom_kill 0
numa_pte_updates 0
numa_huge_pte_updates 0
numa_hint_faults 0
numa_hint_faults_local 0
numa_pages_migrated 0
pgmigrate_success 0
pgmigrate_fail 0
compact_migrate_scanned 0
compact_free_scanned 0
compact_isolated 0
compact_stall 0
compact_fail 0
compact_success 0
compact_daemon_wake 0
compact_daemon_migrate_scanned 0
compact_daemon_free_scanned 0
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 18354
unevictable_pgs_scanned 0
unevictable_pgs_rescued 19851
unevictable_pgs_mlocked 19851
unevictable_pgs_munlocked 19851
unevictable_pgs_cleared 0
unevictable_pgs_stranded 0
thp_fault_alloc 0
thp_fault_fallback 0
thp_collapse_alloc 0
thp_collapse_alloc_failed 0
thp_file_alloc 0
thp_file_mapped 0
thp_split_page 0
thp_split_page_failed 0
thp_deferred_split_page 0
thp_split_pmd 0
thp_split_pud 0
thp_zero_page_alloc 0
thp_zero_page_alloc_failed 0
thp_swpout 0
thp_swpout_fallback 0
balloon_inflate 0
balloon_deflate 0
balloon_migrate 0
swap_ra 0
swap_ra_hit 0

It's zero all the time, except for when I touch an empty file. There's
just no other write IO on the system.

The whole issue here is that synchronous IO from memory reclaim is
unbound in time. For all I know we can have 10TB drive that runs at
floppy drive speeds or worse.

The following can easily happen (correct me if it can't for some reason):

1. kswapd gets stuck because of slow storage and memory is not getting reclaimed
2. memory allocation doesn't have any free pages and kicks in direct reclaim
3. direct reclaim is stuck behind kswapd

I'm not sure why you say direct reclaim happens first, allocstall is zero.

> > The catalyst of our issue is terrible disks. It's not uncommon to see
> > the following stack in hung task detector:
> >
> > Nov 15 21:55:13 21m21 kernel: INFO: task some-task:156314 blocked for
> > more than 10 seconds.
> > Nov 15 21:55:13 21m21 kernel:       Tainted: G           O
> > 4.14.59-cloudflare-2018.7.5 #1
> > Nov 15 21:55:13 21m21 kernel: "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Nov 15 21:55:13 21m21 kernel: some-task     D11792 156314 156183 0x00000080
> > Nov 15 21:55:13 21m21 kernel: Call Trace:
> > Nov 15 21:55:13 21m21 kernel:  ? __schedule+0x21a/0x820
> > Nov 15 21:55:13 21m21 kernel:  schedule+0x28/0x80
> > Nov 15 21:55:13 21m21 kernel:  schedule_preempt_disabled+0xa/0x10
> > Nov 15 21:55:13 21m21 kernel:  __mutex_lock.isra.2+0x16a/0x490
> > Nov 15 21:55:13 21m21 kernel:  ? xfs_reclaim_inodes_ag+0x265/0x2d0
> > Nov 15 21:55:13 21m21 kernel:  xfs_reclaim_inodes_ag+0x265/0x2d0
> > Nov 15 21:55:13 21m21 kernel:  ? kmem_cache_alloc+0x14d/0x1b0
> > Nov 15 21:55:13 21m21 kernel:  ? radix_tree_gang_lookup_tag+0xc4/0x130
> > Nov 15 21:55:13 21m21 kernel:  ? __list_lru_walk_one.isra.5+0x33/0x130
> > Nov 15 21:55:13 21m21 kernel:  xfs_reclaim_inodes_nr+0x31/0x40
> > Nov 15 21:55:13 21m21 kernel:  super_cache_scan+0x156/0x1a0
> > Nov 15 21:55:13 21m21 kernel:  shrink_slab.part.51+0x1d2/0x3a0
> > Nov 15 21:55:13 21m21 kernel:  shrink_node+0x113/0x2e0
> > Nov 15 21:55:13 21m21 kernel:  do_try_to_free_pages+0xb3/0x310
> > Nov 15 21:55:13 21m21 kernel:  try_to_free_pages+0xd2/0x190
> > Nov 15 21:55:13 21m21 kernel:  __alloc_pages_slowpath+0x3a3/0xdc0
> > Nov 15 21:55:13 21m21 kernel:  ? ip_output+0x5c/0xc0
> > Nov 15 21:55:13 21m21 kernel:  ? update_curr+0x141/0x1a0
> > Nov 15 21:55:13 21m21 kernel:  __alloc_pages_nodemask+0x223/0x240
> > Nov 15 21:55:13 21m21 kernel:  skb_page_frag_refill+0x93/0xb0
> > Nov 15 21:55:13 21m21 kernel:  sk_page_frag_refill+0x19/0x80
> > Nov 15 21:55:13 21m21 kernel:  tcp_sendmsg_locked+0x247/0xdc0
> > Nov 15 21:55:13 21m21 kernel:  tcp_sendmsg+0x27/0x40
> > Nov 15 21:55:13 21m21 kernel:  sock_sendmsg+0x36/0x40
> > Nov 15 21:55:13 21m21 kernel:  sock_write_iter+0x84/0xd0
> > Nov 15 21:55:13 21m21 kernel:  __vfs_write+0xdd/0x140
> > Nov 15 21:55:13 21m21 kernel:  vfs_write+0xad/0x1a0
> > Nov 15 21:55:13 21m21 kernel:  SyS_write+0x42/0x90
> > Nov 15 21:55:13 21m21 kernel:  do_syscall_64+0x60/0x110
> > Nov 15 21:55:13 21m21 kernel:  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> >
> > Here "some-task" is trying to send some bytes over network and it's
> > stuck in direct reclaim. Naturally, kswapd is not keeping up with its
> > duties.
>
> That's not kswapd causing the problem here, that's direct reclaim.

I think direct reclaim happens concurrently with kswapd, see my
workload example.

> > It seems to me that our terrible disks sometimes take a pause to think
> > about the meaning of life for a few seconds.
>
> You've probably got lots of dirty inodes backed up waiting for IO.

I don't have lots of dirty inodes waiting on IO.

> How big are xfs inode slab caches (slabinfo output, please)? How big are the logs
> on the filesystems on the machine (xfs_info output, please)?

See my workload example.

> > During that time XFS
> > shrinker is stuck, which drives the whole system out of free memory
> > and in turns triggers direct reclaim.
>
> It happens the other way around. What you have here is direct
> reclaim trying to get the reclaim lock for a given AG to be able to
> reclaim inodes from it. direct reclaim only gets stuck here when all
> AGs are already being reclaimed from (i.e. already under direct
> reclaim).
>
> > One solution to this is to not go into direct reclaim by keeping more
> > free pages with vm.watermark_scale_factor,  but I'd like to discard
> > this and argue that we're going to hit direct reclaim at some point
> > anyway.
>
> Right, but the problem is that the mm/ subsystem allows effectively
> unbound direct reclaim concurrency. At some point, having tens to
> hundreds of direct reclaimers all trying to write dirty inodes to
> disk causes catastrophic IO breakdown and everything grinds to a
> halt forever. We have to prevent that breakdown from occurring.
>
> i.e. we have to throttle direct reclaim to before it reaches IO
> breakdown /somewhere/. The memory reclaim subsystem does not do it,
> so we have to do it in XFS itself. The problem here is that if we
> ignore direct reclaim (i.e do nothing rather than block waiting on
> reclaim progress) then the mm/ reclaim algorithms will eventually
> think they aren't making progress and unleash the OOM killer.
>
> > The solution I have in mind for this is not to try to write anything
> > to (disastrously terrible) storage in shrinkers. We have 95GB of page
> > cache readily available for reclaim and it seems a lot cheaper to grab
> > that.
>
> The mm/ subsystem tries to keep a balance between page cache and
> shrinker controlled caches. /proc/sys/vm/vfs_cache_pressure
> controls the balance between page cache and inode/dentry cache
> reclaim pressure..
>
> > That brings me to the first question around memory subsystem: are
> > shrinkers supposed to flush any dirty data?
>
> No shrinker does that, but if they wanted to they would be perfectly
> within their right to write back data.
>
> XFS is writing back /metadata/ here because direct reclaim is
> putting excessive pressure on the inode cache and not allowing the
> background cleaners to do their job.

Metadata is still data, XFS still writes it to the same block device.

> > My gut feeling is that
> > they should not do that, because there's already writeback mechanism
> > with own tunables for limits to take care of that. If a system runs
> > out of memory reclaimable without IO and dirty pages are under limit,
> > it's totally fair to OOM somebody.
> >
> > It's totally possible that I'm wrong about this feeling, but either
> > way I think docs need an update on this matter:
> >
> > * https://elixir.bootlin.com/linux/v4.14.55/source/Documentation/filesystems/vfs.txt
> >
> >   nr_cached_objects: called by the sb cache shrinking function for the
> >   filesystem to return the number of freeable cached objects it contains.
>
> You are assuming that "freeable" means "instantly freeable object",
> not "unreferenced object that can be freed in the near future". We
> use the latter definition in the shrinkers, not the former.

I'm only assuming things because documentation leaves room for
interpretation. I would love to this worded in a way that's crystal
clear and mentions possibility of IO.

> > My second question is conditional on the first one: if filesystems are
> > supposed to flush dirty data in response to shrinkers, then how can I
> > stop this, given my knowledge about combination of lots of available
> > page cache and terrible disks?
>
> Filesystems have more caches that just the page cache.
>
> > I've tried two things to address this problem ad-hoc.
> >
> > 1. Run the following systemtap script to trick shrinkers into thinking
> > that XFS has nothing to free:
> >
> > probe kernel.function("xfs_fs_nr_cached_objects").return {
> >   $return = 0
> > }
> >
> > That did the job and shrink_node latency dropped considerably, while
> > calls to xfs_fs_free_cached_objects disappeared.
>
> Which effectively turned off direct reclaim for XFS inodes. See
> above - this just means that when you have no easily reclaimable
> page cache the system will OOM kill rather than wait for inodes to
> be reclaimed. i.e. it looks good until everything suddenly goes
> wrong and then everything dies a horrible death.

We have hundreds of gigabytes of page cache, dirty pages are not
allowed to go near that mark. There's a separate limit for dirty data.

What I want to have is a way to tell the kernel to not try to flush
data to disk in response to reclaim, because that's choosing a very
real horrible life over imaginary horrible death. I can't possibly
create enough dirty inodes to cause the horrible death you describe.
This may be the case for low memory machines, but not for high memory
(which are arguably a commodity these days).

Can we have a sysctl toggle for no IO during reclaim?

> > 2. Use vm.vfs_cache_pressure to do the same thing. This failed
> > miserably, because of the following code in super_cache_count:
> >
> >   if (sb->s_op && sb->s_op->nr_cached_objects)
> >     total_objects = sb->s_op->nr_cached_objects(sb, sc);
> >
> >   total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
> >   total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);
> >
> >   total_objects = vfs_pressure_ratio(total_objects);
> >   return total_objects;
> >
> > XFS was doing its job cleaning up inodes with the background mechanims
> > it has (m_reclaim_workqueue), but kernel also stopped cleaning up
> > readily available inodes after XFS.
>
> You mean the kernel stopped handing readily available inodes to XFS
> to free.
>
> ie. The VFS has to release unreferenced inodes before they are
> handed to XFS to clean up and free. IOWs, I suspect you biased
> vm.vfs_cache_pressure to turn off inode reclaim to effectively do
> the same thing as above. However, by stopping the feed of
> unreferenced inodes to XFS you stopped inode reclaim altogether and
> that caused other problems.
>
> What I suspect you want to do is bias vfs_cache_pressure the other
> way, so that it /agressively/ feeds inode reclaim and so prevent
> large numbers of inodes from building up in the cache. That way
> reclaim won't have as many dirty inodes to block on when a sustained
> memory shortage drives large amounts of direct reclaim into the
> shrinkers...

No, I absolutely do not want to do that, because it still means writes
to disk in response to reclaim. That's the opposite of what I want.

> > I'm not a kernel hacker and to be honest with you I don't even
> > understand all the nuances here. All I know is:
> >
> > 1. I have lots of page cache and terrible disks.
> > 2. I want to reclaim page cache and never touch disks in response to
> > memory reclaim.
> > 3. Direct reclaim will happen at some point, somebody will want a big
> > chunk of memory all at once.
> > 4. I'm probably ok with reclaiming clean xfs inodes synchronously in
> > reclaim path.
> >
> > This brings me to my final question: what should I do to avoid latency
> > in reclaim (direct or kswapd)?
> >
> > To reiterate the importance of this issue: we see interactive
> > applications with zero IO stall for multiple seconds in writes to
> > non-blocking sockets and page faults on newly allocated memory, while
> > 95GB of memory is in page cache.
>
> It really just sounds like you are allowing too many dirty inodes to
> build up in memory. You have a huge amount of memory but really slow
> disks - that's just asking for idata and metadata writeback latency
> problems.

I don't care as much about idata and metadata writeback latency.
Writes are asynchronous as long as dirty page limit is not breached
and I don't breach it. What I care about is system freezes when
there's tons of memory to reclaim from page cache, because reclaim
writes to storage synchronously.

> It sounds like, you need to cause inode cleaning to happen
> earlier/faster than it is happening now, because your disks are slow
> and so you can't allow a big backlog to build up.  Reducing
> /proc/sys/fs/xfs/xfssyncd_centisecs can get background metadata
> writeback started earlier. As per Documentation/filesystems/xfs.txt:
>
>   fs.xfs.xfssyncd_centisecs     (Min: 100  Default: 3000  Max: 720000)
>         The interval at which the filesystem flushes metadata
>         out to disk and runs internal cache cleanup routines.
>
> If this doesn't help, then I'd suggest taht your config is simply
> allowing too many dirty inodes to build up in memory.  With XFS, the
> number of dirty inodes that can be held in memory at any point in
> time is bound by the size of the XFS log in the filesystem. Hence if
> you have hundreds of MB of journal space, then you can cache
> hundreds of thousands of dirty inodes in memory and so can take tens
> of minutes to write them back to slow disks. So if earlier writeback
> doesn't help, I'd suggest that using smaller logs with your
> filesystems is the only real option here....

You are right that our config allows too many dirty inodes to build up
in memory. Unfortunately, too many means any number larger than zero
and lowering fs.xfs.xfssyncd_centisecs to 600 made no difference in
real production workload.

> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag()
  2018-11-29 22:22   ` Ivan Babrou
@ 2018-11-30  2:18     ` Dave Chinner
  2018-11-30  3:31       ` Ivan Babrou
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2018-11-30  2:18 UTC (permalink / raw)
  To: Ivan Babrou; +Cc: linux-xfs, Shawn Bohrer

On Thu, Nov 29, 2018 at 02:22:53PM -0800, Ivan Babrou wrote:
> On Wed, Nov 28, 2018 at 6:18 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Wed, Nov 28, 2018 at 04:36:25PM -0800, Ivan Babrou wrote:
> > > Hello,
> > >
> > > We're experiencing some interesting issues with memory reclaim, both
> > > kswapd and direct reclaim.
> > >
> > > A typical machine is 2 x NUMA with 128GB of RAM and 6 XFS filesystems.
> > > Page cache is around 95GB and dirty pages hover around 50MB, rarely
> > > jumping up to 1GB.
> >
> > What is your workload?
> 
> My test setup is an empty machine with 256GB of RAM booted from
> network into memory with just systemd essentials running.

What is your root filesystem?

> I create XFS on a 10TB drive (via LUKS), mount the drive and write
> 300GiB of randomness:
> 
> $ sudo mkfs.xfs /dev/mapper/luks-sda
> $ sudo mount /dev/mapper/luks-sda /mnt
> $ sudo dd if=/dev/urandom of=/mnt/300g.random bs=1M count=300K status=progress
> 
> Then I reboot and just mount the drive again to run my test workload:
> 
> $ dd if=/mnt/300g.random of=/dev/null bs=1M status=progress
> 
> After running it once and populating page cache I restart it to collect traces.

This isn't your production workload that is demonstrating problems -
it's your interpretation of the problem based on how you think
everything should work.

I need to know what the workload is so I can reproduce and observe a
the latency problems myself. I do have some clue abou thow this is
all supposed to work, and I have abunch of workloads that are known
to trigger severe memory-reclaim based IO breakdowns if memory
reclaim doesn't balance and throttle appropriately.

> Here's xfs_info:
> 
> $ sudo xfs_info /mnt
> meta-data=/dev/mapper/luks-sda   isize=512    agcount=10, agsize=268435455 blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=1        finobt=1 spinodes=0 rmapbt=0
>          =                       reflink=0
> data     =                       bsize=4096   blocks=2441608704, imaxpct=5
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> log      =internal               bsize=4096   blocks=521728, version=2

You've got a maximally sized log (2GB), so there's basically no bound on
dirty metadata in the filesystem.

> $ sudo cat /proc/slabinfo
....
> slabinfo - version: 2.1
> # name            <active_objs> <num_objs> <objsize> <objperslab>
> xfs_ili              144    144    168   48    2 : tunables    0    0
> xfs_inode            170    170    960   34    8 : tunables    0    0
> xfs_efd_item           0      0    416   39    4 : tunables    0    0
> xfs_buf_item         132    132    248   33    2 : tunables    0    0
> xfs_da_state           0      0    480   34    4 : tunables    0    0
> xfs_btree_cur        420    420    232   35    2 : tunables    0    0
> xfs_log_ticket       308    308    184   44    2 : tunables    0    0

That doesn't add up to a single XFS filesystem with 2 inodes in it.
where are the other 168 cached XFS inodes coming from? And I note
that 144 of them are currently or have been previously dirtied, too.

> The following can easily happen (correct me if it can't for some reason):
> 
> 1. kswapd gets stuck because of slow storage and memory is not getting reclaimed
> 2. memory allocation doesn't have any free pages and kicks in direct reclaim
> 3. direct reclaim is stuck behind kswapd
> 
> I'm not sure why you say direct reclaim happens first, allocstall is zero.

Because I thought we were talking about your production workload
that you pasted stack traces from showing direct reclaim blocking.
When you have a highly concurrent workload which has tens to
hundreds of processes all producing memory pressure, dirtying files
and page cache, etc, direct reclaim is almost always occurring.

i.e. your artificial test workload doesn't tell me anything about
the problems you are seeing on your production systems....

> > > My gut feeling is that
> > > they should not do that, because there's already writeback mechanism
> > > with own tunables for limits to take care of that. If a system runs
> > > out of memory reclaimable without IO and dirty pages are under limit,
> > > it's totally fair to OOM somebody.
> > >
> > > It's totally possible that I'm wrong about this feeling, but either
> > > way I think docs need an update on this matter:
> > >
> > > * https://elixir.bootlin.com/linux/v4.14.55/source/Documentation/filesystems/vfs.txt
> > >
> > >   nr_cached_objects: called by the sb cache shrinking function for the
> > >   filesystem to return the number of freeable cached objects it contains.
> >
> > You are assuming that "freeable" means "instantly freeable object",
> > not "unreferenced object that can be freed in the near future". We
> > use the latter definition in the shrinkers, not the former.
> 
> I'm only assuming things because documentation leaves room for
> interpretation. I would love to this worded in a way that's crystal
> clear and mentions possibility of IO.

Send a patch. I wrote that years ago when all the people reviewing
the changes understood what "freeable" meant in the shrinker context.

> > > My second question is conditional on the first one: if filesystems are
> > > supposed to flush dirty data in response to shrinkers, then how can I
> > > stop this, given my knowledge about combination of lots of available
> > > page cache and terrible disks?
> >
> > Filesystems have more caches that just the page cache.
> >
> > > I've tried two things to address this problem ad-hoc.
> > >
> > > 1. Run the following systemtap script to trick shrinkers into thinking
> > > that XFS has nothing to free:
> > >
> > > probe kernel.function("xfs_fs_nr_cached_objects").return {
> > >   $return = 0
> > > }
> > >
> > > That did the job and shrink_node latency dropped considerably, while
> > > calls to xfs_fs_free_cached_objects disappeared.
> >
> > Which effectively turned off direct reclaim for XFS inodes. See
> > above - this just means that when you have no easily reclaimable
> > page cache the system will OOM kill rather than wait for inodes to
> > be reclaimed. i.e. it looks good until everything suddenly goes
> > wrong and then everything dies a horrible death.
> 
> We have hundreds of gigabytes of page cache, dirty pages are not
> allowed to go near that mark. There's a separate limit for dirty data.

Well, yes, but we're not talking about dirty data here - I'm
talking about what happens when we turn off reclaim for a cache
that can grow without bound.  I can only say "this is a bad idea in
general because...." as I have to make the code work for lots of
different workloads.

So while it might be a solution work for your specific workload -
which I know nothing about because you haven't described it to me -
it's not a solution we can use for the general case.

> What I want to have is a way to tell the kernel to not try to flush
> data to disk in response to reclaim, because that's choosing a very
> real horrible life over imaginary horrible death.  I can't possibly
> create enough dirty inodes to cause the horrible death you describe.

Sure you can. Just keep filling memory with dirty inodes until the
log runs out of space. With disks that are as slow as you say, the
system will take hours to recover log space and return to decent
steady state performance, if it ever manages to catch up at all.

And this deomnstrates the fact that there can be many causes of the
symptoms you are describing.  But without a description of the
production workload that is demonstrating problems, I cannot
reproduce it, do any root cause analysis, or even validate that your
analysis is correct.

So, please, rather than tell me what you think the problem is and
how it should be fixed, frist describe the workload that is causing
problems in enough detail that I can reproduce it myself.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag()
  2018-11-30  2:18     ` Dave Chinner
@ 2018-11-30  3:31       ` Ivan Babrou
  2018-11-30  6:49         ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Ivan Babrou @ 2018-11-30  3:31 UTC (permalink / raw)
  To: david; +Cc: linux-xfs, Shawn Bohrer

On Thu, Nov 29, 2018 at 6:18 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Nov 29, 2018 at 02:22:53PM -0800, Ivan Babrou wrote:
> > On Wed, Nov 28, 2018 at 6:18 PM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Wed, Nov 28, 2018 at 04:36:25PM -0800, Ivan Babrou wrote:
> > > > Hello,
> > > >
> > > > We're experiencing some interesting issues with memory reclaim, both
> > > > kswapd and direct reclaim.
> > > >
> > > > A typical machine is 2 x NUMA with 128GB of RAM and 6 XFS filesystems.
> > > > Page cache is around 95GB and dirty pages hover around 50MB, rarely
> > > > jumping up to 1GB.
> > >
> > > What is your workload?
> >
> > My test setup is an empty machine with 256GB of RAM booted from
> > network into memory with just systemd essentials running.
>
> What is your root filesystem?

It's whatever initramfs unpacks into, there is no storage involved:

rootfs / rootfs rw,size=131930052k,nr_inodes=32982513 0 0

> > I create XFS on a 10TB drive (via LUKS), mount the drive and write
> > 300GiB of randomness:
> >
> > $ sudo mkfs.xfs /dev/mapper/luks-sda
> > $ sudo mount /dev/mapper/luks-sda /mnt
> > $ sudo dd if=/dev/urandom of=/mnt/300g.random bs=1M count=300K status=progress
> >
> > Then I reboot and just mount the drive again to run my test workload:
> >
> > $ dd if=/mnt/300g.random of=/dev/null bs=1M status=progress
> >
> > After running it once and populating page cache I restart it to collect traces.
>
> This isn't your production workload that is demonstrating problems -
> it's your interpretation of the problem based on how you think
> everything should work.

This is the simplest workload that exhibits the issue. My production
workload is similar: serve most of the data from page cache,
occasionally write some data on disk.

To make it more real in terms of how I think things work, let's add a
bit more fun with the following systemtap script:

$ cat derp.stap
%{
/* guru */
#undef STP_OVERLOAD
#include <linux/delay.h>
%}

function mdelay(ms:long) %{
  /* guru */
  mdelay(STAP_ARG_ms);
%}

probe kernel.function("xfs_bwrite").return {
  mdelay(30000)
}

This simulates very slow disk writes by making xfs_bwrite 30s longer.
We only ever write to one empty file, so there's no need to be very
selective.

I load the script above, fire up dd to fill up page cache and run my
touch command. This is what happens in dmesg:

Nov 30 03:15:31 myhost kernel: INFO: task kswapd1:1863 blocked for
more than 2 seconds.
Nov 30 03:15:31 myhost kernel:       Tainted: G           O
4.14.74-cloudflare-2018.10.1 #1
Nov 30 03:15:31 myhost kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 30 03:15:31 myhost kernel: kswapd1         D13856  1863      2 0x80000000
Nov 30 03:15:31 myhost kernel: Call Trace:
Nov 30 03:15:31 myhost kernel:  ? __schedule+0x21e/0x7f0
Nov 30 03:15:31 myhost kernel:  schedule+0x28/0x80
Nov 30 03:15:31 myhost kernel:  schedule_preempt_disabled+0xa/0x10
Nov 30 03:15:31 myhost kernel:  __mutex_lock.isra.3+0x2a8/0x490
Nov 30 03:15:31 myhost kernel:  ? xfs_perag_get_tag+0x2d/0xc0
Nov 30 03:15:31 myhost kernel:  xfs_reclaim_inodes_ag+0x27d/0x2c0
Nov 30 03:15:31 myhost kernel:  ? free_hot_cold_page_list+0x3e/0xa0
Nov 30 03:15:31 myhost kernel:  ? shrink_page_list+0x42c/0xb30
Nov 30 03:15:31 myhost kernel:  ? radix_tree_gang_lookup_tag+0xc3/0x140
Nov 30 03:15:31 myhost kernel:  ? __list_lru_walk_one.isra.7+0x33/0x140
Nov 30 03:15:31 myhost kernel:  xfs_reclaim_inodes_nr+0x31/0x40
Nov 30 03:15:31 myhost kernel:  super_cache_scan+0x152/0x1a0
Nov 30 03:15:31 myhost kernel:  shrink_slab.part.45+0x1d3/0x380
Nov 30 03:15:31 myhost kernel:  shrink_node+0x116/0x300
Nov 30 03:15:31 myhost kernel:  kswapd+0x3af/0x6b0
Nov 30 03:15:31 myhost kernel:  ? mem_cgroup_shrink_node+0x150/0x150
Nov 30 03:15:31 myhost kernel:  kthread+0x119/0x130
Nov 30 03:15:31 myhost kernel:  ? kthread_create_on_node+0x40/0x40
Nov 30 03:15:31 myhost kernel:  ret_from_fork+0x35/0x40
Nov 30 03:15:31 myhost kernel: INFO: task dd:15663 blocked for more
than 2 seconds.
Nov 30 03:15:31 myhost kernel:       Tainted: G           O
4.14.74-cloudflare-2018.10.1 #1
Nov 30 03:15:31 myhost kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 30 03:15:31 myhost kernel: dd              D13952 15663  13243 0x00000000
Nov 30 03:15:31 myhost kernel: Call Trace:
Nov 30 03:15:31 myhost kernel:  ? __schedule+0x21e/0x7f0
Nov 30 03:15:31 myhost kernel:  ? __switch_to_asm+0x34/0x70
Nov 30 03:15:31 myhost kernel:  schedule+0x28/0x80
Nov 30 03:15:31 myhost kernel:  schedule_timeout+0x268/0x340
Nov 30 03:15:31 myhost kernel:  ? __switch_to_asm+0x34/0x70
Nov 30 03:15:31 myhost kernel:  ? __switch_to_asm+0x40/0x70
Nov 30 03:15:31 myhost kernel:  ? __switch_to_asm+0x40/0x70
Nov 30 03:15:31 myhost kernel:  ? __switch_to_asm+0x34/0x70
Nov 30 03:15:31 myhost kernel:  wait_for_completion+0x97/0x100
Nov 30 03:15:31 myhost kernel:  ? wake_up_q+0x70/0x70
Nov 30 03:15:31 myhost kernel:  flush_work+0xfc/0x190
Nov 30 03:15:31 myhost kernel:  ? flush_workqueue_prep_pwqs+0x130/0x130
Nov 30 03:15:31 myhost kernel:  drain_all_pages+0x140/0x190
Nov 30 03:15:31 myhost kernel:  __alloc_pages_slowpath+0x41b/0xd50
Nov 30 03:15:31 myhost kernel:  ? blk_peek_request+0x62/0x280
Nov 30 03:15:31 myhost kernel:  ? __switch_to_asm+0x40/0x70
Nov 30 03:15:31 myhost kernel:  ? __switch_to_asm+0x34/0x70
Nov 30 03:15:31 myhost kernel:  ? __switch_to_asm+0x40/0x70
Nov 30 03:15:31 myhost kernel:  __alloc_pages_nodemask+0x235/0x250
Nov 30 03:15:31 myhost kernel:  __do_page_cache_readahead+0xcd/0x210
Nov 30 03:15:31 myhost kernel:  ? __switch_to_asm+0x34/0x70
Nov 30 03:15:31 myhost kernel:  ondemand_readahead+0x1f9/0x2c0
Nov 30 03:15:31 myhost kernel:  generic_file_read_iter+0x6c1/0x9c0
Nov 30 03:15:31 myhost kernel:  ? page_cache_tree_insert+0x90/0x90
Nov 30 03:15:31 myhost kernel:  xfs_file_buffered_aio_read+0x47/0xe0
Nov 30 03:15:31 myhost kernel:  xfs_file_read_iter+0x5e/0xb0
Nov 30 03:15:31 myhost kernel:  __vfs_read+0xee/0x130
Nov 30 03:15:31 myhost kernel:  vfs_read+0x8a/0x140
Nov 30 03:15:31 myhost kernel:  SyS_read+0x3f/0xa0
Nov 30 03:15:31 myhost kernel:  do_syscall_64+0x60/0xf0
Nov 30 03:15:31 myhost kernel:  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Nov 30 03:15:31 myhost kernel: RIP: 0033:0x7f8d9264e6d0
Nov 30 03:15:31 myhost kernel: RSP: 002b:00007ffdf8842998 EFLAGS:
00000246 ORIG_RAX: 0000000000000000
Nov 30 03:15:31 myhost kernel: RAX: ffffffffffffffda RBX:
0000562ac05b2400 RCX: 00007f8d9264e6d0
Nov 30 03:15:31 myhost kernel: RDX: 0000000000100000 RSI:
00007f8d92a2a000 RDI: 0000000000000000
Nov 30 03:15:31 myhost kernel: RBP: 0000000000100000 R08:
0035153337c601da R09: 00007ffdf8888080
Nov 30 03:15:31 myhost kernel: R10: 0000000000a7aa68 R11:
0000000000000246 R12: 00007f8d92a2a000
Nov 30 03:15:31 myhost kernel: R13: 0000000000000000 R14:
0000000000000000 R15: 0000000000000000

The way I read this: kswapd got stuck behind one stalled io, that
triggered later direct reclaim from dd, which also got stuck.

250GB of page cache to reclaim and yet one unlucky write made
everything sad. Stack trace is somewhat different, but it illustrates
the idea of one write blocking reclaim.

> I need to know what the workload is so I can reproduce and observe a
> the latency problems myself. I do have some clue abou thow this is
> all supposed to work, and I have abunch of workloads that are known
> to trigger severe memory-reclaim based IO breakdowns if memory
> reclaim doesn't balance and throttle appropriately.

I believe that you do have a clue and workloads. The problem is: those
workloads are not mine.

> > Here's xfs_info:
> >
> > $ sudo xfs_info /mnt
> > meta-data=/dev/mapper/luks-sda   isize=512    agcount=10, agsize=268435455 blks
> >          =                       sectsz=4096  attr=2, projid32bit=1
> >          =                       crc=1        finobt=1 spinodes=0 rmapbt=0
> >          =                       reflink=0
> > data     =                       bsize=4096   blocks=2441608704, imaxpct=5
> >          =                       sunit=0      swidth=0 blks
> > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > log      =internal               bsize=4096   blocks=521728, version=2
>
> You've got a maximally sized log (2GB), so there's basically no bound on
> dirty metadata in the filesystem.

I didn't do anything special, mkfs,xfs picked all the values for me.

> > $ sudo cat /proc/slabinfo
> ....
> > slabinfo - version: 2.1
> > # name            <active_objs> <num_objs> <objsize> <objperslab>
> > xfs_ili              144    144    168   48    2 : tunables    0    0
> > xfs_inode            170    170    960   34    8 : tunables    0    0
> > xfs_efd_item           0      0    416   39    4 : tunables    0    0
> > xfs_buf_item         132    132    248   33    2 : tunables    0    0
> > xfs_da_state           0      0    480   34    4 : tunables    0    0
> > xfs_btree_cur        420    420    232   35    2 : tunables    0    0
> > xfs_log_ticket       308    308    184   44    2 : tunables    0    0
>
> That doesn't add up to a single XFS filesystem with 2 inodes in it.
> where are the other 168 cached XFS inodes coming from? And I note
> that 144 of them are currently or have been previously dirtied, too.

I don't know about this, my only guess that some of my previous tests
with touch created those.

> > The following can easily happen (correct me if it can't for some reason):
> >
> > 1. kswapd gets stuck because of slow storage and memory is not getting reclaimed
> > 2. memory allocation doesn't have any free pages and kicks in direct reclaim
> > 3. direct reclaim is stuck behind kswapd
> >
> > I'm not sure why you say direct reclaim happens first, allocstall is zero.
>
> Because I thought we were talking about your production workload
> that you pasted stack traces from showing direct reclaim blocking.
> When you have a highly concurrent workload which has tens to
> hundreds of processes all producing memory pressure, dirtying files
> and page cache, etc, direct reclaim is almost always occurring.
>
> i.e. your artificial test workload doesn't tell me anything about
> the problems you are seeing on your production systems....
>
> > > > My gut feeling is that
> > > > they should not do that, because there's already writeback mechanism
> > > > with own tunables for limits to take care of that. If a system runs
> > > > out of memory reclaimable without IO and dirty pages are under limit,
> > > > it's totally fair to OOM somebody.
> > > >
> > > > It's totally possible that I'm wrong about this feeling, but either
> > > > way I think docs need an update on this matter:
> > > >
> > > > * https://elixir.bootlin.com/linux/v4.14.55/source/Documentation/filesystems/vfs.txt
> > > >
> > > >   nr_cached_objects: called by the sb cache shrinking function for the
> > > >   filesystem to return the number of freeable cached objects it contains.
> > >
> > > You are assuming that "freeable" means "instantly freeable object",
> > > not "unreferenced object that can be freed in the near future". We
> > > use the latter definition in the shrinkers, not the former.
> >
> > I'm only assuming things because documentation leaves room for
> > interpretation. I would love to this worded in a way that's crystal
> > clear and mentions possibility of IO.
>
> Send a patch. I wrote that years ago when all the people reviewing
> the changes understood what "freeable" meant in the shrinker context.

Would you be happy with this patch?

diff --git a/Documentation/filesystems/vfs.txt
b/Documentation/filesystems/vfs.txt
index 5fd325df59e2..297400842152 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -301,6 +301,7 @@ or bottom half).

   nr_cached_objects: called by the sb cache shrinking function for the
        filesystem to return the number of freeable cached objects it contains.
+       Freeable objects may include dirty ones requiring write IO to be freed.
        Optional.

   free_cache_objects: called by the sb cache shrinking function for the
@@ -319,6 +320,8 @@ or bottom half).
        implementations will cause holdoff problems due to large scan batch
        sizes.

+       Implementations may issue write IO to free objects.
+
 Whoever sets up the inode is responsible for filling in the "i_op" field. This
 is a pointer to a "struct inode_operations" which describes the methods that
 can be performed on individual inodes.


> > > > My second question is conditional on the first one: if filesystems are
> > > > supposed to flush dirty data in response to shrinkers, then how can I
> > > > stop this, given my knowledge about combination of lots of available
> > > > page cache and terrible disks?
> > >
> > > Filesystems have more caches that just the page cache.
> > >
> > > > I've tried two things to address this problem ad-hoc.
> > > >
> > > > 1. Run the following systemtap script to trick shrinkers into thinking
> > > > that XFS has nothing to free:
> > > >
> > > > probe kernel.function("xfs_fs_nr_cached_objects").return {
> > > >   $return = 0
> > > > }
> > > >
> > > > That did the job and shrink_node latency dropped considerably, while
> > > > calls to xfs_fs_free_cached_objects disappeared.
> > >
> > > Which effectively turned off direct reclaim for XFS inodes. See
> > > above - this just means that when you have no easily reclaimable
> > > page cache the system will OOM kill rather than wait for inodes to
> > > be reclaimed. i.e. it looks good until everything suddenly goes
> > > wrong and then everything dies a horrible death.
> >
> > We have hundreds of gigabytes of page cache, dirty pages are not
> > allowed to go near that mark. There's a separate limit for dirty data.
>
> Well, yes, but we're not talking about dirty data here - I'm
> talking about what happens when we turn off reclaim for a cache
> that can grow without bound.  I can only say "this is a bad idea in
> general because...." as I have to make the code work for lots of
> different workloads.
>
> So while it might be a solution work for your specific workload -
> which I know nothing about because you haven't described it to me -
> it's not a solution we can use for the general case.

Facebook is still running with their patch that does async reclaim, so
my workload is not that special.

> > What I want to have is a way to tell the kernel to not try to flush
> > data to disk in response to reclaim, because that's choosing a very
> > real horrible life over imaginary horrible death.  I can't possibly
> > create enough dirty inodes to cause the horrible death you describe.
>
> Sure you can. Just keep filling memory with dirty inodes until the
> log runs out of space. With disks that are as slow as you say, the
> system will take hours to recover log space and return to decent
> steady state performance, if it ever manages to catch up at all.

Slow is a broad definition. Disks may sometimes get stuck for some IO,
but overall they are very much capable of keeping up with background
writes.

I mentioned that we typically have ~50MB of dirty pages and very low
write demand.

Shouldn't dirty inodes also be accounted as dirty pages?

> And this deomnstrates the fact that there can be many causes of the
> symptoms you are describing.  But without a description of the
> production workload that is demonstrating problems, I cannot
> reproduce it, do any root cause analysis, or even validate that your
> analysis is correct.
>
> So, please, rather than tell me what you think the problem is and
> how it should be fixed, frist describe the workload that is causing
> problems in enough detail that I can reproduce it myself.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag()
  2018-11-30  3:31       ` Ivan Babrou
@ 2018-11-30  6:49         ` Dave Chinner
  2018-11-30  7:45           ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2018-11-30  6:49 UTC (permalink / raw)
  To: Ivan Babrou; +Cc: linux-xfs, Shawn Bohrer

On Thu, Nov 29, 2018 at 07:31:44PM -0800, Ivan Babrou wrote:
> On Thu, Nov 29, 2018 at 6:18 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Thu, Nov 29, 2018 at 02:22:53PM -0800, Ivan Babrou wrote:
> > > On Wed, Nov 28, 2018 at 6:18 PM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > On Wed, Nov 28, 2018 at 04:36:25PM -0800, Ivan Babrou wrote:
> > > > > Hello,
> > > > >
> > > > > We're experiencing some interesting issues with memory reclaim, both
> > > > > kswapd and direct reclaim.
> > > > >
> > > > > A typical machine is 2 x NUMA with 128GB of RAM and 6 XFS filesystems.
> > > > > Page cache is around 95GB and dirty pages hover around 50MB, rarely
> > > > > jumping up to 1GB.
> > > >
> > > > What is your workload?
> > >
> > > My test setup is an empty machine with 256GB of RAM booted from
> > > network into memory with just systemd essentials running.
> >
> > What is your root filesystem?
> 
> It's whatever initramfs unpacks into, there is no storage involved:
> 
> rootfs / rootfs rw,size=131930052k,nr_inodes=32982513 0 0
> 
> > > I create XFS on a 10TB drive (via LUKS), mount the drive and write
> > > 300GiB of randomness:
> > >
> > > $ sudo mkfs.xfs /dev/mapper/luks-sda
> > > $ sudo mount /dev/mapper/luks-sda /mnt
> > > $ sudo dd if=/dev/urandom of=/mnt/300g.random bs=1M count=300K status=progress
> > >
> > > Then I reboot and just mount the drive again to run my test workload:
> > >
> > > $ dd if=/mnt/300g.random of=/dev/null bs=1M status=progress
> > >
> > > After running it once and populating page cache I restart it to collect traces.
> >
> > This isn't your production workload that is demonstrating problems -
> > it's your interpretation of the problem based on how you think
> > everything should work.
> 
> This is the simplest workload that exhibits the issue. My production
> workload is similar: serve most of the data from page cache,
> occasionally write some data on disk.

That's not a description of the workload that is generating delays
in the order of tens of seconds....

> To make it more real in terms of how I think things work, let's add a
> bit more fun with the following systemtap script:

Stop it already. I know exactly what can happen in that path and
you're not helping me at all by trying to tell me what you think the
problem is over and over again. I don't care what you think is the
problem - I want to know how to reproduce the symptoms you are
seeing.

I need to know what load your systems are being placed under when
they demonstrate the worst case latencies. I need to know about
about CPU load, memory pressure, concurrency, read rates, write
rates, the actual physical IO rates and patterns. It also helps to
know about the hardware, your storage setup and the type of physical
disks you are using.

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

[snip traces]

> The way I read this: kswapd got stuck behind one stalled io, that
> triggered later direct reclaim from dd, which also got stuck.

No, that's a stall waiting for the per-cpu free page magazines to
drain back into the free list. i.e. it's waiting on other CPUs to do
scheduled work that has absolutely nothing to do with kswapd and
the XFS shrinker waiting on IO to complete. Indeed:

/*
 * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
 *
 * When zone parameter is non-NULL, spill just the single zone's pages.
 *
 * Note that this can be extremely slow as the draining happens in a workqueue.
 */
void drain_all_pages(struct zone *zone)
....

> 250GB of page cache to reclaim and yet one unlucky write made
> everything sad. Stack trace is somewhat different, but it illustrates
> the idea of one write blocking reclaim.

No, they aren't related at all. There are a lot of places where
memory reclaim can block, and very few of them have anything to do
with shrinkers and filesystems.

> > the latency problems myself. I do have some clue abou thow this is
> > all supposed to work, and I have abunch of workloads that are known
> > to trigger severe memory-reclaim based IO breakdowns if memory
> > reclaim doesn't balance and throttle appropriately.
> 
> I believe that you do have a clue and workloads. The problem is: those
> workloads are not mine.

Yes, but you won't tell me what your workload is.

> > > I'm only assuming things because documentation leaves room for
> > > interpretation. I would love to this worded in a way that's crystal
> > > clear and mentions possibility of IO.
> >
> > Send a patch. I wrote that years ago when all the people reviewing
> > the changes understood what "freeable" meant in the shrinker context.
> 
> Would you be happy with this patch?

No objections here. Send it to -fsdevel with the appropriate
sign-off.

> > > > above - this just means that when you have no easily reclaimable
> > > > page cache the system will OOM kill rather than wait for inodes to
> > > > be reclaimed. i.e. it looks good until everything suddenly goes
> > > > wrong and then everything dies a horrible death.
> > >
> > > We have hundreds of gigabytes of page cache, dirty pages are not
> > > allowed to go near that mark. There's a separate limit for dirty data.
> >
> > Well, yes, but we're not talking about dirty data here - I'm
> > talking about what happens when we turn off reclaim for a cache
> > that can grow without bound.  I can only say "this is a bad idea in
> > general because...." as I have to make the code work for lots of
> > different workloads.
> >
> > So while it might be a solution work for your specific workload -
> > which I know nothing about because you haven't described it to me -
> > it's not a solution we can use for the general case.
> 
> Facebook is still running with their patch that does async reclaim, so
> my workload is not that special.

Your workload is (apparently) *very different* to the FB workload.
Their workload was that filesystem traversal was causing severe
fluctuations in inode cache pressure - the huge numbers of inodes
that were being cycled in and out memory were causing problems when
kswapd occassionally blocked on a single inode out of millions. They
were chasing occasional long tail latencies on heavy loaded, highly
concurrent workloads, not delays that measured in multiples of tens
of seconds. i.e. their latency spikes were 50-100ms instead of
typical 5ms. That's very, very different to what you are reporting
and the worklaod you are implying is causing them.

And, really, I know exactly what the FB workload is because the FB
devs wrote a workload simulator to reproduce the problem offline. So
while I could reproduce the latency spikes they saw and could make
changes to get rid of them, it always came at the expense of
dramatic regressions on other memory pressure workloads I have.

IOWs, the change that FB made for their specific workload hid the
latency issue for their workload, but did not benefit any other
workloads. Maybe that change would also work for you, but it may not
and it may, in fact, make things worse. But I can't say, because I
don't know how to reproduce your problem so I can't test the FB
change against it.

Seriously: describe your workload in detail for me so I can write a
reproducer for it. Without that I cannot help you any further and I
am just wasting my time asking you to describe the workload over
and over again.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag()
  2018-11-30  6:49         ` Dave Chinner
@ 2018-11-30  7:45           ` Dave Chinner
  2018-12-19 22:15             ` Ivan Babrou
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2018-11-30  7:45 UTC (permalink / raw)
  To: Ivan Babrou; +Cc: linux-xfs, Shawn Bohrer

On Fri, Nov 30, 2018 at 05:49:08PM +1100, Dave Chinner wrote:
> Seriously: describe your workload in detail for me so I can write a
> reproducer for it. Without that I cannot help you any further and I
> am just wasting my time asking you to describe the workload over
> and over again.

FWIW, here's the discussion that about the FB issue. Go read it,
the first few emails are pretty much the same as this thread so far.

https://www.spinics.net/lists/linux-xfs/msg01541.html

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag()
  2018-11-30  7:45           ` Dave Chinner
@ 2018-12-19 22:15             ` Ivan Babrou
  2018-12-21  4:00               ` Kenton Varda
  2019-01-02 10:34               ` Arkadiusz Miśkiewicz
  0 siblings, 2 replies; 16+ messages in thread
From: Ivan Babrou @ 2018-12-19 22:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Shawn Bohrer

We're sticking with the following patch that allows runtime switching
between XFS memory reclaim strategies:

* https://github.com/bobrik/linux/pull/2

There are some tests and graphs describing the issue and how it can be solved.

Let me know if you think this can be incorporated upstream, I'm fine if not.

On Thu, Nov 29, 2018 at 11:45 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, Nov 30, 2018 at 05:49:08PM +1100, Dave Chinner wrote:
> > Seriously: describe your workload in detail for me so I can write a
> > reproducer for it. Without that I cannot help you any further and I
> > am just wasting my time asking you to describe the workload over
> > and over again.
>
> FWIW, here's the discussion that about the FB issue. Go read it,
> the first few emails are pretty much the same as this thread so far.
>
> https://www.spinics.net/lists/linux-xfs/msg01541.html
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag()
  2018-12-19 22:15             ` Ivan Babrou
@ 2018-12-21  4:00               ` Kenton Varda
  2018-12-25 23:47                 ` Dave Chinner
  2019-01-02 10:34               ` Arkadiusz Miśkiewicz
  1 sibling, 1 reply; 16+ messages in thread
From: Kenton Varda @ 2018-12-21  4:00 UTC (permalink / raw)
  To: Ivan Babrou; +Cc: Dave Chinner, linux-xfs, Shawn Bohrer

Hi all,

I'm a coworker of Ivan's and wanted to add some background here -- in
particular to answer Dave's question about our workload.

For the purpose of this discussion, we can describe our workload as a
giant, glorified HTTP caching proxy. (We receive HTTP requests in. We
check if we have a cached response. If so, we return it to the client,
otherwise we forward the request on to its "origin" server. When the
origin responds, if the response is cacheable, we save it, and either
way we return it to the client.)

Roughly speaking, each HTTP cache entry is stored as a file on disk.
Hence, we have a very large number of files with files being added and
removed frequently. We also rely heavily on page cache for
performance, rather than some more complicated database scheme.

The HTTP requests we serve almost always come from live end users
interacting with a web site. So, any kind of delay means someone is
sitting and waiting. When delays get up over 15 seconds, we start
hitting timeouts, meaning someone's web site doesn't load at all or
loads "broken". Also note that any particular machine may serve
thousands of requests per second, so blocking one machine may affect
thousands of users.

When XFS blocks direct reclaim, our service pretty much grinds to a
halt on that machine, because everything is trying to allocate memory
all the time. For example, as alluded by the subject of this thread,
writing to a socket allocates memory, and thus will block waiting for
XFS to write back inodes. What we find really frustrating is that we
almost always have over 100GB of clean page cache that could be
reclaimed immediately, without blocking, yet we end up waiting for the
much-smaller inode cache to be written back to disk.

We really can't accept random multi-second pauses. Our current plan is
to roll out the patch Ivan linked to. But, if you have any other
suggestions, we'd love to hear them. It would be great if we could
agree on an upstream solution, and maybe solve Facebook's problem too.

Hope that helps elucidate things.

Thanks,
-Kenton


On Wed, Dec 19, 2018 at 2:15 PM Ivan Babrou <ivan@cloudflare.com> wrote:
>
> We're sticking with the following patch that allows runtime switching
> between XFS memory reclaim strategies:
>
> * https://github.com/bobrik/linux/pull/2
>
> There are some tests and graphs describing the issue and how it can be solved.
>
> Let me know if you think this can be incorporated upstream, I'm fine if not.
>
> On Thu, Nov 29, 2018 at 11:45 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Fri, Nov 30, 2018 at 05:49:08PM +1100, Dave Chinner wrote:
> > > Seriously: describe your workload in detail for me so I can write a
> > > reproducer for it. Without that I cannot help you any further and I
> > > am just wasting my time asking you to describe the workload over
> > > and over again.
> >
> > FWIW, here's the discussion that about the FB issue. Go read it,
> > the first few emails are pretty much the same as this thread so far.
> >
> > https://www.spinics.net/lists/linux-xfs/msg01541.html
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag()
  2018-12-21  4:00               ` Kenton Varda
@ 2018-12-25 23:47                 ` Dave Chinner
  2018-12-26  3:16                   ` Kenton Varda
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2018-12-25 23:47 UTC (permalink / raw)
  To: Kenton Varda; +Cc: Ivan Babrou, linux-xfs, Shawn Bohrer

Hi Kenton,

On Thu, Dec 20, 2018 at 08:00:21PM -0800, Kenton Varda wrote:
> When XFS blocks direct reclaim, our service pretty much grinds to a
> halt on that machine, because everything is trying to allocate memory
> all the time. For example, as alluded by the subject of this thread,
> writing to a socket allocates memory, and thus will block waiting for
> XFS to write back inodes. What we find really frustrating is that we
> almost always have over 100GB of clean page cache that could be
> reclaimed immediately, without blocking, yet we end up waiting for the
> much-smaller inode cache to be written back to disk.

Sure, it's frustrating. It frsutrates me that I've been trying for
years to get memory reclaim behaviour changed so that we don't have
to do this, but we are still stuck with it.

But taking out your frustrations on the people who are trying to fix
the problems you are seeing isn't productive. We are only a small
team and we can't fix every problem that everyone reports
immediately. Some things take time to fix.

> We really can't accept random multi-second pauses.

And that's the problem here. For the vast majority of XFS users, the
alternative (i.e. not blocking reclaim) leads to substantially lower
performance and the high risk of premature OOM kills. It basically
moves the reclaim blocking problem to a different context (i.e. to
journal space allocation) and that has even worse latency and has
global filesystem scope rather than just the process doing reclaim.

IOWs, there are relatively few applications that have such a
significant dependency on memory reclaim having extremely low
latency, but there are a lot that are dependent on memory reclaim
being throttled harshly to keep the filesystem out of even worse
breakdown conditions.....

> Our current plan is
> to roll out the patch Ivan linked to.

Which is perfectly fine by me. I read the link, and it looks like it
works just fine in your environment. In contrast, I ran the same
patch on my performance bencmarks and saw between 15-30%
degradation in performance on my inode cache heavy and mixed
inode/page cache heavy memory pressure benchmarks. IOWs, that change
still doesn't work for XFS in general.

This is the beauty of open source software - you can make tweaks to
it to refine it for your specific workloads easily when such changes
aren't really suitable for the wider user base.  I encourage people
to make tweaks like this to optimise your systems for your workload.

However, I also encourage people then to discuss the problems that
lead to needing such tweaks with upstream so we are aware of the
issues and can work towards either incorporating them or modifying
infrastructure to avoid them altogether. Further, there's no need to
be combative or get upset when upstream determines that the tweak
isn't generally applicable or is hiding something deeper that needs
fixing.

All it means is that the upstream developers know there is a deeper
underlying problem and want to fix the underlying problem rather
than try to hide the problem for a specific workload.  The fact we
know there's a problem (and it's not just one workload situation
that it affects) helps us prioritise what we need to fix.

> But, if you have any other
> suggestions, we'd love to hear them. It would be great if we could
> agree on an upstream solution, and maybe solve Facebook's problem too.

I've already mentioned the things we are working on to Shaun to keep
the ball rolling on this. e.g. Darrick's pipelined background inode
inactivation work:

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=85baea68d8e87803e6831bda7b5a3773cf0d8820

i.e. as I indicated above, inode reclaim can block on lots more
things that just inode writeback. We can do transactions in inode
reclaim, so we can block on log space and that might require
*thousands* of IO to be completed before it can make progress. The
async kswapd mod doesn't address this problem at all because
inactivation occurs in the prune_icache() VFS shrinker context (i.e.
->destroy_inode()) rather than the XFS inode cache context that the
FB and CF patches address.

Also, the upcoming inode cache xarray conversion will provide us
with many more per-inode state tags in the xarray that will allow us
to track and execute multiple different post-VFS reclaim inode
states directly.  This will allow us to efficiently separate inodes
that need transactions and/or IO to be reclaimed from inodes that
can be reclaimed immediately, and it will allow efficient concurrent
async processing of inodes that need IO to be reclaimed.

IOWs, we're trying to solve *all* the blocking problems that we know
that can occur in inode reclaim so that it all just works for
everyone without tweaks being necessary. Yes, this takes longer than
just addressing the specific symptom that is causing you problems,
but the reality is while fixing things properly takes time to get
right, everyone will benefit from it being fixed and not just one or
two very specific, latency sensitive workloads.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag()
  2018-12-25 23:47                 ` Dave Chinner
@ 2018-12-26  3:16                   ` Kenton Varda
  2018-12-29 19:05                     ` Darrick J. Wong
  2019-01-01 23:48                     ` Dave Chinner
  0 siblings, 2 replies; 16+ messages in thread
From: Kenton Varda @ 2018-12-26  3:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Ivan Babrou, linux-xfs, Shawn Bohrer

On Tue, Dec 25, 2018 at 3:47 PM Dave Chinner <david@fromorbit.com> wrote:
> But taking out your frustrations on the people who are trying to fix
> the problems you are seeing isn't productive. We are only a small
> team and we can't fix every problem that everyone reports
> immediately. Some things take time to fix.

I agree. My hope is that explaining our use case helps you make XFS
better, but you don't owe us anything. It's our problem to solve and
any help you give us is a favor.

> IOWs, there are relatively few applications that have such a
> significant dependency on memory reclaim having extremely low
> latency,

Hmm, I'm confused by this. Isn't low-latency memory allocation is a
common requirement for any kind of interactive workload? I don't see
what's unique about our use case in this respect. Any desktop and most
web servers I would think have similar requirements.

I'm sure there's something about our use case that's unusual, but it
doesn't seem to me that requiring low-latency memory allocation is
unique.

Maybe the real thing that's odd about us is that we constantly create
and delete files at a high rate, and that means we have an excessive
number of dirty inodes to flush?

> IOWs, we're trying to solve *all* the blocking problems that we know
> that can occur in inode reclaim so that it all just works for
> everyone without tweaks being necessary. Yes, this takes longer than
> just addressing the specific symptom that is causing you problems,
> but the reality is while fixing things properly takes time to get
> right, everyone will benefit from it being fixed and not just one or
> two very specific, latency sensitive workloads.

Great, it's good to hear that this problem is expected to be fixed
eventually. We can patch our way around it in the meantime.

-Kenton

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag()
  2018-12-26  3:16                   ` Kenton Varda
@ 2018-12-29 19:05                     ` Darrick J. Wong
  2019-01-01 23:48                     ` Dave Chinner
  1 sibling, 0 replies; 16+ messages in thread
From: Darrick J. Wong @ 2018-12-29 19:05 UTC (permalink / raw)
  To: Kenton Varda; +Cc: Dave Chinner, Ivan Babrou, linux-xfs, Shawn Bohrer

On Tue, Dec 25, 2018 at 07:16:25PM -0800, Kenton Varda wrote:
> On Tue, Dec 25, 2018 at 3:47 PM Dave Chinner <david@fromorbit.com> wrote:
> > But taking out your frustrations on the people who are trying to fix
> > the problems you are seeing isn't productive. We are only a small
> > team and we can't fix every problem that everyone reports
> > immediately. Some things take time to fix.
> 
> I agree. My hope is that explaining our use case helps you make XFS
> better, but you don't owe us anything. It's our problem to solve and
> any help you give us is a favor.
> 
> > IOWs, there are relatively few applications that have such a
> > significant dependency on memory reclaim having extremely low
> > latency,
> 
> Hmm, I'm confused by this. Isn't low-latency memory allocation is a
> common requirement for any kind of interactive workload? I don't see
> what's unique about our use case in this respect. Any desktop and most
> web servers I would think have similar requirements.
> 
> I'm sure there's something about our use case that's unusual, but it
> doesn't seem to me that requiring low-latency memory allocation is
> unique.
> 
> Maybe the real thing that's odd about us is that we constantly create
> and delete files at a high rate, and that means we have an excessive
> number of dirty inodes to flush?
> 
> > IOWs, we're trying to solve *all* the blocking problems that we know
> > that can occur in inode reclaim so that it all just works for
> > everyone without tweaks being necessary. Yes, this takes longer than
> > just addressing the specific symptom that is causing you problems,
> > but the reality is while fixing things properly takes time to get
> > right, everyone will benefit from it being fixed and not just one or
> > two very specific, latency sensitive workloads.
> 
> Great, it's good to hear that this problem is expected to be fixed
> eventually. We can patch our way around it in the meantime.

FWIW I /was/ planning to patchbomb every feature that's sitting around
in my xfs development tree on NYE for everyone's enjoyment^Wreview. ;)

Concretely, those features are:

- Scrub fixes
- The eas(ier) parts of online repair
- Deferred inode inactivation (i.e. the thing you're talking about)
- The hard parts of online repair
- Hoisting inode operations to libxfs
- Metadata inode directory tree
- Reverse mapping for realtime devices

--D

> -Kenton

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag()
  2018-12-26  3:16                   ` Kenton Varda
  2018-12-29 19:05                     ` Darrick J. Wong
@ 2019-01-01 23:48                     ` Dave Chinner
  1 sibling, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2019-01-01 23:48 UTC (permalink / raw)
  To: Kenton Varda; +Cc: Ivan Babrou, linux-xfs, Shawn Bohrer

On Tue, Dec 25, 2018 at 07:16:25PM -0800, Kenton Varda wrote:
> On Tue, Dec 25, 2018 at 3:47 PM Dave Chinner <david@fromorbit.com> wrote:
> > But taking out your frustrations on the people who are trying to fix
> > the problems you are seeing isn't productive. We are only a small
> > team and we can't fix every problem that everyone reports
> > immediately. Some things take time to fix.
> 
> I agree. My hope is that explaining our use case helps you make XFS
> better, but you don't owe us anything. It's our problem to solve and
> any help you give us is a favor.
> 
> > IOWs, there are relatively few applications that have such a
> > significant dependency on memory reclaim having extremely low
> > latency,
> 
> Hmm, I'm confused by this. Isn't low-latency memory allocation is a
> common requirement for any kind of interactive workload?

"interactive" tends to mean "human does not see noticable
delays" which means acceptible latency for an operation is measured
in hundreds of milliseconds, not microseconds.

And it's relatively rare for interactive users to have heavily
overloaded IO subsystems such that a single IO takes more than a
couple of hundred milliseconds, let alone have enough memory demand
and concurrent memory reclaim IO that direct reclaim backs up for
seconds on it.

> I don't see
> what's unique about our use case in this respect. Any desktop and most
> web servers I would think have similar requirements.

Interactive latency deficiencies are almost always caused by "need
to get something off disk", not memory reclaim.  And even when it is
caused by "memory reclaim needs to write something", that tends to
mean the "get something from disk" latency is even higher and more
noticable....

> I'm sure there's something about our use case that's unusual, but it
> doesn't seem to me that requiring low-latency memory allocation is
> unique.
>
> Maybe the real thing that's odd about us is that we constantly create
> and delete files at a high rate, and that means we have an excessive
> number of dirty inodes to flush?

Most likely. It is unusual to combine a huge amount of clean page
cache with an inode cache that has really only has dirty reclaimable
inodes in it. It basically implies the common way inodes age out of
the in-memory cache is by being unlinked, not by having their page
cache fully reclaimed by memory pressure...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag()
  2018-12-19 22:15             ` Ivan Babrou
  2018-12-21  4:00               ` Kenton Varda
@ 2019-01-02 10:34               ` Arkadiusz Miśkiewicz
  1 sibling, 0 replies; 16+ messages in thread
From: Arkadiusz Miśkiewicz @ 2019-01-02 10:34 UTC (permalink / raw)
  Cc: linux-xfs

On 19/12/2018 23:15, Ivan Babrou wrote:
> We're sticking with the following patch that allows runtime switching
> between XFS memory reclaim strategies:
> 
> * https://github.com/bobrik/linux/pull/2
> 
> There are some tests and graphs describing the issue and how it can be solved.
> 
> Let me know if you think this can be incorporated upstream, I'm fine if not.

This hack and fs.xfs.memory_reclaim=0 solved my problem where backup
server (doing rsnapshot, so tons of rsync+hardlinks+rm+cp of milions
files) was barely usable. Things like ssh or other network services were
stuck for 10-30 minutes doing reclaim.

Hack but for now server behaves much better than with vanilla kernel on
this type of workload.

Thanks,

(resend)
-- 
Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-01-02 10:34 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-29  0:36 Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag() Ivan Babrou
2018-11-29  2:18 ` Dave Chinner
2018-11-29 14:36   ` Shawn Bohrer
2018-11-29 21:20     ` Dave Chinner
2018-11-29 22:22   ` Ivan Babrou
2018-11-30  2:18     ` Dave Chinner
2018-11-30  3:31       ` Ivan Babrou
2018-11-30  6:49         ` Dave Chinner
2018-11-30  7:45           ` Dave Chinner
2018-12-19 22:15             ` Ivan Babrou
2018-12-21  4:00               ` Kenton Varda
2018-12-25 23:47                 ` Dave Chinner
2018-12-26  3:16                   ` Kenton Varda
2018-12-29 19:05                     ` Darrick J. Wong
2019-01-01 23:48                     ` Dave Chinner
2019-01-02 10:34               ` Arkadiusz Miśkiewicz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.