* Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag() @ 2018-11-29 0:36 Ivan Babrou 2018-11-29 2:18 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Ivan Babrou @ 2018-11-29 0:36 UTC (permalink / raw) To: linux-xfs; +Cc: Shawn Bohrer Hello, We're experiencing some interesting issues with memory reclaim, both kswapd and direct reclaim. A typical machine is 2 x NUMA with 128GB of RAM and 6 XFS filesystems. Page cache is around 95GB and dirty pages hover around 50MB, rarely jumping up to 1GB. The catalyst of our issue is terrible disks. It's not uncommon to see the following stack in hung task detector: Nov 15 21:55:13 21m21 kernel: INFO: task some-task:156314 blocked for more than 10 seconds. Nov 15 21:55:13 21m21 kernel: Tainted: G O 4.14.59-cloudflare-2018.7.5 #1 Nov 15 21:55:13 21m21 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 15 21:55:13 21m21 kernel: some-task D11792 156314 156183 0x00000080 Nov 15 21:55:13 21m21 kernel: Call Trace: Nov 15 21:55:13 21m21 kernel: ? __schedule+0x21a/0x820 Nov 15 21:55:13 21m21 kernel: schedule+0x28/0x80 Nov 15 21:55:13 21m21 kernel: schedule_preempt_disabled+0xa/0x10 Nov 15 21:55:13 21m21 kernel: __mutex_lock.isra.2+0x16a/0x490 Nov 15 21:55:13 21m21 kernel: ? xfs_reclaim_inodes_ag+0x265/0x2d0 Nov 15 21:55:13 21m21 kernel: xfs_reclaim_inodes_ag+0x265/0x2d0 Nov 15 21:55:13 21m21 kernel: ? kmem_cache_alloc+0x14d/0x1b0 Nov 15 21:55:13 21m21 kernel: ? radix_tree_gang_lookup_tag+0xc4/0x130 Nov 15 21:55:13 21m21 kernel: ? __list_lru_walk_one.isra.5+0x33/0x130 Nov 15 21:55:13 21m21 kernel: xfs_reclaim_inodes_nr+0x31/0x40 Nov 15 21:55:13 21m21 kernel: super_cache_scan+0x156/0x1a0 Nov 15 21:55:13 21m21 kernel: shrink_slab.part.51+0x1d2/0x3a0 Nov 15 21:55:13 21m21 kernel: shrink_node+0x113/0x2e0 Nov 15 21:55:13 21m21 kernel: do_try_to_free_pages+0xb3/0x310 Nov 15 21:55:13 21m21 kernel: try_to_free_pages+0xd2/0x190 Nov 15 21:55:13 21m21 kernel: __alloc_pages_slowpath+0x3a3/0xdc0 Nov 15 21:55:13 21m21 kernel: ? ip_output+0x5c/0xc0 Nov 15 21:55:13 21m21 kernel: ? update_curr+0x141/0x1a0 Nov 15 21:55:13 21m21 kernel: __alloc_pages_nodemask+0x223/0x240 Nov 15 21:55:13 21m21 kernel: skb_page_frag_refill+0x93/0xb0 Nov 15 21:55:13 21m21 kernel: sk_page_frag_refill+0x19/0x80 Nov 15 21:55:13 21m21 kernel: tcp_sendmsg_locked+0x247/0xdc0 Nov 15 21:55:13 21m21 kernel: tcp_sendmsg+0x27/0x40 Nov 15 21:55:13 21m21 kernel: sock_sendmsg+0x36/0x40 Nov 15 21:55:13 21m21 kernel: sock_write_iter+0x84/0xd0 Nov 15 21:55:13 21m21 kernel: __vfs_write+0xdd/0x140 Nov 15 21:55:13 21m21 kernel: vfs_write+0xad/0x1a0 Nov 15 21:55:13 21m21 kernel: SyS_write+0x42/0x90 Nov 15 21:55:13 21m21 kernel: do_syscall_64+0x60/0x110 Nov 15 21:55:13 21m21 kernel: entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Here "some-task" is trying to send some bytes over network and it's stuck in direct reclaim. Naturally, kswapd is not keeping up with its duties. It seems to me that our terrible disks sometimes take a pause to think about the meaning of life for a few seconds. During that time XFS shrinker is stuck, which drives the whole system out of free memory and in turns triggers direct reclaim. One solution to this is to not go into direct reclaim by keeping more free pages with vm.watermark_scale_factor, but I'd like to discard this and argue that we're going to hit direct reclaim at some point anyway. The solution I have in mind for this is not to try to write anything to (disastrously terrible) storage in shrinkers. We have 95GB of page cache readily available for reclaim and it seems a lot cheaper to grab that. That brings me to the first question around memory subsystem: are shrinkers supposed to flush any dirty data? My gut feeling is that they should not do that, because there's already writeback mechanism with own tunables for limits to take care of that. If a system runs out of memory reclaimable without IO and dirty pages are under limit, it's totally fair to OOM somebody. It's totally possible that I'm wrong about this feeling, but either way I think docs need an update on this matter: * https://elixir.bootlin.com/linux/v4.14.55/source/Documentation/filesystems/vfs.txt nr_cached_objects: called by the sb cache shrinking function for the filesystem to return the number of freeable cached objects it contains. My second question is conditional on the first one: if filesystems are supposed to flush dirty data in response to shrinkers, then how can I stop this, given my knowledge about combination of lots of available page cache and terrible disks? I've tried two things to address this problem ad-hoc. 1. Run the following systemtap script to trick shrinkers into thinking that XFS has nothing to free: probe kernel.function("xfs_fs_nr_cached_objects").return { $return = 0 } That did the job and shrink_node latency dropped considerably, while calls to xfs_fs_free_cached_objects disappeared. 2. Use vm.vfs_cache_pressure to do the same thing. This failed miserably, because of the following code in super_cache_count: if (sb->s_op && sb->s_op->nr_cached_objects) total_objects = sb->s_op->nr_cached_objects(sb, sc); total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc); total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc); total_objects = vfs_pressure_ratio(total_objects); return total_objects; XFS was doing its job cleaning up inodes with the background mechanims it has (m_reclaim_workqueue), but kernel also stopped cleaning up readily available inodes after XFS. I'm not a kernel hacker and to be honest with you I don't even understand all the nuances here. All I know is: 1. I have lots of page cache and terrible disks. 2. I want to reclaim page cache and never touch disks in response to memory reclaim. 3. Direct reclaim will happen at some point, somebody will want a big chunk of memory all at once. 4. I'm probably ok with reclaiming clean xfs inodes synchronously in reclaim path. This brings me to my final question: what should I do to avoid latency in reclaim (direct or kswapd)? To reiterate the importance of this issue: we see interactive applications with zero IO stall for multiple seconds in writes to non-blocking sockets and page faults on newly allocated memory, while 95GB of memory is in page cache. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag() 2018-11-29 0:36 Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag() Ivan Babrou @ 2018-11-29 2:18 ` Dave Chinner 2018-11-29 14:36 ` Shawn Bohrer 2018-11-29 22:22 ` Ivan Babrou 0 siblings, 2 replies; 16+ messages in thread From: Dave Chinner @ 2018-11-29 2:18 UTC (permalink / raw) To: Ivan Babrou; +Cc: linux-xfs, Shawn Bohrer On Wed, Nov 28, 2018 at 04:36:25PM -0800, Ivan Babrou wrote: > Hello, > > We're experiencing some interesting issues with memory reclaim, both > kswapd and direct reclaim. > > A typical machine is 2 x NUMA with 128GB of RAM and 6 XFS filesystems. > Page cache is around 95GB and dirty pages hover around 50MB, rarely > jumping up to 1GB. What is your workload? > The catalyst of our issue is terrible disks. It's not uncommon to see > the following stack in hung task detector: > > Nov 15 21:55:13 21m21 kernel: INFO: task some-task:156314 blocked for > more than 10 seconds. > Nov 15 21:55:13 21m21 kernel: Tainted: G O > 4.14.59-cloudflare-2018.7.5 #1 > Nov 15 21:55:13 21m21 kernel: "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Nov 15 21:55:13 21m21 kernel: some-task D11792 156314 156183 0x00000080 > Nov 15 21:55:13 21m21 kernel: Call Trace: > Nov 15 21:55:13 21m21 kernel: ? __schedule+0x21a/0x820 > Nov 15 21:55:13 21m21 kernel: schedule+0x28/0x80 > Nov 15 21:55:13 21m21 kernel: schedule_preempt_disabled+0xa/0x10 > Nov 15 21:55:13 21m21 kernel: __mutex_lock.isra.2+0x16a/0x490 > Nov 15 21:55:13 21m21 kernel: ? xfs_reclaim_inodes_ag+0x265/0x2d0 > Nov 15 21:55:13 21m21 kernel: xfs_reclaim_inodes_ag+0x265/0x2d0 > Nov 15 21:55:13 21m21 kernel: ? kmem_cache_alloc+0x14d/0x1b0 > Nov 15 21:55:13 21m21 kernel: ? radix_tree_gang_lookup_tag+0xc4/0x130 > Nov 15 21:55:13 21m21 kernel: ? __list_lru_walk_one.isra.5+0x33/0x130 > Nov 15 21:55:13 21m21 kernel: xfs_reclaim_inodes_nr+0x31/0x40 > Nov 15 21:55:13 21m21 kernel: super_cache_scan+0x156/0x1a0 > Nov 15 21:55:13 21m21 kernel: shrink_slab.part.51+0x1d2/0x3a0 > Nov 15 21:55:13 21m21 kernel: shrink_node+0x113/0x2e0 > Nov 15 21:55:13 21m21 kernel: do_try_to_free_pages+0xb3/0x310 > Nov 15 21:55:13 21m21 kernel: try_to_free_pages+0xd2/0x190 > Nov 15 21:55:13 21m21 kernel: __alloc_pages_slowpath+0x3a3/0xdc0 > Nov 15 21:55:13 21m21 kernel: ? ip_output+0x5c/0xc0 > Nov 15 21:55:13 21m21 kernel: ? update_curr+0x141/0x1a0 > Nov 15 21:55:13 21m21 kernel: __alloc_pages_nodemask+0x223/0x240 > Nov 15 21:55:13 21m21 kernel: skb_page_frag_refill+0x93/0xb0 > Nov 15 21:55:13 21m21 kernel: sk_page_frag_refill+0x19/0x80 > Nov 15 21:55:13 21m21 kernel: tcp_sendmsg_locked+0x247/0xdc0 > Nov 15 21:55:13 21m21 kernel: tcp_sendmsg+0x27/0x40 > Nov 15 21:55:13 21m21 kernel: sock_sendmsg+0x36/0x40 > Nov 15 21:55:13 21m21 kernel: sock_write_iter+0x84/0xd0 > Nov 15 21:55:13 21m21 kernel: __vfs_write+0xdd/0x140 > Nov 15 21:55:13 21m21 kernel: vfs_write+0xad/0x1a0 > Nov 15 21:55:13 21m21 kernel: SyS_write+0x42/0x90 > Nov 15 21:55:13 21m21 kernel: do_syscall_64+0x60/0x110 > Nov 15 21:55:13 21m21 kernel: entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > > Here "some-task" is trying to send some bytes over network and it's > stuck in direct reclaim. Naturally, kswapd is not keeping up with its > duties. That's not kswapd causing the problem here, that's direct reclaim. > It seems to me that our terrible disks sometimes take a pause to think > about the meaning of life for a few seconds. You've probably got lots of dirty inodes backed up waiting for IO. How big are xfs inode slab caches (slabinfo output, please)? How big are the logs on the filesystems on the machine (xfs_info output, please)? > During that time XFS > shrinker is stuck, which drives the whole system out of free memory > and in turns triggers direct reclaim. It happens the other way around. What you have here is direct reclaim trying to get the reclaim lock for a given AG to be able to reclaim inodes from it. direct reclaim only gets stuck here when all AGs are already being reclaimed from (i.e. already under direct reclaim). > One solution to this is to not go into direct reclaim by keeping more > free pages with vm.watermark_scale_factor, but I'd like to discard > this and argue that we're going to hit direct reclaim at some point > anyway. Right, but the problem is that the mm/ subsystem allows effectively unbound direct reclaim concurrency. At some point, having tens to hundreds of direct reclaimers all trying to write dirty inodes to disk causes catastrophic IO breakdown and everything grinds to a halt forever. We have to prevent that breakdown from occurring. i.e. we have to throttle direct reclaim to before it reaches IO breakdown /somewhere/. The memory reclaim subsystem does not do it, so we have to do it in XFS itself. The problem here is that if we ignore direct reclaim (i.e do nothing rather than block waiting on reclaim progress) then the mm/ reclaim algorithms will eventually think they aren't making progress and unleash the OOM killer. > The solution I have in mind for this is not to try to write anything > to (disastrously terrible) storage in shrinkers. We have 95GB of page > cache readily available for reclaim and it seems a lot cheaper to grab > that. The mm/ subsystem tries to keep a balance between page cache and shrinker controlled caches. /proc/sys/vm/vfs_cache_pressure controls the balance between page cache and inode/dentry cache reclaim pressure.. > That brings me to the first question around memory subsystem: are > shrinkers supposed to flush any dirty data? No shrinker does that, but if they wanted to they would be perfectly within their right to write back data. XFS is writing back /metadata/ here because direct reclaim is putting excessive pressure on the inode cache and not allowing the background cleaners to do their job. > My gut feeling is that > they should not do that, because there's already writeback mechanism > with own tunables for limits to take care of that. If a system runs > out of memory reclaimable without IO and dirty pages are under limit, > it's totally fair to OOM somebody. > > It's totally possible that I'm wrong about this feeling, but either > way I think docs need an update on this matter: > > * https://elixir.bootlin.com/linux/v4.14.55/source/Documentation/filesystems/vfs.txt > > nr_cached_objects: called by the sb cache shrinking function for the > filesystem to return the number of freeable cached objects it contains. You are assuming that "freeable" means "instantly freeable object", not "unreferenced object that can be freed in the near future". We use the latter definition in the shrinkers, not the former. > My second question is conditional on the first one: if filesystems are > supposed to flush dirty data in response to shrinkers, then how can I > stop this, given my knowledge about combination of lots of available > page cache and terrible disks? Filesystems have more caches that just the page cache. > I've tried two things to address this problem ad-hoc. > > 1. Run the following systemtap script to trick shrinkers into thinking > that XFS has nothing to free: > > probe kernel.function("xfs_fs_nr_cached_objects").return { > $return = 0 > } > > That did the job and shrink_node latency dropped considerably, while > calls to xfs_fs_free_cached_objects disappeared. Which effectively turned off direct reclaim for XFS inodes. See above - this just means that when you have no easily reclaimable page cache the system will OOM kill rather than wait for inodes to be reclaimed. i.e. it looks good until everything suddenly goes wrong and then everything dies a horrible death. > 2. Use vm.vfs_cache_pressure to do the same thing. This failed > miserably, because of the following code in super_cache_count: > > if (sb->s_op && sb->s_op->nr_cached_objects) > total_objects = sb->s_op->nr_cached_objects(sb, sc); > > total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc); > total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc); > > total_objects = vfs_pressure_ratio(total_objects); > return total_objects; > > XFS was doing its job cleaning up inodes with the background mechanims > it has (m_reclaim_workqueue), but kernel also stopped cleaning up > readily available inodes after XFS. You mean the kernel stopped handing readily available inodes to XFS to free. ie. The VFS has to release unreferenced inodes before they are handed to XFS to clean up and free. IOWs, I suspect you biased vm.vfs_cache_pressure to turn off inode reclaim to effectively do the same thing as above. However, by stopping the feed of unreferenced inodes to XFS you stopped inode reclaim altogether and that caused other problems. What I suspect you want to do is bias vfs_cache_pressure the other way, so that it /agressively/ feeds inode reclaim and so prevent large numbers of inodes from building up in the cache. That way reclaim won't have as many dirty inodes to block on when a sustained memory shortage drives large amounts of direct reclaim into the shrinkers... > I'm not a kernel hacker and to be honest with you I don't even > understand all the nuances here. All I know is: > > 1. I have lots of page cache and terrible disks. > 2. I want to reclaim page cache and never touch disks in response to > memory reclaim. > 3. Direct reclaim will happen at some point, somebody will want a big > chunk of memory all at once. > 4. I'm probably ok with reclaiming clean xfs inodes synchronously in > reclaim path. > > This brings me to my final question: what should I do to avoid latency > in reclaim (direct or kswapd)? > > To reiterate the importance of this issue: we see interactive > applications with zero IO stall for multiple seconds in writes to > non-blocking sockets and page faults on newly allocated memory, while > 95GB of memory is in page cache. It really just sounds like you are allowing too many dirty inodes to build up in memory. You have a huge amount of memory but really slow disks - that's just asking for idata and metadata writeback latency problems. It sounds like, you need to cause inode cleaning to happen earlier/faster than it is happening now, because your disks are slow and so you can't allow a big backlog to build up. Reducing /proc/sys/fs/xfs/xfssyncd_centisecs can get background metadata writeback started earlier. As per Documentation/filesystems/xfs.txt: fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000) The interval at which the filesystem flushes metadata out to disk and runs internal cache cleanup routines. If this doesn't help, then I'd suggest taht your config is simply allowing too many dirty inodes to build up in memory. With XFS, the number of dirty inodes that can be held in memory at any point in time is bound by the size of the XFS log in the filesystem. Hence if you have hundreds of MB of journal space, then you can cache hundreds of thousands of dirty inodes in memory and so can take tens of minutes to write them back to slow disks. So if earlier writeback doesn't help, I'd suggest that using smaller logs with your filesystems is the only real option here.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag() 2018-11-29 2:18 ` Dave Chinner @ 2018-11-29 14:36 ` Shawn Bohrer 2018-11-29 21:20 ` Dave Chinner 2018-11-29 22:22 ` Ivan Babrou 1 sibling, 1 reply; 16+ messages in thread From: Shawn Bohrer @ 2018-11-29 14:36 UTC (permalink / raw) To: Dave Chinner; +Cc: Ivan Babrou, linux-xfs Hi Dave, I've got a few follow up questions below based on your response about this. On Thu, Nov 29, 2018 at 01:18:00PM +1100, Dave Chinner wrote: > On Wed, Nov 28, 2018 at 04:36:25PM -0800, Ivan Babrou wrote: > > The catalyst of our issue is terrible disks. It's not uncommon to see > > the following stack in hung task detector: > > > > Nov 15 21:55:13 21m21 kernel: INFO: task some-task:156314 blocked for > > more than 10 seconds. > > Nov 15 21:55:13 21m21 kernel: Tainted: G O > > 4.14.59-cloudflare-2018.7.5 #1 > > Nov 15 21:55:13 21m21 kernel: "echo 0 > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > Nov 15 21:55:13 21m21 kernel: some-task D11792 156314 156183 0x00000080 > > Nov 15 21:55:13 21m21 kernel: Call Trace: > > Nov 15 21:55:13 21m21 kernel: ? __schedule+0x21a/0x820 > > Nov 15 21:55:13 21m21 kernel: schedule+0x28/0x80 > > Nov 15 21:55:13 21m21 kernel: schedule_preempt_disabled+0xa/0x10 > > Nov 15 21:55:13 21m21 kernel: __mutex_lock.isra.2+0x16a/0x490 > > Nov 15 21:55:13 21m21 kernel: ? xfs_reclaim_inodes_ag+0x265/0x2d0 > > Nov 15 21:55:13 21m21 kernel: xfs_reclaim_inodes_ag+0x265/0x2d0 > > Nov 15 21:55:13 21m21 kernel: ? kmem_cache_alloc+0x14d/0x1b0 > > Nov 15 21:55:13 21m21 kernel: ? radix_tree_gang_lookup_tag+0xc4/0x130 > > Nov 15 21:55:13 21m21 kernel: ? __list_lru_walk_one.isra.5+0x33/0x130 > > Nov 15 21:55:13 21m21 kernel: xfs_reclaim_inodes_nr+0x31/0x40 > > Nov 15 21:55:13 21m21 kernel: super_cache_scan+0x156/0x1a0 > > Nov 15 21:55:13 21m21 kernel: shrink_slab.part.51+0x1d2/0x3a0 > > Nov 15 21:55:13 21m21 kernel: shrink_node+0x113/0x2e0 > > Nov 15 21:55:13 21m21 kernel: do_try_to_free_pages+0xb3/0x310 > > Nov 15 21:55:13 21m21 kernel: try_to_free_pages+0xd2/0x190 > > Nov 15 21:55:13 21m21 kernel: __alloc_pages_slowpath+0x3a3/0xdc0 > > Nov 15 21:55:13 21m21 kernel: ? ip_output+0x5c/0xc0 > > Nov 15 21:55:13 21m21 kernel: ? update_curr+0x141/0x1a0 > > Nov 15 21:55:13 21m21 kernel: __alloc_pages_nodemask+0x223/0x240 > > Nov 15 21:55:13 21m21 kernel: skb_page_frag_refill+0x93/0xb0 > > Nov 15 21:55:13 21m21 kernel: sk_page_frag_refill+0x19/0x80 > > Nov 15 21:55:13 21m21 kernel: tcp_sendmsg_locked+0x247/0xdc0 > > Nov 15 21:55:13 21m21 kernel: tcp_sendmsg+0x27/0x40 > > Nov 15 21:55:13 21m21 kernel: sock_sendmsg+0x36/0x40 > > Nov 15 21:55:13 21m21 kernel: sock_write_iter+0x84/0xd0 > > Nov 15 21:55:13 21m21 kernel: __vfs_write+0xdd/0x140 > > Nov 15 21:55:13 21m21 kernel: vfs_write+0xad/0x1a0 > > Nov 15 21:55:13 21m21 kernel: SyS_write+0x42/0x90 > > Nov 15 21:55:13 21m21 kernel: do_syscall_64+0x60/0x110 > > Nov 15 21:55:13 21m21 kernel: entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > > > > Here "some-task" is trying to send some bytes over network and it's > > stuck in direct reclaim. Naturally, kswapd is not keeping up with its > > duties. > > That's not kswapd causing the problem here, that's direct reclaim. It is understood that the above is direct reclaim. When this happens kswapd is also blocked as below. As I'm sure you can imagine many other tasks get blocked in direct reclaim as well. [Thu Nov 15 21:52:06 2018] INFO: task kswapd0:1061 blocked for more than 10 seconds. [Thu Nov 15 21:52:06 2018] Tainted: G O 4.14.59-cloudflare-2018.7.5 #1 [Thu Nov 15 21:52:06 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Nov 15 21:52:06 2018] kswapd0 D12848 1061 2 0x80000000 [Thu Nov 15 21:52:06 2018] Call Trace: [Thu Nov 15 21:52:06 2018] ? __schedule+0x21a/0x820 [Thu Nov 15 21:52:06 2018] schedule+0x28/0x80 [Thu Nov 15 21:52:06 2018] io_schedule+0x12/0x40 [Thu Nov 15 21:52:06 2018] __xfs_iflock+0xe3/0x100 [Thu Nov 15 21:52:06 2018] ? bit_waitqueue+0x30/0x30 [Thu Nov 15 21:52:06 2018] xfs_reclaim_inode+0x141/0x300 [Thu Nov 15 21:52:06 2018] xfs_reclaim_inodes_ag+0x19d/0x2d0 [Thu Nov 15 21:52:06 2018] xfs_reclaim_inodes_nr+0x31/0x40 [Thu Nov 15 21:52:07 2018] super_cache_scan+0x156/0x1a0 [Thu Nov 15 21:52:07 2018] shrink_slab.part.51+0x1d2/0x3a0 [Thu Nov 15 21:52:07 2018] shrink_node+0x113/0x2e0 [Thu Nov 15 21:52:07 2018] kswapd+0x28a/0x6d0 [Thu Nov 15 21:52:07 2018] ? mem_cgroup_shrink_node+0x150/0x150 [Thu Nov 15 21:52:07 2018] kthread+0x119/0x130 [Thu Nov 15 21:52:07 2018] ? kthread_create_on_node+0x40/0x40 [Thu Nov 15 21:52:07 2018] ret_from_fork+0x35/0x40 > > One solution to this is to not go into direct reclaim by keeping more > > free pages with vm.watermark_scale_factor, but I'd like to discard > > this and argue that we're going to hit direct reclaim at some point > > anyway. > > Right, but the problem is that the mm/ subsystem allows effectively > unbound direct reclaim concurrency. At some point, having tens to > hundreds of direct reclaimers all trying to write dirty inodes to > disk causes catastrophic IO breakdown and everything grinds to a > halt forever. We have to prevent that breakdown from occurring. > > i.e. we have to throttle direct reclaim to before it reaches IO > breakdown /somewhere/. The memory reclaim subsystem does not do it, > so we have to do it in XFS itself. The problem here is that if we > ignore direct reclaim (i.e do nothing rather than block waiting on > reclaim progress) then the mm/ reclaim algorithms will eventually > think they aren't making progress and unleash the OOM killer. Here is my naive question. Why does kswapd block? Wouldn't it make sense for kswapd to asynchronously start the xfs_reclaim_inodes process and then continue looking for other pages (perhaps page cache) that it can easily free? In my mind this might prevent us from ever getting to the point of direct reclaim. And if we did get to that point then yes I can see that you might need to synchronously block all tasks in direct reclaim in xfs_reclaim_inodes to prevent the thundering herd problem. My other question is why does the mm/ reclaim algorithms think that that they need to force this metadata reclaim? I think Ivan's main question was we have 95GB of page cache maybe 2-3GB of total slab memory in use, and maybe 1GB of dirty pages. Blocking the world for any disk I/O at this point seems insane when there is other quickly freeable memory. I assume the answer is LRU? Our page cache pages are newer or more frequently accesses then this filesystem metadata? Thanks, Shawn ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag() 2018-11-29 14:36 ` Shawn Bohrer @ 2018-11-29 21:20 ` Dave Chinner 0 siblings, 0 replies; 16+ messages in thread From: Dave Chinner @ 2018-11-29 21:20 UTC (permalink / raw) To: Shawn Bohrer; +Cc: Ivan Babrou, linux-xfs On Thu, Nov 29, 2018 at 08:36:48AM -0600, Shawn Bohrer wrote: > Hi Dave, > > I've got a few follow up questions below based on your response about > this. > > On Thu, Nov 29, 2018 at 01:18:00PM +1100, Dave Chinner wrote: > > On Wed, Nov 28, 2018 at 04:36:25PM -0800, Ivan Babrou wrote: > > > The catalyst of our issue is terrible disks. It's not uncommon to see > > > the following stack in hung task detector: > > > > > > Nov 15 21:55:13 21m21 kernel: INFO: task some-task:156314 blocked for > > > more than 10 seconds. > > > Nov 15 21:55:13 21m21 kernel: Tainted: G O > > > 4.14.59-cloudflare-2018.7.5 #1 > > > Nov 15 21:55:13 21m21 kernel: "echo 0 > > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > Nov 15 21:55:13 21m21 kernel: some-task D11792 156314 156183 0x00000080 > > > Nov 15 21:55:13 21m21 kernel: Call Trace: > > > Nov 15 21:55:13 21m21 kernel: ? __schedule+0x21a/0x820 > > > Nov 15 21:55:13 21m21 kernel: schedule+0x28/0x80 > > > Nov 15 21:55:13 21m21 kernel: schedule_preempt_disabled+0xa/0x10 > > > Nov 15 21:55:13 21m21 kernel: __mutex_lock.isra.2+0x16a/0x490 > > > Nov 15 21:55:13 21m21 kernel: ? xfs_reclaim_inodes_ag+0x265/0x2d0 > > > Nov 15 21:55:13 21m21 kernel: xfs_reclaim_inodes_ag+0x265/0x2d0 > > > Nov 15 21:55:13 21m21 kernel: ? kmem_cache_alloc+0x14d/0x1b0 > > > Nov 15 21:55:13 21m21 kernel: ? radix_tree_gang_lookup_tag+0xc4/0x130 > > > Nov 15 21:55:13 21m21 kernel: ? __list_lru_walk_one.isra.5+0x33/0x130 > > > Nov 15 21:55:13 21m21 kernel: xfs_reclaim_inodes_nr+0x31/0x40 > > > Nov 15 21:55:13 21m21 kernel: super_cache_scan+0x156/0x1a0 > > > Nov 15 21:55:13 21m21 kernel: shrink_slab.part.51+0x1d2/0x3a0 > > > Nov 15 21:55:13 21m21 kernel: shrink_node+0x113/0x2e0 > > > Nov 15 21:55:13 21m21 kernel: do_try_to_free_pages+0xb3/0x310 > > > Nov 15 21:55:13 21m21 kernel: try_to_free_pages+0xd2/0x190 > > > Nov 15 21:55:13 21m21 kernel: __alloc_pages_slowpath+0x3a3/0xdc0 > > > Nov 15 21:55:13 21m21 kernel: ? ip_output+0x5c/0xc0 > > > Nov 15 21:55:13 21m21 kernel: ? update_curr+0x141/0x1a0 > > > Nov 15 21:55:13 21m21 kernel: __alloc_pages_nodemask+0x223/0x240 > > > Nov 15 21:55:13 21m21 kernel: skb_page_frag_refill+0x93/0xb0 > > > Nov 15 21:55:13 21m21 kernel: sk_page_frag_refill+0x19/0x80 > > > Nov 15 21:55:13 21m21 kernel: tcp_sendmsg_locked+0x247/0xdc0 > > > Nov 15 21:55:13 21m21 kernel: tcp_sendmsg+0x27/0x40 > > > Nov 15 21:55:13 21m21 kernel: sock_sendmsg+0x36/0x40 > > > Nov 15 21:55:13 21m21 kernel: sock_write_iter+0x84/0xd0 > > > Nov 15 21:55:13 21m21 kernel: __vfs_write+0xdd/0x140 > > > Nov 15 21:55:13 21m21 kernel: vfs_write+0xad/0x1a0 > > > Nov 15 21:55:13 21m21 kernel: SyS_write+0x42/0x90 > > > Nov 15 21:55:13 21m21 kernel: do_syscall_64+0x60/0x110 > > > Nov 15 21:55:13 21m21 kernel: entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > > > > > > Here "some-task" is trying to send some bytes over network and it's > > > stuck in direct reclaim. Naturally, kswapd is not keeping up with its > > > duties. > > > > That's not kswapd causing the problem here, that's direct reclaim. > > It is understood that the above is direct reclaim. When this happens > kswapd is also blocked as below. As I'm sure you can imagine many > other tasks get blocked in direct reclaim as well. Kswapd is allowed to block waiting for IO - it runs in GFP_KERNEL context and so can both issue write back of dirty page cache pages and wait for them to complete IO. With that reclaim context (GFP_KERNEL) we can also issue and wait for IO from the filesystem shrinkers. Blocking kswapd is less than ideal; facebook hit this particular "kswapd is blocked" issue to but their proposed "don't block kswapd during inode reclaim" patch caused an increase in OOM kills on my low memory test VMs as well as small, repeatable performance regressions on my benchmark workloads so it wasn't merged. > > > One solution to this is to not go into direct reclaim by keeping more > > > free pages with vm.watermark_scale_factor, but I'd like to discard > > > this and argue that we're going to hit direct reclaim at some point > > > anyway. > > > > Right, but the problem is that the mm/ subsystem allows effectively > > unbound direct reclaim concurrency. At some point, having tens to > > hundreds of direct reclaimers all trying to write dirty inodes to > > disk causes catastrophic IO breakdown and everything grinds to a > > halt forever. We have to prevent that breakdown from occurring. > > > > i.e. we have to throttle direct reclaim to before it reaches IO > > breakdown /somewhere/. The memory reclaim subsystem does not do it, > > so we have to do it in XFS itself. The problem here is that if we > > ignore direct reclaim (i.e do nothing rather than block waiting on > > reclaim progress) then the mm/ reclaim algorithms will eventually > > think they aren't making progress and unleash the OOM killer. > > Here is my naive question. Why does kswapd block? Wouldn't it make > sense for kswapd to asynchronously start the xfs_reclaim_inodes > process and then continue looking for other pages (perhaps page cache) > that it can easily free? kswapd blocks because it is the only thing that (almost) guarantees forwards progress in memory reclaim. If the memory is full of dirty pages, it *must* block and wait for writeback to complete and clean the pages it needs to reclaim for the waiting allocators. If it does not block, we trigger the OOM killer prematurely. i.e. waiting a short time for IO to complete avoids false positive transient ENOMEM detection. > In my mind this might prevent us from ever getting to the point of > direct reclaim. Direct reclaim happens first. kswapd is only kicked when direct reclaim isn't making enough progress - kswapd happens in the background. The problem we have is that the primary memory reclaimer is the thundering heard of direct reclaim, and kswapd can then get stuck behind the thundering heard. > And if we did get to that point then yes I can see > that you might need to synchronously block all tasks in direct reclaim > in xfs_reclaim_inodes to prevent the thundering herd problem. We already had to deal with the thundering heard before kswapd starts working. > My other question is why does the mm/ reclaim algorithms think that > that they need to force this metadata reclaim? I think Ivan's main > question was we have 95GB of page cache maybe 2-3GB of total slab > memory in use, and maybe 1GB of dirty pages. The kernel keeps a balance between the caches. If it scans 1% of the page cache, it also needs to scan 1% of the other caches to keep them in balance. > Blocking the world for > any disk I/O at this point seems insane when there is other quickly > freeable memory. I assume the answer is LRU? Our page cache pages > are newer or more frequently accesses then this filesystem metadata? Not blocking for IO in the shrinkers is even worse. I made some mods yesterday to make inode reclaim block less to do the same work (more efficient IO dispatch == less waiting) and the only thing I succeeded in doing is slowing down my benchmark workloads by 30-40%. Why? because blocking less in inode reclaim meant that the memory reclaim did more scanning, and so reclaimed /other caches faster/. That meant it was trashing the working set in the xfs bufffer cache (which holds all the metadata buffers) and so that cause a substantial increase in metadata read IO on a /write-only workload/. IOWs, metadata buffers were being reclaimed far too quickly because I made inode cache reclaim faster. All the caches must be balanced for the system to perform well across a wide range of workloads. And it's an unfortunate fact that the only way we can maintain that balance right now is via harsh throttling of the inode cache. I'll keep trying to find a different solution, but the reality is that memory reclaim subsystem has fairly major limitations in what we can and can't do to maintain balance across related and/or dependent caches.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag() 2018-11-29 2:18 ` Dave Chinner 2018-11-29 14:36 ` Shawn Bohrer @ 2018-11-29 22:22 ` Ivan Babrou 2018-11-30 2:18 ` Dave Chinner 1 sibling, 1 reply; 16+ messages in thread From: Ivan Babrou @ 2018-11-29 22:22 UTC (permalink / raw) To: david; +Cc: linux-xfs, Shawn Bohrer On Wed, Nov 28, 2018 at 6:18 PM Dave Chinner <david@fromorbit.com> wrote: > > On Wed, Nov 28, 2018 at 04:36:25PM -0800, Ivan Babrou wrote: > > Hello, > > > > We're experiencing some interesting issues with memory reclaim, both > > kswapd and direct reclaim. > > > > A typical machine is 2 x NUMA with 128GB of RAM and 6 XFS filesystems. > > Page cache is around 95GB and dirty pages hover around 50MB, rarely > > jumping up to 1GB. > > What is your workload? My test setup is an empty machine with 256GB of RAM booted from network into memory with just systemd essentials running. I create XFS on a 10TB drive (via LUKS), mount the drive and write 300GiB of randomness: $ sudo mkfs.xfs /dev/mapper/luks-sda $ sudo mount /dev/mapper/luks-sda /mnt $ sudo dd if=/dev/urandom of=/mnt/300g.random bs=1M count=300K status=progress Then I reboot and just mount the drive again to run my test workload: $ dd if=/mnt/300g.random of=/dev/null bs=1M status=progress After running it once and populating page cache I restart it to collect traces. Here's xfs_info: $ sudo xfs_info /mnt meta-data=/dev/mapper/luks-sda isize=512 agcount=10, agsize=268435455 blks = sectsz=4096 attr=2, projid32bit=1 = crc=1 finobt=1 spinodes=0 rmapbt=0 = reflink=0 data = bsize=4096 blocks=2441608704, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=521728, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 And here's slabinfo: $ sudo cat /proc/slabinfo slabinfo - version: 2.1 # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> kcopyd_job 0 0 3312 9 8 : tunables 0 0 0 : slabdata 0 0 0 dm_uevent 0 0 2632 12 8 : tunables 0 0 0 : slabdata 0 0 0 dax_cache 96 504 768 42 8 : tunables 0 0 0 : slabdata 12 12 0 nf_conntrack_expect 0 0 224 36 2 : tunables 0 0 0 : slabdata 0 0 0 nf_conntrack 0 0 320 51 4 : tunables 0 0 0 : slabdata 0 0 0 kvm_async_pf 0 0 136 30 1 : tunables 0 0 0 : slabdata 0 0 0 kvm_vcpu 0 0 19392 1 8 : tunables 0 0 0 : slabdata 0 0 0 kvm_mmu_page_header 0 0 168 48 2 : tunables 0 0 0 : slabdata 0 0 0 pte_list_desc 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0 ip6-frags 0 0 208 39 2 : tunables 0 0 0 : slabdata 0 0 0 RAWv6 168 168 1152 28 8 : tunables 0 0 0 : slabdata 6 6 0 UDPv6 900 900 1280 25 8 : tunables 0 0 0 : slabdata 36 36 0 tw_sock_TCPv6 1122 1122 248 33 2 : tunables 0 0 0 : slabdata 34 34 0 request_sock_TCPv6 0 0 312 52 4 : tunables 0 0 0 : slabdata 0 0 0 TCPv6 420 420 2240 14 8 : tunables 0 0 0 : slabdata 30 30 0 cfq_io_cq 1156 1156 120 34 1 : tunables 0 0 0 : slabdata 34 34 0 mqueue_inode_cache 34 34 960 34 8 : tunables 0 0 0 : slabdata 1 1 0 xfs_dqtrx 0 0 528 31 4 : tunables 0 0 0 : slabdata 0 0 0 xfs_rui_item 0 0 672 48 8 : tunables 0 0 0 : slabdata 0 0 0 xfs_rud_item 0 0 152 53 2 : tunables 0 0 0 : slabdata 0 0 0 xfs_ili 144 144 168 48 2 : tunables 0 0 0 : slabdata 3 3 0 xfs_inode 170 170 960 34 8 : tunables 0 0 0 : slabdata 5 5 0 xfs_efd_item 0 0 416 39 4 : tunables 0 0 0 : slabdata 0 0 0 xfs_buf_item 132 132 248 33 2 : tunables 0 0 0 : slabdata 4 4 0 xfs_da_state 0 0 480 34 4 : tunables 0 0 0 : slabdata 0 0 0 xfs_btree_cur 420 420 232 35 2 : tunables 0 0 0 : slabdata 12 12 0 xfs_log_ticket 308 308 184 44 2 : tunables 0 0 0 : slabdata 7 7 0 bio-2 51 51 320 51 4 : tunables 0 0 0 : slabdata 1 1 0 nfs_direct_cache 0 0 336 48 4 : tunables 0 0 0 : slabdata 0 0 0 nfs_read_data 36 36 896 36 8 : tunables 0 0 0 : slabdata 1 1 0 nfs_inode_cache 0 0 1080 30 8 : tunables 0 0 0 : slabdata 0 0 0 isofs_inode_cache 0 0 656 49 8 : tunables 0 0 0 : slabdata 0 0 0 fat_inode_cache 0 0 744 44 8 : tunables 0 0 0 : slabdata 0 0 0 fat_cache 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0 jbd2_journal_head 0 0 120 34 1 : tunables 0 0 0 : slabdata 0 0 0 jbd2_revoke_table_s 0 0 16 256 1 : tunables 0 0 0 : slabdata 0 0 0 ext4_inode_cache 0 0 1088 30 8 : tunables 0 0 0 : slabdata 0 0 0 ext4_allocation_context 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0 ext4_io_end 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0 ext4_extent_status 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0 mbcache 0 0 56 73 1 : tunables 0 0 0 : slabdata 0 0 0 fscrypt_info 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0 fscrypt_ctx 0 0 48 85 1 : tunables 0 0 0 : slabdata 0 0 0 userfaultfd_ctx_cache 0 0 192 42 2 : tunables 0 0 0 : slabdata 0 0 0 dio 0 0 640 51 8 : tunables 0 0 0 : slabdata 0 0 0 pid_namespace 0 0 2232 14 8 : tunables 0 0 0 : slabdata 0 0 0 posix_timers_cache 884 884 240 34 2 : tunables 0 0 0 : slabdata 26 26 0 rpc_inode_cache 0 0 704 46 8 : tunables 0 0 0 : slabdata 0 0 0 ip4-frags 185 185 216 37 2 : tunables 0 0 0 : slabdata 5 5 0 xfrm_dst_cache 0 0 448 36 4 : tunables 0 0 0 : slabdata 0 0 0 RAW 1564 1564 960 34 8 : tunables 0 0 0 : slabdata 46 46 0 tw_sock_TCP 990 990 248 33 2 : tunables 0 0 0 : slabdata 30 30 0 request_sock_TCP 520 520 312 52 4 : tunables 0 0 0 : slabdata 10 10 0 TCP 544 544 2048 16 8 : tunables 0 0 0 : slabdata 34 34 0 hugetlbfs_inode_cache 104 104 624 52 8 : tunables 0 0 0 : slabdata 2 2 0 dquot 0 0 256 32 2 : tunables 0 0 0 : slabdata 0 0 0 request_queue 195 195 2144 15 8 : tunables 0 0 0 : slabdata 13 13 0 blkdev_ioc 1560 1560 104 39 1 : tunables 0 0 0 : slabdata 40 40 0 user_namespace 0 0 440 37 4 : tunables 0 0 0 : slabdata 0 0 0 dmaengine-unmap-256 15 15 2112 15 8 : tunables 0 0 0 : slabdata 1 1 0 dmaengine-unmap-128 1230 1230 1088 30 8 : tunables 0 0 0 : slabdata 41 41 0 sock_inode_cache 6808 6808 704 46 8 : tunables 0 0 0 : slabdata 148 148 0 file_lock_cache 1600 1600 200 40 2 : tunables 0 0 0 : slabdata 40 40 0 fsnotify_mark_connector 5610 5610 24 170 1 : tunables 0 0 0 : slabdata 33 33 0 net_namespace 5 5 5888 5 8 : tunables 0 0 0 : slabdata 1 1 0 taskstats 1813 1813 328 49 4 : tunables 0 0 0 : slabdata 37 37 0 proc_inode_cache 13776 13776 680 48 8 : tunables 0 0 0 : slabdata 287 287 0 sigqueue 2040 2040 160 51 2 : tunables 0 0 0 : slabdata 40 40 0 bdev_cache 909 1287 832 39 8 : tunables 0 0 0 : slabdata 33 33 0 shmem_inode_cache 48414 49864 712 46 8 : tunables 0 0 0 : slabdata 1084 1084 0 kernfs_node_cache 54722 54780 136 30 1 : tunables 0 0 0 : slabdata 1826 1826 0 mnt_cache 1890 1890 384 42 4 : tunables 0 0 0 : slabdata 45 45 0 inode_cache 25175 26447 608 53 8 : tunables 0 0 0 : slabdata 499 499 0 dentry 108066 117096 192 42 2 : tunables 0 0 0 : slabdata 2788 2788 0 iint_cache 0 0 80 51 1 : tunables 0 0 0 : slabdata 0 0 0 buffer_head 585 585 104 39 1 : tunables 0 0 0 : slabdata 15 15 0 vm_area_struct 69560 69760 200 40 2 : tunables 0 0 0 : slabdata 1744 1744 0 mm_struct 4920 4920 1088 30 8 : tunables 0 0 0 : slabdata 164 164 0 files_cache 6624 6624 704 46 8 : tunables 0 0 0 : slabdata 144 144 0 signal_cache 7040 7200 1024 32 8 : tunables 0 0 0 : slabdata 225 225 0 sighand_cache 2670 2670 2112 15 8 : tunables 0 0 0 : slabdata 178 178 0 task_struct 1431 1468 7744 4 8 : tunables 0 0 0 : slabdata 367 367 0 cred_jar 20580 20580 192 42 2 : tunables 0 0 0 : slabdata 490 490 0 Acpi-Operand 21504 21504 72 56 1 : tunables 0 0 0 : slabdata 384 384 0 Acpi-Parse 2920 2920 56 73 1 : tunables 0 0 0 : slabdata 40 40 0 Acpi-State 1836 1836 80 51 1 : tunables 0 0 0 : slabdata 36 36 0 Acpi-Namespace 80882 82416 40 102 1 : tunables 0 0 0 : slabdata 808 808 0 anon_vma_chain 95424 95744 64 64 1 : tunables 0 0 0 : slabdata 1496 1496 0 anon_vma 47886 47886 88 46 1 : tunables 0 0 0 : slabdata 1041 1041 0 pid 7776 7776 128 32 1 : tunables 0 0 0 : slabdata 243 243 0 numa_policy 62 62 264 31 2 : tunables 0 0 0 : slabdata 2 2 0 trace_event_file 2208 2208 88 46 1 : tunables 0 0 0 : slabdata 48 48 0 ftrace_event_field 4675 4675 48 85 1 : tunables 0 0 0 : slabdata 55 55 0 radix_tree_node 1266533 1266720 584 28 4 : tunables 0 0 0 : slabdata 45240 45240 0 task_group 184 184 704 46 8 : tunables 0 0 0 : slabdata 4 4 0 dma-kmalloc-8192 0 0 8192 4 8 : tunables 0 0 0 : slabdata 0 0 0 dma-kmalloc-4096 0 0 4096 8 8 : tunables 0 0 0 : slabdata 0 0 0 dma-kmalloc-2048 0 0 2048 16 8 : tunables 0 0 0 : slabdata 0 0 0 dma-kmalloc-1024 0 0 1024 32 8 : tunables 0 0 0 : slabdata 0 0 0 dma-kmalloc-512 0 0 512 32 4 : tunables 0 0 0 : slabdata 0 0 0 dma-kmalloc-256 0 0 256 32 2 : tunables 0 0 0 : slabdata 0 0 0 dma-kmalloc-128 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0 dma-kmalloc-64 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0 dma-kmalloc-32 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0 dma-kmalloc-16 0 0 16 256 1 : tunables 0 0 0 : slabdata 0 0 0 dma-kmalloc-8 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0 dma-kmalloc-192 0 0 192 42 2 : tunables 0 0 0 : slabdata 0 0 0 dma-kmalloc-96 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0 kmalloc-8192 1292 1292 8192 4 8 : tunables 0 0 0 : slabdata 323 323 0 kmalloc-4096 1056 1056 4096 8 8 : tunables 0 0 0 : slabdata 132 132 0 kmalloc-2048 3246 3312 2048 16 8 : tunables 0 0 0 : slabdata 207 207 0 kmalloc-1024 9056 9056 1024 32 8 : tunables 0 0 0 : slabdata 283 283 0 kmalloc-512 7487 7488 512 32 4 : tunables 0 0 0 : slabdata 234 234 0 kmalloc-256 17544 18528 256 32 2 : tunables 0 0 0 : slabdata 579 579 0 kmalloc-192 10416 10416 192 42 2 : tunables 0 0 0 : slabdata 248 248 0 kmalloc-128 6756 6816 128 32 1 : tunables 0 0 0 : slabdata 213 213 0 kmalloc-96 48655 49308 96 42 1 : tunables 0 0 0 : slabdata 1174 1174 0 kmalloc-64 23650 23808 64 64 1 : tunables 0 0 0 : slabdata 372 372 0 kmalloc-32 34432 34432 32 128 1 : tunables 0 0 0 : slabdata 269 269 0 kmalloc-16 27392 27392 16 256 1 : tunables 0 0 0 : slabdata 107 107 0 kmalloc-8 37376 37376 8 512 1 : tunables 0 0 0 : slabdata 73 73 0 kmem_cache_node 1725 1728 64 64 1 : tunables 0 0 0 : slabdata 27 27 0 kmem_cache 966 966 384 42 4 : tunables 0 0 0 : slabdata 23 23 0 To illustrate the issue I'm issuing the simplest command I can come up with: $ date; sudo touch /mnt/empty Thu Nov 29 20:09:35 UTC 2018 Now let's see what happens in 10s surrounding this event. Please read through all events to see the whole picture. 1. Memory reclaim is happening as it should (my working set is bigger than RAM): $ sudo /usr/share/bcc/tools/funclatency -mTi 1 shrink_node Tracing 1 functions for "shrink_node"... Hit Ctrl-C to end. ... 20:09:30 msecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 2 |************************** | 8 -> 15 : 1 |************* | 16 -> 31 : 3 |****************************************| 20:09:31 msecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 2 |****************************************| 8 -> 15 : 1 |******************** | 20:09:32 msecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 3 |****************************************| 20:09:33 msecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 2 |************************** | 8 -> 15 : 1 |************* | 16 -> 31 : 3 |****************************************| 20:09:34 20:09:35 msecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 2 |****************************************| 8 -> 15 : 2 |****************************************| 16 -> 31 : 2 |****************************************| 20:09:36 msecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 1 |******************** | 8 -> 15 : 1 |******************** | 16 -> 31 : 2 |****************************************| 32 -> 63 : 2 |****************************************| 20:09:37 20:09:38 msecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 2 |****************************************| 8 -> 15 : 2 |****************************************| 16 -> 31 : 2 |****************************************| 20:09:39 20:09:40 msecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 2 |****************************************| 8 -> 15 : 2 |****************************************| 16 -> 31 : 2 |****************************************| 2. XFS reclaim only kicks in when I create an empty file, silence otherwise: $ sudo /usr/share/bcc/tools/funclatency -mTi 1 xfs_fs_free_cached_objects Tracing 1 functions for "xfs_fs_free_cached_objects"... Hit Ctrl-C to end. ... 20:09:30 20:09:31 20:09:32 20:09:33 20:09:34 20:09:35 20:09:36 20:09:37 msecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 2 |****************************************| 20:09:38 20:09:39 20:09:40 3. We can see that XFS inode is actually reclaimed from kswapd0: $ sudo /usr/share/bcc/tools/stackcount -Ti 1 xfs_reclaim_inode Tracing 1 functions for "xfs_reclaim_inode"... Hit Ctrl-C to end. ... 20:09:30 20:09:31 20:09:32 20:09:33 20:09:34 20:09:35 20:09:36 xfs_reclaim_inode xfs_reclaim_inodes_ag xfs_reclaim_inodes_nr kretprobe_trampoline kswapd0 [1862] 1 20:09:37 20:09:38 20:09:39 20:09:40 4. And we can also see that flush on disk is happening from the same kswapd0: $ sudo /usr/share/bcc/tools/stackcount -Ti 1 xfs_iflush Tracing 1 functions for "xfs_iflush"... Hit Ctrl-C to end. ... 20:09:30 20:09:31 20:09:32 20:09:33 20:09:34 20:09:35 20:09:36 20:09:37 xfs_iflush xfs_reclaim_inode xfs_reclaim_inodes_ag xfs_reclaim_inodes_nr kretprobe_trampoline kswapd0 [1862] 1 20:09:38 20:09:39 20:09:40 5. Back to XFS reclaim source locations, both kswapd0 and kswapd1 hit it: ivan@36s336:~$ sudo /usr/share/bcc/tools/stackcount -Ti 1 xfs_fs_free_cached_objects Tracing 1 functions for "xfs_fs_free_cached_objects"... Hit Ctrl-C to end. ... 20:09:30 20:09:31 20:09:32 20:09:33 20:09:34 20:09:35 20:09:36 20:09:37 xfs_fs_free_cached_objects super_cache_scan shrink_slab.part.45 shrink_node kretprobe_trampoline kswapd1 [1863] 1 xfs_fs_free_cached_objects super_cache_scan shrink_slab.part.45 shrink_node kretprobe_trampoline kswapd0 [1862] 1 20:09:38 20:09:39 20:09:40 6. And we can see that it was kswapd0 that flushed data to disk synchronously: $ sudo /usr/share/bcc/tools/stackcount -Ti 1 xfs_buf_submit_wait Tracing 1 functions for "xfs_buf_submit_wait"... Hit Ctrl-C to end. ... 20:09:30 20:09:31 20:09:32 20:09:33 20:09:34 20:09:35 20:09:36 xfs_buf_submit_wait xfs_bwrite xfs_reclaim_inode xfs_reclaim_inodes_ag xfs_reclaim_inodes_nr kretprobe_trampoline kswapd0 [1862] 1 20:09:37 20:09:38 20:09:39 20:09:40 7. Finally, we only see xfs_fs_nr_cached_objects return non-zero around the same time: $ sudo /usr/share/bcc/tools/trace -T 'r::xfs_fs_nr_cached_objects (retval > 0) "cached = %d", retval' TIME PID TID COMM FUNC - 20:09:36 1862 1862 kswapd0 xfs_fs_nr_cached_objects cached = 1 20:09:36 1862 1862 kswapd0 xfs_fs_nr_cached_objects cached = 1 20:09:36 1863 1863 kswapd1 xfs_fs_nr_cached_objects cached = 1 20:09:36 1863 1863 kswapd1 xfs_fs_nr_cached_objects cached = 1 You can see that have 249GB of page cache: $ free -h total used free shared buff/cache available Mem: 251G 1.0G 1.1G 656M 249G 248G Swap: 0B 0B 0B And the number of dirty pages is zero: $ cat /proc/vmstat nr_free_pages 295648 nr_zone_inactive_anon 99463 nr_zone_active_anon 76344 nr_zone_inactive_file 56796555 nr_zone_active_file 8297372 nr_zone_unevictable 0 nr_zone_write_pending 0 nr_mlock 0 nr_page_table_pages 1578 nr_kernel_stack 9168 nr_bounce 0 nr_free_cma 0 numa_hit 99220256 numa_miss 52104657 numa_foreign 52104657 numa_interleave 135792 numa_local 99127058 numa_other 52197855 nr_inactive_anon 99463 nr_active_anon 76344 nr_inactive_file 56796555 nr_active_file 8297372 nr_unevictable 0 nr_slab_reclaimable 194565 nr_slab_unreclaimable 43128 nr_isolated_anon 0 nr_isolated_file 0 workingset_refault 69892763 workingset_activate 289419 workingset_nodereclaim 0 nr_anon_pages 7700 nr_mapped 5765 nr_file_pages 65262134 nr_dirty 0 nr_writeback 0 nr_writeback_temp 0 nr_shmem 168119 nr_shmem_hugepages 0 nr_shmem_pmdmapped 0 nr_anon_transparent_hugepages 0 nr_unstable 0 nr_vmscan_write 0 nr_vmscan_immediate_reclaim 0 nr_dirtied 0 nr_written 0 nr_dirty_threshold 13017651 nr_dirty_background_threshold 6500878 pgpgin 594234184 pgpgout 2360 pswpin 0 pswpout 0 pgalloc_dma 0 pgalloc_dma32 348991 pgalloc_normal 151243219 pgalloc_movable 0 allocstall_dma 0 allocstall_dma32 0 allocstall_normal 0 allocstall_movable 0 pgskip_dma 0 pgskip_dma32 0 pgskip_normal 0 pgskip_movable 0 pgfree 151892594 pgactivate 8963375 pgdeactivate 742721 pglazyfree 0 pgfault 3796959 pgmajfault 0 pglazyfreed 0 pgrefill 742721 pgsteal_kswapd 83442196 pgsteal_direct 0 pgscan_kswapd 83442409 pgscan_direct 0 pgscan_direct_throttle 0 zone_reclaim_failed 0 pginodesteal 0 slabs_scanned 26710 kswapd_inodesteal 0 kswapd_low_wmark_hit_quickly 0 kswapd_high_wmark_hit_quickly 0 pageoutrun 1636 pgrotated 0 drop_pagecache 0 drop_slab 0 oom_kill 0 numa_pte_updates 0 numa_huge_pte_updates 0 numa_hint_faults 0 numa_hint_faults_local 0 numa_pages_migrated 0 pgmigrate_success 0 pgmigrate_fail 0 compact_migrate_scanned 0 compact_free_scanned 0 compact_isolated 0 compact_stall 0 compact_fail 0 compact_success 0 compact_daemon_wake 0 compact_daemon_migrate_scanned 0 compact_daemon_free_scanned 0 htlb_buddy_alloc_success 0 htlb_buddy_alloc_fail 0 unevictable_pgs_culled 18354 unevictable_pgs_scanned 0 unevictable_pgs_rescued 19851 unevictable_pgs_mlocked 19851 unevictable_pgs_munlocked 19851 unevictable_pgs_cleared 0 unevictable_pgs_stranded 0 thp_fault_alloc 0 thp_fault_fallback 0 thp_collapse_alloc 0 thp_collapse_alloc_failed 0 thp_file_alloc 0 thp_file_mapped 0 thp_split_page 0 thp_split_page_failed 0 thp_deferred_split_page 0 thp_split_pmd 0 thp_split_pud 0 thp_zero_page_alloc 0 thp_zero_page_alloc_failed 0 thp_swpout 0 thp_swpout_fallback 0 balloon_inflate 0 balloon_deflate 0 balloon_migrate 0 swap_ra 0 swap_ra_hit 0 It's zero all the time, except for when I touch an empty file. There's just no other write IO on the system. The whole issue here is that synchronous IO from memory reclaim is unbound in time. For all I know we can have 10TB drive that runs at floppy drive speeds or worse. The following can easily happen (correct me if it can't for some reason): 1. kswapd gets stuck because of slow storage and memory is not getting reclaimed 2. memory allocation doesn't have any free pages and kicks in direct reclaim 3. direct reclaim is stuck behind kswapd I'm not sure why you say direct reclaim happens first, allocstall is zero. > > The catalyst of our issue is terrible disks. It's not uncommon to see > > the following stack in hung task detector: > > > > Nov 15 21:55:13 21m21 kernel: INFO: task some-task:156314 blocked for > > more than 10 seconds. > > Nov 15 21:55:13 21m21 kernel: Tainted: G O > > 4.14.59-cloudflare-2018.7.5 #1 > > Nov 15 21:55:13 21m21 kernel: "echo 0 > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > Nov 15 21:55:13 21m21 kernel: some-task D11792 156314 156183 0x00000080 > > Nov 15 21:55:13 21m21 kernel: Call Trace: > > Nov 15 21:55:13 21m21 kernel: ? __schedule+0x21a/0x820 > > Nov 15 21:55:13 21m21 kernel: schedule+0x28/0x80 > > Nov 15 21:55:13 21m21 kernel: schedule_preempt_disabled+0xa/0x10 > > Nov 15 21:55:13 21m21 kernel: __mutex_lock.isra.2+0x16a/0x490 > > Nov 15 21:55:13 21m21 kernel: ? xfs_reclaim_inodes_ag+0x265/0x2d0 > > Nov 15 21:55:13 21m21 kernel: xfs_reclaim_inodes_ag+0x265/0x2d0 > > Nov 15 21:55:13 21m21 kernel: ? kmem_cache_alloc+0x14d/0x1b0 > > Nov 15 21:55:13 21m21 kernel: ? radix_tree_gang_lookup_tag+0xc4/0x130 > > Nov 15 21:55:13 21m21 kernel: ? __list_lru_walk_one.isra.5+0x33/0x130 > > Nov 15 21:55:13 21m21 kernel: xfs_reclaim_inodes_nr+0x31/0x40 > > Nov 15 21:55:13 21m21 kernel: super_cache_scan+0x156/0x1a0 > > Nov 15 21:55:13 21m21 kernel: shrink_slab.part.51+0x1d2/0x3a0 > > Nov 15 21:55:13 21m21 kernel: shrink_node+0x113/0x2e0 > > Nov 15 21:55:13 21m21 kernel: do_try_to_free_pages+0xb3/0x310 > > Nov 15 21:55:13 21m21 kernel: try_to_free_pages+0xd2/0x190 > > Nov 15 21:55:13 21m21 kernel: __alloc_pages_slowpath+0x3a3/0xdc0 > > Nov 15 21:55:13 21m21 kernel: ? ip_output+0x5c/0xc0 > > Nov 15 21:55:13 21m21 kernel: ? update_curr+0x141/0x1a0 > > Nov 15 21:55:13 21m21 kernel: __alloc_pages_nodemask+0x223/0x240 > > Nov 15 21:55:13 21m21 kernel: skb_page_frag_refill+0x93/0xb0 > > Nov 15 21:55:13 21m21 kernel: sk_page_frag_refill+0x19/0x80 > > Nov 15 21:55:13 21m21 kernel: tcp_sendmsg_locked+0x247/0xdc0 > > Nov 15 21:55:13 21m21 kernel: tcp_sendmsg+0x27/0x40 > > Nov 15 21:55:13 21m21 kernel: sock_sendmsg+0x36/0x40 > > Nov 15 21:55:13 21m21 kernel: sock_write_iter+0x84/0xd0 > > Nov 15 21:55:13 21m21 kernel: __vfs_write+0xdd/0x140 > > Nov 15 21:55:13 21m21 kernel: vfs_write+0xad/0x1a0 > > Nov 15 21:55:13 21m21 kernel: SyS_write+0x42/0x90 > > Nov 15 21:55:13 21m21 kernel: do_syscall_64+0x60/0x110 > > Nov 15 21:55:13 21m21 kernel: entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > > > > Here "some-task" is trying to send some bytes over network and it's > > stuck in direct reclaim. Naturally, kswapd is not keeping up with its > > duties. > > That's not kswapd causing the problem here, that's direct reclaim. I think direct reclaim happens concurrently with kswapd, see my workload example. > > It seems to me that our terrible disks sometimes take a pause to think > > about the meaning of life for a few seconds. > > You've probably got lots of dirty inodes backed up waiting for IO. I don't have lots of dirty inodes waiting on IO. > How big are xfs inode slab caches (slabinfo output, please)? How big are the logs > on the filesystems on the machine (xfs_info output, please)? See my workload example. > > During that time XFS > > shrinker is stuck, which drives the whole system out of free memory > > and in turns triggers direct reclaim. > > It happens the other way around. What you have here is direct > reclaim trying to get the reclaim lock for a given AG to be able to > reclaim inodes from it. direct reclaim only gets stuck here when all > AGs are already being reclaimed from (i.e. already under direct > reclaim). > > > One solution to this is to not go into direct reclaim by keeping more > > free pages with vm.watermark_scale_factor, but I'd like to discard > > this and argue that we're going to hit direct reclaim at some point > > anyway. > > Right, but the problem is that the mm/ subsystem allows effectively > unbound direct reclaim concurrency. At some point, having tens to > hundreds of direct reclaimers all trying to write dirty inodes to > disk causes catastrophic IO breakdown and everything grinds to a > halt forever. We have to prevent that breakdown from occurring. > > i.e. we have to throttle direct reclaim to before it reaches IO > breakdown /somewhere/. The memory reclaim subsystem does not do it, > so we have to do it in XFS itself. The problem here is that if we > ignore direct reclaim (i.e do nothing rather than block waiting on > reclaim progress) then the mm/ reclaim algorithms will eventually > think they aren't making progress and unleash the OOM killer. > > > The solution I have in mind for this is not to try to write anything > > to (disastrously terrible) storage in shrinkers. We have 95GB of page > > cache readily available for reclaim and it seems a lot cheaper to grab > > that. > > The mm/ subsystem tries to keep a balance between page cache and > shrinker controlled caches. /proc/sys/vm/vfs_cache_pressure > controls the balance between page cache and inode/dentry cache > reclaim pressure.. > > > That brings me to the first question around memory subsystem: are > > shrinkers supposed to flush any dirty data? > > No shrinker does that, but if they wanted to they would be perfectly > within their right to write back data. > > XFS is writing back /metadata/ here because direct reclaim is > putting excessive pressure on the inode cache and not allowing the > background cleaners to do their job. Metadata is still data, XFS still writes it to the same block device. > > My gut feeling is that > > they should not do that, because there's already writeback mechanism > > with own tunables for limits to take care of that. If a system runs > > out of memory reclaimable without IO and dirty pages are under limit, > > it's totally fair to OOM somebody. > > > > It's totally possible that I'm wrong about this feeling, but either > > way I think docs need an update on this matter: > > > > * https://elixir.bootlin.com/linux/v4.14.55/source/Documentation/filesystems/vfs.txt > > > > nr_cached_objects: called by the sb cache shrinking function for the > > filesystem to return the number of freeable cached objects it contains. > > You are assuming that "freeable" means "instantly freeable object", > not "unreferenced object that can be freed in the near future". We > use the latter definition in the shrinkers, not the former. I'm only assuming things because documentation leaves room for interpretation. I would love to this worded in a way that's crystal clear and mentions possibility of IO. > > My second question is conditional on the first one: if filesystems are > > supposed to flush dirty data in response to shrinkers, then how can I > > stop this, given my knowledge about combination of lots of available > > page cache and terrible disks? > > Filesystems have more caches that just the page cache. > > > I've tried two things to address this problem ad-hoc. > > > > 1. Run the following systemtap script to trick shrinkers into thinking > > that XFS has nothing to free: > > > > probe kernel.function("xfs_fs_nr_cached_objects").return { > > $return = 0 > > } > > > > That did the job and shrink_node latency dropped considerably, while > > calls to xfs_fs_free_cached_objects disappeared. > > Which effectively turned off direct reclaim for XFS inodes. See > above - this just means that when you have no easily reclaimable > page cache the system will OOM kill rather than wait for inodes to > be reclaimed. i.e. it looks good until everything suddenly goes > wrong and then everything dies a horrible death. We have hundreds of gigabytes of page cache, dirty pages are not allowed to go near that mark. There's a separate limit for dirty data. What I want to have is a way to tell the kernel to not try to flush data to disk in response to reclaim, because that's choosing a very real horrible life over imaginary horrible death. I can't possibly create enough dirty inodes to cause the horrible death you describe. This may be the case for low memory machines, but not for high memory (which are arguably a commodity these days). Can we have a sysctl toggle for no IO during reclaim? > > 2. Use vm.vfs_cache_pressure to do the same thing. This failed > > miserably, because of the following code in super_cache_count: > > > > if (sb->s_op && sb->s_op->nr_cached_objects) > > total_objects = sb->s_op->nr_cached_objects(sb, sc); > > > > total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc); > > total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc); > > > > total_objects = vfs_pressure_ratio(total_objects); > > return total_objects; > > > > XFS was doing its job cleaning up inodes with the background mechanims > > it has (m_reclaim_workqueue), but kernel also stopped cleaning up > > readily available inodes after XFS. > > You mean the kernel stopped handing readily available inodes to XFS > to free. > > ie. The VFS has to release unreferenced inodes before they are > handed to XFS to clean up and free. IOWs, I suspect you biased > vm.vfs_cache_pressure to turn off inode reclaim to effectively do > the same thing as above. However, by stopping the feed of > unreferenced inodes to XFS you stopped inode reclaim altogether and > that caused other problems. > > What I suspect you want to do is bias vfs_cache_pressure the other > way, so that it /agressively/ feeds inode reclaim and so prevent > large numbers of inodes from building up in the cache. That way > reclaim won't have as many dirty inodes to block on when a sustained > memory shortage drives large amounts of direct reclaim into the > shrinkers... No, I absolutely do not want to do that, because it still means writes to disk in response to reclaim. That's the opposite of what I want. > > I'm not a kernel hacker and to be honest with you I don't even > > understand all the nuances here. All I know is: > > > > 1. I have lots of page cache and terrible disks. > > 2. I want to reclaim page cache and never touch disks in response to > > memory reclaim. > > 3. Direct reclaim will happen at some point, somebody will want a big > > chunk of memory all at once. > > 4. I'm probably ok with reclaiming clean xfs inodes synchronously in > > reclaim path. > > > > This brings me to my final question: what should I do to avoid latency > > in reclaim (direct or kswapd)? > > > > To reiterate the importance of this issue: we see interactive > > applications with zero IO stall for multiple seconds in writes to > > non-blocking sockets and page faults on newly allocated memory, while > > 95GB of memory is in page cache. > > It really just sounds like you are allowing too many dirty inodes to > build up in memory. You have a huge amount of memory but really slow > disks - that's just asking for idata and metadata writeback latency > problems. I don't care as much about idata and metadata writeback latency. Writes are asynchronous as long as dirty page limit is not breached and I don't breach it. What I care about is system freezes when there's tons of memory to reclaim from page cache, because reclaim writes to storage synchronously. > It sounds like, you need to cause inode cleaning to happen > earlier/faster than it is happening now, because your disks are slow > and so you can't allow a big backlog to build up. Reducing > /proc/sys/fs/xfs/xfssyncd_centisecs can get background metadata > writeback started earlier. As per Documentation/filesystems/xfs.txt: > > fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000) > The interval at which the filesystem flushes metadata > out to disk and runs internal cache cleanup routines. > > If this doesn't help, then I'd suggest taht your config is simply > allowing too many dirty inodes to build up in memory. With XFS, the > number of dirty inodes that can be held in memory at any point in > time is bound by the size of the XFS log in the filesystem. Hence if > you have hundreds of MB of journal space, then you can cache > hundreds of thousands of dirty inodes in memory and so can take tens > of minutes to write them back to slow disks. So if earlier writeback > doesn't help, I'd suggest that using smaller logs with your > filesystems is the only real option here.... You are right that our config allows too many dirty inodes to build up in memory. Unfortunately, too many means any number larger than zero and lowering fs.xfs.xfssyncd_centisecs to 600 made no difference in real production workload. > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag() 2018-11-29 22:22 ` Ivan Babrou @ 2018-11-30 2:18 ` Dave Chinner 2018-11-30 3:31 ` Ivan Babrou 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2018-11-30 2:18 UTC (permalink / raw) To: Ivan Babrou; +Cc: linux-xfs, Shawn Bohrer On Thu, Nov 29, 2018 at 02:22:53PM -0800, Ivan Babrou wrote: > On Wed, Nov 28, 2018 at 6:18 PM Dave Chinner <david@fromorbit.com> wrote: > > > > On Wed, Nov 28, 2018 at 04:36:25PM -0800, Ivan Babrou wrote: > > > Hello, > > > > > > We're experiencing some interesting issues with memory reclaim, both > > > kswapd and direct reclaim. > > > > > > A typical machine is 2 x NUMA with 128GB of RAM and 6 XFS filesystems. > > > Page cache is around 95GB and dirty pages hover around 50MB, rarely > > > jumping up to 1GB. > > > > What is your workload? > > My test setup is an empty machine with 256GB of RAM booted from > network into memory with just systemd essentials running. What is your root filesystem? > I create XFS on a 10TB drive (via LUKS), mount the drive and write > 300GiB of randomness: > > $ sudo mkfs.xfs /dev/mapper/luks-sda > $ sudo mount /dev/mapper/luks-sda /mnt > $ sudo dd if=/dev/urandom of=/mnt/300g.random bs=1M count=300K status=progress > > Then I reboot and just mount the drive again to run my test workload: > > $ dd if=/mnt/300g.random of=/dev/null bs=1M status=progress > > After running it once and populating page cache I restart it to collect traces. This isn't your production workload that is demonstrating problems - it's your interpretation of the problem based on how you think everything should work. I need to know what the workload is so I can reproduce and observe a the latency problems myself. I do have some clue abou thow this is all supposed to work, and I have abunch of workloads that are known to trigger severe memory-reclaim based IO breakdowns if memory reclaim doesn't balance and throttle appropriately. > Here's xfs_info: > > $ sudo xfs_info /mnt > meta-data=/dev/mapper/luks-sda isize=512 agcount=10, agsize=268435455 blks > = sectsz=4096 attr=2, projid32bit=1 > = crc=1 finobt=1 spinodes=0 rmapbt=0 > = reflink=0 > data = bsize=4096 blocks=2441608704, imaxpct=5 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > log =internal bsize=4096 blocks=521728, version=2 You've got a maximally sized log (2GB), so there's basically no bound on dirty metadata in the filesystem. > $ sudo cat /proc/slabinfo .... > slabinfo - version: 2.1 > # name <active_objs> <num_objs> <objsize> <objperslab> > xfs_ili 144 144 168 48 2 : tunables 0 0 > xfs_inode 170 170 960 34 8 : tunables 0 0 > xfs_efd_item 0 0 416 39 4 : tunables 0 0 > xfs_buf_item 132 132 248 33 2 : tunables 0 0 > xfs_da_state 0 0 480 34 4 : tunables 0 0 > xfs_btree_cur 420 420 232 35 2 : tunables 0 0 > xfs_log_ticket 308 308 184 44 2 : tunables 0 0 That doesn't add up to a single XFS filesystem with 2 inodes in it. where are the other 168 cached XFS inodes coming from? And I note that 144 of them are currently or have been previously dirtied, too. > The following can easily happen (correct me if it can't for some reason): > > 1. kswapd gets stuck because of slow storage and memory is not getting reclaimed > 2. memory allocation doesn't have any free pages and kicks in direct reclaim > 3. direct reclaim is stuck behind kswapd > > I'm not sure why you say direct reclaim happens first, allocstall is zero. Because I thought we were talking about your production workload that you pasted stack traces from showing direct reclaim blocking. When you have a highly concurrent workload which has tens to hundreds of processes all producing memory pressure, dirtying files and page cache, etc, direct reclaim is almost always occurring. i.e. your artificial test workload doesn't tell me anything about the problems you are seeing on your production systems.... > > > My gut feeling is that > > > they should not do that, because there's already writeback mechanism > > > with own tunables for limits to take care of that. If a system runs > > > out of memory reclaimable without IO and dirty pages are under limit, > > > it's totally fair to OOM somebody. > > > > > > It's totally possible that I'm wrong about this feeling, but either > > > way I think docs need an update on this matter: > > > > > > * https://elixir.bootlin.com/linux/v4.14.55/source/Documentation/filesystems/vfs.txt > > > > > > nr_cached_objects: called by the sb cache shrinking function for the > > > filesystem to return the number of freeable cached objects it contains. > > > > You are assuming that "freeable" means "instantly freeable object", > > not "unreferenced object that can be freed in the near future". We > > use the latter definition in the shrinkers, not the former. > > I'm only assuming things because documentation leaves room for > interpretation. I would love to this worded in a way that's crystal > clear and mentions possibility of IO. Send a patch. I wrote that years ago when all the people reviewing the changes understood what "freeable" meant in the shrinker context. > > > My second question is conditional on the first one: if filesystems are > > > supposed to flush dirty data in response to shrinkers, then how can I > > > stop this, given my knowledge about combination of lots of available > > > page cache and terrible disks? > > > > Filesystems have more caches that just the page cache. > > > > > I've tried two things to address this problem ad-hoc. > > > > > > 1. Run the following systemtap script to trick shrinkers into thinking > > > that XFS has nothing to free: > > > > > > probe kernel.function("xfs_fs_nr_cached_objects").return { > > > $return = 0 > > > } > > > > > > That did the job and shrink_node latency dropped considerably, while > > > calls to xfs_fs_free_cached_objects disappeared. > > > > Which effectively turned off direct reclaim for XFS inodes. See > > above - this just means that when you have no easily reclaimable > > page cache the system will OOM kill rather than wait for inodes to > > be reclaimed. i.e. it looks good until everything suddenly goes > > wrong and then everything dies a horrible death. > > We have hundreds of gigabytes of page cache, dirty pages are not > allowed to go near that mark. There's a separate limit for dirty data. Well, yes, but we're not talking about dirty data here - I'm talking about what happens when we turn off reclaim for a cache that can grow without bound. I can only say "this is a bad idea in general because...." as I have to make the code work for lots of different workloads. So while it might be a solution work for your specific workload - which I know nothing about because you haven't described it to me - it's not a solution we can use for the general case. > What I want to have is a way to tell the kernel to not try to flush > data to disk in response to reclaim, because that's choosing a very > real horrible life over imaginary horrible death. I can't possibly > create enough dirty inodes to cause the horrible death you describe. Sure you can. Just keep filling memory with dirty inodes until the log runs out of space. With disks that are as slow as you say, the system will take hours to recover log space and return to decent steady state performance, if it ever manages to catch up at all. And this deomnstrates the fact that there can be many causes of the symptoms you are describing. But without a description of the production workload that is demonstrating problems, I cannot reproduce it, do any root cause analysis, or even validate that your analysis is correct. So, please, rather than tell me what you think the problem is and how it should be fixed, frist describe the workload that is causing problems in enough detail that I can reproduce it myself. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag() 2018-11-30 2:18 ` Dave Chinner @ 2018-11-30 3:31 ` Ivan Babrou 2018-11-30 6:49 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Ivan Babrou @ 2018-11-30 3:31 UTC (permalink / raw) To: david; +Cc: linux-xfs, Shawn Bohrer On Thu, Nov 29, 2018 at 6:18 PM Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Nov 29, 2018 at 02:22:53PM -0800, Ivan Babrou wrote: > > On Wed, Nov 28, 2018 at 6:18 PM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Wed, Nov 28, 2018 at 04:36:25PM -0800, Ivan Babrou wrote: > > > > Hello, > > > > > > > > We're experiencing some interesting issues with memory reclaim, both > > > > kswapd and direct reclaim. > > > > > > > > A typical machine is 2 x NUMA with 128GB of RAM and 6 XFS filesystems. > > > > Page cache is around 95GB and dirty pages hover around 50MB, rarely > > > > jumping up to 1GB. > > > > > > What is your workload? > > > > My test setup is an empty machine with 256GB of RAM booted from > > network into memory with just systemd essentials running. > > What is your root filesystem? It's whatever initramfs unpacks into, there is no storage involved: rootfs / rootfs rw,size=131930052k,nr_inodes=32982513 0 0 > > I create XFS on a 10TB drive (via LUKS), mount the drive and write > > 300GiB of randomness: > > > > $ sudo mkfs.xfs /dev/mapper/luks-sda > > $ sudo mount /dev/mapper/luks-sda /mnt > > $ sudo dd if=/dev/urandom of=/mnt/300g.random bs=1M count=300K status=progress > > > > Then I reboot and just mount the drive again to run my test workload: > > > > $ dd if=/mnt/300g.random of=/dev/null bs=1M status=progress > > > > After running it once and populating page cache I restart it to collect traces. > > This isn't your production workload that is demonstrating problems - > it's your interpretation of the problem based on how you think > everything should work. This is the simplest workload that exhibits the issue. My production workload is similar: serve most of the data from page cache, occasionally write some data on disk. To make it more real in terms of how I think things work, let's add a bit more fun with the following systemtap script: $ cat derp.stap %{ /* guru */ #undef STP_OVERLOAD #include <linux/delay.h> %} function mdelay(ms:long) %{ /* guru */ mdelay(STAP_ARG_ms); %} probe kernel.function("xfs_bwrite").return { mdelay(30000) } This simulates very slow disk writes by making xfs_bwrite 30s longer. We only ever write to one empty file, so there's no need to be very selective. I load the script above, fire up dd to fill up page cache and run my touch command. This is what happens in dmesg: Nov 30 03:15:31 myhost kernel: INFO: task kswapd1:1863 blocked for more than 2 seconds. Nov 30 03:15:31 myhost kernel: Tainted: G O 4.14.74-cloudflare-2018.10.1 #1 Nov 30 03:15:31 myhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 30 03:15:31 myhost kernel: kswapd1 D13856 1863 2 0x80000000 Nov 30 03:15:31 myhost kernel: Call Trace: Nov 30 03:15:31 myhost kernel: ? __schedule+0x21e/0x7f0 Nov 30 03:15:31 myhost kernel: schedule+0x28/0x80 Nov 30 03:15:31 myhost kernel: schedule_preempt_disabled+0xa/0x10 Nov 30 03:15:31 myhost kernel: __mutex_lock.isra.3+0x2a8/0x490 Nov 30 03:15:31 myhost kernel: ? xfs_perag_get_tag+0x2d/0xc0 Nov 30 03:15:31 myhost kernel: xfs_reclaim_inodes_ag+0x27d/0x2c0 Nov 30 03:15:31 myhost kernel: ? free_hot_cold_page_list+0x3e/0xa0 Nov 30 03:15:31 myhost kernel: ? shrink_page_list+0x42c/0xb30 Nov 30 03:15:31 myhost kernel: ? radix_tree_gang_lookup_tag+0xc3/0x140 Nov 30 03:15:31 myhost kernel: ? __list_lru_walk_one.isra.7+0x33/0x140 Nov 30 03:15:31 myhost kernel: xfs_reclaim_inodes_nr+0x31/0x40 Nov 30 03:15:31 myhost kernel: super_cache_scan+0x152/0x1a0 Nov 30 03:15:31 myhost kernel: shrink_slab.part.45+0x1d3/0x380 Nov 30 03:15:31 myhost kernel: shrink_node+0x116/0x300 Nov 30 03:15:31 myhost kernel: kswapd+0x3af/0x6b0 Nov 30 03:15:31 myhost kernel: ? mem_cgroup_shrink_node+0x150/0x150 Nov 30 03:15:31 myhost kernel: kthread+0x119/0x130 Nov 30 03:15:31 myhost kernel: ? kthread_create_on_node+0x40/0x40 Nov 30 03:15:31 myhost kernel: ret_from_fork+0x35/0x40 Nov 30 03:15:31 myhost kernel: INFO: task dd:15663 blocked for more than 2 seconds. Nov 30 03:15:31 myhost kernel: Tainted: G O 4.14.74-cloudflare-2018.10.1 #1 Nov 30 03:15:31 myhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 30 03:15:31 myhost kernel: dd D13952 15663 13243 0x00000000 Nov 30 03:15:31 myhost kernel: Call Trace: Nov 30 03:15:31 myhost kernel: ? __schedule+0x21e/0x7f0 Nov 30 03:15:31 myhost kernel: ? __switch_to_asm+0x34/0x70 Nov 30 03:15:31 myhost kernel: schedule+0x28/0x80 Nov 30 03:15:31 myhost kernel: schedule_timeout+0x268/0x340 Nov 30 03:15:31 myhost kernel: ? __switch_to_asm+0x34/0x70 Nov 30 03:15:31 myhost kernel: ? __switch_to_asm+0x40/0x70 Nov 30 03:15:31 myhost kernel: ? __switch_to_asm+0x40/0x70 Nov 30 03:15:31 myhost kernel: ? __switch_to_asm+0x34/0x70 Nov 30 03:15:31 myhost kernel: wait_for_completion+0x97/0x100 Nov 30 03:15:31 myhost kernel: ? wake_up_q+0x70/0x70 Nov 30 03:15:31 myhost kernel: flush_work+0xfc/0x190 Nov 30 03:15:31 myhost kernel: ? flush_workqueue_prep_pwqs+0x130/0x130 Nov 30 03:15:31 myhost kernel: drain_all_pages+0x140/0x190 Nov 30 03:15:31 myhost kernel: __alloc_pages_slowpath+0x41b/0xd50 Nov 30 03:15:31 myhost kernel: ? blk_peek_request+0x62/0x280 Nov 30 03:15:31 myhost kernel: ? __switch_to_asm+0x40/0x70 Nov 30 03:15:31 myhost kernel: ? __switch_to_asm+0x34/0x70 Nov 30 03:15:31 myhost kernel: ? __switch_to_asm+0x40/0x70 Nov 30 03:15:31 myhost kernel: __alloc_pages_nodemask+0x235/0x250 Nov 30 03:15:31 myhost kernel: __do_page_cache_readahead+0xcd/0x210 Nov 30 03:15:31 myhost kernel: ? __switch_to_asm+0x34/0x70 Nov 30 03:15:31 myhost kernel: ondemand_readahead+0x1f9/0x2c0 Nov 30 03:15:31 myhost kernel: generic_file_read_iter+0x6c1/0x9c0 Nov 30 03:15:31 myhost kernel: ? page_cache_tree_insert+0x90/0x90 Nov 30 03:15:31 myhost kernel: xfs_file_buffered_aio_read+0x47/0xe0 Nov 30 03:15:31 myhost kernel: xfs_file_read_iter+0x5e/0xb0 Nov 30 03:15:31 myhost kernel: __vfs_read+0xee/0x130 Nov 30 03:15:31 myhost kernel: vfs_read+0x8a/0x140 Nov 30 03:15:31 myhost kernel: SyS_read+0x3f/0xa0 Nov 30 03:15:31 myhost kernel: do_syscall_64+0x60/0xf0 Nov 30 03:15:31 myhost kernel: entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Nov 30 03:15:31 myhost kernel: RIP: 0033:0x7f8d9264e6d0 Nov 30 03:15:31 myhost kernel: RSP: 002b:00007ffdf8842998 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 Nov 30 03:15:31 myhost kernel: RAX: ffffffffffffffda RBX: 0000562ac05b2400 RCX: 00007f8d9264e6d0 Nov 30 03:15:31 myhost kernel: RDX: 0000000000100000 RSI: 00007f8d92a2a000 RDI: 0000000000000000 Nov 30 03:15:31 myhost kernel: RBP: 0000000000100000 R08: 0035153337c601da R09: 00007ffdf8888080 Nov 30 03:15:31 myhost kernel: R10: 0000000000a7aa68 R11: 0000000000000246 R12: 00007f8d92a2a000 Nov 30 03:15:31 myhost kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 The way I read this: kswapd got stuck behind one stalled io, that triggered later direct reclaim from dd, which also got stuck. 250GB of page cache to reclaim and yet one unlucky write made everything sad. Stack trace is somewhat different, but it illustrates the idea of one write blocking reclaim. > I need to know what the workload is so I can reproduce and observe a > the latency problems myself. I do have some clue abou thow this is > all supposed to work, and I have abunch of workloads that are known > to trigger severe memory-reclaim based IO breakdowns if memory > reclaim doesn't balance and throttle appropriately. I believe that you do have a clue and workloads. The problem is: those workloads are not mine. > > Here's xfs_info: > > > > $ sudo xfs_info /mnt > > meta-data=/dev/mapper/luks-sda isize=512 agcount=10, agsize=268435455 blks > > = sectsz=4096 attr=2, projid32bit=1 > > = crc=1 finobt=1 spinodes=0 rmapbt=0 > > = reflink=0 > > data = bsize=4096 blocks=2441608704, imaxpct=5 > > = sunit=0 swidth=0 blks > > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > > log =internal bsize=4096 blocks=521728, version=2 > > You've got a maximally sized log (2GB), so there's basically no bound on > dirty metadata in the filesystem. I didn't do anything special, mkfs,xfs picked all the values for me. > > $ sudo cat /proc/slabinfo > .... > > slabinfo - version: 2.1 > > # name <active_objs> <num_objs> <objsize> <objperslab> > > xfs_ili 144 144 168 48 2 : tunables 0 0 > > xfs_inode 170 170 960 34 8 : tunables 0 0 > > xfs_efd_item 0 0 416 39 4 : tunables 0 0 > > xfs_buf_item 132 132 248 33 2 : tunables 0 0 > > xfs_da_state 0 0 480 34 4 : tunables 0 0 > > xfs_btree_cur 420 420 232 35 2 : tunables 0 0 > > xfs_log_ticket 308 308 184 44 2 : tunables 0 0 > > That doesn't add up to a single XFS filesystem with 2 inodes in it. > where are the other 168 cached XFS inodes coming from? And I note > that 144 of them are currently or have been previously dirtied, too. I don't know about this, my only guess that some of my previous tests with touch created those. > > The following can easily happen (correct me if it can't for some reason): > > > > 1. kswapd gets stuck because of slow storage and memory is not getting reclaimed > > 2. memory allocation doesn't have any free pages and kicks in direct reclaim > > 3. direct reclaim is stuck behind kswapd > > > > I'm not sure why you say direct reclaim happens first, allocstall is zero. > > Because I thought we were talking about your production workload > that you pasted stack traces from showing direct reclaim blocking. > When you have a highly concurrent workload which has tens to > hundreds of processes all producing memory pressure, dirtying files > and page cache, etc, direct reclaim is almost always occurring. > > i.e. your artificial test workload doesn't tell me anything about > the problems you are seeing on your production systems.... > > > > > My gut feeling is that > > > > they should not do that, because there's already writeback mechanism > > > > with own tunables for limits to take care of that. If a system runs > > > > out of memory reclaimable without IO and dirty pages are under limit, > > > > it's totally fair to OOM somebody. > > > > > > > > It's totally possible that I'm wrong about this feeling, but either > > > > way I think docs need an update on this matter: > > > > > > > > * https://elixir.bootlin.com/linux/v4.14.55/source/Documentation/filesystems/vfs.txt > > > > > > > > nr_cached_objects: called by the sb cache shrinking function for the > > > > filesystem to return the number of freeable cached objects it contains. > > > > > > You are assuming that "freeable" means "instantly freeable object", > > > not "unreferenced object that can be freed in the near future". We > > > use the latter definition in the shrinkers, not the former. > > > > I'm only assuming things because documentation leaves room for > > interpretation. I would love to this worded in a way that's crystal > > clear and mentions possibility of IO. > > Send a patch. I wrote that years ago when all the people reviewing > the changes understood what "freeable" meant in the shrinker context. Would you be happy with this patch? diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 5fd325df59e2..297400842152 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -301,6 +301,7 @@ or bottom half). nr_cached_objects: called by the sb cache shrinking function for the filesystem to return the number of freeable cached objects it contains. + Freeable objects may include dirty ones requiring write IO to be freed. Optional. free_cache_objects: called by the sb cache shrinking function for the @@ -319,6 +320,8 @@ or bottom half). implementations will cause holdoff problems due to large scan batch sizes. + Implementations may issue write IO to free objects. + Whoever sets up the inode is responsible for filling in the "i_op" field. This is a pointer to a "struct inode_operations" which describes the methods that can be performed on individual inodes. > > > > My second question is conditional on the first one: if filesystems are > > > > supposed to flush dirty data in response to shrinkers, then how can I > > > > stop this, given my knowledge about combination of lots of available > > > > page cache and terrible disks? > > > > > > Filesystems have more caches that just the page cache. > > > > > > > I've tried two things to address this problem ad-hoc. > > > > > > > > 1. Run the following systemtap script to trick shrinkers into thinking > > > > that XFS has nothing to free: > > > > > > > > probe kernel.function("xfs_fs_nr_cached_objects").return { > > > > $return = 0 > > > > } > > > > > > > > That did the job and shrink_node latency dropped considerably, while > > > > calls to xfs_fs_free_cached_objects disappeared. > > > > > > Which effectively turned off direct reclaim for XFS inodes. See > > > above - this just means that when you have no easily reclaimable > > > page cache the system will OOM kill rather than wait for inodes to > > > be reclaimed. i.e. it looks good until everything suddenly goes > > > wrong and then everything dies a horrible death. > > > > We have hundreds of gigabytes of page cache, dirty pages are not > > allowed to go near that mark. There's a separate limit for dirty data. > > Well, yes, but we're not talking about dirty data here - I'm > talking about what happens when we turn off reclaim for a cache > that can grow without bound. I can only say "this is a bad idea in > general because...." as I have to make the code work for lots of > different workloads. > > So while it might be a solution work for your specific workload - > which I know nothing about because you haven't described it to me - > it's not a solution we can use for the general case. Facebook is still running with their patch that does async reclaim, so my workload is not that special. > > What I want to have is a way to tell the kernel to not try to flush > > data to disk in response to reclaim, because that's choosing a very > > real horrible life over imaginary horrible death. I can't possibly > > create enough dirty inodes to cause the horrible death you describe. > > Sure you can. Just keep filling memory with dirty inodes until the > log runs out of space. With disks that are as slow as you say, the > system will take hours to recover log space and return to decent > steady state performance, if it ever manages to catch up at all. Slow is a broad definition. Disks may sometimes get stuck for some IO, but overall they are very much capable of keeping up with background writes. I mentioned that we typically have ~50MB of dirty pages and very low write demand. Shouldn't dirty inodes also be accounted as dirty pages? > And this deomnstrates the fact that there can be many causes of the > symptoms you are describing. But without a description of the > production workload that is demonstrating problems, I cannot > reproduce it, do any root cause analysis, or even validate that your > analysis is correct. > > So, please, rather than tell me what you think the problem is and > how it should be fixed, frist describe the workload that is causing > problems in enough detail that I can reproduce it myself. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag() 2018-11-30 3:31 ` Ivan Babrou @ 2018-11-30 6:49 ` Dave Chinner 2018-11-30 7:45 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2018-11-30 6:49 UTC (permalink / raw) To: Ivan Babrou; +Cc: linux-xfs, Shawn Bohrer On Thu, Nov 29, 2018 at 07:31:44PM -0800, Ivan Babrou wrote: > On Thu, Nov 29, 2018 at 6:18 PM Dave Chinner <david@fromorbit.com> wrote: > > > > On Thu, Nov 29, 2018 at 02:22:53PM -0800, Ivan Babrou wrote: > > > On Wed, Nov 28, 2018 at 6:18 PM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > On Wed, Nov 28, 2018 at 04:36:25PM -0800, Ivan Babrou wrote: > > > > > Hello, > > > > > > > > > > We're experiencing some interesting issues with memory reclaim, both > > > > > kswapd and direct reclaim. > > > > > > > > > > A typical machine is 2 x NUMA with 128GB of RAM and 6 XFS filesystems. > > > > > Page cache is around 95GB and dirty pages hover around 50MB, rarely > > > > > jumping up to 1GB. > > > > > > > > What is your workload? > > > > > > My test setup is an empty machine with 256GB of RAM booted from > > > network into memory with just systemd essentials running. > > > > What is your root filesystem? > > It's whatever initramfs unpacks into, there is no storage involved: > > rootfs / rootfs rw,size=131930052k,nr_inodes=32982513 0 0 > > > > I create XFS on a 10TB drive (via LUKS), mount the drive and write > > > 300GiB of randomness: > > > > > > $ sudo mkfs.xfs /dev/mapper/luks-sda > > > $ sudo mount /dev/mapper/luks-sda /mnt > > > $ sudo dd if=/dev/urandom of=/mnt/300g.random bs=1M count=300K status=progress > > > > > > Then I reboot and just mount the drive again to run my test workload: > > > > > > $ dd if=/mnt/300g.random of=/dev/null bs=1M status=progress > > > > > > After running it once and populating page cache I restart it to collect traces. > > > > This isn't your production workload that is demonstrating problems - > > it's your interpretation of the problem based on how you think > > everything should work. > > This is the simplest workload that exhibits the issue. My production > workload is similar: serve most of the data from page cache, > occasionally write some data on disk. That's not a description of the workload that is generating delays in the order of tens of seconds.... > To make it more real in terms of how I think things work, let's add a > bit more fun with the following systemtap script: Stop it already. I know exactly what can happen in that path and you're not helping me at all by trying to tell me what you think the problem is over and over again. I don't care what you think is the problem - I want to know how to reproduce the symptoms you are seeing. I need to know what load your systems are being placed under when they demonstrate the worst case latencies. I need to know about about CPU load, memory pressure, concurrency, read rates, write rates, the actual physical IO rates and patterns. It also helps to know about the hardware, your storage setup and the type of physical disks you are using. http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F [snip traces] > The way I read this: kswapd got stuck behind one stalled io, that > triggered later direct reclaim from dd, which also got stuck. No, that's a stall waiting for the per-cpu free page magazines to drain back into the free list. i.e. it's waiting on other CPUs to do scheduled work that has absolutely nothing to do with kswapd and the XFS shrinker waiting on IO to complete. Indeed: /* * Spill all the per-cpu pages from all CPUs back into the buddy allocator. * * When zone parameter is non-NULL, spill just the single zone's pages. * * Note that this can be extremely slow as the draining happens in a workqueue. */ void drain_all_pages(struct zone *zone) .... > 250GB of page cache to reclaim and yet one unlucky write made > everything sad. Stack trace is somewhat different, but it illustrates > the idea of one write blocking reclaim. No, they aren't related at all. There are a lot of places where memory reclaim can block, and very few of them have anything to do with shrinkers and filesystems. > > the latency problems myself. I do have some clue abou thow this is > > all supposed to work, and I have abunch of workloads that are known > > to trigger severe memory-reclaim based IO breakdowns if memory > > reclaim doesn't balance and throttle appropriately. > > I believe that you do have a clue and workloads. The problem is: those > workloads are not mine. Yes, but you won't tell me what your workload is. > > > I'm only assuming things because documentation leaves room for > > > interpretation. I would love to this worded in a way that's crystal > > > clear and mentions possibility of IO. > > > > Send a patch. I wrote that years ago when all the people reviewing > > the changes understood what "freeable" meant in the shrinker context. > > Would you be happy with this patch? No objections here. Send it to -fsdevel with the appropriate sign-off. > > > > above - this just means that when you have no easily reclaimable > > > > page cache the system will OOM kill rather than wait for inodes to > > > > be reclaimed. i.e. it looks good until everything suddenly goes > > > > wrong and then everything dies a horrible death. > > > > > > We have hundreds of gigabytes of page cache, dirty pages are not > > > allowed to go near that mark. There's a separate limit for dirty data. > > > > Well, yes, but we're not talking about dirty data here - I'm > > talking about what happens when we turn off reclaim for a cache > > that can grow without bound. I can only say "this is a bad idea in > > general because...." as I have to make the code work for lots of > > different workloads. > > > > So while it might be a solution work for your specific workload - > > which I know nothing about because you haven't described it to me - > > it's not a solution we can use for the general case. > > Facebook is still running with their patch that does async reclaim, so > my workload is not that special. Your workload is (apparently) *very different* to the FB workload. Their workload was that filesystem traversal was causing severe fluctuations in inode cache pressure - the huge numbers of inodes that were being cycled in and out memory were causing problems when kswapd occassionally blocked on a single inode out of millions. They were chasing occasional long tail latencies on heavy loaded, highly concurrent workloads, not delays that measured in multiples of tens of seconds. i.e. their latency spikes were 50-100ms instead of typical 5ms. That's very, very different to what you are reporting and the worklaod you are implying is causing them. And, really, I know exactly what the FB workload is because the FB devs wrote a workload simulator to reproduce the problem offline. So while I could reproduce the latency spikes they saw and could make changes to get rid of them, it always came at the expense of dramatic regressions on other memory pressure workloads I have. IOWs, the change that FB made for their specific workload hid the latency issue for their workload, but did not benefit any other workloads. Maybe that change would also work for you, but it may not and it may, in fact, make things worse. But I can't say, because I don't know how to reproduce your problem so I can't test the FB change against it. Seriously: describe your workload in detail for me so I can write a reproducer for it. Without that I cannot help you any further and I am just wasting my time asking you to describe the workload over and over again. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag() 2018-11-30 6:49 ` Dave Chinner @ 2018-11-30 7:45 ` Dave Chinner 2018-12-19 22:15 ` Ivan Babrou 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2018-11-30 7:45 UTC (permalink / raw) To: Ivan Babrou; +Cc: linux-xfs, Shawn Bohrer On Fri, Nov 30, 2018 at 05:49:08PM +1100, Dave Chinner wrote: > Seriously: describe your workload in detail for me so I can write a > reproducer for it. Without that I cannot help you any further and I > am just wasting my time asking you to describe the workload over > and over again. FWIW, here's the discussion that about the FB issue. Go read it, the first few emails are pretty much the same as this thread so far. https://www.spinics.net/lists/linux-xfs/msg01541.html Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag() 2018-11-30 7:45 ` Dave Chinner @ 2018-12-19 22:15 ` Ivan Babrou 2018-12-21 4:00 ` Kenton Varda 2019-01-02 10:34 ` Arkadiusz Miśkiewicz 0 siblings, 2 replies; 16+ messages in thread From: Ivan Babrou @ 2018-12-19 22:15 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, Shawn Bohrer We're sticking with the following patch that allows runtime switching between XFS memory reclaim strategies: * https://github.com/bobrik/linux/pull/2 There are some tests and graphs describing the issue and how it can be solved. Let me know if you think this can be incorporated upstream, I'm fine if not. On Thu, Nov 29, 2018 at 11:45 PM Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Nov 30, 2018 at 05:49:08PM +1100, Dave Chinner wrote: > > Seriously: describe your workload in detail for me so I can write a > > reproducer for it. Without that I cannot help you any further and I > > am just wasting my time asking you to describe the workload over > > and over again. > > FWIW, here's the discussion that about the FB issue. Go read it, > the first few emails are pretty much the same as this thread so far. > > https://www.spinics.net/lists/linux-xfs/msg01541.html > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag() 2018-12-19 22:15 ` Ivan Babrou @ 2018-12-21 4:00 ` Kenton Varda 2018-12-25 23:47 ` Dave Chinner 2019-01-02 10:34 ` Arkadiusz Miśkiewicz 1 sibling, 1 reply; 16+ messages in thread From: Kenton Varda @ 2018-12-21 4:00 UTC (permalink / raw) To: Ivan Babrou; +Cc: Dave Chinner, linux-xfs, Shawn Bohrer Hi all, I'm a coworker of Ivan's and wanted to add some background here -- in particular to answer Dave's question about our workload. For the purpose of this discussion, we can describe our workload as a giant, glorified HTTP caching proxy. (We receive HTTP requests in. We check if we have a cached response. If so, we return it to the client, otherwise we forward the request on to its "origin" server. When the origin responds, if the response is cacheable, we save it, and either way we return it to the client.) Roughly speaking, each HTTP cache entry is stored as a file on disk. Hence, we have a very large number of files with files being added and removed frequently. We also rely heavily on page cache for performance, rather than some more complicated database scheme. The HTTP requests we serve almost always come from live end users interacting with a web site. So, any kind of delay means someone is sitting and waiting. When delays get up over 15 seconds, we start hitting timeouts, meaning someone's web site doesn't load at all or loads "broken". Also note that any particular machine may serve thousands of requests per second, so blocking one machine may affect thousands of users. When XFS blocks direct reclaim, our service pretty much grinds to a halt on that machine, because everything is trying to allocate memory all the time. For example, as alluded by the subject of this thread, writing to a socket allocates memory, and thus will block waiting for XFS to write back inodes. What we find really frustrating is that we almost always have over 100GB of clean page cache that could be reclaimed immediately, without blocking, yet we end up waiting for the much-smaller inode cache to be written back to disk. We really can't accept random multi-second pauses. Our current plan is to roll out the patch Ivan linked to. But, if you have any other suggestions, we'd love to hear them. It would be great if we could agree on an upstream solution, and maybe solve Facebook's problem too. Hope that helps elucidate things. Thanks, -Kenton On Wed, Dec 19, 2018 at 2:15 PM Ivan Babrou <ivan@cloudflare.com> wrote: > > We're sticking with the following patch that allows runtime switching > between XFS memory reclaim strategies: > > * https://github.com/bobrik/linux/pull/2 > > There are some tests and graphs describing the issue and how it can be solved. > > Let me know if you think this can be incorporated upstream, I'm fine if not. > > On Thu, Nov 29, 2018 at 11:45 PM Dave Chinner <david@fromorbit.com> wrote: > > > > On Fri, Nov 30, 2018 at 05:49:08PM +1100, Dave Chinner wrote: > > > Seriously: describe your workload in detail for me so I can write a > > > reproducer for it. Without that I cannot help you any further and I > > > am just wasting my time asking you to describe the workload over > > > and over again. > > > > FWIW, here's the discussion that about the FB issue. Go read it, > > the first few emails are pretty much the same as this thread so far. > > > > https://www.spinics.net/lists/linux-xfs/msg01541.html > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag() 2018-12-21 4:00 ` Kenton Varda @ 2018-12-25 23:47 ` Dave Chinner 2018-12-26 3:16 ` Kenton Varda 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2018-12-25 23:47 UTC (permalink / raw) To: Kenton Varda; +Cc: Ivan Babrou, linux-xfs, Shawn Bohrer Hi Kenton, On Thu, Dec 20, 2018 at 08:00:21PM -0800, Kenton Varda wrote: > When XFS blocks direct reclaim, our service pretty much grinds to a > halt on that machine, because everything is trying to allocate memory > all the time. For example, as alluded by the subject of this thread, > writing to a socket allocates memory, and thus will block waiting for > XFS to write back inodes. What we find really frustrating is that we > almost always have over 100GB of clean page cache that could be > reclaimed immediately, without blocking, yet we end up waiting for the > much-smaller inode cache to be written back to disk. Sure, it's frustrating. It frsutrates me that I've been trying for years to get memory reclaim behaviour changed so that we don't have to do this, but we are still stuck with it. But taking out your frustrations on the people who are trying to fix the problems you are seeing isn't productive. We are only a small team and we can't fix every problem that everyone reports immediately. Some things take time to fix. > We really can't accept random multi-second pauses. And that's the problem here. For the vast majority of XFS users, the alternative (i.e. not blocking reclaim) leads to substantially lower performance and the high risk of premature OOM kills. It basically moves the reclaim blocking problem to a different context (i.e. to journal space allocation) and that has even worse latency and has global filesystem scope rather than just the process doing reclaim. IOWs, there are relatively few applications that have such a significant dependency on memory reclaim having extremely low latency, but there are a lot that are dependent on memory reclaim being throttled harshly to keep the filesystem out of even worse breakdown conditions..... > Our current plan is > to roll out the patch Ivan linked to. Which is perfectly fine by me. I read the link, and it looks like it works just fine in your environment. In contrast, I ran the same patch on my performance bencmarks and saw between 15-30% degradation in performance on my inode cache heavy and mixed inode/page cache heavy memory pressure benchmarks. IOWs, that change still doesn't work for XFS in general. This is the beauty of open source software - you can make tweaks to it to refine it for your specific workloads easily when such changes aren't really suitable for the wider user base. I encourage people to make tweaks like this to optimise your systems for your workload. However, I also encourage people then to discuss the problems that lead to needing such tweaks with upstream so we are aware of the issues and can work towards either incorporating them or modifying infrastructure to avoid them altogether. Further, there's no need to be combative or get upset when upstream determines that the tweak isn't generally applicable or is hiding something deeper that needs fixing. All it means is that the upstream developers know there is a deeper underlying problem and want to fix the underlying problem rather than try to hide the problem for a specific workload. The fact we know there's a problem (and it's not just one workload situation that it affects) helps us prioritise what we need to fix. > But, if you have any other > suggestions, we'd love to hear them. It would be great if we could > agree on an upstream solution, and maybe solve Facebook's problem too. I've already mentioned the things we are working on to Shaun to keep the ball rolling on this. e.g. Darrick's pipelined background inode inactivation work: https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=85baea68d8e87803e6831bda7b5a3773cf0d8820 i.e. as I indicated above, inode reclaim can block on lots more things that just inode writeback. We can do transactions in inode reclaim, so we can block on log space and that might require *thousands* of IO to be completed before it can make progress. The async kswapd mod doesn't address this problem at all because inactivation occurs in the prune_icache() VFS shrinker context (i.e. ->destroy_inode()) rather than the XFS inode cache context that the FB and CF patches address. Also, the upcoming inode cache xarray conversion will provide us with many more per-inode state tags in the xarray that will allow us to track and execute multiple different post-VFS reclaim inode states directly. This will allow us to efficiently separate inodes that need transactions and/or IO to be reclaimed from inodes that can be reclaimed immediately, and it will allow efficient concurrent async processing of inodes that need IO to be reclaimed. IOWs, we're trying to solve *all* the blocking problems that we know that can occur in inode reclaim so that it all just works for everyone without tweaks being necessary. Yes, this takes longer than just addressing the specific symptom that is causing you problems, but the reality is while fixing things properly takes time to get right, everyone will benefit from it being fixed and not just one or two very specific, latency sensitive workloads. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag() 2018-12-25 23:47 ` Dave Chinner @ 2018-12-26 3:16 ` Kenton Varda 2018-12-29 19:05 ` Darrick J. Wong 2019-01-01 23:48 ` Dave Chinner 0 siblings, 2 replies; 16+ messages in thread From: Kenton Varda @ 2018-12-26 3:16 UTC (permalink / raw) To: Dave Chinner; +Cc: Ivan Babrou, linux-xfs, Shawn Bohrer On Tue, Dec 25, 2018 at 3:47 PM Dave Chinner <david@fromorbit.com> wrote: > But taking out your frustrations on the people who are trying to fix > the problems you are seeing isn't productive. We are only a small > team and we can't fix every problem that everyone reports > immediately. Some things take time to fix. I agree. My hope is that explaining our use case helps you make XFS better, but you don't owe us anything. It's our problem to solve and any help you give us is a favor. > IOWs, there are relatively few applications that have such a > significant dependency on memory reclaim having extremely low > latency, Hmm, I'm confused by this. Isn't low-latency memory allocation is a common requirement for any kind of interactive workload? I don't see what's unique about our use case in this respect. Any desktop and most web servers I would think have similar requirements. I'm sure there's something about our use case that's unusual, but it doesn't seem to me that requiring low-latency memory allocation is unique. Maybe the real thing that's odd about us is that we constantly create and delete files at a high rate, and that means we have an excessive number of dirty inodes to flush? > IOWs, we're trying to solve *all* the blocking problems that we know > that can occur in inode reclaim so that it all just works for > everyone without tweaks being necessary. Yes, this takes longer than > just addressing the specific symptom that is causing you problems, > but the reality is while fixing things properly takes time to get > right, everyone will benefit from it being fixed and not just one or > two very specific, latency sensitive workloads. Great, it's good to hear that this problem is expected to be fixed eventually. We can patch our way around it in the meantime. -Kenton ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag() 2018-12-26 3:16 ` Kenton Varda @ 2018-12-29 19:05 ` Darrick J. Wong 2019-01-01 23:48 ` Dave Chinner 1 sibling, 0 replies; 16+ messages in thread From: Darrick J. Wong @ 2018-12-29 19:05 UTC (permalink / raw) To: Kenton Varda; +Cc: Dave Chinner, Ivan Babrou, linux-xfs, Shawn Bohrer On Tue, Dec 25, 2018 at 07:16:25PM -0800, Kenton Varda wrote: > On Tue, Dec 25, 2018 at 3:47 PM Dave Chinner <david@fromorbit.com> wrote: > > But taking out your frustrations on the people who are trying to fix > > the problems you are seeing isn't productive. We are only a small > > team and we can't fix every problem that everyone reports > > immediately. Some things take time to fix. > > I agree. My hope is that explaining our use case helps you make XFS > better, but you don't owe us anything. It's our problem to solve and > any help you give us is a favor. > > > IOWs, there are relatively few applications that have such a > > significant dependency on memory reclaim having extremely low > > latency, > > Hmm, I'm confused by this. Isn't low-latency memory allocation is a > common requirement for any kind of interactive workload? I don't see > what's unique about our use case in this respect. Any desktop and most > web servers I would think have similar requirements. > > I'm sure there's something about our use case that's unusual, but it > doesn't seem to me that requiring low-latency memory allocation is > unique. > > Maybe the real thing that's odd about us is that we constantly create > and delete files at a high rate, and that means we have an excessive > number of dirty inodes to flush? > > > IOWs, we're trying to solve *all* the blocking problems that we know > > that can occur in inode reclaim so that it all just works for > > everyone without tweaks being necessary. Yes, this takes longer than > > just addressing the specific symptom that is causing you problems, > > but the reality is while fixing things properly takes time to get > > right, everyone will benefit from it being fixed and not just one or > > two very specific, latency sensitive workloads. > > Great, it's good to hear that this problem is expected to be fixed > eventually. We can patch our way around it in the meantime. FWIW I /was/ planning to patchbomb every feature that's sitting around in my xfs development tree on NYE for everyone's enjoyment^Wreview. ;) Concretely, those features are: - Scrub fixes - The eas(ier) parts of online repair - Deferred inode inactivation (i.e. the thing you're talking about) - The hard parts of online repair - Hoisting inode operations to libxfs - Metadata inode directory tree - Reverse mapping for realtime devices --D > -Kenton ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag() 2018-12-26 3:16 ` Kenton Varda 2018-12-29 19:05 ` Darrick J. Wong @ 2019-01-01 23:48 ` Dave Chinner 1 sibling, 0 replies; 16+ messages in thread From: Dave Chinner @ 2019-01-01 23:48 UTC (permalink / raw) To: Kenton Varda; +Cc: Ivan Babrou, linux-xfs, Shawn Bohrer On Tue, Dec 25, 2018 at 07:16:25PM -0800, Kenton Varda wrote: > On Tue, Dec 25, 2018 at 3:47 PM Dave Chinner <david@fromorbit.com> wrote: > > But taking out your frustrations on the people who are trying to fix > > the problems you are seeing isn't productive. We are only a small > > team and we can't fix every problem that everyone reports > > immediately. Some things take time to fix. > > I agree. My hope is that explaining our use case helps you make XFS > better, but you don't owe us anything. It's our problem to solve and > any help you give us is a favor. > > > IOWs, there are relatively few applications that have such a > > significant dependency on memory reclaim having extremely low > > latency, > > Hmm, I'm confused by this. Isn't low-latency memory allocation is a > common requirement for any kind of interactive workload? "interactive" tends to mean "human does not see noticable delays" which means acceptible latency for an operation is measured in hundreds of milliseconds, not microseconds. And it's relatively rare for interactive users to have heavily overloaded IO subsystems such that a single IO takes more than a couple of hundred milliseconds, let alone have enough memory demand and concurrent memory reclaim IO that direct reclaim backs up for seconds on it. > I don't see > what's unique about our use case in this respect. Any desktop and most > web servers I would think have similar requirements. Interactive latency deficiencies are almost always caused by "need to get something off disk", not memory reclaim. And even when it is caused by "memory reclaim needs to write something", that tends to mean the "get something from disk" latency is even higher and more noticable.... > I'm sure there's something about our use case that's unusual, but it > doesn't seem to me that requiring low-latency memory allocation is > unique. > > Maybe the real thing that's odd about us is that we constantly create > and delete files at a high rate, and that means we have an excessive > number of dirty inodes to flush? Most likely. It is unusual to combine a huge amount of clean page cache with an inode cache that has really only has dirty reclaimable inodes in it. It basically implies the common way inodes age out of the in-memory cache is by being unlinked, not by having their page cache fully reclaimed by memory pressure... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag() 2018-12-19 22:15 ` Ivan Babrou 2018-12-21 4:00 ` Kenton Varda @ 2019-01-02 10:34 ` Arkadiusz Miśkiewicz 1 sibling, 0 replies; 16+ messages in thread From: Arkadiusz Miśkiewicz @ 2019-01-02 10:34 UTC (permalink / raw) Cc: linux-xfs On 19/12/2018 23:15, Ivan Babrou wrote: > We're sticking with the following patch that allows runtime switching > between XFS memory reclaim strategies: > > * https://github.com/bobrik/linux/pull/2 > > There are some tests and graphs describing the issue and how it can be solved. > > Let me know if you think this can be incorporated upstream, I'm fine if not. This hack and fs.xfs.memory_reclaim=0 solved my problem where backup server (doing rsnapshot, so tons of rsync+hardlinks+rm+cp of milions files) was barely usable. Things like ssh or other network services were stuck for 10-30 minutes doing reclaim. Hack but for now server behaves much better than with vanilla kernel on this type of workload. Thanks, (resend) -- Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org ) ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2019-01-02 10:34 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-11-29 0:36 Non-blocking socket stuck for multiple seconds on xfs_reclaim_inodes_ag() Ivan Babrou 2018-11-29 2:18 ` Dave Chinner 2018-11-29 14:36 ` Shawn Bohrer 2018-11-29 21:20 ` Dave Chinner 2018-11-29 22:22 ` Ivan Babrou 2018-11-30 2:18 ` Dave Chinner 2018-11-30 3:31 ` Ivan Babrou 2018-11-30 6:49 ` Dave Chinner 2018-11-30 7:45 ` Dave Chinner 2018-12-19 22:15 ` Ivan Babrou 2018-12-21 4:00 ` Kenton Varda 2018-12-25 23:47 ` Dave Chinner 2018-12-26 3:16 ` Kenton Varda 2018-12-29 19:05 ` Darrick J. Wong 2019-01-01 23:48 ` Dave Chinner 2019-01-02 10:34 ` Arkadiusz Miśkiewicz
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.