* Recurring hand in XFS inode reclaim on 4.10
@ 2019-05-21 16:21 Jeffrey Baker
2019-05-21 22:49 ` Dave Chinner
0 siblings, 1 reply; 6+ messages in thread
From: Jeffrey Baker @ 2019-05-21 16:21 UTC (permalink / raw)
To: linux-xfs
I have a problem of recurring hangs on machines where I get hung task
warnings for kswapd and many user threads stuck in lstat. At the time,
I/O grinds to a halt, although not quite to zero. These hangs happen
1-2 times per day on a fleet of several thousand machines. We'd like
to understand the root cause, if it is already known, so we can figure
out the minimum kernel to which we might want to update.
These are the hung task stacks:
kswapd0 D 0 279 2 0x00000000
Call Trace:
__schedule+0x232/0x700
schedule+0x36/0x80
schedule_preempt_disabled+0xe/0x10
__mutex_lock_slowpath+0x193/0x290
mutex_lock+0x2f/0x40
xfs_reclaim_inodes_ag+0x288/0x330 [xfs]
? enqueue_entity+0x118/0x640
? check_preempt_curr+0x54/0x90
? ttwu_do_wakeup+0x19/0xe0
? ttwu_do_activate+0x6f/0x80
? try_to_wake_up+0x59/0x3e0
? wake_up_process+0x15/0x20
xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
xfs_fs_free_cached_objects+0x19/0x20 [xfs]
super_cache_scan+0x191/0x1a0
shrink_slab.part.40+0x1fa/0x430
shrink_slab+0x29/0x30
shrink_node+0x108/0x320
kswapd+0x34b/0x720
kthread+0x109/0x140
? mem_cgroup_shrink_node+0x180/0x180
? kthread_create_on_node+0x60/0x60
ret_from_fork+0x2c/0x40
kswapd1 D 0 280 2 0x00000000
Call Trace:
__schedule+0x232/0x700
schedule+0x36/0x80
schedule_timeout+0x235/0x3f0
? blk_finish_plug+0x2c/0x40
? _xfs_buf_ioapply+0x334/0x460 [xfs]
wait_for_completion+0xb4/0x140
? wake_up_q+0x70/0x70
? xfs_bwrite+0x24/0x60 [xfs]
xfs_buf_submit_wait+0x7f/0x210 [xfs]
xfs_bwrite+0x24/0x60 [xfs]
xfs_reclaim_inode+0x313/0x340 [xfs]
xfs_reclaim_inodes_ag+0x208/0x330 [xfs]
? sched_clock+0x9/0x10
? sched_clock_cpu+0x8f/0xa0
? try_to_wake_up+0x59/0x3e0
? wake_up_process+0x15/0x20
xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
xfs_fs_free_cached_objects+0x19/0x20 [xfs]
super_cache_scan+0x191/0x1a0
shrink_slab.part.40+0x1fa/0x430
shrink_slab+0x29/0x30
shrink_node+0x108/0x320
kswapd+0x34b/0x720
kthread+0x109/0x140
? mem_cgroup_shrink_node+0x180/0x180
? kthread_create_on_node+0x60/0x60
ret_from_fork+0x2c/0x40
mysqld D 0 89015 116527 0x00000080
Call Trace:
__schedule+0x232/0x700
? __remove_hrtimer+0x3c/0x70
schedule+0x36/0x80
rwsem_down_read_failed+0xf9/0x150
? xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
call_rwsem_down_read_failed+0x18/0x30
? xfs_trans_roll+0x40/0x40 [xfs]
down_read+0x20/0x40
xfs_ilock+0xe5/0x110 [xfs]
xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
xfs_attr_get+0xd3/0x180 [xfs]
xfs_xattr_get+0x4b/0x70 [xfs]
__vfs_getxattr+0x57/0x70
get_vfs_caps_from_disk+0x59/0x100
audit_copy_inode+0x6d/0xb0
__audit_inode+0x1bb/0x320
filename_lookup+0x128/0x180
? __check_object_size+0x108/0x1e3
? path_get+0x27/0x30
? __audit_getname+0x96/0xb0
user_path_at_empty+0x36/0x40
vfs_fstatat+0x66/0xc0
SYSC_newlstat+0x31/0x60
? syscall_trace_enter+0x1d9/0x2f0
? __audit_syscall_exit+0x230/0x2c0
SyS_newlstat+0xe/0x10
do_syscall_64+0x5b/0xc0
entry_SYSCALL64_slow_path+0x25/0x25
All other hung threads are stuck in the third stack.
We are using the Ubuntu 16.04 kernel, 4.10.0-40-generic
#44~16.04.1-Ubuntu. The machines involved have 20-core / 40-thread
Intel CPUs, 384 GiB of main memory, and four nvme devices in an md
RAID 0. The filesystem info is:
# xfs_info /dev/md0
meta-data=/dev/md0 isize=256 agcount=6, agsize=268435455 blks
= sectsz=512 attr=2, projid32bit=0
= crc=0 finobt=0 spinodes=0
data = bsize=4096 blocks=1562366976, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
The tasks above were reported as hung at 22:51:55, so they'd been dead
since 22:49:55 at the latest. The stats from around that time seem
pretty normal.
# atopsar -r /var/log/atop/atop_20190517 -b 22:48:00 -e 22:53:00 -m
22:48:02 memtotal memfree buffers cached dirty slabmem swptotal swpfree _mem_
22:48:07 385598M 2199M 115M 23773M 14M 2435M 0M 0M
22:48:12 385598M 2151M 115M 23777M 15M 2435M 0M 0M
22:48:17 385598M 2222M 115M 23781M 16M 2435M 0M 0M
22:48:22 385598M 2213M 115M 23785M 5M 2435M 0M 0M
22:48:27 385598M 2198M 115M 23790M 11M 2435M 0M 0M
22:48:32 385598M 2209M 115M 23791M 9M 2435M 0M 0M
22:48:37 385598M 2205M 115M 23794M 0M 2435M 0M 0M
22:48:42 385598M 2199M 115M 23798M 2M 2435M 0M 0M
22:48:47 385598M 2119M 115M 23799M 1M 2435M 0M 0M
22:48:52 385598M 2106M 115M 23804M 4M 2450M 0M 0M
22:48:57 385598M 2095M 115M 23807M 4M 2450M 0M 0M
22:49:02 385598M 1997M 116M 23814M 6M 2450M 0M 0M
22:49:07 385598M 1966M 116M 23839M 4M 2451M 0M 0M
22:49:12 385598M 1989M 116M 23856M 22M 2451M 0M 0M
22:49:17 385598M 1797M 117M 23863M 29M 2452M 0M 0M
22:49:22 385598M 1920M 118M 23902M 82M 2451M 0M 0M
22:49:27 385598M 1875M 118M 23905M 83M 2451M 0M 0M
22:49:32 385598M 1794M 120M 23925M 94M 2458M 0M 0M
22:49:37 385598M 1707M 120M 23953M 35M 2459M 0M 0M
22:49:42 385598M 1618M 120M 23985M 66M 2460M 0M 0M
22:49:47 385598M 1535M 120M 24016M 95M 2460M 0M 0M
22:49:52 385598M 1499M 121M 24036M 15M 2460M 0M 0M
22:49:57 385598M 1440M 121M 24070M 48M 2461M 0M 0M
22:50:02 385598M 1368M 121M 24095M 73M 2462M 0M 0M
22:50:07 385598M 1355M 121M 24096M 73M 2461M 0M 0M
22:50:12 385598M 1368M 121M 24101M 77M 2462M 0M 0M
22:50:17 385598M 1362M 121M 24102M 78M 2461M 0M 0M
22:50:22 385598M 1339M 121M 24106M 1M 2461M 0M 0M
22:50:27 385598M 1329M 121M 24108M 2M 2461M 0M 0M
22:50:32 385598M 1354M 121M 24109M 2M 2461M 0M 0M
22:50:37 385598M 1337M 121M 24113M 6M 2461M 0M 0M
22:50:42 385598M 1316M 121M 24114M 6M 2461M 0M 0M
22:50:47 385598M 1309M 121M 24119M 11M 2461M 0M 0M
22:50:52 385598M 1310M 121M 24119M 2M 2461M 0M 0M
22:50:57 385598M 1305M 121M 24123M 6M 2461M 0M 0M
22:51:02 385598M 1289M 121M 24123M 7M 2461M 0M 0M
22:51:07 385598M 1286M 121M 24127M 8M 2462M 0M 0M
22:51:12 385598M 1312M 121M 24128M 8M 2462M 0M 0M
22:51:17 385598M 1291M 121M 24128M 9M 2462M 0M 0M
22:51:22 385598M 1270M 121M 24132M 13M 2462M 0M 0M
22:51:27 385598M 1283M 121M 24112M 1M 2462M 0M 0M
22:51:32 385598M 1279M 121M 24116M 5M 2462M 0M 0M
22:51:37 385598M 1273M 121M 24116M 6M 2462M 0M 0M
22:51:42 385598M 1275M 121M 24115M 4M 2462M 0M 0M
22:51:47 385598M 1302M 121M 24081M 6M 2462M 0M 0M
22:51:52 385598M 1289M 121M 24085M 10M 2462M 0M 0M
22:51:57 385598M 1285M 121M 24086M 3M 2462M 0M 0M
Something is obviously wrong with atop's analysis of md0 I/O rate
here, but you get the idea: I/O slides over a minute from a reasonable
rate to near zero, except for one big spike of output.
# atopsar -r /var/log/atop/atop_20190517 -b 22:48:00 -e 22:53:00 -f
22:48:02 disk busy read/s KB/read writ/s KB/writ avque avserv _mdd_
22:48:07 md0 0% 1208.4 6.9 2569.0 6.6 0.0 0.00 ms
22:48:12 md0 0% 1294.1 7.0 2642.5 6.6 0.0 0.00 ms
22:48:17 md0 0% 1289.0 6.9 2823.5 6.8 0.0 0.00 ms
22:48:22 md0 0% 1376.3 6.9 2662.6 6.7 0.0 0.00 ms
22:48:27 md0 0% 1332.3 6.8 2578.7 6.3 0.0 0.00 ms
22:48:32 md0 0% 1338.2 6.9 2601.7 6.5 0.0 0.00 ms
22:48:37 md0 0% 1133.7 6.8 3172.3 6.9 0.0 0.00 ms
22:48:42 md0 0% 1377.4 6.8 2609.3 6.8 0.0 0.00 ms
22:48:47 md0 0% 1353.4 6.9 2293.6 6.5 0.0 0.00 ms
22:48:52 md0 0% 1291.8 6.9 2200.5 6.3 0.0 0.00 ms
22:48:57 md0 0% 1332.2 7.0 3066.1 6.9 0.0 0.00 ms
22:49:02 md0 0% 1275.6 6.9 3021.2 6.8 0.0 0.00 ms
22:49:07 md0 0% 1306.6 6.4 4407.2 10.4 0.0 0.00 ms
22:49:12 md0 0% 1172.4 6.6 2740.4 7.1 0.0 0.00 ms
22:49:17 md0 0% 468.4 4.9 2769.7 6.6 0.0 0.00 ms
22:49:22 md0 0% 422.3 4.9 2950.9 7.0 0.0 0.00 ms
22:49:27 md0 0% 364.7 4.9 2744.3 7.0 0.0 0.00 ms
22:49:32 md0 0% 254.6 4.7 6271.1 8.8 0.0 0.00 ms
22:49:37 md0 0% 289.6 4.6 2108.7 6.8 0.0 0.00 ms
22:49:42 md0 0% 276.1 4.7 2278.3 6.7 0.0 0.00 ms
22:49:47 md0 0% 266.2 4.7 1990.3 7.0 0.0 0.00 ms
22:49:52 md0 0% 274.9 4.7 2051.3 7.0 0.0 0.00 ms
22:49:57 md0 0% 222.7 4.6 1770.4 7.1 0.0 0.00 ms
22:50:02 md0 0% 104.1 4.5 22339.5 9.5 0.0 0.00 ms
22:50:07 md0 0% 132.9 4.4 6208.9 9.1 0.0 0.00 ms
22:50:12 md0 0% 147.2 4.5 755.7 5.1 0.0 0.00 ms
22:50:17 md0 0% 127.0 4.4 722.6 5.3 0.0 0.00 ms
22:50:22 md0 0% 120.5 4.4 703.7 4.9 0.0 0.00 ms
22:50:27 md0 0% 106.5 4.5 681.7 5.0 0.0 0.00 ms
22:50:32 md0 0% 121.1 4.5 756.5 4.7 0.0 0.00 ms
22:50:37 md0 0% 131.9 4.4 696.7 4.8 0.0 0.00 ms
22:50:42 md0 0% 100.9 4.4 669.6 4.7 0.0 0.00 ms
22:50:47 md0 0% 72.2 4.4 691.1 6.7 0.0 0.00 ms
22:50:52 md0 0% 0.0 0.0 320.4 7.6 0.0 0.00 ms
22:50:57 md0 0% 0.2 4.0 115.1 6.5 0.0 0.00 ms
22:51:02 md0 0% 0.0 0.0 62.5 5.7 0.0 0.00 ms
22:51:07 md0 0% 0.2 4.0 23.0 4.2 0.0 0.00 ms
22:51:12 md0 0% 0.2 4.0 22.6 4.1 0.0 0.00 ms
22:51:17 md0 0% 0.0 0.0 22.4 4.1 0.0 0.00 ms
22:51:22 md0 0% 0.2 4.0 26.0 4.4 0.0 0.00 ms
22:51:27 md0 0% 0.0 0.0 25.4 5.6 0.0 0.00 ms
22:51:32 md0 0% 0.0 0.0 22.6 4.1 0.0 0.00 ms
22:51:37 md0 0% 0.0 0.0 21.5 4.1 0.0 0.00 ms
22:51:42 md0 0% 0.0 0.0 26.0 5.4 0.0 0.00 ms
22:51:47 md0 0% 0.0 0.0 22.2 4.1 0.0 0.00 ms
22:51:52 md0 0% 0.0 0.0 23.4 4.1 0.0 0.00 ms
22:51:57 md0 0% 0.2 4.0 26.0 5.6 0.0 0.00 ms
Any help would be appreciated.
-jwb
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Recurring hand in XFS inode reclaim on 4.10
2019-05-21 16:21 Recurring hand in XFS inode reclaim on 4.10 Jeffrey Baker
@ 2019-05-21 22:49 ` Dave Chinner
2019-05-24 20:34 ` Jeffrey Baker
0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2019-05-21 22:49 UTC (permalink / raw)
To: Jeffrey Baker; +Cc: linux-xfs
On Tue, May 21, 2019 at 09:21:10AM -0700, Jeffrey Baker wrote:
> I have a problem of recurring hangs on machines where I get hung task
> warnings for kswapd and many user threads stuck in lstat. At the time,
> I/O grinds to a halt, although not quite to zero. These hangs happen
> 1-2 times per day on a fleet of several thousand machines. We'd like
> to understand the root cause, if it is already known, so we can figure
> out the minimum kernel to which we might want to update.
>
> These are the hung task stacks:
>
> kswapd0 D 0 279 2 0x00000000
> Call Trace:
> __schedule+0x232/0x700
> schedule+0x36/0x80
> schedule_preempt_disabled+0xe/0x10
> __mutex_lock_slowpath+0x193/0x290
> mutex_lock+0x2f/0x40
> xfs_reclaim_inodes_ag+0x288/0x330 [xfs]
You're basically running the machine out of memory and there
are so many direct reclaimers that all the inode reclaim parallelism in
the filesystem has been exhausted and it's blocking waiting for
other reclaim to complete.
> kswapd1 D 0 280 2 0x00000000
> Call Trace:
> __schedule+0x232/0x700
> schedule+0x36/0x80
> schedule_timeout+0x235/0x3f0
> ? blk_finish_plug+0x2c/0x40
> ? _xfs_buf_ioapply+0x334/0x460 [xfs]
> wait_for_completion+0xb4/0x140
> ? wake_up_q+0x70/0x70
> ? xfs_bwrite+0x24/0x60 [xfs]
> xfs_buf_submit_wait+0x7f/0x210 [xfs]
> xfs_bwrite+0x24/0x60 [xfs]
> xfs_reclaim_inode+0x313/0x340 [xfs]
> xfs_reclaim_inodes_ag+0x208/0x330 [xfs]
Yup, memory reclaim is pushing so hard it is doing direct writeback
of dirty inodes.
> mysqld D 0 89015 116527 0x00000080
> Call Trace:
> __schedule+0x232/0x700
> ? __remove_hrtimer+0x3c/0x70
> schedule+0x36/0x80
> rwsem_down_read_failed+0xf9/0x150
> ? xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
> call_rwsem_down_read_failed+0x18/0x30
> ? xfs_trans_roll+0x40/0x40 [xfs]
> down_read+0x20/0x40
> xfs_ilock+0xe5/0x110 [xfs]
> xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
> xfs_attr_get+0xd3/0x180 [xfs]
> xfs_xattr_get+0x4b/0x70 [xfs]
> __vfs_getxattr+0x57/0x70
> get_vfs_caps_from_disk+0x59/0x100
> audit_copy_inode+0x6d/0xb0
> __audit_inode+0x1bb/0x320
> filename_lookup+0x128/0x180
Somebody else has the inode locked doing something with it, so the
audit code has to wait for it to finish before it can continue. This
likely has nothing to do with memory reclaim, and more likely is
backed up behind a transaction or something doing IO....
> ? __check_object_size+0x108/0x1e3
> ? path_get+0x27/0x30
> ? __audit_getname+0x96/0xb0
> user_path_at_empty+0x36/0x40
> vfs_fstatat+0x66/0xc0
> SYSC_newlstat+0x31/0x60
> ? syscall_trace_enter+0x1d9/0x2f0
> ? __audit_syscall_exit+0x230/0x2c0
> SyS_newlstat+0xe/0x10
> do_syscall_64+0x5b/0xc0
> entry_SYSCALL64_slow_path+0x25/0x25
>
> All other hung threads are stuck in the third stack.
>
> We are using the Ubuntu 16.04 kernel, 4.10.0-40-generic
> #44~16.04.1-Ubuntu. The machines involved have 20-core / 40-thread
> Intel CPUs, 384 GiB of main memory, and four nvme devices in an md
> RAID 0. The filesystem info is:
>
> # xfs_info /dev/md0
> meta-data=/dev/md0 isize=256 agcount=6, agsize=268435455 blks
> = sectsz=512 attr=2, projid32bit=0
> = crc=0 finobt=0 spinodes=0
> data = bsize=4096 blocks=1562366976, imaxpct=5
> = sunit=0 swidth=0 blks
> naming =version 2 bsize=4096 ascii-ci=0 ftype=1
> log =internal bsize=4096 blocks=521728, version=2
> = sectsz=512 sunit=1 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
>
> The tasks above were reported as hung at 22:51:55, so they'd been dead
> since 22:49:55 at the latest. The stats from around that time seem
> pretty normal.
So you've got processes stuck waiting a couple of minutes for IO
on nvme drives? That doesn't sound like a filesystem problem - that
sounds more like lost IOs, a hung MD array, or hardware that's gone
AWOL....
> 22:48:02 memtotal memfree buffers cached dirty slabmem swptotal swpfree _mem_
> 22:48:07 385598M 2199M 115M 23773M 14M 2435M 0M 0M
Nothing unusual there - still lots of reclaimable page cache, and
some slabmem, too.
> Something is obviously wrong with atop's analysis of md0 I/O rate
> here, but you get the idea: I/O slides over a minute from a reasonable
> rate to near zero, except for one big spike of output.
>
> # atopsar -r /var/log/atop/atop_20190517 -b 22:48:00 -e 22:53:00 -f
>
> 22:48:02 disk busy read/s KB/read writ/s KB/writ avque avserv _mdd_
> 22:48:07 md0 0% 1208.4 6.9 2569.0 6.6 0.0 0.00 ms
> 22:48:12 md0 0% 1294.1 7.0 2642.5 6.6 0.0 0.00 ms
> 22:48:17 md0 0% 1289.0 6.9 2823.5 6.8 0.0 0.00 ms
> 22:48:22 md0 0% 1376.3 6.9 2662.6 6.7 0.0 0.00 ms
> 22:48:27 md0 0% 1332.3 6.8 2578.7 6.3 0.0 0.00 ms
> 22:48:32 md0 0% 1338.2 6.9 2601.7 6.5 0.0 0.00 ms
> 22:48:37 md0 0% 1133.7 6.8 3172.3 6.9 0.0 0.00 ms
> 22:48:42 md0 0% 1377.4 6.8 2609.3 6.8 0.0 0.00 ms
> 22:48:47 md0 0% 1353.4 6.9 2293.6 6.5 0.0 0.00 ms
> 22:48:52 md0 0% 1291.8 6.9 2200.5 6.3 0.0 0.00 ms
> 22:48:57 md0 0% 1332.2 7.0 3066.1 6.9 0.0 0.00 ms
> 22:49:02 md0 0% 1275.6 6.9 3021.2 6.8 0.0 0.00 ms
> 22:49:07 md0 0% 1306.6 6.4 4407.2 10.4 0.0 0.00 ms
> 22:49:12 md0 0% 1172.4 6.6 2740.4 7.1 0.0 0.00 ms
> 22:49:17 md0 0% 468.4 4.9 2769.7 6.6 0.0 0.00 ms
IO rates take a hit here, about 40s before the processes got stuck.
> 22:49:22 md0 0% 422.3 4.9 2950.9 7.0 0.0 0.00 ms
> 22:49:27 md0 0% 364.7 4.9 2744.3 7.0 0.0 0.00 ms
> 22:49:32 md0 0% 254.6 4.7 6271.1 8.8 0.0 0.00 ms
> 22:49:37 md0 0% 289.6 4.6 2108.7 6.8 0.0 0.00 ms
> 22:49:42 md0 0% 276.1 4.7 2278.3 6.7 0.0 0.00 ms
> 22:49:47 md0 0% 266.2 4.7 1990.3 7.0 0.0 0.00 ms
> 22:49:52 md0 0% 274.9 4.7 2051.3 7.0 0.0 0.00 ms
> 22:49:57 md0 0% 222.7 4.6 1770.4 7.1 0.0 0.00 ms
Processes stick here just as the read rate takes another hit, and...
> 22:50:02 md0 0% 104.1 4.5 22339.5 9.5 0.0 0.00 ms
That's likely a burst of 8KB inode writes from the XFS inode
reclaim code.
> 22:50:07 md0 0% 132.9 4.4 6208.9 9.1 0.0 0.00 ms
> 22:50:12 md0 0% 147.2 4.5 755.7 5.1 0.0 0.00 ms
> 22:50:17 md0 0% 127.0 4.4 722.6 5.3 0.0 0.00 ms
> 22:50:22 md0 0% 120.5 4.4 703.7 4.9 0.0 0.00 ms
> 22:50:27 md0 0% 106.5 4.5 681.7 5.0 0.0 0.00 ms
> 22:50:32 md0 0% 121.1 4.5 756.5 4.7 0.0 0.00 ms
> 22:50:37 md0 0% 131.9 4.4 696.7 4.8 0.0 0.00 ms
> 22:50:42 md0 0% 100.9 4.4 669.6 4.7 0.0 0.00 ms
> 22:50:47 md0 0% 72.2 4.4 691.1 6.7 0.0 0.00 ms
> 22:50:52 md0 0% 0.0 0.0 320.4 7.6 0.0 0.00 ms
And a minute later everything has ground to a halt.
> 22:50:57 md0 0% 0.2 4.0 115.1 6.5 0.0 0.00 ms
> 22:51:02 md0 0% 0.0 0.0 62.5 5.7 0.0 0.00 ms
> 22:51:07 md0 0% 0.2 4.0 23.0 4.2 0.0 0.00 ms
> 22:51:12 md0 0% 0.2 4.0 22.6 4.1 0.0 0.00 ms
> 22:51:17 md0 0% 0.0 0.0 22.4 4.1 0.0 0.00 ms
> 22:51:22 md0 0% 0.2 4.0 26.0 4.4 0.0 0.00 ms
> 22:51:27 md0 0% 0.0 0.0 25.4 5.6 0.0 0.00 ms
> 22:51:32 md0 0% 0.0 0.0 22.6 4.1 0.0 0.00 ms
> 22:51:37 md0 0% 0.0 0.0 21.5 4.1 0.0 0.00 ms
> 22:51:42 md0 0% 0.0 0.0 26.0 5.4 0.0 0.00 ms
> 22:51:47 md0 0% 0.0 0.0 22.2 4.1 0.0 0.00 ms
> 22:51:52 md0 0% 0.0 0.0 23.4 4.1 0.0 0.00 ms
> 22:51:57 md0 0% 0.2 4.0 26.0 5.6 0.0 0.00 ms
This looks to me like something below the filesystem choking up and
grinding to a halt. What are all the nvme drives doing over this
period?
Hard to know what is going on at this point, though, but nothing
at the filesystem or memory reclaim level should be stuck on IO for
long periods of time on nvme SSDs...
/me wonders if 4.10 had the block layer writeback throttle code in
it, and if it does whether that is what has gone haywire here.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Recurring hand in XFS inode reclaim on 4.10
2019-05-21 22:49 ` Dave Chinner
@ 2019-05-24 20:34 ` Jeffrey Baker
2019-05-27 3:32 ` Dave Chinner
0 siblings, 1 reply; 6+ messages in thread
From: Jeffrey Baker @ 2019-05-24 20:34 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs
On Tue, May 21, 2019 at 3:49 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, May 21, 2019 at 09:21:10AM -0700, Jeffrey Baker wrote:
> > I have a problem of recurring hangs on machines where I get hung task
> > warnings for kswapd and many user threads stuck in lstat. At the time,
> > I/O grinds to a halt, although not quite to zero. These hangs happen
> > 1-2 times per day on a fleet of several thousand machines. We'd like
> > to understand the root cause, if it is already known, so we can figure
> > out the minimum kernel to which we might want to update.
> >
> > These are the hung task stacks:
> >
> > kswapd0 D 0 279 2 0x00000000
> > Call Trace:
> > __schedule+0x232/0x700
> > schedule+0x36/0x80
> > schedule_preempt_disabled+0xe/0x10
> > __mutex_lock_slowpath+0x193/0x290
> > mutex_lock+0x2f/0x40
> > xfs_reclaim_inodes_ag+0x288/0x330 [xfs]
>
> You're basically running the machine out of memory and there
> are so many direct reclaimers that all the inode reclaim parallelism in
> the filesystem has been exhausted and it's blocking waiting for
> other reclaim to complete.
Sort of. "Out of memory" here means > 2GB free, admittedly that's
less than 1% of the machine.
>
> > kswapd1 D 0 280 2 0x00000000
> > Call Trace:
> > __schedule+0x232/0x700
> > schedule+0x36/0x80
> > schedule_timeout+0x235/0x3f0
> > ? blk_finish_plug+0x2c/0x40
> > ? _xfs_buf_ioapply+0x334/0x460 [xfs]
> > wait_for_completion+0xb4/0x140
> > ? wake_up_q+0x70/0x70
> > ? xfs_bwrite+0x24/0x60 [xfs]
> > xfs_buf_submit_wait+0x7f/0x210 [xfs]
> > xfs_bwrite+0x24/0x60 [xfs]
> > xfs_reclaim_inode+0x313/0x340 [xfs]
> > xfs_reclaim_inodes_ag+0x208/0x330 [xfs]
>
> Yup, memory reclaim is pushing so hard it is doing direct writeback
> of dirty inodes.
OK. Is that reflected in vmstat somewhere? I see
"nr_vmscan_immediate_reclaim" but it's not obvious to me what that is,
and it wasn't increasing at the moment of this event.
> > mysqld D 0 89015 116527 0x00000080
> > Call Trace:
> > __schedule+0x232/0x700
> > ? __remove_hrtimer+0x3c/0x70
> > schedule+0x36/0x80
> > rwsem_down_read_failed+0xf9/0x150
> > ? xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
> > call_rwsem_down_read_failed+0x18/0x30
> > ? xfs_trans_roll+0x40/0x40 [xfs]
> > down_read+0x20/0x40
> > xfs_ilock+0xe5/0x110 [xfs]
> > xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
> > xfs_attr_get+0xd3/0x180 [xfs]
> > xfs_xattr_get+0x4b/0x70 [xfs]
> > __vfs_getxattr+0x57/0x70
> > get_vfs_caps_from_disk+0x59/0x100
> > audit_copy_inode+0x6d/0xb0
> > __audit_inode+0x1bb/0x320
> > filename_lookup+0x128/0x180
>
> Somebody else has the inode locked doing something with it, so the
> audit code has to wait for it to finish before it can continue. This
> likely has nothing to do with memory reclaim, and more likely is
> backed up behind a transaction or something doing IO....
>
> > ? __check_object_size+0x108/0x1e3
> > ? path_get+0x27/0x30
> > ? __audit_getname+0x96/0xb0
> > user_path_at_empty+0x36/0x40
> > vfs_fstatat+0x66/0xc0
> > SYSC_newlstat+0x31/0x60
> > ? syscall_trace_enter+0x1d9/0x2f0
> > ? __audit_syscall_exit+0x230/0x2c0
> > SyS_newlstat+0xe/0x10
> > do_syscall_64+0x5b/0xc0
> > entry_SYSCALL64_slow_path+0x25/0x25
> >
> > All other hung threads are stuck in the third stack.
> >
> > We are using the Ubuntu 16.04 kernel, 4.10.0-40-generic
> > #44~16.04.1-Ubuntu. The machines involved have 20-core / 40-thread
> > Intel CPUs, 384 GiB of main memory, and four nvme devices in an md
> > RAID 0. The filesystem info is:
> >
> > # xfs_info /dev/md0
> > meta-data=/dev/md0 isize=256 agcount=6, agsize=268435455 blks
> > = sectsz=512 attr=2, projid32bit=0
> > = crc=0 finobt=0 spinodes=0
> > data = bsize=4096 blocks=1562366976, imaxpct=5
> > = sunit=0 swidth=0 blks
> > naming =version 2 bsize=4096 ascii-ci=0 ftype=1
> > log =internal bsize=4096 blocks=521728, version=2
> > = sectsz=512 sunit=1 blks, lazy-count=1
> > realtime =none extsz=4096 blocks=0, rtextents=0
> >
> > The tasks above were reported as hung at 22:51:55, so they'd been dead
> > since 22:49:55 at the latest. The stats from around that time seem
> > pretty normal.
>
> So you've got processes stuck waiting a couple of minutes for IO
> on nvme drives? That doesn't sound like a filesystem problem - that
> sounds more like lost IOs, a hung MD array, or hardware that's gone
> AWOL....
Sure, it could be any of these things. We're trying to get to the
bottom of why the software appears to do nothing while the hardware
appears to be working perfectly. In particular why the kswapd thread
is waiting on I/O but there's none/very little happening.
>
> > 22:48:02 memtotal memfree buffers cached dirty slabmem swptotal swpfree _mem_
> > 22:48:07 385598M 2199M 115M 23773M 14M 2435M 0M 0M
>
> Nothing unusual there - still lots of reclaimable page cache, and
> some slabmem, too.
>
> > Something is obviously wrong with atop's analysis of md0 I/O rate
> > here, but you get the idea: I/O slides over a minute from a reasonable
> > rate to near zero, except for one big spike of output.
> >
> > # atopsar -r /var/log/atop/atop_20190517 -b 22:48:00 -e 22:53:00 -f
> >
> > 22:48:02 disk busy read/s KB/read writ/s KB/writ avque avserv _mdd_
> > 22:48:07 md0 0% 1208.4 6.9 2569.0 6.6 0.0 0.00 ms
> > 22:48:12 md0 0% 1294.1 7.0 2642.5 6.6 0.0 0.00 ms
> > 22:48:17 md0 0% 1289.0 6.9 2823.5 6.8 0.0 0.00 ms
> > 22:48:22 md0 0% 1376.3 6.9 2662.6 6.7 0.0 0.00 ms
> > 22:48:27 md0 0% 1332.3 6.8 2578.7 6.3 0.0 0.00 ms
> > 22:48:32 md0 0% 1338.2 6.9 2601.7 6.5 0.0 0.00 ms
> > 22:48:37 md0 0% 1133.7 6.8 3172.3 6.9 0.0 0.00 ms
> > 22:48:42 md0 0% 1377.4 6.8 2609.3 6.8 0.0 0.00 ms
> > 22:48:47 md0 0% 1353.4 6.9 2293.6 6.5 0.0 0.00 ms
> > 22:48:52 md0 0% 1291.8 6.9 2200.5 6.3 0.0 0.00 ms
> > 22:48:57 md0 0% 1332.2 7.0 3066.1 6.9 0.0 0.00 ms
> > 22:49:02 md0 0% 1275.6 6.9 3021.2 6.8 0.0 0.00 ms
> > 22:49:07 md0 0% 1306.6 6.4 4407.2 10.4 0.0 0.00 ms
> > 22:49:12 md0 0% 1172.4 6.6 2740.4 7.1 0.0 0.00 ms
> > 22:49:17 md0 0% 468.4 4.9 2769.7 6.6 0.0 0.00 ms
>
> IO rates take a hit here, about 40s before the processes got stuck.
>
> > 22:49:22 md0 0% 422.3 4.9 2950.9 7.0 0.0 0.00 ms
> > 22:49:27 md0 0% 364.7 4.9 2744.3 7.0 0.0 0.00 ms
> > 22:49:32 md0 0% 254.6 4.7 6271.1 8.8 0.0 0.00 ms
> > 22:49:37 md0 0% 289.6 4.6 2108.7 6.8 0.0 0.00 ms
> > 22:49:42 md0 0% 276.1 4.7 2278.3 6.7 0.0 0.00 ms
> > 22:49:47 md0 0% 266.2 4.7 1990.3 7.0 0.0 0.00 ms
> > 22:49:52 md0 0% 274.9 4.7 2051.3 7.0 0.0 0.00 ms
> > 22:49:57 md0 0% 222.7 4.6 1770.4 7.1 0.0 0.00 ms
>
> Processes stick here just as the read rate takes another hit, and...
>
> > 22:50:02 md0 0% 104.1 4.5 22339.5 9.5 0.0 0.00 ms
>
> That's likely a burst of 8KB inode writes from the XFS inode
> reclaim code.
>
> > 22:50:07 md0 0% 132.9 4.4 6208.9 9.1 0.0 0.00 ms
> > 22:50:12 md0 0% 147.2 4.5 755.7 5.1 0.0 0.00 ms
> > 22:50:17 md0 0% 127.0 4.4 722.6 5.3 0.0 0.00 ms
> > 22:50:22 md0 0% 120.5 4.4 703.7 4.9 0.0 0.00 ms
> > 22:50:27 md0 0% 106.5 4.5 681.7 5.0 0.0 0.00 ms
> > 22:50:32 md0 0% 121.1 4.5 756.5 4.7 0.0 0.00 ms
> > 22:50:37 md0 0% 131.9 4.4 696.7 4.8 0.0 0.00 ms
> > 22:50:42 md0 0% 100.9 4.4 669.6 4.7 0.0 0.00 ms
> > 22:50:47 md0 0% 72.2 4.4 691.1 6.7 0.0 0.00 ms
> > 22:50:52 md0 0% 0.0 0.0 320.4 7.6 0.0 0.00 ms
>
> And a minute later everything has ground to a halt.
>
> > 22:50:57 md0 0% 0.2 4.0 115.1 6.5 0.0 0.00 ms
> > 22:51:02 md0 0% 0.0 0.0 62.5 5.7 0.0 0.00 ms
> > 22:51:07 md0 0% 0.2 4.0 23.0 4.2 0.0 0.00 ms
> > 22:51:12 md0 0% 0.2 4.0 22.6 4.1 0.0 0.00 ms
> > 22:51:17 md0 0% 0.0 0.0 22.4 4.1 0.0 0.00 ms
> > 22:51:22 md0 0% 0.2 4.0 26.0 4.4 0.0 0.00 ms
> > 22:51:27 md0 0% 0.0 0.0 25.4 5.6 0.0 0.00 ms
> > 22:51:32 md0 0% 0.0 0.0 22.6 4.1 0.0 0.00 ms
> > 22:51:37 md0 0% 0.0 0.0 21.5 4.1 0.0 0.00 ms
> > 22:51:42 md0 0% 0.0 0.0 26.0 5.4 0.0 0.00 ms
> > 22:51:47 md0 0% 0.0 0.0 22.2 4.1 0.0 0.00 ms
> > 22:51:52 md0 0% 0.0 0.0 23.4 4.1 0.0 0.00 ms
> > 22:51:57 md0 0% 0.2 4.0 26.0 5.6 0.0 0.00 ms
>
> This looks to me like something below the filesystem choking up and
> grinding to a halt. What are all the nvme drives doing over this
> period?
Write rates on the nvme drives are all exactly the md0 rates / 4, so
that seems normal.
>
> Hard to know what is going on at this point, though, but nothing
> at the filesystem or memory reclaim level should be stuck on IO for
> long periods of time on nvme SSDs...
>
> /me wonders if 4.10 had the block layer writeback throttle code in
> it, and if it does whether that is what has gone haywire here.
That is an interesting point. I do see this on a different, healthy box
# cat /sys/kernel/debug/bdi/9\:0/stats
BdiWriteback: 4320 kB
BdiReclaimable: 2304 kB
BdiDirtyThresh: 4834552 kB
DirtyThresh: 4834552 kB
BackgroundThresh: 2414324 kB
BdiDirtied: 5316993504 kB
BdiWritten: 4732763040 kB
BdiWriteBandwidth: 6060 kBps
One hopes that no throttling is based on this estimate of 6MB/s since
this array is capable of more like 10000MB/s. At this time we're
focusing on system tunables that might prevent the machine from
getting into serious trouble with dirty pages or reclaim or whatever.
I know XFS has only a handful of tunables. Do you think it's
recommendable to lower the xfssynd timer from 30s to something else?
Our workload _is_ metadata-heavy; I see about 100 create/remove ops
per second and the log stats are rapidly increasing. By contrast we're
not really pushing the extent allocator at all.
extent_alloc 10755989 500829395 3030459 4148636182
abt 0 0 0 0
blk_map 3012410686 1743133993 196385287 18176676 100618849 659563291 0
bmbt 0 0 0 0
dir 112134 99359626 99358346 49393915
trans 98 4036033830 0
ig 99320520 98796299 5890 524221 0 523020 140279
log 1339041912 2184819896 0 1341870387 1341704946
push_ail 4109383649 0 37995385 2614017 0 136962 25152 3301200 0 47933
xstrat 589377 0
rw 3561425929 1342108567
attr 3297942617 0 0 0
icluster 1452111 945722 3364798
vnodes 1201 0 0 0 99307909 99307909 99307909 0
buf 779252853 2349573 776924344 978078 41302 2328509 0 2491287 1090
abtb2 13886237 90593055 270338 266221 17 15 4555 1481 1188 2529 30 22
47 37 93926009
abtc2 28760802 203677378 13739372 13735284 45 43 3270 1292 1935 1691
57 50 102 93 1413028741
bmbt2 14355745 140867292 7158285 7115081 8 2 4074 4122 25396 4074 230
48 238 50 6974741
ibt2 397071627 776203283 67846 67820 0 0 0 0 0 0 0 0 0 0 413
fibt2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
rmapbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
refcntbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
qm 0 0 0 0 0 0 0 0
xpc 17666569097216 111928912437311 80399355018480
debug 0
Thanks for your thoughts so far.
-jwb
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Recurring hand in XFS inode reclaim on 4.10
2019-05-24 20:34 ` Jeffrey Baker
@ 2019-05-27 3:32 ` Dave Chinner
2019-05-28 17:16 ` Jeffrey Baker
0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2019-05-27 3:32 UTC (permalink / raw)
To: Jeffrey Baker; +Cc: linux-xfs
On Fri, May 24, 2019 at 01:34:58PM -0700, Jeffrey Baker wrote:
> On Tue, May 21, 2019 at 3:49 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Tue, May 21, 2019 at 09:21:10AM -0700, Jeffrey Baker wrote:
> > > I have a problem of recurring hangs on machines where I get hung task
> > > warnings for kswapd and many user threads stuck in lstat. At the time,
> > > I/O grinds to a halt, although not quite to zero. These hangs happen
> > > 1-2 times per day on a fleet of several thousand machines. We'd like
> > > to understand the root cause, if it is already known, so we can figure
> > > out the minimum kernel to which we might want to update.
> > >
> > > These are the hung task stacks:
> > >
> > > kswapd0 D 0 279 2 0x00000000
> > > Call Trace:
> > > __schedule+0x232/0x700
> > > schedule+0x36/0x80
> > > schedule_preempt_disabled+0xe/0x10
> > > __mutex_lock_slowpath+0x193/0x290
> > > mutex_lock+0x2f/0x40
> > > xfs_reclaim_inodes_ag+0x288/0x330 [xfs]
> >
> > You're basically running the machine out of memory and there
> > are so many direct reclaimers that all the inode reclaim parallelism in
> > the filesystem has been exhausted and it's blocking waiting for
> > other reclaim to complete.
>
> Sort of. "Out of memory" here means > 2GB free, admittedly that's
> less than 1% of the machine.
By "out of memory" I meant that your machine appears to have been at
or below the memory reclaim watermarks for an extended period of
time, not that there is no free memory at all.
> > > kswapd1 D 0 280 2 0x00000000
> > > Call Trace:
> > > __schedule+0x232/0x700
> > > schedule+0x36/0x80
> > > schedule_timeout+0x235/0x3f0
> > > ? blk_finish_plug+0x2c/0x40
> > > ? _xfs_buf_ioapply+0x334/0x460 [xfs]
> > > wait_for_completion+0xb4/0x140
> > > ? wake_up_q+0x70/0x70
> > > ? xfs_bwrite+0x24/0x60 [xfs]
> > > xfs_buf_submit_wait+0x7f/0x210 [xfs]
> > > xfs_bwrite+0x24/0x60 [xfs]
> > > xfs_reclaim_inode+0x313/0x340 [xfs]
> > > xfs_reclaim_inodes_ag+0x208/0x330 [xfs]
> >
> > Yup, memory reclaim is pushing so hard it is doing direct writeback
> > of dirty inodes.
>
> OK. Is that reflected in vmstat somewhere? I see
> "nr_vmscan_immediate_reclaim" but it's not obvious to me what that is,
> and it wasn't increasing at the moment of this event.
Nope, not for low level filesystem inode reclaim like this. Getting
the filesystem to do writeback from the inode shrinker is relatively
uncommon.
> > > All other hung threads are stuck in the third stack.
> > >
> > > We are using the Ubuntu 16.04 kernel, 4.10.0-40-generic
> > > #44~16.04.1-Ubuntu. The machines involved have 20-core / 40-thread
> > > Intel CPUs, 384 GiB of main memory, and four nvme devices in an md
> > > RAID 0. The filesystem info is:
> > >
> > > # xfs_info /dev/md0
> > > meta-data=/dev/md0 isize=256 agcount=6, agsize=268435455 blks
> > > = sectsz=512 attr=2, projid32bit=0
> > > = crc=0 finobt=0 spinodes=0
> > > data = bsize=4096 blocks=1562366976, imaxpct=5
> > > = sunit=0 swidth=0 blks
FWIW, I just noticed you don't have a sunit/swidth set for you
md RAID 0, which means you only have 6 AGs rather than 32. So
there's a lot less allocation and reclaim concurrency in this
filesystem that we'd normally see for a 4x RAID0 stripe....
> > This looks to me like something below the filesystem choking up and
> > grinding to a halt. What are all the nvme drives doing over this
> > period?
>
> Write rates on the nvme drives are all exactly the md0 rates / 4, so
> that seems normal.
Write rates aren't that important - what do the io latencies, queue
depths and device utilisation figures look like?
> > Hard to know what is going on at this point, though, but nothing
> > at the filesystem or memory reclaim level should be stuck on IO for
> > long periods of time on nvme SSDs...
> >
> > /me wonders if 4.10 had the block layer writeback throttle code in
> > it, and if it does whether that is what has gone haywire here.
>
> That is an interesting point. I do see this on a different, healthy box
>
> # cat /sys/kernel/debug/bdi/9\:0/stats
> BdiWriteback: 4320 kB
> BdiReclaimable: 2304 kB
> BdiDirtyThresh: 4834552 kB
> DirtyThresh: 4834552 kB
> BackgroundThresh: 2414324 kB
> BdiDirtied: 5316993504 kB
> BdiWritten: 4732763040 kB
> BdiWriteBandwidth: 6060 kBps
That's the currently used write bandwidth of the backing device. The
block layer writeback throttle is below this. i.e. This simply
reflects the rate at which data is currently being written to the
block device.
> One hopes that no throttling is based on this estimate of 6MB/s since
> this array is capable of more like 10000MB/s. At this time we're
> focusing on system tunables that might prevent the machine from
> getting into serious trouble with dirty pages or reclaim or whatever.
> I know XFS has only a handful of tunables. Do you think it's
> recommendable to lower the xfssynd timer from 30s to something else?
> Our workload _is_ metadata-heavy; I see about 100 create/remove ops
> per second
FWIW, that's not what I'd call metadata heavy. XFS on a 4-SSD array with
20 CPU cores can sustain somewhere around 300,000 create/removes a
second before it runs out of CPU. 100/s is barely a blip above
idle....
> and the log stats are rapidly increasing. By contrast we're
> not really pushing the extent allocator at all.
>
> extent_alloc 10755989 500829395 3030459 4148636182
> abt 0 0 0 0
> blk_map 3012410686 1743133993 196385287 18176676 100618849 659563291 0
> bmbt 0 0 0 0
> dir 112134 99359626 99358346 49393915
Ok, so 100M files created and removed since boot.
> trans 98 4036033830 0
All async transactions.
> ig 99320520 98796299 5890 524221 0 523020 140279
> log 1339041912 2184819896 0 1341870387 1341704946
writes blocks | forces force_waits
noiclogs
Ok, so why are the so many log forces (1.34 /billion/)? There's
roughly a log force in every 3rd transaction, but they aren't from
the transaction commit (because they are async transactions).
> push_ail 4109383649 0 37995385 2614017 0 136962 25152 3301200 0 47933
try_logspace | pushes success | pinned | flushing | flush
sleep_logspace pushbuf locked restarts
They aren't coming from the metadata writeback code (pinned count),
and the log is not blocking on free space. i.e. there's no metadata
writeback bottleneck occurring.
> xstrat 589377 0
> rw 3561425929 1342108567
Oh, 1.34 billion writes. You're doing O_DSYNC writes, yes? And lots
of them to every file that is created?
> attr 3297942617 0 0 0
> icluster 1452111 945722 3364798
> vnodes 1201 0 0 0 99307909 99307909 99307909 0
> buf 779252853 2349573 776924344 978078 41302 2328509 0 2491287 1090
> abtb2 13886237 90593055 270338 266221 17 15 4555 1481 1188 2529 30 22
> 47 37 93926009
> abtc2 28760802 203677378 13739372 13735284 45 43 3270 1292 1935 1691
> 57 50 102 93 1413028741
> bmbt2 14355745 140867292 7158285 7115081 8 2 4074 4122 25396 4074 230
> 48 238 50 6974741
> ibt2 397071627 776203283 67846 67820 0 0 0 0 0 0 0 0 0 0 413
For 100M files created/removed, to only see ~68000 inode btree
record inserts and removes implies that the filesystem is
efficiently reusing the freed inodes. i.e. there's pretty much a
steady state of inodes in use in the workload....
> fibt2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> rmapbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> refcntbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> qm 0 0 0 0 0 0 0 0
> xpc 17666569097216 111928912437311 80399355018480
write_bytes read_bytes
Assuming this is the only substantially active XFS filesystem on the
system, that's 101TiB written to those drives, in 1.34 billion writes,
which is an average of 82KiB per write. Assuming that the 32 bit
write counter has not wrapped. Given that the transaction counter is
nearly at the wrap point, and the read counter is at 3.5B for ~75TiB
(average 22KiB), let's assume that the write counter has wrapped
and we have ~5.5 billion writes. That gives an average write of just
on 20KiB which kinda matches the read....
So the drives look like they are being constantly hammered with
small, synchronous IOs. This means write amplification is likely to
be a substantial problem for them. They've had a /lot/ of data
written to them and are under constant load so there's no time for
idle cleanup, so is it possible that the drives themselves are
having internal garbage collection related slowdowns?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Recurring hand in XFS inode reclaim on 4.10
2019-05-27 3:32 ` Dave Chinner
@ 2019-05-28 17:16 ` Jeffrey Baker
2019-05-29 22:20 ` Dave Chinner
0 siblings, 1 reply; 6+ messages in thread
From: Jeffrey Baker @ 2019-05-28 17:16 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs
On Sun, May 26, 2019 at 8:32 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, May 24, 2019 at 01:34:58PM -0700, Jeffrey Baker wrote:
> > On Tue, May 21, 2019 at 3:49 PM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Tue, May 21, 2019 at 09:21:10AM -0700, Jeffrey Baker wrote:
> > > > I have a problem of recurring hangs on machines where I get hung task
> > > > warnings for kswapd and many user threads stuck in lstat. At the time,
> > > > I/O grinds to a halt, although not quite to zero. These hangs happen
> > > > 1-2 times per day on a fleet of several thousand machines. We'd like
> > > > to understand the root cause, if it is already known, so we can figure
> > > > out the minimum kernel to which we might want to update.
> > > >
> > > > These are the hung task stacks:
> > > >
> > > > kswapd0 D 0 279 2 0x00000000
> > > > Call Trace:
> > > > __schedule+0x232/0x700
> > > > schedule+0x36/0x80
> > > > schedule_preempt_disabled+0xe/0x10
> > > > __mutex_lock_slowpath+0x193/0x290
> > > > mutex_lock+0x2f/0x40
> > > > xfs_reclaim_inodes_ag+0x288/0x330 [xfs]
> > >
> > > You're basically running the machine out of memory and there
> > > are so many direct reclaimers that all the inode reclaim parallelism in
> > > the filesystem has been exhausted and it's blocking waiting for
> > > other reclaim to complete.
> >
> > Sort of. "Out of memory" here means > 2GB free, admittedly that's
> > less than 1% of the machine.
>
> By "out of memory" I meant that your machine appears to have been at
> or below the memory reclaim watermarks for an extended period of
> time, not that there is no free memory at all.
Indeed, we always have one NUMA zone at the low water mark, although
we have zone reclaim disabled.
>
> > > > kswapd1 D 0 280 2 0x00000000
> > > > Call Trace:
> > > > __schedule+0x232/0x700
> > > > schedule+0x36/0x80
> > > > schedule_timeout+0x235/0x3f0
> > > > ? blk_finish_plug+0x2c/0x40
> > > > ? _xfs_buf_ioapply+0x334/0x460 [xfs]
> > > > wait_for_completion+0xb4/0x140
> > > > ? wake_up_q+0x70/0x70
> > > > ? xfs_bwrite+0x24/0x60 [xfs]
> > > > xfs_buf_submit_wait+0x7f/0x210 [xfs]
> > > > xfs_bwrite+0x24/0x60 [xfs]
> > > > xfs_reclaim_inode+0x313/0x340 [xfs]
> > > > xfs_reclaim_inodes_ag+0x208/0x330 [xfs]
> > >
> > > Yup, memory reclaim is pushing so hard it is doing direct writeback
> > > of dirty inodes.
> >
> > OK. Is that reflected in vmstat somewhere? I see
> > "nr_vmscan_immediate_reclaim" but it's not obvious to me what that is,
> > and it wasn't increasing at the moment of this event.
>
> Nope, not for low level filesystem inode reclaim like this. Getting
> the filesystem to do writeback from the inode shrinker is relatively
> uncommon.
>
> > > > All other hung threads are stuck in the third stack.
> > > >
> > > > We are using the Ubuntu 16.04 kernel, 4.10.0-40-generic
> > > > #44~16.04.1-Ubuntu. The machines involved have 20-core / 40-thread
> > > > Intel CPUs, 384 GiB of main memory, and four nvme devices in an md
> > > > RAID 0. The filesystem info is:
> > > >
> > > > # xfs_info /dev/md0
> > > > meta-data=/dev/md0 isize=256 agcount=6, agsize=268435455 blks
> > > > = sectsz=512 attr=2, projid32bit=0
> > > > = crc=0 finobt=0 spinodes=0
> > > > data = bsize=4096 blocks=1562366976, imaxpct=5
> > > > = sunit=0 swidth=0 blks
>
> FWIW, I just noticed you don't have a sunit/swidth set for you
> md RAID 0, which means you only have 6 AGs rather than 32. So
> there's a lot less allocation and reclaim concurrency in this
> filesystem that we'd normally see for a 4x RAID0 stripe....
To some extent this has been a forensic exercise for me to figure out
how and why these filesystems were deployed like this. I think it was
believed at the time that su/sw has a magic automatic value of 0.
>
> > > This looks to me like something below the filesystem choking up and
> > > grinding to a halt. What are all the nvme drives doing over this
> > > period?
> >
> > Write rates on the nvme drives are all exactly the md0 rates / 4, so
> > that seems normal.
>
> Write rates aren't that important - what do the io latencies, queue
> depths and device utilisation figures look like?
10s of microseconds, ~zero, and ~zero respectively.
> > > Hard to know what is going on at this point, though, but nothing
> > > at the filesystem or memory reclaim level should be stuck on IO for
> > > long periods of time on nvme SSDs...
> > >
> > > /me wonders if 4.10 had the block layer writeback throttle code in
> > > it, and if it does whether that is what has gone haywire here.
> >
> > That is an interesting point. I do see this on a different, healthy box
> >
> > # cat /sys/kernel/debug/bdi/9\:0/stats
> > BdiWriteback: 4320 kB
> > BdiReclaimable: 2304 kB
> > BdiDirtyThresh: 4834552 kB
> > DirtyThresh: 4834552 kB
> > BackgroundThresh: 2414324 kB
> > BdiDirtied: 5316993504 kB
> > BdiWritten: 4732763040 kB
> > BdiWriteBandwidth: 6060 kBps
>
> That's the currently used write bandwidth of the backing device. The
> block layer writeback throttle is below this. i.e. This simply
> reflects the rate at which data is currently being written to the
> block device.
>
> > One hopes that no throttling is based on this estimate of 6MB/s since
> > this array is capable of more like 10000MB/s. At this time we're
> > focusing on system tunables that might prevent the machine from
> > getting into serious trouble with dirty pages or reclaim or whatever.
> > I know XFS has only a handful of tunables. Do you think it's
> > recommendable to lower the xfssynd timer from 30s to something else?
> > Our workload _is_ metadata-heavy; I see about 100 create/remove ops
> > per second
>
> FWIW, that's not what I'd call metadata heavy. XFS on a 4-SSD array with
> 20 CPU cores can sustain somewhere around 300,000 create/removes a
> second before it runs out of CPU. 100/s is barely a blip above
> idle....
Quite. That's part of the mystery: why the machine derps to a halt
while also barely accessing its abilities. What I meant, though, was
that the metadata rate is fairly high in my experience in proportion
to the amount of other I/O, which is also pretty low here.
>
> > and the log stats are rapidly increasing. By contrast we're
> > not really pushing the extent allocator at all.
> >
> > extent_alloc 10755989 500829395 3030459 4148636182
> > abt 0 0 0 0
> > blk_map 3012410686 1743133993 196385287 18176676 100618849 659563291 0
> > bmbt 0 0 0 0
> > dir 112134 99359626 99358346 49393915
>
> Ok, so 100M files created and removed since boot.
>
> > trans 98 4036033830 0
>
> All async transactions.
>
> > ig 99320520 98796299 5890 524221 0 523020 140279
> > log 1339041912 2184819896 0 1341870387 1341704946
> writes blocks | forces force_waits
> noiclogs
>
> Ok, so why are the so many log forces (1.34 /billion/)? There's
> roughly a log force in every 3rd transaction, but they aren't from
> the transaction commit (because they are async transactions).
>
> > push_ail 4109383649 0 37995385 2614017 0 136962 25152 3301200 0 47933
> try_logspace | pushes success | pinned | flushing | flush
> sleep_logspace pushbuf locked restarts
>
> They aren't coming from the metadata writeback code (pinned count),
> and the log is not blocking on free space. i.e. there's no metadata
> writeback bottleneck occurring.
>
> > xstrat 589377 0
> > rw 3561425929 1342108567
>
> Oh, 1.34 billion writes. You're doing O_DSYNC writes, yes? And lots
> of them to every file that is created?
>
> > attr 3297942617 0 0 0
> > icluster 1452111 945722 3364798
> > vnodes 1201 0 0 0 99307909 99307909 99307909 0
> > buf 779252853 2349573 776924344 978078 41302 2328509 0 2491287 1090
> > abtb2 13886237 90593055 270338 266221 17 15 4555 1481 1188 2529 30 22
> > 47 37 93926009
> > abtc2 28760802 203677378 13739372 13735284 45 43 3270 1292 1935 1691
> > 57 50 102 93 1413028741
> > bmbt2 14355745 140867292 7158285 7115081 8 2 4074 4122 25396 4074 230
> > 48 238 50 6974741
> > ibt2 397071627 776203283 67846 67820 0 0 0 0 0 0 0 0 0 0 413
>
> For 100M files created/removed, to only see ~68000 inode btree
> record inserts and removes implies that the filesystem is
> efficiently reusing the freed inodes. i.e. there's pretty much a
> steady state of inodes in use in the workload....
>
> > fibt2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> > rmapbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> > refcntbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> > qm 0 0 0 0 0 0 0 0
> > xpc 17666569097216 111928912437311 80399355018480
> write_bytes read_bytes
>
> Assuming this is the only substantially active XFS filesystem on the
> system, that's 101TiB written to those drives, in 1.34 billion writes,
> which is an average of 82KiB per write. Assuming that the 32 bit
> write counter has not wrapped. Given that the transaction counter is
> nearly at the wrap point, and the read counter is at 3.5B for ~75TiB
> (average 22KiB), let's assume that the write counter has wrapped
> and we have ~5.5 billion writes. That gives an average write of just
> on 20KiB which kinda matches the read....
>
> So the drives look like they are being constantly hammered with
> small, synchronous IOs. This means write amplification is likely to
> be a substantial problem for them. They've had a /lot/ of data
> written to them and are under constant load so there's no time for
> idle cleanup, so is it possible that the drives themselves are
> having internal garbage collection related slowdowns?
That's certainly not impossible, although these are Very Fancy(tm)
devices. I'll have a look at the logs.
Thanks again,
jwb
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Recurring hand in XFS inode reclaim on 4.10
2019-05-28 17:16 ` Jeffrey Baker
@ 2019-05-29 22:20 ` Dave Chinner
0 siblings, 0 replies; 6+ messages in thread
From: Dave Chinner @ 2019-05-29 22:20 UTC (permalink / raw)
To: Jeffrey Baker; +Cc: linux-xfs
On Tue, May 28, 2019 at 10:16:39AM -0700, Jeffrey Baker wrote:
> On Sun, May 26, 2019 at 8:32 PM Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, May 24, 2019 at 01:34:58PM -0700, Jeffrey Baker wrote:
> > > Write rates on the nvme drives are all exactly the md0 rates / 4, so
> > > that seems normal.
> >
> > Write rates aren't that important - what do the io latencies, queue
> > depths and device utilisation figures look like?
>
> 10s of microseconds, ~zero, and ~zero respectively.
So it sounds like the devices aren't showing signs of slowness, but
the filesystem and memory reclaim is sitting there twiddling it's
thumbs waiting for IO to complete?
Perhaps it might be worthwhile running a latency trace (e.g. with
ftrace or one of the newfangled bpf tools) to find out where
everything is blocking and how long they are waiting for things to
occur. That might give some more insight into what resource we are
waiting to be released to make progress...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2019-05-29 22:20 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-21 16:21 Recurring hand in XFS inode reclaim on 4.10 Jeffrey Baker
2019-05-21 22:49 ` Dave Chinner
2019-05-24 20:34 ` Jeffrey Baker
2019-05-27 3:32 ` Dave Chinner
2019-05-28 17:16 ` Jeffrey Baker
2019-05-29 22:20 ` Dave Chinner
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.