All of lore.kernel.org
 help / color / mirror / Atom feed
* Recurring hand in XFS inode reclaim on 4.10
@ 2019-05-21 16:21 Jeffrey Baker
  2019-05-21 22:49 ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: Jeffrey Baker @ 2019-05-21 16:21 UTC (permalink / raw)
  To: linux-xfs

I have a problem of recurring hangs on machines where I get hung task
warnings for kswapd and many user threads stuck in lstat. At the time,
I/O grinds to a halt, although not quite to zero. These hangs happen
1-2 times per day on a fleet of several thousand machines. We'd like
to understand the root cause, if it is already known, so we can figure
out the minimum kernel to which we might want to update.

These are the hung task stacks:

kswapd0         D    0   279      2 0x00000000
Call Trace:
 __schedule+0x232/0x700
 schedule+0x36/0x80
 schedule_preempt_disabled+0xe/0x10
 __mutex_lock_slowpath+0x193/0x290
 mutex_lock+0x2f/0x40
 xfs_reclaim_inodes_ag+0x288/0x330 [xfs]
 ? enqueue_entity+0x118/0x640
 ? check_preempt_curr+0x54/0x90
 ? ttwu_do_wakeup+0x19/0xe0
 ? ttwu_do_activate+0x6f/0x80
 ? try_to_wake_up+0x59/0x3e0
 ? wake_up_process+0x15/0x20
 xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
 xfs_fs_free_cached_objects+0x19/0x20 [xfs]
 super_cache_scan+0x191/0x1a0
 shrink_slab.part.40+0x1fa/0x430
 shrink_slab+0x29/0x30
 shrink_node+0x108/0x320
 kswapd+0x34b/0x720
 kthread+0x109/0x140
 ? mem_cgroup_shrink_node+0x180/0x180
 ? kthread_create_on_node+0x60/0x60
 ret_from_fork+0x2c/0x40

kswapd1         D    0   280      2 0x00000000
Call Trace:
 __schedule+0x232/0x700
 schedule+0x36/0x80
 schedule_timeout+0x235/0x3f0
 ? blk_finish_plug+0x2c/0x40
 ? _xfs_buf_ioapply+0x334/0x460 [xfs]
 wait_for_completion+0xb4/0x140
 ? wake_up_q+0x70/0x70
 ? xfs_bwrite+0x24/0x60 [xfs]
 xfs_buf_submit_wait+0x7f/0x210 [xfs]
 xfs_bwrite+0x24/0x60 [xfs]
 xfs_reclaim_inode+0x313/0x340 [xfs]
 xfs_reclaim_inodes_ag+0x208/0x330 [xfs]
 ? sched_clock+0x9/0x10
 ? sched_clock_cpu+0x8f/0xa0
 ? try_to_wake_up+0x59/0x3e0
 ? wake_up_process+0x15/0x20
 xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
 xfs_fs_free_cached_objects+0x19/0x20 [xfs]
 super_cache_scan+0x191/0x1a0
 shrink_slab.part.40+0x1fa/0x430
 shrink_slab+0x29/0x30
 shrink_node+0x108/0x320
 kswapd+0x34b/0x720
 kthread+0x109/0x140
 ? mem_cgroup_shrink_node+0x180/0x180
 ? kthread_create_on_node+0x60/0x60
 ret_from_fork+0x2c/0x40

mysqld          D    0 89015 116527 0x00000080
Call Trace:
 __schedule+0x232/0x700
 ? __remove_hrtimer+0x3c/0x70
 schedule+0x36/0x80
 rwsem_down_read_failed+0xf9/0x150
 ? xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
 call_rwsem_down_read_failed+0x18/0x30
 ? xfs_trans_roll+0x40/0x40 [xfs]
 down_read+0x20/0x40
 xfs_ilock+0xe5/0x110 [xfs]
 xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
 xfs_attr_get+0xd3/0x180 [xfs]
 xfs_xattr_get+0x4b/0x70 [xfs]
 __vfs_getxattr+0x57/0x70
 get_vfs_caps_from_disk+0x59/0x100
 audit_copy_inode+0x6d/0xb0
 __audit_inode+0x1bb/0x320
 filename_lookup+0x128/0x180
 ? __check_object_size+0x108/0x1e3
 ? path_get+0x27/0x30
 ? __audit_getname+0x96/0xb0
 user_path_at_empty+0x36/0x40
 vfs_fstatat+0x66/0xc0
 SYSC_newlstat+0x31/0x60
 ? syscall_trace_enter+0x1d9/0x2f0
 ? __audit_syscall_exit+0x230/0x2c0
 SyS_newlstat+0xe/0x10
 do_syscall_64+0x5b/0xc0
 entry_SYSCALL64_slow_path+0x25/0x25

All other hung threads are stuck in the third stack.

We are using the Ubuntu 16.04 kernel, 4.10.0-40-generic
#44~16.04.1-Ubuntu. The machines involved have 20-core / 40-thread
Intel CPUs, 384 GiB of main memory, and four nvme devices in an md
RAID 0.  The filesystem info is:

# xfs_info /dev/md0
meta-data=/dev/md0               isize=256    agcount=6, agsize=268435455 blks
         =                       sectsz=512   attr=2, projid32bit=0
         =                       crc=0        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=1562366976, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

The tasks above were reported as hung at 22:51:55, so they'd been dead
since 22:49:55 at the latest.  The stats from around that time seem
pretty normal.

# atopsar -r /var/log/atop/atop_20190517 -b 22:48:00 -e 22:53:00 -m

22:48:02  memtotal memfree buffers cached dirty slabmem  swptotal swpfree _mem_
22:48:07   385598M   2199M    115M 23773M   14M   2435M        0M      0M
22:48:12   385598M   2151M    115M 23777M   15M   2435M        0M      0M
22:48:17   385598M   2222M    115M 23781M   16M   2435M        0M      0M
22:48:22   385598M   2213M    115M 23785M    5M   2435M        0M      0M
22:48:27   385598M   2198M    115M 23790M   11M   2435M        0M      0M
22:48:32   385598M   2209M    115M 23791M    9M   2435M        0M      0M
22:48:37   385598M   2205M    115M 23794M    0M   2435M        0M      0M
22:48:42   385598M   2199M    115M 23798M    2M   2435M        0M      0M
22:48:47   385598M   2119M    115M 23799M    1M   2435M        0M      0M
22:48:52   385598M   2106M    115M 23804M    4M   2450M        0M      0M
22:48:57   385598M   2095M    115M 23807M    4M   2450M        0M      0M
22:49:02   385598M   1997M    116M 23814M    6M   2450M        0M      0M
22:49:07   385598M   1966M    116M 23839M    4M   2451M        0M      0M
22:49:12   385598M   1989M    116M 23856M   22M   2451M        0M      0M
22:49:17   385598M   1797M    117M 23863M   29M   2452M        0M      0M
22:49:22   385598M   1920M    118M 23902M   82M   2451M        0M      0M
22:49:27   385598M   1875M    118M 23905M   83M   2451M        0M      0M
22:49:32   385598M   1794M    120M 23925M   94M   2458M        0M      0M
22:49:37   385598M   1707M    120M 23953M   35M   2459M        0M      0M
22:49:42   385598M   1618M    120M 23985M   66M   2460M        0M      0M
22:49:47   385598M   1535M    120M 24016M   95M   2460M        0M      0M
22:49:52   385598M   1499M    121M 24036M   15M   2460M        0M      0M
22:49:57   385598M   1440M    121M 24070M   48M   2461M        0M      0M
22:50:02   385598M   1368M    121M 24095M   73M   2462M        0M      0M
22:50:07   385598M   1355M    121M 24096M   73M   2461M        0M      0M
22:50:12   385598M   1368M    121M 24101M   77M   2462M        0M      0M
22:50:17   385598M   1362M    121M 24102M   78M   2461M        0M      0M
22:50:22   385598M   1339M    121M 24106M    1M   2461M        0M      0M
22:50:27   385598M   1329M    121M 24108M    2M   2461M        0M      0M
22:50:32   385598M   1354M    121M 24109M    2M   2461M        0M      0M
22:50:37   385598M   1337M    121M 24113M    6M   2461M        0M      0M
22:50:42   385598M   1316M    121M 24114M    6M   2461M        0M      0M
22:50:47   385598M   1309M    121M 24119M   11M   2461M        0M      0M
22:50:52   385598M   1310M    121M 24119M    2M   2461M        0M      0M
22:50:57   385598M   1305M    121M 24123M    6M   2461M        0M      0M
22:51:02   385598M   1289M    121M 24123M    7M   2461M        0M      0M
22:51:07   385598M   1286M    121M 24127M    8M   2462M        0M      0M
22:51:12   385598M   1312M    121M 24128M    8M   2462M        0M      0M
22:51:17   385598M   1291M    121M 24128M    9M   2462M        0M      0M
22:51:22   385598M   1270M    121M 24132M   13M   2462M        0M      0M
22:51:27   385598M   1283M    121M 24112M    1M   2462M        0M      0M
22:51:32   385598M   1279M    121M 24116M    5M   2462M        0M      0M
22:51:37   385598M   1273M    121M 24116M    6M   2462M        0M      0M
22:51:42   385598M   1275M    121M 24115M    4M   2462M        0M      0M
22:51:47   385598M   1302M    121M 24081M    6M   2462M        0M      0M
22:51:52   385598M   1289M    121M 24085M   10M   2462M        0M      0M
22:51:57   385598M   1285M    121M 24086M    3M   2462M        0M      0M

Something is obviously wrong with atop's analysis of md0 I/O rate
here, but you get the idea: I/O slides over a minute from a reasonable
rate to near zero, except for one big spike of output.

# atopsar -r /var/log/atop/atop_20190517 -b 22:48:00 -e 22:53:00 -f

22:48:02  disk           busy read/s KB/read  writ/s KB/writ avque avserv _mdd_
22:48:07  md0              0% 1208.4     6.9  2569.0     6.6   0.0   0.00 ms
22:48:12  md0              0% 1294.1     7.0  2642.5     6.6   0.0   0.00 ms
22:48:17  md0              0% 1289.0     6.9  2823.5     6.8   0.0   0.00 ms
22:48:22  md0              0% 1376.3     6.9  2662.6     6.7   0.0   0.00 ms
22:48:27  md0              0% 1332.3     6.8  2578.7     6.3   0.0   0.00 ms
22:48:32  md0              0% 1338.2     6.9  2601.7     6.5   0.0   0.00 ms
22:48:37  md0              0% 1133.7     6.8  3172.3     6.9   0.0   0.00 ms
22:48:42  md0              0% 1377.4     6.8  2609.3     6.8   0.0   0.00 ms
22:48:47  md0              0% 1353.4     6.9  2293.6     6.5   0.0   0.00 ms
22:48:52  md0              0% 1291.8     6.9  2200.5     6.3   0.0   0.00 ms
22:48:57  md0              0% 1332.2     7.0  3066.1     6.9   0.0   0.00 ms
22:49:02  md0              0% 1275.6     6.9  3021.2     6.8   0.0   0.00 ms
22:49:07  md0              0% 1306.6     6.4  4407.2    10.4   0.0   0.00 ms
22:49:12  md0              0% 1172.4     6.6  2740.4     7.1   0.0   0.00 ms
22:49:17  md0              0%  468.4     4.9  2769.7     6.6   0.0   0.00 ms
22:49:22  md0              0%  422.3     4.9  2950.9     7.0   0.0   0.00 ms
22:49:27  md0              0%  364.7     4.9  2744.3     7.0   0.0   0.00 ms
22:49:32  md0              0%  254.6     4.7  6271.1     8.8   0.0   0.00 ms
22:49:37  md0              0%  289.6     4.6  2108.7     6.8   0.0   0.00 ms
22:49:42  md0              0%  276.1     4.7  2278.3     6.7   0.0   0.00 ms
22:49:47  md0              0%  266.2     4.7  1990.3     7.0   0.0   0.00 ms
22:49:52  md0              0%  274.9     4.7  2051.3     7.0   0.0   0.00 ms
22:49:57  md0              0%  222.7     4.6  1770.4     7.1   0.0   0.00 ms
22:50:02  md0              0%  104.1     4.5 22339.5     9.5   0.0   0.00 ms
22:50:07  md0              0%  132.9     4.4  6208.9     9.1   0.0   0.00 ms
22:50:12  md0              0%  147.2     4.5   755.7     5.1   0.0   0.00 ms
22:50:17  md0              0%  127.0     4.4   722.6     5.3   0.0   0.00 ms
22:50:22  md0              0%  120.5     4.4   703.7     4.9   0.0   0.00 ms
22:50:27  md0              0%  106.5     4.5   681.7     5.0   0.0   0.00 ms
22:50:32  md0              0%  121.1     4.5   756.5     4.7   0.0   0.00 ms
22:50:37  md0              0%  131.9     4.4   696.7     4.8   0.0   0.00 ms
22:50:42  md0              0%  100.9     4.4   669.6     4.7   0.0   0.00 ms
22:50:47  md0              0%   72.2     4.4   691.1     6.7   0.0   0.00 ms
22:50:52  md0              0%    0.0     0.0   320.4     7.6   0.0   0.00 ms
22:50:57  md0              0%    0.2     4.0   115.1     6.5   0.0   0.00 ms
22:51:02  md0              0%    0.0     0.0    62.5     5.7   0.0   0.00 ms
22:51:07  md0              0%    0.2     4.0    23.0     4.2   0.0   0.00 ms
22:51:12  md0              0%    0.2     4.0    22.6     4.1   0.0   0.00 ms
22:51:17  md0              0%    0.0     0.0    22.4     4.1   0.0   0.00 ms
22:51:22  md0              0%    0.2     4.0    26.0     4.4   0.0   0.00 ms
22:51:27  md0              0%    0.0     0.0    25.4     5.6   0.0   0.00 ms
22:51:32  md0              0%    0.0     0.0    22.6     4.1   0.0   0.00 ms
22:51:37  md0              0%    0.0     0.0    21.5     4.1   0.0   0.00 ms
22:51:42  md0              0%    0.0     0.0    26.0     5.4   0.0   0.00 ms
22:51:47  md0              0%    0.0     0.0    22.2     4.1   0.0   0.00 ms
22:51:52  md0              0%    0.0     0.0    23.4     4.1   0.0   0.00 ms
22:51:57  md0              0%    0.2     4.0    26.0     5.6   0.0   0.00 ms

Any help would be appreciated.

-jwb

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Recurring hand in XFS inode reclaim on 4.10
  2019-05-21 16:21 Recurring hand in XFS inode reclaim on 4.10 Jeffrey Baker
@ 2019-05-21 22:49 ` Dave Chinner
  2019-05-24 20:34   ` Jeffrey Baker
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2019-05-21 22:49 UTC (permalink / raw)
  To: Jeffrey Baker; +Cc: linux-xfs

On Tue, May 21, 2019 at 09:21:10AM -0700, Jeffrey Baker wrote:
> I have a problem of recurring hangs on machines where I get hung task
> warnings for kswapd and many user threads stuck in lstat. At the time,
> I/O grinds to a halt, although not quite to zero. These hangs happen
> 1-2 times per day on a fleet of several thousand machines. We'd like
> to understand the root cause, if it is already known, so we can figure
> out the minimum kernel to which we might want to update.
> 
> These are the hung task stacks:
> 
> kswapd0         D    0   279      2 0x00000000
> Call Trace:
>  __schedule+0x232/0x700
>  schedule+0x36/0x80
>  schedule_preempt_disabled+0xe/0x10
>  __mutex_lock_slowpath+0x193/0x290
>  mutex_lock+0x2f/0x40
>  xfs_reclaim_inodes_ag+0x288/0x330 [xfs]

You're basically running the machine out of memory and there
are so many direct reclaimers that all the inode reclaim parallelism in
the filesystem has been exhausted and it's blocking waiting for
other reclaim to complete.

> kswapd1         D    0   280      2 0x00000000
> Call Trace:
>  __schedule+0x232/0x700
>  schedule+0x36/0x80
>  schedule_timeout+0x235/0x3f0
>  ? blk_finish_plug+0x2c/0x40
>  ? _xfs_buf_ioapply+0x334/0x460 [xfs]
>  wait_for_completion+0xb4/0x140
>  ? wake_up_q+0x70/0x70
>  ? xfs_bwrite+0x24/0x60 [xfs]
>  xfs_buf_submit_wait+0x7f/0x210 [xfs]
>  xfs_bwrite+0x24/0x60 [xfs]
>  xfs_reclaim_inode+0x313/0x340 [xfs]
>  xfs_reclaim_inodes_ag+0x208/0x330 [xfs]

Yup, memory reclaim is pushing so hard it is doing direct writeback
of dirty inodes.

> mysqld          D    0 89015 116527 0x00000080
> Call Trace:
>  __schedule+0x232/0x700
>  ? __remove_hrtimer+0x3c/0x70
>  schedule+0x36/0x80
>  rwsem_down_read_failed+0xf9/0x150
>  ? xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
>  call_rwsem_down_read_failed+0x18/0x30
>  ? xfs_trans_roll+0x40/0x40 [xfs]
>  down_read+0x20/0x40
>  xfs_ilock+0xe5/0x110 [xfs]
>  xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
>  xfs_attr_get+0xd3/0x180 [xfs]
>  xfs_xattr_get+0x4b/0x70 [xfs]
>  __vfs_getxattr+0x57/0x70
>  get_vfs_caps_from_disk+0x59/0x100
>  audit_copy_inode+0x6d/0xb0
>  __audit_inode+0x1bb/0x320
>  filename_lookup+0x128/0x180

Somebody else has the inode locked doing something with it, so the
audit code has to wait for it to finish before it can continue. This
likely has nothing to do with memory reclaim, and more likely is
backed up behind a transaction or something doing IO....

>  ? __check_object_size+0x108/0x1e3
>  ? path_get+0x27/0x30
>  ? __audit_getname+0x96/0xb0
>  user_path_at_empty+0x36/0x40
>  vfs_fstatat+0x66/0xc0
>  SYSC_newlstat+0x31/0x60
>  ? syscall_trace_enter+0x1d9/0x2f0
>  ? __audit_syscall_exit+0x230/0x2c0
>  SyS_newlstat+0xe/0x10
>  do_syscall_64+0x5b/0xc0
>  entry_SYSCALL64_slow_path+0x25/0x25
> 
> All other hung threads are stuck in the third stack.
> 
> We are using the Ubuntu 16.04 kernel, 4.10.0-40-generic
> #44~16.04.1-Ubuntu. The machines involved have 20-core / 40-thread
> Intel CPUs, 384 GiB of main memory, and four nvme devices in an md
> RAID 0.  The filesystem info is:
> 
> # xfs_info /dev/md0
> meta-data=/dev/md0               isize=256    agcount=6, agsize=268435455 blks
>          =                       sectsz=512   attr=2, projid32bit=0
>          =                       crc=0        finobt=0 spinodes=0
> data     =                       bsize=4096   blocks=1562366976, imaxpct=5
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> log      =internal               bsize=4096   blocks=521728, version=2
>          =                       sectsz=512   sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> The tasks above were reported as hung at 22:51:55, so they'd been dead
> since 22:49:55 at the latest.  The stats from around that time seem
> pretty normal.

So you've got processes stuck waiting a couple of minutes for IO
on nvme drives? That doesn't sound like a filesystem problem - that
sounds more like lost IOs, a hung MD array, or hardware that's gone
AWOL....

> 22:48:02  memtotal memfree buffers cached dirty slabmem  swptotal swpfree _mem_
> 22:48:07   385598M   2199M    115M 23773M   14M   2435M        0M      0M

Nothing unusual there - still lots of reclaimable page cache, and
some slabmem, too.

> Something is obviously wrong with atop's analysis of md0 I/O rate
> here, but you get the idea: I/O slides over a minute from a reasonable
> rate to near zero, except for one big spike of output.
> 
> # atopsar -r /var/log/atop/atop_20190517 -b 22:48:00 -e 22:53:00 -f
> 
> 22:48:02  disk           busy read/s KB/read  writ/s KB/writ avque avserv _mdd_
> 22:48:07  md0              0% 1208.4     6.9  2569.0     6.6   0.0   0.00 ms
> 22:48:12  md0              0% 1294.1     7.0  2642.5     6.6   0.0   0.00 ms
> 22:48:17  md0              0% 1289.0     6.9  2823.5     6.8   0.0   0.00 ms
> 22:48:22  md0              0% 1376.3     6.9  2662.6     6.7   0.0   0.00 ms
> 22:48:27  md0              0% 1332.3     6.8  2578.7     6.3   0.0   0.00 ms
> 22:48:32  md0              0% 1338.2     6.9  2601.7     6.5   0.0   0.00 ms
> 22:48:37  md0              0% 1133.7     6.8  3172.3     6.9   0.0   0.00 ms
> 22:48:42  md0              0% 1377.4     6.8  2609.3     6.8   0.0   0.00 ms
> 22:48:47  md0              0% 1353.4     6.9  2293.6     6.5   0.0   0.00 ms
> 22:48:52  md0              0% 1291.8     6.9  2200.5     6.3   0.0   0.00 ms
> 22:48:57  md0              0% 1332.2     7.0  3066.1     6.9   0.0   0.00 ms
> 22:49:02  md0              0% 1275.6     6.9  3021.2     6.8   0.0   0.00 ms
> 22:49:07  md0              0% 1306.6     6.4  4407.2    10.4   0.0   0.00 ms
> 22:49:12  md0              0% 1172.4     6.6  2740.4     7.1   0.0   0.00 ms
> 22:49:17  md0              0%  468.4     4.9  2769.7     6.6   0.0   0.00 ms

IO rates take a hit here, about 40s before the processes got stuck.

> 22:49:22  md0              0%  422.3     4.9  2950.9     7.0   0.0   0.00 ms
> 22:49:27  md0              0%  364.7     4.9  2744.3     7.0   0.0   0.00 ms
> 22:49:32  md0              0%  254.6     4.7  6271.1     8.8   0.0   0.00 ms
> 22:49:37  md0              0%  289.6     4.6  2108.7     6.8   0.0   0.00 ms
> 22:49:42  md0              0%  276.1     4.7  2278.3     6.7   0.0   0.00 ms
> 22:49:47  md0              0%  266.2     4.7  1990.3     7.0   0.0   0.00 ms
> 22:49:52  md0              0%  274.9     4.7  2051.3     7.0   0.0   0.00 ms
> 22:49:57  md0              0%  222.7     4.6  1770.4     7.1   0.0   0.00 ms

Processes stick here just as the read rate takes another hit, and...

> 22:50:02  md0              0%  104.1     4.5 22339.5     9.5   0.0   0.00 ms

That's likely a burst of 8KB inode writes from the XFS inode
reclaim code.

> 22:50:07  md0              0%  132.9     4.4  6208.9     9.1   0.0   0.00 ms
> 22:50:12  md0              0%  147.2     4.5   755.7     5.1   0.0   0.00 ms
> 22:50:17  md0              0%  127.0     4.4   722.6     5.3   0.0   0.00 ms
> 22:50:22  md0              0%  120.5     4.4   703.7     4.9   0.0   0.00 ms
> 22:50:27  md0              0%  106.5     4.5   681.7     5.0   0.0   0.00 ms
> 22:50:32  md0              0%  121.1     4.5   756.5     4.7   0.0   0.00 ms
> 22:50:37  md0              0%  131.9     4.4   696.7     4.8   0.0   0.00 ms
> 22:50:42  md0              0%  100.9     4.4   669.6     4.7   0.0   0.00 ms
> 22:50:47  md0              0%   72.2     4.4   691.1     6.7   0.0   0.00 ms
> 22:50:52  md0              0%    0.0     0.0   320.4     7.6   0.0   0.00 ms

And a minute later everything has ground to a halt.

> 22:50:57  md0              0%    0.2     4.0   115.1     6.5   0.0   0.00 ms
> 22:51:02  md0              0%    0.0     0.0    62.5     5.7   0.0   0.00 ms
> 22:51:07  md0              0%    0.2     4.0    23.0     4.2   0.0   0.00 ms
> 22:51:12  md0              0%    0.2     4.0    22.6     4.1   0.0   0.00 ms
> 22:51:17  md0              0%    0.0     0.0    22.4     4.1   0.0   0.00 ms
> 22:51:22  md0              0%    0.2     4.0    26.0     4.4   0.0   0.00 ms
> 22:51:27  md0              0%    0.0     0.0    25.4     5.6   0.0   0.00 ms
> 22:51:32  md0              0%    0.0     0.0    22.6     4.1   0.0   0.00 ms
> 22:51:37  md0              0%    0.0     0.0    21.5     4.1   0.0   0.00 ms
> 22:51:42  md0              0%    0.0     0.0    26.0     5.4   0.0   0.00 ms
> 22:51:47  md0              0%    0.0     0.0    22.2     4.1   0.0   0.00 ms
> 22:51:52  md0              0%    0.0     0.0    23.4     4.1   0.0   0.00 ms
> 22:51:57  md0              0%    0.2     4.0    26.0     5.6   0.0   0.00 ms

This looks to me like something below the filesystem choking up and
grinding to a halt. What are all the nvme drives doing over this
period?

Hard to know what is going on at this point, though, but nothing
at the filesystem or memory reclaim level should be stuck on IO for
long periods of time on nvme SSDs...

/me wonders if 4.10 had the block layer writeback throttle code in
it, and if it does whether that is what has gone haywire here.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Recurring hand in XFS inode reclaim on 4.10
  2019-05-21 22:49 ` Dave Chinner
@ 2019-05-24 20:34   ` Jeffrey Baker
  2019-05-27  3:32     ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: Jeffrey Baker @ 2019-05-24 20:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, May 21, 2019 at 3:49 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, May 21, 2019 at 09:21:10AM -0700, Jeffrey Baker wrote:
> > I have a problem of recurring hangs on machines where I get hung task
> > warnings for kswapd and many user threads stuck in lstat. At the time,
> > I/O grinds to a halt, although not quite to zero. These hangs happen
> > 1-2 times per day on a fleet of several thousand machines. We'd like
> > to understand the root cause, if it is already known, so we can figure
> > out the minimum kernel to which we might want to update.
> >
> > These are the hung task stacks:
> >
> > kswapd0         D    0   279      2 0x00000000
> > Call Trace:
> >  __schedule+0x232/0x700
> >  schedule+0x36/0x80
> >  schedule_preempt_disabled+0xe/0x10
> >  __mutex_lock_slowpath+0x193/0x290
> >  mutex_lock+0x2f/0x40
> >  xfs_reclaim_inodes_ag+0x288/0x330 [xfs]
>
> You're basically running the machine out of memory and there
> are so many direct reclaimers that all the inode reclaim parallelism in
> the filesystem has been exhausted and it's blocking waiting for
> other reclaim to complete.

Sort of.  "Out of memory" here means > 2GB free, admittedly that's
less than 1% of the machine.

>
> > kswapd1         D    0   280      2 0x00000000
> > Call Trace:
> >  __schedule+0x232/0x700
> >  schedule+0x36/0x80
> >  schedule_timeout+0x235/0x3f0
> >  ? blk_finish_plug+0x2c/0x40
> >  ? _xfs_buf_ioapply+0x334/0x460 [xfs]
> >  wait_for_completion+0xb4/0x140
> >  ? wake_up_q+0x70/0x70
> >  ? xfs_bwrite+0x24/0x60 [xfs]
> >  xfs_buf_submit_wait+0x7f/0x210 [xfs]
> >  xfs_bwrite+0x24/0x60 [xfs]
> >  xfs_reclaim_inode+0x313/0x340 [xfs]
> >  xfs_reclaim_inodes_ag+0x208/0x330 [xfs]
>
> Yup, memory reclaim is pushing so hard it is doing direct writeback
> of dirty inodes.

OK.  Is that reflected in vmstat somewhere?  I see
"nr_vmscan_immediate_reclaim" but it's not obvious to me what that is,
and it wasn't increasing at the moment of this event.

> > mysqld          D    0 89015 116527 0x00000080
> > Call Trace:
> >  __schedule+0x232/0x700
> >  ? __remove_hrtimer+0x3c/0x70
> >  schedule+0x36/0x80
> >  rwsem_down_read_failed+0xf9/0x150
> >  ? xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
> >  call_rwsem_down_read_failed+0x18/0x30
> >  ? xfs_trans_roll+0x40/0x40 [xfs]
> >  down_read+0x20/0x40
> >  xfs_ilock+0xe5/0x110 [xfs]
> >  xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
> >  xfs_attr_get+0xd3/0x180 [xfs]
> >  xfs_xattr_get+0x4b/0x70 [xfs]
> >  __vfs_getxattr+0x57/0x70
> >  get_vfs_caps_from_disk+0x59/0x100
> >  audit_copy_inode+0x6d/0xb0
> >  __audit_inode+0x1bb/0x320
> >  filename_lookup+0x128/0x180
>
> Somebody else has the inode locked doing something with it, so the
> audit code has to wait for it to finish before it can continue. This
> likely has nothing to do with memory reclaim, and more likely is
> backed up behind a transaction or something doing IO....
>
> >  ? __check_object_size+0x108/0x1e3
> >  ? path_get+0x27/0x30
> >  ? __audit_getname+0x96/0xb0
> >  user_path_at_empty+0x36/0x40
> >  vfs_fstatat+0x66/0xc0
> >  SYSC_newlstat+0x31/0x60
> >  ? syscall_trace_enter+0x1d9/0x2f0
> >  ? __audit_syscall_exit+0x230/0x2c0
> >  SyS_newlstat+0xe/0x10
> >  do_syscall_64+0x5b/0xc0
> >  entry_SYSCALL64_slow_path+0x25/0x25
> >
> > All other hung threads are stuck in the third stack.
> >
> > We are using the Ubuntu 16.04 kernel, 4.10.0-40-generic
> > #44~16.04.1-Ubuntu. The machines involved have 20-core / 40-thread
> > Intel CPUs, 384 GiB of main memory, and four nvme devices in an md
> > RAID 0.  The filesystem info is:
> >
> > # xfs_info /dev/md0
> > meta-data=/dev/md0               isize=256    agcount=6, agsize=268435455 blks
> >          =                       sectsz=512   attr=2, projid32bit=0
> >          =                       crc=0        finobt=0 spinodes=0
> > data     =                       bsize=4096   blocks=1562366976, imaxpct=5
> >          =                       sunit=0      swidth=0 blks
> > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > log      =internal               bsize=4096   blocks=521728, version=2
> >          =                       sectsz=512   sunit=1 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> >
> > The tasks above were reported as hung at 22:51:55, so they'd been dead
> > since 22:49:55 at the latest.  The stats from around that time seem
> > pretty normal.
>
> So you've got processes stuck waiting a couple of minutes for IO
> on nvme drives? That doesn't sound like a filesystem problem - that
> sounds more like lost IOs, a hung MD array, or hardware that's gone
> AWOL....

Sure, it could be any of these things. We're trying to get to the
bottom of why the software appears to do nothing while the hardware
appears to be working perfectly. In particular why the kswapd thread
is waiting on I/O but there's none/very little happening.

>
> > 22:48:02  memtotal memfree buffers cached dirty slabmem  swptotal swpfree _mem_
> > 22:48:07   385598M   2199M    115M 23773M   14M   2435M        0M      0M
>
> Nothing unusual there - still lots of reclaimable page cache, and
> some slabmem, too.
>
> > Something is obviously wrong with atop's analysis of md0 I/O rate
> > here, but you get the idea: I/O slides over a minute from a reasonable
> > rate to near zero, except for one big spike of output.
> >
> > # atopsar -r /var/log/atop/atop_20190517 -b 22:48:00 -e 22:53:00 -f
> >
> > 22:48:02  disk           busy read/s KB/read  writ/s KB/writ avque avserv _mdd_
> > 22:48:07  md0              0% 1208.4     6.9  2569.0     6.6   0.0   0.00 ms
> > 22:48:12  md0              0% 1294.1     7.0  2642.5     6.6   0.0   0.00 ms
> > 22:48:17  md0              0% 1289.0     6.9  2823.5     6.8   0.0   0.00 ms
> > 22:48:22  md0              0% 1376.3     6.9  2662.6     6.7   0.0   0.00 ms
> > 22:48:27  md0              0% 1332.3     6.8  2578.7     6.3   0.0   0.00 ms
> > 22:48:32  md0              0% 1338.2     6.9  2601.7     6.5   0.0   0.00 ms
> > 22:48:37  md0              0% 1133.7     6.8  3172.3     6.9   0.0   0.00 ms
> > 22:48:42  md0              0% 1377.4     6.8  2609.3     6.8   0.0   0.00 ms
> > 22:48:47  md0              0% 1353.4     6.9  2293.6     6.5   0.0   0.00 ms
> > 22:48:52  md0              0% 1291.8     6.9  2200.5     6.3   0.0   0.00 ms
> > 22:48:57  md0              0% 1332.2     7.0  3066.1     6.9   0.0   0.00 ms
> > 22:49:02  md0              0% 1275.6     6.9  3021.2     6.8   0.0   0.00 ms
> > 22:49:07  md0              0% 1306.6     6.4  4407.2    10.4   0.0   0.00 ms
> > 22:49:12  md0              0% 1172.4     6.6  2740.4     7.1   0.0   0.00 ms
> > 22:49:17  md0              0%  468.4     4.9  2769.7     6.6   0.0   0.00 ms
>
> IO rates take a hit here, about 40s before the processes got stuck.
>
> > 22:49:22  md0              0%  422.3     4.9  2950.9     7.0   0.0   0.00 ms
> > 22:49:27  md0              0%  364.7     4.9  2744.3     7.0   0.0   0.00 ms
> > 22:49:32  md0              0%  254.6     4.7  6271.1     8.8   0.0   0.00 ms
> > 22:49:37  md0              0%  289.6     4.6  2108.7     6.8   0.0   0.00 ms
> > 22:49:42  md0              0%  276.1     4.7  2278.3     6.7   0.0   0.00 ms
> > 22:49:47  md0              0%  266.2     4.7  1990.3     7.0   0.0   0.00 ms
> > 22:49:52  md0              0%  274.9     4.7  2051.3     7.0   0.0   0.00 ms
> > 22:49:57  md0              0%  222.7     4.6  1770.4     7.1   0.0   0.00 ms
>
> Processes stick here just as the read rate takes another hit, and...
>
> > 22:50:02  md0              0%  104.1     4.5 22339.5     9.5   0.0   0.00 ms
>
> That's likely a burst of 8KB inode writes from the XFS inode
> reclaim code.
>
> > 22:50:07  md0              0%  132.9     4.4  6208.9     9.1   0.0   0.00 ms
> > 22:50:12  md0              0%  147.2     4.5   755.7     5.1   0.0   0.00 ms
> > 22:50:17  md0              0%  127.0     4.4   722.6     5.3   0.0   0.00 ms
> > 22:50:22  md0              0%  120.5     4.4   703.7     4.9   0.0   0.00 ms
> > 22:50:27  md0              0%  106.5     4.5   681.7     5.0   0.0   0.00 ms
> > 22:50:32  md0              0%  121.1     4.5   756.5     4.7   0.0   0.00 ms
> > 22:50:37  md0              0%  131.9     4.4   696.7     4.8   0.0   0.00 ms
> > 22:50:42  md0              0%  100.9     4.4   669.6     4.7   0.0   0.00 ms
> > 22:50:47  md0              0%   72.2     4.4   691.1     6.7   0.0   0.00 ms
> > 22:50:52  md0              0%    0.0     0.0   320.4     7.6   0.0   0.00 ms
>
> And a minute later everything has ground to a halt.
>
> > 22:50:57  md0              0%    0.2     4.0   115.1     6.5   0.0   0.00 ms
> > 22:51:02  md0              0%    0.0     0.0    62.5     5.7   0.0   0.00 ms
> > 22:51:07  md0              0%    0.2     4.0    23.0     4.2   0.0   0.00 ms
> > 22:51:12  md0              0%    0.2     4.0    22.6     4.1   0.0   0.00 ms
> > 22:51:17  md0              0%    0.0     0.0    22.4     4.1   0.0   0.00 ms
> > 22:51:22  md0              0%    0.2     4.0    26.0     4.4   0.0   0.00 ms
> > 22:51:27  md0              0%    0.0     0.0    25.4     5.6   0.0   0.00 ms
> > 22:51:32  md0              0%    0.0     0.0    22.6     4.1   0.0   0.00 ms
> > 22:51:37  md0              0%    0.0     0.0    21.5     4.1   0.0   0.00 ms
> > 22:51:42  md0              0%    0.0     0.0    26.0     5.4   0.0   0.00 ms
> > 22:51:47  md0              0%    0.0     0.0    22.2     4.1   0.0   0.00 ms
> > 22:51:52  md0              0%    0.0     0.0    23.4     4.1   0.0   0.00 ms
> > 22:51:57  md0              0%    0.2     4.0    26.0     5.6   0.0   0.00 ms
>
> This looks to me like something below the filesystem choking up and
> grinding to a halt. What are all the nvme drives doing over this
> period?

Write rates on the nvme drives are all exactly the md0 rates / 4, so
that seems normal.

>
> Hard to know what is going on at this point, though, but nothing
> at the filesystem or memory reclaim level should be stuck on IO for
> long periods of time on nvme SSDs...
>
> /me wonders if 4.10 had the block layer writeback throttle code in
> it, and if it does whether that is what has gone haywire here.

That is an interesting point.  I do see this on a different, healthy box

# cat /sys/kernel/debug/bdi/9\:0/stats
BdiWriteback:             4320 kB
BdiReclaimable:           2304 kB
BdiDirtyThresh:        4834552 kB
DirtyThresh:           4834552 kB
BackgroundThresh:      2414324 kB
BdiDirtied:         5316993504 kB
BdiWritten:         4732763040 kB
BdiWriteBandwidth:        6060 kBps

One hopes that no throttling is based on this estimate of 6MB/s since
this array is capable of more like 10000MB/s. At this time we're
focusing on system tunables that might prevent the machine from
getting into serious trouble with dirty pages or reclaim or whatever.
I know XFS has only a handful of tunables. Do you think it's
recommendable to lower the xfssynd timer from 30s to something else?
Our workload _is_ metadata-heavy; I see about 100 create/remove ops
per second and the log stats are rapidly increasing. By contrast we're
not really pushing the extent allocator at all.

extent_alloc 10755989 500829395 3030459 4148636182
abt 0 0 0 0
blk_map 3012410686 1743133993 196385287 18176676 100618849 659563291 0
bmbt 0 0 0 0
dir 112134 99359626 99358346 49393915
trans 98 4036033830 0
ig 99320520 98796299 5890 524221 0 523020 140279
log 1339041912 2184819896 0 1341870387 1341704946
push_ail 4109383649 0 37995385 2614017 0 136962 25152 3301200 0 47933
xstrat 589377 0
rw 3561425929 1342108567
attr 3297942617 0 0 0
icluster 1452111 945722 3364798
vnodes 1201 0 0 0 99307909 99307909 99307909 0
buf 779252853 2349573 776924344 978078 41302 2328509 0 2491287 1090
abtb2 13886237 90593055 270338 266221 17 15 4555 1481 1188 2529 30 22
47 37 93926009
abtc2 28760802 203677378 13739372 13735284 45 43 3270 1292 1935 1691
57 50 102 93 1413028741
bmbt2 14355745 140867292 7158285 7115081 8 2 4074 4122 25396 4074 230
48 238 50 6974741
ibt2 397071627 776203283 67846 67820 0 0 0 0 0 0 0 0 0 0 413
fibt2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
rmapbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
refcntbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
qm 0 0 0 0 0 0 0 0
xpc 17666569097216 111928912437311 80399355018480
debug 0

Thanks for your thoughts so far.

-jwb


>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Recurring hand in XFS inode reclaim on 4.10
  2019-05-24 20:34   ` Jeffrey Baker
@ 2019-05-27  3:32     ` Dave Chinner
  2019-05-28 17:16       ` Jeffrey Baker
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2019-05-27  3:32 UTC (permalink / raw)
  To: Jeffrey Baker; +Cc: linux-xfs

On Fri, May 24, 2019 at 01:34:58PM -0700, Jeffrey Baker wrote:
> On Tue, May 21, 2019 at 3:49 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Tue, May 21, 2019 at 09:21:10AM -0700, Jeffrey Baker wrote:
> > > I have a problem of recurring hangs on machines where I get hung task
> > > warnings for kswapd and many user threads stuck in lstat. At the time,
> > > I/O grinds to a halt, although not quite to zero. These hangs happen
> > > 1-2 times per day on a fleet of several thousand machines. We'd like
> > > to understand the root cause, if it is already known, so we can figure
> > > out the minimum kernel to which we might want to update.
> > >
> > > These are the hung task stacks:
> > >
> > > kswapd0         D    0   279      2 0x00000000
> > > Call Trace:
> > >  __schedule+0x232/0x700
> > >  schedule+0x36/0x80
> > >  schedule_preempt_disabled+0xe/0x10
> > >  __mutex_lock_slowpath+0x193/0x290
> > >  mutex_lock+0x2f/0x40
> > >  xfs_reclaim_inodes_ag+0x288/0x330 [xfs]
> >
> > You're basically running the machine out of memory and there
> > are so many direct reclaimers that all the inode reclaim parallelism in
> > the filesystem has been exhausted and it's blocking waiting for
> > other reclaim to complete.
> 
> Sort of.  "Out of memory" here means > 2GB free, admittedly that's
> less than 1% of the machine.

By "out of memory" I meant that your machine appears to have been at
or below the memory reclaim watermarks for an extended period of
time, not that there is no free memory at all.

> > > kswapd1         D    0   280      2 0x00000000
> > > Call Trace:
> > >  __schedule+0x232/0x700
> > >  schedule+0x36/0x80
> > >  schedule_timeout+0x235/0x3f0
> > >  ? blk_finish_plug+0x2c/0x40
> > >  ? _xfs_buf_ioapply+0x334/0x460 [xfs]
> > >  wait_for_completion+0xb4/0x140
> > >  ? wake_up_q+0x70/0x70
> > >  ? xfs_bwrite+0x24/0x60 [xfs]
> > >  xfs_buf_submit_wait+0x7f/0x210 [xfs]
> > >  xfs_bwrite+0x24/0x60 [xfs]
> > >  xfs_reclaim_inode+0x313/0x340 [xfs]
> > >  xfs_reclaim_inodes_ag+0x208/0x330 [xfs]
> >
> > Yup, memory reclaim is pushing so hard it is doing direct writeback
> > of dirty inodes.
> 
> OK.  Is that reflected in vmstat somewhere?  I see
> "nr_vmscan_immediate_reclaim" but it's not obvious to me what that is,
> and it wasn't increasing at the moment of this event.

Nope, not for low level filesystem inode reclaim like this. Getting
the filesystem to do writeback from the inode shrinker is relatively
uncommon.

> > > All other hung threads are stuck in the third stack.
> > >
> > > We are using the Ubuntu 16.04 kernel, 4.10.0-40-generic
> > > #44~16.04.1-Ubuntu. The machines involved have 20-core / 40-thread
> > > Intel CPUs, 384 GiB of main memory, and four nvme devices in an md
> > > RAID 0.  The filesystem info is:
> > >
> > > # xfs_info /dev/md0
> > > meta-data=/dev/md0               isize=256    agcount=6, agsize=268435455 blks
> > >          =                       sectsz=512   attr=2, projid32bit=0
> > >          =                       crc=0        finobt=0 spinodes=0
> > > data     =                       bsize=4096   blocks=1562366976, imaxpct=5
> > >          =                       sunit=0      swidth=0 blks

FWIW, I just noticed you don't have a sunit/swidth set for you
md RAID 0, which means you only have 6 AGs rather than 32. So
there's a lot less allocation and reclaim concurrency in this
filesystem that we'd normally see for a 4x RAID0 stripe....

> > This looks to me like something below the filesystem choking up and
> > grinding to a halt. What are all the nvme drives doing over this
> > period?
> 
> Write rates on the nvme drives are all exactly the md0 rates / 4, so
> that seems normal.

Write rates aren't that important - what do the io latencies, queue
depths and device utilisation figures look like?

> > Hard to know what is going on at this point, though, but nothing
> > at the filesystem or memory reclaim level should be stuck on IO for
> > long periods of time on nvme SSDs...
> >
> > /me wonders if 4.10 had the block layer writeback throttle code in
> > it, and if it does whether that is what has gone haywire here.
> 
> That is an interesting point.  I do see this on a different, healthy box
> 
> # cat /sys/kernel/debug/bdi/9\:0/stats
> BdiWriteback:             4320 kB
> BdiReclaimable:           2304 kB
> BdiDirtyThresh:        4834552 kB
> DirtyThresh:           4834552 kB
> BackgroundThresh:      2414324 kB
> BdiDirtied:         5316993504 kB
> BdiWritten:         4732763040 kB
> BdiWriteBandwidth:        6060 kBps

That's the currently used write bandwidth of the backing device. The
block layer writeback throttle is below this. i.e. This simply
reflects the rate at which data is currently being written to the
block device.

> One hopes that no throttling is based on this estimate of 6MB/s since
> this array is capable of more like 10000MB/s. At this time we're
> focusing on system tunables that might prevent the machine from
> getting into serious trouble with dirty pages or reclaim or whatever.
> I know XFS has only a handful of tunables. Do you think it's
> recommendable to lower the xfssynd timer from 30s to something else?
> Our workload _is_ metadata-heavy; I see about 100 create/remove ops
> per second

FWIW, that's not what I'd call metadata heavy. XFS on a 4-SSD array with
20 CPU cores can sustain somewhere around 300,000 create/removes a
second before it runs out of CPU. 100/s is barely a blip above
idle....

> and the log stats are rapidly increasing. By contrast we're
> not really pushing the extent allocator at all.
> 
> extent_alloc 10755989 500829395 3030459 4148636182
> abt 0 0 0 0
> blk_map 3012410686 1743133993 196385287 18176676 100618849 659563291 0
> bmbt 0 0 0 0
> dir 112134 99359626 99358346 49393915

Ok, so 100M files created and removed since boot.

> trans 98 4036033830 0

All async transactions.

> ig 99320520 98796299 5890 524221 0 523020 140279
> log 1339041912 2184819896 0 1341870387 1341704946
      writes     blocks     |   forces   force_waits
                         noiclogs

Ok, so why are the so many log forces (1.34 /billion/)? There's
roughly a log force in every 3rd transaction, but they aren't from
the transaction commit (because they are async transactions).

> push_ail 4109383649 0 37995385 2614017 0 136962 25152 3301200 0 47933
         try_logspace | pushes   success | pinned   |  flushing | flush
               sleep_logspace		pushbuf   locked     restarts

They aren't coming from the metadata writeback code (pinned count),
and the log is not blocking on free space. i.e. there's no metadata
writeback bottleneck occurring.

> xstrat 589377 0
> rw 3561425929 1342108567

Oh, 1.34 billion writes. You're doing O_DSYNC writes, yes? And lots
of them to every file that is created?

> attr 3297942617 0 0 0
> icluster 1452111 945722 3364798
> vnodes 1201 0 0 0 99307909 99307909 99307909 0
> buf 779252853 2349573 776924344 978078 41302 2328509 0 2491287 1090
> abtb2 13886237 90593055 270338 266221 17 15 4555 1481 1188 2529 30 22
> 47 37 93926009
> abtc2 28760802 203677378 13739372 13735284 45 43 3270 1292 1935 1691
> 57 50 102 93 1413028741
> bmbt2 14355745 140867292 7158285 7115081 8 2 4074 4122 25396 4074 230
> 48 238 50 6974741
> ibt2 397071627 776203283 67846 67820 0 0 0 0 0 0 0 0 0 0 413

For 100M files created/removed, to only see ~68000 inode btree
record inserts and removes implies that the filesystem is
efficiently reusing the freed inodes. i.e. there's pretty much a
steady state of inodes in use in the workload....

> fibt2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> rmapbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> refcntbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> qm 0 0 0 0 0 0 0 0
> xpc 17666569097216 111928912437311 80399355018480
                       write_bytes    read_bytes

Assuming this is the only substantially active XFS filesystem on the
system, that's 101TiB written to those drives, in 1.34 billion writes,
which is an average of 82KiB per write. Assuming that the 32 bit
write counter has not wrapped. Given that the transaction counter is
nearly at the wrap point, and the read counter is at 3.5B for ~75TiB
(average 22KiB), let's assume that the write counter has wrapped
and we have ~5.5 billion writes. That gives an average write of just
on 20KiB which kinda matches the read....

So the drives look like they are being constantly hammered with
small, synchronous IOs. This means write amplification is likely to
be a substantial problem for them. They've had a /lot/ of data
written to them and are under constant load so there's no time for
idle cleanup, so is it possible that the drives themselves are
having internal garbage collection related slowdowns?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Recurring hand in XFS inode reclaim on 4.10
  2019-05-27  3:32     ` Dave Chinner
@ 2019-05-28 17:16       ` Jeffrey Baker
  2019-05-29 22:20         ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: Jeffrey Baker @ 2019-05-28 17:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Sun, May 26, 2019 at 8:32 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, May 24, 2019 at 01:34:58PM -0700, Jeffrey Baker wrote:
> > On Tue, May 21, 2019 at 3:49 PM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Tue, May 21, 2019 at 09:21:10AM -0700, Jeffrey Baker wrote:
> > > > I have a problem of recurring hangs on machines where I get hung task
> > > > warnings for kswapd and many user threads stuck in lstat. At the time,
> > > > I/O grinds to a halt, although not quite to zero. These hangs happen
> > > > 1-2 times per day on a fleet of several thousand machines. We'd like
> > > > to understand the root cause, if it is already known, so we can figure
> > > > out the minimum kernel to which we might want to update.
> > > >
> > > > These are the hung task stacks:
> > > >
> > > > kswapd0         D    0   279      2 0x00000000
> > > > Call Trace:
> > > >  __schedule+0x232/0x700
> > > >  schedule+0x36/0x80
> > > >  schedule_preempt_disabled+0xe/0x10
> > > >  __mutex_lock_slowpath+0x193/0x290
> > > >  mutex_lock+0x2f/0x40
> > > >  xfs_reclaim_inodes_ag+0x288/0x330 [xfs]
> > >
> > > You're basically running the machine out of memory and there
> > > are so many direct reclaimers that all the inode reclaim parallelism in
> > > the filesystem has been exhausted and it's blocking waiting for
> > > other reclaim to complete.
> >
> > Sort of.  "Out of memory" here means > 2GB free, admittedly that's
> > less than 1% of the machine.
>
> By "out of memory" I meant that your machine appears to have been at
> or below the memory reclaim watermarks for an extended period of
> time, not that there is no free memory at all.

Indeed, we always have one NUMA zone at the low water mark, although
we have zone reclaim disabled.

>
> > > > kswapd1         D    0   280      2 0x00000000
> > > > Call Trace:
> > > >  __schedule+0x232/0x700
> > > >  schedule+0x36/0x80
> > > >  schedule_timeout+0x235/0x3f0
> > > >  ? blk_finish_plug+0x2c/0x40
> > > >  ? _xfs_buf_ioapply+0x334/0x460 [xfs]
> > > >  wait_for_completion+0xb4/0x140
> > > >  ? wake_up_q+0x70/0x70
> > > >  ? xfs_bwrite+0x24/0x60 [xfs]
> > > >  xfs_buf_submit_wait+0x7f/0x210 [xfs]
> > > >  xfs_bwrite+0x24/0x60 [xfs]
> > > >  xfs_reclaim_inode+0x313/0x340 [xfs]
> > > >  xfs_reclaim_inodes_ag+0x208/0x330 [xfs]
> > >
> > > Yup, memory reclaim is pushing so hard it is doing direct writeback
> > > of dirty inodes.
> >
> > OK.  Is that reflected in vmstat somewhere?  I see
> > "nr_vmscan_immediate_reclaim" but it's not obvious to me what that is,
> > and it wasn't increasing at the moment of this event.
>
> Nope, not for low level filesystem inode reclaim like this. Getting
> the filesystem to do writeback from the inode shrinker is relatively
> uncommon.
>
> > > > All other hung threads are stuck in the third stack.
> > > >
> > > > We are using the Ubuntu 16.04 kernel, 4.10.0-40-generic
> > > > #44~16.04.1-Ubuntu. The machines involved have 20-core / 40-thread
> > > > Intel CPUs, 384 GiB of main memory, and four nvme devices in an md
> > > > RAID 0.  The filesystem info is:
> > > >
> > > > # xfs_info /dev/md0
> > > > meta-data=/dev/md0               isize=256    agcount=6, agsize=268435455 blks
> > > >          =                       sectsz=512   attr=2, projid32bit=0
> > > >          =                       crc=0        finobt=0 spinodes=0
> > > > data     =                       bsize=4096   blocks=1562366976, imaxpct=5
> > > >          =                       sunit=0      swidth=0 blks
>
> FWIW, I just noticed you don't have a sunit/swidth set for you
> md RAID 0, which means you only have 6 AGs rather than 32. So
> there's a lot less allocation and reclaim concurrency in this
> filesystem that we'd normally see for a 4x RAID0 stripe....

To some extent this has been a forensic exercise for me to figure out
how and why these filesystems were deployed like this. I think it was
believed at the time that su/sw has a magic automatic value of 0.

>
> > > This looks to me like something below the filesystem choking up and
> > > grinding to a halt. What are all the nvme drives doing over this
> > > period?
> >
> > Write rates on the nvme drives are all exactly the md0 rates / 4, so
> > that seems normal.
>
> Write rates aren't that important - what do the io latencies, queue
> depths and device utilisation figures look like?

10s of microseconds, ~zero, and ~zero respectively.

> > > Hard to know what is going on at this point, though, but nothing
> > > at the filesystem or memory reclaim level should be stuck on IO for
> > > long periods of time on nvme SSDs...
> > >
> > > /me wonders if 4.10 had the block layer writeback throttle code in
> > > it, and if it does whether that is what has gone haywire here.
> >
> > That is an interesting point.  I do see this on a different, healthy box
> >
> > # cat /sys/kernel/debug/bdi/9\:0/stats
> > BdiWriteback:             4320 kB
> > BdiReclaimable:           2304 kB
> > BdiDirtyThresh:        4834552 kB
> > DirtyThresh:           4834552 kB
> > BackgroundThresh:      2414324 kB
> > BdiDirtied:         5316993504 kB
> > BdiWritten:         4732763040 kB
> > BdiWriteBandwidth:        6060 kBps
>
> That's the currently used write bandwidth of the backing device. The
> block layer writeback throttle is below this. i.e. This simply
> reflects the rate at which data is currently being written to the
> block device.
>
> > One hopes that no throttling is based on this estimate of 6MB/s since
> > this array is capable of more like 10000MB/s. At this time we're
> > focusing on system tunables that might prevent the machine from
> > getting into serious trouble with dirty pages or reclaim or whatever.
> > I know XFS has only a handful of tunables. Do you think it's
> > recommendable to lower the xfssynd timer from 30s to something else?
> > Our workload _is_ metadata-heavy; I see about 100 create/remove ops
> > per second
>
> FWIW, that's not what I'd call metadata heavy. XFS on a 4-SSD array with
> 20 CPU cores can sustain somewhere around 300,000 create/removes a
> second before it runs out of CPU. 100/s is barely a blip above
> idle....

Quite.  That's part of the mystery: why the machine derps to a halt
while also barely accessing its abilities. What I meant, though, was
that the metadata rate is fairly high in my experience in proportion
to the amount of other I/O, which is also pretty low here.

>
> > and the log stats are rapidly increasing. By contrast we're
> > not really pushing the extent allocator at all.
> >
> > extent_alloc 10755989 500829395 3030459 4148636182
> > abt 0 0 0 0
> > blk_map 3012410686 1743133993 196385287 18176676 100618849 659563291 0
> > bmbt 0 0 0 0
> > dir 112134 99359626 99358346 49393915
>
> Ok, so 100M files created and removed since boot.
>
> > trans 98 4036033830 0
>
> All async transactions.
>
> > ig 99320520 98796299 5890 524221 0 523020 140279
> > log 1339041912 2184819896 0 1341870387 1341704946
>       writes     blocks     |   forces   force_waits
>                          noiclogs
>
> Ok, so why are the so many log forces (1.34 /billion/)? There's
> roughly a log force in every 3rd transaction, but they aren't from
> the transaction commit (because they are async transactions).
>
> > push_ail 4109383649 0 37995385 2614017 0 136962 25152 3301200 0 47933
>          try_logspace | pushes   success | pinned   |  flushing | flush
>                sleep_logspace           pushbuf   locked     restarts
>
> They aren't coming from the metadata writeback code (pinned count),
> and the log is not blocking on free space. i.e. there's no metadata
> writeback bottleneck occurring.
>
> > xstrat 589377 0
> > rw 3561425929 1342108567
>
> Oh, 1.34 billion writes. You're doing O_DSYNC writes, yes? And lots
> of them to every file that is created?
>
> > attr 3297942617 0 0 0
> > icluster 1452111 945722 3364798
> > vnodes 1201 0 0 0 99307909 99307909 99307909 0
> > buf 779252853 2349573 776924344 978078 41302 2328509 0 2491287 1090
> > abtb2 13886237 90593055 270338 266221 17 15 4555 1481 1188 2529 30 22
> > 47 37 93926009
> > abtc2 28760802 203677378 13739372 13735284 45 43 3270 1292 1935 1691
> > 57 50 102 93 1413028741
> > bmbt2 14355745 140867292 7158285 7115081 8 2 4074 4122 25396 4074 230
> > 48 238 50 6974741
> > ibt2 397071627 776203283 67846 67820 0 0 0 0 0 0 0 0 0 0 413
>
> For 100M files created/removed, to only see ~68000 inode btree
> record inserts and removes implies that the filesystem is
> efficiently reusing the freed inodes. i.e. there's pretty much a
> steady state of inodes in use in the workload....
>
> > fibt2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> > rmapbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> > refcntbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> > qm 0 0 0 0 0 0 0 0
> > xpc 17666569097216 111928912437311 80399355018480
>                        write_bytes    read_bytes
>
> Assuming this is the only substantially active XFS filesystem on the
> system, that's 101TiB written to those drives, in 1.34 billion writes,
> which is an average of 82KiB per write. Assuming that the 32 bit
> write counter has not wrapped. Given that the transaction counter is
> nearly at the wrap point, and the read counter is at 3.5B for ~75TiB
> (average 22KiB), let's assume that the write counter has wrapped
> and we have ~5.5 billion writes. That gives an average write of just
> on 20KiB which kinda matches the read....
>
> So the drives look like they are being constantly hammered with
> small, synchronous IOs. This means write amplification is likely to
> be a substantial problem for them. They've had a /lot/ of data
> written to them and are under constant load so there's no time for
> idle cleanup, so is it possible that the drives themselves are
> having internal garbage collection related slowdowns?

That's certainly not impossible, although these are Very Fancy(tm)
devices.  I'll have a look at the logs.

Thanks again,
jwb

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Recurring hand in XFS inode reclaim on 4.10
  2019-05-28 17:16       ` Jeffrey Baker
@ 2019-05-29 22:20         ` Dave Chinner
  0 siblings, 0 replies; 6+ messages in thread
From: Dave Chinner @ 2019-05-29 22:20 UTC (permalink / raw)
  To: Jeffrey Baker; +Cc: linux-xfs

On Tue, May 28, 2019 at 10:16:39AM -0700, Jeffrey Baker wrote:
> On Sun, May 26, 2019 at 8:32 PM Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, May 24, 2019 at 01:34:58PM -0700, Jeffrey Baker wrote:
> > > Write rates on the nvme drives are all exactly the md0 rates / 4, so
> > > that seems normal.
> >
> > Write rates aren't that important - what do the io latencies, queue
> > depths and device utilisation figures look like?
> 
> 10s of microseconds, ~zero, and ~zero respectively.

So it sounds like the devices aren't showing signs of slowness, but
the filesystem and memory reclaim is sitting there twiddling it's
thumbs waiting for IO to complete?

Perhaps it might be worthwhile running a latency trace (e.g. with
ftrace or one of the newfangled bpf tools) to find out where
everything is blocking and how long they are waiting for things to
occur. That might give some more insight into what resource we are
waiting to be released to make progress...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-05-29 22:20 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-21 16:21 Recurring hand in XFS inode reclaim on 4.10 Jeffrey Baker
2019-05-21 22:49 ` Dave Chinner
2019-05-24 20:34   ` Jeffrey Baker
2019-05-27  3:32     ` Dave Chinner
2019-05-28 17:16       ` Jeffrey Baker
2019-05-29 22:20         ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.