* xfsaild in D state seems to be blocking all other i/o sporadically
@ 2017-04-19 10:58 Michael Weissenbacher
2017-04-19 12:12 ` Carlos Maiolino
0 siblings, 1 reply; 19+ messages in thread
From: Michael Weissenbacher @ 2017-04-19 10:58 UTC (permalink / raw)
To: linux-xfs
Hi List!
I have a storage server which primarily does around 15-20 parallel
rsync's, nothing special. Sometimes (3-4 times a day) i notice that all
I/O on the file system suddenly comes to a halt and the only process
that continues to do any I/O (according to iotop) is the process
xfsaild/md127. When this happens, xfsaild only does reads (according to
iotop) and consistently in D State (according to top).
Unfortunately this can sometimes stay like this for 5-15 minutes. During
this time even a simple "ls" our "touch" would block and be stuck in D
state. All other running processes accessing the fs are of course also
stuck in D state. It is a XFS V5 filesystem.
Then again, as sudden as it began, everything goes back to normal and
I/O continues. The problem is accompanied with several "process blocked
for xxx seconds" in dmesg and also some dropped connections due to
network timeouts.
I've tried several things to remedy the problem, including:
- changing I/O schedulers (tried noop, deadline and cfq). Deadline
seems to be best (the block goes away in less time compared with the
others).
- removing all mount options (defaults + usrquota, grpquota)
- upgrading to the latest 4.11.0-rc kernel (before that i was on 4.9.x)
Nothing of the above seemed to have made a significant change to the
problem.
xfs_info output of the fs in question:
meta-data=/dev/md127 isize=512 agcount=33,
agsize=268435440 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=1 spinodes=0 rmapbt=0
= reflink=0
data = bsize=4096 blocks=8789917696, imaxpct=10
= sunit=16 swidth=96 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal bsize=4096 blocks=521728, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
Storage Subsystem: Dell Perc H730P Controller 2GB NVCACHE, 12 6TB Disks,
RAID-10, latest Firmware Updates
I would be happy to dig out more information if needed. How can i find
out if the RAID Controller itself gets stuck? Nothing bad shows up in
the hardware and SCSI controller logs.
Thanks,
Michael
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfsaild in D state seems to be blocking all other i/o sporadically
2017-04-19 10:58 xfsaild in D state seems to be blocking all other i/o sporadically Michael Weissenbacher
@ 2017-04-19 12:12 ` Carlos Maiolino
2017-04-19 12:37 ` Brian Foster
2017-04-19 12:40 ` Michael Weissenbacher
0 siblings, 2 replies; 19+ messages in thread
From: Carlos Maiolino @ 2017-04-19 12:12 UTC (permalink / raw)
To: Michael Weissenbacher; +Cc: linux-xfs
Hi,
On Wed, Apr 19, 2017 at 12:58:05PM +0200, Michael Weissenbacher wrote:
> Hi List!
> I have a storage server which primarily does around 15-20 parallel
> rsync's, nothing special. Sometimes (3-4 times a day) i notice that all
> I/O on the file system suddenly comes to a halt and the only process
> that continues to do any I/O (according to iotop) is the process
> xfsaild/md127. When this happens, xfsaild only does reads (according to
> iotop) and consistently in D State (according to top).
> Unfortunately this can sometimes stay like this for 5-15 minutes. During
> this time even a simple "ls" our "touch" would block and be stuck in D
> state. All other running processes accessing the fs are of course also
> stuck in D state. It is a XFS V5 filesystem.
> Then again, as sudden as it began, everything goes back to normal and
> I/O continues. The problem is accompanied with several "process blocked
> for xxx seconds" in dmesg and also some dropped connections due to
> network timeouts.
>
> I've tried several things to remedy the problem, including:
> - changing I/O schedulers (tried noop, deadline and cfq). Deadline
> seems to be best (the block goes away in less time compared with the
> others).
> - removing all mount options (defaults + usrquota, grpquota)
> - upgrading to the latest 4.11.0-rc kernel (before that i was on 4.9.x)
>
> Nothing of the above seemed to have made a significant change to the
> problem.
>
> xfs_info output of the fs in question:
> meta-data=/dev/md127 isize=512 agcount=33,
> agsize=268435440 blks
> = sectsz=4096 attr=2, projid32bit=1
> = crc=1 finobt=1 spinodes=0 rmapbt=0
> = reflink=0
> data = bsize=4096 blocks=8789917696, imaxpct=10
> = sunit=16 swidth=96 blks
> naming =version 2 bsize=4096 ascii-ci=0 ftype=1
> log =internal bsize=4096 blocks=521728, version=2
> = sectsz=4096 sunit=1 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
>
This is really not enough to give any idea of what might be happening, although
this looks more like a slow storage while xfsaild is flushing the log, but we
really need more information to try to give a better idea of what is going on,
please look at:
http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
Specially for: storage layout (RAID arrays, LVMs, thin provisioning, etc), and
the dmesg output with the traces from the hang tasks.
Cheers.
> Storage Subsystem: Dell Perc H730P Controller 2GB NVCACHE, 12 6TB Disks,
> RAID-10, latest Firmware Updates
>
> I would be happy to dig out more information if needed. How can i find
> out if the RAID Controller itself gets stuck? Nothing bad shows up in
> the hardware and SCSI controller logs.
>
--
Carlos
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfsaild in D state seems to be blocking all other i/o sporadically
2017-04-19 12:12 ` Carlos Maiolino
@ 2017-04-19 12:37 ` Brian Foster
2017-04-19 12:40 ` Michael Weissenbacher
1 sibling, 0 replies; 19+ messages in thread
From: Brian Foster @ 2017-04-19 12:37 UTC (permalink / raw)
To: Michael Weissenbacher, linux-xfs
On Wed, Apr 19, 2017 at 02:12:02PM +0200, Carlos Maiolino wrote:
> Hi,
>
> On Wed, Apr 19, 2017 at 12:58:05PM +0200, Michael Weissenbacher wrote:
> > Hi List!
> > I have a storage server which primarily does around 15-20 parallel
> > rsync's, nothing special. Sometimes (3-4 times a day) i notice that all
> > I/O on the file system suddenly comes to a halt and the only process
> > that continues to do any I/O (according to iotop) is the process
> > xfsaild/md127. When this happens, xfsaild only does reads (according to
> > iotop) and consistently in D State (according to top).
> > Unfortunately this can sometimes stay like this for 5-15 minutes. During
> > this time even a simple "ls" our "touch" would block and be stuck in D
> > state. All other running processes accessing the fs are of course also
> > stuck in D state. It is a XFS V5 filesystem.
> > Then again, as sudden as it began, everything goes back to normal and
> > I/O continues. The problem is accompanied with several "process blocked
> > for xxx seconds" in dmesg and also some dropped connections due to
> > network timeouts.
> >
> > I've tried several things to remedy the problem, including:
> > - changing I/O schedulers (tried noop, deadline and cfq). Deadline
> > seems to be best (the block goes away in less time compared with the
> > others).
> > - removing all mount options (defaults + usrquota, grpquota)
> > - upgrading to the latest 4.11.0-rc kernel (before that i was on 4.9.x)
> >
> > Nothing of the above seemed to have made a significant change to the
> > problem.
> >
> > xfs_info output of the fs in question:
> > meta-data=/dev/md127 isize=512 agcount=33,
> > agsize=268435440 blks
> > = sectsz=4096 attr=2, projid32bit=1
> > = crc=1 finobt=1 spinodes=0 rmapbt=0
> > = reflink=0
> > data = bsize=4096 blocks=8789917696, imaxpct=10
> > = sunit=16 swidth=96 blks
> > naming =version 2 bsize=4096 ascii-ci=0 ftype=1
> > log =internal bsize=4096 blocks=521728, version=2
> > = sectsz=4096 sunit=1 blks, lazy-count=1
> > realtime =none extsz=4096 blocks=0, rtextents=0
> >
>
> This is really not enough to give any idea of what might be happening, although
> this looks more like a slow storage while xfsaild is flushing the log, but we
> really need more information to try to give a better idea of what is going on,
> please look at:
>
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>
> Specially for: storage layout (RAID arrays, LVMs, thin provisioning, etc), and
> the dmesg output with the traces from the hang tasks.
>
Information around memory usage might be particularly interesting here
as well. E.g., /proc/meminfo and /proc/slabinfo..
Brian
> Cheers.
>
> > Storage Subsystem: Dell Perc H730P Controller 2GB NVCACHE, 12 6TB Disks,
> > RAID-10, latest Firmware Updates
> >
> > I would be happy to dig out more information if needed. How can i find
> > out if the RAID Controller itself gets stuck? Nothing bad shows up in
> > the hardware and SCSI controller logs.
> >
>
> --
> Carlos
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfsaild in D state seems to be blocking all other i/o sporadically
2017-04-19 12:12 ` Carlos Maiolino
2017-04-19 12:37 ` Brian Foster
@ 2017-04-19 12:40 ` Michael Weissenbacher
2017-04-19 13:01 ` Michael Weissenbacher
1 sibling, 1 reply; 19+ messages in thread
From: Michael Weissenbacher @ 2017-04-19 12:40 UTC (permalink / raw)
To: linux-xfs
On 19.04.2017 14:12, Carlos Maiolino wrote:
> Hi,
>
> On Wed, Apr 19, 2017 at 12:58:05PM +0200, Michael Weissenbacher wrote:
>> Hi List!
>> I have a storage server which primarily does around 15-20 parallel
>> rsync's, nothing special. Sometimes (3-4 times a day) i notice that all
>> I/O on the file system suddenly comes to a halt and the only process
>> that continues to do any I/O (according to iotop) is the process
>> xfsaild/md127. When this happens, xfsaild only does reads (according to
>> iotop) and consistently in D State (according to top).
>> Unfortunately this can sometimes stay like this for 5-15 minutes. During
>> this time even a simple "ls" our "touch" would block and be stuck in D
>> state. All other running processes accessing the fs are of course also
>> stuck in D state. It is a XFS V5 filesystem.
>> Then again, as sudden as it began, everything goes back to normal and
>> I/O continues. The problem is accompanied with several "process blocked
>> for xxx seconds" in dmesg and also some dropped connections due to
>> network timeouts.
>>
>> I've tried several things to remedy the problem, including:
>> - changing I/O schedulers (tried noop, deadline and cfq). Deadline
>> seems to be best (the block goes away in less time compared with the
>> others).
>> - removing all mount options (defaults + usrquota, grpquota)
>> - upgrading to the latest 4.11.0-rc kernel (before that i was on 4.9.x)
>>
>> Nothing of the above seemed to have made a significant change to the
>> problem.
>>
>> xfs_info output of the fs in question:
>> meta-data=/dev/md127 isize=512 agcount=33,
>> agsize=268435440 blks
>> = sectsz=4096 attr=2, projid32bit=1
>> = crc=1 finobt=1 spinodes=0 rmapbt=0
>> = reflink=0
>> data = bsize=4096 blocks=8789917696, imaxpct=10
>> = sunit=16 swidth=96 blks
>> naming =version 2 bsize=4096 ascii-ci=0 ftype=1
>> log =internal bsize=4096 blocks=521728, version=2
>> = sectsz=4096 sunit=1 blks, lazy-count=1
>> realtime =none extsz=4096 blocks=0, rtextents=0
>>
>
> This is really not enough to give any idea of what might be happening, although
> this looks more like a slow storage while xfsaild is flushing the log, but we
> really need more information to try to give a better idea of what is going on,
> please look at:
>
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>
> Specially for: storage layout (RAID arrays, LVMs, thin provisioning, etc), and
> the dmesg output with the traces from the hang tasks.
>
Ok, including all of the info above:
* Kernel Vesion: 4.11.0-rc5 (currently), but also reproducable on 4.9.20+
* xfsprogs version: 4.9.0
* number of cpus: 1 socket, 6 cores
* contents of /proc/meminfo, /proc/mounts, /proc/partitions: see below
* RAID layout: Hardware RAID-10 + linear MD device (sdb == md127); XFS
on top of md127
md127 : active linear sdb[0]
35159670784 blocks super 1.2 64k rounding
* LVM: no LVM
* type of disks 12x 6TB SAS disks (enterprise class)
* write cache state of disks: disabled
* write cache of BBWC controller: enabled, 2GB
* xfs_info output: already included
* dmesg output: samples see below
* iostat output: samples see below
* vmstat output: will capture when it occurs the next time
* D state processes: will capture when it occurs the next time; also see
top output below
>> Storage Subsystem: Dell Perc H730P Controller 2GB NVCACHE, 12 6TB Disks,
>> RAID-10, latest Firmware Updates
>>
>> I would be happy to dig out more information if needed. How can i find
>> out if the RAID Controller itself gets stuck? Nothing bad shows up in
>> the hardware and SCSI controller logs.
>>
>
# cat /proc/meminfo
MemTotal: 65886208 kB
MemFree: 3490480 kB
MemAvailable: 41531280 kB
Buffers: 84 kB
Cached: 26125688 kB
SwapCached: 64 kB
Active: 20010544 kB
Inactive: 6142828 kB
Active(anon): 856 kB
Inactive(anon): 27576 kB
Active(file): 20009688 kB
Inactive(file): 6115252 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 64511996 kB
SwapFree: 64505852 kB
Dirty: 8 kB
Writeback: 0 kB
AnonPages: 27536 kB
Mapped: 14000 kB
Shmem: 832 kB
Slab: 27037624 kB
SReclaimable: 20569360 kB
SUnreclaim: 6468264 kB
KernelStack: 2976 kB
PageTables: 6228 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 97455100 kB
Committed_AS: 210904 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 88820 kB
DirectMap2M: 7155712 kB
DirectMap1G: 61865984 kB
# cat /proc/mounts
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
udev /dev devtmpfs rw,relatime,size=10240k,nr_inodes=8233000,mode=755 0 0
devpts /dev/pts devpts
rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,noexec,relatime,size=6588624k,mode=755 0 0
/dev/sda4 / xfs rw,relatime,attr2,inode64,usrquota,grpquota 0 0
tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k 0 0
pstore /sys/fs/pstore pstore rw,relatime 0 0
tmpfs /run/shm tmpfs rw,nosuid,nodev,noexec,relatime,size=26079640k 0 0
/dev/sda2 /boot ext2 rw,relatime,block_validity,barrier,user_xattr,acl 0 0
-> /dev/md127 /backup xfs
rw,noatime,attr2,inode64,sunit=128,swidth=768,usrquota,grpquota 0 0
none /sys/kernel/config configfs rw,relatime 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
# cat /proc/partitions
major minor #blocks name
8 0 292421632 sda
8 1 1024 sda1
8 2 1021952 sda2
8 3 64512000 sda3
8 4 226884608 sda4
8 16 35159801856 sdb
9 127 35159670784 md127
top - 12:30:03 up 11 days, 7:07, 1 user, load average: 22.89, 19.86,
12.53
Tasks: 247 total, 1 running, 246 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.4 sy, 0.0 ni, 99.6 id, 0.0 wa, 0.0 hi, 0.0 si,
0.0 st
KiB Mem: 65886208 total, 62923220 used, 2962988 free, 84 buffers
KiB Swap: 64511996 total, 6144 used, 64505852 free. 38382400 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
1609 root 20 0 0 0 0 D 2.0 0.0 54:39.51
xfsaild/md127
1703883 root 20 0 0 0 0 D 0.0 0.0 2:28.31
kworker/u12:0
1704625 checker+ 20 0 37512 11700 2416 D 0.0 0.0 1:19.68
rsync
1704627 checker+ 20 0 67104 9396 1628 D 0.0 0.0 1:10.44
rsync
1704845 10003 20 0 9148 3836 1744 D 0.0 0.0 0:33.56 rm
1705051 nospam.+ 20 0 24392 2240 1616 D 0.0 0.0 0:22.67
rsync
1705262 atacama+ 20 0 20740 9120 2436 D 0.0 0.0 0:11.27
rsync
1705265 10021 20 0 38144 11404 3800 D 0.0 0.0 0:00.06
rsnapshot
1705371 smtp1.n+ 20 0 7088 1892 1604 D 0.0 0.0 0:06.39 rm
1705373 kalahar+ 20 0 38144 11440 3840 D 0.0 0.0 0:00.07
rsnapshot
1705475 cvs.net+ 20 0 17248 4996 2332 D 0.0 0.0 0:04.66
rsync
1705584 hoth.ne+ 20 0 38040 11488 3888 D 0.0 0.0 0:00.06
rsnapshot
1705880 10035 20 0 38120 11372 3768 D 0.0 0.0 0:00.07
rsnapshot
1705983 smtp2.n+ 20 0 38144 11444 3844 D 0.0 0.0 0:00.07
rsnapshot
1706080 endor.n+ 20 0 38124 11456 3844 D 0.0 0.0 0:00.07
rsnapshot
1706081 10031 20 0 38040 11400 3800 D 0.0 0.0 0:00.06
rsnapshot
1706181 10032 20 0 38144 11380 3780 D 0.0 0.0 0:00.07
rsnapshot
1706182 10004 20 0 38144 11384 3780 D 0.0 0.0 0:00.06
rsnapshot
1706388 dns2.ne+ 20 0 38120 11376 3768 D 0.0 0.0 0:00.06
rsnapshot
1706485 10034 20 0 38144 11392 3788 D 0.0 0.0 0:00.06
rsnapshot
1706491 10022 20 0 38124 11372 3768 D 0.0 0.0 0:00.06
rsnapshot
1706786 wiki.ne+ 20 0 38124 11404 3800 D 0.0 0.0 0:00.06
rsnapshot
iostat output when the problem occurs:
04/19/2017 12:30:17 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.03 0.00 0.17 0.00 0.00 99.80
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 190.40 0.00 3046.40 0.00
32.00 0.98 5.16 5.16 0.00 5.14 97.92
04/19/2017 12:30:22 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.20 0.00 0.00 99.80
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 159.80 0.00 2556.80 0.00
32.00 0.98 6.15 6.15 0.00 6.14 98.16
04/19/2017 12:30:27 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.03 0.00 0.23 0.03 0.00 99.70
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 235.80 0.00 3772.80 0.00
32.00 0.97 4.13 4.13 0.00 4.13 97.36
04/19/2017 12:30:32 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.93 0.00 1.17 0.07 0.00 97.83
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 161.00 3.40 2576.00 54.40
32.00 0.98 5.98 6.10 0.24 5.98 98.24
04/19/2017 12:30:37 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.03 0.00 0.20 0.00 0.00 99.77
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 194.80 0.00 3116.80 0.00
32.00 0.98 5.02 5.02 0.00 5.02 97.84
04/19/2017 12:30:42 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.20 0.00 0.00 99.80
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 192.20 0.00 3075.20 0.00
32.00 0.99 5.14 5.14 0.00 5.14 98.72
...
(after it suddendly stops blocking and starts writing again)
04/19/2017 12:31:37 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.03 0.00 2.51 3.38 0.00 94.08
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 4911.80 55.60 4664.00 889.60 137985.60
58.85 122.10 25.18 8.95 25.37 0.20 94.80
04/19/2017 12:31:42 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.03 0.00 0.81 7.86 0.00 91.30
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 2239.80 0.00 2592.20 0.00 70144.80
54.12 271.25 105.12 0.00 105.12 0.39 100.00
04/19/2017 12:31:47 PM
avg-cpu: %user %nice %system %iowait %steal %idle
1.14 0.00 1.57 7.90 0.00 89.39
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 2620.40 0.00 2632.20 0.00 77292.00
58.73 270.84 102.98 0.00 102.98 0.38 100.00
04/19/2017 12:31:52 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 1.11 7.82 0.00 91.08
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 2592.80 0.00 2661.40 0.00 76693.60
57.63 270.95 101.39 0.00 101.39 0.38 100.00
04/19/2017 12:32:32 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.03 0.00 0.87 7.43 0.00 91.66
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 2007.60 0.00 2321.80 0.00 58285.60
50.21 271.18 116.05 0.00 116.05 0.43 100.00
04/19/2017 12:32:37 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.80 0.00 4.09 13.70 0.00 81.41
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.20 1234.00 6.40 1912.40 43.20 66422.40
69.28 224.64 118.18 135.38 118.13 0.52 100.00
...
followed by "normal" I/O with both reads and writes
04/19/2017 12:32:42 PM
avg-cpu: %user %nice %system %iowait %steal %idle
3.98 0.00 10.23 35.92 0.00 49.87
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 2.00 0.00 423.20 374.80 9460.00 66360.00
190.03 18.15 25.78 47.33 1.44 1.25 100.00
04/19/2017 12:32:47 PM
avg-cpu: %user %nice %system %iowait %steal %idle
9.26 0.00 7.95 56.36 0.00 26.43
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 24.60 0.00 1826.00 188.00 53184.80 5222.40
58.00 17.24 8.25 9.08 0.21 0.50 100.00
with kind regards,
Michael
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfsaild in D state seems to be blocking all other i/o sporadically
2017-04-19 12:40 ` Michael Weissenbacher
@ 2017-04-19 13:01 ` Michael Weissenbacher
2017-04-19 14:04 ` Carlos Maiolino
0 siblings, 1 reply; 19+ messages in thread
From: Michael Weissenbacher @ 2017-04-19 13:01 UTC (permalink / raw)
To: linux-xfs
> * dmesg output from hung tasks
[25012.200938] INFO: task kworker/u12:0:39771 blocked for more than 120
seconds.
[25012.201036] Tainted: G E 4.11.0-rc5-mw #1
[25012.201125] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[25012.201220] kworker/u12:0 D 0 39771 2 0x00000000
[25012.201229] Workqueue: writeback wb_workfn (flush-9:127)
[25012.201231] Call Trace:
[25012.201241] ? __schedule+0x2e0/0x910
[25012.201295] ? xfs_buf_rele+0x5b/0x390 [xfs]
[25012.201300] ? schedule+0x32/0x80
[25012.201344] ? xlog_grant_head_wait+0xca/0x200 [xfs]
[25012.201385] ? xlog_grant_head_check+0x9a/0x110 [xfs]
[25012.201428] ? xfs_log_reserve+0xd4/0x1f0 [xfs]
[25012.201468] ? xfs_trans_reserve+0x182/0x1f0 [xfs]
[25012.201507] ? xfs_trans_alloc+0xb4/0x130 [xfs]
[25012.201548] ? xfs_iomap_write_allocate+0x1ea/0x360 [xfs]
[25012.201578] ? xfs_bmapi_read+0x29f/0x320 [xfs]
[25012.201617] ? xfs_map_blocks+0x1c8/0x240 [xfs]
[25012.201654] ? xfs_do_writepage+0x257/0x710 [xfs]
[25012.201657] ? invalid_page_referenced_vma+0x80/0x80
[25012.201661] ? write_cache_pages+0x207/0x480
[25012.201696] ? xfs_submit_ioend+0x170/0x170 [xfs]
[25012.201731] ? xfs_vm_writepages+0xba/0xf0 [xfs]
[25012.201735] ? __writeback_single_inode+0x3d/0x330
[25012.201737] ? writeback_sb_inodes+0x23d/0x470
[25012.201741] ? __writeback_inodes_wb+0x87/0xb0
[25012.201743] ? wb_writeback+0x28e/0x320
[25012.201746] ? wb_workfn+0x2d2/0x3a0
[25012.201748] ? wb_workfn+0x2d2/0x3a0
[25012.201752] ? process_one_work+0x15e/0x420
[25012.201754] ? worker_thread+0x65/0x4b0
[25012.201756] ? rescuer_thread+0x390/0x390
[25012.201759] ? kthread+0x104/0x140
[25012.201762] ? kthread_park+0x80/0x80
[25012.201764] ? do_group_exit+0x39/0xb0
[25012.201767] ? ret_from_fork+0x26/0x40
[25012.201777] INFO: task rm:44141 blocked for more than 120 seconds.
[25012.201868] Tainted: G E 4.11.0-rc5-mw #1
[25012.201957] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[25012.202051] rm D 0 44141 43912 0x00000000
[25012.202054] Call Trace:
[25012.202059] ? __schedule+0x2e0/0x910
[25012.202097] ? xfs_dir_open+0x59/0x70 [xfs]
[25012.202102] ? schedule+0x32/0x80
[25012.202144] ? xlog_grant_head_wait+0xca/0x200 [xfs]
[25012.202183] ? xlog_grant_head_check+0x9a/0x110 [xfs]
[25012.202223] ? xfs_log_reserve+0xd4/0x1f0 [xfs]
[25012.202262] ? xfs_trans_reserve+0x182/0x1f0 [xfs]
[25012.202299] ? xfs_trans_alloc+0xb4/0x130 [xfs]
[25012.202339] ? xfs_remove+0x13d/0x310 [xfs]
[25012.202379] ? xfs_vn_unlink+0x55/0xa0 [xfs]
[25012.202383] ? vfs_rmdir+0xac/0x120
[25012.202386] ? do_rmdir+0x1c9/0x1e0
[25012.202389] ? entry_SYSCALL_64_fastpath+0x1e/0xad
[25012.202392] INFO: task rm:44149 blocked for more than 120 seconds.
[25012.202483] Tainted: G E 4.11.0-rc5-mw #1
[25012.202571] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[25012.202665] rm D 0 44149 43923 0x00000000
[25012.202668] Call Trace:
[25012.202672] ? __schedule+0x2e0/0x910
[25012.202676] ? schedule+0x32/0x80
[25012.202716] ? xlog_grant_head_wait+0xca/0x200 [xfs]
[25012.202754] ? xlog_grant_head_check+0x9a/0x110 [xfs]
[25012.202793] ? xfs_log_reserve+0xd4/0x1f0 [xfs]
[25012.202831] ? xfs_trans_reserve+0x182/0x1f0 [xfs]
[25012.202867] ? xfs_trans_alloc+0xb4/0x130 [xfs]
[25012.202906] ? xfs_remove+0x13d/0x310 [xfs]
[25012.202945] ? xfs_vn_unlink+0x55/0xa0 [xfs]
[25012.202949] ? vfs_unlink+0xe7/0x180
[25012.202952] ? do_unlinkat+0x289/0x310
[25012.202955] ? entry_SYSCALL_64_fastpath+0x1e/0xad
[25012.202958] INFO: task rsync:44358 blocked for more than 120 seconds.
[25012.203050] Tainted: G E 4.11.0-rc5-mw #1
[25012.203138] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[25012.203233] rsync D 0 44358 44355 0x00000000
[25012.203235] Call Trace:
[25012.203239] ? __schedule+0x2e0/0x910
[25012.203242] ? walk_component+0x4b/0x630
[25012.203245] ? schedule+0x32/0x80
[25012.203284] ? xlog_grant_head_wait+0xca/0x200 [xfs]
[25012.203322] ? xlog_grant_head_check+0x9a/0x110 [xfs]
[25012.203361] ? xfs_log_reserve+0xd4/0x1f0 [xfs]
[25012.203400] ? xfs_trans_reserve+0x182/0x1f0 [xfs]
[25012.203436] ? xfs_trans_alloc+0xb4/0x130 [xfs]
[25012.203476] ? xfs_link+0x123/0x2d0 [xfs]
[25012.203478] ? __d_lookup+0x60/0x130
[25012.203517] ? xfs_vn_link+0x65/0xa0 [xfs]
[25012.203520] ? vfs_link+0x21a/0x2e0
[25012.203524] ? SyS_link+0x22a/0x260
[25012.203527] ? entry_SYSCALL_64_fastpath+0x1e/0xad
[25012.203530] INFO: task rsync:44370 blocked for more than 120 seconds.
[25012.203622] Tainted: G E 4.11.0-rc5-mw #1
[25012.203710] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[25012.203804] rsync D 0 44370 44365 0x00000000
[25012.203806] Call Trace:
[25012.203811] ? __schedule+0x2e0/0x910
[25012.203814] ? schedule+0x32/0x80
[25012.203852] ? xlog_grant_head_wait+0xca/0x200 [xfs]
[25012.203889] ? xlog_grant_head_check+0x9a/0x110 [xfs]
[25012.203928] ? xfs_log_reserve+0xd4/0x1f0 [xfs]
[25012.203965] ? xfs_trans_reserve+0x182/0x1f0 [xfs]
[25012.204002] ? xfs_trans_alloc+0xb4/0x130 [xfs]
[25012.204041] ? xfs_vn_update_time+0x5d/0x150 [xfs]
[25012.204044] ? file_update_time+0xb7/0x110
[25012.204083] ? xfs_file_aio_write_checks+0x176/0x1d0 [xfs]
[25012.204120] ? xfs_file_buffered_aio_write+0x7a/0x260 [xfs]
[25012.204156] ? xfs_file_write_iter+0x10b/0x150 [xfs]
[25012.204159] ? __vfs_write+0xeb/0x160
[25012.204162] ? vfs_write+0xb3/0x1a0
[25012.204165] ? SyS_write+0x52/0xc0
[25012.204167] ? entry_SYSCALL_64_fastpath+0x1e/0xad
[25012.204171] INFO: task rsync:44480 blocked for more than 120 seconds.
[25012.204283] Tainted: G E 4.11.0-rc5-mw #1
[25012.204393] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[25012.204529] rsync D 0 44480 44477 0x00000000
[25012.204531] Call Trace:
[25012.204536] ? __schedule+0x2e0/0x910
[25012.204539] ? walk_component+0x4b/0x630
[25012.204542] ? schedule+0x32/0x80
[25012.204581] ? xlog_grant_head_wait+0xca/0x200 [xfs]
[25012.204619] ? xlog_grant_head_check+0x9a/0x110 [xfs]
[25012.204658] ? xfs_log_reserve+0xd4/0x1f0 [xfs]
[25012.204696] ? xfs_trans_reserve+0x182/0x1f0 [xfs]
[25012.204732] ? xfs_trans_alloc+0xb4/0x130 [xfs]
[25012.204772] ? xfs_link+0x123/0x2d0 [xfs]
[25012.204774] ? __d_lookup+0x60/0x130
[25012.204814] ? xfs_vn_link+0x65/0xa0 [xfs]
[25012.204817] ? vfs_link+0x21a/0x2e0
[25012.204820] ? SyS_link+0x22a/0x260
[25012.204823] ? entry_SYSCALL_64_fastpath+0x1e/0xad
[25012.204827] INFO: task rsync:44581 blocked for more than 120 seconds.
[25012.204951] Tainted: G E 4.11.0-rc5-mw #1
[25012.205062] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[25012.205200] rsync D 0 44581 44580 0x00000000
[25012.205202] Call Trace:
[25012.205207] ? __schedule+0x2e0/0x910
[25012.205245] ? xfs_trans_free_items+0x75/0xa0 [xfs]
[25012.205249] ? schedule+0x32/0x80
[25012.205286] ? xlog_grant_head_wait+0xca/0x200 [xfs]
[25012.205321] ? xlog_grant_head_check+0xf8/0x110 [xfs]
[25012.205362] ? xfs_log_reserve+0xd4/0x1f0 [xfs]
[25012.205401] ? xfs_trans_reserve+0x182/0x1f0 [xfs]
[25012.205438] ? xfs_trans_alloc+0xb4/0x130 [xfs]
[25012.205469] ? xfs_attr_set+0x1d4/0x440 [xfs]
[25012.205473] ? d_splice_alias+0x185/0x3c0
[25012.205509] ? xfs_xattr_set+0x4e/0x90 [xfs]
[25012.205515] ? __vfs_setxattr+0x64/0x80
[25012.205520] ? __vfs_setxattr_noperm+0x62/0x190
[25012.205523] ? vfs_setxattr+0xa0/0xb0
[25012.205526] ? setxattr+0x141/0x190
[25012.205531] ? __check_object_size+0x173/0x19e
[25012.205535] ? strncpy_from_user+0x48/0x160
[25012.205539] ? path_setxattr+0xad/0xe0
[25012.205542] ? SyS_lsetxattr+0xd/0x10
[25012.205545] ? entry_SYSCALL_64_fastpath+0x1e/0xad
[25133.028169] INFO: task kworker/1:0:24106 blocked for more than 120
seconds.
[25133.028288] Tainted: G E 4.11.0-rc5-mw #1
[25133.028397] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[25133.028535] kworker/1:0 D 0 24106 2 0x00000000
[25133.028595] Workqueue: xfs-eofblocks/md127 xfs_eofblocks_worker [xfs]
[25133.028598] Call Trace:
[25133.028606] ? __schedule+0x2e0/0x910
[25133.028609] ? schedule+0x32/0x80
[25133.028655] ? xlog_grant_head_wait+0xca/0x200 [xfs]
[25133.028697] ? xlog_grant_head_check+0x9a/0x110 [xfs]
[25133.028739] ? xfs_log_reserve+0xd4/0x1f0 [xfs]
[25133.028779] ? xfs_trans_reserve+0x182/0x1f0 [xfs]
[25133.028817] ? xfs_trans_alloc+0xb4/0x130 [xfs]
[25133.028857] ? xfs_free_eofblocks+0x120/0x1e0 [xfs]
[25133.028897] ? xfs_inode_free_eofblocks+0x11e/0x140 [xfs]
[25133.028936] ? xfs_inode_ag_walk.isra.13+0x19b/0x2b0 [xfs]
[25133.028973] ? xfs_inode_clear_eofblocks_tag+0x70/0x70 [xfs]
[25133.028978] ? intel_cpufreq_target+0x170/0x170
[25133.028982] ? update_load_avg+0x858/0xa00
[25133.028985] ? radix_tree_next_chunk+0x101/0x310
[25133.028989] ? radix_tree_gang_lookup_tag+0xe4/0x150
[25133.029024] ? xfs_inode_clear_eofblocks_tag+0x70/0x70 [xfs]
[25133.029062] ? xfs_perag_get_tag+0x3d/0xe0 [xfs]
[25133.029097] ? xfs_inode_clear_eofblocks_tag+0x70/0x70 [xfs]
[25133.029132] ? xfs_inode_ag_iterator_tag+0x70/0xa0 [xfs]
[25133.029166] ? xfs_eofblocks_worker+0x23/0x30 [xfs]
[25133.029169] ? process_one_work+0x15e/0x420
[25133.029171] ? worker_thread+0x65/0x4b0
[25133.029173] ? rescuer_thread+0x390/0x390
[25133.029176] ? kthread+0x104/0x140
[25133.029179] ? kthread_park+0x80/0x80
[25133.029182] ? do_group_exit+0x39/0xb0
[25133.029185] ? ret_from_fork+0x26/0x40
[25133.029188] INFO: task kworker/u12:0:39771 blocked for more than 120
seconds.
[25133.029303] Tainted: G E 4.11.0-rc5-mw #1
[25133.029412] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[25133.029549] kworker/u12:0 D 0 39771 2 0x00000000
[25133.029556] Workqueue: writeback wb_workfn (flush-9:127)
[25133.029558] Call Trace:
[25133.029563] ? __schedule+0x2e0/0x910
[25133.029600] ? xfs_buf_rele+0x5b/0x390 [xfs]
[25133.029605] ? schedule+0x32/0x80
[25133.029647] ? xlog_grant_head_wait+0xca/0x200 [xfs]
[25133.029687] ? xlog_grant_head_check+0x9a/0x110 [xfs]
[25133.029728] ? xfs_log_reserve+0xd4/0x1f0 [xfs]
[25133.029768] ? xfs_trans_reserve+0x182/0x1f0 [xfs]
[25133.029807] ? xfs_trans_alloc+0xb4/0x130 [xfs]
[25133.029846] ? xfs_iomap_write_allocate+0x1ea/0x360 [xfs]
[25133.029876] ? xfs_bmapi_read+0x29f/0x320 [xfs]
[25133.029914] ? xfs_map_blocks+0x1c8/0x240 [xfs]
[25133.029950] ? xfs_do_writepage+0x257/0x710 [xfs]
[25133.029954] ? invalid_page_referenced_vma+0x80/0x80
[25133.029957] ? write_cache_pages+0x207/0x480
[25133.029992] ? xfs_submit_ioend+0x170/0x170 [xfs]
[25133.030027] ? xfs_vm_writepages+0xba/0xf0 [xfs]
[25133.030031] ? __writeback_single_inode+0x3d/0x330
[25133.030033] ? writeback_sb_inodes+0x23d/0x470
[25133.030037] ? __writeback_inodes_wb+0x87/0xb0
[25133.030039] ? wb_writeback+0x28e/0x320
[25133.030042] ? wb_workfn+0x2d2/0x3a0
[25133.030044] ? wb_workfn+0x2d2/0x3a0
[25133.030047] ? process_one_work+0x15e/0x420
[25133.030049] ? worker_thread+0x65/0x4b0
[25133.030051] ? rescuer_thread+0x390/0x390
[25133.030054] ? kthread+0x104/0x140
[25133.030057] ? kthread_park+0x80/0x80
[25133.030059] ? do_group_exit+0x39/0xb0
[25133.030062] ? ret_from_fork+0x26/0x40
[25133.030070] INFO: task rm:44141 blocked for more than 120 seconds.
[25133.030182] Tainted: G E 4.11.0-rc5-mw #1
[25133.030291] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[25133.030428] rm D 0 44141 43912 0x00000000
[25133.030430] Call Trace:
[25133.030435] ? __schedule+0x2e0/0x910
[25133.030472] ? xfs_dir_open+0x59/0x70 [xfs]
[25133.030477] ? schedule+0x32/0x80
[25133.030518] ? xlog_grant_head_wait+0xca/0x200 [xfs]
[25133.030557] ? xlog_grant_head_check+0x9a/0x110 [xfs]
[25133.030597] ? xfs_log_reserve+0xd4/0x1f0 [xfs]
[25133.030635] ? xfs_trans_reserve+0x182/0x1f0 [xfs]
[25133.030671] ? xfs_trans_alloc+0xb4/0x130 [xfs]
[25133.030711] ? xfs_remove+0x13d/0x310 [xfs]
[25133.030751] ? xfs_vn_unlink+0x55/0xa0 [xfs]
[25133.030755] ? vfs_rmdir+0xac/0x120
[25133.030758] ? do_rmdir+0x1c9/0x1e0
[25133.030761] ? entry_SYSCALL_64_fastpath+0x1e/0xad
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfsaild in D state seems to be blocking all other i/o sporadically
2017-04-19 13:01 ` Michael Weissenbacher
@ 2017-04-19 14:04 ` Carlos Maiolino
2017-04-19 14:20 ` Carlos Maiolino
2017-04-19 16:36 ` Michael Weissenbacher
0 siblings, 2 replies; 19+ messages in thread
From: Carlos Maiolino @ 2017-04-19 14:04 UTC (permalink / raw)
To: linux-xfs; +Cc: mw
If I am not blind, the task currently owning the cpu isn't listed there, which I
suppose to be already running.
Next time you hit it, could you please paste the stack trace of the task owning
the cpu? Which I suppose to be xfsaild according to your report.
A simple `cat /proc/<pid>/stack` is probably enough
thanks
> > * dmesg output from hung tasks
> [25012.200938] INFO: task kworker/u12:0:39771 blocked for more than 120
> seconds.
> [25012.201036] Tainted: G E 4.11.0-rc5-mw #1
> [25012.201125] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [25012.201220] kworker/u12:0 D 0 39771 2 0x00000000
> [25012.201229] Workqueue: writeback wb_workfn (flush-9:127)
> [25012.201231] Call Trace:
> [25012.201241] ? __schedule+0x2e0/0x910
> [25012.201295] ? xfs_buf_rele+0x5b/0x390 [xfs]
> [25012.201300] ? schedule+0x32/0x80
> [25012.201344] ? xlog_grant_head_wait+0xca/0x200 [xfs]
> [25012.201385] ? xlog_grant_head_check+0x9a/0x110 [xfs]
> [25012.201777] INFO: task rm:44141 blocked for more than 120 seconds.
> [25012.201868] Tainted: G E 4.11.0-rc5-mw #1
> [25012.201957] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [25012.202051] rm D 0 44141 43912 0x00000000
> [25012.202054] Call Trace:
> [25012.202059] ? __schedule+0x2e0/0x910
> [25012.202097] ? xfs_dir_open+0x59/0x70 [xfs]
> [25012.202102] ? schedule+0x32/0x80
> [25012.202392] INFO: task rm:44149 blocked for more than 120 seconds.
> [25012.202483] Tainted: G E 4.11.0-rc5-mw #1
> [25012.202571] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [25012.202665] rm D 0 44149 43923 0x00000000
> [25012.202668] Call Trace:
> [25012.202672] ? __schedule+0x2e0/0x910
> [25012.202676] ? schedule+0x32/0x80
> [25012.202958] INFO: task rsync:44358 blocked for more than 120 seconds.
> [25012.203050] Tainted: G E 4.11.0-rc5-mw #1
> [25012.203138] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [25012.203233] rsync D 0 44358 44355 0x00000000
> [25012.203235] Call Trace:
> [25012.203239] ? __schedule+0x2e0/0x910
> [25012.203242] ? walk_component+0x4b/0x630
> [25012.203245] ? schedule+0x32/0x80
> [25012.203284] ? xlog_grant_head_wait+0xca/0x200 [xfs]
> [25012.203530] INFO: task rsync:44370 blocked for more than 120 seconds.
> [25012.203622] Tainted: G E 4.11.0-rc5-mw #1
> [25012.203710] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [25012.203804] rsync D 0 44370 44365 0x00000000
> [25012.203806] Call Trace:
> [25012.203811] ? __schedule+0x2e0/0x910
> [25012.203814] ? schedule+0x32/0x80
> [25012.203852] ? xlog_grant_head_wait+0xca/0x200 [xfs]
> [25012.203889] ? xlog_grant_head_check+0x9a/0x110 [xfs]
> [25012.204171] INFO: task rsync:44480 blocked for more than 120 seconds.
> [25012.204283] Tainted: G E 4.11.0-rc5-mw #1
> [25012.204393] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [25012.204529] rsync D 0 44480 44477 0x00000000
> [25012.204531] Call Trace:
> [25012.204536] ? __schedule+0x2e0/0x910
> [25012.204539] ? walk_component+0x4b/0x630
> [25012.204542] ? schedule+0x32/0x80
> [25012.204827] INFO: task rsync:44581 blocked for more than 120 seconds.
> [25012.204951] Tainted: G E 4.11.0-rc5-mw #1
> [25012.205062] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [25012.205200] rsync D 0 44581 44580 0x00000000
> [25012.205202] Call Trace:
> [25012.205207] ? __schedule+0x2e0/0x910
> [25012.205245] ? xfs_trans_free_items+0x75/0xa0 [xfs]
> [25012.205249] ? schedule+0x32/0x80
> [25012.205286] ? xlog_grant_head_wait+0xca/0x200 [xfs]
> [25133.028169] INFO: task kworker/1:0:24106 blocked for more than 120
> seconds.
> [25133.028288] Tainted: G E 4.11.0-rc5-mw #1
> [25133.028397] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [25133.028535] kworker/1:0 D 0 24106 2 0x00000000
> [25133.028595] Workqueue: xfs-eofblocks/md127 xfs_eofblocks_worker [xfs]
> [25133.028598] Call Trace:
> [25133.028606] ? __schedule+0x2e0/0x910
> [25133.028609] ? schedule+0x32/0x80
> [25133.028655] ? xlog_grant_head_wait+0xca/0x200 [xfs]
> [25133.029188] INFO: task kworker/u12:0:39771 blocked for more than 120
> seconds.
> [25133.029303] Tainted: G E 4.11.0-rc5-mw #1
> [25133.029412] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [25133.029549] kworker/u12:0 D 0 39771 2 0x00000000
> [25133.029556] Workqueue: writeback wb_workfn (flush-9:127)
> [25133.029558] Call Trace:
> [25133.029563] ? __schedule+0x2e0/0x910
> [25133.029600] ? xfs_buf_rele+0x5b/0x390 [xfs]
> [25133.029605] ? schedule+0x32/0x80
> [25133.029647] ? xlog_grant_head_wait+0xca/0x200 [xfs]
> [25133.030070] INFO: task rm:44141 blocked for more than 120 seconds.
> [25133.030182] Tainted: G E 4.11.0-rc5-mw #1
> [25133.030291] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [25133.030428] rm D 0 44141 43912 0x00000000
> [25133.030430] Call Trace:
> [25133.030435] ? __schedule+0x2e0/0x910
> [25133.030472] ? xfs_dir_open+0x59/0x70 [xfs]
> [25133.030477] ? schedule+0x32/0x80
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Carlos
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfsaild in D state seems to be blocking all other i/o sporadically
2017-04-19 14:04 ` Carlos Maiolino
@ 2017-04-19 14:20 ` Carlos Maiolino
2017-04-19 16:40 ` Michael Weissenbacher
2017-04-19 16:36 ` Michael Weissenbacher
1 sibling, 1 reply; 19+ messages in thread
From: Carlos Maiolino @ 2017-04-19 14:20 UTC (permalink / raw)
To: linux-xfs, mw
On Wed, Apr 19, 2017 at 04:04:46PM +0200, Carlos Maiolino wrote:
> If I am not blind, the task currently owning the cpu isn't listed there, which I
> suppose to be already running.
>
> Next time you hit it, could you please paste the stack trace of the task owning
> the cpu? Which I suppose to be xfsaild according to your report.
>
> A simple `cat /proc/<pid>/stack` is probably enough
>
> thanks
>
Btw, I'm not very good at xfs journaling code, but I wonder by the stack traces
you provided, if you don't have a too small journal, which, with your workload,
it might be getting to flush the logs too many times, making the disk too busy
and slowing down the rest of the IO.
Another folks with more knowledge of the journaling code might be able to
confirm or not this, but having too many tasks waiting in:
xlog_grant_head_wait()
actually made me wonder if your journal isn't too small for the FS size and the
current workload you have.
Again, this is just a guess, but maybe somebody else can put one more pair of
eyes on it.
cheers
--
Carlos
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfsaild in D state seems to be blocking all other i/o sporadically
2017-04-19 14:04 ` Carlos Maiolino
2017-04-19 14:20 ` Carlos Maiolino
@ 2017-04-19 16:36 ` Michael Weissenbacher
2017-04-19 18:08 ` Brian Foster
1 sibling, 1 reply; 19+ messages in thread
From: Michael Weissenbacher @ 2017-04-19 16:36 UTC (permalink / raw)
To: linux-xfs
On 19.04.2017 16:04, Carlos Maiolino wrote:
> If I am not blind, the task currently owning the cpu isn't listed there, which I
> suppose to be already running.
Correct, the only process utilizing the CPU is xfsaild. It is also the
only process doing I/O, and it looks like it's doing only reads. It
(xfsaild) never shows up in dmesg as blocked.
> Next time you hit it, could you please paste the stack trace of the task owning
> the cpu? Which I suppose to be xfsaild according to your report.
>
> A simple `cat /proc/<pid>/stack` is probably enough
Sure, will do.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfsaild in D state seems to be blocking all other i/o sporadically
2017-04-19 14:20 ` Carlos Maiolino
@ 2017-04-19 16:40 ` Michael Weissenbacher
0 siblings, 0 replies; 19+ messages in thread
From: Michael Weissenbacher @ 2017-04-19 16:40 UTC (permalink / raw)
To: linux-xfs
>
> Btw, I'm not very good at xfs journaling code, but I wonder by the stack traces
> you provided, if you don't have a too small journal, which, with your workload,
> it might be getting to flush the logs too many times, making the disk too busy
> and slowing down the rest of the IO.
>
A good hint. Unfortunately, I don't think there is an option to increase
the log size of an internal log post-mkfs?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfsaild in D state seems to be blocking all other i/o sporadically
2017-04-19 16:36 ` Michael Weissenbacher
@ 2017-04-19 18:08 ` Brian Foster
2017-04-19 20:10 ` Michael Weissenbacher
0 siblings, 1 reply; 19+ messages in thread
From: Brian Foster @ 2017-04-19 18:08 UTC (permalink / raw)
To: Michael Weissenbacher; +Cc: linux-xfs
On Wed, Apr 19, 2017 at 06:36:00PM +0200, Michael Weissenbacher wrote:
> On 19.04.2017 16:04, Carlos Maiolino wrote:
> > If I am not blind, the task currently owning the cpu isn't listed there, which I
> > suppose to be already running.
> Correct, the only process utilizing the CPU is xfsaild. It is also the
> only process doing I/O, and it looks like it's doing only reads. It
> (xfsaild) never shows up in dmesg as blocked.
>
A snippet of tracepoint output from the fs might also be interesting
when you hit this state. E.g., 'trace-cmd start -e xfs:*' and save the
output of /sys/kernel/debug/tracing/trace_pipe to a file for a few
seconds or so.
> > Next time you hit it, could you please paste the stack trace of the task owning
> > the cpu? Which I suppose to be xfsaild according to your report.
> >
> > A simple `cat /proc/<pid>/stack` is probably enough
> Sure, will do.
BTW, is the original /proc/meminfo output you posted from the system in
a normal state or when the problem occurs? If the former, could you
include it while the problem occurs as well?
Brian
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfsaild in D state seems to be blocking all other i/o sporadically
2017-04-19 18:08 ` Brian Foster
@ 2017-04-19 20:10 ` Michael Weissenbacher
2017-04-19 20:55 ` Darrick J. Wong
0 siblings, 1 reply; 19+ messages in thread
From: Michael Weissenbacher @ 2017-04-19 20:10 UTC (permalink / raw)
To: linux-xfs
Hi Brian!
On 2017-04-19 20:08, Brian Foster wrote:
> On Wed, Apr 19, 2017 at 06:36:00PM +0200, Michael Weissenbacher wrote:
>> On 19.04.2017 16:04, Carlos Maiolino wrote:
>>> If I am not blind, the task currently owning the cpu isn't listed there, which I
>>> suppose to be already running.
>> Correct, the only process utilizing the CPU is xfsaild. It is also the
>> only process doing I/O, and it looks like it's doing only reads. It
>> (xfsaild) never shows up in dmesg as blocked.
>>
>
> A snippet of tracepoint output from the fs might also be interesting
> when you hit this state. E.g., 'trace-cmd start -e xfs:*' and save the
> output of /sys/kernel/debug/tracing/trace_pipe to a file for a few
> seconds or so.
>
See:
https://we.tl/7AAyBZxt4R
> BTW, is the original /proc/meminfo output you posted from the system in
> a normal state or when the problem occurs? If the former, could you
> include it while the problem occurs as well?
It was from a normal state.
All the following was captured during the problem occurs.
*** snip ***
/proc/meminfo
Wed Apr 19 20:36:50 CEST 2017
MemTotal: 65886208 kB
MemFree: 2923552 kB
MemAvailable: 43018524 kB
Buffers: 52 kB
Cached: 32416864 kB
SwapCached: 580 kB
Active: 30685760 kB
Inactive: 2051532 kB
Active(anon): 165844 kB
Inactive(anon): 155516 kB
Active(file): 30519916 kB
Inactive(file): 1896016 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 64511996 kB
SwapFree: 64506540 kB
Dirty: 202368 kB
Writeback: 0 kB
AnonPages: 319948 kB
Mapped: 18672 kB
Shmem: 868 kB
Slab: 22817492 kB
SReclaimable: 16332540 kB
SUnreclaim: 6484952 kB
KernelStack: 4048 kB
PageTables: 16516 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 97455100 kB
Committed_AS: 516532 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 88820 kB
DirectMap2M: 7155712 kB
DirectMap1G: 61865984 kB
*** snip ***
*** snip ***
Wed Apr 19 20:36:51 CEST 2017
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab>
<pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata
<active_slabs> <num_slabs> <sharedavail>
kcopyd_job 0 0 3240 2 2 : tunables 24 12
8 : slabdata 0 0 0
dm_uevent 0 0 2632 3 2 : tunables 24 12
8 : slabdata 0 0 0
t10_pr_reg_cache 0 0 704 11 2 : tunables 54 27
8 : slabdata 0 0 0
ext4_groupinfo_4k 8 28 144 28 1 : tunables 120 60
8 : slabdata 1 1 0
ext4_inode_cache 328 330 1072 3 1 : tunables 24 12
8 : slabdata 110 110 0
ext4_allocation_context 0 0 128 32 1 : tunables 120
60 8 : slabdata 0 0 0
ext4_io_end 0 0 64 64 1 : tunables 120 60
8 : slabdata 0 0 0
ext4_extent_status 36 198 40 99 1 : tunables 120 60
8 : slabdata 2 2 0
jbd2_journal_head 0 0 112 36 1 : tunables 120 60
8 : slabdata 0 0 0
jbd2_revoke_table_s 0 0 16 240 1 : tunables 120 60
8 : slabdata 0 0 0
jbd2_revoke_record_s 0 0 32 124 1 : tunables 120 60
8 : slabdata 0 0 0
fscrypt_info 0 0 24 163 1 : tunables 120 60
8 : slabdata 0 0 0
fscrypt_ctx 0 0 48 83 1 : tunables 120 60
8 : slabdata 0 0 0
mbcache 0 0 56 71 1 : tunables 120 60
8 : slabdata 0 0 0
kvm_async_pf 0 0 136 30 1 : tunables 120 60
8 : slabdata 0 0 0
kvm_vcpu 0 0 19136 1 8 : tunables 8 4
0 : slabdata 0 0 0
kvm_mmu_page_header 0 0 168 24 1 : tunables 120 60
8 : slabdata 0 0 0
xfs_dqtrx 14 14 528 7 1 : tunables 54 27
8 : slabdata 2 2 0
xfs_dquot 149 168 472 8 1 : tunables 54 27
8 : slabdata 21 21 0
xfs_rui_item 0 0 664 6 1 : tunables 54 27
8 : slabdata 0 0 0
xfs_rud_item 0 0 144 28 1 : tunables 120 60
8 : slabdata 0 0 0
xfs_icr 0 0 152 26 1 : tunables 120 60
8 : slabdata 0 0 0
xfs_inode 12898650 12907268 960 4 1 : tunables 54
27 8 : slabdata 3226817 3226817 0
xfs_efd_item 0 0 408 10 1 : tunables 54 27
8 : slabdata 0 0 0
xfs_da_state 0 0 480 8 1 : tunables 54 27
8 : slabdata 0 0 0
xfs_btree_cur 105 153 232 17 1 : tunables 120 60
8 : slabdata 9 9 0
bio-2 52 96 320 12 1 : tunables 54 27
8 : slabdata 8 8 0
ip6-frags 0 0 200 20 1 : tunables 120 60
8 : slabdata 0 0 0
UDPv6 18 28 1152 7 2 : tunables 24 12
8 : slabdata 4 4 0
tw_sock_TCPv6 0 0 240 17 1 : tunables 120 60
8 : slabdata 0 0 0
request_sock_TCPv6 0 0 304 13 1 : tunables 54 27
8 : slabdata 0 0 0
TCPv6 7 12 2112 3 2 : tunables 24 12
8 : slabdata 4 4 0
cfq_queue 99928 99977 240 17 1 : tunables 120 60
8 : slabdata 5881 5881 0
mqueue_inode_cache 1 4 896 4 1 : tunables 54 27
8 : slabdata 1 1 0
userfaultfd_ctx_cache 0 0 128 32 1 : tunables 120
60 8 : slabdata 0 0 0
dio 0 0 640 6 1 : tunables 54 27
8 : slabdata 0 0 0
pid_namespace 0 0 2232 3 2 : tunables 24 12
8 : slabdata 0 0 0
posix_timers_cache 0 0 216 18 1 : tunables 120 60
8 : slabdata 0 0 0
ip4-frags 34 132 184 22 1 : tunables 120 60
8 : slabdata 6 6 0
PING 0 0 896 4 1 : tunables 54 27
8 : slabdata 0 0 0
UDP 44 44 960 4 1 : tunables 54 27
8 : slabdata 11 11 0
tw_sock_TCP 1 17 240 17 1 : tunables 120 60
8 : slabdata 1 1 0
request_sock_TCP 0 0 304 13 1 : tunables 54 27
8 : slabdata 0 0 0
TCP 18 22 1920 2 1 : tunables 24 12
8 : slabdata 11 11 0
hugetlbfs_inode_cache 2 6 600 6 1 : tunables 54
27 8 : slabdata 1 1 0
dquot 0 0 256 16 1 : tunables 120 60
8 : slabdata 0 0 0
request_queue 3 15 1488 5 2 : tunables 24 12
8 : slabdata 3 3 0
blkdev_ioc 100 234 104 39 1 : tunables 120 60
8 : slabdata 6 6 0
user_namespace 0 0 448 9 1 : tunables 54 27
8 : slabdata 0 0 0
dmaengine-unmap-256 1 3 2112 3 2 : tunables 24 12
8 : slabdata 1 1 0
dmaengine-unmap-128 1 7 1088 7 2 : tunables 24 12
8 : slabdata 1 1 0
sock_inode_cache 179 204 640 6 1 : tunables 54 27
8 : slabdata 34 34 0
file_lock_cache 83 95 208 19 1 : tunables 120 60
8 : slabdata 5 5 0
net_namespace 0 0 5888 1 2 : tunables 8 4
0 : slabdata 0 0 0
shmem_inode_cache 775 825 688 11 2 : tunables 54 27
8 : slabdata 75 75 0
taskstats 2 12 328 12 1 : tunables 54 27
8 : slabdata 1 1 0
proc_inode_cache 2024 2190 656 6 1 : tunables 54 27
8 : slabdata 343 365 200
sigqueue 12104736 12110275 160 25 1 : tunables 120
60 8 : slabdata 484411 484411 0
bdev_cache 8 20 832 4 1 : tunables 54 27
8 : slabdata 5 5 0
kernfs_node_cache 29210 29240 120 34 1 : tunables 120 60
8 : slabdata 860 860 0
mnt_cache 652927 827520 384 10 1 : tunables 54 27
8 : slabdata 82752 82752 0
inode_cache 12968 12985 584 7 1 : tunables 54 27
8 : slabdata 1854 1855 0
dentry 12741855 12742065 192 21 1 : tunables 120
60 8 : slabdata 606757 606765 480
iint_cache 0 0 72 56 1 : tunables 120 60
8 : slabdata 0 0 0
buffer_head 127140 127140 104 39 1 : tunables 120 60
8 : slabdata 3260 3260 0
fs_cache 171 378 64 63 1 : tunables 120 60
8 : slabdata 6 6 0
files_cache 158 220 704 11 2 : tunables 54 27
8 : slabdata 20 20 0
signal_cache 425 508 1024 4 1 : tunables 54 27
8 : slabdata 127 127 0
sighand_cache 261 294 2112 3 2 : tunables 24 12
8 : slabdata 98 98 0
task_struct 278 295 3648 1 1 : tunables 24 12
8 : slabdata 278 295 0
cred_jar 15411 17451 192 21 1 : tunables 120 60
8 : slabdata 831 831 60
Acpi-Operand 147223 147504 72 56 1 : tunables 120 60
8 : slabdata 2634 2634 0
Acpi-Parse 69 355 56 71 1 : tunables 120 60
8 : slabdata 5 5 0
Acpi-State 0 0 80 51 1 : tunables 120 60
8 : slabdata 0 0 0
Acpi-Namespace 13264 13365 40 99 1 : tunables 120 60
8 : slabdata 135 135 0
anon_vma_chain 11212 14336 64 64 1 : tunables 120 60
8 : slabdata 224 224 60
anon_vma 6167 7000 72 56 1 : tunables 120 60
8 : slabdata 125 125 0
pid 312 576 128 32 1 : tunables 120 60
8 : slabdata 18 18 0
numa_policy 5 163 24 163 1 : tunables 120 60
8 : slabdata 1 1 0
trace_event_file 2418 2530 88 46 1 : tunables 120 60
8 : slabdata 55 55 0
ftrace_event_field 3691 3901 48 83 1 : tunables 120 60
8 : slabdata 47 47 0
pool_workqueue 26 96 256 16 1 : tunables 120 60
8 : slabdata 6 6 0
radix_tree_node 1644760 1707538 576 7 1 : tunables 54 27
8 : slabdata 243934 243934 0
task_group 43 84 576 7 1 : tunables 54 27
8 : slabdata 12 12 0
dma-kmalloc-4194304 0 0 4194304 1 1024 : tunables 1 1
0 : slabdata 0 0 0
dma-kmalloc-2097152 0 0 2097152 1 512 : tunables 1 1
0 : slabdata 0 0 0
dma-kmalloc-1048576 0 0 1048576 1 256 : tunables 1 1
0 : slabdata 0 0 0
dma-kmalloc-524288 0 0 524288 1 128 : tunables 1 1
0 : slabdata 0 0 0
dma-kmalloc-262144 0 0 262144 1 64 : tunables 1 1
0 : slabdata 0 0 0
dma-kmalloc-131072 0 0 131072 1 32 : tunables 8 4
0 : slabdata 0 0 0
dma-kmalloc-65536 0 0 65536 1 16 : tunables 8 4
0 : slabdata 0 0 0
dma-kmalloc-32768 0 0 32768 1 8 : tunables 8 4
0 : slabdata 0 0 0
dma-kmalloc-16384 0 0 16384 1 4 : tunables 8 4
0 : slabdata 0 0 0
dma-kmalloc-8192 0 0 8192 1 2 : tunables 8 4
0 : slabdata 0 0 0
dma-kmalloc-4096 0 0 4096 1 1 : tunables 24 12
8 : slabdata 0 0 0
dma-kmalloc-2048 0 0 2048 2 1 : tunables 24 12
8 : slabdata 0 0 0
dma-kmalloc-1024 0 0 1024 4 1 : tunables 54 27
8 : slabdata 0 0 0
dma-kmalloc-512 4 8 512 8 1 : tunables 54 27
8 : slabdata 1 1 0
dma-kmalloc-256 0 0 256 16 1 : tunables 120 60
8 : slabdata 0 0 0
dma-kmalloc-128 0 0 128 32 1 : tunables 120 60
8 : slabdata 0 0 0
dma-kmalloc-64 0 0 64 64 1 : tunables 120 60
8 : slabdata 0 0 0
dma-kmalloc-32 4 124 32 124 1 : tunables 120 60
8 : slabdata 1 1 0
dma-kmalloc-192 0 0 192 21 1 : tunables 120 60
8 : slabdata 0 0 0
dma-kmalloc-96 0 0 128 32 1 : tunables 120 60
8 : slabdata 0 0 0
kmalloc-4194304 0 0 4194304 1 1024 : tunables 1 1
0 : slabdata 0 0 0
kmalloc-2097152 0 0 2097152 1 512 : tunables 1 1
0 : slabdata 0 0 0
kmalloc-1048576 0 0 1048576 1 256 : tunables 1 1
0 : slabdata 0 0 0
kmalloc-524288 1 1 524288 1 128 : tunables 1 1
0 : slabdata 1 1 0
kmalloc-262144 3 3 262144 1 64 : tunables 1 1
0 : slabdata 3 3 0
kmalloc-131072 3 3 131072 1 32 : tunables 8 4
0 : slabdata 3 3 0
kmalloc-65536 13 13 65536 1 16 : tunables 8 4
0 : slabdata 13 13 0
kmalloc-32768 4 4 32768 1 8 : tunables 8 4
0 : slabdata 4 4 0
kmalloc-16384 23 23 16384 1 4 : tunables 8 4
0 : slabdata 23 23 0
kmalloc-8192 44121 44121 8192 1 2 : tunables 8 4
0 : slabdata 44121 44121 0
kmalloc-4096 11850 11850 4096 1 1 : tunables 24 12
8 : slabdata 11850 11850 24
kmalloc-2048 18655 18718 2048 2 1 : tunables 24 12
8 : slabdata 9359 9359 0
kmalloc-1024 584913 603244 1024 4 1 : tunables 54 27
8 : slabdata 150811 150811 0
kmalloc-512 856079 1020576 512 8 1 : tunables 54 27
8 : slabdata 127572 127572 0
kmalloc-256 114928 141968 256 16 1 : tunables 120 60
8 : slabdata 8873 8873 480
kmalloc-192 244249 272643 192 21 1 : tunables 120 60
8 : slabdata 12983 12983 0
kmalloc-96 3139278 3139328 128 32 1 : tunables 120 60
8 : slabdata 98104 98104 0
kmalloc-64 25714408 33297024 64 64 1 : tunables 120
60 8 : slabdata 520266 520266 0
kmalloc-32 423132 631036 32 124 1 : tunables 120 60
8 : slabdata 5089 5089 0
kmalloc-128 213657 269952 128 32 1 : tunables 120 60
8 : slabdata 8436 8436 0
*** snip ***
*** snip ***
/proc/<pidof_xfsaild>/stack
Wed Apr 19 20:36:52 CEST 2017
[<ffffffff8ef3881e>] ring_buffer_unlock_commit+0x1e/0xa0
[<ffffffff8ef419a7>] trace_buffer_unlock_commit_regs+0x37/0x90
[<ffffffff8eea1e20>] wake_up_q+0x60/0x60
[<ffffffffc0719b5b>] xfs_buf_read_map+0x10b/0x180 [xfs]
[<ffffffffc0719888>] xfs_buf_submit_wait+0x78/0x210 [xfs]
[<ffffffffc074d2bd>] xfs_trans_read_buf_map+0xed/0x300 [xfs]
[<ffffffffc0719b5b>] xfs_buf_read_map+0x10b/0x180 [xfs]
[<ffffffffc074d2bd>] xfs_trans_read_buf_map+0xed/0x300 [xfs]
[<ffffffffc0702b77>] xfs_imap_to_bp+0x67/0xe0 [xfs]
[<ffffffffc072e84b>] xfs_iflush+0xdb/0x240 [xfs]
[<ffffffffc07437f9>] xfs_inode_item_push+0x69/0x140 [xfs]
[<ffffffffc0743886>] xfs_inode_item_push+0xf6/0x140 [xfs]
[<ffffffffc074bf25>] xfsaild+0x385/0x7e0 [xfs]
[<ffffffffc074bba0>] xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
[<ffffffff8ee96d64>] kthread+0x104/0x140
[<ffffffff8ee96d64>] kthread+0x104/0x140
[<ffffffff8ee96c60>] kthread_park+0x80/0x80
[<ffffffff8ee7ba79>] do_group_exit+0x39/0xb0
[<ffffffff8f4370b6>] ret_from_fork+0x26/0x40
[<ffffffffffffffff>] 0xffffffffffffffff
*** snip ***
*** snip ***
iotop --batch -o output
Total DISK READ : 1948.60 K/s | Total DISK WRITE : 0.00 B/s
Actual DISK READ: 974.30 K/s | Actual DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
1609 be/4 root 1948.60 K/s 0.00 B/s 0.00 % 0.00 %
[xfsaild/md127]
Total DISK READ : 1924.56 K/s | Total DISK WRITE : 3.94 K/s
Actual DISK READ: 1956.11 K/s | Actual DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
1609 be/4 root 1924.56 K/s 0.00 B/s 0.00 % 0.00 %
[xfsaild/md127]
Total DISK READ : 2.53 M/s | Total DISK WRITE : 0.00 B/s
Actual DISK READ: 2.51 M/s | Actual DISK WRITE: 63.77 K/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
1609 be/4 root 2.53 M/s 0.00 B/s 0.00 % 0.00 %
[xfsaild/md127]
Total DISK READ : 2.09 M/s | Total DISK WRITE : 0.00 B/s
Actual DISK READ: 2.10 M/s | Actual DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
1609 be/4 root 2.09 M/s 0.00 B/s 0.00 % 0.00 %
[xfsaild/md127]
Total DISK READ : 12.12 M/s | Total DISK WRITE : 0.00 B/s
Actual DISK READ: 12.10 M/s | Actual DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
1609 be/4 root 12.12 M/s 0.00 B/s 0.00 % 0.00 %
[xfsaild/md127]
Total DISK READ : 2.30 M/s | Total DISK WRITE : 0.00 B/s
Actual DISK READ: 2.28 M/s | Actual DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
1609 be/4 root 2.30 M/s 0.00 B/s 0.00 % 0.00 %
[xfsaild/md127]
Total DISK READ : 4.86 M/s | Total DISK WRITE : 0.00 B/s
Actual DISK READ: 4.88 M/s | Actual DISK WRITE: 2025.26 K/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
1609 be/4 root 4.86 M/s 0.00 B/s 0.00 % 0.00 %
[xfsaild/md127]
Total DISK READ : 2.64 M/s | Total DISK WRITE : 0.00 B/s
Actual DISK READ: 2.62 M/s | Actual DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
1609 be/4 root 2.64 M/s 0.00 B/s 0.00 % 0.00 %
[xfsaild/md127]
Total DISK READ : 1899.36 K/s | Total DISK WRITE : 0.00 B/s
Actual DISK READ: 1899.36 K/s | Actual DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
1609 be/4 root 1899.36 K/s 0.00 B/s 0.00 % 0.00 %
[xfsaild/md127]
Total DISK READ : 1837.53 K/s | Total DISK WRITE : 0.00 B/s
Actual DISK READ: 1837.53 K/s | Actual DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
1609 be/4 root 1837.53 K/s 0.00 B/s 0.00 % 0.00 %
[xfsaild/md127]
*** snip ***
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfsaild in D state seems to be blocking all other i/o sporadically
2017-04-19 20:10 ` Michael Weissenbacher
@ 2017-04-19 20:55 ` Darrick J. Wong
2017-04-19 21:47 ` Michael Weissenbacher
0 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2017-04-19 20:55 UTC (permalink / raw)
To: Michael Weissenbacher; +Cc: linux-xfs
On Wed, Apr 19, 2017 at 10:10:55PM +0200, Michael Weissenbacher wrote:
> Hi Brian!
>
> On 2017-04-19 20:08, Brian Foster wrote:
> > On Wed, Apr 19, 2017 at 06:36:00PM +0200, Michael Weissenbacher wrote:
> >> On 19.04.2017 16:04, Carlos Maiolino wrote:
> >>> If I am not blind, the task currently owning the cpu isn't listed there, which I
> >>> suppose to be already running.
> >> Correct, the only process utilizing the CPU is xfsaild. It is also the
> >> only process doing I/O, and it looks like it's doing only reads. It
> >> (xfsaild) never shows up in dmesg as blocked.
> >>
> >
> > A snippet of tracepoint output from the fs might also be interesting
> > when you hit this state. E.g., 'trace-cmd start -e xfs:*' and save the
> > output of /sys/kernel/debug/tracing/trace_pipe to a file for a few
> > seconds or so.
> >
> See:
> https://we.tl/7AAyBZxt4R
>
> > BTW, is the original /proc/meminfo output you posted from the system in
> > a normal state or when the problem occurs? If the former, could you
> > include it while the problem occurs as well?
> It was from a normal state.
>
> All the following was captured during the problem occurs.
>
> *** snip ***
> /proc/meminfo
>
> Wed Apr 19 20:36:50 CEST 2017
> MemTotal: 65886208 kB
> MemFree: 2923552 kB
> MemAvailable: 43018524 kB
> Buffers: 52 kB
> Cached: 32416864 kB
> SwapCached: 580 kB
> Active: 30685760 kB
> Inactive: 2051532 kB
> Active(anon): 165844 kB
> Inactive(anon): 155516 kB
> Active(file): 30519916 kB
> Inactive(file): 1896016 kB
> Unevictable: 0 kB
> Mlocked: 0 kB
> SwapTotal: 64511996 kB
> SwapFree: 64506540 kB
> Dirty: 202368 kB
> Writeback: 0 kB
> AnonPages: 319948 kB
> Mapped: 18672 kB
> Shmem: 868 kB
> Slab: 22817492 kB
> SReclaimable: 16332540 kB
> SUnreclaim: 6484952 kB
> KernelStack: 4048 kB
> PageTables: 16516 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 97455100 kB
> Committed_AS: 516532 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed: 0 kB
> VmallocChunk: 0 kB
> HardwareCorrupted: 0 kB
> AnonHugePages: 0 kB
> ShmemHugePages: 0 kB
> ShmemPmdMapped: 0 kB
> HugePages_Total: 0
> HugePages_Free: 0
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
> DirectMap4k: 88820 kB
> DirectMap2M: 7155712 kB
> DirectMap1G: 61865984 kB
> *** snip ***
>
> *** snip ***
> Wed Apr 19 20:36:51 CEST 2017
> slabinfo - version: 2.1
> # name <active_objs> <num_objs> <objsize> <objperslab>
> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata
> <active_slabs> <num_slabs> <sharedavail>
> kcopyd_job 0 0 3240 2 2 : tunables 24 12
> 8 : slabdata 0 0 0
> dm_uevent 0 0 2632 3 2 : tunables 24 12
> 8 : slabdata 0 0 0
> t10_pr_reg_cache 0 0 704 11 2 : tunables 54 27
> 8 : slabdata 0 0 0
> ext4_groupinfo_4k 8 28 144 28 1 : tunables 120 60
> 8 : slabdata 1 1 0
> ext4_inode_cache 328 330 1072 3 1 : tunables 24 12
> 8 : slabdata 110 110 0
> ext4_allocation_context 0 0 128 32 1 : tunables 120
> 60 8 : slabdata 0 0 0
> ext4_io_end 0 0 64 64 1 : tunables 120 60
> 8 : slabdata 0 0 0
> ext4_extent_status 36 198 40 99 1 : tunables 120 60
> 8 : slabdata 2 2 0
> jbd2_journal_head 0 0 112 36 1 : tunables 120 60
> 8 : slabdata 0 0 0
> jbd2_revoke_table_s 0 0 16 240 1 : tunables 120 60
> 8 : slabdata 0 0 0
> jbd2_revoke_record_s 0 0 32 124 1 : tunables 120 60
> 8 : slabdata 0 0 0
> fscrypt_info 0 0 24 163 1 : tunables 120 60
> 8 : slabdata 0 0 0
> fscrypt_ctx 0 0 48 83 1 : tunables 120 60
> 8 : slabdata 0 0 0
> mbcache 0 0 56 71 1 : tunables 120 60
> 8 : slabdata 0 0 0
> kvm_async_pf 0 0 136 30 1 : tunables 120 60
> 8 : slabdata 0 0 0
> kvm_vcpu 0 0 19136 1 8 : tunables 8 4
> 0 : slabdata 0 0 0
> kvm_mmu_page_header 0 0 168 24 1 : tunables 120 60
> 8 : slabdata 0 0 0
> xfs_dqtrx 14 14 528 7 1 : tunables 54 27
> 8 : slabdata 2 2 0
> xfs_dquot 149 168 472 8 1 : tunables 54 27
> 8 : slabdata 21 21 0
> xfs_rui_item 0 0 664 6 1 : tunables 54 27
> 8 : slabdata 0 0 0
> xfs_rud_item 0 0 144 28 1 : tunables 120 60
> 8 : slabdata 0 0 0
> xfs_icr 0 0 152 26 1 : tunables 120 60
> 8 : slabdata 0 0 0
> xfs_inode 12898650 12907268 960 4 1 : tunables 54
> 27 8 : slabdata 3226817 3226817 0
> xfs_efd_item 0 0 408 10 1 : tunables 54 27
> 8 : slabdata 0 0 0
> xfs_da_state 0 0 480 8 1 : tunables 54 27
> 8 : slabdata 0 0 0
> xfs_btree_cur 105 153 232 17 1 : tunables 120 60
> 8 : slabdata 9 9 0
> bio-2 52 96 320 12 1 : tunables 54 27
> 8 : slabdata 8 8 0
> ip6-frags 0 0 200 20 1 : tunables 120 60
> 8 : slabdata 0 0 0
> UDPv6 18 28 1152 7 2 : tunables 24 12
> 8 : slabdata 4 4 0
> tw_sock_TCPv6 0 0 240 17 1 : tunables 120 60
> 8 : slabdata 0 0 0
> request_sock_TCPv6 0 0 304 13 1 : tunables 54 27
> 8 : slabdata 0 0 0
> TCPv6 7 12 2112 3 2 : tunables 24 12
> 8 : slabdata 4 4 0
> cfq_queue 99928 99977 240 17 1 : tunables 120 60
> 8 : slabdata 5881 5881 0
> mqueue_inode_cache 1 4 896 4 1 : tunables 54 27
> 8 : slabdata 1 1 0
> userfaultfd_ctx_cache 0 0 128 32 1 : tunables 120
> 60 8 : slabdata 0 0 0
> dio 0 0 640 6 1 : tunables 54 27
> 8 : slabdata 0 0 0
> pid_namespace 0 0 2232 3 2 : tunables 24 12
> 8 : slabdata 0 0 0
> posix_timers_cache 0 0 216 18 1 : tunables 120 60
> 8 : slabdata 0 0 0
> ip4-frags 34 132 184 22 1 : tunables 120 60
> 8 : slabdata 6 6 0
> PING 0 0 896 4 1 : tunables 54 27
> 8 : slabdata 0 0 0
> UDP 44 44 960 4 1 : tunables 54 27
> 8 : slabdata 11 11 0
> tw_sock_TCP 1 17 240 17 1 : tunables 120 60
> 8 : slabdata 1 1 0
> request_sock_TCP 0 0 304 13 1 : tunables 54 27
> 8 : slabdata 0 0 0
> TCP 18 22 1920 2 1 : tunables 24 12
> 8 : slabdata 11 11 0
> hugetlbfs_inode_cache 2 6 600 6 1 : tunables 54
> 27 8 : slabdata 1 1 0
> dquot 0 0 256 16 1 : tunables 120 60
> 8 : slabdata 0 0 0
> request_queue 3 15 1488 5 2 : tunables 24 12
> 8 : slabdata 3 3 0
> blkdev_ioc 100 234 104 39 1 : tunables 120 60
> 8 : slabdata 6 6 0
> user_namespace 0 0 448 9 1 : tunables 54 27
> 8 : slabdata 0 0 0
> dmaengine-unmap-256 1 3 2112 3 2 : tunables 24 12
> 8 : slabdata 1 1 0
> dmaengine-unmap-128 1 7 1088 7 2 : tunables 24 12
> 8 : slabdata 1 1 0
> sock_inode_cache 179 204 640 6 1 : tunables 54 27
> 8 : slabdata 34 34 0
> file_lock_cache 83 95 208 19 1 : tunables 120 60
> 8 : slabdata 5 5 0
> net_namespace 0 0 5888 1 2 : tunables 8 4
> 0 : slabdata 0 0 0
> shmem_inode_cache 775 825 688 11 2 : tunables 54 27
> 8 : slabdata 75 75 0
> taskstats 2 12 328 12 1 : tunables 54 27
> 8 : slabdata 1 1 0
> proc_inode_cache 2024 2190 656 6 1 : tunables 54 27
> 8 : slabdata 343 365 200
> sigqueue 12104736 12110275 160 25 1 : tunables 120
> 60 8 : slabdata 484411 484411 0
> bdev_cache 8 20 832 4 1 : tunables 54 27
> 8 : slabdata 5 5 0
> kernfs_node_cache 29210 29240 120 34 1 : tunables 120 60
> 8 : slabdata 860 860 0
> mnt_cache 652927 827520 384 10 1 : tunables 54 27
> 8 : slabdata 82752 82752 0
> inode_cache 12968 12985 584 7 1 : tunables 54 27
> 8 : slabdata 1854 1855 0
> dentry 12741855 12742065 192 21 1 : tunables 120
> 60 8 : slabdata 606757 606765 480
> iint_cache 0 0 72 56 1 : tunables 120 60
> 8 : slabdata 0 0 0
> buffer_head 127140 127140 104 39 1 : tunables 120 60
> 8 : slabdata 3260 3260 0
> fs_cache 171 378 64 63 1 : tunables 120 60
> 8 : slabdata 6 6 0
> files_cache 158 220 704 11 2 : tunables 54 27
> 8 : slabdata 20 20 0
> signal_cache 425 508 1024 4 1 : tunables 54 27
> 8 : slabdata 127 127 0
> sighand_cache 261 294 2112 3 2 : tunables 24 12
> 8 : slabdata 98 98 0
> task_struct 278 295 3648 1 1 : tunables 24 12
> 8 : slabdata 278 295 0
> cred_jar 15411 17451 192 21 1 : tunables 120 60
> 8 : slabdata 831 831 60
> Acpi-Operand 147223 147504 72 56 1 : tunables 120 60
> 8 : slabdata 2634 2634 0
> Acpi-Parse 69 355 56 71 1 : tunables 120 60
> 8 : slabdata 5 5 0
> Acpi-State 0 0 80 51 1 : tunables 120 60
> 8 : slabdata 0 0 0
> Acpi-Namespace 13264 13365 40 99 1 : tunables 120 60
> 8 : slabdata 135 135 0
> anon_vma_chain 11212 14336 64 64 1 : tunables 120 60
> 8 : slabdata 224 224 60
> anon_vma 6167 7000 72 56 1 : tunables 120 60
> 8 : slabdata 125 125 0
> pid 312 576 128 32 1 : tunables 120 60
> 8 : slabdata 18 18 0
> numa_policy 5 163 24 163 1 : tunables 120 60
> 8 : slabdata 1 1 0
> trace_event_file 2418 2530 88 46 1 : tunables 120 60
> 8 : slabdata 55 55 0
> ftrace_event_field 3691 3901 48 83 1 : tunables 120 60
> 8 : slabdata 47 47 0
> pool_workqueue 26 96 256 16 1 : tunables 120 60
> 8 : slabdata 6 6 0
> radix_tree_node 1644760 1707538 576 7 1 : tunables 54 27
> 8 : slabdata 243934 243934 0
> task_group 43 84 576 7 1 : tunables 54 27
> 8 : slabdata 12 12 0
> dma-kmalloc-4194304 0 0 4194304 1 1024 : tunables 1 1
> 0 : slabdata 0 0 0
> dma-kmalloc-2097152 0 0 2097152 1 512 : tunables 1 1
> 0 : slabdata 0 0 0
> dma-kmalloc-1048576 0 0 1048576 1 256 : tunables 1 1
> 0 : slabdata 0 0 0
> dma-kmalloc-524288 0 0 524288 1 128 : tunables 1 1
> 0 : slabdata 0 0 0
> dma-kmalloc-262144 0 0 262144 1 64 : tunables 1 1
> 0 : slabdata 0 0 0
> dma-kmalloc-131072 0 0 131072 1 32 : tunables 8 4
> 0 : slabdata 0 0 0
> dma-kmalloc-65536 0 0 65536 1 16 : tunables 8 4
> 0 : slabdata 0 0 0
> dma-kmalloc-32768 0 0 32768 1 8 : tunables 8 4
> 0 : slabdata 0 0 0
> dma-kmalloc-16384 0 0 16384 1 4 : tunables 8 4
> 0 : slabdata 0 0 0
> dma-kmalloc-8192 0 0 8192 1 2 : tunables 8 4
> 0 : slabdata 0 0 0
> dma-kmalloc-4096 0 0 4096 1 1 : tunables 24 12
> 8 : slabdata 0 0 0
> dma-kmalloc-2048 0 0 2048 2 1 : tunables 24 12
> 8 : slabdata 0 0 0
> dma-kmalloc-1024 0 0 1024 4 1 : tunables 54 27
> 8 : slabdata 0 0 0
> dma-kmalloc-512 4 8 512 8 1 : tunables 54 27
> 8 : slabdata 1 1 0
> dma-kmalloc-256 0 0 256 16 1 : tunables 120 60
> 8 : slabdata 0 0 0
> dma-kmalloc-128 0 0 128 32 1 : tunables 120 60
> 8 : slabdata 0 0 0
> dma-kmalloc-64 0 0 64 64 1 : tunables 120 60
> 8 : slabdata 0 0 0
> dma-kmalloc-32 4 124 32 124 1 : tunables 120 60
> 8 : slabdata 1 1 0
> dma-kmalloc-192 0 0 192 21 1 : tunables 120 60
> 8 : slabdata 0 0 0
> dma-kmalloc-96 0 0 128 32 1 : tunables 120 60
> 8 : slabdata 0 0 0
> kmalloc-4194304 0 0 4194304 1 1024 : tunables 1 1
> 0 : slabdata 0 0 0
> kmalloc-2097152 0 0 2097152 1 512 : tunables 1 1
> 0 : slabdata 0 0 0
> kmalloc-1048576 0 0 1048576 1 256 : tunables 1 1
> 0 : slabdata 0 0 0
> kmalloc-524288 1 1 524288 1 128 : tunables 1 1
> 0 : slabdata 1 1 0
> kmalloc-262144 3 3 262144 1 64 : tunables 1 1
> 0 : slabdata 3 3 0
> kmalloc-131072 3 3 131072 1 32 : tunables 8 4
> 0 : slabdata 3 3 0
> kmalloc-65536 13 13 65536 1 16 : tunables 8 4
> 0 : slabdata 13 13 0
> kmalloc-32768 4 4 32768 1 8 : tunables 8 4
> 0 : slabdata 4 4 0
> kmalloc-16384 23 23 16384 1 4 : tunables 8 4
> 0 : slabdata 23 23 0
> kmalloc-8192 44121 44121 8192 1 2 : tunables 8 4
> 0 : slabdata 44121 44121 0
> kmalloc-4096 11850 11850 4096 1 1 : tunables 24 12
> 8 : slabdata 11850 11850 24
> kmalloc-2048 18655 18718 2048 2 1 : tunables 24 12
> 8 : slabdata 9359 9359 0
> kmalloc-1024 584913 603244 1024 4 1 : tunables 54 27
> 8 : slabdata 150811 150811 0
> kmalloc-512 856079 1020576 512 8 1 : tunables 54 27
> 8 : slabdata 127572 127572 0
> kmalloc-256 114928 141968 256 16 1 : tunables 120 60
> 8 : slabdata 8873 8873 480
> kmalloc-192 244249 272643 192 21 1 : tunables 120 60
> 8 : slabdata 12983 12983 0
> kmalloc-96 3139278 3139328 128 32 1 : tunables 120 60
> 8 : slabdata 98104 98104 0
> kmalloc-64 25714408 33297024 64 64 1 : tunables 120
> 60 8 : slabdata 520266 520266 0
> kmalloc-32 423132 631036 32 124 1 : tunables 120 60
> 8 : slabdata 5089 5089 0
> kmalloc-128 213657 269952 128 32 1 : tunables 120 60
> 8 : slabdata 8436 8436 0
> *** snip ***
>
> *** snip ***
> /proc/<pidof_xfsaild>/stack
> Wed Apr 19 20:36:52 CEST 2017
> [<ffffffff8ef3881e>] ring_buffer_unlock_commit+0x1e/0xa0
> [<ffffffff8ef419a7>] trace_buffer_unlock_commit_regs+0x37/0x90
> [<ffffffff8eea1e20>] wake_up_q+0x60/0x60
> [<ffffffffc0719b5b>] xfs_buf_read_map+0x10b/0x180 [xfs]
> [<ffffffffc0719888>] xfs_buf_submit_wait+0x78/0x210 [xfs]
> [<ffffffffc074d2bd>] xfs_trans_read_buf_map+0xed/0x300 [xfs]
> [<ffffffffc0719b5b>] xfs_buf_read_map+0x10b/0x180 [xfs]
> [<ffffffffc074d2bd>] xfs_trans_read_buf_map+0xed/0x300 [xfs]
> [<ffffffffc0702b77>] xfs_imap_to_bp+0x67/0xe0 [xfs]
> [<ffffffffc072e84b>] xfs_iflush+0xdb/0x240 [xfs]
> [<ffffffffc07437f9>] xfs_inode_item_push+0x69/0x140 [xfs]
> [<ffffffffc0743886>] xfs_inode_item_push+0xf6/0x140 [xfs]
> [<ffffffffc074bf25>] xfsaild+0x385/0x7e0 [xfs]
> [<ffffffffc074bba0>] xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
> [<ffffffff8ee96d64>] kthread+0x104/0x140
> [<ffffffff8ee96d64>] kthread+0x104/0x140
> [<ffffffff8ee96c60>] kthread_park+0x80/0x80
> [<ffffffff8ee7ba79>] do_group_exit+0x39/0xb0
> [<ffffffff8f4370b6>] ret_from_fork+0x26/0x40
> [<ffffffffffffffff>] 0xffffffffffffffff
So... (speculating a bit here) you're running 20 different copies of
rsnapshot and rsync, which are generating a lot of dirty data and dirty
inodes. Each of the threads reported by the hung task timeout are
trying to reserve space in the log for metadata updates, and clearly
those log reservation requests have poked the AIL to flush itself to
clear space for new transactions.
The AIL thread is flushing inode updates to disk, which is a lot of RMW
of dirty inodes, hence the high amounts of io reads you see. The log is
already 2GB in size on disk, so it cannot be made larger. TBH, a
/smaller/ log might help here since at least when we have to push the
AIL to free up log space there'll be less of it that has to be freed up.
OTOH I wouldn't expect the system to stall for 5-15 minutes ... but I
guess a quarter of a 2GB log is 500MB, and if that's mostly dirty inode
records, that's a /lot/ of inode updates that the AIL has to do (single
threaded!) to clean out the space. You /do/ seem to have 13 million
inodes in slab, so it's not implausible...
...500MB / 200 bytes per dirty inode record = ~2.5 million inodes.
~2.5m inodes / 10 minutes = ~4200 inodes per second...
That's probably within the ballpark of what 12 drives can handle.
Considering that the inodes get allocated out in chunks of 64, we can
take advantage of /some/ locality...
IOWs, the xfs is flushing big chunks of log all at once. You might look
at the io stats of each drive, to see if they're all busy, and how many
io operations they're managing per second. (But since you know the
exact model of drives, you'll have to take my ballpark figures and work
out whether or not they map to what you'd expect iops-wise from each
drive.)
--D
> *** snip ***
>
> *** snip ***
> iotop --batch -o output
>
> Total DISK READ : 1948.60 K/s | Total DISK WRITE : 0.00 B/s
> Actual DISK READ: 974.30 K/s | Actual DISK WRITE: 0.00 B/s
> TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
> 1609 be/4 root 1948.60 K/s 0.00 B/s 0.00 % 0.00 %
> [xfsaild/md127]
> Total DISK READ : 1924.56 K/s | Total DISK WRITE : 3.94 K/s
> Actual DISK READ: 1956.11 K/s | Actual DISK WRITE: 0.00 B/s
> TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
> 1609 be/4 root 1924.56 K/s 0.00 B/s 0.00 % 0.00 %
> [xfsaild/md127]
> Total DISK READ : 2.53 M/s | Total DISK WRITE : 0.00 B/s
> Actual DISK READ: 2.51 M/s | Actual DISK WRITE: 63.77 K/s
> TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
> 1609 be/4 root 2.53 M/s 0.00 B/s 0.00 % 0.00 %
> [xfsaild/md127]
> Total DISK READ : 2.09 M/s | Total DISK WRITE : 0.00 B/s
> Actual DISK READ: 2.10 M/s | Actual DISK WRITE: 0.00 B/s
> TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
> 1609 be/4 root 2.09 M/s 0.00 B/s 0.00 % 0.00 %
> [xfsaild/md127]
> Total DISK READ : 12.12 M/s | Total DISK WRITE : 0.00 B/s
> Actual DISK READ: 12.10 M/s | Actual DISK WRITE: 0.00 B/s
> TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
> 1609 be/4 root 12.12 M/s 0.00 B/s 0.00 % 0.00 %
> [xfsaild/md127]
> Total DISK READ : 2.30 M/s | Total DISK WRITE : 0.00 B/s
> Actual DISK READ: 2.28 M/s | Actual DISK WRITE: 0.00 B/s
> TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
> 1609 be/4 root 2.30 M/s 0.00 B/s 0.00 % 0.00 %
> [xfsaild/md127]
> Total DISK READ : 4.86 M/s | Total DISK WRITE : 0.00 B/s
> Actual DISK READ: 4.88 M/s | Actual DISK WRITE: 2025.26 K/s
> TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
> 1609 be/4 root 4.86 M/s 0.00 B/s 0.00 % 0.00 %
> [xfsaild/md127]
> Total DISK READ : 2.64 M/s | Total DISK WRITE : 0.00 B/s
> Actual DISK READ: 2.62 M/s | Actual DISK WRITE: 0.00 B/s
> TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
> 1609 be/4 root 2.64 M/s 0.00 B/s 0.00 % 0.00 %
> [xfsaild/md127]
> Total DISK READ : 1899.36 K/s | Total DISK WRITE : 0.00 B/s
> Actual DISK READ: 1899.36 K/s | Actual DISK WRITE: 0.00 B/s
> TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
> 1609 be/4 root 1899.36 K/s 0.00 B/s 0.00 % 0.00 %
> [xfsaild/md127]
> Total DISK READ : 1837.53 K/s | Total DISK WRITE : 0.00 B/s
> Actual DISK READ: 1837.53 K/s | Actual DISK WRITE: 0.00 B/s
> TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
> 1609 be/4 root 1837.53 K/s 0.00 B/s 0.00 % 0.00 %
> [xfsaild/md127]
> *** snip ***
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfsaild in D state seems to be blocking all other i/o sporadically
2017-04-19 20:55 ` Darrick J. Wong
@ 2017-04-19 21:47 ` Michael Weissenbacher
2017-04-19 23:48 ` Dave Chinner
0 siblings, 1 reply; 19+ messages in thread
From: Michael Weissenbacher @ 2017-04-19 21:47 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-xfs
Hi Darrick!
Thanks for your input.
>
> So... (speculating a bit here) you're running 20 different copies of
> rsnapshot and rsync, which are generating a lot of dirty data and dirty
> inodes. Each of the threads reported by the hung task timeout are
> trying to reserve space in the log for metadata updates, and clearly
> those log reservation requests have poked the AIL to flush itself to
> clear space for new transactions.
>
Yes, that is a pretty precise description of what the system is doing.
Although 20 parallel rsync's are pretty much the worst case scenario
(the cron jobs are designed to be starting at differing times). Usually
there are roughly 5 rsync's going on at the same time. The majority of
the files are not copied but hard linked during this process.
> The AIL thread is flushing inode updates to disk, which is a lot of RMW
> of dirty inodes, hence the high amounts of io reads you see. The log is
> already 2GB in size on disk, so it cannot be made larger. TBH, a
> /smaller/ log might help here since at least when we have to push the
> AIL to free up log space there'll be less of it that has to be freed up.
>
That's interesting. I could try to create an external log of 500MB size
and see if that changes anything?
> OTOH I wouldn't expect the system to stall for 5-15 minutes ... but I
> guess a quarter of a 2GB log is 500MB, and if that's mostly dirty inode
> records, that's a /lot/ of inode updates that the AIL has to do (single
> threaded!) to clean out the space. You /do/ seem to have 13 million
> inodes in slab, so it's not implausible...
>
OK, do i understand you correctly that the xfsaild does all the actual
work of updating the inodes? And it's doing that single-threaded,
reading both the log and the indoes themselves?
> ...500MB / 200 bytes per dirty inode record = ~2.5 million inodes.
> ~2.5m inodes / 10 minutes = ~4200 inodes per second...
Yes, these numbers sound plausible. When everything is running smoothly
(i.e. most of the time) i do see up to 4500 combined reads+writes per
second. But when the problem arises, all i see i xfsaild doing around
120 reads/second, which is roughly the performance of a single 7200rpm
drive. Directly after this "read only phase" (which lasts for about 5-10
minutes most of the time) there is a short burst of up to 10000+
writes/sec (writing to the RAID controller cache I suppose). And then
the system goes back to the "normal phase" where it is doing both reads
and writes and all those blocked D state processes continue working.
>
> That's probably within the ballpark of what 12 drives can handle.
> Considering that the inodes get allocated out in chunks of 64, we can
> take advantage of /some/ locality...
>
> IOWs, the xfs is flushing big chunks of log all at once. You might look
> at the io stats of each drive, to see if they're all busy, and how many
> io operations they're managing per second. (But since you know the
> exact model of drives, you'll have to take my ballpark figures and work
> out whether or not they map to what you'd expect iops-wise from each
> drive.)
>
Hm, i don't think i can query the io stats of the individual drives
behind the hardware RAID controller. If your theory is correct i should
only see a single drive busy during the "problem phase". Will look into
that. Maybe simply by eyeballing the drive led's ;-)
thanks,
Michael
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfsaild in D state seems to be blocking all other i/o sporadically
2017-04-19 21:47 ` Michael Weissenbacher
@ 2017-04-19 23:48 ` Dave Chinner
2017-04-20 7:11 ` Michael Weissenbacher
0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2017-04-19 23:48 UTC (permalink / raw)
To: Michael Weissenbacher; +Cc: Darrick J. Wong, linux-xfs
On Wed, Apr 19, 2017 at 11:47:43PM +0200, Michael Weissenbacher wrote:
> OK, do i understand you correctly that the xfsaild does all the actual
> work of updating the inodes? And it's doing that single-threaded,
> reading both the log and the indoes themselves?
The problem is that the backing buffers that are used for flushing
inodes have been reclaimed due to memory pressure, but the inodes in
cache are still dirty. Hence to write the dirty inodes, we first
have to read the inode buffer back into memory.
Essentially, the log pins more dirty inodes than than writeback
can keep up with given the memory pressure that is occurring.
>
> > ...500MB / 200 bytes per dirty inode record = ~2.5 million inodes.
> > ~2.5m inodes / 10 minutes = ~4200 inodes per second...
> Yes, these numbers sound plausible. When everything is running smoothly
> (i.e. most of the time) i do see up to 4500 combined reads+writes per
> second. But when the problem arises, all i see i xfsaild doing around
> 120 reads/second, which is roughly the performance of a single 7200rpm
> drive. Directly after this "read only phase" (which lasts for about 5-10
> minutes most of the time) there is a short burst of up to 10000+
> writes/sec (writing to the RAID controller cache I suppose).
Yes. Inode buffer reads from the xfsaild are sequential so block the
aild and log tail pushing while they are on-going. It then queues
the writes on a delayed write list and they are issued all in one
go. Essentially, the system can dirty far more metadata in memory
than it can effectively cache and flush to disk in a timely manner.
> And then
> the system goes back to the "normal phase" where it is doing both reads
> and writes and all those blocked D state processes continue working.
Yup, because the xfsaild has move the tail of the log, so there is
now log space available and ongoing transactions can run at CPU
speed again. The xfsaild will still be doing RMW inode writeback,
but nothing is blocking waiting for it to reach it's target now...
I'd suggest that the only thing you can realistically do to mitigate
this stall without changing the log size is change
/proc/sys/fs/xfs/xfssyncd_centisecs down to one second so that the
CIL is flushed to disk every second, and the AIL push target is
updated immediately after the CIL flush is done. This will result in
the xfsaild trying to push metadata almost as quickly as it is
committed to the journal, hence it should be trying to flush inodes
while th einode buffer is still cached in memory and hence avoid the
read IO. This will help avoid AIL pushing from getting so far behind
that inode buffers are reclaimed before the dirty inodes are
written. It may not completely avoid the stalls, but it should make
them less frequent...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfsaild in D state seems to be blocking all other i/o sporadically
2017-04-19 23:48 ` Dave Chinner
@ 2017-04-20 7:11 ` Michael Weissenbacher
2017-04-20 23:16 ` Dave Chinner
0 siblings, 1 reply; 19+ messages in thread
From: Michael Weissenbacher @ 2017-04-20 7:11 UTC (permalink / raw)
To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs
On 20.04.2017 01:48, Dave Chinner wrote:
> On Wed, Apr 19, 2017 at 11:47:43PM +0200, Michael Weissenbacher wrote:
>> OK, do i understand you correctly that the xfsaild does all the actual
>> work of updating the inodes? And it's doing that single-threaded,
>> reading both the log and the indoes themselves?
>
> The problem is that the backing buffers that are used for flushing
> inodes have been reclaimed due to memory pressure, but the inodes in
> cache are still dirty. Hence to write the dirty inodes, we first
> have to read the inode buffer back into memory.
>
Interesting find. Is there a way to prevent those buffers from getting
reclaimed? Would adjusting vfs_cache_pressure help? Or adding more
memory to the system? In fact the best thing would be to disable file
content caching completely. Because of the use-case (backup server) it's
worthless to cache file content.
My primary objective is to avoid those stalls and reduce latency, at the
expense of throughput.
> Essentially, the log pins more dirty inodes than than writeback
> can keep up with given the memory pressure that is occurring.
>
>>
>>> ...500MB / 200 bytes per dirty inode record = ~2.5 million inodes.
>>> ~2.5m inodes / 10 minutes = ~4200 inodes per second...
>> Yes, these numbers sound plausible. When everything is running smoothly
>> (i.e. most of the time) i do see up to 4500 combined reads+writes per
>> second. But when the problem arises, all i see i xfsaild doing around
>> 120 reads/second, which is roughly the performance of a single 7200rpm
>> drive. Directly after this "read only phase" (which lasts for about 5-10
>> minutes most of the time) there is a short burst of up to 10000+
>> writes/sec (writing to the RAID controller cache I suppose).
>
> Yes. Inode buffer reads from the xfsaild are sequential so block the
> aild and log tail pushing while they are on-going. It then queues
> the writes on a delayed write list and they are issued all in one
> go. Essentially, the system can dirty far more metadata in memory
> than it can effectively cache and flush to disk in a timely manner.
>
>> And then
>> the system goes back to the "normal phase" where it is doing both reads
>> and writes and all those blocked D state processes continue working.
>
> Yup, because the xfsaild has move the tail of the log, so there is
> now log space available and ongoing transactions can run at CPU
> speed again. The xfsaild will still be doing RMW inode writeback,
> but nothing is blocking waiting for it to reach it's target now...
>
> I'd suggest that the only thing you can realistically do to mitigate
> this stall without changing the log size is change
> /proc/sys/fs/xfs/xfssyncd_centisecs down to one second so that the
> CIL is flushed to disk every second, and the AIL push target is
> updated immediately after the CIL flush is done. This will result in
> the xfsaild trying to push metadata almost as quickly as it is
> committed to the journal, hence it should be trying to flush inodes
> while th einode buffer is still cached in memory and hence avoid the
> read IO. This will help avoid AIL pushing from getting so far behind
> that inode buffers are reclaimed before the dirty inodes are
> written. It may not completely avoid the stalls, but it should make
> them less frequent...
>
Ok, i just did that (set xfssyncd_centisecs=100). Will see if it helps.
thanks,
Michael
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfsaild in D state seems to be blocking all other i/o sporadically
2017-04-20 7:11 ` Michael Weissenbacher
@ 2017-04-20 23:16 ` Dave Chinner
2017-04-21 7:43 ` Michael Weissenbacher
0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2017-04-20 23:16 UTC (permalink / raw)
To: Michael Weissenbacher; +Cc: Darrick J. Wong, linux-xfs
On Thu, Apr 20, 2017 at 09:11:22AM +0200, Michael Weissenbacher wrote:
> On 20.04.2017 01:48, Dave Chinner wrote:
> > On Wed, Apr 19, 2017 at 11:47:43PM +0200, Michael Weissenbacher wrote:
> >> OK, do i understand you correctly that the xfsaild does all the actual
> >> work of updating the inodes? And it's doing that single-threaded,
> >> reading both the log and the indoes themselves?
> >
> > The problem is that the backing buffers that are used for flushing
> > inodes have been reclaimed due to memory pressure, but the inodes in
> > cache are still dirty. Hence to write the dirty inodes, we first
> > have to read the inode buffer back into memory.
> >
> Interesting find. Is there a way to prevent those buffers from getting
> reclaimed?
Not really. It's simply a side effect of memory reclaim not being
able to reclaim inodes or the page cache because they are dirty, and
so it goes and puts lots more pressure on clean caches. The working
set in those other caches gets trashed, and this it's a downward
spiral because it means dirty inodes and pages take longer are
require blocking IO to refill on demand...
> Would adjusting vfs_cache_pressure help?
Unlikely.
> Or adding more
> memory to the system?
Unlikely - that'll just lead to bigger stalls.
> In fact the best thing would be to disable file
> content caching completely. Because of the use-case (backup server) it's
> worthless to cache file content.
> My primary objective is to avoid those stalls and reduce latency, at the
> expense of throughput.
Set up dirty page cache writeback thresholds to be low (a couple of
hundred MB instead of 10/20% of memory) so that data writeback
starts early and throttles dirty pages to a small amount of memory.
This will help keep the page cache clean and immediately
reclaimable, hence it shouldn't put as much pressure on other caches
when memory reclaim is required.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfsaild in D state seems to be blocking all other i/o sporadically
2017-04-20 23:16 ` Dave Chinner
@ 2017-04-21 7:43 ` Michael Weissenbacher
2017-04-21 9:18 ` Shan Hai
0 siblings, 1 reply; 19+ messages in thread
From: Michael Weissenbacher @ 2017-04-21 7:43 UTC (permalink / raw)
To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs
Hi Dave!
On 21.04.2017 01:16, Dave Chinner wrote:
> On Thu, Apr 20, 2017 at 09:11:22AM +0200, Michael Weissenbacher wrote:
>> On 20.04.2017 01:48, Dave Chinner wrote:
>>>
>>> The problem is that the backing buffers that are used for flushing
>>> inodes have been reclaimed due to memory pressure, but the inodes in
>>> cache are still dirty. Hence to write the dirty inodes, we first
>>> have to read the inode buffer back into memory.
>>>
>> Interesting find. Is there a way to prevent those buffers from getting
>> reclaimed?
>
> Not really. It's simply a side effect of memory reclaim not being
> able to reclaim inodes or the page cache because they are dirty, and
> so it goes and puts lots more pressure on clean caches. The working
> set in those other caches gets trashed, and this it's a downward
> spiral because it means dirty inodes and pages take longer are
> require blocking IO to refill on demand...
>
Yesterday i found this patch for rsync:
http://insights.oetiker.ch/linux/fadvise/
It adds the option "--drop-cache" to rsync which sets
POSIX_FADV_DONTNEED to prevent caching. Been running with this option
since yesterday. Unfortunately, even this hasn't changed my problem.
>> In fact the best thing would be to disable file
>> content caching completely. Because of the use-case (backup server) it's
>> worthless to cache file content.
>> My primary objective is to avoid those stalls and reduce latency, at the
>> expense of throughput.
>
> Set up dirty page cache writeback thresholds to be low (a couple of
> hundred MB instead of 10/20% of memory) so that data writeback
> starts early and throttles dirty pages to a small amount of memory.
> This will help keep the page cache clean and immediately
> reclaimable, hence it shouldn't put as much pressure on other caches
> when memory reclaim is required.
>
In fact i already turned those down to 1024MiB/512MiB - not much change
here. I also set xfssyncd_centisecs to 100 like advised by you. Also not
much change after that.
I also noticed that unmounting the file system takes a really long time
after the problem occured (up to around 5 minutes!). Even when there was
nothing at all going on before the unmount. Would it help to capture the
unmount with trace-cmd?
Here is another theory. Could it be that not the rsync's, but the rm's
issued by rsnapshot are causing the problem? Would it help to serialize
all "rm -Rf" calls? Those always delete the oldest backups, which can't
possibly be in the cache and because of that all those inodes need to be
read into memory during deletion. Maybe those rm's are filling up the
XFS log?
If that doesn't work out either i guess my only chance would be to
partition the device with LVM and create a separate XFS for every
rsnapshot instance. In that scenario every file system should get it's
own xfsaild, allowing them to run in parallel and not blocking each other?
cheers,
Michael
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfsaild in D state seems to be blocking all other i/o sporadically
2017-04-21 7:43 ` Michael Weissenbacher
@ 2017-04-21 9:18 ` Shan Hai
2017-04-22 8:38 ` Michael Weissenbacher
0 siblings, 1 reply; 19+ messages in thread
From: Shan Hai @ 2017-04-21 9:18 UTC (permalink / raw)
To: Michael Weissenbacher, Dave Chinner; +Cc: Darrick J. Wong, linux-xfs
On 2017年04月21日 15:43, Michael Weissenbacher wrote:
> Hi Dave!
> On 21.04.2017 01:16, Dave Chinner wrote:
>> On Thu, Apr 20, 2017 at 09:11:22AM +0200, Michael Weissenbacher wrote:
>>> On 20.04.2017 01:48, Dave Chinner wrote:
>>>> The problem is that the backing buffers that are used for flushing
>>>> inodes have been reclaimed due to memory pressure, but the inodes in
>>>> cache are still dirty. Hence to write the dirty inodes, we first
>>>> have to read the inode buffer back into memory.
>>>>
>>> Interesting find. Is there a way to prevent those buffers from getting
>>> reclaimed?
>> Not really. It's simply a side effect of memory reclaim not being
>> able to reclaim inodes or the page cache because they are dirty, and
>> so it goes and puts lots more pressure on clean caches. The working
>> set in those other caches gets trashed, and this it's a downward
>> spiral because it means dirty inodes and pages take longer are
>> require blocking IO to refill on demand...
>>
> Yesterday i found this patch for rsync:
> http://insights.oetiker.ch/linux/fadvise/
> It adds the option "--drop-cache" to rsync which sets
> POSIX_FADV_DONTNEED to prevent caching. Been running with this option
> since yesterday. Unfortunately, even this hasn't changed my problem.
>
>>> In fact the best thing would be to disable file
>>> content caching completely. Because of the use-case (backup server) it's
>>> worthless to cache file content.
>>> My primary objective is to avoid those stalls and reduce latency, at the
>>> expense of throughput.
>> Set up dirty page cache writeback thresholds to be low (a couple of
>> hundred MB instead of 10/20% of memory) so that data writeback
>> starts early and throttles dirty pages to a small amount of memory.
>> This will help keep the page cache clean and immediately
>> reclaimable, hence it shouldn't put as much pressure on other caches
>> when memory reclaim is required.
>>
> In fact i already turned those down to 1024MiB/512MiB - not much change
> here. I also set xfssyncd_centisecs to 100 like advised by you. Also not
> much change after that.
>
> I also noticed that unmounting the file system takes a really long time
> after the problem occured (up to around 5 minutes!). Even when there was
> nothing at all going on before the unmount. Would it help to capture the
> unmount with trace-cmd?
A huge negative dcache entries would cause a slow umount,
please check it by 'cat /proc/sys/fs/dentry-state', the first column
is for used(active) entries while the second is for the unused
(negative) entries.
It can be dropped by 'echo 2 > /proc/sys/vm/drop_caches'.
Thanks
Shan Hai
> Here is another theory. Could it be that not the rsync's, but the rm's
> issued by rsnapshot are causing the problem? Would it help to serialize
> all "rm -Rf" calls? Those always delete the oldest backups, which can't
> possibly be in the cache and because of that all those inodes need to be
> read into memory during deletion. Maybe those rm's are filling up the
> XFS log?
>
> If that doesn't work out either i guess my only chance would be to
> partition the device with LVM and create a separate XFS for every
> rsnapshot instance. In that scenario every file system should get it's
> own xfsaild, allowing them to run in parallel and not blocking each other?
>
> cheers,
> Michael
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: xfsaild in D state seems to be blocking all other i/o sporadically
2017-04-21 9:18 ` Shan Hai
@ 2017-04-22 8:38 ` Michael Weissenbacher
0 siblings, 0 replies; 19+ messages in thread
From: Michael Weissenbacher @ 2017-04-22 8:38 UTC (permalink / raw)
To: Shan Hai, Dave Chinner; +Cc: Darrick J. Wong, linux-xfs
On 21.04.2017 11:18, Shan Hai wrote:
>
>>
>> I also noticed that unmounting the file system takes a really long time
>> after the problem occured (up to around 5 minutes!). Even when there was
>> nothing at all going on before the unmount. Would it help to capture the
>> unmount with trace-cmd?
>
> A huge negative dcache entries would cause a slow umount,
> please check it by 'cat /proc/sys/fs/dentry-state', the first column
> is for used(active) entries while the second is for the unused
> (negative) entries.
>
Does this qualify as "huge"?
# cat /proc/sys/fs/dentry-state
10689551 10539266 45 0 0 0
> It can be dropped by 'echo 2 > /proc/sys/vm/drop_caches'.
>
Won't that drop all dentries and inodes from the cache, which is what I
should be trying to avoid?
>> Here is another theory. Could it be that not the rsync's, but the rm's
>> issued by rsnapshot are causing the problem? Would it help to serialize
>> all "rm -Rf" calls? Those always delete the oldest backups, which can't
>> possibly be in the cache and because of that all those inodes need to be
>> read into memory during deletion. Maybe those rm's are filling up the
>> XFS log?
>>
Answering myself here: after removing the "rm" calls from rsnapshot the
problem completely disappeared. I've now written a separate script which
processes all "rm -Rf *" calls sequentially. I would deduce now that
parallel deletion of big directory trees were pushing the filesystem
(and xfsaild) so hard to cause those 10-15 minute stalls.
Thanks to everyone who contributed to help me solve the problem.
Have a nice weekend,
Michael
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2017-04-22 8:38 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-19 10:58 xfsaild in D state seems to be blocking all other i/o sporadically Michael Weissenbacher
2017-04-19 12:12 ` Carlos Maiolino
2017-04-19 12:37 ` Brian Foster
2017-04-19 12:40 ` Michael Weissenbacher
2017-04-19 13:01 ` Michael Weissenbacher
2017-04-19 14:04 ` Carlos Maiolino
2017-04-19 14:20 ` Carlos Maiolino
2017-04-19 16:40 ` Michael Weissenbacher
2017-04-19 16:36 ` Michael Weissenbacher
2017-04-19 18:08 ` Brian Foster
2017-04-19 20:10 ` Michael Weissenbacher
2017-04-19 20:55 ` Darrick J. Wong
2017-04-19 21:47 ` Michael Weissenbacher
2017-04-19 23:48 ` Dave Chinner
2017-04-20 7:11 ` Michael Weissenbacher
2017-04-20 23:16 ` Dave Chinner
2017-04-21 7:43 ` Michael Weissenbacher
2017-04-21 9:18 ` Shan Hai
2017-04-22 8:38 ` Michael Weissenbacher
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.