xfsaild in D state seems to be blocking all other i/o sporadically

All of lore.kernel.org
 help / color / mirror / Atom feed

* xfsaild in D state seems to be blocking all other i/o sporadically
@ 2017-04-19 10:58 Michael Weissenbacher
  2017-04-19 12:12 ` Carlos Maiolino
  0 siblings, 1 reply; 19+ messages in thread
From: Michael Weissenbacher @ 2017-04-19 10:58 UTC (permalink / raw)
  To: linux-xfs

Hi List!
I have a storage server which primarily does around 15-20 parallel
rsync's, nothing special. Sometimes (3-4 times a day) i notice that all
I/O on the file system suddenly comes to a halt and the only process
that continues to do any I/O (according to iotop) is the process
xfsaild/md127. When this happens, xfsaild only does reads (according to
iotop) and consistently in D State (according to top).
Unfortunately this can sometimes stay like this for 5-15 minutes. During
this time even a simple "ls" our "touch" would block and be stuck in D
state. All other running processes accessing the fs are of course also
stuck in D state. It is a XFS V5 filesystem.
Then again, as sudden as it began, everything goes back to normal and
I/O continues. The problem is accompanied with several "process blocked
for xxx seconds" in dmesg and also some dropped connections due to
network timeouts.

I've tried several things to remedy the problem, including:
  - changing I/O schedulers (tried noop, deadline and cfq). Deadline
seems to be best (the block goes away in less time compared with the
others).
  - removing all mount options (defaults + usrquota, grpquota)
  - upgrading to the latest 4.11.0-rc kernel (before that i was on 4.9.x)

Nothing of the above seemed to have made a significant change to the
problem.

xfs_info output of the fs in question:
meta-data=/dev/md127             isize=512    agcount=33,
agsize=268435440 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1 spinodes=0 rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=8789917696, imaxpct=10
         =                       sunit=16     swidth=96 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Storage Subsystem: Dell Perc H730P Controller 2GB NVCACHE, 12 6TB Disks,
RAID-10, latest Firmware Updates

I would be happy to dig out more information if needed. How can i find
out if the RAID Controller itself gets stuck? Nothing bad shows up in
the hardware and SCSI controller logs.

Thanks,
Michael

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: xfsaild in D state seems to be blocking all other i/o sporadically
  2017-04-19 10:58 xfsaild in D state seems to be blocking all other i/o sporadically Michael Weissenbacher
@ 2017-04-19 12:12 ` Carlos Maiolino
  2017-04-19 12:37   ` Brian Foster
  2017-04-19 12:40   ` Michael Weissenbacher
  0 siblings, 2 replies; 19+ messages in thread
From: Carlos Maiolino @ 2017-04-19 12:12 UTC (permalink / raw)
  To: Michael Weissenbacher; +Cc: linux-xfs

Hi,

On Wed, Apr 19, 2017 at 12:58:05PM +0200, Michael Weissenbacher wrote:
> Hi List!
> I have a storage server which primarily does around 15-20 parallel
> rsync's, nothing special. Sometimes (3-4 times a day) i notice that all
> I/O on the file system suddenly comes to a halt and the only process
> that continues to do any I/O (according to iotop) is the process
> xfsaild/md127. When this happens, xfsaild only does reads (according to
> iotop) and consistently in D State (according to top).
> Unfortunately this can sometimes stay like this for 5-15 minutes. During
> this time even a simple "ls" our "touch" would block and be stuck in D
> state. All other running processes accessing the fs are of course also
> stuck in D state. It is a XFS V5 filesystem.
> Then again, as sudden as it began, everything goes back to normal and
> I/O continues. The problem is accompanied with several "process blocked
> for xxx seconds" in dmesg and also some dropped connections due to
> network timeouts.
> 
> I've tried several things to remedy the problem, including:
>   - changing I/O schedulers (tried noop, deadline and cfq). Deadline
> seems to be best (the block goes away in less time compared with the
> others).
>   - removing all mount options (defaults + usrquota, grpquota)
>   - upgrading to the latest 4.11.0-rc kernel (before that i was on 4.9.x)
> 
> Nothing of the above seemed to have made a significant change to the
> problem.
> 
> xfs_info output of the fs in question:
> meta-data=/dev/md127             isize=512    agcount=33,
> agsize=268435440 blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=1        finobt=1 spinodes=0 rmapbt=0
>          =                       reflink=0
> data     =                       bsize=4096   blocks=8789917696, imaxpct=10
>          =                       sunit=16     swidth=96 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> log      =internal               bsize=4096   blocks=521728, version=2
>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
>

This is really not enough to give any idea of what might be happening, although
this looks more like a slow storage while xfsaild is flushing the log, but we
really need more information to try to give a better idea of what is going on,
please look at:

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

Specially for: storage layout (RAID arrays, LVMs, thin provisioning, etc), and
the dmesg output with the traces from the hang tasks.

Cheers.

> Storage Subsystem: Dell Perc H730P Controller 2GB NVCACHE, 12 6TB Disks,
> RAID-10, latest Firmware Updates
> 
> I would be happy to dig out more information if needed. How can i find
> out if the RAID Controller itself gets stuck? Nothing bad shows up in
> the hardware and SCSI controller logs.
> 

-- 
Carlos

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: xfsaild in D state seems to be blocking all other i/o sporadically
  2017-04-19 12:12 ` Carlos Maiolino
@ 2017-04-19 12:37   ` Brian Foster
  2017-04-19 12:40   ` Michael Weissenbacher
  1 sibling, 0 replies; 19+ messages in thread
From: Brian Foster @ 2017-04-19 12:37 UTC (permalink / raw)
  To: Michael Weissenbacher, linux-xfs

On Wed, Apr 19, 2017 at 02:12:02PM +0200, Carlos Maiolino wrote:
> Hi,
> 
> On Wed, Apr 19, 2017 at 12:58:05PM +0200, Michael Weissenbacher wrote:
> > Hi List!
> > I have a storage server which primarily does around 15-20 parallel
> > rsync's, nothing special. Sometimes (3-4 times a day) i notice that all
> > I/O on the file system suddenly comes to a halt and the only process
> > that continues to do any I/O (according to iotop) is the process
> > xfsaild/md127. When this happens, xfsaild only does reads (according to
> > iotop) and consistently in D State (according to top).
> > Unfortunately this can sometimes stay like this for 5-15 minutes. During
> > this time even a simple "ls" our "touch" would block and be stuck in D
> > state. All other running processes accessing the fs are of course also
> > stuck in D state. It is a XFS V5 filesystem.
> > Then again, as sudden as it began, everything goes back to normal and
> > I/O continues. The problem is accompanied with several "process blocked
> > for xxx seconds" in dmesg and also some dropped connections due to
> > network timeouts.
> > 
> > I've tried several things to remedy the problem, including:
> >   - changing I/O schedulers (tried noop, deadline and cfq). Deadline
> > seems to be best (the block goes away in less time compared with the
> > others).
> >   - removing all mount options (defaults + usrquota, grpquota)
> >   - upgrading to the latest 4.11.0-rc kernel (before that i was on 4.9.x)
> > 
> > Nothing of the above seemed to have made a significant change to the
> > problem.
> > 
> > xfs_info output of the fs in question:
> > meta-data=/dev/md127             isize=512    agcount=33,
> > agsize=268435440 blks
> >          =                       sectsz=4096  attr=2, projid32bit=1
> >          =                       crc=1        finobt=1 spinodes=0 rmapbt=0
> >          =                       reflink=0
> > data     =                       bsize=4096   blocks=8789917696, imaxpct=10
> >          =                       sunit=16     swidth=96 blks
> > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > log      =internal               bsize=4096   blocks=521728, version=2
> >          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> >
> 
> This is really not enough to give any idea of what might be happening, although
> this looks more like a slow storage while xfsaild is flushing the log, but we
> really need more information to try to give a better idea of what is going on,
> please look at:
> 
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
> 
> Specially for: storage layout (RAID arrays, LVMs, thin provisioning, etc), and
> the dmesg output with the traces from the hang tasks.
> 

Information around memory usage might be particularly interesting here
as well. E.g., /proc/meminfo and /proc/slabinfo..

Brian

> Cheers.
> 
> > Storage Subsystem: Dell Perc H730P Controller 2GB NVCACHE, 12 6TB Disks,
> > RAID-10, latest Firmware Updates
> > 
> > I would be happy to dig out more information if needed. How can i find
> > out if the RAID Controller itself gets stuck? Nothing bad shows up in
> > the hardware and SCSI controller logs.
> > 
> 
> -- 
> Carlos
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: xfsaild in D state seems to be blocking all other i/o sporadically
  2017-04-19 12:12 ` Carlos Maiolino
  2017-04-19 12:37   ` Brian Foster
@ 2017-04-19 12:40   ` Michael Weissenbacher
  2017-04-19 13:01     ` Michael Weissenbacher
  1 sibling, 1 reply; 19+ messages in thread
From: Michael Weissenbacher @ 2017-04-19 12:40 UTC (permalink / raw)
  To: linux-xfs



On 19.04.2017 14:12, Carlos Maiolino wrote:
> Hi,
> 
> On Wed, Apr 19, 2017 at 12:58:05PM +0200, Michael Weissenbacher wrote:
>> Hi List!
>> I have a storage server which primarily does around 15-20 parallel
>> rsync's, nothing special. Sometimes (3-4 times a day) i notice that all
>> I/O on the file system suddenly comes to a halt and the only process
>> that continues to do any I/O (according to iotop) is the process
>> xfsaild/md127. When this happens, xfsaild only does reads (according to
>> iotop) and consistently in D State (according to top).
>> Unfortunately this can sometimes stay like this for 5-15 minutes. During
>> this time even a simple "ls" our "touch" would block and be stuck in D
>> state. All other running processes accessing the fs are of course also
>> stuck in D state. It is a XFS V5 filesystem.
>> Then again, as sudden as it began, everything goes back to normal and
>> I/O continues. The problem is accompanied with several "process blocked
>> for xxx seconds" in dmesg and also some dropped connections due to
>> network timeouts.
>>
>> I've tried several things to remedy the problem, including:
>>   - changing I/O schedulers (tried noop, deadline and cfq). Deadline
>> seems to be best (the block goes away in less time compared with the
>> others).
>>   - removing all mount options (defaults + usrquota, grpquota)
>>   - upgrading to the latest 4.11.0-rc kernel (before that i was on 4.9.x)
>>
>> Nothing of the above seemed to have made a significant change to the
>> problem.
>>
>> xfs_info output of the fs in question:
>> meta-data=/dev/md127             isize=512    agcount=33,
>> agsize=268435440 blks
>>          =                       sectsz=4096  attr=2, projid32bit=1
>>          =                       crc=1        finobt=1 spinodes=0 rmapbt=0
>>          =                       reflink=0
>> data     =                       bsize=4096   blocks=8789917696, imaxpct=10
>>          =                       sunit=16     swidth=96 blks
>> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
>> log      =internal               bsize=4096   blocks=521728, version=2
>>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
>> realtime =none                   extsz=4096   blocks=0, rtextents=0
>>
> 
> This is really not enough to give any idea of what might be happening, although
> this looks more like a slow storage while xfsaild is flushing the log, but we
> really need more information to try to give a better idea of what is going on,
> please look at:
> 
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
> 
> Specially for: storage layout (RAID arrays, LVMs, thin provisioning, etc), and
> the dmesg output with the traces from the hang tasks.
> 
Ok, including all of the info above:
* Kernel Vesion: 4.11.0-rc5 (currently), but also reproducable on 4.9.20+
* xfsprogs version: 4.9.0
* number of cpus: 1 socket, 6 cores
* contents of /proc/meminfo, /proc/mounts, /proc/partitions: see below
* RAID layout: Hardware RAID-10 + linear MD device (sdb == md127); XFS
on top of md127
md127 : active linear sdb[0]
      35159670784 blocks super 1.2 64k rounding
* LVM: no LVM
* type of disks 12x 6TB SAS disks (enterprise class)
* write cache state of disks: disabled
* write cache of BBWC controller: enabled, 2GB
* xfs_info output: already included
* dmesg output: samples see below
* iostat output: samples see below
* vmstat output: will capture when it occurs the next time
* D state processes: will capture when it occurs the next time; also see
top output below


>> Storage Subsystem: Dell Perc H730P Controller 2GB NVCACHE, 12 6TB Disks,
>> RAID-10, latest Firmware Updates
>>
>> I would be happy to dig out more information if needed. How can i find
>> out if the RAID Controller itself gets stuck? Nothing bad shows up in
>> the hardware and SCSI controller logs.
>>
> 

# cat /proc/meminfo
MemTotal:       65886208 kB
MemFree:         3490480 kB
MemAvailable:   41531280 kB
Buffers:              84 kB
Cached:         26125688 kB
SwapCached:           64 kB
Active:         20010544 kB
Inactive:        6142828 kB
Active(anon):        856 kB
Inactive(anon):    27576 kB
Active(file):   20009688 kB
Inactive(file):  6115252 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      64511996 kB
SwapFree:       64505852 kB
Dirty:                 8 kB
Writeback:             0 kB
AnonPages:         27536 kB
Mapped:            14000 kB
Shmem:               832 kB
Slab:           27037624 kB
SReclaimable:   20569360 kB
SUnreclaim:      6468264 kB
KernelStack:        2976 kB
PageTables:         6228 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    97455100 kB
Committed_AS:     210904 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       88820 kB
DirectMap2M:     7155712 kB
DirectMap1G:    61865984 kB


# cat /proc/mounts
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
udev /dev devtmpfs rw,relatime,size=10240k,nr_inodes=8233000,mode=755 0 0
devpts /dev/pts devpts
rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,noexec,relatime,size=6588624k,mode=755 0 0
/dev/sda4 / xfs rw,relatime,attr2,inode64,usrquota,grpquota 0 0
tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k 0 0
pstore /sys/fs/pstore pstore rw,relatime 0 0
tmpfs /run/shm tmpfs rw,nosuid,nodev,noexec,relatime,size=26079640k 0 0
/dev/sda2 /boot ext2 rw,relatime,block_validity,barrier,user_xattr,acl 0 0
-> /dev/md127 /backup xfs
rw,noatime,attr2,inode64,sunit=128,swidth=768,usrquota,grpquota 0 0
none /sys/kernel/config configfs rw,relatime 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0


# cat /proc/partitions
major minor  #blocks  name

   8        0  292421632 sda
   8        1       1024 sda1
   8        2    1021952 sda2
   8        3   64512000 sda3
   8        4  226884608 sda4
   8       16 35159801856 sdb
   9      127 35159670784 md127



top - 12:30:03 up 11 days,  7:07,  1 user,  load average: 22.89, 19.86,
12.53
Tasks: 247 total,   1 running, 246 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.4 sy,  0.0 ni, 99.6 id,  0.0 wa,  0.0 hi,  0.0 si,
0.0 st
KiB Mem:  65886208 total, 62923220 used,  2962988 free,       84 buffers
KiB Swap: 64511996 total,     6144 used, 64505852 free. 38382400 cached Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
COMMAND

   1609 root      20   0       0      0      0 D   2.0  0.0  54:39.51
xfsaild/md127

1703883 root      20   0       0      0      0 D   0.0  0.0   2:28.31
kworker/u12:0

1704625 checker+  20   0   37512  11700   2416 D   0.0  0.0   1:19.68
rsync

1704627 checker+  20   0   67104   9396   1628 D   0.0  0.0   1:10.44
rsync

1704845 10003     20   0    9148   3836   1744 D   0.0  0.0   0:33.56 rm


1705051 nospam.+  20   0   24392   2240   1616 D   0.0  0.0   0:22.67
rsync

1705262 atacama+  20   0   20740   9120   2436 D   0.0  0.0   0:11.27
rsync

1705265 10021     20   0   38144  11404   3800 D   0.0  0.0   0:00.06
rsnapshot

1705371 smtp1.n+  20   0    7088   1892   1604 D   0.0  0.0   0:06.39 rm


1705373 kalahar+  20   0   38144  11440   3840 D   0.0  0.0   0:00.07
rsnapshot

1705475 cvs.net+  20   0   17248   4996   2332 D   0.0  0.0   0:04.66
rsync

1705584 hoth.ne+  20   0   38040  11488   3888 D   0.0  0.0   0:00.06
rsnapshot

1705880 10035     20   0   38120  11372   3768 D   0.0  0.0   0:00.07
rsnapshot

1705983 smtp2.n+  20   0   38144  11444   3844 D   0.0  0.0   0:00.07
rsnapshot

1706080 endor.n+  20   0   38124  11456   3844 D   0.0  0.0   0:00.07
rsnapshot

1706081 10031     20   0   38040  11400   3800 D   0.0  0.0   0:00.06
rsnapshot

1706181 10032     20   0   38144  11380   3780 D   0.0  0.0   0:00.07
rsnapshot

1706182 10004     20   0   38144  11384   3780 D   0.0  0.0   0:00.06
rsnapshot

1706388 dns2.ne+  20   0   38120  11376   3768 D   0.0  0.0   0:00.06
rsnapshot

1706485 10034     20   0   38144  11392   3788 D   0.0  0.0   0:00.06
rsnapshot

1706491 10022     20   0   38124  11372   3768 D   0.0  0.0   0:00.06
rsnapshot

1706786 wiki.ne+  20   0   38124  11404   3800 D   0.0  0.0   0:00.06
rsnapshot


iostat output when the problem occurs:
04/19/2017 12:30:17 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.03    0.00    0.17    0.00    0.00   99.80

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00  190.40    0.00  3046.40     0.00
32.00     0.98    5.16    5.16    0.00   5.14  97.92

04/19/2017 12:30:22 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.20    0.00    0.00   99.80

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00  159.80    0.00  2556.80     0.00
32.00     0.98    6.15    6.15    0.00   6.14  98.16

04/19/2017 12:30:27 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.03    0.00    0.23    0.03    0.00   99.70

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00  235.80    0.00  3772.80     0.00
32.00     0.97    4.13    4.13    0.00   4.13  97.36

04/19/2017 12:30:32 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.93    0.00    1.17    0.07    0.00   97.83

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00  161.00    3.40  2576.00    54.40
32.00     0.98    5.98    6.10    0.24   5.98  98.24

04/19/2017 12:30:37 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.03    0.00    0.20    0.00    0.00   99.77

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00  194.80    0.00  3116.80     0.00
32.00     0.98    5.02    5.02    0.00   5.02  97.84

04/19/2017 12:30:42 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.20    0.00    0.00   99.80

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00  192.20    0.00  3075.20     0.00
32.00     0.99    5.14    5.14    0.00   5.14  98.72
...
(after it suddendly stops blocking and starts writing again)
04/19/2017 12:31:37 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.03    0.00    2.51    3.38    0.00   94.08

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00  4911.80   55.60 4664.00   889.60 137985.60
58.85   122.10   25.18    8.95   25.37   0.20  94.80

04/19/2017 12:31:42 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.03    0.00    0.81    7.86    0.00   91.30

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00  2239.80    0.00 2592.20     0.00 70144.80
54.12   271.25  105.12    0.00  105.12   0.39 100.00

04/19/2017 12:31:47 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.14    0.00    1.57    7.90    0.00   89.39

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00  2620.40    0.00 2632.20     0.00 77292.00
58.73   270.84  102.98    0.00  102.98   0.38 100.00

04/19/2017 12:31:52 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    1.11    7.82    0.00   91.08

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00  2592.80    0.00 2661.40     0.00 76693.60
57.63   270.95  101.39    0.00  101.39   0.38 100.00

04/19/2017 12:32:32 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.03    0.00    0.87    7.43    0.00   91.66

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00  2007.60    0.00 2321.80     0.00 58285.60
50.21   271.18  116.05    0.00  116.05   0.43 100.00

04/19/2017 12:32:37 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.80    0.00    4.09   13.70    0.00   81.41

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.20  1234.00    6.40 1912.40    43.20 66422.40
69.28   224.64  118.18  135.38  118.13   0.52 100.00
...
followed by "normal" I/O with both reads and writes
04/19/2017 12:32:42 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.98    0.00   10.23   35.92    0.00   49.87

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               2.00     0.00  423.20  374.80  9460.00 66360.00
190.03    18.15   25.78   47.33    1.44   1.25 100.00

04/19/2017 12:32:47 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           9.26    0.00    7.95   56.36    0.00   26.43

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb              24.60     0.00 1826.00  188.00 53184.80  5222.40
58.00    17.24    8.25    9.08    0.21   0.50 100.00


with kind regards,
Michael

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: xfsaild in D state seems to be blocking all other i/o sporadically
  2017-04-19 12:40   ` Michael Weissenbacher
@ 2017-04-19 13:01     ` Michael Weissenbacher
  2017-04-19 14:04       ` Carlos Maiolino
  0 siblings, 1 reply; 19+ messages in thread
From: Michael Weissenbacher @ 2017-04-19 13:01 UTC (permalink / raw)
  To: linux-xfs

> * dmesg output from hung tasks
[25012.200938] INFO: task kworker/u12:0:39771 blocked for more than 120
seconds.
[25012.201036]       Tainted: G            E   4.11.0-rc5-mw #1
[25012.201125] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[25012.201220] kworker/u12:0   D    0 39771      2 0x00000000
[25012.201229] Workqueue: writeback wb_workfn (flush-9:127)
[25012.201231] Call Trace:
[25012.201241]  ? __schedule+0x2e0/0x910
[25012.201295]  ? xfs_buf_rele+0x5b/0x390 [xfs]
[25012.201300]  ? schedule+0x32/0x80
[25012.201344]  ? xlog_grant_head_wait+0xca/0x200 [xfs]
[25012.201385]  ? xlog_grant_head_check+0x9a/0x110 [xfs]
[25012.201428]  ? xfs_log_reserve+0xd4/0x1f0 [xfs]
[25012.201468]  ? xfs_trans_reserve+0x182/0x1f0 [xfs]
[25012.201507]  ? xfs_trans_alloc+0xb4/0x130 [xfs]
[25012.201548]  ? xfs_iomap_write_allocate+0x1ea/0x360 [xfs]
[25012.201578]  ? xfs_bmapi_read+0x29f/0x320 [xfs]
[25012.201617]  ? xfs_map_blocks+0x1c8/0x240 [xfs]
[25012.201654]  ? xfs_do_writepage+0x257/0x710 [xfs]
[25012.201657]  ? invalid_page_referenced_vma+0x80/0x80
[25012.201661]  ? write_cache_pages+0x207/0x480
[25012.201696]  ? xfs_submit_ioend+0x170/0x170 [xfs]
[25012.201731]  ? xfs_vm_writepages+0xba/0xf0 [xfs]
[25012.201735]  ? __writeback_single_inode+0x3d/0x330
[25012.201737]  ? writeback_sb_inodes+0x23d/0x470
[25012.201741]  ? __writeback_inodes_wb+0x87/0xb0
[25012.201743]  ? wb_writeback+0x28e/0x320
[25012.201746]  ? wb_workfn+0x2d2/0x3a0
[25012.201748]  ? wb_workfn+0x2d2/0x3a0
[25012.201752]  ? process_one_work+0x15e/0x420
[25012.201754]  ? worker_thread+0x65/0x4b0
[25012.201756]  ? rescuer_thread+0x390/0x390
[25012.201759]  ? kthread+0x104/0x140
[25012.201762]  ? kthread_park+0x80/0x80
[25012.201764]  ? do_group_exit+0x39/0xb0
[25012.201767]  ? ret_from_fork+0x26/0x40
[25012.201777] INFO: task rm:44141 blocked for more than 120 seconds.
[25012.201868]       Tainted: G            E   4.11.0-rc5-mw #1
[25012.201957] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[25012.202051] rm              D    0 44141  43912 0x00000000
[25012.202054] Call Trace:
[25012.202059]  ? __schedule+0x2e0/0x910
[25012.202097]  ? xfs_dir_open+0x59/0x70 [xfs]
[25012.202102]  ? schedule+0x32/0x80
[25012.202144]  ? xlog_grant_head_wait+0xca/0x200 [xfs]
[25012.202183]  ? xlog_grant_head_check+0x9a/0x110 [xfs]
[25012.202223]  ? xfs_log_reserve+0xd4/0x1f0 [xfs]
[25012.202262]  ? xfs_trans_reserve+0x182/0x1f0 [xfs]
[25012.202299]  ? xfs_trans_alloc+0xb4/0x130 [xfs]
[25012.202339]  ? xfs_remove+0x13d/0x310 [xfs]
[25012.202379]  ? xfs_vn_unlink+0x55/0xa0 [xfs]
[25012.202383]  ? vfs_rmdir+0xac/0x120
[25012.202386]  ? do_rmdir+0x1c9/0x1e0
[25012.202389]  ? entry_SYSCALL_64_fastpath+0x1e/0xad
[25012.202392] INFO: task rm:44149 blocked for more than 120 seconds.
[25012.202483]       Tainted: G            E   4.11.0-rc5-mw #1
[25012.202571] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[25012.202665] rm              D    0 44149  43923 0x00000000
[25012.202668] Call Trace:
[25012.202672]  ? __schedule+0x2e0/0x910
[25012.202676]  ? schedule+0x32/0x80
[25012.202716]  ? xlog_grant_head_wait+0xca/0x200 [xfs]
[25012.202754]  ? xlog_grant_head_check+0x9a/0x110 [xfs]
[25012.202793]  ? xfs_log_reserve+0xd4/0x1f0 [xfs]
[25012.202831]  ? xfs_trans_reserve+0x182/0x1f0 [xfs]
[25012.202867]  ? xfs_trans_alloc+0xb4/0x130 [xfs]
[25012.202906]  ? xfs_remove+0x13d/0x310 [xfs]
[25012.202945]  ? xfs_vn_unlink+0x55/0xa0 [xfs]
[25012.202949]  ? vfs_unlink+0xe7/0x180
[25012.202952]  ? do_unlinkat+0x289/0x310
[25012.202955]  ? entry_SYSCALL_64_fastpath+0x1e/0xad
[25012.202958] INFO: task rsync:44358 blocked for more than 120 seconds.
[25012.203050]       Tainted: G            E   4.11.0-rc5-mw #1
[25012.203138] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[25012.203233] rsync           D    0 44358  44355 0x00000000
[25012.203235] Call Trace:
[25012.203239]  ? __schedule+0x2e0/0x910
[25012.203242]  ? walk_component+0x4b/0x630
[25012.203245]  ? schedule+0x32/0x80
[25012.203284]  ? xlog_grant_head_wait+0xca/0x200 [xfs]
[25012.203322]  ? xlog_grant_head_check+0x9a/0x110 [xfs]
[25012.203361]  ? xfs_log_reserve+0xd4/0x1f0 [xfs]
[25012.203400]  ? xfs_trans_reserve+0x182/0x1f0 [xfs]
[25012.203436]  ? xfs_trans_alloc+0xb4/0x130 [xfs]
[25012.203476]  ? xfs_link+0x123/0x2d0 [xfs]
[25012.203478]  ? __d_lookup+0x60/0x130
[25012.203517]  ? xfs_vn_link+0x65/0xa0 [xfs]
[25012.203520]  ? vfs_link+0x21a/0x2e0
[25012.203524]  ? SyS_link+0x22a/0x260
[25012.203527]  ? entry_SYSCALL_64_fastpath+0x1e/0xad
[25012.203530] INFO: task rsync:44370 blocked for more than 120 seconds.
[25012.203622]       Tainted: G            E   4.11.0-rc5-mw #1
[25012.203710] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[25012.203804] rsync           D    0 44370  44365 0x00000000
[25012.203806] Call Trace:
[25012.203811]  ? __schedule+0x2e0/0x910
[25012.203814]  ? schedule+0x32/0x80
[25012.203852]  ? xlog_grant_head_wait+0xca/0x200 [xfs]
[25012.203889]  ? xlog_grant_head_check+0x9a/0x110 [xfs]
[25012.203928]  ? xfs_log_reserve+0xd4/0x1f0 [xfs]
[25012.203965]  ? xfs_trans_reserve+0x182/0x1f0 [xfs]
[25012.204002]  ? xfs_trans_alloc+0xb4/0x130 [xfs]
[25012.204041]  ? xfs_vn_update_time+0x5d/0x150 [xfs]
[25012.204044]  ? file_update_time+0xb7/0x110
[25012.204083]  ? xfs_file_aio_write_checks+0x176/0x1d0 [xfs]
[25012.204120]  ? xfs_file_buffered_aio_write+0x7a/0x260 [xfs]
[25012.204156]  ? xfs_file_write_iter+0x10b/0x150 [xfs]
[25012.204159]  ? __vfs_write+0xeb/0x160
[25012.204162]  ? vfs_write+0xb3/0x1a0
[25012.204165]  ? SyS_write+0x52/0xc0
[25012.204167]  ? entry_SYSCALL_64_fastpath+0x1e/0xad
[25012.204171] INFO: task rsync:44480 blocked for more than 120 seconds.
[25012.204283]       Tainted: G            E   4.11.0-rc5-mw #1
[25012.204393] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[25012.204529] rsync           D    0 44480  44477 0x00000000
[25012.204531] Call Trace:
[25012.204536]  ? __schedule+0x2e0/0x910
[25012.204539]  ? walk_component+0x4b/0x630
[25012.204542]  ? schedule+0x32/0x80
[25012.204581]  ? xlog_grant_head_wait+0xca/0x200 [xfs]
[25012.204619]  ? xlog_grant_head_check+0x9a/0x110 [xfs]
[25012.204658]  ? xfs_log_reserve+0xd4/0x1f0 [xfs]
[25012.204696]  ? xfs_trans_reserve+0x182/0x1f0 [xfs]
[25012.204732]  ? xfs_trans_alloc+0xb4/0x130 [xfs]
[25012.204772]  ? xfs_link+0x123/0x2d0 [xfs]
[25012.204774]  ? __d_lookup+0x60/0x130
[25012.204814]  ? xfs_vn_link+0x65/0xa0 [xfs]
[25012.204817]  ? vfs_link+0x21a/0x2e0
[25012.204820]  ? SyS_link+0x22a/0x260
[25012.204823]  ? entry_SYSCALL_64_fastpath+0x1e/0xad
[25012.204827] INFO: task rsync:44581 blocked for more than 120 seconds.
[25012.204951]       Tainted: G            E   4.11.0-rc5-mw #1
[25012.205062] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[25012.205200] rsync           D    0 44581  44580 0x00000000
[25012.205202] Call Trace:
[25012.205207]  ? __schedule+0x2e0/0x910
[25012.205245]  ? xfs_trans_free_items+0x75/0xa0 [xfs]
[25012.205249]  ? schedule+0x32/0x80
[25012.205286]  ? xlog_grant_head_wait+0xca/0x200 [xfs]
[25012.205321]  ? xlog_grant_head_check+0xf8/0x110 [xfs]
[25012.205362]  ? xfs_log_reserve+0xd4/0x1f0 [xfs]
[25012.205401]  ? xfs_trans_reserve+0x182/0x1f0 [xfs]
[25012.205438]  ? xfs_trans_alloc+0xb4/0x130 [xfs]
[25012.205469]  ? xfs_attr_set+0x1d4/0x440 [xfs]
[25012.205473]  ? d_splice_alias+0x185/0x3c0
[25012.205509]  ? xfs_xattr_set+0x4e/0x90 [xfs]
[25012.205515]  ? __vfs_setxattr+0x64/0x80
[25012.205520]  ? __vfs_setxattr_noperm+0x62/0x190
[25012.205523]  ? vfs_setxattr+0xa0/0xb0
[25012.205526]  ? setxattr+0x141/0x190
[25012.205531]  ? __check_object_size+0x173/0x19e
[25012.205535]  ? strncpy_from_user+0x48/0x160
[25012.205539]  ? path_setxattr+0xad/0xe0
[25012.205542]  ? SyS_lsetxattr+0xd/0x10
[25012.205545]  ? entry_SYSCALL_64_fastpath+0x1e/0xad
[25133.028169] INFO: task kworker/1:0:24106 blocked for more than 120
seconds.
[25133.028288]       Tainted: G            E   4.11.0-rc5-mw #1
[25133.028397] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[25133.028535] kworker/1:0     D    0 24106      2 0x00000000
[25133.028595] Workqueue: xfs-eofblocks/md127 xfs_eofblocks_worker [xfs]
[25133.028598] Call Trace:
[25133.028606]  ? __schedule+0x2e0/0x910
[25133.028609]  ? schedule+0x32/0x80
[25133.028655]  ? xlog_grant_head_wait+0xca/0x200 [xfs]
[25133.028697]  ? xlog_grant_head_check+0x9a/0x110 [xfs]
[25133.028739]  ? xfs_log_reserve+0xd4/0x1f0 [xfs]
[25133.028779]  ? xfs_trans_reserve+0x182/0x1f0 [xfs]
[25133.028817]  ? xfs_trans_alloc+0xb4/0x130 [xfs]
[25133.028857]  ? xfs_free_eofblocks+0x120/0x1e0 [xfs]
[25133.028897]  ? xfs_inode_free_eofblocks+0x11e/0x140 [xfs]
[25133.028936]  ? xfs_inode_ag_walk.isra.13+0x19b/0x2b0 [xfs]
[25133.028973]  ? xfs_inode_clear_eofblocks_tag+0x70/0x70 [xfs]
[25133.028978]  ? intel_cpufreq_target+0x170/0x170
[25133.028982]  ? update_load_avg+0x858/0xa00
[25133.028985]  ? radix_tree_next_chunk+0x101/0x310
[25133.028989]  ? radix_tree_gang_lookup_tag+0xe4/0x150
[25133.029024]  ? xfs_inode_clear_eofblocks_tag+0x70/0x70 [xfs]
[25133.029062]  ? xfs_perag_get_tag+0x3d/0xe0 [xfs]
[25133.029097]  ? xfs_inode_clear_eofblocks_tag+0x70/0x70 [xfs]
[25133.029132]  ? xfs_inode_ag_iterator_tag+0x70/0xa0 [xfs]
[25133.029166]  ? xfs_eofblocks_worker+0x23/0x30 [xfs]
[25133.029169]  ? process_one_work+0x15e/0x420
[25133.029171]  ? worker_thread+0x65/0x4b0
[25133.029173]  ? rescuer_thread+0x390/0x390
[25133.029176]  ? kthread+0x104/0x140
[25133.029179]  ? kthread_park+0x80/0x80
[25133.029182]  ? do_group_exit+0x39/0xb0
[25133.029185]  ? ret_from_fork+0x26/0x40
[25133.029188] INFO: task kworker/u12:0:39771 blocked for more than 120
seconds.
[25133.029303]       Tainted: G            E   4.11.0-rc5-mw #1
[25133.029412] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[25133.029549] kworker/u12:0   D    0 39771      2 0x00000000
[25133.029556] Workqueue: writeback wb_workfn (flush-9:127)
[25133.029558] Call Trace:
[25133.029563]  ? __schedule+0x2e0/0x910
[25133.029600]  ? xfs_buf_rele+0x5b/0x390 [xfs]
[25133.029605]  ? schedule+0x32/0x80
[25133.029647]  ? xlog_grant_head_wait+0xca/0x200 [xfs]
[25133.029687]  ? xlog_grant_head_check+0x9a/0x110 [xfs]
[25133.029728]  ? xfs_log_reserve+0xd4/0x1f0 [xfs]
[25133.029768]  ? xfs_trans_reserve+0x182/0x1f0 [xfs]
[25133.029807]  ? xfs_trans_alloc+0xb4/0x130 [xfs]
[25133.029846]  ? xfs_iomap_write_allocate+0x1ea/0x360 [xfs]
[25133.029876]  ? xfs_bmapi_read+0x29f/0x320 [xfs]
[25133.029914]  ? xfs_map_blocks+0x1c8/0x240 [xfs]
[25133.029950]  ? xfs_do_writepage+0x257/0x710 [xfs]
[25133.029954]  ? invalid_page_referenced_vma+0x80/0x80
[25133.029957]  ? write_cache_pages+0x207/0x480
[25133.029992]  ? xfs_submit_ioend+0x170/0x170 [xfs]
[25133.030027]  ? xfs_vm_writepages+0xba/0xf0 [xfs]
[25133.030031]  ? __writeback_single_inode+0x3d/0x330
[25133.030033]  ? writeback_sb_inodes+0x23d/0x470
[25133.030037]  ? __writeback_inodes_wb+0x87/0xb0
[25133.030039]  ? wb_writeback+0x28e/0x320
[25133.030042]  ? wb_workfn+0x2d2/0x3a0
[25133.030044]  ? wb_workfn+0x2d2/0x3a0
[25133.030047]  ? process_one_work+0x15e/0x420
[25133.030049]  ? worker_thread+0x65/0x4b0
[25133.030051]  ? rescuer_thread+0x390/0x390
[25133.030054]  ? kthread+0x104/0x140
[25133.030057]  ? kthread_park+0x80/0x80
[25133.030059]  ? do_group_exit+0x39/0xb0
[25133.030062]  ? ret_from_fork+0x26/0x40
[25133.030070] INFO: task rm:44141 blocked for more than 120 seconds.
[25133.030182]       Tainted: G            E   4.11.0-rc5-mw #1
[25133.030291] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[25133.030428] rm              D    0 44141  43912 0x00000000
[25133.030430] Call Trace:
[25133.030435]  ? __schedule+0x2e0/0x910
[25133.030472]  ? xfs_dir_open+0x59/0x70 [xfs]
[25133.030477]  ? schedule+0x32/0x80
[25133.030518]  ? xlog_grant_head_wait+0xca/0x200 [xfs]
[25133.030557]  ? xlog_grant_head_check+0x9a/0x110 [xfs]
[25133.030597]  ? xfs_log_reserve+0xd4/0x1f0 [xfs]
[25133.030635]  ? xfs_trans_reserve+0x182/0x1f0 [xfs]
[25133.030671]  ? xfs_trans_alloc+0xb4/0x130 [xfs]
[25133.030711]  ? xfs_remove+0x13d/0x310 [xfs]
[25133.030751]  ? xfs_vn_unlink+0x55/0xa0 [xfs]
[25133.030755]  ? vfs_rmdir+0xac/0x120
[25133.030758]  ? do_rmdir+0x1c9/0x1e0
[25133.030761]  ? entry_SYSCALL_64_fastpath+0x1e/0xad


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: xfsaild in D state seems to be blocking all other i/o sporadically
  2017-04-19 13:01     ` Michael Weissenbacher
@ 2017-04-19 14:04       ` Carlos Maiolino
  2017-04-19 14:20         ` Carlos Maiolino
  2017-04-19 16:36         ` Michael Weissenbacher
  0 siblings, 2 replies; 19+ messages in thread
From: Carlos Maiolino @ 2017-04-19 14:04 UTC (permalink / raw)
  To: linux-xfs; +Cc: mw

If I am not blind, the task currently owning the cpu isn't listed there, which I
suppose to be already running.

Next time you hit it, could you please paste the stack trace of the task owning
the cpu? Which I suppose to be xfsaild according to your report.

A simple `cat /proc/<pid>/stack` is probably enough

thanks

> > * dmesg output from hung tasks
> [25012.200938] INFO: task kworker/u12:0:39771 blocked for more than 120
> seconds.
> [25012.201036]       Tainted: G            E   4.11.0-rc5-mw #1
> [25012.201125] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [25012.201220] kworker/u12:0   D    0 39771      2 0x00000000
> [25012.201229] Workqueue: writeback wb_workfn (flush-9:127)
> [25012.201231] Call Trace:
> [25012.201241]  ? __schedule+0x2e0/0x910
> [25012.201295]  ? xfs_buf_rele+0x5b/0x390 [xfs]
> [25012.201300]  ? schedule+0x32/0x80
> [25012.201344]  ? xlog_grant_head_wait+0xca/0x200 [xfs]
> [25012.201385]  ? xlog_grant_head_check+0x9a/0x110 [xfs]

> [25012.201777] INFO: task rm:44141 blocked for more than 120 seconds.
> [25012.201868]       Tainted: G            E   4.11.0-rc5-mw #1
> [25012.201957] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [25012.202051] rm              D    0 44141  43912 0x00000000
> [25012.202054] Call Trace:
> [25012.202059]  ? __schedule+0x2e0/0x910
> [25012.202097]  ? xfs_dir_open+0x59/0x70 [xfs]
> [25012.202102]  ? schedule+0x32/0x80

> [25012.202392] INFO: task rm:44149 blocked for more than 120 seconds.
> [25012.202483]       Tainted: G            E   4.11.0-rc5-mw #1
> [25012.202571] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [25012.202665] rm              D    0 44149  43923 0x00000000
> [25012.202668] Call Trace:
> [25012.202672]  ? __schedule+0x2e0/0x910
> [25012.202676]  ? schedule+0x32/0x80

> [25012.202958] INFO: task rsync:44358 blocked for more than 120 seconds.
> [25012.203050]       Tainted: G            E   4.11.0-rc5-mw #1
> [25012.203138] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [25012.203233] rsync           D    0 44358  44355 0x00000000
> [25012.203235] Call Trace:
> [25012.203239]  ? __schedule+0x2e0/0x910
> [25012.203242]  ? walk_component+0x4b/0x630
> [25012.203245]  ? schedule+0x32/0x80
> [25012.203284]  ? xlog_grant_head_wait+0xca/0x200 [xfs]

> [25012.203530] INFO: task rsync:44370 blocked for more than 120 seconds.
> [25012.203622]       Tainted: G            E   4.11.0-rc5-mw #1
> [25012.203710] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [25012.203804] rsync           D    0 44370  44365 0x00000000
> [25012.203806] Call Trace:
> [25012.203811]  ? __schedule+0x2e0/0x910
> [25012.203814]  ? schedule+0x32/0x80
> [25012.203852]  ? xlog_grant_head_wait+0xca/0x200 [xfs]
> [25012.203889]  ? xlog_grant_head_check+0x9a/0x110 [xfs]

> [25012.204171] INFO: task rsync:44480 blocked for more than 120 seconds.
> [25012.204283]       Tainted: G            E   4.11.0-rc5-mw #1
> [25012.204393] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [25012.204529] rsync           D    0 44480  44477 0x00000000
> [25012.204531] Call Trace:
> [25012.204536]  ? __schedule+0x2e0/0x910
> [25012.204539]  ? walk_component+0x4b/0x630
> [25012.204542]  ? schedule+0x32/0x80

> [25012.204827] INFO: task rsync:44581 blocked for more than 120 seconds.
> [25012.204951]       Tainted: G            E   4.11.0-rc5-mw #1
> [25012.205062] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [25012.205200] rsync           D    0 44581  44580 0x00000000
> [25012.205202] Call Trace:
> [25012.205207]  ? __schedule+0x2e0/0x910
> [25012.205245]  ? xfs_trans_free_items+0x75/0xa0 [xfs]
> [25012.205249]  ? schedule+0x32/0x80
> [25012.205286]  ? xlog_grant_head_wait+0xca/0x200 [xfs]

> [25133.028169] INFO: task kworker/1:0:24106 blocked for more than 120
> seconds.
> [25133.028288]       Tainted: G            E   4.11.0-rc5-mw #1
> [25133.028397] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [25133.028535] kworker/1:0     D    0 24106      2 0x00000000
> [25133.028595] Workqueue: xfs-eofblocks/md127 xfs_eofblocks_worker [xfs]
> [25133.028598] Call Trace:
> [25133.028606]  ? __schedule+0x2e0/0x910
> [25133.028609]  ? schedule+0x32/0x80
> [25133.028655]  ? xlog_grant_head_wait+0xca/0x200 [xfs]

> [25133.029188] INFO: task kworker/u12:0:39771 blocked for more than 120
> seconds.
> [25133.029303]       Tainted: G            E   4.11.0-rc5-mw #1
> [25133.029412] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [25133.029549] kworker/u12:0   D    0 39771      2 0x00000000
> [25133.029556] Workqueue: writeback wb_workfn (flush-9:127)
> [25133.029558] Call Trace:
> [25133.029563]  ? __schedule+0x2e0/0x910
> [25133.029600]  ? xfs_buf_rele+0x5b/0x390 [xfs]
> [25133.029605]  ? schedule+0x32/0x80
> [25133.029647]  ? xlog_grant_head_wait+0xca/0x200 [xfs]

> [25133.030070] INFO: task rm:44141 blocked for more than 120 seconds.
> [25133.030182]       Tainted: G            E   4.11.0-rc5-mw #1
> [25133.030291] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [25133.030428] rm              D    0 44141  43912 0x00000000
> [25133.030430] Call Trace:
> [25133.030435]  ? __schedule+0x2e0/0x910
> [25133.030472]  ? xfs_dir_open+0x59/0x70 [xfs]
> [25133.030477]  ? schedule+0x32/0x80
> 

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Carlos

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: xfsaild in D state seems to be blocking all other i/o sporadically
  2017-04-19 14:04       ` Carlos Maiolino
@ 2017-04-19 14:20         ` Carlos Maiolino
  2017-04-19 16:40           ` Michael Weissenbacher
  2017-04-19 16:36         ` Michael Weissenbacher
  1 sibling, 1 reply; 19+ messages in thread
From: Carlos Maiolino @ 2017-04-19 14:20 UTC (permalink / raw)
  To: linux-xfs, mw

On Wed, Apr 19, 2017 at 04:04:46PM +0200, Carlos Maiolino wrote:
> If I am not blind, the task currently owning the cpu isn't listed there, which I
> suppose to be already running.
> 
> Next time you hit it, could you please paste the stack trace of the task owning
> the cpu? Which I suppose to be xfsaild according to your report.
> 
> A simple `cat /proc/<pid>/stack` is probably enough
> 
> thanks
> 

Btw, I'm not very good at xfs journaling code, but I wonder by the stack traces
you provided, if you don't have a too small journal, which, with your workload,
it might be getting to flush the logs too many times, making the disk too busy
and slowing down the rest of the IO.

Another folks with more knowledge of the journaling code might be able to
confirm or not this, but having too many tasks waiting in:

xlog_grant_head_wait()

actually made me wonder if your journal isn't too small for the FS size and the
current workload you have.

Again, this is just a guess, but maybe somebody else can put one more pair of
eyes on it.

cheers

-- 
Carlos

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: xfsaild in D state seems to be blocking all other i/o sporadically
  2017-04-19 14:04       ` Carlos Maiolino
  2017-04-19 14:20         ` Carlos Maiolino
@ 2017-04-19 16:36         ` Michael Weissenbacher
  2017-04-19 18:08           ` Brian Foster
  1 sibling, 1 reply; 19+ messages in thread
From: Michael Weissenbacher @ 2017-04-19 16:36 UTC (permalink / raw)
  To: linux-xfs

On 19.04.2017 16:04, Carlos Maiolino wrote:
> If I am not blind, the task currently owning the cpu isn't listed there, which I
> suppose to be already running.
Correct, the only process utilizing the CPU is xfsaild. It is also the
only process doing I/O, and it looks like it's doing only reads. It
(xfsaild) never shows up in dmesg as blocked.

> Next time you hit it, could you please paste the stack trace of the task owning
> the cpu? Which I suppose to be xfsaild according to your report.
> 
> A simple `cat /proc/<pid>/stack` is probably enough
Sure, will do.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: xfsaild in D state seems to be blocking all other i/o sporadically
  2017-04-19 14:20         ` Carlos Maiolino
@ 2017-04-19 16:40           ` Michael Weissenbacher
  0 siblings, 0 replies; 19+ messages in thread
From: Michael Weissenbacher @ 2017-04-19 16:40 UTC (permalink / raw)
  To: linux-xfs

> 
> Btw, I'm not very good at xfs journaling code, but I wonder by the stack traces
> you provided, if you don't have a too small journal, which, with your workload,
> it might be getting to flush the logs too many times, making the disk too busy
> and slowing down the rest of the IO.
> 
A good hint. Unfortunately, I don't think there is an option to increase
the log size of an internal log post-mkfs?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: xfsaild in D state seems to be blocking all other i/o sporadically
  2017-04-19 16:36         ` Michael Weissenbacher
@ 2017-04-19 18:08           ` Brian Foster
  2017-04-19 20:10             ` Michael Weissenbacher
  0 siblings, 1 reply; 19+ messages in thread
From: Brian Foster @ 2017-04-19 18:08 UTC (permalink / raw)
  To: Michael Weissenbacher; +Cc: linux-xfs

On Wed, Apr 19, 2017 at 06:36:00PM +0200, Michael Weissenbacher wrote:
> On 19.04.2017 16:04, Carlos Maiolino wrote:
> > If I am not blind, the task currently owning the cpu isn't listed there, which I
> > suppose to be already running.
> Correct, the only process utilizing the CPU is xfsaild. It is also the
> only process doing I/O, and it looks like it's doing only reads. It
> (xfsaild) never shows up in dmesg as blocked.
> 

A snippet of tracepoint output from the fs might also be interesting
when you hit this state. E.g., 'trace-cmd start -e xfs:*' and save the
output of /sys/kernel/debug/tracing/trace_pipe to a file for a few
seconds or so.

> > Next time you hit it, could you please paste the stack trace of the task owning
> > the cpu? Which I suppose to be xfsaild according to your report.
> > 
> > A simple `cat /proc/<pid>/stack` is probably enough
> Sure, will do.

BTW, is the original /proc/meminfo output you posted from the system in
a normal state or when the problem occurs? If the former, could you
include it while the problem occurs as well?

Brian

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: xfsaild in D state seems to be blocking all other i/o sporadically
  2017-04-19 18:08           ` Brian Foster
@ 2017-04-19 20:10             ` Michael Weissenbacher
  2017-04-19 20:55               ` Darrick J. Wong
  0 siblings, 1 reply; 19+ messages in thread
From: Michael Weissenbacher @ 2017-04-19 20:10 UTC (permalink / raw)
  To: linux-xfs

Hi Brian!

On 2017-04-19 20:08, Brian Foster wrote:
> On Wed, Apr 19, 2017 at 06:36:00PM +0200, Michael Weissenbacher wrote:
>> On 19.04.2017 16:04, Carlos Maiolino wrote:
>>> If I am not blind, the task currently owning the cpu isn't listed there, which I
>>> suppose to be already running.
>> Correct, the only process utilizing the CPU is xfsaild. It is also the
>> only process doing I/O, and it looks like it's doing only reads. It
>> (xfsaild) never shows up in dmesg as blocked.
>>
> 
> A snippet of tracepoint output from the fs might also be interesting
> when you hit this state. E.g., 'trace-cmd start -e xfs:*' and save the
> output of /sys/kernel/debug/tracing/trace_pipe to a file for a few
> seconds or so.
> 
See:
https://we.tl/7AAyBZxt4R

> BTW, is the original /proc/meminfo output you posted from the system in
> a normal state or when the problem occurs? If the former, could you
> include it while the problem occurs as well?
It was from a normal state.

All the following was captured during the problem occurs.

*** snip ***
/proc/meminfo

Wed Apr 19 20:36:50 CEST 2017
MemTotal:       65886208 kB
MemFree:         2923552 kB
MemAvailable:   43018524 kB
Buffers:              52 kB
Cached:         32416864 kB
SwapCached:          580 kB
Active:         30685760 kB
Inactive:        2051532 kB
Active(anon):     165844 kB
Inactive(anon):   155516 kB
Active(file):   30519916 kB
Inactive(file):  1896016 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      64511996 kB
SwapFree:       64506540 kB
Dirty:            202368 kB
Writeback:             0 kB
AnonPages:        319948 kB
Mapped:            18672 kB
Shmem:               868 kB
Slab:           22817492 kB
SReclaimable:   16332540 kB
SUnreclaim:      6484952 kB
KernelStack:        4048 kB
PageTables:        16516 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    97455100 kB
Committed_AS:     516532 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       88820 kB
DirectMap2M:     7155712 kB
DirectMap1G:    61865984 kB
*** snip ***

*** snip ***
Wed Apr 19 20:36:51 CEST 2017
slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab>
<pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata
<active_slabs> <num_slabs> <sharedavail>
kcopyd_job             0      0   3240    2    2 : tunables   24   12
8 : slabdata      0      0      0
dm_uevent              0      0   2632    3    2 : tunables   24   12
8 : slabdata      0      0      0
t10_pr_reg_cache       0      0    704   11    2 : tunables   54   27
8 : slabdata      0      0      0
ext4_groupinfo_4k      8     28    144   28    1 : tunables  120   60
8 : slabdata      1      1      0
ext4_inode_cache     328    330   1072    3    1 : tunables   24   12
8 : slabdata    110    110      0
ext4_allocation_context      0      0    128   32    1 : tunables  120
60    8 : slabdata      0      0      0
ext4_io_end            0      0     64   64    1 : tunables  120   60
8 : slabdata      0      0      0
ext4_extent_status     36    198     40   99    1 : tunables  120   60
 8 : slabdata      2      2      0
jbd2_journal_head      0      0    112   36    1 : tunables  120   60
8 : slabdata      0      0      0
jbd2_revoke_table_s      0      0     16  240    1 : tunables  120   60
  8 : slabdata      0      0      0
jbd2_revoke_record_s      0      0     32  124    1 : tunables  120   60
   8 : slabdata      0      0      0
fscrypt_info           0      0     24  163    1 : tunables  120   60
8 : slabdata      0      0      0
fscrypt_ctx            0      0     48   83    1 : tunables  120   60
8 : slabdata      0      0      0
mbcache                0      0     56   71    1 : tunables  120   60
8 : slabdata      0      0      0
kvm_async_pf           0      0    136   30    1 : tunables  120   60
8 : slabdata      0      0      0
kvm_vcpu               0      0  19136    1    8 : tunables    8    4
0 : slabdata      0      0      0
kvm_mmu_page_header      0      0    168   24    1 : tunables  120   60
  8 : slabdata      0      0      0
xfs_dqtrx             14     14    528    7    1 : tunables   54   27
8 : slabdata      2      2      0
xfs_dquot            149    168    472    8    1 : tunables   54   27
8 : slabdata     21     21      0
xfs_rui_item           0      0    664    6    1 : tunables   54   27
8 : slabdata      0      0      0
xfs_rud_item           0      0    144   28    1 : tunables  120   60
8 : slabdata      0      0      0
xfs_icr                0      0    152   26    1 : tunables  120   60
8 : slabdata      0      0      0
xfs_inode         12898650 12907268    960    4    1 : tunables   54
27    8 : slabdata 3226817 3226817      0
xfs_efd_item           0      0    408   10    1 : tunables   54   27
8 : slabdata      0      0      0
xfs_da_state           0      0    480    8    1 : tunables   54   27
8 : slabdata      0      0      0
xfs_btree_cur        105    153    232   17    1 : tunables  120   60
8 : slabdata      9      9      0
bio-2                 52     96    320   12    1 : tunables   54   27
8 : slabdata      8      8      0
ip6-frags              0      0    200   20    1 : tunables  120   60
8 : slabdata      0      0      0
UDPv6                 18     28   1152    7    2 : tunables   24   12
8 : slabdata      4      4      0
tw_sock_TCPv6          0      0    240   17    1 : tunables  120   60
8 : slabdata      0      0      0
request_sock_TCPv6      0      0    304   13    1 : tunables   54   27
 8 : slabdata      0      0      0
TCPv6                  7     12   2112    3    2 : tunables   24   12
8 : slabdata      4      4      0
cfq_queue          99928  99977    240   17    1 : tunables  120   60
8 : slabdata   5881   5881      0
mqueue_inode_cache      1      4    896    4    1 : tunables   54   27
 8 : slabdata      1      1      0
userfaultfd_ctx_cache      0      0    128   32    1 : tunables  120
60    8 : slabdata      0      0      0
dio                    0      0    640    6    1 : tunables   54   27
8 : slabdata      0      0      0
pid_namespace          0      0   2232    3    2 : tunables   24   12
8 : slabdata      0      0      0
posix_timers_cache      0      0    216   18    1 : tunables  120   60
 8 : slabdata      0      0      0
ip4-frags             34    132    184   22    1 : tunables  120   60
8 : slabdata      6      6      0
PING                   0      0    896    4    1 : tunables   54   27
8 : slabdata      0      0      0
UDP                   44     44    960    4    1 : tunables   54   27
8 : slabdata     11     11      0
tw_sock_TCP            1     17    240   17    1 : tunables  120   60
8 : slabdata      1      1      0
request_sock_TCP       0      0    304   13    1 : tunables   54   27
8 : slabdata      0      0      0
TCP                   18     22   1920    2    1 : tunables   24   12
8 : slabdata     11     11      0
hugetlbfs_inode_cache      2      6    600    6    1 : tunables   54
27    8 : slabdata      1      1      0
dquot                  0      0    256   16    1 : tunables  120   60
8 : slabdata      0      0      0
request_queue          3     15   1488    5    2 : tunables   24   12
8 : slabdata      3      3      0
blkdev_ioc           100    234    104   39    1 : tunables  120   60
8 : slabdata      6      6      0
user_namespace         0      0    448    9    1 : tunables   54   27
8 : slabdata      0      0      0
dmaengine-unmap-256      1      3   2112    3    2 : tunables   24   12
  8 : slabdata      1      1      0
dmaengine-unmap-128      1      7   1088    7    2 : tunables   24   12
  8 : slabdata      1      1      0
sock_inode_cache     179    204    640    6    1 : tunables   54   27
8 : slabdata     34     34      0
file_lock_cache       83     95    208   19    1 : tunables  120   60
8 : slabdata      5      5      0
net_namespace          0      0   5888    1    2 : tunables    8    4
0 : slabdata      0      0      0
shmem_inode_cache    775    825    688   11    2 : tunables   54   27
8 : slabdata     75     75      0
taskstats              2     12    328   12    1 : tunables   54   27
8 : slabdata      1      1      0
proc_inode_cache    2024   2190    656    6    1 : tunables   54   27
8 : slabdata    343    365    200
sigqueue          12104736 12110275    160   25    1 : tunables  120
60    8 : slabdata 484411 484411      0
bdev_cache             8     20    832    4    1 : tunables   54   27
8 : slabdata      5      5      0
kernfs_node_cache  29210  29240    120   34    1 : tunables  120   60
8 : slabdata    860    860      0
mnt_cache         652927 827520    384   10    1 : tunables   54   27
8 : slabdata  82752  82752      0
inode_cache        12968  12985    584    7    1 : tunables   54   27
8 : slabdata   1854   1855      0
dentry            12741855 12742065    192   21    1 : tunables  120
60    8 : slabdata 606757 606765    480
iint_cache             0      0     72   56    1 : tunables  120   60
8 : slabdata      0      0      0
buffer_head       127140 127140    104   39    1 : tunables  120   60
8 : slabdata   3260   3260      0
fs_cache             171    378     64   63    1 : tunables  120   60
8 : slabdata      6      6      0
files_cache          158    220    704   11    2 : tunables   54   27
8 : slabdata     20     20      0
signal_cache         425    508   1024    4    1 : tunables   54   27
8 : slabdata    127    127      0
sighand_cache        261    294   2112    3    2 : tunables   24   12
8 : slabdata     98     98      0
task_struct          278    295   3648    1    1 : tunables   24   12
8 : slabdata    278    295      0
cred_jar           15411  17451    192   21    1 : tunables  120   60
8 : slabdata    831    831     60
Acpi-Operand      147223 147504     72   56    1 : tunables  120   60
8 : slabdata   2634   2634      0
Acpi-Parse            69    355     56   71    1 : tunables  120   60
8 : slabdata      5      5      0
Acpi-State             0      0     80   51    1 : tunables  120   60
8 : slabdata      0      0      0
Acpi-Namespace     13264  13365     40   99    1 : tunables  120   60
8 : slabdata    135    135      0
anon_vma_chain     11212  14336     64   64    1 : tunables  120   60
8 : slabdata    224    224     60
anon_vma            6167   7000     72   56    1 : tunables  120   60
8 : slabdata    125    125      0
pid                  312    576    128   32    1 : tunables  120   60
8 : slabdata     18     18      0
numa_policy            5    163     24  163    1 : tunables  120   60
8 : slabdata      1      1      0
trace_event_file    2418   2530     88   46    1 : tunables  120   60
8 : slabdata     55     55      0
ftrace_event_field   3691   3901     48   83    1 : tunables  120   60
 8 : slabdata     47     47      0
pool_workqueue        26     96    256   16    1 : tunables  120   60
8 : slabdata      6      6      0
radix_tree_node   1644760 1707538    576    7    1 : tunables   54   27
  8 : slabdata 243934 243934      0
task_group            43     84    576    7    1 : tunables   54   27
8 : slabdata     12     12      0
dma-kmalloc-4194304      0      0 4194304    1 1024 : tunables    1    1
   0 : slabdata      0      0      0
dma-kmalloc-2097152      0      0 2097152    1  512 : tunables    1    1
   0 : slabdata      0      0      0
dma-kmalloc-1048576      0      0 1048576    1  256 : tunables    1    1
   0 : slabdata      0      0      0
dma-kmalloc-524288      0      0 524288    1  128 : tunables    1    1
 0 : slabdata      0      0      0
dma-kmalloc-262144      0      0 262144    1   64 : tunables    1    1
 0 : slabdata      0      0      0
dma-kmalloc-131072      0      0 131072    1   32 : tunables    8    4
 0 : slabdata      0      0      0
dma-kmalloc-65536      0      0  65536    1   16 : tunables    8    4
0 : slabdata      0      0      0
dma-kmalloc-32768      0      0  32768    1    8 : tunables    8    4
0 : slabdata      0      0      0
dma-kmalloc-16384      0      0  16384    1    4 : tunables    8    4
0 : slabdata      0      0      0
dma-kmalloc-8192       0      0   8192    1    2 : tunables    8    4
0 : slabdata      0      0      0
dma-kmalloc-4096       0      0   4096    1    1 : tunables   24   12
8 : slabdata      0      0      0
dma-kmalloc-2048       0      0   2048    2    1 : tunables   24   12
8 : slabdata      0      0      0
dma-kmalloc-1024       0      0   1024    4    1 : tunables   54   27
8 : slabdata      0      0      0
dma-kmalloc-512        4      8    512    8    1 : tunables   54   27
8 : slabdata      1      1      0
dma-kmalloc-256        0      0    256   16    1 : tunables  120   60
8 : slabdata      0      0      0
dma-kmalloc-128        0      0    128   32    1 : tunables  120   60
8 : slabdata      0      0      0
dma-kmalloc-64         0      0     64   64    1 : tunables  120   60
8 : slabdata      0      0      0
dma-kmalloc-32         4    124     32  124    1 : tunables  120   60
8 : slabdata      1      1      0
dma-kmalloc-192        0      0    192   21    1 : tunables  120   60
8 : slabdata      0      0      0
dma-kmalloc-96         0      0    128   32    1 : tunables  120   60
8 : slabdata      0      0      0
kmalloc-4194304        0      0 4194304    1 1024 : tunables    1    1
 0 : slabdata      0      0      0
kmalloc-2097152        0      0 2097152    1  512 : tunables    1    1
 0 : slabdata      0      0      0
kmalloc-1048576        0      0 1048576    1  256 : tunables    1    1
 0 : slabdata      0      0      0
kmalloc-524288         1      1 524288    1  128 : tunables    1    1
0 : slabdata      1      1      0
kmalloc-262144         3      3 262144    1   64 : tunables    1    1
0 : slabdata      3      3      0
kmalloc-131072         3      3 131072    1   32 : tunables    8    4
0 : slabdata      3      3      0
kmalloc-65536         13     13  65536    1   16 : tunables    8    4
0 : slabdata     13     13      0
kmalloc-32768          4      4  32768    1    8 : tunables    8    4
0 : slabdata      4      4      0
kmalloc-16384         23     23  16384    1    4 : tunables    8    4
0 : slabdata     23     23      0
kmalloc-8192       44121  44121   8192    1    2 : tunables    8    4
0 : slabdata  44121  44121      0
kmalloc-4096       11850  11850   4096    1    1 : tunables   24   12
8 : slabdata  11850  11850     24
kmalloc-2048       18655  18718   2048    2    1 : tunables   24   12
8 : slabdata   9359   9359      0
kmalloc-1024      584913 603244   1024    4    1 : tunables   54   27
8 : slabdata 150811 150811      0
kmalloc-512       856079 1020576    512    8    1 : tunables   54   27
 8 : slabdata 127572 127572      0
kmalloc-256       114928 141968    256   16    1 : tunables  120   60
8 : slabdata   8873   8873    480
kmalloc-192       244249 272643    192   21    1 : tunables  120   60
8 : slabdata  12983  12983      0
kmalloc-96        3139278 3139328    128   32    1 : tunables  120   60
  8 : slabdata  98104  98104      0
kmalloc-64        25714408 33297024     64   64    1 : tunables  120
60    8 : slabdata 520266 520266      0
kmalloc-32        423132 631036     32  124    1 : tunables  120   60
8 : slabdata   5089   5089      0
kmalloc-128       213657 269952    128   32    1 : tunables  120   60
8 : slabdata   8436   8436      0
*** snip ***

*** snip ***
/proc/<pidof_xfsaild>/stack
Wed Apr 19 20:36:52 CEST 2017
[<ffffffff8ef3881e>] ring_buffer_unlock_commit+0x1e/0xa0
[<ffffffff8ef419a7>] trace_buffer_unlock_commit_regs+0x37/0x90
[<ffffffff8eea1e20>] wake_up_q+0x60/0x60
[<ffffffffc0719b5b>] xfs_buf_read_map+0x10b/0x180 [xfs]
[<ffffffffc0719888>] xfs_buf_submit_wait+0x78/0x210 [xfs]
[<ffffffffc074d2bd>] xfs_trans_read_buf_map+0xed/0x300 [xfs]
[<ffffffffc0719b5b>] xfs_buf_read_map+0x10b/0x180 [xfs]
[<ffffffffc074d2bd>] xfs_trans_read_buf_map+0xed/0x300 [xfs]
[<ffffffffc0702b77>] xfs_imap_to_bp+0x67/0xe0 [xfs]
[<ffffffffc072e84b>] xfs_iflush+0xdb/0x240 [xfs]
[<ffffffffc07437f9>] xfs_inode_item_push+0x69/0x140 [xfs]
[<ffffffffc0743886>] xfs_inode_item_push+0xf6/0x140 [xfs]
[<ffffffffc074bf25>] xfsaild+0x385/0x7e0 [xfs]
[<ffffffffc074bba0>] xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
[<ffffffff8ee96d64>] kthread+0x104/0x140
[<ffffffff8ee96d64>] kthread+0x104/0x140
[<ffffffff8ee96c60>] kthread_park+0x80/0x80
[<ffffffff8ee7ba79>] do_group_exit+0x39/0xb0
[<ffffffff8f4370b6>] ret_from_fork+0x26/0x40
[<ffffffffffffffff>] 0xffffffffffffffff
*** snip ***

*** snip ***
iotop --batch -o output

Total DISK READ :    1948.60 K/s | Total DISK WRITE :       0.00 B/s
Actual DISK READ:     974.30 K/s | Actual DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   1609 be/4 root     1948.60 K/s    0.00 B/s  0.00 %  0.00 %
[xfsaild/md127]
Total DISK READ :    1924.56 K/s | Total DISK WRITE :       3.94 K/s
Actual DISK READ:    1956.11 K/s | Actual DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   1609 be/4 root     1924.56 K/s    0.00 B/s  0.00 %  0.00 %
[xfsaild/md127]
Total DISK READ :       2.53 M/s | Total DISK WRITE :       0.00 B/s
Actual DISK READ:       2.51 M/s | Actual DISK WRITE:      63.77 K/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   1609 be/4 root        2.53 M/s    0.00 B/s  0.00 %  0.00 %
[xfsaild/md127]
Total DISK READ :       2.09 M/s | Total DISK WRITE :       0.00 B/s
Actual DISK READ:       2.10 M/s | Actual DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   1609 be/4 root        2.09 M/s    0.00 B/s  0.00 %  0.00 %
[xfsaild/md127]
Total DISK READ :      12.12 M/s | Total DISK WRITE :       0.00 B/s
Actual DISK READ:      12.10 M/s | Actual DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   1609 be/4 root       12.12 M/s    0.00 B/s  0.00 %  0.00 %
[xfsaild/md127]
Total DISK READ :       2.30 M/s | Total DISK WRITE :       0.00 B/s
Actual DISK READ:       2.28 M/s | Actual DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   1609 be/4 root        2.30 M/s    0.00 B/s  0.00 %  0.00 %
[xfsaild/md127]
Total DISK READ :       4.86 M/s | Total DISK WRITE :       0.00 B/s
Actual DISK READ:       4.88 M/s | Actual DISK WRITE:    2025.26 K/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   1609 be/4 root        4.86 M/s    0.00 B/s  0.00 %  0.00 %
[xfsaild/md127]
Total DISK READ :       2.64 M/s | Total DISK WRITE :       0.00 B/s
Actual DISK READ:       2.62 M/s | Actual DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   1609 be/4 root        2.64 M/s    0.00 B/s  0.00 %  0.00 %
[xfsaild/md127]
Total DISK READ :    1899.36 K/s | Total DISK WRITE :       0.00 B/s
Actual DISK READ:    1899.36 K/s | Actual DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   1609 be/4 root     1899.36 K/s    0.00 B/s  0.00 %  0.00 %
[xfsaild/md127]
Total DISK READ :    1837.53 K/s | Total DISK WRITE :       0.00 B/s
Actual DISK READ:    1837.53 K/s | Actual DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   1609 be/4 root     1837.53 K/s    0.00 B/s  0.00 %  0.00 %
[xfsaild/md127]
*** snip ***


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: xfsaild in D state seems to be blocking all other i/o sporadically
  2017-04-19 20:10             ` Michael Weissenbacher
@ 2017-04-19 20:55               ` Darrick J. Wong
  2017-04-19 21:47                 ` Michael Weissenbacher
  0 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2017-04-19 20:55 UTC (permalink / raw)
  To: Michael Weissenbacher; +Cc: linux-xfs

On Wed, Apr 19, 2017 at 10:10:55PM +0200, Michael Weissenbacher wrote:
> Hi Brian!
> 
> On 2017-04-19 20:08, Brian Foster wrote:
> > On Wed, Apr 19, 2017 at 06:36:00PM +0200, Michael Weissenbacher wrote:
> >> On 19.04.2017 16:04, Carlos Maiolino wrote:
> >>> If I am not blind, the task currently owning the cpu isn't listed there, which I
> >>> suppose to be already running.
> >> Correct, the only process utilizing the CPU is xfsaild. It is also the
> >> only process doing I/O, and it looks like it's doing only reads. It
> >> (xfsaild) never shows up in dmesg as blocked.
> >>
> > 
> > A snippet of tracepoint output from the fs might also be interesting
> > when you hit this state. E.g., 'trace-cmd start -e xfs:*' and save the
> > output of /sys/kernel/debug/tracing/trace_pipe to a file for a few
> > seconds or so.
> > 
> See:
> https://we.tl/7AAyBZxt4R
> 
> > BTW, is the original /proc/meminfo output you posted from the system in
> > a normal state or when the problem occurs? If the former, could you
> > include it while the problem occurs as well?
> It was from a normal state.
> 
> All the following was captured during the problem occurs.
> 
> *** snip ***
> /proc/meminfo
> 
> Wed Apr 19 20:36:50 CEST 2017
> MemTotal:       65886208 kB
> MemFree:         2923552 kB
> MemAvailable:   43018524 kB
> Buffers:              52 kB
> Cached:         32416864 kB
> SwapCached:          580 kB
> Active:         30685760 kB
> Inactive:        2051532 kB
> Active(anon):     165844 kB
> Inactive(anon):   155516 kB
> Active(file):   30519916 kB
> Inactive(file):  1896016 kB
> Unevictable:           0 kB
> Mlocked:               0 kB
> SwapTotal:      64511996 kB
> SwapFree:       64506540 kB
> Dirty:            202368 kB
> Writeback:             0 kB
> AnonPages:        319948 kB
> Mapped:            18672 kB
> Shmem:               868 kB
> Slab:           22817492 kB
> SReclaimable:   16332540 kB
> SUnreclaim:      6484952 kB
> KernelStack:        4048 kB
> PageTables:        16516 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:    97455100 kB
> Committed_AS:     516532 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:           0 kB
> VmallocChunk:          0 kB
> HardwareCorrupted:     0 kB
> AnonHugePages:         0 kB
> ShmemHugePages:        0 kB
> ShmemPmdMapped:        0 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> DirectMap4k:       88820 kB
> DirectMap2M:     7155712 kB
> DirectMap1G:    61865984 kB
> *** snip ***
> 
> *** snip ***
> Wed Apr 19 20:36:51 CEST 2017
> slabinfo - version: 2.1
> # name            <active_objs> <num_objs> <objsize> <objperslab>
> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata
> <active_slabs> <num_slabs> <sharedavail>
> kcopyd_job             0      0   3240    2    2 : tunables   24   12
> 8 : slabdata      0      0      0
> dm_uevent              0      0   2632    3    2 : tunables   24   12
> 8 : slabdata      0      0      0
> t10_pr_reg_cache       0      0    704   11    2 : tunables   54   27
> 8 : slabdata      0      0      0
> ext4_groupinfo_4k      8     28    144   28    1 : tunables  120   60
> 8 : slabdata      1      1      0
> ext4_inode_cache     328    330   1072    3    1 : tunables   24   12
> 8 : slabdata    110    110      0
> ext4_allocation_context      0      0    128   32    1 : tunables  120
> 60    8 : slabdata      0      0      0
> ext4_io_end            0      0     64   64    1 : tunables  120   60
> 8 : slabdata      0      0      0
> ext4_extent_status     36    198     40   99    1 : tunables  120   60
>  8 : slabdata      2      2      0
> jbd2_journal_head      0      0    112   36    1 : tunables  120   60
> 8 : slabdata      0      0      0
> jbd2_revoke_table_s      0      0     16  240    1 : tunables  120   60
>   8 : slabdata      0      0      0
> jbd2_revoke_record_s      0      0     32  124    1 : tunables  120   60
>    8 : slabdata      0      0      0
> fscrypt_info           0      0     24  163    1 : tunables  120   60
> 8 : slabdata      0      0      0
> fscrypt_ctx            0      0     48   83    1 : tunables  120   60
> 8 : slabdata      0      0      0
> mbcache                0      0     56   71    1 : tunables  120   60
> 8 : slabdata      0      0      0
> kvm_async_pf           0      0    136   30    1 : tunables  120   60
> 8 : slabdata      0      0      0
> kvm_vcpu               0      0  19136    1    8 : tunables    8    4
> 0 : slabdata      0      0      0
> kvm_mmu_page_header      0      0    168   24    1 : tunables  120   60
>   8 : slabdata      0      0      0
> xfs_dqtrx             14     14    528    7    1 : tunables   54   27
> 8 : slabdata      2      2      0
> xfs_dquot            149    168    472    8    1 : tunables   54   27
> 8 : slabdata     21     21      0
> xfs_rui_item           0      0    664    6    1 : tunables   54   27
> 8 : slabdata      0      0      0
> xfs_rud_item           0      0    144   28    1 : tunables  120   60
> 8 : slabdata      0      0      0
> xfs_icr                0      0    152   26    1 : tunables  120   60
> 8 : slabdata      0      0      0
> xfs_inode         12898650 12907268    960    4    1 : tunables   54
> 27    8 : slabdata 3226817 3226817      0
> xfs_efd_item           0      0    408   10    1 : tunables   54   27
> 8 : slabdata      0      0      0
> xfs_da_state           0      0    480    8    1 : tunables   54   27
> 8 : slabdata      0      0      0
> xfs_btree_cur        105    153    232   17    1 : tunables  120   60
> 8 : slabdata      9      9      0
> bio-2                 52     96    320   12    1 : tunables   54   27
> 8 : slabdata      8      8      0
> ip6-frags              0      0    200   20    1 : tunables  120   60
> 8 : slabdata      0      0      0
> UDPv6                 18     28   1152    7    2 : tunables   24   12
> 8 : slabdata      4      4      0
> tw_sock_TCPv6          0      0    240   17    1 : tunables  120   60
> 8 : slabdata      0      0      0
> request_sock_TCPv6      0      0    304   13    1 : tunables   54   27
>  8 : slabdata      0      0      0
> TCPv6                  7     12   2112    3    2 : tunables   24   12
> 8 : slabdata      4      4      0
> cfq_queue          99928  99977    240   17    1 : tunables  120   60
> 8 : slabdata   5881   5881      0
> mqueue_inode_cache      1      4    896    4    1 : tunables   54   27
>  8 : slabdata      1      1      0
> userfaultfd_ctx_cache      0      0    128   32    1 : tunables  120
> 60    8 : slabdata      0      0      0
> dio                    0      0    640    6    1 : tunables   54   27
> 8 : slabdata      0      0      0
> pid_namespace          0      0   2232    3    2 : tunables   24   12
> 8 : slabdata      0      0      0
> posix_timers_cache      0      0    216   18    1 : tunables  120   60
>  8 : slabdata      0      0      0
> ip4-frags             34    132    184   22    1 : tunables  120   60
> 8 : slabdata      6      6      0
> PING                   0      0    896    4    1 : tunables   54   27
> 8 : slabdata      0      0      0
> UDP                   44     44    960    4    1 : tunables   54   27
> 8 : slabdata     11     11      0
> tw_sock_TCP            1     17    240   17    1 : tunables  120   60
> 8 : slabdata      1      1      0
> request_sock_TCP       0      0    304   13    1 : tunables   54   27
> 8 : slabdata      0      0      0
> TCP                   18     22   1920    2    1 : tunables   24   12
> 8 : slabdata     11     11      0
> hugetlbfs_inode_cache      2      6    600    6    1 : tunables   54
> 27    8 : slabdata      1      1      0
> dquot                  0      0    256   16    1 : tunables  120   60
> 8 : slabdata      0      0      0
> request_queue          3     15   1488    5    2 : tunables   24   12
> 8 : slabdata      3      3      0
> blkdev_ioc           100    234    104   39    1 : tunables  120   60
> 8 : slabdata      6      6      0
> user_namespace         0      0    448    9    1 : tunables   54   27
> 8 : slabdata      0      0      0
> dmaengine-unmap-256      1      3   2112    3    2 : tunables   24   12
>   8 : slabdata      1      1      0
> dmaengine-unmap-128      1      7   1088    7    2 : tunables   24   12
>   8 : slabdata      1      1      0
> sock_inode_cache     179    204    640    6    1 : tunables   54   27
> 8 : slabdata     34     34      0
> file_lock_cache       83     95    208   19    1 : tunables  120   60
> 8 : slabdata      5      5      0
> net_namespace          0      0   5888    1    2 : tunables    8    4
> 0 : slabdata      0      0      0
> shmem_inode_cache    775    825    688   11    2 : tunables   54   27
> 8 : slabdata     75     75      0
> taskstats              2     12    328   12    1 : tunables   54   27
> 8 : slabdata      1      1      0
> proc_inode_cache    2024   2190    656    6    1 : tunables   54   27
> 8 : slabdata    343    365    200
> sigqueue          12104736 12110275    160   25    1 : tunables  120
> 60    8 : slabdata 484411 484411      0
> bdev_cache             8     20    832    4    1 : tunables   54   27
> 8 : slabdata      5      5      0
> kernfs_node_cache  29210  29240    120   34    1 : tunables  120   60
> 8 : slabdata    860    860      0
> mnt_cache         652927 827520    384   10    1 : tunables   54   27
> 8 : slabdata  82752  82752      0
> inode_cache        12968  12985    584    7    1 : tunables   54   27
> 8 : slabdata   1854   1855      0
> dentry            12741855 12742065    192   21    1 : tunables  120
> 60    8 : slabdata 606757 606765    480
> iint_cache             0      0     72   56    1 : tunables  120   60
> 8 : slabdata      0      0      0
> buffer_head       127140 127140    104   39    1 : tunables  120   60
> 8 : slabdata   3260   3260      0
> fs_cache             171    378     64   63    1 : tunables  120   60
> 8 : slabdata      6      6      0
> files_cache          158    220    704   11    2 : tunables   54   27
> 8 : slabdata     20     20      0
> signal_cache         425    508   1024    4    1 : tunables   54   27
> 8 : slabdata    127    127      0
> sighand_cache        261    294   2112    3    2 : tunables   24   12
> 8 : slabdata     98     98      0
> task_struct          278    295   3648    1    1 : tunables   24   12
> 8 : slabdata    278    295      0
> cred_jar           15411  17451    192   21    1 : tunables  120   60
> 8 : slabdata    831    831     60
> Acpi-Operand      147223 147504     72   56    1 : tunables  120   60
> 8 : slabdata   2634   2634      0
> Acpi-Parse            69    355     56   71    1 : tunables  120   60
> 8 : slabdata      5      5      0
> Acpi-State             0      0     80   51    1 : tunables  120   60
> 8 : slabdata      0      0      0
> Acpi-Namespace     13264  13365     40   99    1 : tunables  120   60
> 8 : slabdata    135    135      0
> anon_vma_chain     11212  14336     64   64    1 : tunables  120   60
> 8 : slabdata    224    224     60
> anon_vma            6167   7000     72   56    1 : tunables  120   60
> 8 : slabdata    125    125      0
> pid                  312    576    128   32    1 : tunables  120   60
> 8 : slabdata     18     18      0
> numa_policy            5    163     24  163    1 : tunables  120   60
> 8 : slabdata      1      1      0
> trace_event_file    2418   2530     88   46    1 : tunables  120   60
> 8 : slabdata     55     55      0
> ftrace_event_field   3691   3901     48   83    1 : tunables  120   60
>  8 : slabdata     47     47      0
> pool_workqueue        26     96    256   16    1 : tunables  120   60
> 8 : slabdata      6      6      0
> radix_tree_node   1644760 1707538    576    7    1 : tunables   54   27
>   8 : slabdata 243934 243934      0
> task_group            43     84    576    7    1 : tunables   54   27
> 8 : slabdata     12     12      0
> dma-kmalloc-4194304      0      0 4194304    1 1024 : tunables    1    1
>    0 : slabdata      0      0      0
> dma-kmalloc-2097152      0      0 2097152    1  512 : tunables    1    1
>    0 : slabdata      0      0      0
> dma-kmalloc-1048576      0      0 1048576    1  256 : tunables    1    1
>    0 : slabdata      0      0      0
> dma-kmalloc-524288      0      0 524288    1  128 : tunables    1    1
>  0 : slabdata      0      0      0
> dma-kmalloc-262144      0      0 262144    1   64 : tunables    1    1
>  0 : slabdata      0      0      0
> dma-kmalloc-131072      0      0 131072    1   32 : tunables    8    4
>  0 : slabdata      0      0      0
> dma-kmalloc-65536      0      0  65536    1   16 : tunables    8    4
> 0 : slabdata      0      0      0
> dma-kmalloc-32768      0      0  32768    1    8 : tunables    8    4
> 0 : slabdata      0      0      0
> dma-kmalloc-16384      0      0  16384    1    4 : tunables    8    4
> 0 : slabdata      0      0      0
> dma-kmalloc-8192       0      0   8192    1    2 : tunables    8    4
> 0 : slabdata      0      0      0
> dma-kmalloc-4096       0      0   4096    1    1 : tunables   24   12
> 8 : slabdata      0      0      0
> dma-kmalloc-2048       0      0   2048    2    1 : tunables   24   12
> 8 : slabdata      0      0      0
> dma-kmalloc-1024       0      0   1024    4    1 : tunables   54   27
> 8 : slabdata      0      0      0
> dma-kmalloc-512        4      8    512    8    1 : tunables   54   27
> 8 : slabdata      1      1      0
> dma-kmalloc-256        0      0    256   16    1 : tunables  120   60
> 8 : slabdata      0      0      0
> dma-kmalloc-128        0      0    128   32    1 : tunables  120   60
> 8 : slabdata      0      0      0
> dma-kmalloc-64         0      0     64   64    1 : tunables  120   60
> 8 : slabdata      0      0      0
> dma-kmalloc-32         4    124     32  124    1 : tunables  120   60
> 8 : slabdata      1      1      0
> dma-kmalloc-192        0      0    192   21    1 : tunables  120   60
> 8 : slabdata      0      0      0
> dma-kmalloc-96         0      0    128   32    1 : tunables  120   60
> 8 : slabdata      0      0      0
> kmalloc-4194304        0      0 4194304    1 1024 : tunables    1    1
>  0 : slabdata      0      0      0
> kmalloc-2097152        0      0 2097152    1  512 : tunables    1    1
>  0 : slabdata      0      0      0
> kmalloc-1048576        0      0 1048576    1  256 : tunables    1    1
>  0 : slabdata      0      0      0
> kmalloc-524288         1      1 524288    1  128 : tunables    1    1
> 0 : slabdata      1      1      0
> kmalloc-262144         3      3 262144    1   64 : tunables    1    1
> 0 : slabdata      3      3      0
> kmalloc-131072         3      3 131072    1   32 : tunables    8    4
> 0 : slabdata      3      3      0
> kmalloc-65536         13     13  65536    1   16 : tunables    8    4
> 0 : slabdata     13     13      0
> kmalloc-32768          4      4  32768    1    8 : tunables    8    4
> 0 : slabdata      4      4      0
> kmalloc-16384         23     23  16384    1    4 : tunables    8    4
> 0 : slabdata     23     23      0
> kmalloc-8192       44121  44121   8192    1    2 : tunables    8    4
> 0 : slabdata  44121  44121      0
> kmalloc-4096       11850  11850   4096    1    1 : tunables   24   12
> 8 : slabdata  11850  11850     24
> kmalloc-2048       18655  18718   2048    2    1 : tunables   24   12
> 8 : slabdata   9359   9359      0
> kmalloc-1024      584913 603244   1024    4    1 : tunables   54   27
> 8 : slabdata 150811 150811      0
> kmalloc-512       856079 1020576    512    8    1 : tunables   54   27
>  8 : slabdata 127572 127572      0
> kmalloc-256       114928 141968    256   16    1 : tunables  120   60
> 8 : slabdata   8873   8873    480
> kmalloc-192       244249 272643    192   21    1 : tunables  120   60
> 8 : slabdata  12983  12983      0
> kmalloc-96        3139278 3139328    128   32    1 : tunables  120   60
>   8 : slabdata  98104  98104      0
> kmalloc-64        25714408 33297024     64   64    1 : tunables  120
> 60    8 : slabdata 520266 520266      0
> kmalloc-32        423132 631036     32  124    1 : tunables  120   60
> 8 : slabdata   5089   5089      0
> kmalloc-128       213657 269952    128   32    1 : tunables  120   60
> 8 : slabdata   8436   8436      0
> *** snip ***
> 
> *** snip ***
> /proc/<pidof_xfsaild>/stack
> Wed Apr 19 20:36:52 CEST 2017
> [<ffffffff8ef3881e>] ring_buffer_unlock_commit+0x1e/0xa0
> [<ffffffff8ef419a7>] trace_buffer_unlock_commit_regs+0x37/0x90
> [<ffffffff8eea1e20>] wake_up_q+0x60/0x60
> [<ffffffffc0719b5b>] xfs_buf_read_map+0x10b/0x180 [xfs]
> [<ffffffffc0719888>] xfs_buf_submit_wait+0x78/0x210 [xfs]
> [<ffffffffc074d2bd>] xfs_trans_read_buf_map+0xed/0x300 [xfs]
> [<ffffffffc0719b5b>] xfs_buf_read_map+0x10b/0x180 [xfs]
> [<ffffffffc074d2bd>] xfs_trans_read_buf_map+0xed/0x300 [xfs]
> [<ffffffffc0702b77>] xfs_imap_to_bp+0x67/0xe0 [xfs]
> [<ffffffffc072e84b>] xfs_iflush+0xdb/0x240 [xfs]
> [<ffffffffc07437f9>] xfs_inode_item_push+0x69/0x140 [xfs]
> [<ffffffffc0743886>] xfs_inode_item_push+0xf6/0x140 [xfs]
> [<ffffffffc074bf25>] xfsaild+0x385/0x7e0 [xfs]
> [<ffffffffc074bba0>] xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
> [<ffffffff8ee96d64>] kthread+0x104/0x140
> [<ffffffff8ee96d64>] kthread+0x104/0x140
> [<ffffffff8ee96c60>] kthread_park+0x80/0x80
> [<ffffffff8ee7ba79>] do_group_exit+0x39/0xb0
> [<ffffffff8f4370b6>] ret_from_fork+0x26/0x40
> [<ffffffffffffffff>] 0xffffffffffffffff

So... (speculating a bit here) you're running 20 different copies of
rsnapshot and rsync, which are generating a lot of dirty data and dirty
inodes.  Each of the threads reported by the hung task timeout are
trying to reserve space in the log for metadata updates, and clearly
those log reservation requests have poked the AIL to flush itself to
clear space for new transactions.

The AIL thread is flushing inode updates to disk, which is a lot of RMW
of dirty inodes, hence the high amounts of io reads you see.  The log is
already 2GB in size on disk, so it cannot be made larger.  TBH, a
/smaller/ log might help here since at least when we have to push the
AIL to free up log space there'll be less of it that has to be freed up.

OTOH I wouldn't expect the system to stall for 5-15 minutes ... but I
guess a quarter of a 2GB log is 500MB, and if that's mostly dirty inode
records, that's a /lot/ of inode updates that the AIL has to do (single
threaded!) to clean out the space.  You /do/ seem to have 13 million
inodes in slab, so it's not implausible...

...500MB / 200 bytes per dirty inode record = ~2.5 million inodes.
~2.5m inodes / 10 minutes = ~4200 inodes per second...

That's probably within the ballpark of what 12 drives can handle.
Considering that the inodes get allocated out in chunks of 64, we can
take advantage of /some/ locality...

IOWs, the xfs is flushing big chunks of log all at once.  You might look
at the io stats of each drive, to see if they're all busy, and how many
io operations they're managing per second.  (But since you know the
exact model of drives, you'll have to take my ballpark figures and work
out whether or not they map to what you'd expect iops-wise from each
drive.)

--D

> *** snip ***
> 
> *** snip ***
> iotop --batch -o output
> 
> Total DISK READ :    1948.60 K/s | Total DISK WRITE :       0.00 B/s
> Actual DISK READ:     974.30 K/s | Actual DISK WRITE:       0.00 B/s
>     TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
>    1609 be/4 root     1948.60 K/s    0.00 B/s  0.00 %  0.00 %
> [xfsaild/md127]
> Total DISK READ :    1924.56 K/s | Total DISK WRITE :       3.94 K/s
> Actual DISK READ:    1956.11 K/s | Actual DISK WRITE:       0.00 B/s
>     TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
>    1609 be/4 root     1924.56 K/s    0.00 B/s  0.00 %  0.00 %
> [xfsaild/md127]
> Total DISK READ :       2.53 M/s | Total DISK WRITE :       0.00 B/s
> Actual DISK READ:       2.51 M/s | Actual DISK WRITE:      63.77 K/s
>     TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
>    1609 be/4 root        2.53 M/s    0.00 B/s  0.00 %  0.00 %
> [xfsaild/md127]
> Total DISK READ :       2.09 M/s | Total DISK WRITE :       0.00 B/s
> Actual DISK READ:       2.10 M/s | Actual DISK WRITE:       0.00 B/s
>     TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
>    1609 be/4 root        2.09 M/s    0.00 B/s  0.00 %  0.00 %
> [xfsaild/md127]
> Total DISK READ :      12.12 M/s | Total DISK WRITE :       0.00 B/s
> Actual DISK READ:      12.10 M/s | Actual DISK WRITE:       0.00 B/s
>     TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
>    1609 be/4 root       12.12 M/s    0.00 B/s  0.00 %  0.00 %
> [xfsaild/md127]
> Total DISK READ :       2.30 M/s | Total DISK WRITE :       0.00 B/s
> Actual DISK READ:       2.28 M/s | Actual DISK WRITE:       0.00 B/s
>     TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
>    1609 be/4 root        2.30 M/s    0.00 B/s  0.00 %  0.00 %
> [xfsaild/md127]
> Total DISK READ :       4.86 M/s | Total DISK WRITE :       0.00 B/s
> Actual DISK READ:       4.88 M/s | Actual DISK WRITE:    2025.26 K/s
>     TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
>    1609 be/4 root        4.86 M/s    0.00 B/s  0.00 %  0.00 %
> [xfsaild/md127]
> Total DISK READ :       2.64 M/s | Total DISK WRITE :       0.00 B/s
> Actual DISK READ:       2.62 M/s | Actual DISK WRITE:       0.00 B/s
>     TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
>    1609 be/4 root        2.64 M/s    0.00 B/s  0.00 %  0.00 %
> [xfsaild/md127]
> Total DISK READ :    1899.36 K/s | Total DISK WRITE :       0.00 B/s
> Actual DISK READ:    1899.36 K/s | Actual DISK WRITE:       0.00 B/s
>     TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
>    1609 be/4 root     1899.36 K/s    0.00 B/s  0.00 %  0.00 %
> [xfsaild/md127]
> Total DISK READ :    1837.53 K/s | Total DISK WRITE :       0.00 B/s
> Actual DISK READ:    1837.53 K/s | Actual DISK WRITE:       0.00 B/s
>     TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
>    1609 be/4 root     1837.53 K/s    0.00 B/s  0.00 %  0.00 %
> [xfsaild/md127]
> *** snip ***
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: xfsaild in D state seems to be blocking all other i/o sporadically
  2017-04-19 20:55               ` Darrick J. Wong
@ 2017-04-19 21:47                 ` Michael Weissenbacher
  2017-04-19 23:48                   ` Dave Chinner
  0 siblings, 1 reply; 19+ messages in thread
From: Michael Weissenbacher @ 2017-04-19 21:47 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Hi Darrick!
Thanks for your input.

> 
> So... (speculating a bit here) you're running 20 different copies of
> rsnapshot and rsync, which are generating a lot of dirty data and dirty
> inodes.  Each of the threads reported by the hung task timeout are
> trying to reserve space in the log for metadata updates, and clearly
> those log reservation requests have poked the AIL to flush itself to
> clear space for new transactions.
> 
Yes, that is a pretty precise description of what the system is doing.
Although 20 parallel rsync's are pretty much the worst case scenario
(the cron jobs are designed to be starting at differing times). Usually
there are roughly 5 rsync's going on at the same time. The majority of
the files are not copied but hard linked during this process.

> The AIL thread is flushing inode updates to disk, which is a lot of RMW
> of dirty inodes, hence the high amounts of io reads you see.  The log is
> already 2GB in size on disk, so it cannot be made larger.  TBH, a
> /smaller/ log might help here since at least when we have to push the
> AIL to free up log space there'll be less of it that has to be freed up.
> 
That's interesting. I could try to create an external log of 500MB size
and see if that changes anything?

> OTOH I wouldn't expect the system to stall for 5-15 minutes ... but I
> guess a quarter of a 2GB log is 500MB, and if that's mostly dirty inode
> records, that's a /lot/ of inode updates that the AIL has to do (single
> threaded!) to clean out the space.  You /do/ seem to have 13 million
> inodes in slab, so it's not implausible...
> 
OK, do i understand you correctly that the xfsaild does all the actual
work of updating the inodes? And it's doing that single-threaded,
reading both the log and the indoes themselves?

> ...500MB / 200 bytes per dirty inode record = ~2.5 million inodes.
> ~2.5m inodes / 10 minutes = ~4200 inodes per second...
Yes, these numbers sound plausible. When everything is running smoothly
(i.e. most of the time) i do see up to 4500 combined reads+writes per
second. But when the problem arises, all i see i xfsaild doing around
120 reads/second, which is roughly the performance of a single 7200rpm
drive. Directly after this "read only phase" (which lasts for about 5-10
minutes most of the time) there is a short burst of up to 10000+
writes/sec (writing to the RAID controller cache I suppose). And then
the system goes back to the "normal phase" where it is doing both reads
and writes and all those blocked D state processes continue working.

> 
> That's probably within the ballpark of what 12 drives can handle.
> Considering that the inodes get allocated out in chunks of 64, we can
> take advantage of /some/ locality...
> 
> IOWs, the xfs is flushing big chunks of log all at once.  You might look
> at the io stats of each drive, to see if they're all busy, and how many
> io operations they're managing per second.  (But since you know the
> exact model of drives, you'll have to take my ballpark figures and work
> out whether or not they map to what you'd expect iops-wise from each
> drive.)
> 
Hm, i don't think i can query the io stats of the individual drives
behind the hardware RAID controller. If your theory is correct i should
only see a single drive busy during the "problem phase". Will look into
that. Maybe simply by eyeballing the drive led's ;-)

thanks,
Michael

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: xfsaild in D state seems to be blocking all other i/o sporadically
  2017-04-19 21:47                 ` Michael Weissenbacher
@ 2017-04-19 23:48                   ` Dave Chinner
  2017-04-20  7:11                     ` Michael Weissenbacher
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2017-04-19 23:48 UTC (permalink / raw)
  To: Michael Weissenbacher; +Cc: Darrick J. Wong, linux-xfs

On Wed, Apr 19, 2017 at 11:47:43PM +0200, Michael Weissenbacher wrote:
> OK, do i understand you correctly that the xfsaild does all the actual
> work of updating the inodes? And it's doing that single-threaded,
> reading both the log and the indoes themselves?

The problem is that the backing buffers that are used for flushing
inodes have been reclaimed due to memory pressure, but the inodes in
cache are still dirty. Hence to write the dirty inodes, we first
have to read the inode buffer back into memory.

Essentially, the log pins more dirty inodes than than writeback
can keep up with given the memory pressure that is occurring.

> 
> > ...500MB / 200 bytes per dirty inode record = ~2.5 million inodes.
> > ~2.5m inodes / 10 minutes = ~4200 inodes per second...
> Yes, these numbers sound plausible. When everything is running smoothly
> (i.e. most of the time) i do see up to 4500 combined reads+writes per
> second. But when the problem arises, all i see i xfsaild doing around
> 120 reads/second, which is roughly the performance of a single 7200rpm
> drive. Directly after this "read only phase" (which lasts for about 5-10
> minutes most of the time) there is a short burst of up to 10000+
> writes/sec (writing to the RAID controller cache I suppose).

Yes. Inode buffer reads from the xfsaild are sequential so block the
aild and log tail pushing while they are on-going. It then queues
the writes on a delayed write list and they are issued all in one
go. Essentially, the system can dirty far more metadata in memory
than it can effectively cache and flush to disk in a timely manner.

> And then
> the system goes back to the "normal phase" where it is doing both reads
> and writes and all those blocked D state processes continue working.

Yup, because the xfsaild has move the tail of the log, so there is
now log space available and ongoing transactions can run at CPU
speed again. The xfsaild will still be doing RMW inode writeback,
but nothing is blocking waiting for it to reach it's target now...

I'd suggest that the only thing you can realistically do to mitigate
this stall without changing the log size is change
/proc/sys/fs/xfs/xfssyncd_centisecs down to one second so that the
CIL is flushed to disk every second, and the AIL push target is
updated immediately after the CIL flush is done. This will result in
the xfsaild trying to push metadata almost as quickly as it is
committed to the journal, hence it should be trying to flush inodes
while th einode buffer is still cached in memory and hence avoid the
read IO. This will help avoid AIL pushing from getting so far behind
that inode buffers are reclaimed before the dirty inodes are
written. It may not completely avoid the stalls, but it should make
them less frequent...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: xfsaild in D state seems to be blocking all other i/o sporadically
  2017-04-19 23:48                   ` Dave Chinner
@ 2017-04-20  7:11                     ` Michael Weissenbacher
  2017-04-20 23:16                       ` Dave Chinner
  0 siblings, 1 reply; 19+ messages in thread
From: Michael Weissenbacher @ 2017-04-20  7:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On 20.04.2017 01:48, Dave Chinner wrote:
> On Wed, Apr 19, 2017 at 11:47:43PM +0200, Michael Weissenbacher wrote:
>> OK, do i understand you correctly that the xfsaild does all the actual
>> work of updating the inodes? And it's doing that single-threaded,
>> reading both the log and the indoes themselves?
> 
> The problem is that the backing buffers that are used for flushing
> inodes have been reclaimed due to memory pressure, but the inodes in
> cache are still dirty. Hence to write the dirty inodes, we first
> have to read the inode buffer back into memory.
> 
Interesting find. Is there a way to prevent those buffers from getting
reclaimed? Would adjusting vfs_cache_pressure help? Or adding more
memory to the system? In fact the best thing would be to disable file
content caching completely. Because of the use-case (backup server) it's
worthless to cache file content.
My primary objective is to avoid those stalls and reduce latency, at the
expense of throughput.

> Essentially, the log pins more dirty inodes than than writeback
> can keep up with given the memory pressure that is occurring.
> 
>>
>>> ...500MB / 200 bytes per dirty inode record = ~2.5 million inodes.
>>> ~2.5m inodes / 10 minutes = ~4200 inodes per second...
>> Yes, these numbers sound plausible. When everything is running smoothly
>> (i.e. most of the time) i do see up to 4500 combined reads+writes per
>> second. But when the problem arises, all i see i xfsaild doing around
>> 120 reads/second, which is roughly the performance of a single 7200rpm
>> drive. Directly after this "read only phase" (which lasts for about 5-10
>> minutes most of the time) there is a short burst of up to 10000+
>> writes/sec (writing to the RAID controller cache I suppose).
> 
> Yes. Inode buffer reads from the xfsaild are sequential so block the
> aild and log tail pushing while they are on-going. It then queues
> the writes on a delayed write list and they are issued all in one
> go. Essentially, the system can dirty far more metadata in memory
> than it can effectively cache and flush to disk in a timely manner.
> 
>> And then
>> the system goes back to the "normal phase" where it is doing both reads
>> and writes and all those blocked D state processes continue working.
> 
> Yup, because the xfsaild has move the tail of the log, so there is
> now log space available and ongoing transactions can run at CPU
> speed again. The xfsaild will still be doing RMW inode writeback,
> but nothing is blocking waiting for it to reach it's target now...
> 
> I'd suggest that the only thing you can realistically do to mitigate
> this stall without changing the log size is change
> /proc/sys/fs/xfs/xfssyncd_centisecs down to one second so that the
> CIL is flushed to disk every second, and the AIL push target is
> updated immediately after the CIL flush is done. This will result in
> the xfsaild trying to push metadata almost as quickly as it is
> committed to the journal, hence it should be trying to flush inodes
> while th einode buffer is still cached in memory and hence avoid the
> read IO. This will help avoid AIL pushing from getting so far behind
> that inode buffers are reclaimed before the dirty inodes are
> written. It may not completely avoid the stalls, but it should make
> them less frequent...
> 
Ok, i just did that (set xfssyncd_centisecs=100). Will see if it helps.

thanks,
Michael

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: xfsaild in D state seems to be blocking all other i/o sporadically
  2017-04-20  7:11                     ` Michael Weissenbacher
@ 2017-04-20 23:16                       ` Dave Chinner
  2017-04-21  7:43                         ` Michael Weissenbacher
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2017-04-20 23:16 UTC (permalink / raw)
  To: Michael Weissenbacher; +Cc: Darrick J. Wong, linux-xfs

On Thu, Apr 20, 2017 at 09:11:22AM +0200, Michael Weissenbacher wrote:
> On 20.04.2017 01:48, Dave Chinner wrote:
> > On Wed, Apr 19, 2017 at 11:47:43PM +0200, Michael Weissenbacher wrote:
> >> OK, do i understand you correctly that the xfsaild does all the actual
> >> work of updating the inodes? And it's doing that single-threaded,
> >> reading both the log and the indoes themselves?
> > 
> > The problem is that the backing buffers that are used for flushing
> > inodes have been reclaimed due to memory pressure, but the inodes in
> > cache are still dirty. Hence to write the dirty inodes, we first
> > have to read the inode buffer back into memory.
> > 
> Interesting find. Is there a way to prevent those buffers from getting
> reclaimed?

Not really. It's simply a side effect of memory reclaim not being
able to reclaim inodes or the page cache because they are dirty, and
so it goes and puts lots more pressure on clean caches. The working
set in those other caches gets trashed, and this it's a downward
spiral because it means dirty inodes and pages take longer are
require blocking IO to refill on demand...

> Would adjusting vfs_cache_pressure help?

Unlikely.

> Or adding more
> memory to the system?

Unlikely - that'll just lead to bigger stalls.

> In fact the best thing would be to disable file
> content caching completely. Because of the use-case (backup server) it's
> worthless to cache file content.
> My primary objective is to avoid those stalls and reduce latency, at the
> expense of throughput.

Set up dirty page cache writeback thresholds to be low (a couple of
hundred MB instead of 10/20% of memory) so that data writeback
starts early and throttles dirty pages to a small amount of memory.
This will help keep the page cache clean and immediately
reclaimable, hence it shouldn't put as much pressure on other caches
when memory reclaim is required.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: xfsaild in D state seems to be blocking all other i/o sporadically
  2017-04-20 23:16                       ` Dave Chinner
@ 2017-04-21  7:43                         ` Michael Weissenbacher
  2017-04-21  9:18                           ` Shan Hai
  0 siblings, 1 reply; 19+ messages in thread
From: Michael Weissenbacher @ 2017-04-21  7:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

Hi Dave!
On 21.04.2017 01:16, Dave Chinner wrote:
> On Thu, Apr 20, 2017 at 09:11:22AM +0200, Michael Weissenbacher wrote:
>> On 20.04.2017 01:48, Dave Chinner wrote:
>>>
>>> The problem is that the backing buffers that are used for flushing
>>> inodes have been reclaimed due to memory pressure, but the inodes in
>>> cache are still dirty. Hence to write the dirty inodes, we first
>>> have to read the inode buffer back into memory.
>>>
>> Interesting find. Is there a way to prevent those buffers from getting
>> reclaimed?
> 
> Not really. It's simply a side effect of memory reclaim not being
> able to reclaim inodes or the page cache because they are dirty, and
> so it goes and puts lots more pressure on clean caches. The working
> set in those other caches gets trashed, and this it's a downward
> spiral because it means dirty inodes and pages take longer are
> require blocking IO to refill on demand...
> 
Yesterday i found this patch for rsync:
http://insights.oetiker.ch/linux/fadvise/
It adds the option "--drop-cache" to rsync which sets
POSIX_FADV_DONTNEED to prevent caching. Been running with this option
since yesterday. Unfortunately, even this hasn't changed my problem.

>> In fact the best thing would be to disable file
>> content caching completely. Because of the use-case (backup server) it's
>> worthless to cache file content.
>> My primary objective is to avoid those stalls and reduce latency, at the
>> expense of throughput.
> 
> Set up dirty page cache writeback thresholds to be low (a couple of
> hundred MB instead of 10/20% of memory) so that data writeback
> starts early and throttles dirty pages to a small amount of memory.
> This will help keep the page cache clean and immediately
> reclaimable, hence it shouldn't put as much pressure on other caches
> when memory reclaim is required.
> 
In fact i already turned those down to 1024MiB/512MiB - not much change
here. I also set xfssyncd_centisecs to 100 like advised by you. Also not
much change after that.

I also noticed that unmounting the file system takes a really long time
after the problem occured (up to around 5 minutes!). Even when there was
nothing at all going on before the unmount. Would it help to capture the
unmount with trace-cmd?

Here is another theory. Could it be that not the rsync's, but the rm's
issued by rsnapshot are causing the problem? Would it help to serialize
all "rm -Rf" calls? Those always delete the oldest backups, which can't
possibly be in the cache and because of that all those inodes need to be
read into memory during deletion. Maybe those rm's are filling up the
XFS log?

If that doesn't work out either i guess my only chance would be to
partition the device with LVM and create a separate XFS for every
rsnapshot instance. In that scenario every file system should get it's
own xfsaild, allowing them to run in parallel and not blocking each other?

cheers,
Michael

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: xfsaild in D state seems to be blocking all other i/o sporadically
  2017-04-21  7:43                         ` Michael Weissenbacher
@ 2017-04-21  9:18                           ` Shan Hai
  2017-04-22  8:38                             ` Michael Weissenbacher
  0 siblings, 1 reply; 19+ messages in thread
From: Shan Hai @ 2017-04-21  9:18 UTC (permalink / raw)
  To: Michael Weissenbacher, Dave Chinner; +Cc: Darrick J. Wong, linux-xfs



On 2017年04月21日 15:43, Michael Weissenbacher wrote:
> Hi Dave!
> On 21.04.2017 01:16, Dave Chinner wrote:
>> On Thu, Apr 20, 2017 at 09:11:22AM +0200, Michael Weissenbacher wrote:
>>> On 20.04.2017 01:48, Dave Chinner wrote:
>>>> The problem is that the backing buffers that are used for flushing
>>>> inodes have been reclaimed due to memory pressure, but the inodes in
>>>> cache are still dirty. Hence to write the dirty inodes, we first
>>>> have to read the inode buffer back into memory.
>>>>
>>> Interesting find. Is there a way to prevent those buffers from getting
>>> reclaimed?
>> Not really. It's simply a side effect of memory reclaim not being
>> able to reclaim inodes or the page cache because they are dirty, and
>> so it goes and puts lots more pressure on clean caches. The working
>> set in those other caches gets trashed, and this it's a downward
>> spiral because it means dirty inodes and pages take longer are
>> require blocking IO to refill on demand...
>>
> Yesterday i found this patch for rsync:
> http://insights.oetiker.ch/linux/fadvise/
> It adds the option "--drop-cache" to rsync which sets
> POSIX_FADV_DONTNEED to prevent caching. Been running with this option
> since yesterday. Unfortunately, even this hasn't changed my problem.
>
>>> In fact the best thing would be to disable file
>>> content caching completely. Because of the use-case (backup server) it's
>>> worthless to cache file content.
>>> My primary objective is to avoid those stalls and reduce latency, at the
>>> expense of throughput.
>> Set up dirty page cache writeback thresholds to be low (a couple of
>> hundred MB instead of 10/20% of memory) so that data writeback
>> starts early and throttles dirty pages to a small amount of memory.
>> This will help keep the page cache clean and immediately
>> reclaimable, hence it shouldn't put as much pressure on other caches
>> when memory reclaim is required.
>>
> In fact i already turned those down to 1024MiB/512MiB - not much change
> here. I also set xfssyncd_centisecs to 100 like advised by you. Also not
> much change after that.
>
> I also noticed that unmounting the file system takes a really long time
> after the problem occured (up to around 5 minutes!). Even when there was
> nothing at all going on before the unmount. Would it help to capture the
> unmount with trace-cmd?

A huge negative dcache entries would cause a slow umount,
please check it by 'cat /proc/sys/fs/dentry-state', the first column
is for used(active) entries while the second is for the unused 
(negative) entries.

It can be dropped by 'echo 2 > /proc/sys/vm/drop_caches'.

Thanks
Shan Hai

> Here is another theory. Could it be that not the rsync's, but the rm's
> issued by rsnapshot are causing the problem? Would it help to serialize
> all "rm -Rf" calls? Those always delete the oldest backups, which can't
> possibly be in the cache and because of that all those inodes need to be
> read into memory during deletion. Maybe those rm's are filling up the
> XFS log?
>
> If that doesn't work out either i guess my only chance would be to
> partition the device with LVM and create a separate XFS for every
> rsnapshot instance. In that scenario every file system should get it's
> own xfsaild, allowing them to run in parallel and not blocking each other?
>
> cheers,
> Michael
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: xfsaild in D state seems to be blocking all other i/o sporadically
  2017-04-21  9:18                           ` Shan Hai
@ 2017-04-22  8:38                             ` Michael Weissenbacher
  0 siblings, 0 replies; 19+ messages in thread
From: Michael Weissenbacher @ 2017-04-22  8:38 UTC (permalink / raw)
  To: Shan Hai, Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On 21.04.2017 11:18, Shan Hai wrote:
> 
>>
>> I also noticed that unmounting the file system takes a really long time
>> after the problem occured (up to around 5 minutes!). Even when there was
>> nothing at all going on before the unmount. Would it help to capture the
>> unmount with trace-cmd?
> 
> A huge negative dcache entries would cause a slow umount,
> please check it by 'cat /proc/sys/fs/dentry-state', the first column
> is for used(active) entries while the second is for the unused
> (negative) entries.
> 
Does this qualify as "huge"?
# cat /proc/sys/fs/dentry-state
10689551        10539266        45      0       0       0

> It can be dropped by 'echo 2 > /proc/sys/vm/drop_caches'.
> 
Won't that drop all dentries and inodes from the cache, which is what I
should be trying to avoid?

>> Here is another theory. Could it be that not the rsync's, but the rm's
>> issued by rsnapshot are causing the problem? Would it help to serialize
>> all "rm -Rf" calls? Those always delete the oldest backups, which can't
>> possibly be in the cache and because of that all those inodes need to be
>> read into memory during deletion. Maybe those rm's are filling up the
>> XFS log?
>>
Answering myself here: after removing the "rm" calls from rsnapshot the
problem completely disappeared. I've now written a separate script which
processes all "rm -Rf *" calls sequentially. I would deduce now that
parallel deletion of big directory trees were pushing the filesystem
(and xfsaild) so hard to cause those 10-15 minute stalls.

Thanks to everyone who contributed to help me solve the problem.

Have a nice weekend,
Michael

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2017-04-22  8:38 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-19 10:58 xfsaild in D state seems to be blocking all other i/o sporadically Michael Weissenbacher
2017-04-19 12:12 ` Carlos Maiolino
2017-04-19 12:37   ` Brian Foster
2017-04-19 12:40   ` Michael Weissenbacher
2017-04-19 13:01     ` Michael Weissenbacher
2017-04-19 14:04       ` Carlos Maiolino
2017-04-19 14:20         ` Carlos Maiolino
2017-04-19 16:40           ` Michael Weissenbacher
2017-04-19 16:36         ` Michael Weissenbacher
2017-04-19 18:08           ` Brian Foster
2017-04-19 20:10             ` Michael Weissenbacher
2017-04-19 20:55               ` Darrick J. Wong
2017-04-19 21:47                 ` Michael Weissenbacher
2017-04-19 23:48                   ` Dave Chinner
2017-04-20  7:11                     ` Michael Weissenbacher
2017-04-20 23:16                       ` Dave Chinner
2017-04-21  7:43                         ` Michael Weissenbacher
2017-04-21  9:18                           ` Shan Hai
2017-04-22  8:38                             ` Michael Weissenbacher

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.