linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* heavy xfsaild I/O blocking process exit
@ 2021-09-08  8:15 Momtchil Momtchev
  2021-09-08 21:27 ` Dave Chinner
  0 siblings, 1 reply; 4+ messages in thread
From: Momtchil Momtchev @ 2021-09-08  8:15 UTC (permalink / raw)
  To: linux-xfs


Hello,


I have a puzzling problem with XFS on Debian 10. I am running 
number-crunching driven by Node.js - I have a process that creates about 
2 million 1MB to 5MB files per day with an about 24h lifespan (weather 
forecasting). The file system is obviously heavily fragmented. I have 
absolutely no problems when running this in cruise mode, but every time 
I decide to stop that process, especially when it has been running for a 
few weeks or months, the process will become a zombie (freeing all its 
user memory and file descriptors) and then xfsaild/kworker will continue 
flushing the log for about 30-45 minutes before the process really 
quits. It will keep its binds to network ports (which is my main 
problem) but the system will remain responsive and usable. The I/O 
pattern is several seconds of random reading then a second or two of 
sequential writing.

The kernel functions that are running in the zombie process context are 
mainly xfs_btree_lookup, xfs_log_commit_cil, xfs_next_bit, 
xfs_buf_find_isra.26

xfsaild is spending time in radix_tree_next_chunk, xfs_inode_buf_verify

kworker is in xfs_reclaim_inode, radix_tree_next_chunk



This is on (standard up-to date Debian 10):

Linux version 4.19.0-16-amd64 (debian-kernel@lists.debian.org) (gcc 
version 8.3.0 (Debian 8.3.0-6)) #1 SMP Debian 4.19.181-1 (2021-03-19)

xfs_progs 4.20.0-1



File system is RAID-0, 2x2TB disks with LVM over md (512k chunks)

meta-data=/dev/mapper/vg0-home   isize=512    agcount=32, 
agsize=29849728 blks
          =                       sectsz=4096  attr=2, projid32bit=1
          =                       crc=1        finobt=1, sparse=1, rmapbt=0
          =                       reflink=0
data     =                       bsize=4096   blocks=955191296, imaxpct=5
          =                       sunit=128    swidth=256 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=466402, version=2
          =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


MemTotal:       32800968 kB
MemFree:          759308 kB
MemAvailable:   27941208 kB
Buffers:           43900 kB
Cached:         26504332 kB
SwapCached:         7560 kB
Active:         16101380 kB
Inactive:       11488252 kB
Active(anon):     813424 kB
Inactive(anon):   228180 kB
Active(file):   15287956 kB
Inactive(file): 11260072 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      16777212 kB
SwapFree:       16715524 kB
Dirty:              2228 kB
Writeback:             0 kB
AnonPages:       1034280 kB
Mapped:            89660 kB
Shmem:               188 kB
Slab:            1508868 kB
SReclaimable:    1097804 kB
SUnreclaim:       411064 kB
KernelStack:        3792 kB
PageTables:         5872 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    33177696 kB
Committed_AS:    1394296 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
Percpu:             7776 kB
HardwareCorrupted:     0 kB
AnonHugePages:    215040 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:    11682188 kB
DirectMap2M:    21731328 kB
DirectMap1G:     1048576 kB


-- 
Momtchil Momtchev <momtchil@momtchev.com>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: heavy xfsaild I/O blocking process exit
  2021-09-08  8:15 heavy xfsaild I/O blocking process exit Momtchil Momtchev
@ 2021-09-08 21:27 ` Dave Chinner
  2021-09-09  8:25   ` Momtchil Momtchev
  2021-09-09  9:27   ` Momtchil Momtchev
  0 siblings, 2 replies; 4+ messages in thread
From: Dave Chinner @ 2021-09-08 21:27 UTC (permalink / raw)
  To: Momtchil Momtchev; +Cc: linux-xfs

On Wed, Sep 08, 2021 at 10:15:59AM +0200, Momtchil Momtchev wrote:
> 
> Hello,
> 
> 
> I have a puzzling problem with XFS on Debian 10. I am running
> number-crunching driven by Node.js - I have a process that creates about 2
> million 1MB to 5MB files per day with an about 24h lifespan (weather
> forecasting). The file system is obviously heavily fragmented. I have
> absolutely no problems when running this in cruise mode, but every time I
> decide to stop that process, especially when it has been running for a few

What does "stop that process" mean? You kill it, or do you run a
stop command that tells the process to do a controlled shutdown?

> weeks or months, the process will become a zombie (freeing all its user
> memory and file descriptors) and then xfsaild/kworker will continue flushing
> the log for about 30-45 minutes before the process really quits.

The xfsaild is not flushing the log. It's doing metadata writeback.
If it is constantly busy, it means the log has run out of space and
something else wants log space. That something else will block until
the log space has been freed up by metadata writeback....

> It will
> keep its binds to network ports (which is my main problem) but the system
> will remain responsive and usable. The I/O pattern is several seconds of
> random reading then a second or two of sequential writing.

That would be expected from final close on lots of dirty inodes or
finalising unlinks on final close. But that won't stop anything else
from functioning.

> The kernel functions that are running in the zombie process context are
> mainly xfs_btree_lookup, xfs_log_commit_cil, xfs_next_bit,
> xfs_buf_find_isra.26

Full profiles (e.g. from perf top -U -p <pid>) would be useful here,
but this sounds very much like extent removal on final close. This
will be removing either speculative preallocation beyond EOF or the
workload has open but unlinked files and the unlink is being done at
process exit.

Either way, if the files are fragmented into millions of extents,
this could take minutes per file being closed. But with only 1-5MB
files, that shouldn't be occurring...

> xfsaild is spending time in radix_tree_next_chunk, xfs_inode_buf_verify

xfsaild should never be doing radix tree lookups - it only works on
internal in-memory filessytem objects that it has direct references
to. IOWs, I really need to see the actual profile outputs to
determine what it is doing...

xfs_inode_buf_verify() is expected if it is writing back dirty inode
clusters. Which it will be, but at only 2 million files a day I
wouldn't expect that to show up in profiles at all. It doesn't
really even show up in profiles even at half a million inodes per
_second_

> kworker is in xfs_reclaim_inode, radix_tree_next_chunk

Which kworker is that? Likely background inode reclaim, but that
doesn't limit anything - it just indicates there are inodes
available to be reclaimed.

> This is on (standard up-to date Debian 10):
> 
> Linux version 4.19.0-16-amd64 (debian-kernel@lists.debian.org) (gcc version
> 8.3.0 (Debian 8.3.0-6)) #1 SMP Debian 4.19.181-1 (2021-03-19)
> 
> xfs_progs 4.20.0-1
> 
> 
> 
> File system is RAID-0, 2x2TB disks with LVM over md (512k chunks)

Are these SSDs or HDDs? I'll assume HDD at this point.

> meta-data=/dev/mapper/vg0-home   isize=512    agcount=32, agsize=29849728
> blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=0
>          =                       reflink=0
> data     =                       bsize=4096   blocks=955191296, imaxpct=5
>          =                       sunit=128    swidth=256 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=466402, version=2
>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0

Ok, so you have a 1.7GB log. If those are HDDs, then you could have
hundreds of thousands of dirty inodes tracked in the log, and
metadata writeback has been falling behind for days because the log
can be filled much faster than it can be drained.

Assuming 200 write IOPS, 30 minutes would be 360,000 writes, which
pretty much matches up with having half a million dirty inodes in
the log and the process exiting needing to run a bunch of
transactions that need a chunk of log space to make progress and
having to wait on inode writeback to free up log space...

> MemTotal:       32800968 kB
> MemFree:          759308 kB
> MemAvailable:   27941208 kB
> Buffers:           43900 kB
> Cached:         26504332 kB
> SwapCached:         7560 kB
> Active:         16101380 kB
> Inactive:       11488252 kB
> Active(anon):     813424 kB
> Inactive(anon):   228180 kB
> Active(file):   15287956 kB
> Inactive(file): 11260072 kB

So all your memory is in the page cache.

> Unevictable:           0 kB
> Mlocked:               0 kB
> SwapTotal:      16777212 kB
> SwapFree:       16715524 kB
> Dirty:              2228 kB

And almost all the page cache is clean.

> Writeback:             0 kB
> AnonPages:       1034280 kB
> Mapped:            89660 kB
> Shmem:               188 kB
> Slab:            1508868 kB
> SReclaimable:    1097804 kB
> SUnreclaim:       411064 kB

And that's enough slab cache to hold half a million cached, dirty
inodes...

More information required.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: heavy xfsaild I/O blocking process exit
  2021-09-08 21:27 ` Dave Chinner
@ 2021-09-09  8:25   ` Momtchil Momtchev
  2021-09-09  9:27   ` Momtchil Momtchev
  1 sibling, 0 replies; 4+ messages in thread
From: Momtchil Momtchev @ 2021-09-09  8:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs


Thank you for your reply and your time. Your assumptions are correct. 
The process is killed by systemd.

I can't use perf -p on it since it has freed all of its memory.

Metadata writeback is a very good explanation that is consistent with 
everything I have seen - as the process writes lots of files and then 
deletes them at some later point.

Why does this writeback happens in the process context? Why isn't it in 
a kworker?

What really surprises me is that this happens even if the process has 
been idle for half an hour or so (it produces its files in bursts then 
idles a little bit) - this rules out speculative preallocation since it 
is freed on file close?

Does xfssyncd_centisecs influence metadata writeback? I am currently 
trying this.

Maybe I will reduce the journal size as a last resort.

Anyway, this is more of an annoyance, than a real problem.


On 08/09/2021 23:27, Dave Chinner wrote:
> On Wed, Sep 08, 2021 at 10:15:59AM +0200, Momtchil Momtchev wrote:
>> Hello,
>>
>>
>> I have a puzzling problem with XFS on Debian 10. I am running
>> number-crunching driven by Node.js - I have a process that creates about 2
>> million 1MB to 5MB files per day with an about 24h lifespan (weather
>> forecasting). The file system is obviously heavily fragmented. I have
>> absolutely no problems when running this in cruise mode, but every time I
>> decide to stop that process, especially when it has been running for a few
> What does "stop that process" mean? You kill it, or do you run a
> stop command that tells the process to do a controlled shutdown?
>
>> weeks or months, the process will become a zombie (freeing all its user
>> memory and file descriptors) and then xfsaild/kworker will continue flushing
>> the log for about 30-45 minutes before the process really quits.
> The xfsaild is not flushing the log. It's doing metadata writeback.
> If it is constantly busy, it means the log has run out of space and
> something else wants log space. That something else will block until
> the log space has been freed up by metadata writeback....
>
>> It will
>> keep its binds to network ports (which is my main problem) but the system
>> will remain responsive and usable. The I/O pattern is several seconds of
>> random reading then a second or two of sequential writing.
> That would be expected from final close on lots of dirty inodes or
> finalising unlinks on final close. But that won't stop anything else
> from functioning.
>
>> The kernel functions that are running in the zombie process context are
>> mainly xfs_btree_lookup, xfs_log_commit_cil, xfs_next_bit,
>> xfs_buf_find_isra.26
> Full profiles (e.g. from perf top -U -p <pid>) would be useful here,
> but this sounds very much like extent removal on final close. This
> will be removing either speculative preallocation beyond EOF or the
> workload has open but unlinked files and the unlink is being done at
> process exit.
>
> Either way, if the files are fragmented into millions of extents,
> this could take minutes per file being closed. But with only 1-5MB
> files, that shouldn't be occurring...
>
>> xfsaild is spending time in radix_tree_next_chunk, xfs_inode_buf_verify
> xfsaild should never be doing radix tree lookups - it only works on
> internal in-memory filessytem objects that it has direct references
> to. IOWs, I really need to see the actual profile outputs to
> determine what it is doing...
>
> xfs_inode_buf_verify() is expected if it is writing back dirty inode
> clusters. Which it will be, but at only 2 million files a day I
> wouldn't expect that to show up in profiles at all. It doesn't
> really even show up in profiles even at half a million inodes per
> _second_
>
>> kworker is in xfs_reclaim_inode, radix_tree_next_chunk
> Which kworker is that? Likely background inode reclaim, but that
> doesn't limit anything - it just indicates there are inodes
> available to be reclaimed.
>
>> This is on (standard up-to date Debian 10):
>>
>> Linux version 4.19.0-16-amd64 (debian-kernel@lists.debian.org) (gcc version
>> 8.3.0 (Debian 8.3.0-6)) #1 SMP Debian 4.19.181-1 (2021-03-19)
>>
>> xfs_progs 4.20.0-1
>>
>>
>>
>> File system is RAID-0, 2x2TB disks with LVM over md (512k chunks)
> Are these SSDs or HDDs? I'll assume HDD at this point.
>
>> meta-data=/dev/mapper/vg0-home   isize=512    agcount=32, agsize=29849728
>> blks
>>           =                       sectsz=4096  attr=2, projid32bit=1
>>           =                       crc=1        finobt=1, sparse=1, rmapbt=0
>>           =                       reflink=0
>> data     =                       bsize=4096   blocks=955191296, imaxpct=5
>>           =                       sunit=128    swidth=256 blks
>> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
>> log      =internal log           bsize=4096   blocks=466402, version=2
>>           =                       sectsz=4096  sunit=1 blks, lazy-count=1
>> realtime =none                   extsz=4096   blocks=0, rtextents=0
> Ok, so you have a 1.7GB log. If those are HDDs, then you could have
> hundreds of thousands of dirty inodes tracked in the log, and
> metadata writeback has been falling behind for days because the log
> can be filled much faster than it can be drained.
>
> Assuming 200 write IOPS, 30 minutes would be 360,000 writes, which
> pretty much matches up with having half a million dirty inodes in
> the log and the process exiting needing to run a bunch of
> transactions that need a chunk of log space to make progress and
> having to wait on inode writeback to free up log space...
>
>> MemTotal:       32800968 kB
>> MemFree:          759308 kB
>> MemAvailable:   27941208 kB
>> Buffers:           43900 kB
>> Cached:         26504332 kB
>> SwapCached:         7560 kB
>> Active:         16101380 kB
>> Inactive:       11488252 kB
>> Active(anon):     813424 kB
>> Inactive(anon):   228180 kB
>> Active(file):   15287956 kB
>> Inactive(file): 11260072 kB
> So all your memory is in the page cache.
>
>> Unevictable:           0 kB
>> Mlocked:               0 kB
>> SwapTotal:      16777212 kB
>> SwapFree:       16715524 kB
>> Dirty:              2228 kB
> And almost all the page cache is clean.
>
>> Writeback:             0 kB
>> AnonPages:       1034280 kB
>> Mapped:            89660 kB
>> Shmem:               188 kB
>> Slab:            1508868 kB
>> SReclaimable:    1097804 kB
>> SUnreclaim:       411064 kB
> And that's enough slab cache to hold half a million cached, dirty
> inodes...
>
> More information required.
>
> Cheers,
>
> Dave.

-- 
Momtchil Momtchev <momtchil@momtchev.com>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: heavy xfsaild I/O blocking process exit
  2021-09-08 21:27 ` Dave Chinner
  2021-09-09  8:25   ` Momtchil Momtchev
@ 2021-09-09  9:27   ` Momtchil Momtchev
  1 sibling, 0 replies; 4+ messages in thread
From: Momtchil Momtchev @ 2021-09-09  9:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs


I just found the problem - it is indeed the speculative allocation. The 
process is leaking file descriptors (Node.js raises the limit itself so 
leaks tend to remain hidden) - and when it is killed it starts going 
through tens even hundreds of thousands of descriptors.

Thanks for all the tips.


On 08/09/2021 23:27, Dave Chinner wrote:
> On Wed, Sep 08, 2021 at 10:15:59AM +0200, Momtchil Momtchev wrote:
>> Hello,
>>
>>
>> I have a puzzling problem with XFS on Debian 10. I am running
>> number-crunching driven by Node.js - I have a process that creates about 2
>> million 1MB to 5MB files per day with an about 24h lifespan (weather
>> forecasting). The file system is obviously heavily fragmented. I have
>> absolutely no problems when running this in cruise mode, but every time I
>> decide to stop that process, especially when it has been running for a few
> What does "stop that process" mean? You kill it, or do you run a
> stop command that tells the process to do a controlled shutdown?
>
>> weeks or months, the process will become a zombie (freeing all its user
>> memory and file descriptors) and then xfsaild/kworker will continue flushing
>> the log for about 30-45 minutes before the process really quits.
> The xfsaild is not flushing the log. It's doing metadata writeback.
> If it is constantly busy, it means the log has run out of space and
> something else wants log space. That something else will block until
> the log space has been freed up by metadata writeback....
>
>> It will
>> keep its binds to network ports (which is my main problem) but the system
>> will remain responsive and usable. The I/O pattern is several seconds of
>> random reading then a second or two of sequential writing.
> That would be expected from final close on lots of dirty inodes or
> finalising unlinks on final close. But that won't stop anything else
> from functioning.
>
>> The kernel functions that are running in the zombie process context are
>> mainly xfs_btree_lookup, xfs_log_commit_cil, xfs_next_bit,
>> xfs_buf_find_isra.26
> Full profiles (e.g. from perf top -U -p <pid>) would be useful here,
> but this sounds very much like extent removal on final close. This
> will be removing either speculative preallocation beyond EOF or the
> workload has open but unlinked files and the unlink is being done at
> process exit.
>
> Either way, if the files are fragmented into millions of extents,
> this could take minutes per file being closed. But with only 1-5MB
> files, that shouldn't be occurring...
>
>> xfsaild is spending time in radix_tree_next_chunk, xfs_inode_buf_verify
> xfsaild should never be doing radix tree lookups - it only works on
> internal in-memory filessytem objects that it has direct references
> to. IOWs, I really need to see the actual profile outputs to
> determine what it is doing...
>
> xfs_inode_buf_verify() is expected if it is writing back dirty inode
> clusters. Which it will be, but at only 2 million files a day I
> wouldn't expect that to show up in profiles at all. It doesn't
> really even show up in profiles even at half a million inodes per
> _second_
>
>> kworker is in xfs_reclaim_inode, radix_tree_next_chunk
> Which kworker is that? Likely background inode reclaim, but that
> doesn't limit anything - it just indicates there are inodes
> available to be reclaimed.
>
>> This is on (standard up-to date Debian 10):
>>
>> Linux version 4.19.0-16-amd64 (debian-kernel@lists.debian.org) (gcc version
>> 8.3.0 (Debian 8.3.0-6)) #1 SMP Debian 4.19.181-1 (2021-03-19)
>>
>> xfs_progs 4.20.0-1
>>
>>
>>
>> File system is RAID-0, 2x2TB disks with LVM over md (512k chunks)
> Are these SSDs or HDDs? I'll assume HDD at this point.
>
>> meta-data=/dev/mapper/vg0-home   isize=512    agcount=32, agsize=29849728
>> blks
>>           =                       sectsz=4096  attr=2, projid32bit=1
>>           =                       crc=1        finobt=1, sparse=1, rmapbt=0
>>           =                       reflink=0
>> data     =                       bsize=4096   blocks=955191296, imaxpct=5
>>           =                       sunit=128    swidth=256 blks
>> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
>> log      =internal log           bsize=4096   blocks=466402, version=2
>>           =                       sectsz=4096  sunit=1 blks, lazy-count=1
>> realtime =none                   extsz=4096   blocks=0, rtextents=0
> Ok, so you have a 1.7GB log. If those are HDDs, then you could have
> hundreds of thousands of dirty inodes tracked in the log, and
> metadata writeback has been falling behind for days because the log
> can be filled much faster than it can be drained.
>
> Assuming 200 write IOPS, 30 minutes would be 360,000 writes, which
> pretty much matches up with having half a million dirty inodes in
> the log and the process exiting needing to run a bunch of
> transactions that need a chunk of log space to make progress and
> having to wait on inode writeback to free up log space...
>
>> MemTotal:       32800968 kB
>> MemFree:          759308 kB
>> MemAvailable:   27941208 kB
>> Buffers:           43900 kB
>> Cached:         26504332 kB
>> SwapCached:         7560 kB
>> Active:         16101380 kB
>> Inactive:       11488252 kB
>> Active(anon):     813424 kB
>> Inactive(anon):   228180 kB
>> Active(file):   15287956 kB
>> Inactive(file): 11260072 kB
> So all your memory is in the page cache.
>
>> Unevictable:           0 kB
>> Mlocked:               0 kB
>> SwapTotal:      16777212 kB
>> SwapFree:       16715524 kB
>> Dirty:              2228 kB
> And almost all the page cache is clean.
>
>> Writeback:             0 kB
>> AnonPages:       1034280 kB
>> Mapped:            89660 kB
>> Shmem:               188 kB
>> Slab:            1508868 kB
>> SReclaimable:    1097804 kB
>> SUnreclaim:       411064 kB
> And that's enough slab cache to hold half a million cached, dirty
> inodes...
>
> More information required.
>
> Cheers,
>
> Dave.

-- 
Momtchil Momtchev <momtchil@momtchev.com>


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-09-09 14:26 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-08  8:15 heavy xfsaild I/O blocking process exit Momtchil Momtchev
2021-09-08 21:27 ` Dave Chinner
2021-09-09  8:25   ` Momtchil Momtchev
2021-09-09  9:27   ` Momtchil Momtchev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).