All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
@ 2016-10-14 12:27 Chris Mason
  2016-10-15 22:34 ` Dave Chinner
  2019-06-14 12:58 ` Amir Goldstein
  0 siblings, 2 replies; 24+ messages in thread
From: Chris Mason @ 2016-10-14 12:27 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs


Hi Dave,

This is part of a series of patches we're growing to fix a perf
regression on a few straggler tiers that are still on v3.10.  In this
case, hadoop had to switch back to v3.10 because v4.x is as much as 15%
slower on recent kernels.

Between v3.10 and v4.x, kswapd is less effective overall.  This leads
more and more procs to get bogged down in direct reclaim Using SYNC_WAIT
in xfs_reclaim_inodes_ag().

Since slab shrinking happens very early in direct reclaim, we've seen
systems with 130GB of ram where hundreds of procs are stuck on the xfs
slab shrinker fighting to walk a slab 900MB in size.  They'd have better
luck moving on to the page cache instead.

Also, we're going into direct reclaim much more often than we should
because kswapd is getting stuck on XFS inode locks and writeback.
Dropping the SYNC_WAIT means that kswapd can move on to other things and
let the async worker threads get kicked to work on the inodes.

We're still working on the series, and this is only compile tested on
current Linus git.  I'm working out some better simulations for the
hadoop workload to stuff into Mel's tests.  Numbers from prod take
roughly 3 days to stabilize, so I haven't isolated this patch from the rest
of the series.

Unpatched v4.x our base allocation stall rate goes up to as much as
200-300/sec, averaging 70/sec.  The series I'm finalizing gets that
number down to < 1 /sec.

Omar Sandoval did some digging and found you added the SYNC_WAIT in
response to a workload I sent ages ago.  I tried to make this OOM with
fsmark creating empty files, and it has been soaking in memory
constrained workloads in production for almost two weeks.

Signed-off-by: Chris Mason <clm@fb.com>
---
 fs/xfs/xfs_icache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index bf2d607..63938fb 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1195,7 +1195,7 @@ xfs_reclaim_inodes_nr(
 	xfs_reclaim_work_queue(mp);
 	xfs_ail_push_all(mp->m_ail);
 
-	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
+	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
 }
 
 /*
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
  2016-10-14 12:27 [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim Chris Mason
@ 2016-10-15 22:34 ` Dave Chinner
  2016-10-17  0:24   ` Chris Mason
  2019-06-14 12:58 ` Amir Goldstein
  1 sibling, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2016-10-15 22:34 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-xfs

On Fri, Oct 14, 2016 at 08:27:24AM -0400, Chris Mason wrote:
> 
> Hi Dave,
> 
> This is part of a series of patches we're growing to fix a perf
> regression on a few straggler tiers that are still on v3.10.  In this
> case, hadoop had to switch back to v3.10 because v4.x is as much as 15%
> slower on recent kernels.
> 
> Between v3.10 and v4.x, kswapd is less effective overall.  This leads
> more and more procs to get bogged down in direct reclaim Using SYNC_WAIT
> in xfs_reclaim_inodes_ag().
> 
> Since slab shrinking happens very early in direct reclaim, we've seen
> systems with 130GB of ram where hundreds of procs are stuck on the xfs
> slab shrinker fighting to walk a slab 900MB in size.  They'd have better
> luck moving on to the page cache instead.

We've already scanned the page cache for direct reclaim by the time
we get to running the shrinkers. Indeed, the amount of work the
shrinkers do is directly controlled by the amount of work done
scanning the page cache beforehand....

> Also, we're going into direct reclaim much more often than we should
> because kswapd is getting stuck on XFS inode locks and writeback.

Where and what locks, exactly?

> Dropping the SYNC_WAIT means that kswapd can move on to other things and
> let the async worker threads get kicked to work on the inodes.

Correct me if I'm wrong, but we introduced this shrinker behaviour
long before 3.10. I think the /explicit/ SYNC_WAIT was added some
time around 3.0 when I added the background async reclaim, but IIRC
the synchronous reclaim behaviour predates that by quite a bit.

IOWs, XFS shrinkers have the same blocking behaviour in 3.10 as they
do in current 4.x kernels. Hence if you're getting problems with
excessive blocking during reclaim on more recent 4.x kernels, then
it's more likely there is a change in memory reclaim balance in the
vmscan code that drives the shrinkers - that has definitely changed
between 3.10 and 4.x, but the XFS shrinker behaviour has not.

What XFS is doing is not wrong - the synchrnous behaviour is the
primary memory reclaim feedback mechanism that prevents reclaim from
trashing the working set of clean inodes when under memory pressure.
It's also the choke point where we prevent lots of concurrent
threads from trying to do reclaim at once, contending on locks
and inodes and causing catastrophic IO breakdown because such
reclaim results in random IO patterns for inode writeback instead of
nice clean ascending offset ordered IO.

This happens to be exactly the situation you are describing -
hundreds of processes all trying to run inode reclaim concurrently,
which if we don't block will repaid trash the clean inode cache
leaving only the dirty inodes around.

Hence to remove the SYNC_WAIT simply breaks the overload prevention
and throttling feedback mechanisms the XFS inode shrinker uses to
maintain performance under severe memory pressure.

> We're still working on the series, and this is only compile tested on
> current Linus git.  I'm working out some better simulations for the
> hadoop workload to stuff into Mel's tests.  Numbers from prod take
> roughly 3 days to stabilize, so I haven't isolated this patch from the rest
> of the series.
> 
> Unpatched v4.x our base allocation stall rate goes up to as much as
> 200-300/sec, averaging 70/sec.  The series I'm finalizing gets that
> number down to < 1 /sec.

That doesn't mean you've 'fixed' the problem. AFAICT you're just
disconnecting memory allocation from the rate at which inode reclaim
can clean and reclaim inodes. Yes, direct reclaim stalls will go
away, but that doesn't speed up inode reclaim, nor does it do
anything to prevent random inode writeback patterns. With the amount
of memory your machines have, removing SYNC_WAIT will simply hide
the problem by leaving the inodes dirty until journal pressure
forces the inode to be cleaned and then the background reclaimer can
free it. In general, that doesn't work particularly well because
journals can index more dirty objects that can be easily cached in
memory, and that's why we have to drive IO and throttle direct
reclaim.

> Omar Sandoval did some digging and found you added the SYNC_WAIT in
> response to a workload I sent ages ago. 

That commit (which I had to go find because you helpfully didn't
quote it) was a7b339f1b869 ("xfs: introduce background inode reclaim
work") introduced asynchronous background reclaim work and so made
the reclaim function able to handle both async and sync reclaim. To
maintain the direct reclaim throttling behaviour of the shrinker,
that function now needed to be told to be sycnhronous, hence the
addtion of the SYNC_WAIT. We didn't introduce sync
reclaim with this commit in 2.6.39(!), we've had that behaviour
since well before that. Hence if the analysis performed stopped at
this point in history, it was flawed.

> I tried to make this OOM with
> fsmark creating empty files, and it has been soaking in memory
> constrained workloads in production for almost two weeks.

That workload reclaims mostly from the background worker because
a) the majority of allocations are GFP_NOFS and so the inode
shrinker doesn't run very often, b) inode allocation is sequential
so inode writeback optimises into large sequential writes, not slow
random writes, c) it generates significant journal pressure so
drives inode writeback through tail-pushing the journal, and d)
sequential, ascending order inode allocation and inode writeback
optimisations means reclaim rarely comes across dirty inodes because
it reclaims inodes in the same order they are allocated and cleaned.

I'm sure there's a solution to whatever problem is occurring at FB,
but you're going to need to start by describing the problems and
documenting the analysis so we can understand what the problem is
before we start discussion potential solutions. It seems that the
problem is excessive direct reclaim concurrency and XFS throttling
concurrency back to the maximum the underlying filesystem can
optimise sanely, but nobody has explained why that is happening on
4.x and not 3.10.

i.e. We need to clearly understand the root cause of the problem
before we discuss potential solutions. Changes to long standing (and
unchanged) behaviours might end up being the fix, but it's not
something we'll do without first understanding the problem or
exploring other potential solutions...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
  2016-10-15 22:34 ` Dave Chinner
@ 2016-10-17  0:24   ` Chris Mason
  2016-10-17  1:52     ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Mason @ 2016-10-17  0:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Sun, Oct 16, 2016 at 09:34:54AM +1100, Dave Chinner wrote:
>On Fri, Oct 14, 2016 at 08:27:24AM -0400, Chris Mason wrote:
>>
>> Hi Dave,
>>
>> This is part of a series of patches we're growing to fix a perf
>> regression on a few straggler tiers that are still on v3.10.  In this
>> case, hadoop had to switch back to v3.10 because v4.x is as much as 15%
>> slower on recent kernels.
>>
>> Between v3.10 and v4.x, kswapd is less effective overall.  This leads
>> more and more procs to get bogged down in direct reclaim Using SYNC_WAIT
>> in xfs_reclaim_inodes_ag().
>>
>> Since slab shrinking happens very early in direct reclaim, we've seen
>> systems with 130GB of ram where hundreds of procs are stuck on the xfs
>> slab shrinker fighting to walk a slab 900MB in size.  They'd have better
>> luck moving on to the page cache instead.
>
>We've already scanned the page cache for direct reclaim by the time
>we get to running the shrinkers. Indeed, the amount of work the
>shrinkers do is directly controlled by the amount of work done
>scanning the page cache beforehand....
>
>> Also, we're going into direct reclaim much more often than we should
>> because kswapd is getting stuck on XFS inode locks and writeback.
>
>Where and what locks, exactly?

This is from v4.0, because all of my newer hosts are trying a variety of 
patched kernels.  But the traces were very similar on newer kernels:

# cat /proc/282/stack
[<ffffffff812ea2cd>] xfs_buf_submit_wait+0xbd/0x1d0
[<ffffffff812ea6e4>] xfs_bwrite+0x24/0x60
[<ffffffff812f18a4>] xfs_reclaim_inode+0x304/0x320
[<ffffffff812f1b17>] xfs_reclaim_inodes_ag+0x257/0x370
[<ffffffff812f2613>] xfs_reclaim_inodes_nr+0x33/0x40
[<ffffffff81300fb9>] xfs_fs_free_cached_objects+0x19/0x20
[<ffffffff811bb13b>] super_cache_scan+0x18b/0x190
[<ffffffff8115acc6>] shrink_slab.part.40+0x1f6/0x380
[<ffffffff8115e9da>] shrink_zone+0x30a/0x320
[<ffffffff8115f94f>] kswapd+0x51f/0x9e0
[<ffffffff810886b2>] kthread+0xd2/0xf0
[<ffffffff81770d88>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff

This one hurts the most.  While kswapd is waiting for IO, all the other 
reclaim he might have been doing is backing up.

The other common path is the pag->pag_ici_reclaim_lock lock in 
xfs_reclaim_inodes_ag.  It goes through the trylock loop, didn't free 
enough, and then waits on the locks for real.

>
>> Dropping the SYNC_WAIT means that kswapd can move on to other things and
>> let the async worker threads get kicked to work on the inodes.
>
>Correct me if I'm wrong, but we introduced this shrinker behaviour
>long before 3.10. I think the /explicit/ SYNC_WAIT was added some
>time around 3.0 when I added the background async reclaim, but IIRC
>the synchronous reclaim behaviour predates that by quite a bit.

With this workload, it can take as much as a week to verify a given 
change really makes it better.  So, I'm much more focused on a high 
level view of making the MM in current kernels more efficient than I am 
in digging through each commit between v3.10 and v4.x that may be 
related.

Once I've nailed down a good list of fixes, I'll start making some tests 
to make sure they keep fixing things in the future.  I'm starting with 
obvious things:  direct reclaim is happening because kswapd isn't 
processing the lists fast enough because he's stuck in D state.

This lead to two changes.  This patch, and lowering the dirty ratios to 
make sure background writeout can keep up and kswapd doesn't have to 
call pageout().

XFS is also limiting the direct reclaim speed of all the other slabs.  
We have 15 drives, each with its own filesystem.  But end result of the 
current system is to bottleneck behind whichever FS is slowest at any 
given moment.

>
>IOWs, XFS shrinkers have the same blocking behaviour in 3.10 as they
>do in current 4.x kernels. Hence if you're getting problems with
>excessive blocking during reclaim on more recent 4.x kernels, then
>it's more likely there is a change in memory reclaim balance in the
>vmscan code that drives the shrinkers - that has definitely changed
>between 3.10 and 4.x, but the XFS shrinker behaviour has not.

Correct, I got down this path assuming that commits between v3.10 and 
4.x changed how often we're calling the xfs shrinker, even though the 
shrinker was the same.

>
>What XFS is doing is not wrong - the synchrnous behaviour is the
>primary memory reclaim feedback mechanism that prevents reclaim from
>trashing the working set of clean inodes when under memory pressure.
>It's also the choke point where we prevent lots of concurrent
>threads from trying to do reclaim at once, contending on locks
>and inodes and causing catastrophic IO breakdown because such
>reclaim results in random IO patterns for inode writeback instead of
>nice clean ascending offset ordered IO.

It's also blocking kswapd (and all the other procs directly calling 
shrinkers) on IO.  Either IO it directly issues or IO run by other 
procs.  This causes the conditions that make all the threads want to do 
reclaim at once.

>That commit (which I had to go find because you helpfully didn't
>quote it) was a7b339f1b869 ("xfs: introduce background inode reclaim
>work") introduced asynchronous background reclaim work and so made
>the reclaim function able to handle both async and sync reclaim. To
>maintain the direct reclaim throttling behaviour of the shrinker,
>that function now needed to be told to be sycnhronous, hence the
>addtion of the SYNC_WAIT. We didn't introduce sync
>reclaim with this commit in 2.6.39(!), we've had that behaviour
>since well before that. Hence if the analysis performed stopped at
>this point in history, it was flawed.

This was an RFC because I was RFCing.  We're stuffing kswapd behind 
synchronous IO, and limiting the rate at which we can reclaim pages on 
the system.  I'm happy to toss unpatched kernels into the workload and 
gather stats to hep us nail down good behaviour, but really I'm asking 
what those stats might be.

What we have now is a single synchronous shrinker taking the box over.  
Nothing happens until XFS gets its inodes down the pipe, even when there 
are a considerable number of other freeable pages on the box.

There are probably a series of other optimizations we can be making in 
the MM code around when the shrinkers are called, and how well it deals 
with the constant churn from this workload.   I want to try those too,  
but we're stuck right now on this one spot.

An over all description of the hadoop workload:

Lots of java threads, spanning 15 filesystems, 4T each.

Each thread runs an unpredictable amount of disk and network IO and uses 
massive amounts of CPU.  The threads last for unpredictable amounts of 
time.  These boxes have ~130GB of ram and two sockets of CPUs (12 cores 
per-cpu, HT enabled).  The files themselves are relatively large and are 
mostly streaming reads/writes.

-chris


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
  2016-10-17  0:24   ` Chris Mason
@ 2016-10-17  1:52     ` Dave Chinner
  2016-10-17 13:30       ` Chris Mason
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2016-10-17  1:52 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-xfs

On Sun, Oct 16, 2016 at 08:24:33PM -0400, Chris Mason wrote:
> On Sun, Oct 16, 2016 at 09:34:54AM +1100, Dave Chinner wrote:
> >On Fri, Oct 14, 2016 at 08:27:24AM -0400, Chris Mason wrote:
> >>
> >>Hi Dave,
> >>
> >>This is part of a series of patches we're growing to fix a perf
> >>regression on a few straggler tiers that are still on v3.10.  In this
> >>case, hadoop had to switch back to v3.10 because v4.x is as much as 15%
> >>slower on recent kernels.
> >>
> >>Between v3.10 and v4.x, kswapd is less effective overall.  This leads
> >>more and more procs to get bogged down in direct reclaim Using SYNC_WAIT
> >>in xfs_reclaim_inodes_ag().
> >>
> >>Since slab shrinking happens very early in direct reclaim, we've seen
> >>systems with 130GB of ram where hundreds of procs are stuck on the xfs
> >>slab shrinker fighting to walk a slab 900MB in size.  They'd have better
> >>luck moving on to the page cache instead.
> >
> >We've already scanned the page cache for direct reclaim by the time
> >we get to running the shrinkers. Indeed, the amount of work the
> >shrinkers do is directly controlled by the amount of work done
> >scanning the page cache beforehand....
> >
> >>Also, we're going into direct reclaim much more often than we should
> >>because kswapd is getting stuck on XFS inode locks and writeback.
> >
> >Where and what locks, exactly?
> 
> This is from v4.0, because all of my newer hosts are trying a
> variety of patched kernels.  But the traces were very similar on
> newer kernels:
> 
> # cat /proc/282/stack
> [<ffffffff812ea2cd>] xfs_buf_submit_wait+0xbd/0x1d0
> [<ffffffff812ea6e4>] xfs_bwrite+0x24/0x60
> [<ffffffff812f18a4>] xfs_reclaim_inode+0x304/0x320
> [<ffffffff812f1b17>] xfs_reclaim_inodes_ag+0x257/0x370
> [<ffffffff812f2613>] xfs_reclaim_inodes_nr+0x33/0x40
> [<ffffffff81300fb9>] xfs_fs_free_cached_objects+0x19/0x20
> [<ffffffff811bb13b>] super_cache_scan+0x18b/0x190
> [<ffffffff8115acc6>] shrink_slab.part.40+0x1f6/0x380
> [<ffffffff8115e9da>] shrink_zone+0x30a/0x320
> [<ffffffff8115f94f>] kswapd+0x51f/0x9e0
> [<ffffffff810886b2>] kthread+0xd2/0xf0
> [<ffffffff81770d88>] ret_from_fork+0x58/0x90
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> This one hurts the most.  While kswapd is waiting for IO, all the
> other reclaim he might have been doing is backing up.

Which says two things: the journal tail pushing nor the background
inode reclaim threads are keeping up with dirty inode writeback
demand. Without knowing why that is occurring, we cannot solve the
problem.

> The other common path is the pag->pag_ici_reclaim_lock lock in
> xfs_reclaim_inodes_ag.  It goes through the trylock loop, didn't
> free enough, and then waits on the locks for real.

Which is the "prevent hundreds of threads from all issuing inode
writeback concurrently" throttling. Working as designed.

> XFS is also limiting the direct reclaim speed of all the other
> slabs.  We have 15 drives, each with its own filesystem.  But end
> result of the current system is to bottleneck behind whichever FS is
> slowest at any given moment.

So why is the filesystem slow in 4.0 and not slow at all in 3.10?

And how does a 4.8 kernel compare, given there were major changes to
the mm/ subsystem in this release? i.e. are you chasing a mm/
problem that has already been solved?

> >What XFS is doing is not wrong - the synchrnous behaviour is the
> >primary memory reclaim feedback mechanism that prevents reclaim from
> >trashing the working set of clean inodes when under memory pressure.
> >It's also the choke point where we prevent lots of concurrent
> >threads from trying to do reclaim at once, contending on locks
> >and inodes and causing catastrophic IO breakdown because such
> >reclaim results in random IO patterns for inode writeback instead of
> >nice clean ascending offset ordered IO.
> 
> It's also blocking kswapd (and all the other procs directly calling
> shrinkers) on IO.  Either IO it directly issues or IO run by other
> procs.  This causes the conditions that make all the threads want to
> do reclaim at once.

Yup, that's what it's meant to do. As I keep repeating - this
behaviour is indicative of /some other problem/ occurring. If inode
writeback and reclaim is occurring efficiently and correctly, then
kswapd will not throttle like this because it will never block on
IO. And, by design, when we throttle kswapd we effectively throttle
direct reclaim, too.

> >That commit (which I had to go find because you helpfully didn't
> >quote it) was a7b339f1b869 ("xfs: introduce background inode reclaim
> >work") introduced asynchronous background reclaim work and so made
> >the reclaim function able to handle both async and sync reclaim. To
> >maintain the direct reclaim throttling behaviour of the shrinker,
> >that function now needed to be told to be sycnhronous, hence the
> >addtion of the SYNC_WAIT. We didn't introduce sync
> >reclaim with this commit in 2.6.39(!), we've had that behaviour
> >since well before that. Hence if the analysis performed stopped at
> >this point in history, it was flawed.
> 
> This was an RFC because I was RFCing.  We're stuffing kswapd behind
> synchronous IO, and limiting the rate at which we can reclaim pages
> on the system.  I'm happy to toss unpatched kernels into the
> workload and gather stats to hep us nail down good behaviour, but
> really I'm asking what those stats might be.

I can't suggest anything right now, because you haven't given me
system/workload details or concrete analysis to base any suggestions
on.

> What we have now is a single synchronous shrinker taking the box
> over.  Nothing happens until XFS gets its inodes down the pipe, even
> when there are a considerable number of other freeable pages on the
> box.
> 
> There are probably a series of other optimizations we can be making
> in the MM code around when the shrinkers are called, and how well it
> deals with the constant churn from this workload.   I want to try
> those too,  but we're stuck right now on this one spot.
> 
> An over all description of the hadoop workload:
> 
> Lots of java threads, spanning 15 filesystems, 4T each.

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

Things like number of allocation groups, journal size, what the
physical storage is, IO scheduler (or lack of), etc, are all
important here.

> Each thread runs an unpredictable amount of disk and network IO and
> uses massive amounts of CPU.  The threads last for unpredictable
> amounts of time.  These boxes have ~130GB of ram and two sockets of
> CPUs (12 cores per-cpu, HT enabled). The files themselves are
> relatively large and are mostly streaming reads/writes.

Again, details. meminfo, iostat, etc, details of how memory usage
and IO patterns change when everything backs up on inode reclaim,
etc are /really important/ here. I need to be able to reproduce a
similar memory imbalance myself to be able to test any solution we
come up with.

Here's some immediate questions I have from the workload
description:

	- If the workload is mostly large files and streaming reads
	  and writes, then why are there so many inodes that need
	  writeback that reclaim is getting stuck on them?
	- Why aren't the inodes getting written back regularly via
	  periodic log work (e.g. via the xfs_ail_push_all() call
	  that occur every 30s)?
	- Is there so much data IO that metadata IO is being
	  starved?
	- Why does it take a week to manifest - when does the system
	  go out of balance and is there anything in userspace that
	  changes behaviour that might trigger it?
	- Is there a behavioural step-change in the workload of
	  some threads?
	- are you running at near ENOSPC and so maybe hitting some
	  filesystem fragmentation level that causes the seek load
	  to slowly increase until there's no IOPS left in the
	  storage?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
  2016-10-17  1:52     ` Dave Chinner
@ 2016-10-17 13:30       ` Chris Mason
  2016-10-17 22:30         ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Mason @ 2016-10-17 13:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs



On 10/16/2016 09:52 PM, Dave Chinner wrote:
> On Sun, Oct 16, 2016 at 08:24:33PM -0400, Chris Mason wrote:
>> On Sun, Oct 16, 2016 at 09:34:54AM +1100, Dave Chinner wrote:
>>> On Fri, Oct 14, 2016 at 08:27:24AM -0400, Chris Mason wrote:
>>>>
>>>> Hi Dave,
>>>>
>>>> This is part of a series of patches we're growing to fix a perf
>>>> regression on a few straggler tiers that are still on v3.10.  In this
>>>> case, hadoop had to switch back to v3.10 because v4.x is as much as 15%
>>>> slower on recent kernels.
>>>>
>>>> Between v3.10 and v4.x, kswapd is less effective overall.  This leads
>>>> more and more procs to get bogged down in direct reclaim Using SYNC_WAIT
>>>> in xfs_reclaim_inodes_ag().
>>>>
>>>> Since slab shrinking happens very early in direct reclaim, we've seen
>>>> systems with 130GB of ram where hundreds of procs are stuck on the xfs
>>>> slab shrinker fighting to walk a slab 900MB in size.  They'd have better
>>>> luck moving on to the page cache instead.
>>>
>>> We've already scanned the page cache for direct reclaim by the time
>>> we get to running the shrinkers. Indeed, the amount of work the
>>> shrinkers do is directly controlled by the amount of work done
>>> scanning the page cache beforehand....
>>>
>>>> Also, we're going into direct reclaim much more often than we should
>>>> because kswapd is getting stuck on XFS inode locks and writeback.
>>>
>>> Where and what locks, exactly?
>>
>> This is from v4.0, because all of my newer hosts are trying a
>> variety of patched kernels.  But the traces were very similar on
>> newer kernels:
>>
>> # cat /proc/282/stack
>> [<ffffffff812ea2cd>] xfs_buf_submit_wait+0xbd/0x1d0
>> [<ffffffff812ea6e4>] xfs_bwrite+0x24/0x60
>> [<ffffffff812f18a4>] xfs_reclaim_inode+0x304/0x320
>> [<ffffffff812f1b17>] xfs_reclaim_inodes_ag+0x257/0x370
>> [<ffffffff812f2613>] xfs_reclaim_inodes_nr+0x33/0x40
>> [<ffffffff81300fb9>] xfs_fs_free_cached_objects+0x19/0x20
>> [<ffffffff811bb13b>] super_cache_scan+0x18b/0x190
>> [<ffffffff8115acc6>] shrink_slab.part.40+0x1f6/0x380
>> [<ffffffff8115e9da>] shrink_zone+0x30a/0x320
>> [<ffffffff8115f94f>] kswapd+0x51f/0x9e0
>> [<ffffffff810886b2>] kthread+0xd2/0xf0
>> [<ffffffff81770d88>] ret_from_fork+0x58/0x90
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> This one hurts the most.  While kswapd is waiting for IO, all the
>> other reclaim he might have been doing is backing up.
>
> Which says two things: the journal tail pushing nor the background
> inode reclaim threads are keeping up with dirty inode writeback
> demand. Without knowing why that is occurring, we cannot solve the
> problem.
>
>> The other common path is the pag->pag_ici_reclaim_lock lock in
>> xfs_reclaim_inodes_ag.  It goes through the trylock loop, didn't
>> free enough, and then waits on the locks for real.
>
> Which is the "prevent hundreds of threads from all issuing inode
> writeback concurrently" throttling. Working as designed.

Ok, I think we're just not on the same page about how kswapd is 
designed.  So instead of worrying about some crusty old kernel, lets 
talk about that for a minute.  I'm not trying to explain kswapd to you, 
just putting what I'm seeing from the shrinker in terms of how kswapd 
deals with dirty pages:

LRUs try to keep the dirty pages away from kswapd in hopes that 
background writeback will clean them instead of kswapd.

When system memory pressure gets bad enough, kswapd will call pageout(). 
  This includes a check for congested bdis where it will skip the IO 
because it doesn't want to wait on busy resources.

The main throttling mechanism is to slow down the creation of new dirty 
pages via balance_dirty_pages().

IO is avoided from inside kswapd because there's only one kswapd 
per-numa node.  It is trying to take a global view of the freeable 
memory in the node, instead of focusing on any one individual page.

Shrinkers are a little different because while individual shrinkers have 
a definition of dirty, the general concept doesn't.  kswapd calls into 
the shrinkers to ask them to be smaller.

With dirty pages, kswapd will start IO but not wait on it.
With the xfs shrinker, kswapd does synchronous IO to write a single 
inode in xfs_buf_submit_wait().

With congested BDIs, kswapd will skip the IO and wait for progress after 
running through a good chunk of pages.  With the xfs shrinker, kswapd 
will synchronously wait for progress on a single FS, even if there are 
dozens of other filesystems around.

For the xfs shrinker, the mechanism to throttle new dirty inodes on a 
single FS is stalling every process in the system in direct reclaim?

>
>> XFS is also limiting the direct reclaim speed of all the other
>> slabs.  We have 15 drives, each with its own filesystem.  But end
>> result of the current system is to bottleneck behind whichever FS is
>> slowest at any given moment.
>
> So why is the filesystem slow in 4.0 and not slow at all in 3.10?
>

It's not that v3.10 is fast.  It's just faster.  v4.x is faster in a 
bunch of other ways, but this one part of v3.10 isn't slowing down the 
system as much as this one part of v4.x

> And how does a 4.8 kernel compare, given there were major changes to
> the mm/ subsystem in this release? i.e. are you chasing a mm/
> problem that has already been solved?

We don't think it's already solved in v4.8, but we're setting up a test 
to confirm that.  I'm working on a better simulation of the parts we're 
tripping over so I can model this outside of production.  I definitely 
agree that something is wrong in MM land too, we have to clamp down on 
the dirty ratios much more than we should to keep kswapd from calling 
pageout().

We can dive into workload specifics too, but I'd rather do that against 
simulations where I can try individual experiments more quickly.  The 
reason it takes me a week to get hard numbers is because the workload is 
very inconsistent.  The only way to get a good comparison is to put the 
test kernel on roughly 30 machines and then average major metrics over a 
period of days.  Just installing the kernel takes almost a day because I 
can only reboot one machine every 20 minutes.

-chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
  2016-10-17 13:30       ` Chris Mason
@ 2016-10-17 22:30         ` Dave Chinner
  2016-10-17 23:20           ` Chris Mason
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2016-10-17 22:30 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-xfs

On Mon, Oct 17, 2016 at 09:30:05AM -0400, Chris Mason wrote:
> On 10/16/2016 09:52 PM, Dave Chinner wrote:
> >On Sun, Oct 16, 2016 at 08:24:33PM -0400, Chris Mason wrote:
> >>On Sun, Oct 16, 2016 at 09:34:54AM +1100, Dave Chinner wrote:
> >>This one hurts the most.  While kswapd is waiting for IO, all the
> >>other reclaim he might have been doing is backing up.
> >
> >Which says two things: the journal tail pushing nor the background
> >inode reclaim threads are keeping up with dirty inode writeback
> >demand. Without knowing why that is occurring, we cannot solve the
> >problem.
> >
> >>The other common path is the pag->pag_ici_reclaim_lock lock in
> >>xfs_reclaim_inodes_ag.  It goes through the trylock loop, didn't
> >>free enough, and then waits on the locks for real.
> >
> >Which is the "prevent hundreds of threads from all issuing inode
> >writeback concurrently" throttling. Working as designed.
> 
> Ok, I think we're just not on the same page about how kswapd is
> designed.

Chris, I understand perfectly well what kswapd is and how it is
supposed to work. I also understand how shrinkers work and how they
are supposed to interact with page reclaim - my dirty paws are all
over the shrinker infrastructure (context specific shrinkers,
NUMA, GFP_NOFS anti-windup mechanisms, cgroup, etc).

> The main throttling mechanism is to slow down the creation of new
> dirty pages via balance_dirty_pages().

I'm also the guy who architected the IO-less dirty page throttling
infrastructure so there's not much you can teach me about that,
either. Indeed, the XFS behaviour that you want to remove implements
a similar (but much more complex) feedback mechanism as the IO-less
dirty throttle.

> IO is avoided from inside kswapd because there's only one kswapd
> per-numa node.  It is trying to take a global view of the freeable
> memory in the node, instead of focusing on any one individual page.

kswapd is also the "get out of gaol free" thread for reclaim when
the memory pressure is entirely filesystem bound and so direct
reclaim skips all filesystem reclaim because GFP_NOFS is being
asserted. This happens a lot in XFS.....

The result of this is that only in kswapd context (i.e. GFP_KERNEL,
PF_KSWAPD shrinker context) can we do the things necessary to
/guarantee forwards reclaim progress/.  That means kswapd might
sometimes be slow, but if we don't allow the shrinker to block from
time to time then there's every chance that reclaim will not make
progress.

> Shrinkers are a little different because while individual shrinkers
> have a definition of dirty, the general concept doesn't.  kswapd
> calls into the shrinkers to ask them to be smaller.

Well, that's not exactly what the shrinkers are being asked to do.
Speaking as the guy who designed the current shrinker API, shrinkers
are being asked to /scan their subsystem/ and take actions that will
/allow/ memory to be reclaimed. Not all shrinkers sit in front of
slab caches - some sit on hardware related pools and trigger early
garbage collection of completed commands and queues. e.g. the DRI
subsystem shrinkers. Shrinkers are for more than just slab caches
these days - they are a general "memory pressure notification"
mechanism and need to be thought of as such rather than a traditional
"slab cache shrinker".

IOWs, the actions that shrinkers take may not directly free memory,
but they may result in memory becoming reclaimable in the near
future e.g. writing back dirty inodes doesn't free memory - it can
actually create memory demand - but it does then allow the inodes
and their backing buffers to be reclaimed once the inodes are clean.

To allow such generic implementations to exist, shrinkers are
allowed to block just like the page cache reclaim is allowed to
block. Blocking should be as limited as possible, but it is allowed
as it may be necessary to guarantee progress.

The difference here is that page cache reclaim has a far more
context in which to make decisions on whether to block or not.
Shrinkers have a gfp mask, and nothing else. i.e shrinkers are not
given enough context by the mm subsystem to make smart decisions on
how much they are allowed to block. e.g. GFP_NOFS means no
filesystem shrinker can run, even though the memory allocation may
be coming from a different fs and there's no possibility of a
deadlock.

I've been talking about this with the mm/ guys and what we can do
differently to pass more context to the shrinkers (e.g. task based
reclaim context structures rather than GFP_* flags passed on the
stack) so we can be smarter in the shrinkers about what we can and
can't do.

> With dirty pages, kswapd will start IO but not wait on it.
> With the xfs shrinker, kswapd does synchronous IO to write a single
> inode in xfs_buf_submit_wait().

That's because there is /no other throttling mechanism/ for shrinker
controlled slab caches. We can't throttle at allocation time because
we have no mechanism for either counting or limiting the number of
dirty objects in a slab caches, like we do the page cache. We have
/limited/ control via the size of the filesystem journal (which is
why I've been asking for that information!), but realistically the
only solid, reliable method we have to prevent excessive dirty inode
accumulation in large memory machines with multiple filesystems and
slow disks is to throttle memory allocation to the rate at which we
can reclaim inodes.

This prevents the sorts of situations we were regularly seeing 7-8
years ago where a freeze (for a snapshot) or unmount could take
/hours/ because we have built up hundreds of thousands of dirty
inodes in cache over a period of days because the slow SATA RAID6
array can only do 50 write IOPS.....

i.e. there are many, many good reasons for why XFS treats inode
reclaim the way it does. I don't expect you to understand all the
issues that we are preventing by throttling memory allocation like
this, but I do expect you to respect the fact it's done for very
good reasons.

> With congested BDIs, kswapd will skip the IO and wait for progress
> after running through a good chunk of pages.  With the xfs shrinker,
> kswapd will synchronously wait for progress on a single FS, even if
> there are dozens of other filesystems around.

Yes, that can happen but, again, this behaviour indicates something
else is wrong and the system is out of whack in some way.  In the
normal case, kswapd will only block on IO long enough for the other
other async threads run through inode writeback and clean inode
reclaim as efficiently as possible. By the time that the shrinker
wakes up after blocking on an IO, it should either have clean inodes
to reclaim or nothing more to do.

What you are reporting is equivalent to having pageout() run and do
all the writeback (badly) instead of the bdi flusher threads doing
all the writeback (efficiently). pageout() is a /worst case/
behaviour we try very hard to avoid and when it occurs it is
generally indicative of some other problem or imbalance. Same goes
here for the inode shrinker.

> For the xfs shrinker, the mechanism to throttle new dirty inodes on
> a single FS is stalling every process in the system in direct
> reclaim?

We're not throttling dirty inodes - we are throttling memory
allocation because that's the only hammer we have to prevent
excessive buildup of dirty of inodes that have already been
allocated.

We can't do that throttling when we dirty inodes because that
happens in transaction context holding locked objects and blocking
there waiting on inode writeback progress will cause journal
deadlocks....

This stuff is way more complex than just "have cache, will shrink".

> We don't think it's already solved in v4.8, but we're setting up a
> test to confirm that.  I'm working on a better simulation of the
> parts we're tripping over so I can model this outside of production.
> I definitely agree that something is wrong in MM land too, we have
> to clamp down on the dirty ratios much more than we should to keep
> kswapd from calling pageout().

Having pageout() run is pretty indicative of instantaneous memory
demand being significantly higher than the IO throughput of the
storage subsystem and the only reclaimable memory in the system
being dirty filesystem caches. i.e. background writeback is not
keeping up memory demand and dirtying rates for some reason. The
general rule of thumb is that if getting pageout() is occurring then
the IO subsystem is about to die a horrible death of random IO.

If you're taking great lengths to avoid pageout() from being called,
then it's no surprise to me that your workload is, instead,
triggering the equivalent "oh shit, we're in real trouble here"
behaviour in XFS inode cache reclaim.  I also wonder, after turning
down the dirty ratios, if you've done other typical writeback tuning
tweaks like speeding up XFS's periodic metadata writeback to clean
inodes faster in the absence of journal pressure.

It's detailed information like this that I've been asking for -
there's good reasons I've asked for that information, and any
further discussion will just be a waste of my time without all the
details I've already asked for.

> We can dive into workload specifics too, but I'd rather do that
> against simulations where I can try individual experiments more
> quickly. 

Yup, I think you need to come up with a reproducable test case you
can share....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
  2016-10-17 22:30         ` Dave Chinner
@ 2016-10-17 23:20           ` Chris Mason
  2016-10-18  2:03             ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Mason @ 2016-10-17 23:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs



On 10/17/2016 06:30 PM, Dave Chinner wrote:
> On Mon, Oct 17, 2016 at 09:30:05AM -0400, Chris Mason wrote:
>> On 10/16/2016 09:52 PM, Dave Chinner wrote:
>>> On Sun, Oct 16, 2016 at 08:24:33PM -0400, Chris Mason wrote:
>>>> On Sun, Oct 16, 2016 at 09:34:54AM +1100, Dave Chinner wrote:
>>>> This one hurts the most.  While kswapd is waiting for IO, all the
>>>> other reclaim he might have been doing is backing up.
>>>
>>> Which says two things: the journal tail pushing nor the background
>>> inode reclaim threads are keeping up with dirty inode writeback
>>> demand. Without knowing why that is occurring, we cannot solve the
>>> problem.
>>>
>>>> The other common path is the pag->pag_ici_reclaim_lock lock in
>>>> xfs_reclaim_inodes_ag.  It goes through the trylock loop, didn't
>>>> free enough, and then waits on the locks for real.
>>>
>>> Which is the "prevent hundreds of threads from all issuing inode
>>> writeback concurrently" throttling. Working as designed.
>>
>> Ok, I think we're just not on the same page about how kswapd is
>> designed.
>
> Chris, I understand perfectly well what kswapd is and how it is
> supposed to work. I also understand how shrinkers work and how they
> are supposed to interact with page reclaim - my dirty paws are all
> over the shrinker infrastructure (context specific shrinkers,
> NUMA, GFP_NOFS anti-windup mechanisms, cgroup, etc).

Dave, I'm just trying break down the conversation to a series common 
ground vocabulary that I know you understand.

I'll come back with a workload in .c form.  You can run the workload and 
decide for yourself if the shrinkers are incorrectly bottlencecking the 
system.

>
>> The main throttling mechanism is to slow down the creation of new
>> dirty pages via balance_dirty_pages().
>
> I'm also the guy who architected the IO-less dirty page throttling
> infrastructure so there's not much you can teach me about that,
> either. Indeed, the XFS behaviour that you want to remove implements
> a similar (but much more complex) feedback mechanism as the IO-less
> dirty throttle.
>
>> IO is avoided from inside kswapd because there's only one kswapd
>> per-numa node.  It is trying to take a global view of the freeable
>> memory in the node, instead of focusing on any one individual page.
>
> kswapd is also the "get out of gaol free" thread for reclaim when
> the memory pressure is entirely filesystem bound and so direct
> reclaim skips all filesystem reclaim because GFP_NOFS is being
> asserted. This happens a lot in XFS.....
>
> The result of this is that only in kswapd context (i.e. GFP_KERNEL,
> PF_KSWAPD shrinker context) can we do the things necessary to
> /guarantee forwards reclaim progress/.  That means kswapd might
> sometimes be slow, but if we don't allow the shrinker to block from
> time to time then there's every chance that reclaim will not make
> progress.
>

Sure, if the shrinker blocked from time to time, I wouldn't be sending 
this email.

>> Shrinkers are a little different because while individual shrinkers
>> have a definition of dirty, the general concept doesn't.  kswapd
>> calls into the shrinkers to ask them to be smaller.
>
> Well, that's not exactly what the shrinkers are being asked to do.
> Speaking as the guy who designed the current shrinker API, shrinkers
> are being asked to /scan their subsystem/ and take actions that will
> /allow/ memory to be reclaimed. Not all shrinkers sit in front of
> slab caches - some sit on hardware related pools and trigger early
> garbage collection of completed commands and queues. e.g. the DRI
> subsystem shrinkers. Shrinkers are for more than just slab caches
> these days - they are a general "memory pressure notification"
> mechanism and need to be thought of as such rather than a traditional
> "slab cache shrinker".

Yes, which is why its so important that a single individual shrinker not 
block all the other shrinkers from happening.

>
> IOWs, the actions that shrinkers take may not directly free memory,
> but they may result in memory becoming reclaimable in the near
> future e.g. writing back dirty inodes doesn't free memory - it can
> actually create memory demand - but it does then allow the inodes
> and their backing buffers to be reclaimed once the inodes are clean.
>
> To allow such generic implementations to exist, shrinkers are
> allowed to block just like the page cache reclaim is allowed to
> block. Blocking should be as limited as possible, but it is allowed
> as it may be necessary to guarantee progress.
>
> The difference here is that page cache reclaim has a far more
> context in which to make decisions on whether to block or not.
> Shrinkers have a gfp mask, and nothing else. i.e shrinkers are not
> given enough context by the mm subsystem to make smart decisions on
> how much they are allowed to block. e.g. GFP_NOFS means no
> filesystem shrinker can run, even though the memory allocation may
> be coming from a different fs and there's no possibility of a
> deadlock.
>
> I've been talking about this with the mm/ guys and what we can do
> differently to pass more context to the shrinkers (e.g. task based
> reclaim context structures rather than GFP_* flags passed on the
> stack) so we can be smarter in the shrinkers about what we can and
> can't do.
>
>> With dirty pages, kswapd will start IO but not wait on it.
>> With the xfs shrinker, kswapd does synchronous IO to write a single
>> inode in xfs_buf_submit_wait().
>
> That's because there is /no other throttling mechanism/ for shrinker
> controlled slab caches. We can't throttle at allocation time because
> we have no mechanism for either counting or limiting the number of
> dirty objects in a slab caches, like we do the page cache. We have
> /limited/ control via the size of the filesystem journal (which is
> why I've been asking for that information!), but realistically the
> only solid, reliable method we have to prevent excessive dirty inode
> accumulation in large memory machines with multiple filesystems and
> slow disks is to throttle memory allocation to the rate at which we
> can reclaim inodes.
>
> This prevents the sorts of situations we were regularly seeing 7-8
> years ago where a freeze (for a snapshot) or unmount could take
> /hours/ because we have built up hundreds of thousands of dirty
> inodes in cache over a period of days because the slow SATA RAID6
> array can only do 50 write IOPS.....
>
> i.e. there are many, many good reasons for why XFS treats inode
> reclaim the way it does. I don't expect you to understand all the
> issues that we are preventing by throttling memory allocation like
> this, but I do expect you to respect the fact it's done for very
> good reasons.
>
>> With congested BDIs, kswapd will skip the IO and wait for progress
>> after running through a good chunk of pages.  With the xfs shrinker,
>> kswapd will synchronously wait for progress on a single FS, even if
>> there are dozens of other filesystems around.
>
> Yes, that can happen but, again, this behaviour indicates something
> else is wrong and the system is out of whack in some way.  In the
> normal case, kswapd will only block on IO long enough for the other
> other async threads run through inode writeback and clean inode
> reclaim as efficiently as possible. By the time that the shrinker
> wakes up after blocking on an IO, it should either have clean inodes
> to reclaim or nothing more to do.
>
> What you are reporting is equivalent to having pageout() run and do
> all the writeback (badly) instead of the bdi flusher threads doing
> all the writeback (efficiently). pageout() is a /worst case/
> behaviour we try very hard to avoid and when it occurs it is
> generally indicative of some other problem or imbalance. Same goes
> here for the inode shrinker.

Yes!  But the big difference is that pageout() already has a backoff for 
congestion.  The xfs shrinker doesn't.

>
>> For the xfs shrinker, the mechanism to throttle new dirty inodes on
>> a single FS is stalling every process in the system in direct
>> reclaim?
>
> We're not throttling dirty inodes - we are throttling memory
> allocation because that's the only hammer we have to prevent
> excessive buildup of dirty of inodes that have already been
> allocated.
>
> We can't do that throttling when we dirty inodes because that
> happens in transaction context holding locked objects and blocking
> there waiting on inode writeback progress will cause journal
> deadlocks....
>
> This stuff is way more complex than just "have cache, will shrink".

Actually we can easily throttle after common metadata operations that 
dirty inodes.  There aren't many, and the throttling can be done after 
the transaction locks are dropped.

>
>> We don't think it's already solved in v4.8, but we're setting up a
>> test to confirm that.  I'm working on a better simulation of the
>> parts we're tripping over so I can model this outside of production.
>> I definitely agree that something is wrong in MM land too, we have
>> to clamp down on the dirty ratios much more than we should to keep
>> kswapd from calling pageout().
>
> Having pageout() run is pretty indicative of instantaneous memory
> demand being significantly higher than the IO throughput of the
> storage subsystem and the only reclaimable memory in the system
> being dirty filesystem caches. i.e. background writeback is not
> keeping up memory demand and dirtying rates for some reason. The
> general rule of thumb is that if getting pageout() is occurring then
> the IO subsystem is about to die a horrible death of random IO.
>
> If you're taking great lengths to avoid pageout() from being called,
> then it's no surprise to me that your workload is, instead,
> triggering the equivalent "oh shit, we're in real trouble here"
> behaviour in XFS inode cache reclaim.  I also wonder, after turning
> down the dirty ratios, if you've done other typical writeback tuning
> tweaks like speeding up XFS's periodic metadata writeback to clean
> inodes faster in the absence of journal pressure.

No, we haven't.  I'm trying really hard to avoid the need for 50 billion 
tunables when the shrinkers are so clearly doing the wrong thing.

>
> It's detailed information like this that I've been asking for -
> there's good reasons I've asked for that information, and any
> further discussion will just be a waste of my time without all the
> details I've already asked for.

I'll send reproduction workloads.  It'll take a few days to nail them 
down, but it'll be much easier to talk about than all of hadoop.

-chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
  2016-10-17 23:20           ` Chris Mason
@ 2016-10-18  2:03             ` Dave Chinner
  2016-11-14  1:00               ` Chris Mason
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2016-10-18  2:03 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-xfs

On Mon, Oct 17, 2016 at 07:20:56PM -0400, Chris Mason wrote:
> On 10/17/2016 06:30 PM, Dave Chinner wrote:
> >On Mon, Oct 17, 2016 at 09:30:05AM -0400, Chris Mason wrote:
> >What you are reporting is equivalent to having pageout() run and do
> >all the writeback (badly) instead of the bdi flusher threads doing
> >all the writeback (efficiently). pageout() is a /worst case/
> >behaviour we try very hard to avoid and when it occurs it is
> >generally indicative of some other problem or imbalance. Same goes
> >here for the inode shrinker.
> 
> Yes!  But the big difference is that pageout() already has a backoff
> for congestion.  The xfs shrinker doesn't.

pageout() is only ever called from kswapd context for file pages.
Hence applications hitting direct reclaim really hard will never
call pageout() directly - they'll skip over it and end up calling
congestion_wait() instead and at that point the page cache dirty
throttle and background writeback should take over.

This, however, breaks down when there are hundreds of direct
reclaimers because the pressure put on direct reclaim can exceed the
amount of cleaning work that can be done by the background threads
in the maximum congestion backoff period and there are other caches
that require IO to clean.

Direct reclaim then has no clean page cache pages to clean on each
LRU scan, so it effectively then transfers that excess pressure to
the shrinkers that require IO to reclaim.  If a shrinker hits a
similar "reclaim pressure > background cleaning rate" threshold,
then it will end up directly blocking on IO congestion, exactly as
you are describing.

Both direct reclaim and kswapd can get stuck in this becase
shrinkers - unlike pageout() - are called from direct reclaim as
well as kswapd.  i.e. Shrinkers are exposed to unbound direct
reclaim pressure, pageout() writeback isn't. Hence shrinkers need to
handle unbound incoming concurrency without killing IO patterns,
without trashing the working set of objects it controls, and it has
to - somehow - adequately throttle reclaim rates in times of
pressure overload.

Right now the XFS code uses IO submission to do that throttling.
Shrinkers have no higher layer throttling we can rely on here.
Blocking on congestion during IO submission is effectively no
different to calling congestion_wait() in the shrinker itself after
skipping a bunch of dirty inodes that we can't write because of
congestion. While we can do this, it doesn't change the fact that
shrinkers that do IO need to block callers to adequately control the
reclaim pressure being directed at them.

If the XFS background metadata writeback threads are doing their
work properly, shrinker reclaim should not be blocking on dirty
inodes.  However, for all I know right now the problem could be that
the background reclaimer is *working too well* and so leaving only
dirty inodes for the shrinkers to act on.....

IOWs, what we need to do *first* is to work out why there is so much
blocking occuring - we need to find the /root cause of the
blocking problem/ and once we've found that we can discuss potential
solutions.

[snip]

> >If you're taking great lengths to avoid pageout() from being called,
> >then it's no surprise to me that your workload is, instead,
> >triggering the equivalent "oh shit, we're in real trouble here"
> >behaviour in XFS inode cache reclaim.  I also wonder, after turning
> >down the dirty ratios, if you've done other typical writeback tuning
> >tweaks like speeding up XFS's periodic metadata writeback to clean
> >inodes faster in the absence of journal pressure.
> 
> No, we haven't.  I'm trying really hard to avoid the need for 50
> billion tunables when the shrinkers are so clearly doing the wrong
> thing.

XFS has *1* tunable that can change the behaviour of metadata
writeback. Please try it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
  2016-10-18  2:03             ` Dave Chinner
@ 2016-11-14  1:00               ` Chris Mason
  2016-11-14  7:27                 ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Mason @ 2016-11-14  1:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Oct 18, 2016 at 01:03:24PM +1100, Dave Chinner wrote:

[ Long stalls from xfs_reclaim_inodes_ag ]

>XFS has *1* tunable that can change the behaviour of metadata
>writeback. Please try it.

[ weeks go by, so this email is insanely long ]

Testing all of this was slow going because two of the three test boxes I 
had with the hadoop configuration starting having hardware problems.  
The good news is that while I was adjusting the benchmark, we lined up 
access to a bunch of duplicate boxes, so I can now try ~20 different 
configurations in parallel.

My rough benchmark is here:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/simoop.git

The command line I ended up using was:

simoop -t 512 -m 190 -M 128 -C 28 -r 60000 -f 70 -T 20 -R1 -W 1 -i 60 -w 
300 -D 2 /hammer/*

I'll break down how I came up with that command line and the end of the 
email.

This isn't trying to be like hadoop, it's more trying to create the same 
VM pressures that hadoop does.  It's missing all of the network traffic, 
which is at least as much of the workload as disk.

The benchmark has a few parts:

Fill the drives to ~60-70% full.  This working set is reused run to run
Read things
Write things
Burn CPU -- it's important that kswapd have to compete for CPU slices
Allocate things

I'm pretending to be java here, so anything that could be done with a 
one thread was done with 10 threads for good luck.

Stage one is to create a working set of files across all the drives.  We 
use 15 drives, and each drive has its own filesystem.  On my current 
boxes, the drives are 8TB each, but production can be as small as 4TB.  
The multiple FS part of the benchmark is important to reproduce what we 
see, but I doubt you need all 15.  If you want to try this out, I'm 
happy to match your config and help come up with a command line that 
reproduces for me.

While I was reproducing the problem, I was trying to avoid the one part 
of hadoop that bothers me the most.  It has one thread per filesystem 
that runs du on the whole FS every 10 minutes or so.  There are patches 
in upstream hadoop to use df instead, but ugh.  I think what's happening 
is that when the du thread kicks in, we get a lot of inodes that are 
tossed soon after.  This in turn means XFS_ICI_RECLAIM_TAG is set on a 
lot of ags, which makes it easier to trigger my stalls.

In hopes of simulating a workload less dumb, my simoop runs periodically 
stat all the files on only two of the 15 drives.  The idea is that a 
mixed workload is slowing the system down to the slowest drive.

We find the allocation stalls counters in /proc/vmstat roughly matches 
poor performance overall.  So, the benchmark reads that from time to 
time includes it in the output.

Some example results, plain v4.8 kernel.  This run had xfssyncd tuning 
left at the default, but I did crank it down last week without any real 
change in the numbers.  The bad news is it takes the runs a long time to 
stabilize, I run at least an hour.

v4.8
___
Run time: 3623 seconds
Read latency (p50: 21,594,112) (p95: 31,621,120) (p99: 36,896,768)
Write latency (p50: 19,169,280) (p95: 31,096,832) (p99: 37,158,912)
Allocation latency (p50: 12,992,512) (p95: 28,934,144) (p99: 36,110,336)
work rate = 9.12/sec (avg 8.77/sec) (p50: 8.91) (p95: 9.11) (p99: 9.99)
alloc stall rate = 44.98/sec (avg: 41.22) (p50: 39.56) (p95: 41.20) (p99: 50.96)

In general, we care most about the p99 of the latencies.
Once you average things over the fleet, most users see the p99 often 
enough that it really hurts.  The latency times here are in usec, but 
read/write latencies are the time it took for N threads to start, do IO 
and return.

work rate is a made up number, and basically summarizes the combination 
of reading, writing, allocating and hogging the CPU.  For that, we'd use 
the avg or p50 number to measure production.

Comparing with the patch:
___
Run time: 3614 seconds
Read latency (p50: 18,644,992) (p95: 29,720,576) (p99: 34,930,688)
Write latency (p50: 12,173,312) (p95: 22,839,296) (p99: 29,851,648)
Allocation latency (p50: 9,322,496) (p95: 25,395,200) (p99: 30,834,688)
work rate = 10.82/sec (avg 11.16/sec) (p50: 10.90) (p95: 11.14) (p99: 11.30)
alloc stall rate = 28.05/sec (avg: 19.52) (p50: 23.56) (p95: 31.48) (p99: 36.12)

I also had a bpf probe running alongside the test to measure how long 
we're spending in xfs_reclaim_inodes_ag().  argdist.py comes with the 
bcc iovisor: https://github.com/iovisor/bcc.  I ran:

argdist.py -i 600 -H 'r::xfs_reclaim_inodes_ag():u64:$latency:$latency < 
56137438953472'

(v4.9-rc fixes a bug where the latencies go negative, so you have to 
ignore crazy numbers from v4.8 or older).  

Unpatched:

	       $latency                      : count     distribution
		   0 -> 1                    : 0        |                    |
		   2 -> 3                    : 0        |                    |
		   4 -> 7                    : 0        |                    |
		   8 -> 15                   : 0        |                    |
		  16 -> 31                   : 0        |                    |
		  32 -> 63                   : 0        |                    |
		  64 -> 127                  : 0        |                    |
		 128 -> 255                  : 0        |                    |
		 256 -> 511                  : 0        |                    |
		 512 -> 1023                 : 0        |                    |
		1024 -> 2047                 : 0        |                    |
		2048 -> 4095                 : 1        |                    |
		4096 -> 8191                 : 0        |                    |
		8192 -> 16383                : 6        |                    |
	       16384 -> 32767                : 27       |                    |
	       32768 -> 65535                : 35       |                    |
	       65536 -> 131071               : 80       |                    |
	      131072 -> 262143               : 249      |**                  |
	      262144 -> 524287               : 701      |*******             |
	      524288 -> 1048575              : 1761     |********************|
	     1048576 -> 2097151              : 634      |*******             |
	     2097152 -> 4194303              : 28       |                    |
	     4194304 -> 8388607              : 0        |                    |
	     8388608 -> 16777215             : 0        |                    |
	    16777216 -> 33554431             : 0        |                    |
	    33554432 -> 67108863             : 0        |                    |
	    67108864 -> 134217727            : 0        |                    |
	   134217728 -> 268435455            : 0        |                    |
	   268435456 -> 536870911            : 0        |                    |
	   536870912 -> 1073741823           : 43       |                    |
	  1073741824 -> 2147483647           : 23       |                    |
	  2147483648 -> 4294967295           : 23       |                    |
	  4294967296 -> 8589934591           : 25       |                    |
	  8589934592 -> 17179869183          : 24       |                    |
	 17179869184 -> 34359738367          : 2        |                    |
	 34359738368 -> 68719476735          : 6        |                    |

-i 600 means to collect data for 600 seconds and then print the 
histogram and reset.  The latencies are in nanoseconds, so v4.8 had six 
calls that lasted between 34 and 68 seconds.

Same stats for v4.8-patched:

[04:54:22]
	       $latency                      : count     distribution
		   0 -> 1                    : 0        |                    |
		   2 -> 3                    : 0        |                    |
		   4 -> 7                    : 0        |                    |
		   8 -> 15                   : 0        |                    |
		  16 -> 31                   : 0        |                    |
		  32 -> 63                   : 0        |                    |
		  64 -> 127                  : 0        |                    |
		 128 -> 255                  : 0        |                    |
		 256 -> 511                  : 0        |                    |
		 512 -> 1023                 : 0        |                    |
		1024 -> 2047                 : 1        |                    |
		2048 -> 4095                 : 8        |                    |
		4096 -> 8191                 : 9        |                    |
		8192 -> 16383                : 23       |                    |
	       16384 -> 32767                : 48       |                    |
	       32768 -> 65535                : 73       |                    |
	       65536 -> 131071               : 164      |                    |
	      131072 -> 262143               : 245      |                    |
	      262144 -> 524287               : 393      |                    |
	      524288 -> 1048575              : 857      |*                   |
	     1048576 -> 2097151              : 14980    |********************|
	     2097152 -> 4194303              : 287      |                    |
	     4194304 -> 8388607              : 1        |                    |
	     8388608 -> 16777215             : 2        |                    |
	    16777216 -> 33554431             : 11       |                    |
	    33554432 -> 67108863             : 10       |                    |
	    67108864 -> 134217727            : 8        |                    |
	   134217728 -> 268435455            : 0        |                    |
	   268435456 -> 536870911            : 0        |                    |
	   536870912 -> 1073741823           : 152      |                    |

There are consistently more calls in the histogram on the patched 
kernel, and consistently huge stalls (30 seconds plus) in the unpatched 
kernel.

I've already written a small novel here, but I'll talk about where I got 
the command line for simoop:

simoop -t 512 -m 190 -M 128 -C 28 -r 60000 -f 70 -T 20 -R1 -W 1 -i 60 -w 
300 -D 2 /hammer/*

The box has 130GB of ram, two sockets (14 cores each) the overall goals 
are:

95-99% user CPU
200MB-500MB/s reads and writes
Filesystems 60% full
Allocation stalls between 30 and 50/sec

Allocation stalls over 100 tend to oom the box eventually.

These are the most important knobs:

-t 512 worker threads
-m 190 is the per-thread working set ram (in MB).  It's allocated once 
per thread and stays in use the whole time.

-M 128 is the allocation we time for latencies, it's freed almost right 
away.

-C 28 threads that just burn cpus.

The rest tune how much IO is done (-R, -W) and how many subthreads do 
the IO (-T).  -w 300 gives us 300 seconds of warm up time before we 
reset the stats for the long haul.

-D 2 are the du threads.  I'm sorry.

I have a few more answers for why v3.10 in prod has so many fewer 
allocation stalls than v4.0+. It's partially because some slabs have 
gotten a little bigger.  It's also because netlink started using larger 
order allocations (this is an even better story than the du threads), 
and because our stacks jumped up to 16K contiguous.

My goal in posting all of this is to help find a way to put proper 
throttling on the slab so we don't have to do synchronous waiting and IO 
in the shrinker.

As far as I can tell, simoop roughly mirrors the benefits I get from my 
one liner in production, but this is v0.000001 so there is a lot of room 
for improvement.

 -chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
  2016-11-14  1:00               ` Chris Mason
@ 2016-11-14  7:27                 ` Dave Chinner
  2016-11-14 20:56                   ` Chris Mason
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2016-11-14  7:27 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-xfs

On Sun, Nov 13, 2016 at 08:00:04PM -0500, Chris Mason wrote:
> On Tue, Oct 18, 2016 at 01:03:24PM +1100, Dave Chinner wrote:
> 
> [ Long stalls from xfs_reclaim_inodes_ag ]
> 
> >XFS has *1* tunable that can change the behaviour of metadata
> >writeback. Please try it.
> 
> [ weeks go by, so this email is insanely long ]
> 
> Testing all of this was slow going because two of the three test
> boxes I had with the hadoop configuration starting having hardware
> problems.  The good news is that while I was adjusting the
> benchmark, we lined up access to a bunch of duplicate boxes, so I
> can now try ~20 different configurations in parallel.
> 
> My rough benchmark is here:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mason/simoop.git
> 
> The command line I ended up using was:
> 
> simoop -t 512 -m 190 -M 128 -C 28 -r 60000 -f 70 -T 20 -R1 -W 1 -i
> 60 -w 300 -D 2 /hammer/*

There's a lightly tested patch below that should do the trick.

After 5 minutes on a modified simoop cli on a single filesystem,
4.9-rc4+for-next:

$ ./simoop -t 128 -m 50 -M 128 -C 14 -r 60000 -f 2 -T 20 -R1 -W 1 -i 60 -w 300 -D 2 /mnt/scratch
....
Run time: 300 seconds
Read latency (p50: 3,174,400) (p95: 4,530,176) (p99: 18,055,168)
Write latency (p50: 14,991,360) (p95: 28,672,000) (p99: 33,325,056)
Allocation latency (p50: 1,771,520) (p95: 17,530,880) (p99: 23,756,800)
work rate = 4.75/sec (avg 5.24/sec) (p50: 5.79) (p95: 6.99) (p99: 6.99)
alloc stall rate = 94.42/sec (avg: 51.63) (p50: 51.60) (p95: 59.12) (p99: 59.12)

With the patch below:

Run time: 300 seconds
Read latency (p50: 3,043,328) (p95: 3,649,536) (p99: 4,710,400)
Write latency (p50: 21,004,288) (p95: 27,557,888) (p99: 29,130,752)
Allocation latency (p50: 280,064) (p95: 680,960) (p99: 863,232)
work rate = 4.08/sec (avg 4.76/sec) (p50: 5.39) (p95: 6.93) (p99: 6.93)
alloc stall rate = 0.08/sec (avg: 0.02) (p50: 0.00) (p95: 0.01) (p99: 0.01)

Stall rate went to zero and stayed there at the 120s mark of the
warmup. Note the p99 difference for read and allocation latency,
too.

I'll post some graphs tomorrow from my live PCP telemetry that
demonstrate the difference in behaviour better than any words...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: reduce blocking in inode reclaim

From: Dave Chinner <dchinner@redhat.com>

When we already have maximum reclaim concurrency going on, stop the
caller from doing any more reclaim rather than blocking them.
Instead, transfer the scanning work the next context that gets
access to the reclaim locks.

Further, make sure not to block kswapd ireclaim on IO or locks. This
means it will always be able to move on to reclaim something else
rather than blocking and preventing reclaim from other filesytsems.

Finally, change the blocking reclaim cases to follow the reclaim
cursor so that we don't restart from a position where inodes may
curentl be under IO from a previous flush. This means we cycle
through all inodes int eh AG before revisting indoes that may now be
clean for reclaim.

This change does not appear to cause overall performance regression
with the pure dirty inode cache load (such as fsmark inode creation)
and inode traversal worklaods (find, rm -rf) on inodes
collections much larger than can be cached in memory. The reclaim
pattern is noticably lumpier, so work still to be done.

Under initial simoop testing, p99 read latencies are down by 70%,
p99 allocation latency is down by over 90% and allocation stalls
completely go away.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/super.c          |  8 +++++++-
 fs/xfs/xfs_icache.c | 59 +++++++++++++++++++++++++++++++++++++++++++-------------
 2 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index c183835566c1..9fff5630b12d 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -102,8 +102,14 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	freed += prune_icache_sb(sb, sc);
 
 	if (fs_objects) {
+		unsigned long ret;
+
 		sc->nr_to_scan = fs_objects + 1;
-		freed += sb->s_op->free_cached_objects(sb, sc);
+		ret = sb->s_op->free_cached_objects(sb, sc);
+		if (ret == SHRINK_STOP)
+			freed = SHRINK_STOP;
+		else
+			freed += ret;
 	}
 
 	up_read(&sb->s_umount);
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 9c3e5c6ddf20..65c0f79f3edc 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -975,7 +975,7 @@ xfs_reclaim_inode(
 	error = 0;
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	if (!xfs_iflock_nowait(ip)) {
-		if (!(sync_mode & SYNC_WAIT))
+		if (sync_mode & SYNC_TRYLOCK)
 			goto out;
 		xfs_iflock(ip);
 	}
@@ -987,7 +987,7 @@ xfs_reclaim_inode(
 		goto reclaim;
 	}
 	if (xfs_ipincount(ip)) {
-		if (!(sync_mode & SYNC_WAIT))
+		if (sync_mode & SYNC_TRYLOCK)
 			goto out_ifunlock;
 		xfs_iunpin_wait(ip);
 	}
@@ -1103,7 +1103,7 @@ xfs_reclaim_inode(
  * then a shut down during filesystem unmount reclaim walk leak all the
  * unreclaimed inodes.
  */
-STATIC int
+STATIC long
 xfs_reclaim_inodes_ag(
 	struct xfs_mount	*mp,
 	int			flags,
@@ -1113,18 +1113,22 @@ xfs_reclaim_inodes_ag(
 	int			error = 0;
 	int			last_error = 0;
 	xfs_agnumber_t		ag;
-	int			trylock = flags & SYNC_TRYLOCK;
+	int			trylock;
 	int			skipped;
+	int			dirty_ags;
 
 restart:
+	trylock = flags & SYNC_TRYLOCK;
 	ag = 0;
 	skipped = 0;
+	dirty_ags = 0;
 	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
 		unsigned long	first_index = 0;
 		int		done = 0;
 		int		nr_found = 0;
 
 		ag = pag->pag_agno + 1;
+		dirty_ags++;
 
 		if (trylock) {
 			if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) {
@@ -1132,10 +1136,16 @@ xfs_reclaim_inodes_ag(
 				xfs_perag_put(pag);
 				continue;
 			}
-			first_index = pag->pag_ici_reclaim_cursor;
 		} else
 			mutex_lock(&pag->pag_ici_reclaim_lock);
 
+		/*
+		 * Always start from the last scanned inode so that we don't
+		 * block on inodes that a previous iteration just flushed.
+		 * Iterate over the entire inode range before coming back to
+		 * skipped/dirty inodes.
+		 */
+		first_index = pag->pag_ici_reclaim_cursor;
 		do {
 			struct xfs_inode *batch[XFS_LOOKUP_BATCH];
 			int	i;
@@ -1201,23 +1211,31 @@ xfs_reclaim_inodes_ag(
 
 		} while (nr_found && !done && *nr_to_scan > 0);
 
-		if (trylock && !done)
-			pag->pag_ici_reclaim_cursor = first_index;
-		else
-			pag->pag_ici_reclaim_cursor = 0;
+		if (done)
+			first_index = 0;
+		pag->pag_ici_reclaim_cursor = first_index;
 		mutex_unlock(&pag->pag_ici_reclaim_lock);
 		xfs_perag_put(pag);
 	}
 
 	/*
-	 * if we skipped any AG, and we still have scan count remaining, do
+	 * If we skipped all AGs because they are locked, we've reached maximum
+	 * reclaim concurrency. At this point there is no point in having more
+	 * attempts to shrink the cache from this context. Tell the shrinker to
+	 * stop and defer the reclaim work till later.
+	 */
+	if (skipped && skipped == dirty_ags)
+		return SHRINK_STOP;
+
+	/*
+	 * If we skipped any AG, and we still have scan count remaining, do
 	 * another pass this time using blocking reclaim semantics (i.e
 	 * waiting on the reclaim locks and ignoring the reclaim cursors). This
 	 * ensure that when we get more reclaimers than AGs we block rather
 	 * than spin trying to execute reclaim.
 	 */
 	if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) {
-		trylock = 0;
+		flags &= ~SYNC_TRYLOCK;
 		goto restart;
 	}
 	return last_error;
@@ -1229,8 +1247,12 @@ xfs_reclaim_inodes(
 	int		mode)
 {
 	int		nr_to_scan = INT_MAX;
+	long		ret;
 
-	return xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+	ret = xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+	if (ret == SHRINK_STOP)
+		return 0;
+	return ret;
 }
 
 /*
@@ -1241,17 +1263,28 @@ xfs_reclaim_inodes(
  * reclaim of inodes. That means if we come across dirty inodes, we wait for
  * them to be cleaned, which we hope will not be very long due to the
  * background walker having already kicked the IO off on those dirty inodes.
+ *
+ * Also, treat kswapd specially - we really want it to run asynchronously and
+ * not block on dirty inodes, unlike direct reclaim that we can tolerate
+ * blocking and some amount of IO latency. If we start to overload the reclaim
+ * subsystem with too many direct reclaimers, it will start returning
+ * SHRINK_STOP to tell the mm subsystem to defer the work rather than continuing
+ * to call us and forcing us to block.
  */
 long
 xfs_reclaim_inodes_nr(
 	struct xfs_mount	*mp,
 	int			nr_to_scan)
 {
+	int			flags = SYNC_TRYLOCK;
+
 	/* kick background reclaimer and push the AIL */
 	xfs_reclaim_work_queue(mp);
 	xfs_ail_push_all(mp->m_ail);
 
-	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
+	if (!current_is_kswapd())
+		flags |= SYNC_WAIT;
+	return xfs_reclaim_inodes_ag(mp, flags, &nr_to_scan);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
  2016-11-14  7:27                 ` Dave Chinner
@ 2016-11-14 20:56                   ` Chris Mason
  2016-11-14 23:58                     ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Mason @ 2016-11-14 20:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 11/14/2016 02:27 AM, Dave Chinner wrote:
> On Sun, Nov 13, 2016 at 08:00:04PM -0500, Chris Mason wrote:
>> On Tue, Oct 18, 2016 at 01:03:24PM +1100, Dave Chinner wrote:
>>
>> [ Long stalls from xfs_reclaim_inodes_ag ]
>>
>>> XFS has *1* tunable that can change the behaviour of metadata
>>> writeback. Please try it.
>>
>> [ weeks go by, so this email is insanely long ]
>>
>> Testing all of this was slow going because two of the three test
>> boxes I had with the hadoop configuration starting having hardware
>> problems.  The good news is that while I was adjusting the
>> benchmark, we lined up access to a bunch of duplicate boxes, so I
>> can now try ~20 different configurations in parallel.
>>
>> My rough benchmark is here:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/mason/simoop.git
>>
>> The command line I ended up using was:
>>
>> simoop -t 512 -m 190 -M 128 -C 28 -r 60000 -f 70 -T 20 -R1 -W 1 -i
>> 60 -w 300 -D 2 /hammer/*
>
> There's a lightly tested patch below that should do the trick.
>
> After 5 minutes on a modified simoop cli on a single filesystem,
> 4.9-rc4+for-next:
>
> $ ./simoop -t 128 -m 50 -M 128 -C 14 -r 60000 -f 2 -T 20 -R1 -W 1 -i 60 -w 300 -D 2 /mnt/scratch
> ....
> Run time: 300 seconds
> Read latency (p50: 3,174,400) (p95: 4,530,176) (p99: 18,055,168)
> Write latency (p50: 14,991,360) (p95: 28,672,000) (p99: 33,325,056)
> Allocation latency (p50: 1,771,520) (p95: 17,530,880) (p99: 23,756,800)
> work rate = 4.75/sec (avg 5.24/sec) (p50: 5.79) (p95: 6.99) (p99: 6.99)
> alloc stall rate = 94.42/sec (avg: 51.63) (p50: 51.60) (p95: 59.12) (p99: 59.12)
>
> With the patch below:
>
> Run time: 300 seconds
> Read latency (p50: 3,043,328) (p95: 3,649,536) (p99: 4,710,400)
> Write latency (p50: 21,004,288) (p95: 27,557,888) (p99: 29,130,752)
> Allocation latency (p50: 280,064) (p95: 680,960) (p99: 863,232)
> work rate = 4.08/sec (avg 4.76/sec) (p50: 5.39) (p95: 6.93) (p99: 6.93)
> alloc stall rate = 0.08/sec (avg: 0.02) (p50: 0.00) (p95: 0.01) (p99: 0.01)
>
> Stall rate went to zero and stayed there at the 120s mark of the
> warmup. Note the p99 difference for read and allocation latency,
> too.
>
> I'll post some graphs tomorrow from my live PCP telemetry that
> demonstrate the difference in behaviour better than any words...

Thanks Dave, this is definitely better.  But at least for the multi-disk 
setup, it's not quite there yet.

Your patch:
___
Run time: 15535 seconds
Read latency (p50: 22,708,224) (p95: 34,668,544) (p99: 41,746,432)
Write latency (p50: 21,200,896) (p95: 34,799,616) (p99: 41,877,504)
Allocation latency (p50: 11,550,720) (p95: 31,424,512) (p99: 39,518,208)
work rate = 7.48/sec (avg 8.41/sec) (p50: 8.69) (p95: 9.57) (p99: 9.87)
alloc stall rate = 14.08/sec (avg: 14.85) (p50: 15.74) (p95: 19.74) 
(p99: 22.04)

Original patch:
___
Run time: 15474 seconds
Read latency (p50: 20,414,464) (p95: 29,786,112) (p99: 34,275,328)
Write latency (p50: 15,155,200) (p95: 25,591,808) (p99: 31,621,120)
Allocation latency (p50: 7,675,904) (p95: 22,970,368) (p99: 29,523,968)
work rate = 8.33/sec (avg 10.54/sec) (p50: 10.54) (p95: 10.58) (p99: 10.58)
alloc stall rate = 37.08/sec (avg: 21.73) (p50: 23.16) (p95: 24.68) 
(p99: 25.00)

v4.8
___
Run time: 15492 seconds
Read latency (p50: 22,642,688) (p95: 35,848,192) (p99: 43,712,512)
Write latency (p50: 21,200,896) (p95: 35,454,976) (p99: 43,450,368)
Allocation latency (p50: 12,599,296) (p95: 34,144,256) (p99: 41,615,360)
work rate = 9.77/sec (avg 8.15/sec) (p50: 8.37) (p95: 9.29) (p99: 9.55)
alloc stall rate = 8.33/sec (avg: 33.65) (p50: 34.52) (p95: 37.40) (p99: 
37.96)

One thing that might have been too far buried in my email yesterday. 
The read/write latencies include the time to start threads, that's not 
just IO in there.

I've had this running all day, but the differences stabilized after 5-10 
minutes.  Everyone's p99s trend higher as the day goes on, but the 
percentage difference stays about the same.

I think the difference between mine and yours is we didn't quite get the 
allocation stalls down to zero, so making tasks wait for the shrinker 
shows up in the end numbers.

For your patch, the stalls from xfs_reclaim_inodes_ag() were about the 
same as the unpatched kernel yesterday.  We still had long tails in the 
30 second+ category.

I did a trace on vmscan:mm_vmscan_direct_reclaim_begin, and 91% of the 
allocation stalls were:

order=0 may_writepage=1 gfp_flags=GFP_HIGHUSER_MOVABLE|__GFP_ZERO 
classzone_idx=3

These are all page faults, either during read() syscalls or when simoop 
was touching all the pages in its mmap()'d area (-M from the cmdline)

One detail I didn't give yesterday was these are all using deadline as 
the IO scheduler.

-chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
  2016-11-14 20:56                   ` Chris Mason
@ 2016-11-14 23:58                     ` Dave Chinner
  2016-11-15  3:09                       ` Chris Mason
  2016-11-15  5:54                       ` Dave Chinner
  0 siblings, 2 replies; 24+ messages in thread
From: Dave Chinner @ 2016-11-14 23:58 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-xfs

[-- Attachment #1: Type: text/plain, Size: 11106 bytes --]

On Mon, Nov 14, 2016 at 03:56:14PM -0500, Chris Mason wrote:
> On 11/14/2016 02:27 AM, Dave Chinner wrote:
> >On Sun, Nov 13, 2016 at 08:00:04PM -0500, Chris Mason wrote:
> >>On Tue, Oct 18, 2016 at 01:03:24PM +1100, Dave Chinner wrote:
> >>
> >>[ Long stalls from xfs_reclaim_inodes_ag ]
> >>
> >>>XFS has *1* tunable that can change the behaviour of metadata
> >>>writeback. Please try it.
> >>
> >>[ weeks go by, so this email is insanely long ]
> >>
> >>Testing all of this was slow going because two of the three test
> >>boxes I had with the hadoop configuration starting having hardware
> >>problems.  The good news is that while I was adjusting the
> >>benchmark, we lined up access to a bunch of duplicate boxes, so I
> >>can now try ~20 different configurations in parallel.
> >>
> >>My rough benchmark is here:
> >>
> >>git://git.kernel.org/pub/scm/linux/kernel/git/mason/simoop.git
> >>
> >>The command line I ended up using was:
> >>
> >>simoop -t 512 -m 190 -M 128 -C 28 -r 60000 -f 70 -T 20 -R1 -W 1 -i
> >>60 -w 300 -D 2 /hammer/*
> >
> >There's a lightly tested patch below that should do the trick.
> >
> >After 5 minutes on a modified simoop cli on a single filesystem,
> >4.9-rc4+for-next:
> >
> >$ ./simoop -t 128 -m 50 -M 128 -C 14 -r 60000 -f 2 -T 20 -R1 -W 1 -i 60 -w 300 -D 2 /mnt/scratch
> >....
> >Run time: 300 seconds
> >Read latency (p50: 3,174,400) (p95: 4,530,176) (p99: 18,055,168)
> >Write latency (p50: 14,991,360) (p95: 28,672,000) (p99: 33,325,056)
> >Allocation latency (p50: 1,771,520) (p95: 17,530,880) (p99: 23,756,800)
> >work rate = 4.75/sec (avg 5.24/sec) (p50: 5.79) (p95: 6.99) (p99: 6.99)
> >alloc stall rate = 94.42/sec (avg: 51.63) (p50: 51.60) (p95: 59.12) (p99: 59.12)
> >
> >With the patch below:
> >
> >Run time: 300 seconds
> >Read latency (p50: 3,043,328) (p95: 3,649,536) (p99: 4,710,400)
> >Write latency (p50: 21,004,288) (p95: 27,557,888) (p99: 29,130,752)
> >Allocation latency (p50: 280,064) (p95: 680,960) (p99: 863,232)
> >work rate = 4.08/sec (avg 4.76/sec) (p50: 5.39) (p95: 6.93) (p99: 6.93)
> >alloc stall rate = 0.08/sec (avg: 0.02) (p50: 0.00) (p95: 0.01) (p99: 0.01)
> >
> >Stall rate went to zero and stayed there at the 120s mark of the
> >warmup. Note the p99 difference for read and allocation latency,
> >too.
> >
> >I'll post some graphs tomorrow from my live PCP telemetry that
> >demonstrate the difference in behaviour better than any words...
> 
> Thanks Dave, this is definitely better.  But at least for the
> multi-disk setup, it's not quite there yet.

I can see why now I've run the test for half an hour to get a decent
idea of steady state behaviour. I'll come back to that....

> Your patch:
> ___
> Run time: 15535 seconds
> Read latency (p50: 22,708,224) (p95: 34,668,544) (p99: 41,746,432)
> Write latency (p50: 21,200,896) (p95: 34,799,616) (p99: 41,877,504)
> Allocation latency (p50: 11,550,720) (p95: 31,424,512) (p99: 39,518,208)
> work rate = 7.48/sec (avg 8.41/sec) (p50: 8.69) (p95: 9.57) (p99: 9.87)
> alloc stall rate = 14.08/sec (avg: 14.85) (p50: 15.74) (p95: 19.74)
> (p99: 22.04)
> 
> Original patch:
> ___
> Run time: 15474 seconds
> Read latency (p50: 20,414,464) (p95: 29,786,112) (p99: 34,275,328)
> Write latency (p50: 15,155,200) (p95: 25,591,808) (p99: 31,621,120)
> Allocation latency (p50: 7,675,904) (p95: 22,970,368) (p99: 29,523,968)
> work rate = 8.33/sec (avg 10.54/sec) (p50: 10.54) (p95: 10.58) (p99: 10.58)
> alloc stall rate = 37.08/sec (avg: 21.73) (p50: 23.16) (p95: 24.68)
> (p99: 25.00)

Ok, so stall rate is down, but the latencies are still garbage
because stalls are happening.

> I've had this running all day, but the differences stabilized after
> 5-10 minutes.  Everyone's p99s trend higher as the day goes on, but
> the percentage difference stays about the same.

Right - they stabilise after the first few stalls because they
dominate the p99 numbers. For example, with my patch, I saw the
first stall event between 1080 and 1140 seconds - you can see this
clearly in the 3rd chart:

Run time: 1080 seconds
Read latency (p50: 3,035,136) (p95: 4,284,416) (p99: 5,349,376)
Write latency (p50: 27,623,424) (p95: 31,031,296) (p99: 45,154,304)
Allocation latency (p50: 240,384) (p95: 709,632) (p99: 1,308,672)
work rate = 3.75/sec (avg 3.71/sec) (p50: 3.71) (p95: 4.04) (p99: 4.04)
alloc stall rate = 0.65/sec (avg: 0.04) (p50: 0.00) (p95: 0.03) (p99: 0.03)
du thread is done /mnt/scratch
___
Run time: 1140 seconds
Read latency (p50: 3,035,136) (p95: 4,415,488) (p99: 6,119,424)
Write latency (p50: 27,557,888) (p95: 31,490,048) (p99: 45,285,376)
Allocation latency (p50: 247,552) (p95: 1,497,088) (p99: 19,496,960)
work rate = 3.68/sec (avg 3.71/sec) (p50: 3.71) (p95: 4.04) (p99: 4.04)
alloc stall rate = 1.65/sec (avg: 0.12) (p50: 0.00) (p95: 0.12) (p99: 0.12)
du thread is running /mnt/scratch

So a single "stall event" blows out the p99 latencies really badly.
This is probably the single most important revelation about this
so far...

> I think the difference between mine and yours is we didn't quite get
> the allocation stalls down to zero, so making tasks wait for the
> shrinker shows up in the end numbers.

Right, but so far we haven't answered the obvious question: what
triggers the stall events?

> For your patch, the stalls from xfs_reclaim_inodes_ag() were about
> the same as the unpatched kernel yesterday.  We still had long tails
> in the 30 second+ category.

Because when a stall happens, it's doing more than just trashing the
inode cache. The 4 charts I've attached in a tarball are:

Chart 1: unpatched kernel, inode/dentry/xfs_inode cache usage
Chart 2: unpatched kernel, memory usage
Chart 3: unpatched kernel, inode/dentry/xfs_inode cache usage
Chart 4: unpatched kernel, memory usage

These are taken with a 5s sample period, so we're looking at overall
trends, not fine grained behaviour. Things to note:

	- discontinuities in the graphs indicate an unresponsive
	  machine and samples were lost. Good sign that the machine
	  is stalling on memory allocation
	- dentry/vfs inode counts come from
	  /proc/sys/fs/{dentry,inode}_state
	- memory usage chart: "other" is userspace and kernel memory
	  other than the page cache ("cached").
	- cached + other + free = 16GB (all memory)
	- "slab"  is a subset of "other", but is split out above
	  16GB so i can easily see salb cache memory usage
	- "dirty" is a subset of "cached", but is split out so I can
	  see how much of the page cache is dirty.

Now, with that in mind, let's compare the graphs. Let's start with
the inode/dentry/xfs_inode cache usage. We can see that the
unpatched kernel has rapid rises and falls in cache footprint. The
vertical falling edges are evidence of allocation stalls. You can
see that it's somewhat regular - the lines stay roughly level, then
we see a deep, sharp valley where the caches are completely freed
and then immediately repopulate.

This generally co-incides with periodic "du scan" starting - it adds
new memory pressure, tipping the kernel into memory reclaim, which
immediately stalls and trashes the cache. Given that there is now
lots of memory free, the du runs at several thousand IOPS (SSDs,
remember) and immediately refills the cache and finishes, producing
no more memory pressure. The cache then sits at the "full" level
until du starts again and we fall off the cliff again.

If we switch to the memory usage graph for the unpatched kernel
(chart 2), we can see that allocation stalls trash more than just
the slab caches - they clean all memory and free all the page cache,
too. i.e. it completely trashes any working set that we might have.
This is not caused by the shrinker - it has no effect on page cache
reclaim. It appears that page reclaim code is, for some reason,
agressively freeing the page cache for multiple seconds, even after
there is already noticable free memory. This is probably an issue
with there being too much direct reclaim stalling due to pressure
(e.g. on dirty pages), and then when the cork is pulled we've got
lots of threads that then run a full direct reclaim scan and that
empties the caches...

/me has always said unbound concurrency in memory reclaim and IO
writeback is a bad thing because you cannot control and limit
behaviour in overload situations like this.

Anyway, lets look at the patched kernel inode cache behaviour -
chart 3. The first thing to note is that there is a very different
pattern - there's a very regular triangular waveform. The trough
occurs when the du thread starts, and the peak occurs when the du
thread stops. IOWs, the inode cache is no longer retaining single
use inodes in the cache for extended periods. The waveform is neat
and regular, and it's very clear where the allocation stall at
~1100s occurred - there's a reclaim cliff.

If we look at chart 4 right now - the patched kernel memory usage -
we can see there is a big dip in the page cache, too. So when an
allocation stall event occurs, the bad behaviour is still there.

However, the way the system responds to a stall event is very
different - it doesn't immediately refill the inode cache. It goes
back to the triangle waveform until the next stall occurs.

What is particularly interesting in chart 3 is the vfs.inodes.count
and xfs_inode slab metrics (red and pink lines respectively). The
active xfs_inode slab objects starts off below the vfs inode count,
then over time 6 cycles xfs_inode count slowly increases above th
vfs inode count. Then we get an allocation stall event, and it
returns the xfs_inode count to less than the vfs inode count. The
second stall even occurs at a similar point.

The xfs_inode > vfs inode count values indicate that we've got XFs
inodes in the reclaim state and that reclaim of XFs inodes is not
keeping up with the VFS inode cache reclaim rate. This is indicative
of either dirty inodes being reclaimed, or we are not putting enough
pressure on the XFS inode cache and so it's putting additional
pressure on other parts of memory reclaim. This change in balance
seems to correlate to allocation stalls being triggered, so I'm
going to try to remove that imbalance and see if that avoids the
allocation stalls.

IOWs, I'm not using latency stats to guide me - they tell me nothing
useful about what causes the latency. The charts tell me when there
is a reclaim issue - I can see it clear as anything.  If the graphs
are smooth, maintain a consistent relationship to each other and
behave nicely when there are step changes in behaviour then
everything else will take care of itself.

It's easy to get lost in creating fancy histograms with shiny new
tools like ebpf/bcc, but this is mostly wasted effort when it comes
to finding the cause of the problem. Latency histograms don't help
us understand why the latencies occur, nor do they help track down
the cause. The charts I have show that there is an obvious, one-off,
large scale memory reclaim event that is causing the stalls and bad
latencies. Finding the trigger for that large scale event is what
matters, not pretty histograms of it's effect...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

[-- Attachment #2: charts.tar.gz --]
[-- Type: application/gzip, Size: 81821 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
  2016-11-14 23:58                     ` Dave Chinner
@ 2016-11-15  3:09                       ` Chris Mason
  2016-11-15  5:54                       ` Dave Chinner
  1 sibling, 0 replies; 24+ messages in thread
From: Chris Mason @ 2016-11-15  3:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 11/14/2016 06:58 PM, Dave Chinner wrote:
> On Mon, Nov 14, 2016 at 03:56:14PM -0500, Chris Mason wrote:
>> On 11/14/2016 02:27 AM, Dave Chinner wrote:
>>> On Sun, Nov 13, 2016 at 08:00:04PM -0500, Chris Mason wrote:
>>>> On Tue, Oct 18, 2016 at 01:03:24PM +1100, Dave Chinner wrote:
>>>>
>>>> [ Long stalls from xfs_reclaim_inodes_ag ]
>>>>
>>>>> XFS has *1* tunable that can change the behaviour of metadata
>>>>> writeback. Please try it.
>>>>
>>>> [ weeks go by, so this email is insanely long ]
>>>>
>>>> Testing all of this was slow going because two of the three test
>>>> boxes I had with the hadoop configuration starting having hardware
>>>> problems.  The good news is that while I was adjusting the
>>>> benchmark, we lined up access to a bunch of duplicate boxes, so I
>>>> can now try ~20 different configurations in parallel.
>>>>
>>>> My rough benchmark is here:
>>>>
>>>> git://git.kernel.org/pub/scm/linux/kernel/git/mason/simoop.git
>>>>
>>>> The command line I ended up using was:
>>>>
>>>> simoop -t 512 -m 190 -M 128 -C 28 -r 60000 -f 70 -T 20 -R1 -W 1 -i
>>>> 60 -w 300 -D 2 /hammer/*
>>>
>>> There's a lightly tested patch below that should do the trick.
>>>
>>> After 5 minutes on a modified simoop cli on a single filesystem,
>>> 4.9-rc4+for-next:
>>>
>>> $ ./simoop -t 128 -m 50 -M 128 -C 14 -r 60000 -f 2 -T 20 -R1 -W 1 -i 60 -w 300 -D 2 /mnt/scratch
>>> ....
>>> Run time: 300 seconds
>>> Read latency (p50: 3,174,400) (p95: 4,530,176) (p99: 18,055,168)
>>> Write latency (p50: 14,991,360) (p95: 28,672,000) (p99: 33,325,056)
>>> Allocation latency (p50: 1,771,520) (p95: 17,530,880) (p99: 23,756,800)
>>> work rate = 4.75/sec (avg 5.24/sec) (p50: 5.79) (p95: 6.99) (p99: 6.99)
>>> alloc stall rate = 94.42/sec (avg: 51.63) (p50: 51.60) (p95: 59.12) (p99: 59.12)
>>>
>>> With the patch below:
>>>
>>> Run time: 300 seconds
>>> Read latency (p50: 3,043,328) (p95: 3,649,536) (p99: 4,710,400)
>>> Write latency (p50: 21,004,288) (p95: 27,557,888) (p99: 29,130,752)
>>> Allocation latency (p50: 280,064) (p95: 680,960) (p99: 863,232)
>>> work rate = 4.08/sec (avg 4.76/sec) (p50: 5.39) (p95: 6.93) (p99: 6.93)
>>> alloc stall rate = 0.08/sec (avg: 0.02) (p50: 0.00) (p95: 0.01) (p99: 0.01)
>>>
>>> Stall rate went to zero and stayed there at the 120s mark of the
>>> warmup. Note the p99 difference for read and allocation latency,
>>> too.
>>>
>>> I'll post some graphs tomorrow from my live PCP telemetry that
>>> demonstrate the difference in behaviour better than any words...
>>
>> Thanks Dave, this is definitely better.  But at least for the
>> multi-disk setup, it's not quite there yet.
>
> I can see why now I've run the test for half an hour to get a decent
> idea of steady state behaviour. I'll come back to that....
>
>> Your patch:
>> ___
>> Run time: 15535 seconds
>> Read latency (p50: 22,708,224) (p95: 34,668,544) (p99: 41,746,432)
>> Write latency (p50: 21,200,896) (p95: 34,799,616) (p99: 41,877,504)
>> Allocation latency (p50: 11,550,720) (p95: 31,424,512) (p99: 39,518,208)
>> work rate = 7.48/sec (avg 8.41/sec) (p50: 8.69) (p95: 9.57) (p99: 9.87)
>> alloc stall rate = 14.08/sec (avg: 14.85) (p50: 15.74) (p95: 19.74)
>> (p99: 22.04)
>>
>> Original patch:
>> ___
>> Run time: 15474 seconds
>> Read latency (p50: 20,414,464) (p95: 29,786,112) (p99: 34,275,328)
>> Write latency (p50: 15,155,200) (p95: 25,591,808) (p99: 31,621,120)
>> Allocation latency (p50: 7,675,904) (p95: 22,970,368) (p99: 29,523,968)
>> work rate = 8.33/sec (avg 10.54/sec) (p50: 10.54) (p95: 10.58) (p99: 10.58)
>> alloc stall rate = 37.08/sec (avg: 21.73) (p50: 23.16) (p95: 24.68)
>> (p99: 25.00)
>
> Ok, so stall rate is down, but the latencies are still garbage
> because stalls are happening.
>
>> I've had this running all day, but the differences stabilized after
>> 5-10 minutes.  Everyone's p99s trend higher as the day goes on, but
>> the percentage difference stays about the same.
>
> Right - they stabilise after the first few stalls because they
> dominate the p99 numbers. For example, with my patch, I saw the
> first stall event between 1080 and 1140 seconds - you can see this
> clearly in the 3rd chart:

Yes, I went back and forth a bunch of times between zeroing out the 
p50/p95/p99 periodically or leaving them to gather for the whole run. 
In general for production we average results over a whole day, so I 
ended up letting them gather.

>
> Run time: 1080 seconds
> Read latency (p50: 3,035,136) (p95: 4,284,416) (p99: 5,349,376)
> Write latency (p50: 27,623,424) (p95: 31,031,296) (p99: 45,154,304)
> Allocation latency (p50: 240,384) (p95: 709,632) (p99: 1,308,672)
> work rate = 3.75/sec (avg 3.71/sec) (p50: 3.71) (p95: 4.04) (p99: 4.04)
> alloc stall rate = 0.65/sec (avg: 0.04) (p50: 0.00) (p95: 0.03) (p99: 0.03)
> du thread is done /mnt/scratch
> ___
> Run time: 1140 seconds
> Read latency (p50: 3,035,136) (p95: 4,415,488) (p99: 6,119,424)
> Write latency (p50: 27,557,888) (p95: 31,490,048) (p99: 45,285,376)
> Allocation latency (p50: 247,552) (p95: 1,497,088) (p99: 19,496,960)
> work rate = 3.68/sec (avg 3.71/sec) (p50: 3.71) (p95: 4.04) (p99: 4.04)
> alloc stall rate = 1.65/sec (avg: 0.12) (p50: 0.00) (p95: 0.12) (p99: 0.12)
> du thread is running /mnt/scratch
>
> So a single "stall event" blows out the p99 latencies really badly.
> This is probably the single most important revelation about this
> so far...

Depending on the workload, an allocation stall rate lower than 5 is 
fine, but some tiers notice anything > 0.  If we gave hadoop stalls at 
0, they would almost certainly thank us for all the hard work, and then 
add more load.

>
>> I think the difference between mine and yours is we didn't quite get
>> the allocation stalls down to zero, so making tasks wait for the
>> shrinker shows up in the end numbers.
>
> Right, but so far we haven't answered the obvious question: what
> triggers the stall events?
>
>> For your patch, the stalls from xfs_reclaim_inodes_ag() were about
>> the same as the unpatched kernel yesterday.  We still had long tails
>> in the 30 second+ category.
>
> Because when a stall happens, it's doing more than just trashing the
> inode cache. The 4 charts I've attached in a tarball are:

We have variations for these charts in prod, but my test boxes aren't 
wired into the collection system for some of the stats yet.    We 
usually don't have the granularity of individual slabs, but it can be 
added if needed.

Many proc files do large order allocations when you open them, so they 
can skew the results.  For the next merge window, I have a patch to make 
seq_buf_alloc() fall back into vmalloc without doing direct reclaim.

The page cache on my boxes in vmstat stays relatively constant.  I think 
the huge amount of ram on these machines is soaking up the slop.  But I 
do have a more regular series of stalls (20/sec) than you do, and if I 
lightened up the workload a bit, I might see your peaks and valleys.

The size of the XFS inode cache does vary pretty dramatically, but 
that's not too surprising with the du threads (sorry).

What I found in prod is that generally the allocation stalls are tied to 
kswapd being in the D state.  Whenever they spike, he spits out a stack 
trace of being stuck <somewhere>.  Once it wasn't stuck in slab 
shrinkers, the next step was being stuck in writepage().

Johannes had a patch to change how eager kswapd was to writepage(). 
While that gets bashed out upstream, I just dropped our dirty ratios 
down to saner levels.  This brought our stalls down from ~10-100/sec to 
<1-2.

It's not perfect, our write iops and read iops both went up, but the 
application visible performance was higher so it's good enough for now. 
We also saw cases where kswapd was starved for CPU, or just sitting 
around and waiting for congested block devices to make progress.

The other comment I'll make is this workload is all spindles.  We 
definitely have flash based workloads hit by this too, but not to the 
extent these huge drives are.

I definitely agree about the value of the graphs in addition to the bpf 
probes.  The bpf histogram answer the question of "is it fixed yet" 
while the graphs help understand the rest of what's going on.  The 
biggest place bpf helps is being able to say: show me all the stack 
traces whenever kswapd is stuck longer than X usec

 From the iovisor git repo, you can try:
  offcputime.py -m 3000 -p 365 60

where 365 is the pid of one of your kswapds.  The answer it spits out is 
a mixture of writepage() and cond_resched().   Bumping the prio of 
kswapd and limiting dirty pages cuts my stall down from 20/sec to much 
lower.  We'll have to see what this looks like in the morning, but going 
back to the first parts of the run with my one liner patch:

~600 seconds
alloc stall rate = 32.08/sec (avg: 22.36) (p50: 22.36) (p95: 24.12) 
(p99: 24.12)

vs same kernel but with dirty ratios lower and kswapd prio bumped:

alloc stall rate = 3.85/sec (avg: 6.50) (p50: 7.07) (p95: 9.85) (p99: 9.85)

Thanks for looking at this.  I'm happy to try patches and dig up 
analysis from specific areas where you want to see stats from real 
workloads.

-chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
  2016-11-14 23:58                     ` Dave Chinner
  2016-11-15  3:09                       ` Chris Mason
@ 2016-11-15  5:54                       ` Dave Chinner
  2016-11-15 19:00                         ` Chris Mason
  1 sibling, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2016-11-15  5:54 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-xfs

On Tue, Nov 15, 2016 at 10:58:01AM +1100, Dave Chinner wrote:
> On Mon, Nov 14, 2016 at 03:56:14PM -0500, Chris Mason wrote:
.....
> So a single "stall event" blows out the p99 latencies really badly.
> This is probably the single most important revelation about this
> so far...
> 
> > I think the difference between mine and yours is we didn't quite get
> > the allocation stalls down to zero, so making tasks wait for the
> > shrinker shows up in the end numbers.
> 
> Right, but so far we haven't answered the obvious question: what
> triggers the stall events?

Our prime suspicion in all this has been that blocking on dirty
inodes has been preventing the XFS inode cache shrinker from making
progress. That does not appear to be the case at all here. From
a half hour sample of my local workload:


    Inode Clustering
      xs_iflush_count.......         20119    <<<<<<
      xs_icluster_flushcnt..         20000
      xs_icluster_flushinode        440411
    Vnode Statistics
      vn_active.............        130903
      vn_alloc..............             0
      vn_get................             0
      vn_hold...............             0
      vn_rele...............       1217355
      vn_reclaim............       1217355   <<<<<<<
      vn_remove.............       1217355

There have been 1.2 million inodes reclaimed from the cache, but
there have only been 20,000 dirty inode buffer writes. Yes, that's
written 440,000 dirty inodes - the inode write clustering is
capturing about 22 inodes per write - but the inode writeback load
is minimal at about 10 IO/s. XFS inode reclaim is not blocking
significantly on dirty inodes.

.....

> What is particularly interesting in chart 3 is the vfs.inodes.count
> and xfs_inode slab metrics (red and pink lines respectively). The
> active xfs_inode slab objects starts off below the vfs inode count,
> then over time 6 cycles xfs_inode count slowly increases above th
> vfs inode count. Then we get an allocation stall event, and it
> returns the xfs_inode count to less than the vfs inode count. The
> second stall even occurs at a similar point.

Ok, this behaviour is not relevant. It's a result of inode cache
scan cycling behaviour that I added where skipped inodes are not
immediately revisited by SYNC_WAIT scans.

Ok, the more I look into this, the less I think this is an XFS inode
cache shrinker blocking problem. These gigantic "allocation stall"
are a complete and utter mess at the vmscan level.

I caught a trace in a vanilla 4.9-rc4 kernel where a stall event
trashed 8GB of page cache and ~60,000 inodes in about 2s flat. This
has nothing to do with XFS shrinkers blocking, and everything to do
with the mm subsystem going apeshit.

Let's start here at 8501.5s, where kswapd suddenly kicks in and
starts scanning - The first invocation of the XFs shrinker for
node3 reports:

kswapd3-861   [015]  8501.546421: mm_shrink_slab_start: super_cache_scan+0x0 0xffff88023abfc4c0: nid: 3 objects to shrink 552 gfp_flags GFP_KERNEL pgs_scanned 45 lru_pgs 198175 cache items 9197 delta 4 total_scan 556

the page cache has ~200k pages, and the inode cache 9000 objects.
kswapd didn't scan many pages, didn't add much delta to the shrinker.
kswapd keeps running, not finding any pages to reclaim and escalting
priority, running the shrinker on each escalation:

kswapd3-861   [015]  8501.548073: mm_vmscan_lru_shrink_inactive: nid=3 nr_scanned=22 nr_reclaimed=22 priority=8 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
kswapd3-861   [015]  8501.548090: mm_shrink_slab_start: super_cache_scan+0x0 0xffff88023abfc4c0: nid: 3 objects to shrink 613 gfp_flags GFP_KERNEL pgs_scanned 726 lru_pgs 197537 cache items 9197 delta 67 total_scan 680

By priority 8, we can see the pages scanned and the shrinker delta
is going up, but no significant reclaim progress has been made.

Priority 7:

kswapd3-861   [015]  8501.549347: mm_shrink_slab_start: super_cache_scan+0x0 0xffff88023abfc4c0: nid: 3 objects to shrink 680 gfp_flags GFP_KERNEL pgs_scanned 1448 lru_pgs 196821 cache items 9197 delta 135 total_scan 815

Still no shrinker work, and kswapd stops here, having reached it's
thresholds. Shortly after it starts again

kswapd3-861   [015]  8501.571629: mm_vmscan_kswapd_wake: nid=3 zid=3 order=0

Priority winds up to 8:

kswapd3-861   [015]  8501.572660: mm_shrink_slab_start: super_cache_scan+0x0 0xffff88023abfc4c0: nid: 3 objects to shrink 876 gfp_flags GFP_KERNEL pgs_scanned 718 lru_pgs 195330 cache items 9197 delta 67 total_scan 943

But still not over the shrinker threshold. At priorty 7, it kicks
over the batch threshold and we free 326 cached objects:

kswapd3-861   [015]  8501.573640: mm_shrink_slab_start: super_cache_scan+0x0 0xffff88023abfc4c0: nid: 3 objects to shrink 943 gfp_flags GFP_KERNEL pgs_scanned 1430 lru_pgs 194619 cache items 9197 delta 135 total_scan 1078
kswapd3-861   [015]  8501.577927: mm_shrink_slab_end:   super_cache_scan+0x0 0xffff88023abfc4c0: nid: 3 unused scan count 943 new scan count 54 total_scan 54 last shrinker return val 362

Again, kswapd stops here, to be woken about 10ms later, winds up to
priority 7, ends with lru_pgs 192292 cache items 8779

kswapd3-861   [015]  8501.586501: mm_vmscan_kswapd_wake: nid=3 zid=3 order=0

So after another 10 wakeups and P7 reclaims, we are at:

kswapd3-861   [015]  8501.711927: mm_shrink_slab_start: super_cache_scan+0x0 0xffff88023abfc4c0: nid: 3 objects to shrink 845 gfp_flags GFP_KERNEL pgs_scanned 1220 lru_pgs 167629 cache items 8400 delta 122 total_scan 967

This pattern continues, freeing pages and winding up priority:

kswapd3-861   [015]  8501.763802: mm_vmscan_lru_isolate: isolate_mode=0 classzone=3 order=0 nr_requested=32 nr_scanned=32 nr_taken=32 file=1
kswapd3-861   [015]  8501.799755: mm_vmscan_lru_shrink_inactive: nid=3 nr_scanned=32 nr_reclaimed=32 priority=5 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
kswapd3-861   [015]  8501.899717: mm_vmscan_lru_shrink_inactive: nid=3 nr_scanned=32 nr_reclaimed=32 priority=4 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC

by the time we get to priority 4, we've lost half the page cache and
a thid of the inode cache: lru_pgs 105911 cache items 6802. kswapd
keeps being started and winding up to priority 4, where after another
200ms:

kswapd3-861   [015]  8502.093850: mm_vmscan_lru_shrink_inactive: nid=3 nr_scanned=32 nr_reclaimed=32 priority=4 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
....
kswapd3-861   [015]  8502.096649: mm_shrink_slab_start: super_cache_scan+0x0 0xffff88023abfc4c0: nid: 3 objects to shrink 850 gfp_flags GFP_KERNEL pgs_scanned 3104 lru_pgs 75733 cache items 4596 delta 376 total_scan 1226

This inode shirnker invocation blocks for the first time for:

kswapd3-861   [015]  8502.423233: mm_shrink_slab_end:   super_cache_scan+0x0 0xffff88023abfc4c0: nid: 3 unused scan count 850 new scan count 309 total_scan 202 last shrinker return val 628

About 300ms because it's the cache is now really small and we were
asked to scan more than a quarter of it.  At this point we've lost 60%
of the page cache and half the inode cache.

kswapd continues to be run and wind up the reclaim pressure. It's
at priority 5 next time the XFS inode cache shrinker is run:

kswapd3-861   [015]  8502.441241: mm_shrink_slab_start: super_cache_scan+0x0 0xffff88023abfc4c0: nid: 3 objects to shrink 920 gfp_flags GFP_KERNEL pgs_scanned 1754 lru_pgs 66784 cache items 4417 delta 232 total_scan 1152
 kswapd3-861   [015]  8502.706469: mm_shrink_slab_end:   super_cache_scan+0x0 0xffff88023abfc4c0: nid: 3 unused scan count 920 new scan count 437 total_scan 128 last shrinker return val 668

Again, we block for 250ms. kswapd then winds up to prior 4 but doesn't
quite trigger the shrinker. It then runs again, triggers the
shrinker at prior 6, blocks for 400ms, now down to:

	8502.815198: lru_pgs 60245 cache items 4078

kwapd continues, winds up to prio 4, triggers shrinker again, blocks
for 100ms, and by this time we're at lru_pgs 50169 cache items 2972.
We've lost 75% of the page cache and 70% of the inode cache.

kswapd is woken again immediately, runs to prio 5, no blocking, no
changes in cache footprint. Then:

kswapd3-861   [015]  8503.318575: mm_vmscan_kswapd_wake: nid=3 zid=3 order=0
....
kswapd3-861   [015]  8503.327057: mm_vmscan_lru_shrink_inactive: nid=3 nr_scanned=32 nr_reclaimed=32 priority=3 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC

We ramp up to priority 3 and by this stage we've done serious damage
to the inode caches:

kswapd3-861   [015]  8503.330165: mm_shrink_slab_start: super_cache_scan+0x0 0xffff88023abfc4c0: nid: 3 objects to shrink 72 gfp_flags GFP_KERNEL pgs_scanned 3104 lru_pgs 43054 cache items 2007 delta 289 total_scan 361

And we keep scanning at prio 3 putting serious pressure on the page
cache, and then it happens:

kswapd3-861   [015]  8503.333975: mm_shrink_slab_start: super_cache_scan+0x0 0xffff88023abfc4c0: nid: 3 objects to shrink 361 gfp_flags GFP_KERNEL pgs_scanned 3104 lru_pgs 39969 cache items 2067 delta 321 total_scan 682
....
kswapd3-861   [015]  8503.337100: mm_shrink_slab_start: super_cache_scan+0x0 0xffff88023abfc4c0: nid: 3 objects to shrink 682 gfp_flags GFP_KERNEL pgs_scanned 3104 lru_pgs 36888 cache items 2661 delta 447 total_scan 1129
....
 kswapd3-861   [015]  8503.344760: mm_vmscan_lru_shrink_inactive: nid=3 nr_scanned=32 nr_reclaimed=0 priority=3 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
 kswapd3-861   [015]  8503.344766: mm_vmscan_lru_isolate: isolate_mode=0 classzone=3 order=0 nr_requested=32 nr_scanned=32 nr_taken=32 file=1
 kswapd3-861   [015]  8503.344788: mm_vmscan_writepage:  page=0xffffea00003415a5 pfn=3413413 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
 kswapd3-861   [015]  8503.344793: mm_vmscan_writepage:  page=0xffffea00003bcd44 pfn=3919172 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC

Yup, memory reclaim starts dirty page writeback. From this point,
there are ~10,000 pages written by memory reclaim over the next
700ms. This is how close we go to kicking the OOM - we hit prio 1:

kswapd3-861   [015]  8503.935138: mm_vmscan_lru_shrink_inactive: nid=3 nr_scanned=32 nr_reclaimed=32 priority=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC

The XFS inode shrinker blocking plays no significant part in this
series of events. yes, it contributes to /reclaim latency/, but it
is not the cause of the catastrophic breakdown that results in
kswapd emptying the page cache and the slab caches to accomodate the
memory demand coming from userspace. We can fix the blocking
problems with the XFS shrinker, but it's not the shrinker's job to
stop this overload situation from happening.

Indeed, I just ran the workload with my patch and captured an alloc
stall in the same manner with the same tracing. It has the same
"kswapd keeps being run and escalating reclaim until there's nothing
left to reclaim" behaviour. kswapd never blocks in the XFS inode
shrinker now, so the allocation latencies are all from direct
reclaim doing work, which is exactly as it should be.

The fact that we are seeing dirty page writeback from kswapd
indicates that the memory pressure this workload generates from
userspace is not being adequately throttled in
throttle_direct_reclaim() to allow dirty writeback to be done in an
efficient and timely manner. The memory reclaim throttling needs to
back off more in overload situations like this - we need to slow
down the incoming demand to the reclaim rate rather than just
increasing pressure and hoping that kswapd doesn't burn up in a ball
of OOM....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
  2016-11-15  5:54                       ` Dave Chinner
@ 2016-11-15 19:00                         ` Chris Mason
  2016-11-16  1:30                           ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Mason @ 2016-11-15 19:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 11/15/2016 12:54 AM, Dave Chinner wrote:
> On Tue, Nov 15, 2016 at 10:58:01AM +1100, Dave Chinner wrote:
>> On Mon, Nov 14, 2016 at 03:56:14PM -0500, Chris Mason wrote:
> .....
>> So a single "stall event" blows out the p99 latencies really badly.
>> This is probably the single most important revelation about this
>> so far...
>>
>>> I think the difference between mine and yours is we didn't quite get
>>> the allocation stalls down to zero, so making tasks wait for the
>>> shrinker shows up in the end numbers.
>>
>> Right, but so far we haven't answered the obvious question: what
>> triggers the stall events?
>
> Our prime suspicion in all this has been that blocking on dirty
> inodes has been preventing the XFS inode cache shrinker from making
> progress. That does not appear to be the case at all here. From
> a half hour sample of my local workload:
>
>
>     Inode Clustering
>       xs_iflush_count.......         20119    <<<<<<
>       xs_icluster_flushcnt..         20000
>       xs_icluster_flushinode        440411
>     Vnode Statistics
>       vn_active.............        130903
>       vn_alloc..............             0
>       vn_get................             0
>       vn_hold...............             0
>       vn_rele...............       1217355
>       vn_reclaim............       1217355   <<<<<<<
>       vn_remove.............       1217355
>
> There have been 1.2 million inodes reclaimed from the cache, but
> there have only been 20,000 dirty inode buffer writes. Yes, that's
> written 440,000 dirty inodes - the inode write clustering is
> capturing about 22 inodes per write - but the inode writeback load
> is minimal at about 10 IO/s. XFS inode reclaim is not blocking
> significantly on dirty inodes.

I think our machines are different enough that we're not seeing the same 
problems.  Or at least we're seeing different sides of the problem.

We have 130GB of ram and on average about 300-500MB of XFS slab, total 
across all 15 filesystems.  Your inodes are small and cuddly, and I'd 
rather have more than less.  I see more with simoop than we see in prod, 
but either way its a reasonable percentage of system ram considering the 
horrible things being done.

Both patched (yours or mine) and unpatched, XFS inode reclaim is keeping 
up.   With my patch in place, tracing during simoop does show more 
kswapd prio=1 scanning than unpatched, so I'm clearly stretching the 
limits a little more.  But we've got 30+ days of uptime in prod on 
almost 60 machines.  The oom rate is roughly in line with v3.10, and 
miles better than v4.0.

One other difference is that I have 15 filesystems being shrunk in 
series, so my chances of hitting a long stall are much higher.  If I let 
bpf tracing on shrink_slab() latencies run for 600 seconds, I see v4.8 
called shrink_slab roughly 43,000 times.

It spent a total of 244 wall clock seconds in the shrinkers.  Giving us 
an average of .0056 seconds per call.  But 67 of those calls consumed 
243 of those seconds.  The other second of wall time was spread over the 
other 42,000+ calls.

On your machine, you can see the VM wreck your page cache and inode 
caches during the run.  This isn't too surprising because the pages are 
really use once.  We read a random file, we write a random file, and we 
should expect these pages to be gone before we ever use them again.  The 
real page cache working set of the load is:

number of workers * number of IO threads * 1MB or so for good streaming.

I've got simoop cranked a little harder than you, so my allocation 
stalls are in more of a steady state (20/second).  lru_pgs tends to 
shuffle between 12K and 13K per node for most of any tracing runs.

[ tracing analysis ]

>
> The XFS inode shrinker blocking plays no significant part in this
> series of events. yes, it contributes to /reclaim latency/, but it
> is not the cause of the catastrophic breakdown that results in
> kswapd emptying the page cache and the slab caches to accomodate the
> memory demand coming from userspace. We can fix the blocking
> problems with the XFS shrinker, but it's not the shrinker's job to
> stop this overload situation from happening.

My bigger concern with the blocking in the shrinker was more around the 
pile up of processes arguing about how to free a relatively small amount 
of ram.  The source of the overload for us is almost always going to be 
the users, and any little bit of capacity we give them back will get 
absorbed with added load.

>
> Indeed, I just ran the workload with my patch and captured an alloc
> stall in the same manner with the same tracing. It has the same
> "kswapd keeps being run and escalating reclaim until there's nothing
> left to reclaim" behaviour. kswapd never blocks in the XFS inode
> shrinker now, so the allocation latencies are all from direct
> reclaim doing work, which is exactly as it should be.
>
> The fact that we are seeing dirty page writeback from kswapd
> indicates that the memory pressure this workload generates from
> userspace is not being adequately throttled in
> throttle_direct_reclaim() to allow dirty writeback to be done in an
> efficient and timely manner. The memory reclaim throttling needs to
> back off more in overload situations like this - we need to slow
> down the incoming demand to the reclaim rate rather than just
> increasing pressure and hoping that kswapd doesn't burn up in a ball
> of OOM....

Johannes was addressing the dirty writeback from kswapd.  His first 
patch didn't make as big a difference as we hoped, but I've changed 
around simoop a bunch since then.  We'll try again.

-chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
  2016-11-15 19:00                         ` Chris Mason
@ 2016-11-16  1:30                           ` Dave Chinner
  2016-11-16  3:03                             ` Chris Mason
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2016-11-16  1:30 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-xfs

On Tue, Nov 15, 2016 at 02:00:47PM -0500, Chris Mason wrote:
> On 11/15/2016 12:54 AM, Dave Chinner wrote:
> >On Tue, Nov 15, 2016 at 10:58:01AM +1100, Dave Chinner wrote:
> >>On Mon, Nov 14, 2016 at 03:56:14PM -0500, Chris Mason wrote:
> >There have been 1.2 million inodes reclaimed from the cache, but
> >there have only been 20,000 dirty inode buffer writes. Yes, that's
> >written 440,000 dirty inodes - the inode write clustering is
> >capturing about 22 inodes per write - but the inode writeback load
> >is minimal at about 10 IO/s. XFS inode reclaim is not blocking
> >significantly on dirty inodes.
> 
> I think our machines are different enough that we're not seeing the
> same problems.  Or at least we're seeing different sides of the
> problem.
> 
> We have 130GB of ram and on average about 300-500MB of XFS slab,
> total across all 15 filesystems.  Your inodes are small and cuddly,
> and I'd rather have more than less.  I see more with simoop than we
> see in prod, but either way its a reasonable percentage of system
> ram considering the horrible things being done.

So I'm running on 16GB RAM and have 100-150MB of XFS slab.
Percentage wise, the inode cache is a larger portion of memory than
in your machines. I can increase the number of files to increase it
further, but I don't think that will change anything.

> Both patched (yours or mine) and unpatched, XFS inode reclaim is
> keeping up.   With my patch in place, tracing during simoop does
> show more kswapd prio=1 scanning than unpatched, so I'm clearly
> stretching the limits a little more.  But we've got 30+ days of
> uptime in prod on almost 60 machines.  The oom rate is roughly in
> line with v3.10, and miles better than v4.0.

IOWs, you have a workaround that keeps your production systems
running. That's fine for your machines that are running this load,
but it's not working well for any of the other other loads I've
looked at.  That is, removing the throttling from the XFS inode
shrinker causes instability and adverse reclaim of the inode cache
in situations where the maintaining a working set in memory is
required for performance.

Indeed, one of the things I noticed with the simoops workload
running the shrinker patches is that it no longer kept either the
inode cache or the XFS metadata cache in memory long enough for the
du to run without requiring IO. i.e. the caches no longer maintained
the working set of objects needed to optimise a regular operation
and the du scans took a lot longer.

That's why on the vanilla kernels the inode cache footprint went
through steep sided valleys - reclaim would trash the inode cache,
but the metadata cache stayed intact and so all the inodes were
imemdiately pulled from there again and populated back into the
inode cache. With the patches to remove the XFS shrinker blocking,
the pressure was moved to other caches like the metadata cache, and
so the clean inode buffers were reclaimed instead. Hence when the
inodes were reclaimed, IO was necessary to re-read the inodes during
the du scan, and hence the cache growth was also slow.

That's what removing the blocking from the shrinker causes the
overall work rate to go down - it results in the cache not
maintaining a working set of inodes and so increases the IO load and
that then slows everything down.

There's secondary and tertiary effects all over the place that, and 
from the XFS POV this is a catch-22. The shrinker blocking has been
put in place to control the impact of unbound reclaim concurrency on
the working set caches need to maintain to sustain acceptible
performance. This blocking, however, is causing issues with latency
under your workload. If we remove the shrinker blocking
to address the FB allocation latency issue, then we screw up the
cached working set balance for every other XFS user out there and
we'll end up making things worse for many of XFS users.

Quite frankly, if I have to choose between these two things, then
I'm not going to change the shrinker implementation. FB can maintain
their own fixes until such a point in time  that the underlying
reclaim problem that requires the XFS shrinker to block has been
fully addressed and then we can change the XFS shrinker to work well
in all situations.

> >The XFS inode shrinker blocking plays no significant part in this
> >series of events. yes, it contributes to /reclaim latency/, but it
> >is not the cause of the catastrophic breakdown that results in
> >kswapd emptying the page cache and the slab caches to accomodate the
> >memory demand coming from userspace. We can fix the blocking
> >problems with the XFS shrinker, but it's not the shrinker's job to
> >stop this overload situation from happening.
> 
> My bigger concern with the blocking in the shrinker was more around
> the pile up of processes arguing about how to free a relatively
> small amount of ram.

This is not a shrinker problem, though. The shrinkers should be
completely isolated from allocation demand concurrency. The fact is
that they aren't isolated from it, and we have to deal with that as
best we can.

IOWs, this is a direct reclaim architecture problem. i.e. it
presents unbound concurrency to the shrinkers and then requires them
to "behave nicely" when the mm subsystem starts saying "I don't care
that you're already dealing with 200 other concurrent calls from me
- fucking well free everything for me now!".

Controlling and limiting the unbound concurrency of reclaim and
isolating shrinkers from the incoming demand is the only way we can
sanely keep both the reclaim latency to a minimum and maintain a
decent working set in the caches under extreme memory pressure.
We obviously cannot do both in a shrinker implementation, so
we really need a high level re-architecting here...

> The source of the overload for us is almost
> always going to be the users, and any little bit of capacity we give
> them back will get absorbed with added load.

Exactly why we need to re-acrhitect reclaim. because if we don't,
then the users will simply increase the load until it reclaim breaks
down through whatever band-aid we've added to hide the last
problem...

Put simply: reclaim algorithms should not change just because there
are more processes demanding memory - increased demand should simply
mean that the processes demanding memory should /wait longer/. Right
now they end up waiting longer by adding load and concurrency to the
reclaim subsystems, and somewhere in those reclaim subsystems we end
up blocking to try to avoid catastrophic degradations.

This is exactly analogous to the IO-less dirty page throttling
situation we battled with for years. We had an architecture where we
had direct submission of IO that throttling in the block layer on
request queues. When we had tens to hundreds of processes all doing
this, the IO patterns randomised, throughput tanked completely and
applications saw extremely non-deterministic long-tail latencies
during write() operations.

We fixed this by decoupling incoming process dirty page throttling
from the mechanism of cleaning of dirty pages. We now have a queue
of incoming processes that wait in turn for a number of pages to be
cleaned, and when that threshold is cleaned by the background
flusher threads, they are woken and on they go. it's efficient,
reliable, predictable and, above all, is completely workload
independent. We haven't had a "system is completely unresponsive
because I did a large write" problem since we made this
architectural change - we solved the catastrophic overload problem
one and for all.(*)

Direct memory reclaim is doing exactly what the old dirty page
throttle did - it is taking direct action and relying the underlying
reclaim mechanisms to throttle overload situations. Just like the
request queue throttling in the old dirty page code, the memory
reclaim subsystem is unable to behave sanely when large amounts of
concurrent pressure is put on it. The throttling happens too late,
too unpredictably, and too randomly for it to be controllable and
stable. And the result of that is that application see
non-deterministic long-tail latencies once memory reclaim starts.

We've already got background reclaim threads - kswapd - and there
are already hooks for throttling direct reclaim
(throttle_direct_reclaim()). The problem is that direct reclaim
throttling only kicks in once we are very near to low memory limits,
so it doesn't prevent concurency and load from being presented to
the underlying reclaim mechanism until it's already too late.

IMO, direct reclaim should be replaced with a queuing mechanism and
deferral to kswapd to clean pages.  Every time kswapd completes a
batch of freeing, it can check if it's freed enough to allow the
head of the queue to make progress. If it has, then it can walk down
the queue waking processes until all the pages it just freed have
been accounted for.

If we want to be truly fair, this queuing should occur at the
allocation entry points, not the direct reclaim entry point. i.e if
we are in a reclaim situation, go sit in the queue until you're told
we have memory for you and then run allocation.

Then we can design page scanning and shrinkers for maximum
efficiency, to be fully non-blocking, and to never have to directly
issue or wait for IO completion. They can all feed back reclaim
state to a central backoff mechanism which can sleep to alleviate
situations where reclaim cannot be done without blocking. This
allows us to constrain reclaim to a well controlled set of
background threads that we can scale according to observed need.

We know that this model works - IO-less dirty page throttling has
been a spectacular success. We now just take it for granted that
the thottling works because it self tunes to the underlying storage
characteristics and rarely, if ever, does the wrong thing. The same
cannot be said about memory reclaim behaviour....

> >The fact that we are seeing dirty page writeback from kswapd
> >indicates that the memory pressure this workload generates from
> >userspace is not being adequately throttled in
> >throttle_direct_reclaim() to allow dirty writeback to be done in an
> >efficient and timely manner. The memory reclaim throttling needs to
> >back off more in overload situations like this - we need to slow
> >down the incoming demand to the reclaim rate rather than just
> >increasing pressure and hoping that kswapd doesn't burn up in a ball
> >of OOM....
> 
> Johannes was addressing the dirty writeback from kswapd.  His first
> patch didn't make as big a difference as we hoped, but I've changed
> around simoop a bunch since then.  We'll try again.

We need an architectural change - bandaids aren't going to solve the
problem...

Cheers,

Dave.

(*) Yes, I'm aware of Jen's block throttling patches - that's
fixing an IO scheduling issue to avoid long read latencies due to
background writeback being /too efficient/ at cleaning pages when
we're driving the system really hard.  IOWs, it's a good problem to
have because it's a result of things working too well under load...
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
  2016-11-16  1:30                           ` Dave Chinner
@ 2016-11-16  3:03                             ` Chris Mason
  2016-11-16 23:31                               ` Dave Chinner
  2016-11-17  0:47                               ` Dave Chinner
  0 siblings, 2 replies; 24+ messages in thread
From: Chris Mason @ 2016-11-16  3:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Nov 16, 2016 at 12:30:09PM +1100, Dave Chinner wrote:
>On Tue, Nov 15, 2016 at 02:00:47PM -0500, Chris Mason wrote:
>> On 11/15/2016 12:54 AM, Dave Chinner wrote:
>> >On Tue, Nov 15, 2016 at 10:58:01AM +1100, Dave Chinner wrote:
>> >>On Mon, Nov 14, 2016 at 03:56:14PM -0500, Chris Mason wrote:
>> >There have been 1.2 million inodes reclaimed from the cache, but
>> >there have only been 20,000 dirty inode buffer writes. Yes, that's
>> >written 440,000 dirty inodes - the inode write clustering is
>> >capturing about 22 inodes per write - but the inode writeback load
>> >is minimal at about 10 IO/s. XFS inode reclaim is not blocking
>> >significantly on dirty inodes.
>>
>> I think our machines are different enough that we're not seeing the
>> same problems.  Or at least we're seeing different sides of the
>> problem.
>>
>> We have 130GB of ram and on average about 300-500MB of XFS slab,
>> total across all 15 filesystems.  Your inodes are small and cuddly,
>> and I'd rather have more than less.  I see more with simoop than we
>> see in prod, but either way its a reasonable percentage of system
>> ram considering the horrible things being done.
>
>So I'm running on 16GB RAM and have 100-150MB of XFS slab.
>Percentage wise, the inode cache is a larger portion of memory than
>in your machines. I can increase the number of files to increase it
>further, but I don't think that will change anything.

I think the way to see what I'm seeing would be to drop the number of IO 
threads (-T) and bump both -m and -M.  Basically less inode working set 
and more memory working set.

>> Both patched (yours or mine) and unpatched, XFS inode reclaim is
>> keeping up.   With my patch in place, tracing during simoop does
>> show more kswapd prio=1 scanning than unpatched, so I'm clearly
>> stretching the limits a little more.  But we've got 30+ days of
>> uptime in prod on almost 60 machines.  The oom rate is roughly in
>> line with v3.10, and miles better than v4.0.
>
>IOWs, you have a workaround that keeps your production systems
>running. That's fine for your machines that are running this load,
>but it's not working well for any of the other other loads I've
>looked at.  That is, removing the throttling from the XFS inode
>shrinker causes instability and adverse reclaim of the inode cache
>in situations where the maintaining a working set in memory is
>required for performance.

We agree on all of this much more than not.  Josef has spent a lot of 
time recently on shrinkers (w/btrfs but the ideas are similar), and I'm 
wrapping duct tape around workloads until the overall architecture is 
less fragile.

Using slab for metadata in an FS like btrfs where dirty metadata is 
almost unbounded is a huge challenge in the current framework.  Ext4 is 
moving to dramatically bigger logs, so it would eventually have the same 
problems.

>
>Indeed, one of the things I noticed with the simoops workload
>running the shrinker patches is that it no longer kept either the
>inode cache or the XFS metadata cache in memory long enough for the
>du to run without requiring IO. i.e. the caches no longer maintained
>the working set of objects needed to optimise a regular operation
>and the du scans took a lot longer.

With simoop, du is supposed to do IO.  It's crazy to expect to be able 
to scan all the inodes on a huge FS (or 15 of them) and keep it all in 
cache along with everything else hadoop does.  I completely agree there 
are cases where having the working set in ram is valid, just simoop 
isn't one ;)

>
>That's why on the vanilla kernels the inode cache footprint went
>through steep sided valleys - reclaim would trash the inode cache,
>but the metadata cache stayed intact and so all the inodes were
>imemdiately pulled from there again and populated back into the
>inode cache. With the patches to remove the XFS shrinker blocking,
>the pressure was moved to other caches like the metadata cache, and
>so the clean inode buffers were reclaimed instead. Hence when the
>inodes were reclaimed, IO was necessary to re-read the inodes during
>the du scan, and hence the cache growth was also slow.
>
>That's what removing the blocking from the shrinker causes the
>overall work rate to go down - it results in the cache not
>maintaining a working set of inodes and so increases the IO load and
>that then slows everything down.

At least on my machines, it made the overall work rate go up.  Both 
simoop and prod are 10-15% faster.  We have one other workload (gluster) 
where I have no idea if it'll help or hurt, but it'll probably be 
January before I have benchmark numbers from them.  I think it'll help, 
they do have more of a real working set in page cache, but it still 
breaks down to random IO over time.

[ snipping out large chunks, lots to agree with in here ]

>We fixed this by decoupling incoming process dirty page throttling
>from the mechanism of cleaning of dirty pages. We now have a queue
>of incoming processes that wait in turn for a number of pages to be
>cleaned, and when that threshold is cleaned by the background
>flusher threads, they are woken and on they go. it's efficient,
>reliable, predictable and, above all, is completely workload
>independent. We haven't had a "system is completely unresponsive
>because I did a large write" problem since we made this
>architectural change - we solved the catastrophic overload problem
>one and for all.(*)

(*) Agree Jens' patches are pushing io scheduling help higher up the 
stack.  It's a big win, but not directly for reclaim.

>
>Direct memory reclaim is doing exactly what the old dirty page
>throttle did - it is taking direct action and relying the underlying
>reclaim mechanisms to throttle overload situations. Just like the
>request queue throttling in the old dirty page code, the memory
>reclaim subsystem is unable to behave sanely when large amounts of
>concurrent pressure is put on it. The throttling happens too late,
>too unpredictably, and too randomly for it to be controllable and
>stable. And the result of that is that application see
>non-deterministic long-tail latencies once memory reclaim starts.
>
>We've already got background reclaim threads - kswapd - and there
>are already hooks for throttling direct reclaim
>(throttle_direct_reclaim()). The problem is that direct reclaim
>throttling only kicks in once we are very near to low memory limits,
>so it doesn't prevent concurency and load from being presented to
>the underlying reclaim mechanism until it's already too late.
>
>IMO, direct reclaim should be replaced with a queuing mechanism and
>deferral to kswapd to clean pages.  Every time kswapd completes a
>batch of freeing, it can check if it's freed enough to allow the
>head of the queue to make progress. If it has, then it can walk down
>the queue waking processes until all the pages it just freed have
>been accounted for.
>
>If we want to be truly fair, this queuing should occur at the
>allocation entry points, not the direct reclaim entry point. i.e if
>we are in a reclaim situation, go sit in the queue until you're told
>we have memory for you and then run allocation.
>
>Then we can design page scanning and shrinkers for maximum
>efficiency, to be fully non-blocking, and to never have to directly
>issue or wait for IO completion. They can all feed back reclaim
>state to a central backoff mechanism which can sleep to alleviate
>situations where reclaim cannot be done without blocking. This
>allows us to constrain reclaim to a well controlled set of
>background threads that we can scale according to observed need.
>

Can't argue here.  The middle ground today is Josef's LRU ideas so that 
slab reclaim has hopes of doing the most useful work instead of just 
writing things and hoping for the best.  It can either be a band-aid or 
a building block depending on how you look at it, but it can help either 
way.

Moving forward, I think I can manage to carry the one line patch in code 
that hasn't measurably changed in years.  We'll get it tested in a 
variety of workloads and come back with more benchmarks for the great 
slab rework coming soon to a v5.x kernel near you.

-chris


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
  2016-11-16  3:03                             ` Chris Mason
@ 2016-11-16 23:31                               ` Dave Chinner
  2016-11-17  0:27                                 ` Chris Mason
  2016-11-17  0:47                               ` Dave Chinner
  1 sibling, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2016-11-16 23:31 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-xfs

On Tue, Nov 15, 2016 at 10:03:52PM -0500, Chris Mason wrote:
> On Wed, Nov 16, 2016 at 12:30:09PM +1100, Dave Chinner wrote:
> >On Tue, Nov 15, 2016 at 02:00:47PM -0500, Chris Mason wrote:
> >>On 11/15/2016 12:54 AM, Dave Chinner wrote:
> >>>On Tue, Nov 15, 2016 at 10:58:01AM +1100, Dave Chinner wrote:
> >>>>On Mon, Nov 14, 2016 at 03:56:14PM -0500, Chris Mason wrote:
> >>>There have been 1.2 million inodes reclaimed from the cache, but
> >>>there have only been 20,000 dirty inode buffer writes. Yes, that's
> >>>written 440,000 dirty inodes - the inode write clustering is
> >>>capturing about 22 inodes per write - but the inode writeback load
> >>>is minimal at about 10 IO/s. XFS inode reclaim is not blocking
> >>>significantly on dirty inodes.
> >>
> >>I think our machines are different enough that we're not seeing the
> >>same problems.  Or at least we're seeing different sides of the
> >>problem.
> >>
> >>We have 130GB of ram and on average about 300-500MB of XFS slab,
> >>total across all 15 filesystems.  Your inodes are small and cuddly,
> >>and I'd rather have more than less.  I see more with simoop than we
> >>see in prod, but either way its a reasonable percentage of system
> >>ram considering the horrible things being done.
> >
> >So I'm running on 16GB RAM and have 100-150MB of XFS slab.
> >Percentage wise, the inode cache is a larger portion of memory than
> >in your machines. I can increase the number of files to increase it
> >further, but I don't think that will change anything.
> 
> I think the way to see what I'm seeing would be to drop the number
> of IO threads (-T) and bump both -m and -M.  Basically less inode
> working set and more memory working set.

If I increase m/M by any non-trivial amount, the test OOMs within a
couple of minutes of starting even after cutting the number of IO
threads in half. I've managed to increase -m by 10% without OOM -
I'll keep trying to increase this part of the load as much as I
can as I refine the patchset I have.

> >>Both patched (yours or mine) and unpatched, XFS inode reclaim is
> >>keeping up.   With my patch in place, tracing during simoop does
> >>show more kswapd prio=1 scanning than unpatched, so I'm clearly
> >>stretching the limits a little more.  But we've got 30+ days of
> >>uptime in prod on almost 60 machines.  The oom rate is roughly in
> >>line with v3.10, and miles better than v4.0.
> >
> >IOWs, you have a workaround that keeps your production systems
> >running. That's fine for your machines that are running this load,
> >but it's not working well for any of the other other loads I've
> >looked at.  That is, removing the throttling from the XFS inode
> >shrinker causes instability and adverse reclaim of the inode cache
> >in situations where the maintaining a working set in memory is
> >required for performance.
> 
> We agree on all of this much more than not.  Josef has spent a lot
> of time recently on shrinkers (w/btrfs but the ideas are similar),
> and I'm wrapping duct tape around workloads until the overall
> architecture is less fragile.
> 
> Using slab for metadata in an FS like btrfs where dirty metadata is
> almost unbounded is a huge challenge in the current framework.  Ext4
> is moving to dramatically bigger logs, so it would eventually have
> the same problems.

Your 8TB XFS filesystems will be using 2GB logs (unless mkfs
settings were tweaked manually), so there's a huge amount of
metadata 15x8TB XFS filesystems can pin in memory, too...

> >Indeed, one of the things I noticed with the simoops workload
> >running the shrinker patches is that it no longer kept either the
> >inode cache or the XFS metadata cache in memory long enough for the
> >du to run without requiring IO. i.e. the caches no longer maintained
> >the working set of objects needed to optimise a regular operation
> >and the du scans took a lot longer.
> 
> With simoop, du is supposed to do IO.  It's crazy to expect to be
> able to scan all the inodes on a huge FS (or 15 of them) and keep it
> all in cache along with everything else hadoop does.  I completely
> agree there are cases where having the working set in ram is valid,
> just simoop isn't one ;)

Sure, I was just pointing out that even simoop was seeing signficant
changes in cache residency as a result of this change....

> >That's what removing the blocking from the shrinker causes the
> >overall work rate to go down - it results in the cache not
> >maintaining a working set of inodes and so increases the IO load and
> >that then slows everything down.
> 
> At least on my machines, it made the overall work rate go up.  Both
> simoop and prod are 10-15% faster. 

Ok, I'll see if I can tune the workload here to behave more like
this....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
  2016-11-16 23:31                               ` Dave Chinner
@ 2016-11-17  0:27                                 ` Chris Mason
  2016-11-17  1:00                                   ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Mason @ 2016-11-17  0:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Nov 17, 2016 at 10:31:36AM +1100, Dave Chinner wrote:
>On Tue, Nov 15, 2016 at 10:03:52PM -0500, Chris Mason wrote:
>> On Wed, Nov 16, 2016 at 12:30:09PM +1100, Dave Chinner wrote:
>> >On Tue, Nov 15, 2016 at 02:00:47PM -0500, Chris Mason wrote:
>> >>On 11/15/2016 12:54 AM, Dave Chinner wrote:
>> >>>On Tue, Nov 15, 2016 at 10:58:01AM +1100, Dave Chinner wrote:
>> >>>>On Mon, Nov 14, 2016 at 03:56:14PM -0500, Chris Mason wrote:
>> >>>There have been 1.2 million inodes reclaimed from the cache, but
>> >>>there have only been 20,000 dirty inode buffer writes. Yes, that's
>> >>>written 440,000 dirty inodes - the inode write clustering is
>> >>>capturing about 22 inodes per write - but the inode writeback load
>> >>>is minimal at about 10 IO/s. XFS inode reclaim is not blocking
>> >>>significantly on dirty inodes.
>> >>
>> >>I think our machines are different enough that we're not seeing the
>> >>same problems.  Or at least we're seeing different sides of the
>> >>problem.
>> >>
>> >>We have 130GB of ram and on average about 300-500MB of XFS slab,
>> >>total across all 15 filesystems.  Your inodes are small and cuddly,
>> >>and I'd rather have more than less.  I see more with simoop than we
>> >>see in prod, but either way its a reasonable percentage of system
>> >>ram considering the horrible things being done.
>> >
>> >So I'm running on 16GB RAM and have 100-150MB of XFS slab.
>> >Percentage wise, the inode cache is a larger portion of memory than
>> >in your machines. I can increase the number of files to increase it
>> >further, but I don't think that will change anything.
>>
>> I think the way to see what I'm seeing would be to drop the number
>> of IO threads (-T) and bump both -m and -M.  Basically less inode
>> working set and more memory working set.
>
>If I increase m/M by any non-trivial amount, the test OOMs within a
>couple of minutes of starting even after cutting the number of IO
>threads in half. I've managed to increase -m by 10% without OOM -
>I'll keep trying to increase this part of the load as much as I
>can as I refine the patchset I have.

Gotcha.  -m is long lasting, allocated once at the start of the run and 
stays around for ever.  It basically soaks up ram.  -M is allocated once 
per work loop, and it should be where the stalls really hit.  I'll peel 
off a flash machine tomorrow and find a command line that matches my 
results so far.

What kind of flash are you using?  I can choose between modern nvme or 
something more crusty.

>> With simoop, du is supposed to do IO.  It's crazy to expect to be
>> able to scan all the inodes on a huge FS (or 15 of them) and keep it
>> all in cache along with everything else hadoop does.  I completely
>> agree there are cases where having the working set in ram is valid,
>> just simoop isn't one ;)
>
>Sure, I was just pointing out that even simoop was seeing signficant
>changes in cache residency as a result of this change....

Yeah, one of the problems with simoop is it should actually go faster if 
we empty all the caches every time.  It only needs enough dirty pages
pages around for efficient IO.   I should add a 
page-reuse-after-N-seconds mode so that it notices if some jerk tries a 
patch that tosses all the pages.  It won't make it any less effective 
for pretending to be hadoop, and it'll catch some mistakes I'm likely to 
make.

>
>> >That's what removing the blocking from the shrinker causes the
>> >overall work rate to go down - it results in the cache not
>> >maintaining a working set of inodes and so increases the IO load and
>> >that then slows everything down.
>>
>> At least on my machines, it made the overall work rate go up.  Both
>> simoop and prod are 10-15% faster.
>
>Ok, I'll see if I can tune the workload here to behave more like
>this....

What direction do you have in mind for your current patches?  Many tiers 
have shadows where we can put experimental code without feeling bad if 
machines crash or data is lost.  I'm happy to line up tests if you want 
data from specific workloads.

-chris


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
  2016-11-16  3:03                             ` Chris Mason
  2016-11-16 23:31                               ` Dave Chinner
@ 2016-11-17  0:47                               ` Dave Chinner
  2016-11-17  1:07                                 ` Chris Mason
  1 sibling, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2016-11-17  0:47 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-xfs

On Tue, Nov 15, 2016 at 10:03:52PM -0500, Chris Mason wrote:
> Moving forward, I think I can manage to carry the one line patch in
> code that hasn't measurably changed in years.  We'll get it tested
> in a variety of workloads and come back with more benchmarks for the
> great slab rework coming soon to a v5.x kernel near you.

FWIW, I just tested your one-liner against my simoops config here,
and by comparing the behaviour to my patchset that still allows
direct reclaim to block on dirty inodes, it would appear that all
the allocation latency I'm seeing here is from direct reclaim.

So I went looking at the direct reclaim throttle with the intent to
hack it to throttle earlier. It throttles based on watermarks, so
I figured Id just hack them to be larger to trigger direct reclaim
throttling earlier. And then I found this recent addition:

https://patchwork.kernel.org/patch/8426381/

+=============================================================
+
+watermark_scale_factor:
+
+This factor controls the aggressiveness of kswapd. It defines the
+amount of memory left in a node/system before kswapd is woken up and
+how much memory needs to be free before kswapd goes back to sleep.
+
+The unit is in fractions of 10,000. The default value of 10 means the
+distances between watermarks are 0.1% of the available memory in the
+node/system. The maximum value is 1000, or 10% of memory.
+
+A high rate of threads entering direct reclaim (allocstall) or kswapd
+going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate
+that the number of free pages kswapd maintains for latency reasons is
+too small for the allocation bursts occurring in the system. This knob
+can then be used to tune kswapd aggressiveness accordingly.
+

The /exact hack/ I was thinking of was committed about 6 months
ago and added "support for ever more" /proc file:

commit 795ae7a0de6b834a0cc202aa55c190ef81496665
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Thu Mar 17 14:19:14 2016 -0700

    mm: scale kswapd watermarks in proportion to memory


What's painfully obvious, though, is that even when I wind it up to
it's full threshold (10% memory), it does not prevent direct reclaim
from being entered and causing excessive latencies when it blocks.
This is despite the fact that simoops is now running with a big free
memory reserve (3-3.5GB of free memory on my machine as the page
cache now only consumes ~4GB instead of 7-8GB).

And, while harder to trigger, kswapd still goes on the "free fucking
everything" rampages that trigger page writeback from kswapd and
empty both the page cache and the slab caches. The only difference
now is that it does this /without triggering the allocstall
counter/....

So it's seems that just upping the direct reclaim throttle point
isn't a sufficient workaround for the "too much direct reclaim"
problem here...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
  2016-11-17  0:27                                 ` Chris Mason
@ 2016-11-17  1:00                                   ` Dave Chinner
  0 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2016-11-17  1:00 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-xfs

On Wed, Nov 16, 2016 at 07:27:27PM -0500, Chris Mason wrote:
> On Thu, Nov 17, 2016 at 10:31:36AM +1100, Dave Chinner wrote:
> >>>So I'm running on 16GB RAM and have 100-150MB of XFS slab.
> >>>Percentage wise, the inode cache is a larger portion of memory than
> >>>in your machines. I can increase the number of files to increase it
> >>>further, but I don't think that will change anything.
> >>
> >>I think the way to see what I'm seeing would be to drop the number
> >>of IO threads (-T) and bump both -m and -M.  Basically less inode
> >>working set and more memory working set.
> >
> >If I increase m/M by any non-trivial amount, the test OOMs within a
> >couple of minutes of starting even after cutting the number of IO
> >threads in half. I've managed to increase -m by 10% without OOM -
> >I'll keep trying to increase this part of the load as much as I
> >can as I refine the patchset I have.
> 
> Gotcha.  -m is long lasting, allocated once at the start of the run
> and stays around for ever.  It basically soaks up ram.  -M is
> allocated once per work loop, and it should be where the stalls
> really hit.  I'll peel off a flash machine tomorrow and find a
> command line that matches my results so far.
> 
> What kind of flash are you using?  I can choose between modern nvme
> or something more crusty.

Crusty old stuff - a pair of EVO 840s in HW-RAID0 behind 512MB of
BBWC. Read rates peak at ~150MB/s, write rates sustain at about
75MB/s.

I'm testing on a 200GB filesystem, configured as:

mkfs.xfs -f -dagcount=8,size=200g /dev/vdc

The backing file for /dev/vdc is fully preallocated and linear,
accessed via virtio/direct IO, so it's no different to accessing the
real block device....

> >>>That's what removing the blocking from the shrinker causes the
> >>>overall work rate to go down - it results in the cache not
> >>>maintaining a working set of inodes and so increases the IO load and
> >>>that then slows everything down.
> >>
> >>At least on my machines, it made the overall work rate go up.  Both
> >>simoop and prod are 10-15% faster.
> >
> >Ok, I'll see if I can tune the workload here to behave more like
> >this....
> 
> What direction do you have in mind for your current patches?  Many
> tiers have shadows where we can put experimental code without
> feeling bad if machines crash or data is lost.  I'm happy to line up
> tests if you want data from specific workloads.

Right now I have kswapd as fully non-blocking - even more so that
your one line patch because reclaim can (and does) still block on
inode locks with SYNC_TRYLOCK set. I don't see any problems with
doing this.

I'm still trying to work out what to do with direct reclaim - it's
clearly the source of the worst allocation latency problems, and it
also seems to be the contributing factor to the obnoxious kswapd FFE
behaviour.  There's a couple more variations I want to try to see if
I can make it block less severely, but what I do in the short term
here is largely dependent on the effect on other benchmarks and
loads....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
  2016-11-17  0:47                               ` Dave Chinner
@ 2016-11-17  1:07                                 ` Chris Mason
  2016-11-17  3:39                                   ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Mason @ 2016-11-17  1:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Nov 17, 2016 at 11:47:45AM +1100, Dave Chinner wrote:
>On Tue, Nov 15, 2016 at 10:03:52PM -0500, Chris Mason wrote:
>> Moving forward, I think I can manage to carry the one line patch in
>> code that hasn't measurably changed in years.  We'll get it tested
>> in a variety of workloads and come back with more benchmarks for the
>> great slab rework coming soon to a v5.x kernel near you.
>
>FWIW, I just tested your one-liner against my simoops config here,
>and by comparing the behaviour to my patchset that still allows
>direct reclaim to block on dirty inodes, it would appear that all
>the allocation latency I'm seeing here is from direct reclaim.

Meaning that your allocation latencies are constant regardless of if 
we're waiting in the xfs shrinker?

>
>So I went looking at the direct reclaim throttle with the intent to
>hack it to throttle earlier. It throttles based on watermarks, so
>I figured Id just hack them to be larger to trigger direct reclaim
>throttling earlier. And then I found this recent addition:
>
>https://patchwork.kernel.org/patch/8426381/
>
>+=============================================================
>+
>+watermark_scale_factor:
>+
>+This factor controls the aggressiveness of kswapd. It defines the
>+amount of memory left in a node/system before kswapd is woken up and
>+how much memory needs to be free before kswapd goes back to sleep.
>+
>+The unit is in fractions of 10,000. The default value of 10 means the
>+distances between watermarks are 0.1% of the available memory in the
>+node/system. The maximum value is 1000, or 10% of memory.
>+
>+A high rate of threads entering direct reclaim (allocstall) or kswapd
>+going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate
>+that the number of free pages kswapd maintains for latency reasons is
>+too small for the allocation bursts occurring in the system. This knob
>+can then be used to tune kswapd aggressiveness accordingly.
>+
>
>The /exact hack/ I was thinking of was committed about 6 months
>ago and added "support for ever more" /proc file:

Yeah, Johannes spent a bunch of time looking at kswapd in a few places 
it was causing trouble here.

>
>commit 795ae7a0de6b834a0cc202aa55c190ef81496665
>Author: Johannes Weiner <hannes@cmpxchg.org>
>Date:   Thu Mar 17 14:19:14 2016 -0700
>
>    mm: scale kswapd watermarks in proportion to memory
>
>
>What's painfully obvious, though, is that even when I wind it up to
>it's full threshold (10% memory), it does not prevent direct reclaim
>from being entered and causing excessive latencies when it blocks.
>This is despite the fact that simoops is now running with a big free
>memory reserve (3-3.5GB of free memory on my machine as the page
>cache now only consumes ~4GB instead of 7-8GB).

Huh, I'll try to reproduce that.  It might be NUMA imbalance or just 
that simoop is so bursty that we're blowing past that 3.5GB.

>
>And, while harder to trigger, kswapd still goes on the "free fucking
>everything" rampages that trigger page writeback from kswapd and
>empty both the page cache and the slab caches. The only difference
>now is that it does this /without triggering the allocstall
>counter/....
>
>So it's seems that just upping the direct reclaim throttle point
>isn't a sufficient workaround for the "too much direct reclaim"
>problem here...
>
>Cheers,
>
>Dave.
>-- 
>Dave Chinner
>david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
  2016-11-17  1:07                                 ` Chris Mason
@ 2016-11-17  3:39                                   ` Dave Chinner
  0 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2016-11-17  3:39 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-xfs

On Wed, Nov 16, 2016 at 08:07:28PM -0500, Chris Mason wrote:
> On Thu, Nov 17, 2016 at 11:47:45AM +1100, Dave Chinner wrote:
> >On Tue, Nov 15, 2016 at 10:03:52PM -0500, Chris Mason wrote:
> >>Moving forward, I think I can manage to carry the one line patch in
> >>code that hasn't measurably changed in years.  We'll get it tested
> >>in a variety of workloads and come back with more benchmarks for the
> >>great slab rework coming soon to a v5.x kernel near you.
> >
> >FWIW, I just tested your one-liner against my simoops config here,
> >and by comparing the behaviour to my patchset that still allows
> >direct reclaim to block on dirty inodes, it would appear that all
> >the allocation latency I'm seeing here is from direct reclaim.
> 
> Meaning that your allocation latencies are constant regardless of if
> we're waiting in the xfs shrinker?

No, what I mean is that all the big p99 latencies are a result of
blocking in direct reclaim, not blocking kswapd. i.e. fully
non-blocking kswapd + blocking direct reclaim == big bad p99
latencies vs non-blocking kswapd + direct reclaim == no big
latencies.

It also /appears/ that the bad FFE kswapd behaviour is closely
correlated to the long blocking latencies in direct reclaim, though
I haven't been able to confirm this hypothesis yet.

> >commit 795ae7a0de6b834a0cc202aa55c190ef81496665
> >Author: Johannes Weiner <hannes@cmpxchg.org>
> >Date:   Thu Mar 17 14:19:14 2016 -0700
> >
> >   mm: scale kswapd watermarks in proportion to memory
> >
> >
> >What's painfully obvious, though, is that even when I wind it up to
> >it's full threshold (10% memory), it does not prevent direct reclaim
> >from being entered and causing excessive latencies when it blocks.
> >This is despite the fact that simoops is now running with a big free
> >memory reserve (3-3.5GB of free memory on my machine as the page
> >cache now only consumes ~4GB instead of 7-8GB).
> 
> Huh, I'll try to reproduce that.  It might be NUMA imbalance or just
> that simoop is so bursty that we're blowing past that 3.5GB.

It's probably blowing through it, but regardless of this there's
more serious problems with this approach. I originally
turned up the watermarks a few seconds after starting simoops and
everythign was fine. However, when I stopped and tried to restart
simoops, it *always* fails in a few seconds with either:

....
Creating working files
done creating working files
du thread is running /mnt/scratch
du thread is done /mnt/scratch
error 11 from pthread_create
$

or

....
Creating working files
done creating working files
du thread is running /mnt/scratch
du thread is done /mnt/scratch
mmap: Cannot allocate memory
$

I couldn't start simoops again until I backed out the watermark
tuning, and then it started straight away.

IOWs, screwing with the watermarks to try to avoid direct reclaim
appears to make userspace randomly fail with ENOMEM problems when
there are large reserves still available. So AFAICT this doesn't fix
the problems that I've been seeing and instead creates a bunch of
new ones.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim
  2016-10-14 12:27 [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim Chris Mason
  2016-10-15 22:34 ` Dave Chinner
@ 2019-06-14 12:58 ` Amir Goldstein
  1 sibling, 0 replies; 24+ messages in thread
From: Amir Goldstein @ 2019-06-14 12:58 UTC (permalink / raw)
  To: Chris Mason; +Cc: Dave Chinner, linux-xfs, Brian Foster

On Fri, Oct 14, 2016 at 3:36 PM Chris Mason <clm@fb.com> wrote:
>
>
> Hi Dave,
>
> This is part of a series of patches we're growing to fix a perf
> regression on a few straggler tiers that are still on v3.10.  In this
> case, hadoop had to switch back to v3.10 because v4.x is as much as 15%
> slower on recent kernels.
>
> Between v3.10 and v4.x, kswapd is less effective overall.  This leads
> more and more procs to get bogged down in direct reclaim Using SYNC_WAIT
> in xfs_reclaim_inodes_ag().
>
> Since slab shrinking happens very early in direct reclaim, we've seen
> systems with 130GB of ram where hundreds of procs are stuck on the xfs
> slab shrinker fighting to walk a slab 900MB in size.  They'd have better
> luck moving on to the page cache instead.
>
> Also, we're going into direct reclaim much more often than we should
> because kswapd is getting stuck on XFS inode locks and writeback.
> Dropping the SYNC_WAIT means that kswapd can move on to other things and
> let the async worker threads get kicked to work on the inodes.
>
> We're still working on the series, and this is only compile tested on
> current Linus git.  I'm working out some better simulations for the
> hadoop workload to stuff into Mel's tests.  Numbers from prod take
> roughly 3 days to stabilize, so I haven't isolated this patch from the rest
> of the series.
>
> Unpatched v4.x our base allocation stall rate goes up to as much as
> 200-300/sec, averaging 70/sec.  The series I'm finalizing gets that
> number down to < 1 /sec.
>
> Omar Sandoval did some digging and found you added the SYNC_WAIT in
> response to a workload I sent ages ago.  I tried to make this OOM with
> fsmark creating empty files, and it has been soaking in memory
> constrained workloads in production for almost two weeks.
>
> Signed-off-by: Chris Mason <clm@fb.com>
> ---
>  fs/xfs/xfs_icache.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index bf2d607..63938fb 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1195,7 +1195,7 @@ xfs_reclaim_inodes_nr(
>         xfs_reclaim_work_queue(mp);
>         xfs_ail_push_all(mp->m_ail);
>
> -       return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
> +       return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
>  }
>
>  /*
> --

Hi Chris,

We've being seeing memory allocation stalls on some v4.9.y production systems
involving direct reclaim of xfs inodes.

I saw a similar issue was reported again here:
https://bugzilla.kernel.org/show_bug.cgi?id=192981

I couldn't find any resolution to the reported issue in upstream
commits, so I wonder,
does Facebook still carry this patch? Or was there a proper fix and I missed it?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2019-06-14 12:59 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-14 12:27 [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim Chris Mason
2016-10-15 22:34 ` Dave Chinner
2016-10-17  0:24   ` Chris Mason
2016-10-17  1:52     ` Dave Chinner
2016-10-17 13:30       ` Chris Mason
2016-10-17 22:30         ` Dave Chinner
2016-10-17 23:20           ` Chris Mason
2016-10-18  2:03             ` Dave Chinner
2016-11-14  1:00               ` Chris Mason
2016-11-14  7:27                 ` Dave Chinner
2016-11-14 20:56                   ` Chris Mason
2016-11-14 23:58                     ` Dave Chinner
2016-11-15  3:09                       ` Chris Mason
2016-11-15  5:54                       ` Dave Chinner
2016-11-15 19:00                         ` Chris Mason
2016-11-16  1:30                           ` Dave Chinner
2016-11-16  3:03                             ` Chris Mason
2016-11-16 23:31                               ` Dave Chinner
2016-11-17  0:27                                 ` Chris Mason
2016-11-17  1:00                                   ` Dave Chinner
2016-11-17  0:47                               ` Dave Chinner
2016-11-17  1:07                                 ` Chris Mason
2016-11-17  3:39                                   ` Dave Chinner
2019-06-14 12:58 ` Amir Goldstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.