Re: [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was Re: Self nomination

From: James Bottomley <James.Bottomley@HansenPartnership.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: ksummit-discuss@lists.linuxfoundation.org
Subject: Re: [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was Re:  Self nomination
Date: Mon, 01 Aug 2016 12:06:25 -0400	[thread overview]
Message-ID: <1470067585.18751.24.camel@HansenPartnership.com> (raw)
In-Reply-To: <20160801154639.GD7603@cmpxchg.org>

On Mon, 2016-08-01 at 11:46 -0400, Johannes Weiner wrote:
> On Thu, Jul 28, 2016 at 05:41:43PM -0400, James Bottomley wrote:
> > On Thu, 2016-07-28 at 14:55 -0400, Johannes Weiner wrote:
> > > On Mon, Jul 25, 2016 at 01:11:42PM -0400, Johannes Weiner wrote:
> > > > Most recently I have been working on reviving swap for SSDs and
> > > > persistent memory devices (https://lwn.net/Articles/690079/) as
> > > > part of a bigger anti-thrashing effort to make the VM recover
> > > > swiftly and predictably from load spikes.
> > > 
> > > A bit of context, in case we want to discuss this at KS:
> > > 
> > > We frequently have machines hang and stop responding indefinitely
> > > after they experience memory load spikes. On closer look, we find
> > > most tasks either in page reclaim or majorfaulting parts of an 
> > > executable or library. It's a typical thrashing pattern, where 
> > > everybody cannibalizes everybody else. The problem is that with 
> > > fast storage the cache reloads can be fast enough that there are 
> > > never enough in-flight pages at a time to cause page reclaim to 
> > > fail and trigger the OOM killer. The livelock persists until 
> > > external remediation reboots the box or we get lucky and non
> > > -cache allocations eventually suck up the remaining page cache
> > > and trigger the OOM killer.
> > > 
> > > To avoid hitting this situation, we currently have to keep a 
> > > generous memory reserve for occasional spikes, which sucks for 
> > > utilization the rest of the time. Swap would be useful here, but 
> > > the swapout code is basically only triggering when memory 
> > > pressure rises - which again doesn't happen - so I've been 
> > > working on the swap code to balance cache reclaim vs. swap based 
> > > on relative thrashing between the two.
> > > 
> > > There is usually some cold/unused anonymous memory lying around 
> > > that can be unloaded into swap during workload spikes, so that 
> > > allows us to drive up the average memory utilization without 
> > > increasing the risk at least. But if we screw up and there are 
> > > not enough unused anon pages, we are back to thrashing - only now 
> > > it involves swapping too.
> > > 
> > > So how do we address this?
> > > 
> > > A pathological thrashing situation is very obvious to any user, 
> > > but it's not quite clear how to quantify it inside the kernel and
> > > have it trigger the OOM killer. It might be useful to talk about 
> > > metrics.  Could we quantify application progress? Could we 
> > > quantify the amount of time a task or the system spends 
> > > thrashing, and somehow express it as a percentage of overall 
> > > execution time? Maybe something comparable to IO wait time, 
> > > except tracking the time spent performing reclaim and waiting on
> > > IO that is refetching recently evicted pages?
> > > 
> > > This question seems to go beyond the memory subsystem and 
> > > potentially involve the scheduler and the block layer, so it 
> > > might be a good tech topic for KS.
> > 
> > Actually, I'd be interested in this.  We're starting to generate 
> > use cases in the container cloud for swap (I can't believe I'm 
> > saying this since we hitherto regarded swap as wholly evil).  The 
> > issue is that we want to load the system up into its overcommit 
> > region (it means two things: either we're re-using under used 
> > resources or, more correctly, we're reselling resources we sold to 
> > one customer, but they're not using, so we can sell them to 
> > another).  From some research done within IBM, it turns out there's 
> > a region where swapping is beneficial.   We define it as the region 
> > where the B/W to swap doesn't exceed the B/W capacity of the disk
> > (is this the metric you're looking for?).
> 
> That's an interesting take, I haven't thought about that. But note
> that the CPU cost of evicting and refetching pages is not negligible:
> even on fairly beefy machines we've seen significant CPU load when 
> the IO device hits saturation.

Right, but we're not looking to use swap as a kind of slightly more
expensive memory.  We're looking to push the system aggressively to
find its working set while we load it up with jobs.  This means we need
not very often referenced anonymous memory out on swap.  We use
standard SSDs, so if the anon memory refault rate goes too high, we
move from region 3 to region 4 (required swap B/W exceeds available
swap B/W)  and the system goes unstable (so we'd unload it a bit).

>  With persistent memory devices you might actually run out of CPU 
> capacity while performing basic page aging before you saturate the 
> storage device (which is why Andi Kleen has been suggesting to 
> replace LRU reclaim with random replacement for these devices). So 
> storage device saturation might not be the final answer to this
> problem.

We really wouldn't want this.  All cloud jobs seem to have memory they
allocate but rarely use, so we want the properties of the LRU list to
get this on swap so we can re-use the memory pages for something else. 
 A random replacement algorithm would play havoc with that.

Our biggest problem is the difficulty in forcing the system to push
anonymous stuff out to swap.  Linux really likes to hang on to its
anonymous pages and if you get too abrasive with it, it starts dumping
your file backed pages and causing refaults leading to instability
there instead.  We haven't yet played with the swappiness patches, but
we're hoping they will go some way towards fixing this.

> > Our definition of progress is a bit different from yours above 
> > because the interactive jobs must respond as if they were near bare 
> > metal, so we penalise the soak jobs.  However, we find that the 
> > soak jobs also make reasonable progress according to your measure 
> > above (reasonable enough means the customer is happy to pay for the 
> > time they've used).
> 
> We actually are in the same boat, where most of our services are 
> doing work within the context of interactive user sessions. So in 
> terms of quantifying progress, both throughput and latency 
> percentiles would be necessary to form a full picture of whether we
> are beyond capacity.

OK, so this region 3 work (where we can get the system stable with an
acceptable refault rate for the anonymous pages) is probably where you
want to be operating as well.

James