All of lore.kernel.org
 help / color / mirror / Atom feed
* [Ksummit-discuss] Self nomination
@ 2016-07-25 17:11 Johannes Weiner
  2016-07-25 18:15 ` Rik van Riel
  2016-07-28 18:55 ` [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was " Johannes Weiner
  0 siblings, 2 replies; 20+ messages in thread
From: Johannes Weiner @ 2016-07-25 17:11 UTC (permalink / raw)
  To: ksummit-discuss

Hi,

I would like to self-nominate myself for this year's kernel summit.

I co-maintain cgroups and the memory controller and have been a
long-time contributor to the memory management subsystem. At Facebook,
I'm in charge of MM scalability and reliability in our fleet. Most
recently I have been working on reviving swap for SSDs and persistent
memory devices (https://lwn.net/Articles/690079/) as part of a bigger
anti-thrashing effort to make the VM recover swiftly and predictably
from load spikes. This has been a bit of a "lock yourself in a
basement" type project, which is why I missed the mechanimal
nomination based on sign-offs this year.

Thanks
Johannes

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Ksummit-discuss] Self nomination
  2016-07-25 17:11 [Ksummit-discuss] Self nomination Johannes Weiner
@ 2016-07-25 18:15 ` Rik van Riel
  2016-07-26 10:56   ` Jan Kara
  2016-07-28 18:55 ` [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was " Johannes Weiner
  1 sibling, 1 reply; 20+ messages in thread
From: Rik van Riel @ 2016-07-25 18:15 UTC (permalink / raw)
  To: Johannes Weiner, ksummit-discuss

[-- Attachment #1: Type: text/plain, Size: 871 bytes --]

On Mon, 2016-07-25 at 13:11 -0400, Johannes Weiner wrote:
> Hi,
> 
> I would like to self-nominate myself for this year's kernel summit.
> 
> I co-maintain cgroups and the memory controller and have been a
> long-time contributor to the memory management subsystem. At
> Facebook,
> I'm in charge of MM scalability and reliability in our fleet. Most
> recently I have been working on reviving swap for SSDs and persistent
> memory devices (https://lwn.net/Articles/690079/) as part of a bigger
> anti-thrashing effort to make the VM recover swiftly and predictably
> from load spikes. This has been a bit of a "lock yourself in a
> basement" type project, which is why I missed the mechanimal
> nomination based on sign-offs this year.

I am interested in discussing that, either at kernel
summit, or at next year's LSF/MM.

-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Ksummit-discuss] Self nomination
  2016-07-25 18:15 ` Rik van Riel
@ 2016-07-26 10:56   ` Jan Kara
  2016-07-26 13:10     ` Vlastimil Babka
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Kara @ 2016-07-26 10:56 UTC (permalink / raw)
  To: Rik van Riel; +Cc: ksummit-discuss

On Mon 25-07-16 14:15:18, Rik van Riel wrote:
> On Mon, 2016-07-25 at 13:11 -0400, Johannes Weiner wrote:
> > Hi,
> > 
> > I would like to self-nominate myself for this year's kernel summit.
> > 
> > I co-maintain cgroups and the memory controller and have been a
> > long-time contributor to the memory management subsystem. At
> > Facebook,
> > I'm in charge of MM scalability and reliability in our fleet. Most
> > recently I have been working on reviving swap for SSDs and persistent
> > memory devices (https://lwn.net/Articles/690079/) as part of a bigger
> > anti-thrashing effort to make the VM recover swiftly and predictably
> > from load spikes. This has been a bit of a "lock yourself in a
> > basement" type project, which is why I missed the mechanimal
> > nomination based on sign-offs this year.
> 
> I am interested in discussing that, either at kernel
> summit, or at next year's LSF/MM.

Me as well.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Ksummit-discuss] Self nomination
  2016-07-26 10:56   ` Jan Kara
@ 2016-07-26 13:10     ` Vlastimil Babka
  0 siblings, 0 replies; 20+ messages in thread
From: Vlastimil Babka @ 2016-07-26 13:10 UTC (permalink / raw)
  To: Jan Kara, Rik van Riel; +Cc: ksummit-discuss

On 07/26/2016 12:56 PM, Jan Kara wrote:
> On Mon 25-07-16 14:15:18, Rik van Riel wrote:
>> On Mon, 2016-07-25 at 13:11 -0400, Johannes Weiner wrote:
>>> Hi,
>>>
>>> I would like to self-nominate myself for this year's kernel summit.
>>>
>>> I co-maintain cgroups and the memory controller and have been a
>>> long-time contributor to the memory management subsystem. At
>>> Facebook,
>>> I'm in charge of MM scalability and reliability in our fleet. Most
>>> recently I have been working on reviving swap for SSDs and persistent
>>> memory devices (https://lwn.net/Articles/690079/) as part of a bigger
>>> anti-thrashing effort to make the VM recover swiftly and predictably
>>> from load spikes. This has been a bit of a "lock yourself in a
>>> basement" type project, which is why I missed the mechanimal
>>> nomination based on sign-offs this year.
>>
>> I am interested in discussing that, either at kernel
>> summit, or at next year's LSF/MM.
>
> Me as well.

+1

>
> 								Honza
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was Re:  Self nomination
  2016-07-25 17:11 [Ksummit-discuss] Self nomination Johannes Weiner
  2016-07-25 18:15 ` Rik van Riel
@ 2016-07-28 18:55 ` Johannes Weiner
  2016-07-28 21:41   ` James Bottomley
                     ` (3 more replies)
  1 sibling, 4 replies; 20+ messages in thread
From: Johannes Weiner @ 2016-07-28 18:55 UTC (permalink / raw)
  To: ksummit-discuss

On Mon, Jul 25, 2016 at 01:11:42PM -0400, Johannes Weiner wrote:
> Most recently I have been working on reviving swap for SSDs and
> persistent memory devices (https://lwn.net/Articles/690079/) as part
> of a bigger anti-thrashing effort to make the VM recover swiftly and
> predictably from load spikes.

A bit of context, in case we want to discuss this at KS:

We frequently have machines hang and stop responding indefinitely
after they experience memory load spikes. On closer look, we find most
tasks either in page reclaim or majorfaulting parts of an executable
or library. It's a typical thrashing pattern, where everybody
cannibalizes everybody else. The problem is that with fast storage the
cache reloads can be fast enough that there are never enough in-flight
pages at a time to cause page reclaim to fail and trigger the OOM
killer. The livelock persists until external remediation reboots the
box or we get lucky and non-cache allocations eventually suck up the
remaining page cache and trigger the OOM killer.

To avoid hitting this situation, we currently have to keep a generous
memory reserve for occasional spikes, which sucks for utilization the
rest of the time. Swap would be useful here, but the swapout code is
basically only triggering when memory pressure rises - which again
doesn't happen - so I've been working on the swap code to balance
cache reclaim vs. swap based on relative thrashing between the two.

There is usually some cold/unused anonymous memory lying around that
can be unloaded into swap during workload spikes, so that allows us to
drive up the average memory utilization without increasing the risk at
least. But if we screw up and there are not enough unused anon pages,
we are back to thrashing - only now it involves swapping too.

So how do we address this?

A pathological thrashing situation is very obvious to any user, but
it's not quite clear how to quantify it inside the kernel and have it
trigger the OOM killer. It might be useful to talk about
metrics. Could we quantify application progress? Could we quantify the
amount of time a task or the system spends thrashing, and somehow
express it as a percentage of overall execution time? Maybe something
comparable to IO wait time, except tracking the time spent performing
reclaim and waiting on IO that is refetching recently evicted pages?

This question seems to go beyond the memory subsystem and potentially
involve the scheduler and the block layer, so it might be a good tech
topic for KS.

Thanks

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was Re:  Self nomination
  2016-07-28 18:55 ` [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was " Johannes Weiner
@ 2016-07-28 21:41   ` James Bottomley
  2016-08-01 15:46     ` Johannes Weiner
  2016-07-29  0:25   ` Rik van Riel
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: James Bottomley @ 2016-07-28 21:41 UTC (permalink / raw)
  To: Johannes Weiner, ksummit-discuss

On Thu, 2016-07-28 at 14:55 -0400, Johannes Weiner wrote:
> On Mon, Jul 25, 2016 at 01:11:42PM -0400, Johannes Weiner wrote:
> > Most recently I have been working on reviving swap for SSDs and
> > persistent memory devices (https://lwn.net/Articles/690079/) as
> > part
> > of a bigger anti-thrashing effort to make the VM recover swiftly
> > and
> > predictably from load spikes.
> 
> A bit of context, in case we want to discuss this at KS:
> 
> We frequently have machines hang and stop responding indefinitely
> after they experience memory load spikes. On closer look, we find 
> most tasks either in page reclaim or majorfaulting parts of an 
> executable or library. It's a typical thrashing pattern, where 
> everybody cannibalizes everybody else. The problem is that with fast 
> storage the cache reloads can be fast enough that there are never 
> enough in-flight pages at a time to cause page reclaim to fail and 
> trigger the OOM killer. The livelock persists until external
> remediation reboots the
> box or we get lucky and non-cache allocations eventually suck up the
> remaining page cache and trigger the OOM killer.
> 
> To avoid hitting this situation, we currently have to keep a generous
> memory reserve for occasional spikes, which sucks for utilization the
> rest of the time. Swap would be useful here, but the swapout code is
> basically only triggering when memory pressure rises - which again
> doesn't happen - so I've been working on the swap code to balance
> cache reclaim vs. swap based on relative thrashing between the two.
> 
> There is usually some cold/unused anonymous memory lying around that
> can be unloaded into swap during workload spikes, so that allows us 
> to drive up the average memory utilization without increasing the 
> risk at least. But if we screw up and there are not enough unused 
> anon pages, we are back to thrashing - only now it involves swapping
> too.
> 
> So how do we address this?
> 
> A pathological thrashing situation is very obvious to any user, but
> it's not quite clear how to quantify it inside the kernel and have it
> trigger the OOM killer. It might be useful to talk about metrics. 
> Could we quantify application progress? Could we quantify the amount 
> of time a task or the system spends thrashing, and somehow express it 
> as a percentage of overall execution time? Maybe something comparable 
> to IO wait time, except tracking the time spent performing reclaim
> and waiting on IO that is refetching recently evicted pages?
> 
> This question seems to go beyond the memory subsystem and potentially
> involve the scheduler and the block layer, so it might be a good tech
> topic for KS.

Actually, I'd be interested in this.  We're starting to generate use
cases in the container cloud for swap (I can't believe I'm saying this
since we hitherto regarded swap as wholly evil).  The issue is that we
want to load the system up into its overcommit region (it means two
things: either we're re-using under used resources or, more correctly,
we're reselling resources we sold to one customer, but they're not
using, so we can sell them to another).  From some research done within
IBM, it turns out there's a region where swapping is beneficial.  We
define it as the region where the B/W to swap doesn't exceed the B/W
capacity of the disk (is this the metric you're looking for?). 
 Surprisingly, this is a stable region, so we can actually operate the
physical system within this region.  It also turns out to be the ideal
region for operating overcommitted systems in because what appears to
be happening is that we're forcing allocated but unused objects (dirty
anonymous memory) out to swap.  The ideal cloud to run this in is one
which has a mix of soak jobs (background, best effort jobs, usually
analytics based) and highly interactive containers (usually web servers
or something).  We find that if we tune the swappiness of the memory
cgroup of the container to 0 for the interactive jobs, they show no
loss of throughput in this region.

Our definition of progress is a bit different from yours above because
the interactive jobs must respond as if they were near bare metal, so
we penalise the soak jobs.  However, we find that the soak jobs also
make reasonable progress according to your measure above (reasonable
enough means the customer is happy to pay for the time they've used).

James

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was Re:  Self nomination
  2016-07-28 18:55 ` [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was " Johannes Weiner
  2016-07-28 21:41   ` James Bottomley
@ 2016-07-29  0:25   ` Rik van Riel
  2016-07-29 11:07   ` Mel Gorman
  2016-08-02  9:18   ` Jan Kara
  3 siblings, 0 replies; 20+ messages in thread
From: Rik van Riel @ 2016-07-29  0:25 UTC (permalink / raw)
  To: Johannes Weiner, ksummit-discuss

[-- Attachment #1: Type: text/plain, Size: 3148 bytes --]

On Thu, 2016-07-28 at 14:55 -0400, Johannes Weiner wrote:
> On Mon, Jul 25, 2016 at 01:11:42PM -0400, Johannes Weiner wrote:
> > Most recently I have been working on reviving swap for SSDs and
> > persistent memory devices (https://lwn.net/Articles/690079/) as
> > part
> > of a bigger anti-thrashing effort to make the VM recover swiftly
> > and
> > predictably from load spikes.
> 
> A bit of context, in case we want to discuss this at KS:
> 
> We frequently have machines hang and stop responding indefinitely
> after they experience memory load spikes. On closer look, we find
> most
> tasks either in page reclaim or majorfaulting parts of an executable
> or library. It's a typical thrashing pattern, where everybody
> cannibalizes everybody else. The problem is that with fast storage
> the
> cache reloads can be fast enough that there are never enough in-
> flight
> pages at a time to cause page reclaim to fail and trigger the OOM
> killer. The livelock persists until external remediation reboots the
> box or we get lucky and non-cache allocations eventually suck up the
> remaining page cache and trigger the OOM killer.
> 
> To avoid hitting this situation, we currently have to keep a generous
> memory reserve for occasional spikes, which sucks for utilization the
> rest of the time. Swap would be useful here, but the swapout code is
> basically only triggering when memory pressure rises - which again
> doesn't happen - so I've been working on the swap code to balance
> cache reclaim vs. swap based on relative thrashing between the two.
> 
> There is usually some cold/unused anonymous memory lying around that
> can be unloaded into swap during workload spikes, so that allows us
> to
> drive up the average memory utilization without increasing the risk
> at
> least. But if we screw up and there are not enough unused anon pages,
> we are back to thrashing - only now it involves swapping too.
> 
> So how do we address this?
> 
> A pathological thrashing situation is very obvious to any user, but
> it's not quite clear how to quantify it inside the kernel and have it
> trigger the OOM killer. It might be useful to talk about
> metrics. Could we quantify application progress? Could we quantify
> the
> amount of time a task or the system spends thrashing, and somehow
> express it as a percentage of overall execution time? Maybe something
> comparable to IO wait time, except tracking the time spent performing
> reclaim and waiting on IO that is refetching recently evicted pages?
> 
> This question seems to go beyond the memory subsystem and potentially
> involve the scheduler and the block layer, so it might be a good tech
> topic for KS.

I would like to discuss this topic, as well.

This is a very fundamental issue that used to be hard
coded in the BSDs (in the 1980s & 1990s), but where
hard coding is totally inappropriate with today's memory
sizes, and variation in I/O subsystem speeds.

Solving this, even if only on the detection side, could
make a real difference in having systems survive load
spikes.

-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was Re:  Self nomination
  2016-07-28 18:55 ` [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was " Johannes Weiner
  2016-07-28 21:41   ` James Bottomley
  2016-07-29  0:25   ` Rik van Riel
@ 2016-07-29 11:07   ` Mel Gorman
  2016-07-29 16:26     ` Luck, Tony
  2016-08-01 16:55     ` Johannes Weiner
  2016-08-02  9:18   ` Jan Kara
  3 siblings, 2 replies; 20+ messages in thread
From: Mel Gorman @ 2016-07-29 11:07 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: ksummit-discuss

On Thu, Jul 28, 2016 at 02:55:23PM -0400, Johannes Weiner wrote:
> On Mon, Jul 25, 2016 at 01:11:42PM -0400, Johannes Weiner wrote:
> > Most recently I have been working on reviving swap for SSDs and
> > persistent memory devices (https://lwn.net/Articles/690079/) as part
> > of a bigger anti-thrashing effort to make the VM recover swiftly and
> > predictably from load spikes.
> 
> A bit of context, in case we want to discuss this at KS:
> 

Even if it's not a dedicated topic, I'm interested in talking about
this.

> We frequently have machines hang and stop responding indefinitely
> after they experience memory load spikes. On closer look, we find most
> tasks either in page reclaim or majorfaulting parts of an executable
> or library. It's a typical thrashing pattern, where everybody
> cannibalizes everybody else. The problem is that with fast storage the
> cache reloads can be fast enough that there are never enough in-flight
> pages at a time to cause page reclaim to fail and trigger the OOM
> killer. The livelock persists until external remediation reboots the
> box or we get lucky and non-cache allocations eventually suck up the
> remaining page cache and trigger the OOM killer.
> 

This is fundamental to how we current track (or not track) pressure.
Unreclaimable is defined as excessive scanning without a page being
reclaimed which is useless with fast storage. This triggers when there is
so many dirty/writeback pages that reclaim is impossible which indirectly
depends on storage being slow.

Sure, it can still happen if every page is being activated before reaching
the end of the inactive list but that is close to impossible with large
memory sizes.

> To avoid hitting this situation, we currently have to keep a generous
> memory reserve for occasional spikes, which sucks for utilization the
> rest of the time. Swap would be useful here, but the swapout code is
> basically only triggering when memory pressure rises - which again
> doesn't happen - so I've been working on the swap code to balance
> cache reclaim vs. swap based on relative thrashing between the two.
> 

While we have active and inactive lists, they have no concept of time.
Inactive may be "has not been used in hours" or "deactivated recently due to
memory pressure". If we continually aged pages at a very slow rate (e.g. 1%
of a node per minute) in the absense of memory pressure we could create a
"unused" list without reclaiming it in the absense of pressure. We'd
also have to scan 1% part of the unused list at the same time and
reactivate pages if necessary.

Minimally, we'd have a very rough estimate of the true WSS as a bonus.
If we could forcible pageout/swap the unused list, potentially ignoring
swappiness for anon pages. With monitoring, an admin would be able to
estimate how large a spike a system can handle without impact. A crucial
aspect would be knowing the average age of the unused list though and I've
no good idea right now how to calculate that.

We could side-step the time issue slightly by only adding pages to the
unused list during the "continual background aging" scan and never when
reclaiming. Continual background aging should also not happen if any process
is reclaiming. If we tagged the time the unused list gets its first page
and the time of the most recently added page, that would at least give
us a *very* approximate age of the list. That is flawed unfortunately if
the first page added gets reactivated but there a few different ways we
could approximate the age (e.g. unused 1 minute, unused 5 minutes, unused
30 minutes lists).

> There is usually some cold/unused anonymous memory lying around that
> can be unloaded into swap during workload spikes, so that allows us to
> drive up the average memory utilization without increasing the risk at
> least. But if we screw up and there are not enough unused anon pages,
> we are back to thrashing - only now it involves swapping too.
> 
> So how do we address this?
> 
> A pathological thrashing situation is very obvious to any user, but
> it's not quite clear how to quantify it inside the kernel and have it
> trigger the OOM killer.

The OOM killer is at the extreme end of the spectrum. One unloved piece of
code is vmpressure.c which we never put that much effort into.  Ideally, that
would at least be able to notify user space that the system is under pressure
but I have anecdotal evidence that it gives bad advice on large systems.

Essentially, we have four bits of information related to memory pressure --
allocations, scans, steals and refaults. A 1:1:1 ratio of allocations, scans
and steals could just be a streaming workload. The refaults distinguish
between streaming and thrashing workloads but we don't use this for
vmpressure calculations or OOM detection.

> It might be useful to talk about
> metrics. Could we quantify application progress?

We can at least calculate if it's stalling on reclaim or refaults. High
amounts of both would indicate that the application is struggling.

> Could we quantify the
> amount of time a task or the system spends thrashing, and somehow
> express it as a percentage of overall execution time?

Potentially if time spent refaulting or direct reclaiming was accounted
for. What complicates this significantly is kswapd.

> Maybe something
> comparable to IO wait time, except tracking the time spent performing
> reclaim and waiting on IO that is refetching recently evicted pages?
> 

Ideally, yes.

> This question seems to go beyond the memory subsystem and potentially
> involve the scheduler and the block layer, so it might be a good tech
> topic for KS.
> 

I'm on board anyway.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was Re:  Self nomination
  2016-07-29 11:07   ` Mel Gorman
@ 2016-07-29 16:26     ` Luck, Tony
  2016-08-01 15:17       ` Rik van Riel
  2016-08-01 16:55     ` Johannes Weiner
  1 sibling, 1 reply; 20+ messages in thread
From: Luck, Tony @ 2016-07-29 16:26 UTC (permalink / raw)
  To: Mel Gorman; +Cc: ksummit-discuss

On Fri, Jul 29, 2016 at 12:07:24PM +0100, Mel Gorman wrote:
> On Thu, Jul 28, 2016 at 02:55:23PM -0400, Johannes Weiner wrote:
> > On Mon, Jul 25, 2016 at 01:11:42PM -0400, Johannes Weiner wrote:

> > It might be useful to talk about
> > metrics. Could we quantify application progress?

The most reliable way to do that would be to have an actual
user mode program that runs, accessing some configurable number
of pages, periodically touching some file in /proc/sys/vm to
let the kernel know that some quantum of work had been completed.

Then the kernel would get accurate data on application progress
(at the cost of cpu time and memory consumed by this process, and
increased power usage when the system could otherwise be idle).

-Tony

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was Re:  Self nomination
  2016-07-29 16:26     ` Luck, Tony
@ 2016-08-01 15:17       ` Rik van Riel
  0 siblings, 0 replies; 20+ messages in thread
From: Rik van Riel @ 2016-08-01 15:17 UTC (permalink / raw)
  To: Luck, Tony, Mel Gorman; +Cc: ksummit-discuss

[-- Attachment #1: Type: text/plain, Size: 1204 bytes --]

On Fri, 2016-07-29 at 09:26 -0700, Luck, Tony wrote:
> On Fri, Jul 29, 2016 at 12:07:24PM +0100, Mel Gorman wrote:
> > On Thu, Jul 28, 2016 at 02:55:23PM -0400, Johannes Weiner wrote:
> > > On Mon, Jul 25, 2016 at 01:11:42PM -0400, Johannes Weiner wrote:
> 
> > > It might be useful to talk about
> > > metrics. Could we quantify application progress?
> 
> The most reliable way to do that would be to have an actual
> user mode program that runs, accessing some configurable number
> of pages, periodically touching some file in /proc/sys/vm to
> let the kernel know that some quantum of work had been completed.

I don't think there is a need for that.

We already keep track of how much user time and how
much system time a program uses, and how much time
it is stalled on IO.

If user time is low, a program is stalled on IO a
lot of the time, and a lot of the faults are refaults
(previously accessed memory), then we are thrashing.

If the program is not stalled on IO much, or is
accessing pages it has not previously accessed
before, it is not thrashing.

We probably have the right statistics already, unless
I am overlooking something.

-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was Re:  Self nomination
  2016-07-28 21:41   ` James Bottomley
@ 2016-08-01 15:46     ` Johannes Weiner
  2016-08-01 16:06       ` James Bottomley
  0 siblings, 1 reply; 20+ messages in thread
From: Johannes Weiner @ 2016-08-01 15:46 UTC (permalink / raw)
  To: James Bottomley; +Cc: ksummit-discuss

On Thu, Jul 28, 2016 at 05:41:43PM -0400, James Bottomley wrote:
> On Thu, 2016-07-28 at 14:55 -0400, Johannes Weiner wrote:
> > On Mon, Jul 25, 2016 at 01:11:42PM -0400, Johannes Weiner wrote:
> > > Most recently I have been working on reviving swap for SSDs and
> > > persistent memory devices (https://lwn.net/Articles/690079/) as
> > > part
> > > of a bigger anti-thrashing effort to make the VM recover swiftly
> > > and
> > > predictably from load spikes.
> > 
> > A bit of context, in case we want to discuss this at KS:
> > 
> > We frequently have machines hang and stop responding indefinitely
> > after they experience memory load spikes. On closer look, we find 
> > most tasks either in page reclaim or majorfaulting parts of an 
> > executable or library. It's a typical thrashing pattern, where 
> > everybody cannibalizes everybody else. The problem is that with fast 
> > storage the cache reloads can be fast enough that there are never 
> > enough in-flight pages at a time to cause page reclaim to fail and 
> > trigger the OOM killer. The livelock persists until external
> > remediation reboots the
> > box or we get lucky and non-cache allocations eventually suck up the
> > remaining page cache and trigger the OOM killer.
> > 
> > To avoid hitting this situation, we currently have to keep a generous
> > memory reserve for occasional spikes, which sucks for utilization the
> > rest of the time. Swap would be useful here, but the swapout code is
> > basically only triggering when memory pressure rises - which again
> > doesn't happen - so I've been working on the swap code to balance
> > cache reclaim vs. swap based on relative thrashing between the two.
> > 
> > There is usually some cold/unused anonymous memory lying around that
> > can be unloaded into swap during workload spikes, so that allows us 
> > to drive up the average memory utilization without increasing the 
> > risk at least. But if we screw up and there are not enough unused 
> > anon pages, we are back to thrashing - only now it involves swapping
> > too.
> > 
> > So how do we address this?
> > 
> > A pathological thrashing situation is very obvious to any user, but
> > it's not quite clear how to quantify it inside the kernel and have it
> > trigger the OOM killer. It might be useful to talk about metrics. 
> > Could we quantify application progress? Could we quantify the amount 
> > of time a task or the system spends thrashing, and somehow express it 
> > as a percentage of overall execution time? Maybe something comparable 
> > to IO wait time, except tracking the time spent performing reclaim
> > and waiting on IO that is refetching recently evicted pages?
> > 
> > This question seems to go beyond the memory subsystem and potentially
> > involve the scheduler and the block layer, so it might be a good tech
> > topic for KS.
> 
> Actually, I'd be interested in this.  We're starting to generate use
> cases in the container cloud for swap (I can't believe I'm saying this
> since we hitherto regarded swap as wholly evil).  The issue is that we
> want to load the system up into its overcommit region (it means two
> things: either we're re-using under used resources or, more correctly,
> we're reselling resources we sold to one customer, but they're not
> using, so we can sell them to another).  From some research done within
> IBM, it turns out there's a region where swapping is beneficial.  We
> define it as the region where the B/W to swap doesn't exceed the B/W
> capacity of the disk (is this the metric you're looking for?).

That's an interesting take, I haven't thought about that. But note
that the CPU cost of evicting and refetching pages is not negligible:
even on fairly beefy machines we've seen significant CPU load when the
IO device hits saturation. With persistent memory devices you might
actually run out of CPU capacity while performing basic page aging
before you saturate the storage device (which is why Andi Kleen has
been suggesting to replace LRU reclaim with random replacement for
these devices). So storage device saturation might not be the final
answer to this problem.

> Our definition of progress is a bit different from yours above because
> the interactive jobs must respond as if they were near bare metal, so
> we penalise the soak jobs.  However, we find that the soak jobs also
> make reasonable progress according to your measure above (reasonable
> enough means the customer is happy to pay for the time they've used).

We actually are in the same boat, where most of our services are doing
work within the context of interactive user sessions. So in terms of
quantifying progress, both throughput and latency percentiles would be
necessary to form a full picture of whether we are beyond capacity.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was Re:  Self nomination
  2016-08-01 15:46     ` Johannes Weiner
@ 2016-08-01 16:06       ` James Bottomley
  2016-08-01 16:11         ` Dave Hansen
  0 siblings, 1 reply; 20+ messages in thread
From: James Bottomley @ 2016-08-01 16:06 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: ksummit-discuss

On Mon, 2016-08-01 at 11:46 -0400, Johannes Weiner wrote:
> On Thu, Jul 28, 2016 at 05:41:43PM -0400, James Bottomley wrote:
> > On Thu, 2016-07-28 at 14:55 -0400, Johannes Weiner wrote:
> > > On Mon, Jul 25, 2016 at 01:11:42PM -0400, Johannes Weiner wrote:
> > > > Most recently I have been working on reviving swap for SSDs and
> > > > persistent memory devices (https://lwn.net/Articles/690079/) as
> > > > part of a bigger anti-thrashing effort to make the VM recover
> > > > swiftly and predictably from load spikes.
> > > 
> > > A bit of context, in case we want to discuss this at KS:
> > > 
> > > We frequently have machines hang and stop responding indefinitely
> > > after they experience memory load spikes. On closer look, we find
> > > most tasks either in page reclaim or majorfaulting parts of an 
> > > executable or library. It's a typical thrashing pattern, where 
> > > everybody cannibalizes everybody else. The problem is that with 
> > > fast storage the cache reloads can be fast enough that there are 
> > > never enough in-flight pages at a time to cause page reclaim to 
> > > fail and trigger the OOM killer. The livelock persists until 
> > > external remediation reboots the box or we get lucky and non
> > > -cache allocations eventually suck up the remaining page cache
> > > and trigger the OOM killer.
> > > 
> > > To avoid hitting this situation, we currently have to keep a 
> > > generous memory reserve for occasional spikes, which sucks for 
> > > utilization the rest of the time. Swap would be useful here, but 
> > > the swapout code is basically only triggering when memory 
> > > pressure rises - which again doesn't happen - so I've been 
> > > working on the swap code to balance cache reclaim vs. swap based 
> > > on relative thrashing between the two.
> > > 
> > > There is usually some cold/unused anonymous memory lying around 
> > > that can be unloaded into swap during workload spikes, so that 
> > > allows us to drive up the average memory utilization without 
> > > increasing the risk at least. But if we screw up and there are 
> > > not enough unused anon pages, we are back to thrashing - only now 
> > > it involves swapping too.
> > > 
> > > So how do we address this?
> > > 
> > > A pathological thrashing situation is very obvious to any user, 
> > > but it's not quite clear how to quantify it inside the kernel and
> > > have it trigger the OOM killer. It might be useful to talk about 
> > > metrics.  Could we quantify application progress? Could we 
> > > quantify the amount of time a task or the system spends 
> > > thrashing, and somehow express it as a percentage of overall 
> > > execution time? Maybe something comparable to IO wait time, 
> > > except tracking the time spent performing reclaim and waiting on
> > > IO that is refetching recently evicted pages?
> > > 
> > > This question seems to go beyond the memory subsystem and 
> > > potentially involve the scheduler and the block layer, so it 
> > > might be a good tech topic for KS.
> > 
> > Actually, I'd be interested in this.  We're starting to generate 
> > use cases in the container cloud for swap (I can't believe I'm 
> > saying this since we hitherto regarded swap as wholly evil).  The 
> > issue is that we want to load the system up into its overcommit 
> > region (it means two things: either we're re-using under used 
> > resources or, more correctly, we're reselling resources we sold to 
> > one customer, but they're not using, so we can sell them to 
> > another).  From some research done within IBM, it turns out there's 
> > a region where swapping is beneficial.   We define it as the region 
> > where the B/W to swap doesn't exceed the B/W capacity of the disk
> > (is this the metric you're looking for?).
> 
> That's an interesting take, I haven't thought about that. But note
> that the CPU cost of evicting and refetching pages is not negligible:
> even on fairly beefy machines we've seen significant CPU load when 
> the IO device hits saturation.

Right, but we're not looking to use swap as a kind of slightly more
expensive memory.  We're looking to push the system aggressively to
find its working set while we load it up with jobs.  This means we need
not very often referenced anonymous memory out on swap.  We use
standard SSDs, so if the anon memory refault rate goes too high, we
move from region 3 to region 4 (required swap B/W exceeds available
swap B/W)  and the system goes unstable (so we'd unload it a bit).

>  With persistent memory devices you might actually run out of CPU 
> capacity while performing basic page aging before you saturate the 
> storage device (which is why Andi Kleen has been suggesting to 
> replace LRU reclaim with random replacement for these devices). So 
> storage device saturation might not be the final answer to this
> problem.

We really wouldn't want this.  All cloud jobs seem to have memory they
allocate but rarely use, so we want the properties of the LRU list to
get this on swap so we can re-use the memory pages for something else. 
 A random replacement algorithm would play havoc with that.

Our biggest problem is the difficulty in forcing the system to push
anonymous stuff out to swap.  Linux really likes to hang on to its
anonymous pages and if you get too abrasive with it, it starts dumping
your file backed pages and causing refaults leading to instability
there instead.  We haven't yet played with the swappiness patches, but
we're hoping they will go some way towards fixing this.

> > Our definition of progress is a bit different from yours above 
> > because the interactive jobs must respond as if they were near bare 
> > metal, so we penalise the soak jobs.  However, we find that the 
> > soak jobs also make reasonable progress according to your measure 
> > above (reasonable enough means the customer is happy to pay for the 
> > time they've used).
> 
> We actually are in the same boat, where most of our services are 
> doing work within the context of interactive user sessions. So in 
> terms of quantifying progress, both throughput and latency 
> percentiles would be necessary to form a full picture of whether we
> are beyond capacity.

OK, so this region 3 work (where we can get the system stable with an
acceptable refault rate for the anonymous pages) is probably where you
want to be operating as well.

James

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was Re: Self nomination
  2016-08-01 16:06       ` James Bottomley
@ 2016-08-01 16:11         ` Dave Hansen
  2016-08-01 16:33           ` James Bottomley
  2016-08-01 17:08           ` Johannes Weiner
  0 siblings, 2 replies; 20+ messages in thread
From: Dave Hansen @ 2016-08-01 16:11 UTC (permalink / raw)
  To: James Bottomley, Johannes Weiner; +Cc: Kleen, Andi, ksummit-discuss

On 08/01/2016 09:06 AM, James Bottomley wrote:
>>  With persistent memory devices you might actually run out of CPU 
>> > capacity while performing basic page aging before you saturate the 
>> > storage device (which is why Andi Kleen has been suggesting to 
>> > replace LRU reclaim with random replacement for these devices). So 
>> > storage device saturation might not be the final answer to this
>> > problem.
> We really wouldn't want this.  All cloud jobs seem to have memory they
> allocate but rarely use, so we want the properties of the LRU list to
> get this on swap so we can re-use the memory pages for something else. 
>  A random replacement algorithm would play havoc with that.

I don't want to put words in Andi's mouth, but what we want isn't
necessarily something that is random, but it's something that uses less
CPU to swap out a given page.

All the LRU scanning is expensive and doesn't scale particularly well,
and there are some situations where we should be willing to give up some
of the precision of the current LRU in order to increase the throughput
of reclaim in general.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was Re: Self nomination
  2016-08-01 16:11         ` Dave Hansen
@ 2016-08-01 16:33           ` James Bottomley
  2016-08-01 18:13             ` Rik van Riel
  2016-08-01 19:51             ` Dave Hansen
  2016-08-01 17:08           ` Johannes Weiner
  1 sibling, 2 replies; 20+ messages in thread
From: James Bottomley @ 2016-08-01 16:33 UTC (permalink / raw)
  To: Dave Hansen, Johannes Weiner; +Cc: Kleen, Andi, ksummit-discuss

On Mon, 2016-08-01 at 09:11 -0700, Dave Hansen wrote:
> On 08/01/2016 09:06 AM, James Bottomley wrote:
> > >  With persistent memory devices you might actually run out of CPU
> > > > capacity while performing basic page aging before you saturate
> > > > the 
> > > > storage device (which is why Andi Kleen has been suggesting to 
> > > > replace LRU reclaim with random replacement for these devices).
> > > > So 
> > > > storage device saturation might not be the final answer to this
> > > > problem.
> > We really wouldn't want this.  All cloud jobs seem to have memory 
> > they allocate but rarely use, so we want the properties of the LRU 
> > list to get this on swap so we can re-use the memory pages for 
> > something else.  A random replacement algorithm would play havoc
> > with that.
> 
> I don't want to put words in Andi's mouth, but what we want isn't
> necessarily something that is random, but it's something that uses 
> less CPU to swap out a given page.

OK, if it's more deterministic, I'll wait to see the proposal.

> All the LRU scanning is expensive and doesn't scale particularly
> well, and there are some situations where we should be willing to
> give up some of the precision of the current LRU in order to increase
> the throughput of reclaim in general.

Would some type of hinting mechanism work (say via madvise)? 
 MADV_DONTNEED may be good enough, but we could really do with
MADV_SWAP_OUT_NOW to indicate objects we really don't want.  I suppose
I can lose all my credibility by saying this would be the JVM: it knows
roughly the expected lifetime and access patterns and is well qualified
to mark objects as infrequently enough accessed to reside on swap.

I suppose another question is do we still want all of this to be page
based?  We moved to extents in filesystems a while ago, wouldn't some
extent based LRU mechanism be cheaper ... unfortunately it means
something has to try to come up with an idea of what an extent means (I
suspect it would be a bunch of virtually contiguous pages which have
the same expected LRU properties, but I'm thinking from the application
centric viewpoint).

James

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was Re:  Self nomination
  2016-07-29 11:07   ` Mel Gorman
  2016-07-29 16:26     ` Luck, Tony
@ 2016-08-01 16:55     ` Johannes Weiner
  1 sibling, 0 replies; 20+ messages in thread
From: Johannes Weiner @ 2016-08-01 16:55 UTC (permalink / raw)
  To: Mel Gorman; +Cc: ksummit-discuss

On Fri, Jul 29, 2016 at 12:07:24PM +0100, Mel Gorman wrote:
> On Thu, Jul 28, 2016 at 02:55:23PM -0400, Johannes Weiner wrote:
> > To avoid hitting this situation, we currently have to keep a generous
> > memory reserve for occasional spikes, which sucks for utilization the
> > rest of the time. Swap would be useful here, but the swapout code is
> > basically only triggering when memory pressure rises - which again
> > doesn't happen - so I've been working on the swap code to balance
> > cache reclaim vs. swap based on relative thrashing between the two.
> 
> While we have active and inactive lists, they have no concept of time.
> Inactive may be "has not been used in hours" or "deactivated recently due to
> memory pressure". If we continually aged pages at a very slow rate (e.g. 1%
> of a node per minute) in the absense of memory pressure we could create a
> "unused" list without reclaiming it in the absense of pressure. We'd
> also have to scan 1% part of the unused list at the same time and
> reactivate pages if necessary.
>
> Minimally, we'd have a very rough estimate of the true WSS as a bonus.

I fear that something like this would get into the "hardcoded"
territory that Rik mentioned. 1% per minute might be plenty to
distinguish hot and cold for some workloads, and too coarse for
others.

For WSS estimates to be meaningful, they need to be based on a
sampling interval that is connected to the time it takes to evict a
page and the time it takes to refetch it. Because if the access
frequencies of a workload are fairly spread out, kicking out the
colder pages and refetching them later to make room for hotter pages
in the meantime might be a good trade-off to make - especially when
stacking multiple (containerized) workloads onto a single machine.

The WSS of a workload over its lifetime might be several times the
available memory, but what you really care about is how much time you
are actually losing due to memory being underprovisioned for that
workload. If the frequency spectrum is compressed, you might be making
almost no progress at all. If it's spread out, the available memory
might still be mostly underutilized.

We don't have a concept of time in page aging right now, but AFAICS
introducing one would be the central part in making WSS estimation and
subsequent resource allocation work without costly trial and error.

> > There is usually some cold/unused anonymous memory lying around that
> > can be unloaded into swap during workload spikes, so that allows us to
> > drive up the average memory utilization without increasing the risk at
> > least. But if we screw up and there are not enough unused anon pages,
> > we are back to thrashing - only now it involves swapping too.
> > 
> > So how do we address this?
> > 
> > A pathological thrashing situation is very obvious to any user, but
> > it's not quite clear how to quantify it inside the kernel and have it
> > trigger the OOM killer.
> 
> The OOM killer is at the extreme end of the spectrum. One unloved piece of
> code is vmpressure.c which we never put that much effort into.  Ideally, that
> would at least be able to notify user space that the system is under pressure
> but I have anecdotal evidence that it gives bad advice on large systems.

Bringing in the OOM killer doesn't preclude advance notification. But
severe thrashing *is* an OOM situation that can only be handled by
reducing the number of concurrent page references going on. If the
user can help out, that's great, but the OOM killer should still be
the last line of defense to bring the system back into a stable state.

> Essentially, we have four bits of information related to memory pressure --
> allocations, scans, steals and refaults. A 1:1:1 ratio of allocations, scans
> and steals could just be a streaming workload. The refaults distinguish
> between streaming and thrashing workloads but we don't use this for
> vmpressure calculations or OOM detection.

The information we have right now can tell us whether the workingset
is stable or not, and thus whether we should challenge the currently
protected pages or not. What we can't do is tell whether the thrashing
is an acceptable transition between too workingsets or a sustained
instability. The answer to that lies on a subjective spectrum.

Consider a workload that is accessing two datasets alternatingly, like
a database user that is switching back and forth between two tables to
process their data. If evicting one table and loading the other from
storage takes up 1% of the task's time, and processing the data the
other 99%, then we can likely provision memory such that it can hold
one table at a time. If evicting and reloading takes up 10% of the
time, it might still be fine; they might only care about latency while
the active table is loaded, or they might prioritize another job over
this one. If evicting and refetching consumes 95% of the task's time,
we might want to look into giving it more RAM.

So yes, with mm/workingset.c we finally have all the information to
unambiguously identify which VM events are due to memory being
underprovisioned. But we need a concept of time to put the impact of
these events into perspective. And I'm arguing that that perspective
is overall execution time of the tasks in the system (or container),
to calculate the percentage of time lost due to underprovisioning.

> > It might be useful to talk about
> > metrics. Could we quantify application progress?
> 
> We can at least calculate if it's stalling on reclaim or refaults. High
> amounts of both would indicate that the application is struggling.

Again: or transitioning.

> > Could we quantify the
> > amount of time a task or the system spends thrashing, and somehow
> > express it as a percentage of overall execution time?
> 
> Potentially if time spent refaulting or direct reclaiming was accounted
> for. What complicates this significantly is kswapd.

Kswapd is a shared resource, but memory is as well. Whatever concept
of time we can come up with that works for memory should be on the
same scope as kswapd. E.g. potentially available time slices in the
system (or container).

> > This question seems to go beyond the memory subsystem and potentially
> > involve the scheduler and the block layer, so it might be a good tech
> > topic for KS.
> 
> I'm on board anyway.

Great!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was Re: Self nomination
  2016-08-01 16:11         ` Dave Hansen
  2016-08-01 16:33           ` James Bottomley
@ 2016-08-01 17:08           ` Johannes Weiner
  2016-08-01 18:19             ` Johannes Weiner
  1 sibling, 1 reply; 20+ messages in thread
From: Johannes Weiner @ 2016-08-01 17:08 UTC (permalink / raw)
  To: Dave Hansen; +Cc: James Bottomley, Kleen, Andi, ksummit-discuss

On Mon, Aug 01, 2016 at 09:11:32AM -0700, Dave Hansen wrote:
> On 08/01/2016 09:06 AM, James Bottomley wrote:
> >>  With persistent memory devices you might actually run out of CPU 
> >> > capacity while performing basic page aging before you saturate the 
> >> > storage device (which is why Andi Kleen has been suggesting to 
> >> > replace LRU reclaim with random replacement for these devices). So 
> >> > storage device saturation might not be the final answer to this
> >> > problem.
> > We really wouldn't want this.  All cloud jobs seem to have memory they
> > allocate but rarely use, so we want the properties of the LRU list to
> > get this on swap so we can re-use the memory pages for something else. 
> >  A random replacement algorithm would play havoc with that.
> 
> I don't want to put words in Andi's mouth, but what we want isn't
> necessarily something that is random, but it's something that uses less
> CPU to swap out a given page.

Random eviction doesn't mean random outcome of what stabilizes in
memory and swap. The idea is to apply pressure on all pages equally
but in no particular order, and then the in-memory set forms based on
reference frequencies and refaults/swapins.

Our anon LRU approximation can be so inaccurate as to be doing that
already anyway, only with all the overhead of having an LRU list.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was Re: Self nomination
  2016-08-01 16:33           ` James Bottomley
@ 2016-08-01 18:13             ` Rik van Riel
  2016-08-01 19:51             ` Dave Hansen
  1 sibling, 0 replies; 20+ messages in thread
From: Rik van Riel @ 2016-08-01 18:13 UTC (permalink / raw)
  To: James Bottomley, Dave Hansen, Johannes Weiner
  Cc: Kleen, Andi, ksummit-discuss

[-- Attachment #1: Type: text/plain, Size: 2417 bytes --]

On Mon, 2016-08-01 at 12:33 -0400, James Bottomley wrote:
> On Mon, 2016-08-01 at 09:11 -0700, Dave Hansen wrote:
> > On 08/01/2016 09:06 AM, James Bottomley wrote:
> > > >  With persistent memory devices you might actually run out of
> > > > CPU
> > > > > capacity while performing basic page aging before you
> > > > > saturate
> > > > > the 
> > > > > storage device (which is why Andi Kleen has been suggesting
> > > > > to 
> > > > > replace LRU reclaim with random replacement for these
> > > > > devices).
> > > > > So 
> > > > > storage device saturation might not be the final answer to
> > > > > this
> > > > > problem.
> > > We really wouldn't want this.  All cloud jobs seem to have
> > > memory 
> > > they allocate but rarely use, so we want the properties of the
> > > LRU 
> > > list to get this on swap so we can re-use the memory pages for 
> > > something else.  A random replacement algorithm would play havoc
> > > with that.
> > 
> > I don't want to put words in Andi's mouth, but what we want isn't
> > necessarily something that is random, but it's something that uses 
> > less CPU to swap out a given page.
> 
> OK, if it's more deterministic, I'll wait to see the proposal.
> 
> > All the LRU scanning is expensive and doesn't scale particularly
> > well, and there are some situations where we should be willing to
> > give up some of the precision of the current LRU in order to
> > increase
> > the throughput of reclaim in general.
> 
> Would some type of hinting mechanism work (say via madvise)? 

I suspect that might introduce overhead in other ways.

> I suppose another question is do we still want all of this to be page
> based?  We moved to extents in filesystems a while ago, wouldn't some
> extent based LRU mechanism be cheaper ... unfortunately it means
> something has to try to come up with an idea of what an extent means
> (I
> suspect it would be a bunch of virtually contiguous pages which have
> the same expected LRU properties, but I'm thinking from the
> application
> centric viewpoint).
> 
On sufficiently fast swap, we could just swap 2MB pages,
or whatever size THP is on the architecture in question,
in and out of memory.

Working with blocks 512x the size of a 4kB page might
be enough of a scalability gain to match the faster IO
speeds of new storage.

-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was Re: Self nomination
  2016-08-01 17:08           ` Johannes Weiner
@ 2016-08-01 18:19             ` Johannes Weiner
  0 siblings, 0 replies; 20+ messages in thread
From: Johannes Weiner @ 2016-08-01 18:19 UTC (permalink / raw)
  To: Dave Hansen; +Cc: James Bottomley, Kleen, Andi, ksummit-discuss

On Mon, Aug 01, 2016 at 01:08:46PM -0400, Johannes Weiner wrote:
> On Mon, Aug 01, 2016 at 09:11:32AM -0700, Dave Hansen wrote:
> > On 08/01/2016 09:06 AM, James Bottomley wrote:
> > >>  With persistent memory devices you might actually run out of CPU 
> > >> > capacity while performing basic page aging before you saturate the 
> > >> > storage device (which is why Andi Kleen has been suggesting to 
> > >> > replace LRU reclaim with random replacement for these devices). So 
> > >> > storage device saturation might not be the final answer to this
> > >> > problem.
> > > We really wouldn't want this.  All cloud jobs seem to have memory they
> > > allocate but rarely use, so we want the properties of the LRU list to
> > > get this on swap so we can re-use the memory pages for something else. 
> > >  A random replacement algorithm would play havoc with that.
> > 
> > I don't want to put words in Andi's mouth, but what we want isn't
> > necessarily something that is random, but it's something that uses less
> > CPU to swap out a given page.
> 
> Random eviction doesn't mean random outcome of what stabilizes in
> memory and swap. The idea is to apply pressure on all pages equally
> but in no particular order, and then the in-memory set forms based on
> reference frequencies and refaults/swapins.

Anyway, this is getting a little off-topic.

I only brought up CPU cost to make the point that, while sustained
swap-in rate might be a good signal to unload a machine or reschedule
a job elsewhere, it might not be a generic answer to the question of
how much a system's overall progress is actually impeded due to
somebody swapping; or whether the system is actually in a livelock
state that requires intervention by the OOM killer.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was Re: Self nomination
  2016-08-01 16:33           ` James Bottomley
  2016-08-01 18:13             ` Rik van Riel
@ 2016-08-01 19:51             ` Dave Hansen
  1 sibling, 0 replies; 20+ messages in thread
From: Dave Hansen @ 2016-08-01 19:51 UTC (permalink / raw)
  To: James Bottomley, Johannes Weiner; +Cc: Kleen, Andi, ksummit-discuss

On 08/01/2016 09:33 AM, James Bottomley wrote:
>> All the LRU scanning is expensive and doesn't scale particularly
>> well, and there are some situations where we should be willing to
>> give up some of the precision of the current LRU in order to increase
>> the throughput of reclaim in general.
> 
> Would some type of hinting mechanism work (say via madvise)? 
>  MADV_DONTNEED may be good enough, but we could really do with
> MADV_SWAP_OUT_NOW to indicate objects we really don't want.  I suppose
> I can lose all my credibility by saying this would be the JVM: it knows
> roughly the expected lifetime and access patterns and is well qualified
> to mark objects as infrequently enough accessed to reside on swap.

I don't think MADV_DONTNEED is a good fit because it is destructive.  It
does seem like we are missing a true companion to MADV_WILLNEED which
would give memory a push in the direction of being swapped out.

But I don't think it's too crazy to expect apps to participate.  They
certainly have the potential to know more about their data than the
kernel does, and things like GPUs are already pretty actively optimizing
by moving memory around.

> I suppose another question is do we still want all of this to be page
> based?  We moved to extents in filesystems a while ago, wouldn't some
> extent based LRU mechanism be cheaper ... unfortunately it means
> something has to try to come up with an idea of what an extent means (I
> suspect it would be a bunch of virtually contiguous pages which have
> the same expected LRU properties, but I'm thinking from the application
> centric viewpoint).

One part of this (certainly not the _only_ one) is expanding where
transparent huge pages can be used.  That's one extent definition that's
relatively easy to agree on.

Past that, there are lots of things we can try (including something like
you've suggested), but I don't think anybody knows what will work yet.
There is no shortage of ideas.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was Re:  Self nomination
  2016-07-28 18:55 ` [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was " Johannes Weiner
                     ` (2 preceding siblings ...)
  2016-07-29 11:07   ` Mel Gorman
@ 2016-08-02  9:18   ` Jan Kara
  3 siblings, 0 replies; 20+ messages in thread
From: Jan Kara @ 2016-08-02  9:18 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: ksummit-discuss

On Thu 28-07-16 14:55:23, Johannes Weiner wrote:
> On Mon, Jul 25, 2016 at 01:11:42PM -0400, Johannes Weiner wrote:
> > Most recently I have been working on reviving swap for SSDs and
> > persistent memory devices (https://lwn.net/Articles/690079/) as part
> > of a bigger anti-thrashing effort to make the VM recover swiftly and
> > predictably from load spikes.
> 
> A bit of context, in case we want to discuss this at KS:
> 
> We frequently have machines hang and stop responding indefinitely
> after they experience memory load spikes. On closer look, we find most
> tasks either in page reclaim or majorfaulting parts of an executable
> or library. It's a typical thrashing pattern, where everybody
> cannibalizes everybody else. The problem is that with fast storage the
> cache reloads can be fast enough that there are never enough in-flight
> pages at a time to cause page reclaim to fail and trigger the OOM
> killer. The livelock persists until external remediation reboots the
> box or we get lucky and non-cache allocations eventually suck up the
> remaining page cache and trigger the OOM killer.
> 
> To avoid hitting this situation, we currently have to keep a generous
> memory reserve for occasional spikes, which sucks for utilization the
> rest of the time. Swap would be useful here, but the swapout code is
> basically only triggering when memory pressure rises - which again
> doesn't happen - so I've been working on the swap code to balance
> cache reclaim vs. swap based on relative thrashing between the two.
> 
> There is usually some cold/unused anonymous memory lying around that
> can be unloaded into swap during workload spikes, so that allows us to
> drive up the average memory utilization without increasing the risk at
> least. But if we screw up and there are not enough unused anon pages,
> we are back to thrashing - only now it involves swapping too.
> 
> So how do we address this?
> 
> A pathological thrashing situation is very obvious to any user, but
> it's not quite clear how to quantify it inside the kernel and have it
> trigger the OOM killer. It might be useful to talk about
> metrics. Could we quantify application progress? Could we quantify the
> amount of time a task or the system spends thrashing, and somehow
> express it as a percentage of overall execution time? Maybe something
> comparable to IO wait time, except tracking the time spent performing
> reclaim and waiting on IO that is refetching recently evicted pages?
> 
> This question seems to go beyond the memory subsystem and potentially
> involve the scheduler and the block layer, so it might be a good tech
> topic for KS.

I'd be interested to join this discussion.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2016-08-02  9:18 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-25 17:11 [Ksummit-discuss] Self nomination Johannes Weiner
2016-07-25 18:15 ` Rik van Riel
2016-07-26 10:56   ` Jan Kara
2016-07-26 13:10     ` Vlastimil Babka
2016-07-28 18:55 ` [Ksummit-discuss] [TECH TOPIC] Memory thrashing, was " Johannes Weiner
2016-07-28 21:41   ` James Bottomley
2016-08-01 15:46     ` Johannes Weiner
2016-08-01 16:06       ` James Bottomley
2016-08-01 16:11         ` Dave Hansen
2016-08-01 16:33           ` James Bottomley
2016-08-01 18:13             ` Rik van Riel
2016-08-01 19:51             ` Dave Hansen
2016-08-01 17:08           ` Johannes Weiner
2016-08-01 18:19             ` Johannes Weiner
2016-07-29  0:25   ` Rik van Riel
2016-07-29 11:07   ` Mel Gorman
2016-07-29 16:26     ` Luck, Tony
2016-08-01 15:17       ` Rik van Riel
2016-08-01 16:55     ` Johannes Weiner
2016-08-02  9:18   ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.