linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Detecting page cache trashing state
@ 2017-09-15  0:16 Taras Kondratiuk
  2017-09-15 11:55 ` Zdenek Kabelac
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Taras Kondratiuk @ 2017-09-15  0:16 UTC (permalink / raw)
  To: linux-mm; +Cc: xe-linux-external, Ruslan Ruslichenko, linux-kernel

Hi

In our devices under low memory conditions we often get into a trashing
state when system spends most of the time re-reading pages of .text
sections from a file system (squashfs in our case). Working set doesn't
fit into available page cache, so it is expected. The issue is that
OOM killer doesn't get triggered because there is still memory for
reclaiming. System may stuck in this state for a quite some time and
usually dies because of watchdogs.

We are trying to detect such trashing state early to take some
preventive actions. It should be a pretty common issue, but for now we
haven't find any existing VM/IO statistics that can reliably detect such
state.

Most of metrics provide absolute values: number/rate of page faults,
rate of IO operations, number of stolen pages, etc. For a specific
device configuration we can determine threshold values for those
parameters that will detect trashing state, but it is not feasible for
hundreds of device configurations.

We are looking for some relative metric like "percent of CPU time spent
handling major page faults". With such relative metric we could use a
common threshold across all devices. For now we have added such metric
to /proc/stat in our kernel, but we would like to find some mechanism
available in upstream kernel.

Has somebody faced similar issue? How are you solving it?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-15  0:16 Detecting page cache trashing state Taras Kondratiuk
@ 2017-09-15 11:55 ` Zdenek Kabelac
  2017-09-15 14:22 ` Daniel Walker
  2017-09-15 14:36 ` Michal Hocko
  2 siblings, 0 replies; 21+ messages in thread
From: Zdenek Kabelac @ 2017-09-15 11:55 UTC (permalink / raw)
  To: Taras Kondratiuk, linux-mm
  Cc: xe-linux-external, Ruslan Ruslichenko, linux-kernel

Dne 15.9.2017 v 02:16 Taras Kondratiuk napsal(a):
> Hi
> 
> In our devices under low memory conditions we often get into a trashing
> state when system spends most of the time re-reading pages of .text
> sections from a file system (squashfs in our case). Working set doesn't
> fit into available page cache, so it is expected. The issue is that
> OOM killer doesn't get triggered because there is still memory for
> reclaiming. System may stuck in this state for a quite some time and
> usually dies because of watchdogs.
> 
> We are trying to detect such trashing state early to take some
> preventive actions. It should be a pretty common issue, but for now we
> haven't find any existing VM/IO statistics that can reliably detect such
> state.
> 
> Most of metrics provide absolute values: number/rate of page faults,
> rate of IO operations, number of stolen pages, etc. For a specific
> device configuration we can determine threshold values for those
> parameters that will detect trashing state, but it is not feasible for
> hundreds of device configurations.
> 
> We are looking for some relative metric like "percent of CPU time spent
> handling major page faults". With such relative metric we could use a
> common threshold across all devices. For now we have added such metric
> to /proc/stat in our kernel, but we would like to find some mechanism
> available in upstream kernel.
> 
> Has somebody faced similar issue? How are you solving it?
> 
Hi

Well I witness this when running Firefox & Thunderbird on my desktop for a 
while on just 4G RAM machine till these 2app eat all free RAM...

It gets to the position (when I open new tab) that mouse hardly moves - 
kswapd eats  CPU  (I've no swap in fact - so likely just page-caching).

The only 'quick' solution for me as desktop user is to manually invoke OOM
with SYSRQ+F key -  and I'm also wondering why the system is not reacting 
better.  In most cases it kills one of those 2 - but sometime it kills whole 
Xsession...


Regards

Zdenek

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-15  0:16 Detecting page cache trashing state Taras Kondratiuk
  2017-09-15 11:55 ` Zdenek Kabelac
@ 2017-09-15 14:22 ` Daniel Walker
  2017-09-15 16:38   ` Taras Kondratiuk
  2017-09-15 14:36 ` Michal Hocko
  2 siblings, 1 reply; 21+ messages in thread
From: Daniel Walker @ 2017-09-15 14:22 UTC (permalink / raw)
  To: Taras Kondratiuk, linux-mm
  Cc: xe-linux-external, Ruslan Ruslichenko, linux-kernel

On 09/14/2017 05:16 PM, Taras Kondratiuk wrote:
> Hi
>
> In our devices under low memory conditions we often get into a trashing
> state when system spends most of the time re-reading pages of .text
> sections from a file system (squashfs in our case). Working set doesn't
> fit into available page cache, so it is expected. The issue is that
> OOM killer doesn't get triggered because there is still memory for
> reclaiming. System may stuck in this state for a quite some time and
> usually dies because of watchdogs.
>
> We are trying to detect such trashing state early to take some
> preventive actions. It should be a pretty common issue, but for now we
> haven't find any existing VM/IO statistics that can reliably detect such
> state.
>
> Most of metrics provide absolute values: number/rate of page faults,
> rate of IO operations, number of stolen pages, etc. For a specific
> device configuration we can determine threshold values for those
> parameters that will detect trashing state, but it is not feasible for
> hundreds of device configurations.
>
> We are looking for some relative metric like "percent of CPU time spent
> handling major page faults". With such relative metric we could use a
> common threshold across all devices. For now we have added such metric
> to /proc/stat in our kernel, but we would like to find some mechanism
> available in upstream kernel.
>
> Has somebody faced similar issue? How are you solving it?


Did you make any attempt to tune swappiness ?

Documentation/sysctl/vm.txt

swappiness

This control is used to define how aggressive the kernel will swap
memory pages.  Higher values will increase agressiveness, lower values
decrease the amount of swap.

The default value is 60.
=======================================================

Since your using squashfs I would guess that's going to act like swap. 
The default tune of 60 is most likely for x86 servers which may not be a 
good value for some other device.


Daniel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-15  0:16 Detecting page cache trashing state Taras Kondratiuk
  2017-09-15 11:55 ` Zdenek Kabelac
  2017-09-15 14:22 ` Daniel Walker
@ 2017-09-15 14:36 ` Michal Hocko
  2017-09-15 17:28   ` Taras Kondratiuk
  2017-09-15 21:20   ` vcaputo
  2 siblings, 2 replies; 21+ messages in thread
From: Michal Hocko @ 2017-09-15 14:36 UTC (permalink / raw)
  To: Taras Kondratiuk
  Cc: linux-mm, xe-linux-external, Ruslan Ruslichenko, linux-kernel

On Thu 14-09-17 17:16:27, Taras Kondratiuk wrote:
> Hi
> 
> In our devices under low memory conditions we often get into a trashing
> state when system spends most of the time re-reading pages of .text
> sections from a file system (squashfs in our case). Working set doesn't
> fit into available page cache, so it is expected. The issue is that
> OOM killer doesn't get triggered because there is still memory for
> reclaiming. System may stuck in this state for a quite some time and
> usually dies because of watchdogs.
> 
> We are trying to detect such trashing state early to take some
> preventive actions. It should be a pretty common issue, but for now we
> haven't find any existing VM/IO statistics that can reliably detect such
> state.
> 
> Most of metrics provide absolute values: number/rate of page faults,
> rate of IO operations, number of stolen pages, etc. For a specific
> device configuration we can determine threshold values for those
> parameters that will detect trashing state, but it is not feasible for
> hundreds of device configurations.
> 
> We are looking for some relative metric like "percent of CPU time spent
> handling major page faults". With such relative metric we could use a
> common threshold across all devices. For now we have added such metric
> to /proc/stat in our kernel, but we would like to find some mechanism
> available in upstream kernel.
> 
> Has somebody faced similar issue? How are you solving it?

Yes this is a pain point for a _long_ time. And we still do not have a
good answer upstream. Johannes has been playing in this area [1].
The main problem is that our OOM detection logic is based on the ability
to reclaim memory to allocate new memory. And that is pretty much true
for the pagecache when you are trashing. So we do not know that
basically whole time is spent refaulting the memory back and forth.
We do have some refault stats for the page cache but that is not
integrated to the oom detection logic because this is really a
non-trivial problem to solve without triggering early oom killer
invocations.

[1] http://lkml.kernel.org/r/20170727153010.23347-1-hannes@cmpxchg.org
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-15 14:22 ` Daniel Walker
@ 2017-09-15 16:38   ` Taras Kondratiuk
  2017-09-15 17:31     ` Daniel Walker
  0 siblings, 1 reply; 21+ messages in thread
From: Taras Kondratiuk @ 2017-09-15 16:38 UTC (permalink / raw)
  To: Daniel Walker, linux-mm
  Cc: xe-linux-external, Ruslan Ruslichenko, linux-kernel

Quoting Daniel Walker (2017-09-15 07:22:27)
> On 09/14/2017 05:16 PM, Taras Kondratiuk wrote:
> > Hi
> >
> > In our devices under low memory conditions we often get into a trashing
> > state when system spends most of the time re-reading pages of .text
> > sections from a file system (squashfs in our case). Working set doesn't
> > fit into available page cache, so it is expected. The issue is that
> > OOM killer doesn't get triggered because there is still memory for
> > reclaiming. System may stuck in this state for a quite some time and
> > usually dies because of watchdogs.
> >
> > We are trying to detect such trashing state early to take some
> > preventive actions. It should be a pretty common issue, but for now we
> > haven't find any existing VM/IO statistics that can reliably detect such
> > state.
> >
> > Most of metrics provide absolute values: number/rate of page faults,
> > rate of IO operations, number of stolen pages, etc. For a specific
> > device configuration we can determine threshold values for those
> > parameters that will detect trashing state, but it is not feasible for
> > hundreds of device configurations.
> >
> > We are looking for some relative metric like "percent of CPU time spent
> > handling major page faults". With such relative metric we could use a
> > common threshold across all devices. For now we have added such metric
> > to /proc/stat in our kernel, but we would like to find some mechanism
> > available in upstream kernel.
> >
> > Has somebody faced similar issue? How are you solving it?
> 
> 
> Did you make any attempt to tune swappiness ?
> 
> Documentation/sysctl/vm.txt
> 
> swappiness
> 
> This control is used to define how aggressive the kernel will swap
> memory pages.  Higher values will increase agressiveness, lower values
> decrease the amount of swap.
> 
> The default value is 60.
> =======================================================
> 
> Since your using squashfs I would guess that's going to act like swap. 
> The default tune of 60 is most likely for x86 servers which may not be a 
> good value for some other device.

Swap is disabled in our systems, so anonymous pages can't be evicted.
As per my understanding swappiness tune is irrelevant.

Even with enabled swap swappiness tune can't help much in this case. If
working set doesn't fit into available page cache we will hit the same
trashing state.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-15 14:36 ` Michal Hocko
@ 2017-09-15 17:28   ` Taras Kondratiuk
  2017-09-18 16:34     ` Johannes Weiner
  2017-09-15 21:20   ` vcaputo
  1 sibling, 1 reply; 21+ messages in thread
From: Taras Kondratiuk @ 2017-09-15 17:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, xe-linux-external, Ruslan Ruslichenko, linux-kernel

Quoting Michal Hocko (2017-09-15 07:36:19)
> On Thu 14-09-17 17:16:27, Taras Kondratiuk wrote:
> > Hi
> > 
> > In our devices under low memory conditions we often get into a trashing
> > state when system spends most of the time re-reading pages of .text
> > sections from a file system (squashfs in our case). Working set doesn't
> > fit into available page cache, so it is expected. The issue is that
> > OOM killer doesn't get triggered because there is still memory for
> > reclaiming. System may stuck in this state for a quite some time and
> > usually dies because of watchdogs.
> > 
> > We are trying to detect such trashing state early to take some
> > preventive actions. It should be a pretty common issue, but for now we
> > haven't find any existing VM/IO statistics that can reliably detect such
> > state.
> > 
> > Most of metrics provide absolute values: number/rate of page faults,
> > rate of IO operations, number of stolen pages, etc. For a specific
> > device configuration we can determine threshold values for those
> > parameters that will detect trashing state, but it is not feasible for
> > hundreds of device configurations.
> > 
> > We are looking for some relative metric like "percent of CPU time spent
> > handling major page faults". With such relative metric we could use a
> > common threshold across all devices. For now we have added such metric
> > to /proc/stat in our kernel, but we would like to find some mechanism
> > available in upstream kernel.
> > 
> > Has somebody faced similar issue? How are you solving it?
> 
> Yes this is a pain point for a _long_ time. And we still do not have a
> good answer upstream. Johannes has been playing in this area [1].
> The main problem is that our OOM detection logic is based on the ability
> to reclaim memory to allocate new memory. And that is pretty much true
> for the pagecache when you are trashing. So we do not know that
> basically whole time is spent refaulting the memory back and forth.
> We do have some refault stats for the page cache but that is not
> integrated to the oom detection logic because this is really a
> non-trivial problem to solve without triggering early oom killer
> invocations.
> 
> [1] http://lkml.kernel.org/r/20170727153010.23347-1-hannes@cmpxchg.org

Thanks Michal. memdelay looks promising. We will check it.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-15 16:38   ` Taras Kondratiuk
@ 2017-09-15 17:31     ` Daniel Walker
  0 siblings, 0 replies; 21+ messages in thread
From: Daniel Walker @ 2017-09-15 17:31 UTC (permalink / raw)
  To: Taras Kondratiuk, linux-mm
  Cc: xe-linux-external, Ruslan Ruslichenko, linux-kernel

On 09/15/2017 09:38 AM, Taras Kondratiuk wrote:
> Quoting Daniel Walker (2017-09-15 07:22:27)
>> On 09/14/2017 05:16 PM, Taras Kondratiuk wrote:
>>> Hi
>>>
>>> In our devices under low memory conditions we often get into a trashing
>>> state when system spends most of the time re-reading pages of .text
>>> sections from a file system (squashfs in our case). Working set doesn't
>>> fit into available page cache, so it is expected. The issue is that
>>> OOM killer doesn't get triggered because there is still memory for
>>> reclaiming. System may stuck in this state for a quite some time and
>>> usually dies because of watchdogs.
>>>
>>> We are trying to detect such trashing state early to take some
>>> preventive actions. It should be a pretty common issue, but for now we
>>> haven't find any existing VM/IO statistics that can reliably detect such
>>> state.
>>>
>>> Most of metrics provide absolute values: number/rate of page faults,
>>> rate of IO operations, number of stolen pages, etc. For a specific
>>> device configuration we can determine threshold values for those
>>> parameters that will detect trashing state, but it is not feasible for
>>> hundreds of device configurations.
>>>
>>> We are looking for some relative metric like "percent of CPU time spent
>>> handling major page faults". With such relative metric we could use a
>>> common threshold across all devices. For now we have added such metric
>>> to /proc/stat in our kernel, but we would like to find some mechanism
>>> available in upstream kernel.
>>>
>>> Has somebody faced similar issue? How are you solving it?
>>
>> Did you make any attempt to tune swappiness ?
>>
>> Documentation/sysctl/vm.txt
>>
>> swappiness
>>
>> This control is used to define how aggressive the kernel will swap
>> memory pages.  Higher values will increase agressiveness, lower values
>> decrease the amount of swap.
>>
>> The default value is 60.
>> =======================================================
>>
>> Since your using squashfs I would guess that's going to act like swap.
>> The default tune of 60 is most likely for x86 servers which may not be a
>> good value for some other device.
> Swap is disabled in our systems, so anonymous pages can't be evicted.
> As per my understanding swappiness tune is irrelevant.
>
> Even with enabled swap swappiness tune can't help much in this case. If
> working set doesn't fit into available page cache we will hit the same
> trashing state.


I think it's our lack of understanding of how the VM works. If the 
system has no swap, then the system shouldn't start evicting pages 
unless you have %100 memory utilization, then the only place for those 
pages to go is back into the backing store, squashfs in this case.

What your suggesting is that there is still free memory, which means 
something must be evicting page more aggressively then waiting till %100 
utilization. Maybe someone more knownlegable about the VM subsystem can 
clear this up.

Daniel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-15 14:36 ` Michal Hocko
  2017-09-15 17:28   ` Taras Kondratiuk
@ 2017-09-15 21:20   ` vcaputo
  2017-09-15 23:40     ` Taras Kondratiuk
  2017-09-18  5:55     ` Michal Hocko
  1 sibling, 2 replies; 21+ messages in thread
From: vcaputo @ 2017-09-15 21:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Taras Kondratiuk, linux-mm, xe-linux-external,
	Ruslan Ruslichenko, linux-kernel

On Fri, Sep 15, 2017 at 04:36:19PM +0200, Michal Hocko wrote:
> On Thu 14-09-17 17:16:27, Taras Kondratiuk wrote:
> > Hi
> > 
> > In our devices under low memory conditions we often get into a trashing
> > state when system spends most of the time re-reading pages of .text
> > sections from a file system (squashfs in our case). Working set doesn't
> > fit into available page cache, so it is expected. The issue is that
> > OOM killer doesn't get triggered because there is still memory for
> > reclaiming. System may stuck in this state for a quite some time and
> > usually dies because of watchdogs.
> > 
> > We are trying to detect such trashing state early to take some
> > preventive actions. It should be a pretty common issue, but for now we
> > haven't find any existing VM/IO statistics that can reliably detect such
> > state.
> > 
> > Most of metrics provide absolute values: number/rate of page faults,
> > rate of IO operations, number of stolen pages, etc. For a specific
> > device configuration we can determine threshold values for those
> > parameters that will detect trashing state, but it is not feasible for
> > hundreds of device configurations.
> > 
> > We are looking for some relative metric like "percent of CPU time spent
> > handling major page faults". With such relative metric we could use a
> > common threshold across all devices. For now we have added such metric
> > to /proc/stat in our kernel, but we would like to find some mechanism
> > available in upstream kernel.
> > 
> > Has somebody faced similar issue? How are you solving it?
> 
> Yes this is a pain point for a _long_ time. And we still do not have a
> good answer upstream. Johannes has been playing in this area [1].
> The main problem is that our OOM detection logic is based on the ability
> to reclaim memory to allocate new memory. And that is pretty much true
> for the pagecache when you are trashing. So we do not know that
> basically whole time is spent refaulting the memory back and forth.
> We do have some refault stats for the page cache but that is not
> integrated to the oom detection logic because this is really a
> non-trivial problem to solve without triggering early oom killer
> invocations.
> 
> [1] http://lkml.kernel.org/r/20170727153010.23347-1-hannes@cmpxchg.org

For desktop users running without swap, couldn't we just provide a kernel
setting which marks all executable pages as unevictable when first faulted
in?  Then at least thrashing within the space occupied by executables and
shared libraries before eventual OOM would be avoided, and only the
remaining file-backed non-executable pages would be thrashable.

On my swapless laptops I'd much rather have OOM killer kick in immediately
rather than wait for a few minutes of thrashing to pass while the bogged
down system crawls through depleting what's left of technically reclaimable
memory.  It's much improved on modern SSDs, but still annoying.

Regards,
Vito Caputo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-15 21:20   ` vcaputo
@ 2017-09-15 23:40     ` Taras Kondratiuk
  2017-09-18  5:55     ` Michal Hocko
  1 sibling, 0 replies; 21+ messages in thread
From: Taras Kondratiuk @ 2017-09-15 23:40 UTC (permalink / raw)
  To: Michal Hocko, vcaputo
  Cc: linux-mm, xe-linux-external, Ruslan Ruslichenko, linux-kernel

Quoting vcaputo@pengaru.com (2017-09-15 14:20:28)
> On Fri, Sep 15, 2017 at 04:36:19PM +0200, Michal Hocko wrote:
> > On Thu 14-09-17 17:16:27, Taras Kondratiuk wrote:
> > > Hi
> > > 
> > > In our devices under low memory conditions we often get into a trashing
> > > state when system spends most of the time re-reading pages of .text
> > > sections from a file system (squashfs in our case). Working set doesn't
> > > fit into available page cache, so it is expected. The issue is that
> > > OOM killer doesn't get triggered because there is still memory for
> > > reclaiming. System may stuck in this state for a quite some time and
> > > usually dies because of watchdogs.
> > > 
> > > We are trying to detect such trashing state early to take some
> > > preventive actions. It should be a pretty common issue, but for now we
> > > haven't find any existing VM/IO statistics that can reliably detect such
> > > state.
> > > 
> > > Most of metrics provide absolute values: number/rate of page faults,
> > > rate of IO operations, number of stolen pages, etc. For a specific
> > > device configuration we can determine threshold values for those
> > > parameters that will detect trashing state, but it is not feasible for
> > > hundreds of device configurations.
> > > 
> > > We are looking for some relative metric like "percent of CPU time spent
> > > handling major page faults". With such relative metric we could use a
> > > common threshold across all devices. For now we have added such metric
> > > to /proc/stat in our kernel, but we would like to find some mechanism
> > > available in upstream kernel.
> > > 
> > > Has somebody faced similar issue? How are you solving it?
> > 
> > Yes this is a pain point for a _long_ time. And we still do not have a
> > good answer upstream. Johannes has been playing in this area [1].
> > The main problem is that our OOM detection logic is based on the ability
> > to reclaim memory to allocate new memory. And that is pretty much true
> > for the pagecache when you are trashing. So we do not know that
> > basically whole time is spent refaulting the memory back and forth.
> > We do have some refault stats for the page cache but that is not
> > integrated to the oom detection logic because this is really a
> > non-trivial problem to solve without triggering early oom killer
> > invocations.
> > 
> > [1] http://lkml.kernel.org/r/20170727153010.23347-1-hannes@cmpxchg.org
> 
> For desktop users running without swap, couldn't we just provide a kernel
> setting which marks all executable pages as unevictable when first faulted
> in?  Then at least thrashing within the space occupied by executables and
> shared libraries before eventual OOM would be avoided, and only the
> remaining file-backed non-executable pages would be thrashable.
> 
> On my swapless laptops I'd much rather have OOM killer kick in immediately
> rather than wait for a few minutes of thrashing to pass while the bogged
> down system crawls through depleting what's left of technically reclaimable
> memory.  It's much improved on modern SSDs, but still annoying.

Usually a significant part of executable is used rarely or only once
during initialization. Pinning all executable pages forever will waste
a lot of memory.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-15 21:20   ` vcaputo
  2017-09-15 23:40     ` Taras Kondratiuk
@ 2017-09-18  5:55     ` Michal Hocko
  1 sibling, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2017-09-18  5:55 UTC (permalink / raw)
  To: vcaputo
  Cc: Taras Kondratiuk, linux-mm, xe-linux-external,
	Ruslan Ruslichenko, linux-kernel

On Fri 15-09-17 14:20:28, vcaputo@pengaru.com wrote:
> On Fri, Sep 15, 2017 at 04:36:19PM +0200, Michal Hocko wrote:
> > On Thu 14-09-17 17:16:27, Taras Kondratiuk wrote:
> > > Hi
> > > 
> > > In our devices under low memory conditions we often get into a trashing
> > > state when system spends most of the time re-reading pages of .text
> > > sections from a file system (squashfs in our case). Working set doesn't
> > > fit into available page cache, so it is expected. The issue is that
> > > OOM killer doesn't get triggered because there is still memory for
> > > reclaiming. System may stuck in this state for a quite some time and
> > > usually dies because of watchdogs.
> > > 
> > > We are trying to detect such trashing state early to take some
> > > preventive actions. It should be a pretty common issue, but for now we
> > > haven't find any existing VM/IO statistics that can reliably detect such
> > > state.
> > > 
> > > Most of metrics provide absolute values: number/rate of page faults,
> > > rate of IO operations, number of stolen pages, etc. For a specific
> > > device configuration we can determine threshold values for those
> > > parameters that will detect trashing state, but it is not feasible for
> > > hundreds of device configurations.
> > > 
> > > We are looking for some relative metric like "percent of CPU time spent
> > > handling major page faults". With such relative metric we could use a
> > > common threshold across all devices. For now we have added such metric
> > > to /proc/stat in our kernel, but we would like to find some mechanism
> > > available in upstream kernel.
> > > 
> > > Has somebody faced similar issue? How are you solving it?
> > 
> > Yes this is a pain point for a _long_ time. And we still do not have a
> > good answer upstream. Johannes has been playing in this area [1].
> > The main problem is that our OOM detection logic is based on the ability
> > to reclaim memory to allocate new memory. And that is pretty much true
> > for the pagecache when you are trashing. So we do not know that
> > basically whole time is spent refaulting the memory back and forth.
> > We do have some refault stats for the page cache but that is not
> > integrated to the oom detection logic because this is really a
> > non-trivial problem to solve without triggering early oom killer
> > invocations.
> > 
> > [1] http://lkml.kernel.org/r/20170727153010.23347-1-hannes@cmpxchg.org
> 
> For desktop users running without swap, couldn't we just provide a kernel
> setting which marks all executable pages as unevictable when first faulted
> in?

This could result in the immediate DoS vector and you could see trashing
elsewhere. In fact we already do protect executable pages and reclaim
them later (see page_check_references).

I am afraid that the only way to resolve the trashing behavior is to
release a larger amount of memory because shifting the reclaim priority
will just push the suboptimal behavior somewhere else. In order to do
that we really have to detect that the working set doesn't fit into
memory and refaults are predominating system activity.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-15 17:28   ` Taras Kondratiuk
@ 2017-09-18 16:34     ` Johannes Weiner
  2017-09-19 10:55       ` [PATCH 1/3] sched/loadavg: consolidate LOAD_INT, LOAD_FRAC macros kbuild test robot
                         ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Johannes Weiner @ 2017-09-18 16:34 UTC (permalink / raw)
  To: Taras Kondratiuk
  Cc: Michal Hocko, linux-mm, xe-linux-external, Ruslan Ruslichenko,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1421 bytes --]

Hi Taras,

On Fri, Sep 15, 2017 at 10:28:30AM -0700, Taras Kondratiuk wrote:
> Quoting Michal Hocko (2017-09-15 07:36:19)
> > On Thu 14-09-17 17:16:27, Taras Kondratiuk wrote:
> > > Has somebody faced similar issue? How are you solving it?
> > 
> > Yes this is a pain point for a _long_ time. And we still do not have a
> > good answer upstream. Johannes has been playing in this area [1].
> > The main problem is that our OOM detection logic is based on the ability
> > to reclaim memory to allocate new memory. And that is pretty much true
> > for the pagecache when you are trashing. So we do not know that
> > basically whole time is spent refaulting the memory back and forth.
> > We do have some refault stats for the page cache but that is not
> > integrated to the oom detection logic because this is really a
> > non-trivial problem to solve without triggering early oom killer
> > invocations.
> > 
> > [1] http://lkml.kernel.org/r/20170727153010.23347-1-hannes@cmpxchg.org
> 
> Thanks Michal. memdelay looks promising. We will check it.

Great, I'm obviously interested in more users of it :) Please find
attached the latest version of the patch series based on v4.13.

It needs a bit more refactoring in the scheduler bits before
resubmission, but it already contains a couple of fixes and
improvements since the first version I sent out.

Let me know if you need help rebasing to a different kernel version.

[-- Attachment #2: 0001-sched-loadavg-consolidate-LOAD_INT-LOAD_FRAC-macros.patch --]
[-- Type: text/x-diff, Size: 3932 bytes --]

>From d5ffeb4d9d65fcff1b7e50dbde8264b4c32824a5 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 14 Jun 2017 11:12:05 -0400
Subject: [PATCH 1/3] sched/loadavg: consolidate LOAD_INT, LOAD_FRAC macros

There are several identical definitions of those macros in places that
mess with fixed-point load averages. Provide an official version.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 arch/powerpc/platforms/cell/spufs/sched.c | 3 ---
 arch/s390/appldata/appldata_os.c          | 4 ----
 drivers/cpuidle/governors/menu.c          | 4 ----
 fs/proc/loadavg.c                         | 3 ---
 include/linux/sched/loadavg.h             | 3 +++
 kernel/debug/kdb/kdb_main.c               | 7 +------
 6 files changed, 4 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c
index 1fbb5da17dd2..de544070def3 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -1071,9 +1071,6 @@ void spuctx_switch_state(struct spu_context *ctx,
 	}
 }
 
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
 static int show_spu_loadavg(struct seq_file *s, void *private)
 {
 	int a, b, c;
diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c
index 45b3178200ab..a8aac17e1e82 100644
--- a/arch/s390/appldata/appldata_os.c
+++ b/arch/s390/appldata/appldata_os.c
@@ -24,10 +24,6 @@
 
 #include "appldata.h"
 
-
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
 /*
  * OS data
  *
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 61b64c2b2cb8..e215a2c10a61 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -132,10 +132,6 @@ struct menu_device {
 	int		interval_ptr;
 };
 
-
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
 static inline int get_loadavg(unsigned long load)
 {
 	return LOAD_INT(load) * 10 + LOAD_FRAC(load) / 10;
diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c
index 983fce5c2418..111a25e4b088 100644
--- a/fs/proc/loadavg.c
+++ b/fs/proc/loadavg.c
@@ -9,9 +9,6 @@
 #include <linux/seqlock.h>
 #include <linux/time.h>
 
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
 static int loadavg_proc_show(struct seq_file *m, void *v)
 {
 	unsigned long avnrun[3];
diff --git a/include/linux/sched/loadavg.h b/include/linux/sched/loadavg.h
index 4264bc6b2c27..745483bb5cca 100644
--- a/include/linux/sched/loadavg.h
+++ b/include/linux/sched/loadavg.h
@@ -26,6 +26,9 @@ extern void get_avenrun(unsigned long *loads, unsigned long offset, int shift);
 	load += n*(FIXED_1-exp); \
 	load >>= FSHIFT;
 
+#define LOAD_INT(x) ((x) >> FSHIFT)
+#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
+
 extern void calc_global_load(unsigned long ticks);
 
 #endif /* _LINUX_SCHED_LOADAVG_H */
diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c
index c8146d53ca67..2dddd25ccd7a 100644
--- a/kernel/debug/kdb/kdb_main.c
+++ b/kernel/debug/kdb/kdb_main.c
@@ -2571,16 +2571,11 @@ static int kdb_summary(int argc, const char **argv)
 	}
 	kdb_printf("%02ld:%02ld\n", val.uptime/(60*60), (val.uptime/60)%60);
 
-	/* lifted from fs/proc/proc_misc.c::loadavg_read_proc() */
-
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
 	kdb_printf("load avg   %ld.%02ld %ld.%02ld %ld.%02ld\n",
 		LOAD_INT(val.loads[0]), LOAD_FRAC(val.loads[0]),
 		LOAD_INT(val.loads[1]), LOAD_FRAC(val.loads[1]),
 		LOAD_INT(val.loads[2]), LOAD_FRAC(val.loads[2]));
-#undef LOAD_INT
-#undef LOAD_FRAC
+
 	/* Display in kilobytes */
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	kdb_printf("\nMemTotal:       %8lu kB\nMemFree:        %8lu kB\n"
-- 
2.14.1


[-- Attachment #3: 0002-mm-workingset-tell-cache-transitions-from-workingset.patch --]
[-- Type: text/x-diff, Size: 14639 bytes --]

>From 4ccc6444efbdcc30680eff6b8f345511c306f3d7 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 2 Mar 2017 09:58:03 -0500
Subject: [PATCH 2/3] mm: workingset: tell cache transitions from workingset
 thrashing

Refaults happen during transitions between workingsets as well as
in-place thrashing. Knowing the difference between the two has a range
of applications, including measuring the impact of memory shortage on
the system performance, as well as the ability to smarter balance
pressure between the filesystem cache and the swap-backed workingset.

During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.

Introduce a new page flag that tells on eviction whether the page has
been active or not in its lifetime. This bit is then stored in the
shadow entry, to classify refaults as transitioning or thrashing.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h         |  1 +
 include/linux/page-flags.h     |  5 ++-
 include/linux/swap.h           |  2 +-
 include/trace/events/mmflags.h |  1 +
 mm/filemap.c                   |  9 ++--
 mm/huge_memory.c               |  1 +
 mm/memcontrol.c                |  2 +
 mm/migrate.c                   |  2 +
 mm/swap_state.c                |  1 +
 mm/vmscan.c                    |  1 +
 mm/vmstat.c                    |  1 +
 mm/workingset.c                | 96 +++++++++++++++++++++++++++---------------
 12 files changed, 79 insertions(+), 43 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fc14b8b3f6ce..b8726b501166 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -156,6 +156,7 @@ enum node_stat_item {
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	WORKINGSET_REFAULT,
 	WORKINGSET_ACTIVATE,
+	WORKINGSET_RESTORE,
 	WORKINGSET_NODERECLAIM,
 	NR_ANON_MAPPED,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index d33e3280c8ad..f889af1a6aed 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -73,13 +73,14 @@
  */
 enum pageflags {
 	PG_locked,		/* Page is locked. Don't touch. */
-	PG_error,
 	PG_referenced,
 	PG_uptodate,
 	PG_dirty,
 	PG_lru,
 	PG_active,
+	PG_workingset,
 	PG_waiters,		/* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */
+	PG_error,
 	PG_slab,
 	PG_owner_priv_1,	/* Owner use. If pagecache, fs may use*/
 	PG_arch_1,
@@ -272,6 +273,8 @@ PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
 PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
 PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
 	TESTCLEARFLAG(Active, active, PF_HEAD)
+PAGEFLAG(Workingset, workingset, PF_HEAD)
+	TESTCLEARFLAG(Workingset, workingset, PF_HEAD)
 __PAGEFLAG(Slab, slab, PF_NO_TAIL)
 __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
 PAGEFLAG(Checked, checked, PF_NO_COMPOUND)	   /* Used by some filesystems */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index d83d28e53e62..914a173beee1 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -252,7 +252,7 @@ struct swap_info_struct {
 
 /* linux/mm/workingset.c */
 void *workingset_eviction(struct address_space *mapping, struct page *page);
-bool workingset_refault(void *shadow);
+void workingset_refault(struct page *page, void *shadow);
 void workingset_activation(struct page *page);
 void workingset_update_node(struct radix_tree_node *node, void *private);
 
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 8e50d01c645f..aac9eb272754 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -90,6 +90,7 @@
 	{1UL << PG_dirty,		"dirty"		},		\
 	{1UL << PG_lru,			"lru"		},		\
 	{1UL << PG_active,		"active"	},		\
+	{1UL << PG_workingset,		"workingset"	},		\
 	{1UL << PG_slab,		"slab"		},		\
 	{1UL << PG_owner_priv_1,	"owner_priv_1"	},		\
 	{1UL << PG_arch_1,		"arch_1"	},		\
diff --git a/mm/filemap.c b/mm/filemap.c
index 65b4b6e7f7bd..da55a5693da9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -823,12 +823,9 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 		 * data from the working set, only to cache data that will
 		 * get overwritten with something else, is a waste of memory.
 		 */
-		if (!(gfp_mask & __GFP_WRITE) &&
-		    shadow && workingset_refault(shadow)) {
-			SetPageActive(page);
-			workingset_activation(page);
-		} else
-			ClearPageActive(page);
+		WARN_ON_ONCE(PageActive(page));
+		if (!(gfp_mask & __GFP_WRITE) && shadow)
+			workingset_refault(page, shadow);
 		lru_cache_add(page);
 	}
 	return ret;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 90731e3b7e58..b18ac8084c2a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2239,6 +2239,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 			 (1L << PG_mlocked) |
 			 (1L << PG_uptodate) |
 			 (1L << PG_active) |
+			 (1L << PG_workingset) |
 			 (1L << PG_locked) |
 			 (1L << PG_unevictable) |
 			 (1L << PG_dirty)));
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e09741af816f..93b2eb063afd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5274,6 +5274,8 @@ static int memory_stat_show(struct seq_file *m, void *v)
 		   stat[WORKINGSET_REFAULT]);
 	seq_printf(m, "workingset_activate %lu\n",
 		   stat[WORKINGSET_ACTIVATE]);
+	seq_printf(m, "workingset_restore %lu\n",
+		   stat[WORKINGSET_RESTORE]);
 	seq_printf(m, "workingset_nodereclaim %lu\n",
 		   stat[WORKINGSET_NODERECLAIM]);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index e84eeb4e4356..48f4a79869ce 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -624,6 +624,8 @@ void migrate_page_copy(struct page *newpage, struct page *page)
 		SetPageActive(newpage);
 	} else if (TestClearPageUnevictable(page))
 		SetPageUnevictable(newpage);
+	if (PageWorkingset(page))
+		SetPageWorkingset(newpage);
 	if (PageChecked(page))
 		SetPageChecked(newpage);
 	if (PageMappedToDisk(page))
diff --git a/mm/swap_state.c b/mm/swap_state.c
index b68c93014f50..b39b3969be07 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -387,6 +387,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			/*
 			 * Initiate read into locked page and return.
 			 */
+			SetPageWorkingset(new_page);
 			lru_cache_add_anon(new_page);
 			*new_page_allocated = true;
 			return new_page;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a1af041930a6..60357cd84c67 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2022,6 +2022,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 		}
 
 		ClearPageActive(page);	/* we are de-activating */
+		SetPageWorkingset(page);
 		list_add(&page->lru, &l_inactive);
 	}
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9a4441bbeef2..87ce53498828 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -956,6 +956,7 @@ const char * const vmstat_text[] = {
 	"nr_isolated_file",
 	"workingset_refault",
 	"workingset_activate",
+	"workingset_restore",
 	"workingset_nodereclaim",
 	"nr_anon_pages",
 	"nr_mapped",
diff --git a/mm/workingset.c b/mm/workingset.c
index 7119cd745ace..264f0498f2bc 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -120,7 +120,7 @@
  * the only thing eating into inactive list space is active pages.
  *
  *
- *		Activating refaulting pages
+ *		Refaulting inactive pages
  *
  * All that is known about the active list is that the pages have been
  * accessed more than once in the past.  This means that at any given
@@ -133,6 +133,10 @@
  * used less frequently than the refaulting page - or even not used at
  * all anymore.
  *
+ * That means if inactive cache is refaulting with a suitable refault
+ * distance, we assume the cache workingset is transitioning and put
+ * pressure on the current active list.
+ *
  * If this is wrong and demotion kicks in, the pages which are truly
  * used more frequently will be reactivated while the less frequently
  * used once will be evicted from memory.
@@ -140,6 +144,14 @@
  * But if this is right, the stale pages will be pushed out of memory
  * and the used pages get to stay in cache.
  *
+ *		Refaulting active pages
+ *
+ * If on the other hand the refaulting pages have recently been
+ * deactivated, it means that the active list is no longer protecting
+ * actively used cache from reclaim. The cache is NOT transitioning to
+ * a different workingset; the existing workingset is thrashing in the
+ * space allocated to the page cache.
+ *
  *
  *		Implementation
  *
@@ -155,8 +167,7 @@
  */
 
 #define EVICTION_SHIFT	(RADIX_TREE_EXCEPTIONAL_ENTRY + \
-			 NODES_SHIFT +	\
-			 MEM_CGROUP_ID_SHIFT)
+			 1 + NODES_SHIFT + MEM_CGROUP_ID_SHIFT)
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
 
 /*
@@ -169,23 +180,28 @@
  */
 static unsigned int bucket_order __read_mostly;
 
-static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction)
+static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
+			 bool workingset)
 {
 	eviction >>= bucket_order;
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
+	eviction = (eviction << 1) | workingset;
 	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
 
 	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
 static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
-			  unsigned long *evictionp)
+			  unsigned long *evictionp, bool *workingsetp)
 {
 	unsigned long entry = (unsigned long)shadow;
 	int memcgid, nid;
+	bool workingset;
 
 	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
+	workingset = entry & 1;
+	entry >>= 1;
 	nid = entry & ((1UL << NODES_SHIFT) - 1);
 	entry >>= NODES_SHIFT;
 	memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
@@ -194,6 +210,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 	*memcgidp = memcgid;
 	*pgdat = NODE_DATA(nid);
 	*evictionp = entry << bucket_order;
+	*workingsetp = workingset;
 }
 
 /**
@@ -206,8 +223,8 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
  */
 void *workingset_eviction(struct address_space *mapping, struct page *page)
 {
-	struct mem_cgroup *memcg = page_memcg(page);
 	struct pglist_data *pgdat = page_pgdat(page);
+	struct mem_cgroup *memcg = page_memcg(page);
 	int memcgid = mem_cgroup_id(memcg);
 	unsigned long eviction;
 	struct lruvec *lruvec;
@@ -219,30 +236,30 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 
 	lruvec = mem_cgroup_lruvec(pgdat, memcg);
 	eviction = atomic_long_inc_return(&lruvec->inactive_age);
-	return pack_shadow(memcgid, pgdat, eviction);
+	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
 }
 
 /**
  * workingset_refault - evaluate the refault of a previously evicted page
+ * @page: the freshly allocated replacement page
  * @shadow: shadow entry of the evicted page
  *
  * Calculates and evaluates the refault distance of the previously
  * evicted page in the context of the node it was allocated in.
- *
- * Returns %true if the page should be activated, %false otherwise.
  */
-bool workingset_refault(void *shadow)
+void workingset_refault(struct page *page, void *shadow)
 {
 	unsigned long refault_distance;
+	struct pglist_data *pgdat;
 	unsigned long active_file;
 	struct mem_cgroup *memcg;
 	unsigned long eviction;
 	struct lruvec *lruvec;
 	unsigned long refault;
-	struct pglist_data *pgdat;
+	bool workingset;
 	int memcgid;
 
-	unpack_shadow(shadow, &memcgid, &pgdat, &eviction);
+	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
 
 	rcu_read_lock();
 	/*
@@ -262,41 +279,50 @@ bool workingset_refault(void *shadow)
 	 * configurations instead.
 	 */
 	memcg = mem_cgroup_from_id(memcgid);
-	if (!mem_cgroup_disabled() && !memcg) {
-		rcu_read_unlock();
-		return false;
-	}
+	if (!mem_cgroup_disabled() && !memcg)
+		goto out;
 	lruvec = mem_cgroup_lruvec(pgdat, memcg);
 	refault = atomic_long_read(&lruvec->inactive_age);
 	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES);
 
 	/*
-	 * The unsigned subtraction here gives an accurate distance
-	 * across inactive_age overflows in most cases.
+	 * Calculate the refault distance
 	 *
-	 * There is a special case: usually, shadow entries have a
-	 * short lifetime and are either refaulted or reclaimed along
-	 * with the inode before they get too old.  But it is not
-	 * impossible for the inactive_age to lap a shadow entry in
-	 * the field, which can then can result in a false small
-	 * refault distance, leading to a false activation should this
-	 * old entry actually refault again.  However, earlier kernels
-	 * used to deactivate unconditionally with *every* reclaim
-	 * invocation for the longest time, so the occasional
-	 * inappropriate activation leading to pressure on the active
-	 * list is not a problem.
+	 * The unsigned subtraction here gives an accurate distance
+	 * across inactive_age overflows in most cases. There is a
+	 * special case: usually, shadow entries have a short lifetime
+	 * and are either refaulted or reclaimed along with the inode
+	 * before they get too old.  But it is not impossible for the
+	 * inactive_age to lap a shadow entry in the field, which can
+	 * then can result in a false small refault distance, leading
+	 * to a false activation should this old entry actually
+	 * refault again.  However, earlier kernels used to deactivate
+	 * unconditionally with *every* reclaim invocation for the
+	 * longest time, so the occasional inappropriate activation
+	 * leading to pressure on the active list is not a problem.
 	 */
 	refault_distance = (refault - eviction) & EVICTION_MASK;
 
 	inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
 
-	if (refault_distance <= active_file) {
-		inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
-		rcu_read_unlock();
-		return true;
-	}
+	/*
+	 * Compare the distance to the existing workingset size. We
+	 * don't act on pages that couldn't stay resident even if all
+	 * the memory was available to the page cache.
+	 */
+	if (refault_distance > active_file)
+		goto out;
+
+	SetPageActive(page);
+	SetPageWorkingset(page);
+	atomic_long_inc(&lruvec->inactive_age);
+	inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
+
+	/* Page was active prior to eviction */
+	if (workingset)
+		inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
+out:
 	rcu_read_unlock();
-	return false;
 }
 
 /**
-- 
2.14.1


[-- Attachment #4: 0003-mm-sched-memdelay-memory-health-interface-for-system.patch --]
[-- Type: text/x-diff, Size: 35403 bytes --]

>From c3e97f5daf99bcd54383eaab466c477dbb743dd9 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 5 Jun 2017 16:07:22 -0400
Subject: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems
 and workloads

Linux doesn't have a useful metric to describe the memory health of a
system, a cgroup container, or individual tasks.

When workloads are bigger than available memory, they spend a certain
amount of their time inside page reclaim, waiting on thrashing cache,
and swapping in. This has impact on latency, and depending on the CPU
capacity in the system can also translate to a decrease in throughput.

While Linux exports some stats and counters for these events, it does
not quantify the true impact they have on throughput and latency. How
much of the execution time is spent unproductively? This is important
to know when sizing workloads to systems and containers. It also comes
in handy when evaluating the effectiveness and efficiency of the
kernel's memory management policies and heuristics.

This patch implements a metric that quantifies memory pressure in a
unit that matters most to applications and does not rely on hardware
aspects to be meaningful: wallclock time lost while waiting on memory.

Whenever a task is blocked on refaults, swapins, or direct reclaim,
the time it spends is accounted on the task level and aggregated into
a domain state along with other tasks on the system and cgroup level.

Each task has a /proc/<pid>/memdelay file that lists the microseconds
the task has been delayed since it's been forked. That file can be
sampled periodically for recent delays, or before and after certain
operations to measure their memory-related latencies.

On the system and cgroup-level, there are /proc/memdelay and
memory.memdelay, respectively, and their format is as such:

$ cat /proc/memdelay
2489084
41.61 47.28 29.66
0.00 0.00 0.00

The first line shows the cumulative delay times of all tasks in the
domain - in this case, all tasks in the system cumulatively lost 2.49
seconds due to memory delays.

The second and third line show percentages spent in aggregate states
for the domain - system or cgroup - in a load average type format as
decaying averages over the last 1m, 5m, and 15m:

The second line indicates the share of wall-time the domain spends in
a state where SOME tasks are delayed by memory while others are still
productive (runnable or iowait). This indicates a latency problem for
individual tasks, but since the CPU/IO capacity is still used, adding
more memory might not necessarily improve the domain's throughput.

The third line indicates the share of wall-time the domain spends in a
state where ALL non-idle tasks are delayed by memory. In this state,
the domain is entirely unproductive due to a lack of memory.

v2:
- fix active-delay condition when only other runnables, no iowait
- drop private lock from sched path, we can use the rq lock
- fix refault vs. simple lockwait detection
- drop ktime, we can use cpu_clock()

XXX:
- eliminate redundant cgroup hierarchy walks in the scheduler

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/proc/array.c            |   8 ++
 fs/proc/base.c             |   2 +
 fs/proc/internal.h         |   2 +
 include/linux/memcontrol.h |  14 +++
 include/linux/memdelay.h   | 182 +++++++++++++++++++++++++++++
 include/linux/sched.h      |   8 ++
 kernel/cgroup/cgroup.c     |   3 +-
 kernel/fork.c              |   4 +
 kernel/sched/Makefile      |   2 +-
 kernel/sched/core.c        |  27 +++++
 kernel/sched/memdelay.c    | 118 +++++++++++++++++++
 mm/Makefile                |   2 +-
 mm/compaction.c            |   4 +
 mm/filemap.c               |  11 ++
 mm/memcontrol.c            |  25 ++++
 mm/memdelay.c              | 285 +++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            |  11 +-
 mm/vmscan.c                |   9 ++
 18 files changed, 712 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/memdelay.h
 create mode 100644 kernel/sched/memdelay.c
 create mode 100644 mm/memdelay.c

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 88c355574aa0..00e0e9aa3e70 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -611,6 +611,14 @@ int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
 	return 0;
 }
 
+int proc_pid_memdelay(struct seq_file *m, struct pid_namespace *ns,
+		      struct pid *pid, struct task_struct *task)
+{
+	seq_put_decimal_ull(m, "", task->memdelay_total);
+	seq_putc(m, '\n');
+	return 0;
+}
+
 #ifdef CONFIG_PROC_CHILDREN
 static struct pid *
 get_children_pid(struct inode *inode, struct pid *pid_prev, loff_t pos)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 719c2e943ea1..19f194940c80 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2916,6 +2916,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 	REG("cmdline",    S_IRUGO, proc_pid_cmdline_ops),
 	ONE("stat",       S_IRUGO, proc_tgid_stat),
 	ONE("statm",      S_IRUGO, proc_pid_statm),
+	ONE("memdelay",   S_IRUGO, proc_pid_memdelay),
 	REG("maps",       S_IRUGO, proc_pid_maps_operations),
 #ifdef CONFIG_NUMA
 	REG("numa_maps",  S_IRUGO, proc_pid_numa_maps_operations),
@@ -3307,6 +3308,7 @@ static const struct pid_entry tid_base_stuff[] = {
 	REG("cmdline",   S_IRUGO, proc_pid_cmdline_ops),
 	ONE("stat",      S_IRUGO, proc_tid_stat),
 	ONE("statm",     S_IRUGO, proc_pid_statm),
+	ONE("memdelay",  S_IRUGO, proc_pid_memdelay),
 	REG("maps",      S_IRUGO, proc_tid_maps_operations),
 #ifdef CONFIG_PROC_CHILDREN
 	REG("children",  S_IRUGO, proc_tid_children_operations),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index aa2b89071630..7ab706c316b8 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -146,6 +146,8 @@ extern int proc_pid_status(struct seq_file *, struct pid_namespace *,
 			   struct pid *, struct task_struct *);
 extern int proc_pid_statm(struct seq_file *, struct pid_namespace *,
 			  struct pid *, struct task_struct *);
+extern int proc_pid_memdelay(struct seq_file *, struct pid_namespace *,
+			     struct pid *, struct task_struct *);
 
 /*
  * base.c
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9b15a4bcfa77..1f720d3090f7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -30,6 +30,7 @@
 #include <linux/vmstat.h>
 #include <linux/writeback.h>
 #include <linux/page-flags.h>
+#include <linux/memdelay.h>
 
 struct mem_cgroup;
 struct page;
@@ -183,6 +184,9 @@ struct mem_cgroup {
 
 	unsigned long soft_limit;
 
+	/* Memory delay measurement domain */
+	struct memdelay_domain *memdelay_domain;
+
 	/* vmpressure notifications */
 	struct vmpressure vmpressure;
 
@@ -728,6 +732,11 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &pgdat->lruvec;
 }
 
+static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
+{
+	return NULL;
+}
+
 static inline bool mm_match_cgroup(struct mm_struct *mm,
 		struct mem_cgroup *memcg)
 {
@@ -740,6 +749,11 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
 	return true;
 }
 
+static inline struct mem_cgroup *mem_cgroup_from_task(struct task_struct *task)
+{
+	return NULL;
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
diff --git a/include/linux/memdelay.h b/include/linux/memdelay.h
new file mode 100644
index 000000000000..08ed4e4baedf
--- /dev/null
+++ b/include/linux/memdelay.h
@@ -0,0 +1,182 @@
+#ifndef _LINUX_MEMDELAY_H
+#define _LINUX_MEMDELAY_H
+
+#include <linux/spinlock_types.h>
+#include <linux/sched.h>
+
+struct seq_file;
+struct css_set;
+
+/*
+ * Task productivity states tracked by the scheduler
+ */
+enum memdelay_task_state {
+	MTS_NONE,		/* Idle/unqueued/untracked */
+	MTS_IOWAIT,		/* Waiting for IO, not memory delayed */
+	MTS_RUNNABLE,		/* On the runqueue, not memory delayed */
+	MTS_DELAYED,		/* Memory delayed, not running */
+	MTS_DELAYED_ACTIVE,	/* Memory delayed, actively running */
+	NR_MEMDELAY_TASK_STATES,
+};
+
+/*
+ * System/cgroup delay state tracked by the VM, composed of the
+ * productivity states of all tasks inside the domain.
+ */
+enum memdelay_domain_state {
+	MDS_NONE,		/* No delayed tasks */
+	MDS_SOME,		/* Delayed tasks, working tasks */
+	MDS_FULL,		/* Delayed tasks, no working tasks */
+	NR_MEMDELAY_DOMAIN_STATES,
+};
+
+struct memdelay_domain_cpu {
+	/* Task states of the domain on this CPU */
+	int tasks[NR_MEMDELAY_TASK_STATES];
+
+	/* Delay state of the domain on this CPU */
+	enum memdelay_domain_state state;
+
+	/* Time of last state change */
+	u64 state_start;
+};
+
+struct memdelay_domain {
+	/* Aggregate delayed time of all domain tasks */
+	unsigned long aggregate;
+
+	/* Per-CPU delay states in the domain */
+	struct memdelay_domain_cpu __percpu *mdcs;
+
+	/* Cumulative state times from all CPUs */
+	unsigned long times[NR_MEMDELAY_DOMAIN_STATES];
+
+	/* Decaying state time averages over 1m, 5m, 15m */
+	unsigned long period_expires;
+	unsigned long avg_full[3];
+	unsigned long avg_some[3];
+};
+
+/* mm/memdelay.c */
+extern struct memdelay_domain memdelay_global_domain;
+void memdelay_init(void);
+void memdelay_task_change(struct task_struct *task,
+			  enum memdelay_task_state old,
+			  enum memdelay_task_state new);
+struct memdelay_domain *memdelay_domain_alloc(void);
+void memdelay_domain_free(struct memdelay_domain *md);
+int memdelay_domain_show(struct seq_file *s, struct memdelay_domain *md);
+
+/* kernel/sched/memdelay.c */
+void memdelay_enter(unsigned long *flags);
+void memdelay_leave(unsigned long *flags);
+
+/**
+ * memdelay_schedule - note a context switch
+ * @prev: task scheduling out
+ * @next: task scheduling in
+ *
+ * A task switch doesn't affect the balance between delayed and
+ * productive tasks, but we have to update whether the delay is
+ * actively using the CPU or not.
+ */
+static inline void memdelay_schedule(struct task_struct *prev,
+				     struct task_struct *next)
+{
+	if (prev->flags & PF_MEMDELAY)
+		memdelay_task_change(prev, MTS_DELAYED_ACTIVE, MTS_DELAYED);
+
+	if (next->flags & PF_MEMDELAY)
+		memdelay_task_change(next, MTS_DELAYED, MTS_DELAYED_ACTIVE);
+}
+
+/**
+ * memdelay_wakeup - note a task waking up
+ * @task: the task
+ *
+ * Notes an idle task becoming productive. Delayed tasks remain
+ * delayed even when they become runnable.
+ */
+static inline void memdelay_wakeup(struct task_struct *task)
+{
+	if (task->flags & PF_MEMDELAY)
+		return;
+
+	if (task->in_iowait)
+		memdelay_task_change(task, MTS_IOWAIT, MTS_RUNNABLE);
+	else
+		memdelay_task_change(task, MTS_NONE, MTS_RUNNABLE);
+}
+
+/**
+ * memdelay_wakeup - note a task going to sleep
+ * @task: the task
+ *
+ * Notes a working tasks becoming unproductive. Delayed tasks remain
+ * delayed.
+ */
+static inline void memdelay_sleep(struct task_struct *task)
+{
+	if (task->flags & PF_MEMDELAY)
+		return;
+
+	if (task->in_iowait)
+		memdelay_task_change(task, MTS_RUNNABLE, MTS_IOWAIT);
+	else
+		memdelay_task_change(task, MTS_RUNNABLE, MTS_NONE);
+}
+
+/**
+ * memdelay_del_add - track task movement between runqueues
+ * @task: the task
+ * @runnable: a runnable task is moved if %true, unqueued otherwise
+ * @add: task is being added if %true, removed otherwise
+ *
+ * Update the memdelay domain per-cpu states as tasks are being moved
+ * around the runqueues.
+ */
+static inline void memdelay_del_add(struct task_struct *task,
+				    bool runnable, bool add)
+{
+	int state;
+
+	if (task->flags & PF_MEMDELAY)
+		state = MTS_DELAYED;
+	else if (runnable)
+		state = MTS_RUNNABLE;
+	else if (task->in_iowait)
+		state = MTS_IOWAIT;
+	else
+		return; /* already MTS_NONE */
+
+	if (add)
+		memdelay_task_change(task, MTS_NONE, state);
+	else
+		memdelay_task_change(task, state, MTS_NONE);
+}
+
+static inline void memdelay_del_runnable(struct task_struct *task)
+{
+	memdelay_del_add(task, true, false);
+}
+
+static inline void memdelay_add_runnable(struct task_struct *task)
+{
+	memdelay_del_add(task, true, true);
+}
+
+static inline void memdelay_del_sleeping(struct task_struct *task)
+{
+	memdelay_del_add(task, false, false);
+}
+
+static inline void memdelay_add_sleeping(struct task_struct *task)
+{
+	memdelay_del_add(task, false, true);
+}
+
+#ifdef CONFIG_CGROUPS
+void cgroup_move_task(struct task_struct *task, struct css_set *to);
+#endif
+
+#endif /* _LINUX_MEMDELAY_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c05ac5f5aa03..de15e3c8c43a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -651,6 +651,7 @@ struct task_struct {
 	/* disallow userland-initiated cgroup migration */
 	unsigned			no_cgroup_migration:1;
 #endif
+	unsigned			memdelay_migrate_enqueue:1;
 
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
 
@@ -871,6 +872,12 @@ struct task_struct {
 
 	struct io_context		*io_context;
 
+	u64				memdelay_start;
+	unsigned long			memdelay_total;
+#ifdef CONFIG_DEBUG_VM
+	int				memdelay_state;
+#endif
+
 	/* Ptrace state: */
 	unsigned long			ptrace_message;
 	siginfo_t			*last_siginfo;
@@ -1274,6 +1281,7 @@ extern struct pid *cad_pid;
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
 #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
+#define PF_MEMDELAY		0x01000000	/* Delayed due to lack of memory */
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_allowed */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MUTEX_TESTER		0x20000000	/* Thread belongs to the rt mutex tester */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index df2e0f14a95d..930aaef50396 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -699,7 +699,8 @@ static void css_set_move_task(struct task_struct *task,
 		 */
 		WARN_ON_ONCE(task->flags & PF_EXITING);
 
-		rcu_assign_pointer(task->cgroups, to_cset);
+		cgroup_move_task(task, to_cset);
+
 		list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks :
 							     &to_cset->tasks);
 	}
diff --git a/kernel/fork.c b/kernel/fork.c
index b7e9e57b71ea..96dd35393be9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1208,6 +1208,10 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
 	int retval;
 
 	tsk->min_flt = tsk->maj_flt = 0;
+	tsk->memdelay_total = 0;
+#ifdef CONFIG_DEBUG_VM
+	tsk->memdelay_state = 0;
+#endif
 	tsk->nvcsw = tsk->nivcsw = 0;
 #ifdef CONFIG_DETECT_HUNG_TASK
 	tsk->last_switch_count = tsk->nvcsw + tsk->nivcsw;
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 53f0164ed362..84390fc42f60 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -17,7 +17,7 @@ endif
 
 obj-y += core.o loadavg.o clock.o cputime.o
 obj-y += idle_task.o fair.o rt.o deadline.o
-obj-y += wait.o wait_bit.o swait.o completion.o idle.o
+obj-y += wait.o wait_bit.o swait.o completion.o idle.o memdelay.o
 obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0869b20fba81..bf105c870da6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -26,6 +26,7 @@
 #include <linux/profile.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/memdelay.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
@@ -759,6 +760,14 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 	if (!(flags & ENQUEUE_RESTORE))
 		sched_info_queued(rq, p);
 
+	WARN_ON_ONCE(!(flags & ENQUEUE_WAKEUP) && p->memdelay_migrate_enqueue);
+	if (!(flags & ENQUEUE_WAKEUP) || p->memdelay_migrate_enqueue) {
+		memdelay_add_runnable(p);
+		p->memdelay_migrate_enqueue = 0;
+	} else {
+		memdelay_wakeup(p);
+	}
+
 	p->sched_class->enqueue_task(rq, p, flags);
 }
 
@@ -770,6 +779,11 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 	if (!(flags & DEQUEUE_SAVE))
 		sched_info_dequeued(rq, p);
 
+	if (!(flags & DEQUEUE_SLEEP))
+		memdelay_del_runnable(p);
+	else
+		memdelay_sleep(p);
+
 	p->sched_class->dequeue_task(rq, p, flags);
 }
 
@@ -2044,7 +2058,16 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 
 	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
+		struct rq_flags rf;
+		struct rq *rq;
+
 		wake_flags |= WF_MIGRATED;
+
+		rq = __task_rq_lock(p, &rf);
+		memdelay_del_sleeping(p);
+		__task_rq_unlock(rq, &rf);
+		p->memdelay_migrate_enqueue = 1;
+
 		set_task_cpu(p, cpu);
 	}
 
@@ -3326,6 +3349,8 @@ static void __sched notrace __schedule(bool preempt)
 		rq->curr = next;
 		++*switch_count;
 
+		memdelay_schedule(prev, next);
+
 		trace_sched_switch(preempt, prev, next);
 
 		/* Also unlocks the rq: */
@@ -5919,6 +5944,8 @@ void __init sched_init(void)
 
 	init_schedstats();
 
+	memdelay_init();
+
 	scheduler_running = 1;
 }
 
diff --git a/kernel/sched/memdelay.c b/kernel/sched/memdelay.c
new file mode 100644
index 000000000000..1d4813cd018a
--- /dev/null
+++ b/kernel/sched/memdelay.c
@@ -0,0 +1,118 @@
+/*
+ * Memory delay metric
+ *
+ * Copyright (c) 2017 Facebook, Johannes Weiner
+ *
+ * This code quantifies and reports to userspace the wall-time impact
+ * of memory pressure on the system and memory-controlled cgroups.
+ */
+
+#include <linux/memdelay.h>
+#include <linux/cgroup.h>
+#include <linux/sched.h>
+
+#include "sched.h"
+
+/**
+ * memdelay_enter - mark the beginning of a memory delay section
+ * @flags: flags to handle nested memdelay sections
+ *
+ * Marks the calling task as being delayed due to a lack of memory,
+ * such as waiting for a workingset refault or performing reclaim.
+ */
+void memdelay_enter(unsigned long *flags)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+
+	*flags = current->flags & PF_MEMDELAY;
+	if (*flags)
+		return;
+	/*
+	 * PF_MEMDELAY & accounting needs to be atomic wrt changes to
+	 * the task's scheduling state and its domain association.
+	 * Otherwise we could race with CPU or cgroup migration and
+	 * misaccount.
+	 */
+	local_irq_disable();
+	rq = this_rq();
+	rq_lock(rq, &rf);
+
+	current->flags |= PF_MEMDELAY;
+	memdelay_task_change(current, MTS_RUNNABLE, MTS_DELAYED_ACTIVE);
+
+	rq_unlock(rq, &rf);
+	local_irq_enable();
+}
+
+/**
+ * memdelay_leave - mark the end of a memory delay section
+ * @flags: flags to handle nested memdelay sections
+ *
+ * Marks the calling task as no longer delayed due to memory.
+ */
+void memdelay_leave(unsigned long *flags)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+
+	if (*flags)
+		return;
+	/*
+	 * PF_MEMDELAY & accounting needs to be atomic wrt changes to
+	 * the task's scheduling state and its domain association.
+	 * Otherwise we could race with CPU or cgroup migration and
+	 * misaccount.
+	 */
+	local_irq_disable();
+	rq = this_rq();
+	rq_lock(rq, &rf);
+
+	current->flags &= ~PF_MEMDELAY;
+	memdelay_task_change(current, MTS_DELAYED_ACTIVE, MTS_RUNNABLE);
+
+	rq_unlock(rq, &rf);
+	local_irq_enable();
+}
+
+#ifdef CONFIG_CGROUPS
+/**
+ * cgroup_move_task - move task to a different cgroup
+ * @task: the task
+ * @to: the target css_set
+ *
+ * Move task to a new cgroup and safely migrate its associated
+ * delayed/working state between the different domains.
+ *
+ * This function acquires the task's rq lock to lock out concurrent
+ * changes to the task's scheduling state and - in case the task is
+ * running - concurrent changes to its delay state.
+ */
+void cgroup_move_task(struct task_struct *task, struct css_set *to)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+	int state;
+
+	rq = task_rq_lock(task, &rf);
+
+	if (task->flags & PF_MEMDELAY)
+		state = MTS_DELAYED + task_current(rq, task);
+	else if (task_on_rq_queued(task))
+		state = MTS_RUNNABLE;
+	else if (task->in_iowait)
+		state = MTS_IOWAIT;
+	else
+		state = MTS_NONE;
+
+	/*
+	 * Lame to do this here, but the scheduler cannot be locked
+	 * from the outside, so we move cgroups from inside sched/.
+	 */
+	memdelay_task_change(task, state, MTS_NONE);
+	rcu_assign_pointer(task->cgroups, to);
+	memdelay_task_change(task, MTS_NONE, state);
+
+	task_rq_unlock(rq, task, &rf);
+}
+#endif /* CONFIG_CGROUPS */
diff --git a/mm/Makefile b/mm/Makefile
index 411bd24d4a7c..c9bdbc5627e5 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -39,7 +39,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o vmacache.o swap_slots.o \
 			   interval_tree.o list_lru.o workingset.o \
-			   debug.o $(mmu-y)
+			   memdelay.o debug.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/compaction.c b/mm/compaction.c
index fb548e4c7bd4..adf67de23fee 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2040,11 +2040,15 @@ static int kcompactd(void *p)
 	pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;
 
 	while (!kthread_should_stop()) {
+		unsigned long mdflags;
+
 		trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
 		wait_event_freezable(pgdat->kcompactd_wait,
 				kcompactd_work_requested(pgdat));
 
+		memdelay_enter(&mdflags);
 		kcompactd_do_work(pgdat);
+		memdelay_leave(&mdflags);
 	}
 
 	return 0;
diff --git a/mm/filemap.c b/mm/filemap.c
index da55a5693da9..648418694405 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -36,6 +36,7 @@
 #include <linux/memcontrol.h>
 #include <linux/cleancache.h>
 #include <linux/rmap.h>
+#include <linux/memdelay.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -961,8 +962,15 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 {
 	struct wait_page_queue wait_page;
 	wait_queue_entry_t *wait = &wait_page.wait;
+	unsigned long mdflags;
+	bool refault = false;
 	int ret = 0;
 
+	if (bit_nr == PG_locked && !PageUptodate(page) && PageWorkingset(page)) {
+		memdelay_enter(&mdflags);
+		refault = true;
+	}
+
 	init_wait(wait);
 	wait->flags = lock ? WQ_FLAG_EXCLUSIVE : 0;
 	wait->func = wake_page_function;
@@ -1001,6 +1009,9 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 
 	finish_wait(q, wait);
 
+	if (refault)
+		memdelay_leave(&mdflags);
+
 	/*
 	 * A signal could leave PageWaiters set. Clearing it here if
 	 * !waitqueue_active would be possible (by open-coding finish_wait),
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 93b2eb063afd..102f0f4d3f5c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -65,6 +65,7 @@
 #include <linux/lockdep.h>
 #include <linux/file.h>
 #include <linux/tracehook.h>
+#include <linux/memdelay.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -3926,6 +3927,8 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
 	return ret;
 }
 
+static int memory_memdelay_show(struct seq_file *m, void *v);
+
 static struct cftype mem_cgroup_legacy_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -3993,6 +3996,10 @@ static struct cftype mem_cgroup_legacy_files[] = {
 	{
 		.name = "pressure_level",
 	},
+	{
+		.name = "memdelay",
+		.seq_show = memory_memdelay_show,
+	},
 #ifdef CONFIG_NUMA
 	{
 		.name = "numa_stat",
@@ -4170,6 +4177,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 
 	for_each_node(node)
 		free_mem_cgroup_per_node_info(memcg, node);
+	memdelay_domain_free(memcg->memdelay_domain);
 	free_percpu(memcg->stat);
 	kfree(memcg);
 }
@@ -4275,10 +4283,15 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 
 	/* The following stuff does not apply to the root */
 	if (!parent) {
+		memcg->memdelay_domain = &memdelay_global_domain;
 		root_mem_cgroup = memcg;
 		return &memcg->css;
 	}
 
+	memcg->memdelay_domain = memdelay_domain_alloc();
+	if (!memcg->memdelay_domain)
+		goto fail;
+
 	error = memcg_online_kmem(memcg);
 	if (error)
 		goto fail;
@@ -5282,6 +5295,13 @@ static int memory_stat_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+static int memory_memdelay_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+
+	return memdelay_domain_show(m, memcg->memdelay_domain);
+}
+
 static struct cftype memory_files[] = {
 	{
 		.name = "current",
@@ -5317,6 +5337,11 @@ static struct cftype memory_files[] = {
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.seq_show = memory_stat_show,
 	},
+	{
+		.name = "memdelay",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_memdelay_show,
+	},
 	{ }	/* terminate */
 };
 
diff --git a/mm/memdelay.c b/mm/memdelay.c
new file mode 100644
index 000000000000..c43d6f7ba22a
--- /dev/null
+++ b/mm/memdelay.c
@@ -0,0 +1,285 @@
+/*
+ * Memory delay metric
+ *
+ * Copyright (c) 2017 Facebook, Johannes Weiner
+ *
+ * This code quantifies and reports to userspace the wall-time impact
+ * of memory pressure on the system and memory-controlled cgroups.
+ */
+
+#include <linux/sched/loadavg.h>
+#include <linux/sched/clock.h>
+#include <linux/memcontrol.h>
+#include <linux/memdelay.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+
+static DEFINE_PER_CPU(struct memdelay_domain_cpu, global_domain_cpus);
+
+/* System-level keeping of memory delay statistics */
+struct memdelay_domain memdelay_global_domain = {
+	.mdcs = &global_domain_cpus,
+};
+
+static void domain_init(struct memdelay_domain *md)
+{
+	md->period_expires = jiffies + LOAD_FREQ;
+}
+
+/**
+ * memdelay_init - initialize the memdelay subsystem
+ *
+ * This needs to run before the scheduler starts queuing and
+ * scheduling tasks.
+ */
+void __init memdelay_init(void)
+{
+	domain_init(&memdelay_global_domain);
+}
+
+static void domain_move_clock(struct memdelay_domain *md)
+{
+	unsigned long expires = READ_ONCE(md->period_expires);
+	unsigned long none, some, full;
+	int missed_periods;
+	unsigned long next;
+	int i;
+
+	if (time_before(jiffies, expires))
+		return;
+
+	missed_periods = 1 + (jiffies - expires) / LOAD_FREQ;
+	next = expires + (missed_periods * LOAD_FREQ);
+
+	if (cmpxchg(&md->period_expires, expires, next) != expires)
+		return;
+
+	none = xchg(&md->times[MDS_NONE], 0);
+	some = xchg(&md->times[MDS_SOME], 0);
+	full = xchg(&md->times[MDS_FULL], 0);
+
+	for (i = 0; i < missed_periods; i++) {
+		unsigned long pct;
+
+		pct = some * 100 / max(none + some + full, 1UL);
+		pct *= FIXED_1;
+		CALC_LOAD(md->avg_some[0], EXP_1, pct);
+		CALC_LOAD(md->avg_some[1], EXP_5, pct);
+		CALC_LOAD(md->avg_some[2], EXP_15, pct);
+
+		pct = full * 100 / max(none + some + full, 1UL);
+		pct *= FIXED_1;
+		CALC_LOAD(md->avg_full[0], EXP_1, pct);
+		CALC_LOAD(md->avg_full[1], EXP_5, pct);
+		CALC_LOAD(md->avg_full[2], EXP_15, pct);
+
+		none = some = full = 0;
+	}
+}
+
+static void domain_cpu_update(struct memdelay_domain *md, int cpu,
+			      enum memdelay_task_state old,
+			      enum memdelay_task_state new)
+{
+	enum memdelay_domain_state state;
+	struct memdelay_domain_cpu *mdc;
+	unsigned long delta;
+	u64 now;
+
+	mdc = per_cpu_ptr(md->mdcs, cpu);
+
+	if (old) {
+		WARN_ONCE(!mdc->tasks[old], "cpu=%d old=%d new=%d counter=%d\n",
+			  cpu, old, new, mdc->tasks[old]);
+		mdc->tasks[old] -= 1;
+	}
+	if (new)
+		mdc->tasks[new] += 1;
+
+	/*
+	 * The domain is somewhat delayed when a number of tasks are
+	 * delayed but there are still others running the workload.
+	 *
+	 * The domain is fully delayed when all non-idle tasks on the
+	 * CPU are delayed, or when a delayed task is actively running
+	 * and preventing productive tasks from making headway.
+	 *
+	 * The state times then add up over all CPUs in the domain: if
+	 * the domain is fully blocked on one CPU and there is another
+	 * one running the workload, the domain is considered fully
+	 * blocked 50% of the time.
+	 */
+	if (mdc->tasks[MTS_DELAYED_ACTIVE] && !mdc->tasks[MTS_IOWAIT])
+		state = MDS_FULL;
+	else if (mdc->tasks[MTS_DELAYED])
+		state = (mdc->tasks[MTS_RUNNABLE] || mdc->tasks[MTS_IOWAIT]) ?
+			MDS_SOME : MDS_FULL;
+	else
+		state = MDS_NONE;
+
+	if (mdc->state == state)
+		return;
+
+	now = cpu_clock(cpu);
+	delta = (now - mdc->state_start) / NSEC_PER_USEC;
+
+	domain_move_clock(md);
+	md->times[mdc->state] += delta;
+
+	mdc->state = state;
+	mdc->state_start = now;
+}
+
+static struct memdelay_domain *memcg_domain(struct mem_cgroup *memcg)
+{
+#ifdef CONFIG_MEMCG
+	if (!mem_cgroup_disabled())
+		return memcg->memdelay_domain;
+#endif
+	return &memdelay_global_domain;
+}
+
+/**
+ * memdelay_task_change - note a task changing its delay/work state
+ * @task: the task changing state
+ * @old: old task state
+ * @new: new task state
+ *
+ * Updates the task's domain counters to reflect a change in the
+ * task's delayed/working state.
+ */
+void memdelay_task_change(struct task_struct *task,
+			  enum memdelay_task_state old,
+			  enum memdelay_task_state new)
+{
+	int cpu = task_cpu(task);
+	struct mem_cgroup *memcg;
+	unsigned long delay = 0;
+
+#ifdef CONFIG_DEBUG_VM
+	WARN_ONCE(task->memdelay_state != old,
+		  "cpu=%d task=%p state=%d (in_iowait=%d PF_MEMDELAYED=%d) old=%d new=%d\n",
+		  cpu, task, task->memdelay_state, task->in_iowait,
+		  !!(task->flags & PF_MEMDELAY), old, new);
+	task->memdelay_state = new;
+#endif
+
+	/* Account when tasks are entering and leaving delays */
+	if (old < MTS_DELAYED && new >= MTS_DELAYED) {
+		task->memdelay_start = cpu_clock(cpu);
+	} else if (old >= MTS_DELAYED && new < MTS_DELAYED) {
+		delay = (cpu_clock(cpu) - task->memdelay_start) / NSEC_PER_USEC;
+		task->memdelay_total += delay;
+	}
+
+	/* Account domain state changes */
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(task);
+	do {
+		struct memdelay_domain *md;
+
+		md = memcg_domain(memcg);
+		md->aggregate += delay;
+		domain_cpu_update(md, cpu, old, new);
+	} while (memcg && (memcg = parent_mem_cgroup(memcg)));
+	rcu_read_unlock();
+};
+
+/**
+ * memdelay_domain_alloc - allocate a cgroup memory delay domain
+ */
+struct memdelay_domain *memdelay_domain_alloc(void)
+{
+	struct memdelay_domain *md;
+
+	md = kzalloc(sizeof(*md), GFP_KERNEL);
+	if (!md)
+		return NULL;
+	md->mdcs = alloc_percpu(struct memdelay_domain_cpu);
+	if (!md->mdcs) {
+		kfree(md);
+		return NULL;
+	}
+	domain_init(md);
+	return md;
+}
+
+/**
+ * memdelay_domain_free - free a cgroup memory delay domain
+ */
+void memdelay_domain_free(struct memdelay_domain *md)
+{
+	if (md) {
+		free_percpu(md->mdcs);
+		kfree(md);
+	}
+}
+
+/**
+ * memdelay_domain_show - format memory delay domain stats to a seq_file
+ * @s: the seq_file
+ * @md: the memory domain
+ */
+int memdelay_domain_show(struct seq_file *s, struct memdelay_domain *md)
+{
+	domain_move_clock(md);
+
+	seq_printf(s, "%lu\n", md->aggregate);
+
+	seq_printf(s, "%lu.%02lu %lu.%02lu %lu.%02lu\n",
+		   LOAD_INT(md->avg_some[0]), LOAD_FRAC(md->avg_some[0]),
+		   LOAD_INT(md->avg_some[1]), LOAD_FRAC(md->avg_some[1]),
+		   LOAD_INT(md->avg_some[2]), LOAD_FRAC(md->avg_some[2]));
+
+	seq_printf(s, "%lu.%02lu %lu.%02lu %lu.%02lu\n",
+		   LOAD_INT(md->avg_full[0]), LOAD_FRAC(md->avg_full[0]),
+		   LOAD_INT(md->avg_full[1]), LOAD_FRAC(md->avg_full[1]),
+		   LOAD_INT(md->avg_full[2]), LOAD_FRAC(md->avg_full[2]));
+
+#ifdef CONFIG_DEBUG_VM
+	{
+		int cpu;
+
+		for_each_online_cpu(cpu) {
+			struct memdelay_domain_cpu *mdc;
+
+			mdc = per_cpu_ptr(md->mdcs, cpu);
+			seq_printf(s, "%d %d %d %d\n",
+				   mdc->tasks[MTS_IOWAIT],
+				   mdc->tasks[MTS_RUNNABLE],
+				   mdc->tasks[MTS_DELAYED],
+				   mdc->tasks[MTS_DELAYED_ACTIVE]);
+		}
+	}
+#endif
+
+	return 0;
+}
+
+static int memdelay_show(struct seq_file *m, void *v)
+{
+	return memdelay_domain_show(m, &memdelay_global_domain);
+}
+
+static int memdelay_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, memdelay_show, NULL);
+}
+
+static const struct file_operations memdelay_fops = {
+	.open           = memdelay_open,
+	.read           = seq_read,
+	.llseek         = seq_lseek,
+	.release        = single_release,
+};
+
+static int __init memdelay_proc_init(void)
+{
+	proc_create("memdelay", 0, NULL, &memdelay_fops);
+	return 0;
+}
+module_init(memdelay_proc_init);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1423da8dd16f..d8d01e9df982 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -67,6 +67,7 @@
 #include <linux/memcontrol.h>
 #include <linux/ftrace.h>
 #include <linux/nmi.h>
+#include <linux/memdelay.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -3364,16 +3365,19 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		unsigned int alloc_flags, const struct alloc_context *ac,
 		enum compact_priority prio, enum compact_result *compact_result)
 {
-	struct page *page;
 	unsigned int noreclaim_flag;
+	unsigned long mdflags;
+	struct page *page;
 
 	if (!order)
 		return NULL;
 
+	memdelay_enter(&mdflags);
 	noreclaim_flag = memalloc_noreclaim_save();
 	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
 									prio);
 	memalloc_noreclaim_restore(noreclaim_flag);
+	memdelay_leave(&mdflags);
 
 	if (*compact_result <= COMPACT_INACTIVE)
 		return NULL;
@@ -3519,13 +3523,15 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 					const struct alloc_context *ac)
 {
 	struct reclaim_state reclaim_state;
-	int progress;
 	unsigned int noreclaim_flag;
+	unsigned long mdflags;
+	int progress;
 
 	cond_resched();
 
 	/* We now go into synchronous reclaim */
 	cpuset_memory_pressure_bump();
+	memdelay_enter(&mdflags);
 	noreclaim_flag = memalloc_noreclaim_save();
 	lockdep_set_current_reclaim_state(gfp_mask);
 	reclaim_state.reclaimed_slab = 0;
@@ -3537,6 +3543,7 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 	current->reclaim_state = NULL;
 	lockdep_clear_current_reclaim_state();
 	memalloc_noreclaim_restore(noreclaim_flag);
+	memdelay_leave(&mdflags);
 
 	cond_resched();
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 60357cd84c67..1029305b9b3a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -48,6 +48,7 @@
 #include <linux/prefetch.h>
 #include <linux/printk.h>
 #include <linux/dax.h>
+#include <linux/memdelay.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -3098,6 +3099,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 {
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
+	unsigned long mdflags;
 	int nid;
 	unsigned int noreclaim_flag;
 	struct scan_control sc = {
@@ -3126,9 +3128,11 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					    sc.gfp_mask,
 					    sc.reclaim_idx);
 
+	memdelay_enter(&mdflags);
 	noreclaim_flag = memalloc_noreclaim_save();
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 	memalloc_noreclaim_restore(noreclaim_flag);
+	memdelay_leave(&mdflags);
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
@@ -3550,6 +3554,7 @@ static int kswapd(void *p)
 	pgdat->kswapd_order = 0;
 	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
 	for ( ; ; ) {
+		unsigned long mdflags;
 		bool ret;
 
 		alloc_order = reclaim_order = pgdat->kswapd_order;
@@ -3586,7 +3591,11 @@ static int kswapd(void *p)
 		 */
 		trace_mm_vmscan_kswapd_wake(pgdat->node_id, classzone_idx,
 						alloc_order);
+
+		memdelay_enter(&mdflags);
 		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
+		memdelay_leave(&mdflags);
+
 		if (reclaim_order < alloc_order)
 			goto kswapd_try_sleep;
 	}
-- 
2.14.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] sched/loadavg: consolidate LOAD_INT, LOAD_FRAC macros
  2017-09-18 16:34     ` Johannes Weiner
@ 2017-09-19 10:55       ` kbuild test robot
  2017-09-19 11:02       ` kbuild test robot
  2017-09-28 15:49       ` Detecting page cache trashing state Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  2 siblings, 0 replies; 21+ messages in thread
From: kbuild test robot @ 2017-09-19 10:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: kbuild-all, Taras Kondratiuk, Michal Hocko, linux-mm,
	xe-linux-external, Ruslan Ruslichenko, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1091 bytes --]

Hi Johannes,

[auto build test ERROR on v4.13]
[cannot apply to mmotm/master linus/master tip/sched/core v4.14-rc1 next-20170919]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Johannes-Weiner/sched-loadavg-consolidate-LOAD_INT-LOAD_FRAC-macros/20170919-161057
config: blackfin-TCM-BF537_defconfig (attached as .config)
compiler: bfin-uclinux-gcc (GCC) 6.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=blackfin 

All errors (new ones prefixed by >>):

   mm/memdelay.o: In function `memdelay_task_change':
>> mm/memdelay.c:(.text+0x270): undefined reference to `__udivdi3'
   mm/memdelay.c:(.text+0x2e4): undefined reference to `__udivdi3'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 11020 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] sched/loadavg: consolidate LOAD_INT, LOAD_FRAC macros
  2017-09-18 16:34     ` Johannes Weiner
  2017-09-19 10:55       ` [PATCH 1/3] sched/loadavg: consolidate LOAD_INT, LOAD_FRAC macros kbuild test robot
@ 2017-09-19 11:02       ` kbuild test robot
  2017-09-28 15:49       ` Detecting page cache trashing state Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  2 siblings, 0 replies; 21+ messages in thread
From: kbuild test robot @ 2017-09-19 11:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: kbuild-all, Taras Kondratiuk, Michal Hocko, linux-mm,
	xe-linux-external, Ruslan Ruslichenko, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1081 bytes --]

Hi Johannes,

[auto build test ERROR on v4.13]
[cannot apply to mmotm/master linus/master tip/sched/core v4.14-rc1 next-20170919]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Johannes-Weiner/sched-loadavg-consolidate-LOAD_INT-LOAD_FRAC-macros/20170919-161057
config: c6x-evmc6678_defconfig (attached as .config)
compiler: c6x-elf-gcc (GCC) 6.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=c6x 

All errors (new ones prefixed by >>):

   mm/memdelay.o: In function `memdelay_task_change':
>> memdelay.c:(.text+0x2bc): undefined reference to `__c6xabi_divull'
   memdelay.c:(.text+0x438): undefined reference to `__c6xabi_divull'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 5431 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-18 16:34     ` Johannes Weiner
  2017-09-19 10:55       ` [PATCH 1/3] sched/loadavg: consolidate LOAD_INT, LOAD_FRAC macros kbuild test robot
  2017-09-19 11:02       ` kbuild test robot
@ 2017-09-28 15:49       ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  2017-10-25 16:53         ` Daniel Walker
                           ` (2 more replies)
  2 siblings, 3 replies; 21+ messages in thread
From: Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) @ 2017-09-28 15:49 UTC (permalink / raw)
  To: Johannes Weiner, Taras Kondratiuk
  Cc: Michal Hocko, linux-mm, xe-linux-external, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3085 bytes --]

Hi Johannes,

Hopefully I was able to rebase the patch on top v4.9.26 (latest 
supported version by us right now)
and test a bit.
The overall idea definitely looks promising, although I have one 
question on usage.
Will it be able to account the time which processes spend on handling 
major page faults
(including fs and iowait time) of refaulting page?

As we have one big application which code space occupies big amount of 
place in page cache,
when the system under heavy memory usage will reclaim some of it, the 
application will
start constantly thrashing. Since it code is placed on squashfs it 
spends whole CPU time
decompressing the pages and seem memdelay counters are not detecting 
this situation.
Here are some counters to indicate this:

19:02:44        CPU     %user     %nice   %system   %iowait %steal     %idle
19:02:45        all      0.00      0.00    100.00      0.00 0.00      0.00

19:02:44     pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s 
pgscand/s pgsteal/s    %vmeff
19:02:45     15284.00      0.00    428.00    352.00  19990.00 0.00      
0.00  15802.00      0.00

And as nobody actively allocating memory anymore looks like memdelay 
counters are not
actively incremented:

[:~]$ cat /proc/memdelay
268035776
6.13 5.43 3.58
1.90 1.89 1.26

Just in case, I have attached the v4.9.26 rebased patched.

Also attached the patch with our current solution. In current 
implementation it will mostly
fit to squashfs only thrashing situation as in general case iowait time 
would be major part of
page fault handling thus it need to be accounted too.

Thanks,
Ruslan

On 09/18/2017 07:34 PM, Johannes Weiner wrote:
> Hi Taras,
>
> On Fri, Sep 15, 2017 at 10:28:30AM -0700, Taras Kondratiuk wrote:
>> Quoting Michal Hocko (2017-09-15 07:36:19)
>>> On Thu 14-09-17 17:16:27, Taras Kondratiuk wrote:
>>>> Has somebody faced similar issue? How are you solving it?
>>> Yes this is a pain point for a _long_ time. And we still do not have a
>>> good answer upstream. Johannes has been playing in this area [1].
>>> The main problem is that our OOM detection logic is based on the ability
>>> to reclaim memory to allocate new memory. And that is pretty much true
>>> for the pagecache when you are trashing. So we do not know that
>>> basically whole time is spent refaulting the memory back and forth.
>>> We do have some refault stats for the page cache but that is not
>>> integrated to the oom detection logic because this is really a
>>> non-trivial problem to solve without triggering early oom killer
>>> invocations.
>>>
>>> [1] http://lkml.kernel.org/r/20170727153010.23347-1-hannes@cmpxchg.org
>> Thanks Michal. memdelay looks promising. We will check it.
> Great, I'm obviously interested in more users of it :) Please find
> attached the latest version of the patch series based on v4.13.
>
> It needs a bit more refactoring in the scheduler bits before
> resubmission, but it already contains a couple of fixes and
> improvements since the first version I sent out.
>
> Let me know if you need help rebasing to a different kernel version.


[-- Attachment #2: 0002-mm-sched-memdelay-memory-health-interface-for-system.patch --]
[-- Type: text/x-patch, Size: 36379 bytes --]

>From 708131b315b5a5da1beed167bca80ba067aa77a1 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 27 Sep 2017 20:01:41 +0300
Subject: [PATCH 2/2] mm/sched: memdelay: memory health interface for systems
 and workloads

Linux doesn't have a useful metric to describe the memory health of a
system, a cgroup container, or individual tasks.

When workloads are bigger than available memory, they spend a certain
amount of their time inside page reclaim, waiting on thrashing cache,
and swapping in. This has impact on latency, and depending on the CPU
capacity in the system can also translate to a decrease in throughput.

While Linux exports some stats and counters for these events, it does
not quantify the true impact they have on throughput and latency. How
much of the execution time is spent unproductively? This is important
to know when sizing workloads to systems and containers. It also comes
in handy when evaluating the effectiveness and efficiency of the
kernel's memory management policies and heuristics.

This patch implements a metric that quantifies memory pressure in a
unit that matters most to applications and does not rely on hardware
aspects to be meaningful: wallclock time lost while waiting on memory.

Whenever a task is blocked on refaults, swapins, or direct reclaim,
the time it spends is accounted on the task level and aggregated into
a domain state along with other tasks on the system and cgroup level.

Each task has a /proc/<pid>/memdelay file that lists the microseconds
the task has been delayed since it's been forked. That file can be
sampled periodically for recent delays, or before and after certain
operations to measure their memory-related latencies.

On the system and cgroup-level, there are /proc/memdelay and
memory.memdelay, respectively, and their format is as such:

$ cat /proc/memdelay
2489084
41.61 47.28 29.66
0.00 0.00 0.00

The first line shows the cumulative delay times of all tasks in the
domain - in this case, all tasks in the system cumulatively lost 2.49
seconds due to memory delays.

The second and third line show percentages spent in aggregate states
for the domain - system or cgroup - in a load average type format as
decaying averages over the last 1m, 5m, and 15m:

The second line indicates the share of wall-time the domain spends in
a state where SOME tasks are delayed by memory while others are still
productive (runnable or iowait). This indicates a latency problem for
individual tasks, but since the CPU/IO capacity is still used, adding
more memory might not necessarily improve the domain's throughput.

The third line indicates the share of wall-time the domain spends in a
state where ALL non-idle tasks are delayed by memory. In this state,
the domain is entirely unproductive due to a lack of memory.

v2:
- fix active-delay condition when only other runnables, no iowait
- drop private lock from sched path, we can use the rq lock
- fix refault vs. simple lockwait detection
- drop ktime, we can use cpu_clock()

XXX:
- eliminate redundant cgroup hierarchy walks in the scheduler

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Conflicts:
	include/linux/sched.h
	kernel/cgroup/cgroup.c
	kernel/sched/Makefile
	kernel/sched/core.c
	mm/filemap.c
	mm/page_alloc.c
	mm/vmscan.c

Change-Id: I48f7910eb5d6a029fbb884b64deb8d90ec12ddcb
---
 fs/proc/array.c            |   8 ++
 fs/proc/base.c             |   2 +
 fs/proc/internal.h         |   2 +
 include/linux/memcontrol.h |  14 +++
 include/linux/memdelay.h   | 182 +++++++++++++++++++++++++++++
 include/linux/sched.h      |   8 ++
 kernel/cgroup.c            |   3 +-
 kernel/fork.c              |   4 +
 kernel/sched/Makefile      |   2 +-
 kernel/sched/core.c        |  27 +++++
 kernel/sched/memdelay.c    | 116 ++++++++++++++++++
 mm/Makefile                |   2 +-
 mm/compaction.c            |   4 +
 mm/filemap.c               |  40 ++++++-
 mm/memcontrol.c            |  25 ++++
 mm/memdelay.c              | 286 +++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            |   7 ++
 mm/vmscan.c                |   9 ++
 18 files changed, 736 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/memdelay.h
 create mode 100644 kernel/sched/memdelay.c
 create mode 100644 mm/memdelay.c

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 81818ad..1725b51 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -606,6 +606,14 @@ int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
 	return 0;
 }
 
+int proc_pid_memdelay(struct seq_file *m, struct pid_namespace *ns,
+		      struct pid *pid, struct task_struct *task)
+{
+	seq_put_decimal_ull(m, "", task->memdelay_total);
+	seq_putc(m, '\n');
+	return 0;
+}
+
 #ifdef CONFIG_PROC_CHILDREN
 static struct pid *
 get_children_pid(struct inode *inode, struct pid *pid_prev, loff_t pos)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 0ccab57..e7f9bcc 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3004,6 +3004,7 @@ static int proc_pid_personality(struct seq_file *m, struct pid_namespace *ns,
 	REG("cmdline",    S_IRUGO, proc_pid_cmdline_ops),
 	ONE("stat",       S_IRUGO, proc_tgid_stat),
 	ONE("statm",      S_IRUGO, proc_pid_statm),
+	ONE("memdelay",   S_IRUGO, proc_pid_memdelay),
 	REG("maps",       S_IRUGO, proc_pid_maps_operations),
 #ifdef CONFIG_NUMA
 	REG("numa_maps",  S_IRUGO, proc_pid_numa_maps_operations),
@@ -3397,6 +3398,7 @@ static int proc_tid_comm_permission(struct inode *inode, int mask)
 	REG("cmdline",   S_IRUGO, proc_pid_cmdline_ops),
 	ONE("stat",      S_IRUGO, proc_tid_stat),
 	ONE("statm",     S_IRUGO, proc_pid_statm),
+	ONE("memdelay",  S_IRUGO, proc_pid_memdelay),
 	REG("maps",      S_IRUGO, proc_tid_maps_operations),
 #ifdef CONFIG_PROC_CHILDREN
 	REG("children",  S_IRUGO, proc_tid_children_operations),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 5378441..c5bbbee 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -155,6 +155,8 @@ extern int proc_pid_status(struct seq_file *, struct pid_namespace *,
 			   struct pid *, struct task_struct *);
 extern int proc_pid_statm(struct seq_file *, struct pid_namespace *,
 			  struct pid *, struct task_struct *);
+extern int proc_pid_memdelay(struct seq_file *, struct pid_namespace *,
+			     struct pid *, struct task_struct *);
 
 /*
  * base.c
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8b35bdb..239d3ad 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -29,6 +29,7 @@
 #include <linux/mmzone.h>
 #include <linux/writeback.h>
 #include <linux/page-flags.h>
+#include <linux/memdelay.h>
 
 struct mem_cgroup;
 struct page;
@@ -194,6 +195,9 @@ struct mem_cgroup {
 
 	unsigned long soft_limit;
 
+	/* Memory delay measurement domain */
+	struct memdelay_domain *memdelay_domain;
+
 	/* vmpressure notifications */
 	struct vmpressure vmpressure;
 
@@ -632,6 +636,11 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &pgdat->lruvec;
 }
 
+static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
+{
+	return NULL;
+}
+
 static inline bool mm_match_cgroup(struct mm_struct *mm,
 		struct mem_cgroup *memcg)
 {
@@ -644,6 +653,11 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
 	return true;
 }
 
+static inline struct mem_cgroup *mem_cgroup_from_task(struct task_struct *task)
+{
+	return NULL;
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
diff --git a/include/linux/memdelay.h b/include/linux/memdelay.h
new file mode 100644
index 0000000..08ed4e4
--- /dev/null
+++ b/include/linux/memdelay.h
@@ -0,0 +1,182 @@
+#ifndef _LINUX_MEMDELAY_H
+#define _LINUX_MEMDELAY_H
+
+#include <linux/spinlock_types.h>
+#include <linux/sched.h>
+
+struct seq_file;
+struct css_set;
+
+/*
+ * Task productivity states tracked by the scheduler
+ */
+enum memdelay_task_state {
+	MTS_NONE,		/* Idle/unqueued/untracked */
+	MTS_IOWAIT,		/* Waiting for IO, not memory delayed */
+	MTS_RUNNABLE,		/* On the runqueue, not memory delayed */
+	MTS_DELAYED,		/* Memory delayed, not running */
+	MTS_DELAYED_ACTIVE,	/* Memory delayed, actively running */
+	NR_MEMDELAY_TASK_STATES,
+};
+
+/*
+ * System/cgroup delay state tracked by the VM, composed of the
+ * productivity states of all tasks inside the domain.
+ */
+enum memdelay_domain_state {
+	MDS_NONE,		/* No delayed tasks */
+	MDS_SOME,		/* Delayed tasks, working tasks */
+	MDS_FULL,		/* Delayed tasks, no working tasks */
+	NR_MEMDELAY_DOMAIN_STATES,
+};
+
+struct memdelay_domain_cpu {
+	/* Task states of the domain on this CPU */
+	int tasks[NR_MEMDELAY_TASK_STATES];
+
+	/* Delay state of the domain on this CPU */
+	enum memdelay_domain_state state;
+
+	/* Time of last state change */
+	u64 state_start;
+};
+
+struct memdelay_domain {
+	/* Aggregate delayed time of all domain tasks */
+	unsigned long aggregate;
+
+	/* Per-CPU delay states in the domain */
+	struct memdelay_domain_cpu __percpu *mdcs;
+
+	/* Cumulative state times from all CPUs */
+	unsigned long times[NR_MEMDELAY_DOMAIN_STATES];
+
+	/* Decaying state time averages over 1m, 5m, 15m */
+	unsigned long period_expires;
+	unsigned long avg_full[3];
+	unsigned long avg_some[3];
+};
+
+/* mm/memdelay.c */
+extern struct memdelay_domain memdelay_global_domain;
+void memdelay_init(void);
+void memdelay_task_change(struct task_struct *task,
+			  enum memdelay_task_state old,
+			  enum memdelay_task_state new);
+struct memdelay_domain *memdelay_domain_alloc(void);
+void memdelay_domain_free(struct memdelay_domain *md);
+int memdelay_domain_show(struct seq_file *s, struct memdelay_domain *md);
+
+/* kernel/sched/memdelay.c */
+void memdelay_enter(unsigned long *flags);
+void memdelay_leave(unsigned long *flags);
+
+/**
+ * memdelay_schedule - note a context switch
+ * @prev: task scheduling out
+ * @next: task scheduling in
+ *
+ * A task switch doesn't affect the balance between delayed and
+ * productive tasks, but we have to update whether the delay is
+ * actively using the CPU or not.
+ */
+static inline void memdelay_schedule(struct task_struct *prev,
+				     struct task_struct *next)
+{
+	if (prev->flags & PF_MEMDELAY)
+		memdelay_task_change(prev, MTS_DELAYED_ACTIVE, MTS_DELAYED);
+
+	if (next->flags & PF_MEMDELAY)
+		memdelay_task_change(next, MTS_DELAYED, MTS_DELAYED_ACTIVE);
+}
+
+/**
+ * memdelay_wakeup - note a task waking up
+ * @task: the task
+ *
+ * Notes an idle task becoming productive. Delayed tasks remain
+ * delayed even when they become runnable.
+ */
+static inline void memdelay_wakeup(struct task_struct *task)
+{
+	if (task->flags & PF_MEMDELAY)
+		return;
+
+	if (task->in_iowait)
+		memdelay_task_change(task, MTS_IOWAIT, MTS_RUNNABLE);
+	else
+		memdelay_task_change(task, MTS_NONE, MTS_RUNNABLE);
+}
+
+/**
+ * memdelay_wakeup - note a task going to sleep
+ * @task: the task
+ *
+ * Notes a working tasks becoming unproductive. Delayed tasks remain
+ * delayed.
+ */
+static inline void memdelay_sleep(struct task_struct *task)
+{
+	if (task->flags & PF_MEMDELAY)
+		return;
+
+	if (task->in_iowait)
+		memdelay_task_change(task, MTS_RUNNABLE, MTS_IOWAIT);
+	else
+		memdelay_task_change(task, MTS_RUNNABLE, MTS_NONE);
+}
+
+/**
+ * memdelay_del_add - track task movement between runqueues
+ * @task: the task
+ * @runnable: a runnable task is moved if %true, unqueued otherwise
+ * @add: task is being added if %true, removed otherwise
+ *
+ * Update the memdelay domain per-cpu states as tasks are being moved
+ * around the runqueues.
+ */
+static inline void memdelay_del_add(struct task_struct *task,
+				    bool runnable, bool add)
+{
+	int state;
+
+	if (task->flags & PF_MEMDELAY)
+		state = MTS_DELAYED;
+	else if (runnable)
+		state = MTS_RUNNABLE;
+	else if (task->in_iowait)
+		state = MTS_IOWAIT;
+	else
+		return; /* already MTS_NONE */
+
+	if (add)
+		memdelay_task_change(task, MTS_NONE, state);
+	else
+		memdelay_task_change(task, state, MTS_NONE);
+}
+
+static inline void memdelay_del_runnable(struct task_struct *task)
+{
+	memdelay_del_add(task, true, false);
+}
+
+static inline void memdelay_add_runnable(struct task_struct *task)
+{
+	memdelay_del_add(task, true, true);
+}
+
+static inline void memdelay_del_sleeping(struct task_struct *task)
+{
+	memdelay_del_add(task, false, false);
+}
+
+static inline void memdelay_add_sleeping(struct task_struct *task)
+{
+	memdelay_del_add(task, false, true);
+}
+
+#ifdef CONFIG_CGROUPS
+void cgroup_move_task(struct task_struct *task, struct css_set *to);
+#endif
+
+#endif /* _LINUX_MEMDELAY_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9fcfa40..072c244 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1589,6 +1589,7 @@ struct task_struct {
 	/* disallow userland-initiated cgroup migration */
 	unsigned no_cgroup_migration:1;
 #endif
+	unsigned			memdelay_migrate_enqueue:1;
 
 	unsigned long atomic_flags; /* Flags needing atomic access. */
 
@@ -1783,6 +1784,12 @@ struct task_struct {
 
 	struct io_context *io_context;
 
+	u64				memdelay_start;
+	unsigned long			memdelay_total;
+#ifdef CONFIG_DEBUG_VM
+	int				memdelay_state;
+#endif
+
 	unsigned long ptrace_message;
 	siginfo_t *last_siginfo; /* For ptrace use.  */
 	struct task_io_accounting ioac;
@@ -2295,6 +2302,7 @@ static inline cputime_t task_gtime(struct task_struct *t)
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
 #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
+#define PF_MEMDELAY		0x01000000	/* Delayed due to lack of memory */
 #define PF_NO_SETAFFINITY 0x04000000	/* Userland is not allowed to meddle with cpus_allowed */
 #define PF_MCE_EARLY    0x08000000      /* Early kill for mce process policy */
 #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index a3d2aad..d427077 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -781,7 +781,8 @@ static void css_set_move_task(struct task_struct *task,
 		 */
 		WARN_ON_ONCE(task->flags & PF_EXITING);
 
-		rcu_assign_pointer(task->cgroups, to_cset);
+		cgroup_move_task(task, to_cset);
+
 		list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks :
 							     &to_cset->tasks);
 	}
diff --git a/kernel/fork.c b/kernel/fork.c
index 98897eb..4535bf5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1159,6 +1159,10 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
 	int retval;
 
 	tsk->min_flt = tsk->maj_flt = 0;
+	tsk->memdelay_total = 0;
+#ifdef CONFIG_DEBUG_VM
+	tsk->memdelay_state = 0;
+#endif
 	tsk->nvcsw = tsk->nivcsw = 0;
 #ifdef CONFIG_DETECT_HUNG_TASK
 	tsk->last_switch_count = tsk->nvcsw + tsk->nivcsw;
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 5e59b83..8d405e9 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -17,7 +17,7 @@ endif
 
 obj-y += core.o loadavg.o clock.o cputime.o
 obj-y += idle_task.o fair.o rt.o deadline.o stop_task.o
-obj-y += wait.o swait.o completion.o idle.o
+obj-y += wait.o swait.o completion.o idle.o memdelay.o
 obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d63376e..d3e6b06 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -60,6 +60,7 @@
 #include <linux/sysctl.h>
 #include <linux/syscalls.h>
 #include <linux/times.h>
+#include <linux/memdelay.h>
 #include <linux/tsacct_kern.h>
 #include <linux/kprobes.h>
 #include <linux/delayacct.h>
@@ -757,6 +758,14 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 	update_rq_clock(rq);
 	if (!(flags & ENQUEUE_RESTORE))
 		sched_info_queued(rq, p);
+	WARN_ON_ONCE(!(flags & ENQUEUE_WAKEUP) && p->memdelay_migrate_enqueue);
+	if (!(flags & ENQUEUE_WAKEUP) || p->memdelay_migrate_enqueue) {
+		memdelay_add_runnable(p);
+		p->memdelay_migrate_enqueue = 0;
+	} else {
+		memdelay_wakeup(p);
+	}
+
 	p->sched_class->enqueue_task(rq, p, flags);
 }
 
@@ -765,6 +774,11 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 	update_rq_clock(rq);
 	if (!(flags & DEQUEUE_SAVE))
 		sched_info_dequeued(rq, p);
+	if (!(flags & DEQUEUE_SLEEP))
+		memdelay_del_runnable(p);
+	else
+		memdelay_sleep(p);
+
 	p->sched_class->dequeue_task(rq, p, flags);
 }
 
@@ -2087,7 +2101,16 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
 
 	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
+		struct rq_flags rf;
+		struct rq *rq;
+
 		wake_flags |= WF_MIGRATED;
+
+		rq = __task_rq_lock(p, &rf);
+		memdelay_del_sleeping(p);
+		__task_rq_unlock(rq, &rf);
+		p->memdelay_migrate_enqueue = 1;
+
 		set_task_cpu(p, cpu);
 	}
 #endif /* CONFIG_SMP */
@@ -3398,6 +3421,8 @@ static void __sched notrace __schedule(bool preempt)
 		rq->curr = next;
 		++*switch_count;
 
+		memdelay_schedule(prev, next);
+
 		trace_sched_switch(preempt, prev, next);
 		rq = context_switch(rq, prev, next, cookie); /* unlocks the rq */
 	} else {
@@ -7692,6 +7717,8 @@ void __init sched_init(void)
 
 	init_schedstats();
 
+	memdelay_init();
+
 	scheduler_running = 1;
 }
 
diff --git a/kernel/sched/memdelay.c b/kernel/sched/memdelay.c
new file mode 100644
index 0000000..2cc274b
--- /dev/null
+++ b/kernel/sched/memdelay.c
@@ -0,0 +1,116 @@
+/*
+ * Memory delay metric
+ *
+ * Copyright (c) 2017 Facebook, Johannes Weiner
+ *
+ * This code quantifies and reports to userspace the wall-time impact
+ * of memory pressure on the system and memory-controlled cgroups.
+ */
+
+#include <linux/memdelay.h>
+#include <linux/cgroup.h>
+#include <linux/sched.h>
+
+#include "sched.h"
+
+/**
+ * memdelay_enter - mark the beginning of a memory delay section
+ * @flags: flags to handle nested memdelay sections
+ *
+ * Marks the calling task as being delayed due to a lack of memory,
+ * such as waiting for a workingset refault or performing reclaim.
+ */
+void memdelay_enter(unsigned long *flags)
+{
+	struct rq *rq;
+
+	*flags = current->flags & PF_MEMDELAY;
+	if (*flags)
+		return;
+	/*
+	 * PF_MEMDELAY & accounting needs to be atomic wrt changes to
+	 * the task's scheduling state and its domain association.
+	 * Otherwise we could race with CPU or cgroup migration and
+	 * misaccount.
+	 */
+	local_irq_disable();
+	rq = this_rq();
+	raw_spin_lock(&rq->lock);
+
+	current->flags |= PF_MEMDELAY;
+	memdelay_task_change(current, MTS_RUNNABLE, MTS_DELAYED_ACTIVE);
+
+	raw_spin_unlock(&rq->lock);
+	local_irq_enable();
+}
+
+/**
+ * memdelay_leave - mark the end of a memory delay section
+ * @flags: flags to handle nested memdelay sections
+ *
+ * Marks the calling task as no longer delayed due to memory.
+ */
+void memdelay_leave(unsigned long *flags)
+{
+	struct rq *rq;
+
+	if (*flags)
+		return;
+	/*
+	 * PF_MEMDELAY & accounting needs to be atomic wrt changes to
+	 * the task's scheduling state and its domain association.
+	 * Otherwise we could race with CPU or cgroup migration and
+	 * misaccount.
+	 */
+	local_irq_disable();
+	rq = this_rq();
+	raw_spin_lock(&rq->lock);
+
+	current->flags &= ~PF_MEMDELAY;
+	memdelay_task_change(current, MTS_DELAYED_ACTIVE, MTS_RUNNABLE);
+
+	raw_spin_unlock(&rq->lock);
+	local_irq_enable();
+}
+
+#ifdef CONFIG_CGROUPS
+/**
+ * cgroup_move_task - move task to a different cgroup
+ * @task: the task
+ * @to: the target css_set
+ *
+ * Move task to a new cgroup and safely migrate its associated
+ * delayed/working state between the different domains.
+ *
+ * This function acquires the task's rq lock to lock out concurrent
+ * changes to the task's scheduling state and - in case the task is
+ * running - concurrent changes to its delay state.
+ */
+void cgroup_move_task(struct task_struct *task, struct css_set *to)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+	int state;
+
+	rq = task_rq_lock(task, &rf);
+
+	if (task->flags & PF_MEMDELAY)
+		state = MTS_DELAYED + task_current(rq, task);
+	else if (task_on_rq_queued(task))
+		state = MTS_RUNNABLE;
+	else if (task->in_iowait)
+		state = MTS_IOWAIT;
+	else
+		state = MTS_NONE;
+
+	/*
+	 * Lame to do this here, but the scheduler cannot be locked
+	 * from the outside, so we move cgroups from inside sched/.
+	 */
+	memdelay_task_change(task, state, MTS_NONE);
+	rcu_assign_pointer(task->cgroups, to);
+	memdelay_task_change(task, MTS_NONE, state);
+
+	task_rq_unlock(rq, task, &rf);
+}
+#endif /* CONFIG_CGROUPS */
diff --git a/mm/Makefile b/mm/Makefile
index 4d9c99b..c00ab42 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -37,7 +37,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o vmacache.o \
 			   interval_tree.o list_lru.o workingset.o \
-			   debug.o $(mmu-y)
+			   memdelay.o debug.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/compaction.c b/mm/compaction.c
index 70e6bec..e00941f 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1973,11 +1973,15 @@ static int kcompactd(void *p)
 	pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;
 
 	while (!kthread_should_stop()) {
+		unsigned long mdflags;
+
 		trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
 		wait_event_freezable(pgdat->kcompactd_wait,
 				kcompactd_work_requested(pgdat));
 
+		memdelay_enter(&mdflags);
 		kcompactd_do_work(pgdat);
+		memdelay_leave(&mdflags);
 	}
 
 	return 0;
diff --git a/mm/filemap.c b/mm/filemap.c
index 59fe831..aa98d3f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -794,34 +794,70 @@ wait_queue_head_t *page_waitqueue(struct page *page)
 void wait_on_page_bit(struct page *page, int bit_nr)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
+	unsigned long mdflags;
+	bool refault = false;
+
+	if (bit_nr == PG_locked && !PageUptodate(page) && PageWorkingset(page)) {
+		memdelay_enter(&mdflags);
+		refault = true;
+	}
 
 	if (test_bit(bit_nr, &page->flags))
 		__wait_on_bit(page_waitqueue(page), &wait, bit_wait_io,
 							TASK_UNINTERRUPTIBLE);
+	if (refault)
+		memdelay_leave(&mdflags);
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
 int wait_on_page_bit_killable(struct page *page, int bit_nr)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
+	int res;
+	unsigned long mdflags;
+	bool refault = false;
+
+	if (bit_nr == PG_locked && !PageUptodate(page) && PageWorkingset(page)) {
+		memdelay_enter(&mdflags);
+		refault = true;
+	}
 
 	if (!test_bit(bit_nr, &page->flags))
 		return 0;
 
-	return __wait_on_bit(page_waitqueue(page), &wait,
+	res = __wait_on_bit(page_waitqueue(page), &wait,
 			     bit_wait_io, TASK_KILLABLE);
+
+	if (refault)
+		memdelay_leave(&mdflags);
+
+	return res;
 }
 
 int wait_on_page_bit_killable_timeout(struct page *page,
 				       int bit_nr, unsigned long timeout)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
+	int res;
+	unsigned long mdflags;
+	bool refault = false;
+
+	if (bit_nr == PG_locked && !PageUptodate(page) && PageWorkingset(page)) {
+		memdelay_enter(&mdflags);
+		refault = true;
+	}
 
 	wait.key.timeout = jiffies + timeout;
 	if (!test_bit(bit_nr, &page->flags))
 		return 0;
-	return __wait_on_bit(page_waitqueue(page), &wait,
+	res = __wait_on_bit(page_waitqueue(page), &wait,
 			     bit_wait_io_timeout, TASK_KILLABLE);
+
+	if (refault)
+		memdelay_leave(&mdflags);
+
+	return res;
+
 }
 EXPORT_SYMBOL_GPL(wait_on_page_bit_killable_timeout);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d636bd6..ee4617b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -63,6 +63,7 @@
 #include <linux/lockdep.h>
 #include <linux/file.h>
 #include <linux/tracehook.h>
+#include <linux/memdelay.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -3912,6 +3913,8 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
 	return ret;
 }
 
+static int memory_memdelay_show(struct seq_file *m, void *v);
+
 static struct cftype mem_cgroup_legacy_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -3979,6 +3982,10 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
 	{
 		.name = "pressure_level",
 	},
+	{
+		.name = "memdelay",
+		.seq_show = memory_memdelay_show,
+	},
 #ifdef CONFIG_NUMA
 	{
 		.name = "numa_stat",
@@ -4147,6 +4154,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 
 	for_each_node(node)
 		free_mem_cgroup_per_node_info(memcg, node);
+	memdelay_domain_free(memcg->memdelay_domain);
 	free_percpu(memcg->stat);
 	kfree(memcg);
 }
@@ -4252,10 +4260,15 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 
 	/* The following stuff does not apply to the root */
 	if (!parent) {
+		memcg->memdelay_domain = &memdelay_global_domain;
 		root_mem_cgroup = memcg;
 		return &memcg->css;
 	}
 
+	memcg->memdelay_domain = memdelay_domain_alloc();
+	if (!memcg->memdelay_domain)
+		goto fail;
+
 	error = memcg_online_kmem(memcg);
 	if (error)
 		goto fail;
@@ -5245,6 +5258,13 @@ static int memory_stat_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+static int memory_memdelay_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+
+	return memdelay_domain_show(m, memcg->memdelay_domain);
+}
+
 static struct cftype memory_files[] = {
 	{
 		.name = "current",
@@ -5280,6 +5300,11 @@ static int memory_stat_show(struct seq_file *m, void *v)
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.seq_show = memory_stat_show,
 	},
+	{
+		.name = "memdelay",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_memdelay_show,
+	},
 	{ }	/* terminate */
 };
 
diff --git a/mm/memdelay.c b/mm/memdelay.c
new file mode 100644
index 0000000..66e2b63
--- /dev/null
+++ b/mm/memdelay.c
@@ -0,0 +1,286 @@
+/*
+ * Memory delay metric
+ *
+ * Copyright (c) 2017 Facebook, Johannes Weiner
+ *
+ * This code quantifies and reports to userspace the wall-time impact
+ * of memory pressure on the system and memory-controlled cgroups.
+ */
+
+#include <linux/memcontrol.h>
+#include <linux/memdelay.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+
+#define LOAD_INT(x) ((x) >> FSHIFT)
+#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
+
+static DEFINE_PER_CPU(struct memdelay_domain_cpu, global_domain_cpus);
+
+/* System-level keeping of memory delay statistics */
+struct memdelay_domain memdelay_global_domain = {
+	.mdcs = &global_domain_cpus,
+};
+
+static void domain_init(struct memdelay_domain *md)
+{
+	md->period_expires = jiffies + LOAD_FREQ;
+}
+
+/**
+ * memdelay_init - initialize the memdelay subsystem
+ *
+ * This needs to run before the scheduler starts queuing and
+ * scheduling tasks.
+ */
+void __init memdelay_init(void)
+{
+	domain_init(&memdelay_global_domain);
+}
+
+static void domain_move_clock(struct memdelay_domain *md)
+{
+	unsigned long expires = READ_ONCE(md->period_expires);
+	unsigned long none, some, full;
+	int missed_periods;
+	unsigned long next;
+	int i;
+
+	if (time_before(jiffies, expires))
+		return;
+
+	missed_periods = 1 + (jiffies - expires) / LOAD_FREQ;
+	next = expires + (missed_periods * LOAD_FREQ);
+
+	if (cmpxchg(&md->period_expires, expires, next) != expires)
+		return;
+
+	none = xchg(&md->times[MDS_NONE], 0);
+	some = xchg(&md->times[MDS_SOME], 0);
+	full = xchg(&md->times[MDS_FULL], 0);
+
+	for (i = 0; i < missed_periods; i++) {
+		unsigned long pct;
+
+		pct = some * 100 / max(none + some + full, 1UL);
+		pct *= FIXED_1;
+		CALC_LOAD(md->avg_some[0], EXP_1, pct);
+		CALC_LOAD(md->avg_some[1], EXP_5, pct);
+		CALC_LOAD(md->avg_some[2], EXP_15, pct);
+
+		pct = full * 100 / max(none + some + full, 1UL);
+		pct *= FIXED_1;
+		CALC_LOAD(md->avg_full[0], EXP_1, pct);
+		CALC_LOAD(md->avg_full[1], EXP_5, pct);
+		CALC_LOAD(md->avg_full[2], EXP_15, pct);
+
+		none = some = full = 0;
+	}
+}
+
+static void domain_cpu_update(struct memdelay_domain *md, int cpu,
+			      enum memdelay_task_state old,
+			      enum memdelay_task_state new)
+{
+	enum memdelay_domain_state state;
+	struct memdelay_domain_cpu *mdc;
+	unsigned long delta;
+	u64 now;
+
+	mdc = per_cpu_ptr(md->mdcs, cpu);
+
+	if (old) {
+		WARN_ONCE(!mdc->tasks[old], "cpu=%d old=%d new=%d counter=%d\n",
+			  cpu, old, new, mdc->tasks[old]);
+		mdc->tasks[old] -= 1;
+	}
+	if (new)
+		mdc->tasks[new] += 1;
+
+	/*
+	 * The domain is somewhat delayed when a number of tasks are
+	 * delayed but there are still others running the workload.
+	 *
+	 * The domain is fully delayed when all non-idle tasks on the
+	 * CPU are delayed, or when a delayed task is actively running
+	 * and preventing productive tasks from making headway.
+	 *
+	 * The state times then add up over all CPUs in the domain: if
+	 * the domain is fully blocked on one CPU and there is another
+	 * one running the workload, the domain is considered fully
+	 * blocked 50% of the time.
+	 */
+	if (mdc->tasks[MTS_DELAYED_ACTIVE] && !mdc->tasks[MTS_IOWAIT])
+		state = MDS_FULL;
+	else if (mdc->tasks[MTS_DELAYED])
+		state = (mdc->tasks[MTS_RUNNABLE] || mdc->tasks[MTS_IOWAIT]) ?
+			MDS_SOME : MDS_FULL;
+	else
+		state = MDS_NONE;
+
+	if (mdc->state == state)
+		return;
+
+	now = cpu_clock(cpu);
+	delta = (now - mdc->state_start) / NSEC_PER_USEC;
+
+	domain_move_clock(md);
+	md->times[mdc->state] += delta;
+
+	mdc->state = state;
+	mdc->state_start = now;
+}
+
+static struct memdelay_domain *memcg_domain(struct mem_cgroup *memcg)
+{
+#ifdef CONFIG_MEMCG
+	if (!mem_cgroup_disabled())
+		return memcg->memdelay_domain;
+#endif
+	return &memdelay_global_domain;
+}
+
+/**
+ * memdelay_task_change - note a task changing its delay/work state
+ * @task: the task changing state
+ * @old: old task state
+ * @new: new task state
+ *
+ * Updates the task's domain counters to reflect a change in the
+ * task's delayed/working state.
+ */
+void memdelay_task_change(struct task_struct *task,
+			  enum memdelay_task_state old,
+			  enum memdelay_task_state new)
+{
+	int cpu = task_cpu(task);
+	struct mem_cgroup *memcg;
+	unsigned long delay = 0;
+
+#ifdef CONFIG_DEBUG_VM
+	WARN_ONCE(task->memdelay_state != old,
+		  "cpu=%d task=%p state=%d (in_iowait=%d PF_MEMDELAYED=%d) old=%d new=%d\n",
+		  cpu, task, task->memdelay_state, task->in_iowait,
+		  !!(task->flags & PF_MEMDELAY), old, new);
+	task->memdelay_state = new;
+#endif
+
+	/* Account when tasks are entering and leaving delays */
+	if (old < MTS_DELAYED && new >= MTS_DELAYED) {
+		task->memdelay_start = cpu_clock(cpu);
+	} else if (old >= MTS_DELAYED && new < MTS_DELAYED) {
+		delay = (cpu_clock(cpu) - task->memdelay_start) / NSEC_PER_USEC;
+		task->memdelay_total += delay;
+	}
+
+	/* Account domain state changes */
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(task);
+	do {
+		struct memdelay_domain *md;
+
+		md = memcg_domain(memcg);
+		md->aggregate += delay;
+		domain_cpu_update(md, cpu, old, new);
+	} while (memcg && (memcg = parent_mem_cgroup(memcg)));
+	rcu_read_unlock();
+};
+
+/**
+ * memdelay_domain_alloc - allocate a cgroup memory delay domain
+ */
+struct memdelay_domain *memdelay_domain_alloc(void)
+{
+	struct memdelay_domain *md;
+
+	md = kzalloc(sizeof(*md), GFP_KERNEL);
+	if (!md)
+		return NULL;
+	md->mdcs = alloc_percpu(struct memdelay_domain_cpu);
+	if (!md->mdcs) {
+		kfree(md);
+		return NULL;
+	}
+	domain_init(md);
+	return md;
+}
+
+/**
+ * memdelay_domain_free - free a cgroup memory delay domain
+ */
+void memdelay_domain_free(struct memdelay_domain *md)
+{
+	if (md) {
+		free_percpu(md->mdcs);
+		kfree(md);
+	}
+}
+
+/**
+ * memdelay_domain_show - format memory delay domain stats to a seq_file
+ * @s: the seq_file
+ * @md: the memory domain
+ */
+int memdelay_domain_show(struct seq_file *s, struct memdelay_domain *md)
+{
+	domain_move_clock(md);
+
+	seq_printf(s, "%lu\n", md->aggregate);
+
+	seq_printf(s, "%lu.%02lu %lu.%02lu %lu.%02lu\n",
+		   LOAD_INT(md->avg_some[0]), LOAD_FRAC(md->avg_some[0]),
+		   LOAD_INT(md->avg_some[1]), LOAD_FRAC(md->avg_some[1]),
+		   LOAD_INT(md->avg_some[2]), LOAD_FRAC(md->avg_some[2]));
+
+	seq_printf(s, "%lu.%02lu %lu.%02lu %lu.%02lu\n",
+		   LOAD_INT(md->avg_full[0]), LOAD_FRAC(md->avg_full[0]),
+		   LOAD_INT(md->avg_full[1]), LOAD_FRAC(md->avg_full[1]),
+		   LOAD_INT(md->avg_full[2]), LOAD_FRAC(md->avg_full[2]));
+
+#ifdef CONFIG_DEBUG_VM
+	{
+		int cpu;
+
+		for_each_online_cpu(cpu) {
+			struct memdelay_domain_cpu *mdc;
+
+			mdc = per_cpu_ptr(md->mdcs, cpu);
+			seq_printf(s, "%d %d %d %d\n",
+				   mdc->tasks[MTS_IOWAIT],
+				   mdc->tasks[MTS_RUNNABLE],
+				   mdc->tasks[MTS_DELAYED],
+				   mdc->tasks[MTS_DELAYED_ACTIVE]);
+		}
+	}
+#endif
+
+	return 0;
+}
+
+static int memdelay_show(struct seq_file *m, void *v)
+{
+	return memdelay_domain_show(m, &memdelay_global_domain);
+}
+
+static int memdelay_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, memdelay_show, NULL);
+}
+
+static const struct file_operations memdelay_fops = {
+	.open           = memdelay_open,
+	.read           = seq_read,
+	.llseek         = seq_lseek,
+	.release        = single_release,
+};
+
+static int __init memdelay_proc_init(void)
+{
+	proc_create("memdelay", 0, NULL, &memdelay_fops);
+	return 0;
+}
+module_init(memdelay_proc_init);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e5b159b..103cd4a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -64,6 +64,7 @@
 #include <linux/page_owner.h>
 #include <linux/kthread.h>
 #include <linux/memcontrol.h>
+#include <linux/memdelay.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -3124,15 +3125,18 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 		unsigned int alloc_flags, const struct alloc_context *ac,
 		enum compact_priority prio, enum compact_result *compact_result)
 {
+	unsigned long mdflags;
 	struct page *page;
 
 	if (!order)
 		return NULL;
 
+	memdelay_enter(&mdflags);
 	current->flags |= PF_MEMALLOC;
 	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
 									prio);
 	current->flags &= ~PF_MEMALLOC;
+	memdelay_leave(&mdflags);
 
 	if (*compact_result <= COMPACT_INACTIVE)
 		return NULL;
@@ -3268,12 +3272,14 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 					const struct alloc_context *ac)
 {
 	struct reclaim_state reclaim_state;
+	unsigned long mdflags;
 	int progress;
 
 	cond_resched();
 
 	/* We now go into synchronous reclaim */
 	cpuset_memory_pressure_bump();
+	memdelay_enter(&mdflags);
 	current->flags |= PF_MEMALLOC;
 	lockdep_set_current_reclaim_state(gfp_mask);
 	reclaim_state.reclaimed_slab = 0;
@@ -3285,6 +3291,7 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 	current->reclaim_state = NULL;
 	lockdep_clear_current_reclaim_state();
 	current->flags &= ~PF_MEMALLOC;
+	memdelay_leave(&mdflags);
 
 	cond_resched();
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6361461..7024ba6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -47,6 +47,7 @@
 #include <linux/prefetch.h>
 #include <linux/printk.h>
 #include <linux/dax.h>
+#include <linux/memdelay.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -3031,6 +3032,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 {
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
+	unsigned long mdflags;
 	int nid;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
@@ -3058,9 +3060,11 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					    sc.gfp_mask,
 					    sc.reclaim_idx);
 
+	memdelay_enter(&mdflags);
 	current->flags |= PF_MEMALLOC;
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 	current->flags &= ~PF_MEMALLOC;
+	memdelay_leave(&mdflags);
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
@@ -3445,6 +3449,7 @@ static int kswapd(void *p)
 	pgdat->kswapd_order = alloc_order = reclaim_order = 0;
 	pgdat->kswapd_classzone_idx = classzone_idx = 0;
 	for ( ; ; ) {
+		unsigned long mdflags;
 		bool ret;
 
 kswapd_try_sleep:
@@ -3478,7 +3483,11 @@ static int kswapd(void *p)
 		 */
 		trace_mm_vmscan_kswapd_wake(pgdat->node_id, classzone_idx,
 						alloc_order);
+
+		memdelay_enter(&mdflags);
 		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
+		memdelay_leave(&mdflags);
+
 		if (reclaim_order < alloc_order)
 			goto kswapd_try_sleep;
 
-- 
1.9.1


[-- Attachment #3: 0001-mm-workingset-tell-cache-transitions-from-workingset.patch --]
[-- Type: text/x-patch, Size: 14452 bytes --]

>From d644440a4550cf2369c83a6ff6d700fad94e0dfa Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 27 Sep 2017 19:28:49 +0300
Subject: [PATCH 1/2] mm: workingset: tell cache transitions from workingset
 thrashing

Refaults happen during transitions between workingsets as well as
in-place thrashing. Knowing the difference between the two has a range
of applications, including measuring the impact of memory shortage on
the system performance, as well as the ability to smarter balance
pressure between the filesystem cache and the swap-backed workingset.

During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.

Introduce a new page flag that tells on eviction whether the page has
been active or not in its lifetime. This bit is then stored in the
shadow entry, to classify refaults as transitioning or thrashing.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Conflicts:
	include/linux/page-flags.h
	mm/memcontrol.c
	mm/workingset.c

Change-Id: Iee21f70d3b20aa5b9bf6b9e40693e170f785c813
---
 include/linux/mmzone.h         |  1 +
 include/linux/page-flags.h     |  5 ++-
 include/linux/swap.h           |  2 +-
 include/trace/events/mmflags.h |  1 +
 mm/filemap.c                   |  9 ++--
 mm/huge_memory.c               |  1 +
 mm/memcontrol.c                |  2 +
 mm/migrate.c                   |  2 +
 mm/swap_state.c                |  1 +
 mm/vmscan.c                    |  1 +
 mm/vmstat.c                    |  1 +
 mm/workingset.c                | 97 +++++++++++++++++++++++++++---------------
 12 files changed, 80 insertions(+), 43 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7e273e2..a2d1517 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -152,6 +152,7 @@ enum node_stat_item {
 	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
 	WORKINGSET_REFAULT,
 	WORKINGSET_ACTIVATE,
+	WORKINGSET_RESTORE,
 	WORKINGSET_NODERECLAIM,
 	NR_ANON_MAPPED,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 74e4dda..e607142 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -73,12 +73,13 @@
  */
 enum pageflags {
 	PG_locked,		/* Page is locked. Don't touch. */
-	PG_error,
 	PG_referenced,
 	PG_uptodate,
 	PG_dirty,
 	PG_lru,
 	PG_active,
+	PG_workingset,
+	PG_error,
 	PG_slab,
 	PG_owner_priv_1,	/* Owner use. If pagecache, fs may use*/
 	PG_arch_1,
@@ -262,6 +263,8 @@ static __always_inline int PageCompound(struct page *page)
 PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
 PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
 	TESTCLEARFLAG(Active, active, PF_HEAD)
+PAGEFLAG(Workingset, workingset, PF_HEAD)
+	TESTCLEARFLAG(Workingset, workingset, PF_HEAD)
 __PAGEFLAG(Slab, slab, PF_NO_TAIL)
 __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
 PAGEFLAG(Checked, checked, PF_NO_COMPOUND)	   /* Used by some filesystems */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 55ff559..cbe483a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -245,7 +245,7 @@ struct swap_info_struct {
 
 /* linux/mm/workingset.c */
 void *workingset_eviction(struct address_space *mapping, struct page *page);
-bool workingset_refault(void *shadow);
+void workingset_refault(struct page *page, void *shadow);
 void workingset_activation(struct page *page);
 extern struct list_lru workingset_shadow_nodes;
 
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 5a81ab4..5076e11 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -87,6 +87,7 @@
 	{1UL << PG_dirty,		"dirty"		},		\
 	{1UL << PG_lru,			"lru"		},		\
 	{1UL << PG_active,		"active"	},		\
+	{1UL << PG_workingset,		"workingset"	},		\
 	{1UL << PG_slab,		"slab"		},		\
 	{1UL << PG_owner_priv_1,	"owner_priv_1"	},		\
 	{1UL << PG_arch_1,		"arch_1"	},		\
diff --git a/mm/filemap.c b/mm/filemap.c
index edfb90e..59fe831 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -745,12 +745,9 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 		 * data from the working set, only to cache data that will
 		 * get overwritten with something else, is a waste of memory.
 		 */
-		if (!(gfp_mask & __GFP_WRITE) &&
-		    shadow && workingset_refault(shadow)) {
-			SetPageActive(page);
-			workingset_activation(page);
-		} else
-			ClearPageActive(page);
+		WARN_ON_ONCE(PageActive(page));
+		if (!(gfp_mask & __GFP_WRITE) && shadow)
+			workingset_refault(page, shadow);
 		lru_cache_add(page);
 	}
 	return ret;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d5b2b75..22d1961 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1869,6 +1869,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 			 (1L << PG_mlocked) |
 			 (1L << PG_uptodate) |
 			 (1L << PG_active) |
+			 (1L << PG_workingset) |
 			 (1L << PG_locked) |
 			 (1L << PG_unevictable) |
 			 (1L << PG_dirty)));
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 47559cc..d636bd6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5239,6 +5239,8 @@ static int memory_stat_show(struct seq_file *m, void *v)
 		   events[MEM_CGROUP_EVENTS_PGFAULT]);
 	seq_printf(m, "pgmajfault %lu\n",
 		   events[MEM_CGROUP_EVENTS_PGMAJFAULT]);
+	seq_printf(m, "workingset_restore %lu\n",
+		   stat[WORKINGSET_RESTORE]);
 
 	return 0;
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index 6850f62..9348fd4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -636,6 +636,8 @@ void migrate_page_copy(struct page *newpage, struct page *page)
 		SetPageActive(newpage);
 	} else if (TestClearPageUnevictable(page))
 		SetPageUnevictable(newpage);
+	if (PageWorkingset(page))
+		SetPageWorkingset(newpage);
 	if (PageChecked(page))
 		SetPageChecked(newpage);
 	if (PageMappedToDisk(page))
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 35d7e0e..10fec1f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -368,6 +368,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			/*
 			 * Initiate read into locked page and return.
 			 */
+			SetPageWorkingset(new_page);
 			lru_cache_add_anon(new_page);
 			*new_page_allocated = true;
 			return new_page;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 30a88b9..6361461 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1991,6 +1991,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 		}
 
 		ClearPageActive(page);	/* we are de-activating */
+		SetPageWorkingset(page);
 		list_add(&page->lru, &l_inactive);
 	}
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 604f26a..3adf071 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -957,6 +957,7 @@ int fragmentation_index(struct zone *zone, unsigned int order)
 	"nr_pages_scanned",
 	"workingset_refault",
 	"workingset_activate",
+	"workingset_restore",
 	"workingset_nodereclaim",
 	"nr_anon_pages",
 	"nr_mapped",
diff --git a/mm/workingset.c b/mm/workingset.c
index 4c4f056..b058fd3d 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -118,7 +118,7 @@
  * the only thing eating into inactive list space is active pages.
  *
  *
- *		Activating refaulting pages
+ *		Refaulting inactive pages
  *
  * All that is known about the active list is that the pages have been
  * accessed more than once in the past.  This means that at any given
@@ -131,6 +131,10 @@
  * used less frequently than the refaulting page - or even not used at
  * all anymore.
  *
+ * That means if inactive cache is refaulting with a suitable refault
+ * distance, we assume the cache workingset is transitioning and put
+ * pressure on the current active list.
+ *
  * If this is wrong and demotion kicks in, the pages which are truly
  * used more frequently will be reactivated while the less frequently
  * used once will be evicted from memory.
@@ -138,6 +142,14 @@
  * But if this is right, the stale pages will be pushed out of memory
  * and the used pages get to stay in cache.
  *
+ *		Refaulting active pages
+ *
+ * If on the other hand the refaulting pages have recently been
+ * deactivated, it means that the active list is no longer protecting
+ * actively used cache from reclaim. The cache is NOT transitioning to
+ * a different workingset; the existing workingset is thrashing in the
+ * space allocated to the page cache.
+ *
  *
  *		Implementation
  *
@@ -153,8 +165,7 @@
  */
 
 #define EVICTION_SHIFT	(RADIX_TREE_EXCEPTIONAL_ENTRY + \
-			 NODES_SHIFT +	\
-			 MEM_CGROUP_ID_SHIFT)
+			 1 + NODES_SHIFT + MEM_CGROUP_ID_SHIFT)
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
 
 /*
@@ -167,23 +178,28 @@
  */
 static unsigned int bucket_order __read_mostly;
 
-static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction)
+static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
+			 bool workingset)
 {
 	eviction >>= bucket_order;
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
+	eviction = (eviction << 1) | workingset;
 	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
 
 	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
 static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
-			  unsigned long *evictionp)
+			  unsigned long *evictionp, bool *workingsetp)
 {
 	unsigned long entry = (unsigned long)shadow;
 	int memcgid, nid;
+	bool workingset;
 
 	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
+	workingset = entry & 1;
+	entry >>= 1;
 	nid = entry & ((1UL << NODES_SHIFT) - 1);
 	entry >>= NODES_SHIFT;
 	memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
@@ -192,6 +208,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 	*memcgidp = memcgid;
 	*pgdat = NODE_DATA(nid);
 	*evictionp = entry << bucket_order;
+	*workingsetp = workingset;
 }
 
 /**
@@ -204,8 +221,8 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
  */
 void *workingset_eviction(struct address_space *mapping, struct page *page)
 {
-	struct mem_cgroup *memcg = page_memcg(page);
 	struct pglist_data *pgdat = page_pgdat(page);
+	struct mem_cgroup *memcg = page_memcg(page);
 	int memcgid = mem_cgroup_id(memcg);
 	unsigned long eviction;
 	struct lruvec *lruvec;
@@ -217,30 +234,30 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 
 	lruvec = mem_cgroup_lruvec(pgdat, memcg);
 	eviction = atomic_long_inc_return(&lruvec->inactive_age);
-	return pack_shadow(memcgid, pgdat, eviction);
+	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
 }
 
 /**
  * workingset_refault - evaluate the refault of a previously evicted page
+ * @page: the freshly allocated replacement page
  * @shadow: shadow entry of the evicted page
  *
  * Calculates and evaluates the refault distance of the previously
  * evicted page in the context of the node it was allocated in.
- *
- * Returns %true if the page should be activated, %false otherwise.
  */
-bool workingset_refault(void *shadow)
+void workingset_refault(struct page *page, void *shadow)
 {
 	unsigned long refault_distance;
+	struct pglist_data *pgdat;
 	unsigned long active_file;
 	struct mem_cgroup *memcg;
 	unsigned long eviction;
 	struct lruvec *lruvec;
 	unsigned long refault;
-	struct pglist_data *pgdat;
+	bool workingset;
 	int memcgid;
 
-	unpack_shadow(shadow, &memcgid, &pgdat, &eviction);
+	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
 
 	rcu_read_lock();
 	/*
@@ -260,40 +277,50 @@ bool workingset_refault(void *shadow)
 	 * configurations instead.
 	 */
 	memcg = mem_cgroup_from_id(memcgid);
-	if (!mem_cgroup_disabled() && !memcg) {
-		rcu_read_unlock();
-		return false;
-	}
+	if (!mem_cgroup_disabled() && !memcg)
+		goto out;
 	lruvec = mem_cgroup_lruvec(pgdat, memcg);
 	refault = atomic_long_read(&lruvec->inactive_age);
 	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES);
-	rcu_read_unlock();
 
 	/*
-	 * The unsigned subtraction here gives an accurate distance
-	 * across inactive_age overflows in most cases.
+	 * Calculate the refault distance
 	 *
-	 * There is a special case: usually, shadow entries have a
-	 * short lifetime and are either refaulted or reclaimed along
-	 * with the inode before they get too old.  But it is not
-	 * impossible for the inactive_age to lap a shadow entry in
-	 * the field, which can then can result in a false small
-	 * refault distance, leading to a false activation should this
-	 * old entry actually refault again.  However, earlier kernels
-	 * used to deactivate unconditionally with *every* reclaim
-	 * invocation for the longest time, so the occasional
-	 * inappropriate activation leading to pressure on the active
-	 * list is not a problem.
+	 * The unsigned subtraction here gives an accurate distance
+	 * across inactive_age overflows in most cases. There is a
+	 * special case: usually, shadow entries have a short lifetime
+	 * and are either refaulted or reclaimed along with the inode
+	 * before they get too old.  But it is not impossible for the
+	 * inactive_age to lap a shadow entry in the field, which can
+	 * then can result in a false small refault distance, leading
+	 * to a false activation should this old entry actually
+	 * refault again.  However, earlier kernels used to deactivate
+	 * unconditionally with *every* reclaim invocation for the
+	 * longest time, so the occasional inappropriate activation
+	 * leading to pressure on the active list is not a problem.
 	 */
 	refault_distance = (refault - eviction) & EVICTION_MASK;
 
 	inc_node_state(pgdat, WORKINGSET_REFAULT);
 
-	if (refault_distance <= active_file) {
-		inc_node_state(pgdat, WORKINGSET_ACTIVATE);
-		return true;
-	}
-	return false;
+	/*
+	 * Compare the distance to the existing workingset size. We
+	 * don't act on pages that couldn't stay resident even if all
+	 * the memory was available to the page cache.
+	 */
+	if (refault_distance > active_file)
+		goto out;
+
+	SetPageActive(page);
+	SetPageWorkingset(page);
+	atomic_long_inc(&lruvec->inactive_age);
+	inc_node_state(pgdat, WORKINGSET_ACTIVATE);
+
+	/* Page was active prior to eviction */
+	if (workingset)
+		inc_node_state(pgdat, WORKINGSET_RESTORE);
+out:
+	rcu_read_unlock();
 }
 
 /**
-- 
1.9.1


[-- Attachment #4: 0001-proc-stat-add-major-page-faults-time-accounting.patch --]
[-- Type: text/x-patch, Size: 4082 bytes --]

>From f3c2b4fb2fa45eb9c087e8a60cefeb80cce50df1 Mon Sep 17 00:00:00 2001
From: Ruslan Ruslichenko <rruslich@cisco.com>
Date: Mon, 14 Aug 2017 12:07:30 +0300
Subject: [PATCH] /proc/stat: add major page faults time accounting

Add accounting of time spend by CPUs handling major
page faults (those which caused disk io).
This may be needed to detect system thrashing situation,
when the system has not enough memory for the working set
and have to constanly paging in/out pages from page cache
with no ability to do anything useful for the rest of the time.

Signed-off-by: Ruslan Ruslichenko <rruslich@cisco.com>
---
 fs/proc/stat.c              | 8 ++++++--
 include/linux/kernel_stat.h | 1 +
 mm/memory.c                 | 5 +++++
 3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 510413eb..d183aa9 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -81,7 +81,7 @@ static int show_stat(struct seq_file *p, void *v)
 {
 	int i, j;
 	unsigned long jif;
-	u64 user, nice, system, idle, iowait, irq, softirq, steal;
+	u64 user, nice, system, idle, iowait, irq, softirq, steal, pgmajflts;
 	u64 guest, guest_nice;
 	u64 sum = 0;
 	u64 sum_softirq = 0;
@@ -89,7 +89,7 @@ static int show_stat(struct seq_file *p, void *v)
 	struct timespec boottime;
 
 	user = nice = system = idle = iowait =
-		irq = softirq = steal = 0;
+		irq = softirq = steal = pgmajflts = 0;
 	guest = guest_nice = 0;
 	getboottime(&boottime);
 	jif = boottime.tv_sec;
@@ -105,6 +105,7 @@ static int show_stat(struct seq_file *p, void *v)
 		steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
 		guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
 		guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+		pgmajflts += kcpustat_cpu(i).cpustat[CPUTIME_PGMAJFAULT];
 		sum += kstat_cpu_irqs_sum(i);
 		sum += arch_irq_stat_cpu(i);
 
@@ -128,6 +129,7 @@ static int show_stat(struct seq_file *p, void *v)
 	seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal));
 	seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
 	seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
+	seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(pgmajflts));
 	seq_putc(p, '\n');
 
 	for_each_online_cpu(i) {
@@ -142,6 +144,7 @@ static int show_stat(struct seq_file *p, void *v)
 		steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
 		guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
 		guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+		pgmajflts = kcpustat_cpu(i).cpustat[CPUTIME_PGMAJFAULT];
 		seq_printf(p, "cpu%d", i);
 		seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user));
 		seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice));
@@ -153,6 +156,7 @@ static int show_stat(struct seq_file *p, void *v)
 		seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal));
 		seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
 		seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
+		seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(pgmajflts));
 		seq_putc(p, '\n');
 	}
 	seq_printf(p, "intr %llu", (unsigned long long)sum);
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 25a822f..1503706 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -28,6 +28,7 @@ enum cpu_usage_stat {
 	CPUTIME_STEAL,
 	CPUTIME_GUEST,
 	CPUTIME_GUEST_NICE,
+	CPUTIME_PGMAJFAULT,
 	NR_STATS,
 };
 
diff --git a/mm/memory.c b/mm/memory.c
index 692cef8..8a40645 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3437,6 +3437,8 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		    unsigned long address, unsigned int flags)
 {
 	int ret;
+	unsigned long start_time = current->stime;
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
 
 	__set_current_state(TASK_RUNNING);
 
@@ -3467,6 +3469,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
                         mem_cgroup_oom_synchronize(false);
 	}
 
+	if (ret & VM_FAULT_MAJOR)
+		cpustat[CPUTIME_PGMAJFAULT] += current->stime - start_time;
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-28 15:49       ` Detecting page cache trashing state Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
@ 2017-10-25 16:53         ` Daniel Walker
  2017-10-25 17:54         ` Johannes Weiner
  2017-10-26  3:53         ` vinayak menon
  2 siblings, 0 replies; 21+ messages in thread
From: Daniel Walker @ 2017-10-25 16:53 UTC (permalink / raw)
  To: Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco),
	Johannes Weiner, Taras Kondratiuk
  Cc: Michal Hocko, linux-mm, xe-linux-external, linux-kernel

On 09/28/2017 08:49 AM, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC 
INC at Cisco) wrote:
> Hi Johannes,
>
> Hopefully I was able to rebase the patch on top v4.9.26 (latest 
> supported version by us right now)
> and test a bit.
> The overall idea definitely looks promising, although I have one 
> question on usage.
> Will it be able to account the time which processes spend on handling 
> major page faults
> (including fs and iowait time) of refaulting page?

Johannes, did you get a chance to review the changes from Ruslan ?

Daniel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-28 15:49       ` Detecting page cache trashing state Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  2017-10-25 16:53         ` Daniel Walker
@ 2017-10-25 17:54         ` Johannes Weiner
  2017-10-27 20:19           ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  2017-10-26  3:53         ` vinayak menon
  2 siblings, 1 reply; 21+ messages in thread
From: Johannes Weiner @ 2017-10-25 17:54 UTC (permalink / raw)
  To: Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  Cc: Taras Kondratiuk, Michal Hocko, linux-mm, xe-linux-external,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2201 bytes --]

Hi Ruslan,

sorry about the delayed response, I missed the new activity in this
older thread.

On Thu, Sep 28, 2017 at 06:49:07PM +0300, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) wrote:
> Hi Johannes,
> 
> Hopefully I was able to rebase the patch on top v4.9.26 (latest supported
> version by us right now)
> and test a bit.
> The overall idea definitely looks promising, although I have one question on
> usage.
> Will it be able to account the time which processes spend on handling major
> page faults
> (including fs and iowait time) of refaulting page?

That's the main thing it should measure! :)

The lock_page() and wait_on_page_locked() calls are where iowaits
happen on a cache miss. If those are refaults, they'll be counted.

> As we have one big application which code space occupies big amount of place
> in page cache,
> when the system under heavy memory usage will reclaim some of it, the
> application will
> start constantly thrashing. Since it code is placed on squashfs it spends
> whole CPU time
> decompressing the pages and seem memdelay counters are not detecting this
> situation.
> Here are some counters to indicate this:
> 
> 19:02:44        CPU     %user     %nice   %system   %iowait %steal     %idle
> 19:02:45        all      0.00      0.00    100.00      0.00 0.00      0.00
> 
> 19:02:44     pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s
> pgscand/s pgsteal/s    %vmeff
> 19:02:45     15284.00      0.00    428.00    352.00  19990.00 0.00      0.00
> 15802.00      0.00
> 
> And as nobody actively allocating memory anymore looks like memdelay
> counters are not
> actively incremented:
> 
> [:~]$ cat /proc/memdelay
> 268035776
> 6.13 5.43 3.58
> 1.90 1.89 1.26

How does it correlate with /proc/vmstat::workingset_activate during
that time? It only counts thrashing time of refaults it can actively
detect.

Btw, how many CPUs does this system have? There is a bug in this
version on how idle time is aggregated across multiple CPUs. The error
compounds with the number of CPUs in the system.

I'm attaching 3 bugfixes that go on top of what you have. There might
be some conflicts, but they should be minor variable naming issues.


[-- Attachment #2: 0001-mm-memdelay-fix-task-flags-race-condition.patch --]
[-- Type: text/x-diff, Size: 2548 bytes --]

>From 7318c963a582833d4556c51fc2e1658e00c14e3e Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 5 Oct 2017 12:32:47 -0400
Subject: [PATCH 1/3] mm: memdelay: fix task flags race condition

WARNING: CPU: 35 PID: 2263 at ../include/linux/memcontrol.h:466

This is memcg warning that current->memcg_may_oom is set when it
doesn't expect it to.

The warning came in new with the memdelay patches. They add another
task flag in the same int as memcg_may_oom, but modify it from
try_to_wake_up, from a task that isn't current. This isn't safe.

Move the flag to the other int holding task flags, whose modifications
are serialized through the scheduler locks.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/sched.h | 2 +-
 kernel/sched/core.c   | 8 ++++----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index de15e3c8c43a..d1aa8f4c19ab 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -627,6 +627,7 @@ struct task_struct {
 	unsigned			sched_contributes_to_load:1;
 	unsigned			sched_migrated:1;
 	unsigned			sched_remote_wakeup:1;
+	unsigned			sched_memdelay_requeue:1;
 	/* Force alignment to the next boundary: */
 	unsigned			:0;
 
@@ -651,7 +652,6 @@ struct task_struct {
 	/* disallow userland-initiated cgroup migration */
 	unsigned			no_cgroup_migration:1;
 #endif
-	unsigned			memdelay_migrate_enqueue:1;
 
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bf105c870da6..b4fa806bf153 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -760,10 +760,10 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 	if (!(flags & ENQUEUE_RESTORE))
 		sched_info_queued(rq, p);
 
-	WARN_ON_ONCE(!(flags & ENQUEUE_WAKEUP) && p->memdelay_migrate_enqueue);
-	if (!(flags & ENQUEUE_WAKEUP) || p->memdelay_migrate_enqueue) {
+	WARN_ON_ONCE(!(flags & ENQUEUE_WAKEUP) && p->sched_memdelay_requeue);
+	if (!(flags & ENQUEUE_WAKEUP) || p->sched_memdelay_requeue) {
 		memdelay_add_runnable(p);
-		p->memdelay_migrate_enqueue = 0;
+		p->sched_memdelay_requeue = 0;
 	} else {
 		memdelay_wakeup(p);
 	}
@@ -2065,8 +2065,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 
 		rq = __task_rq_lock(p, &rf);
 		memdelay_del_sleeping(p);
+		p->sched_memdelay_requeue = 1;
 		__task_rq_unlock(rq, &rf);
-		p->memdelay_migrate_enqueue = 1;
 
 		set_task_cpu(p, cpu);
 	}
-- 
2.14.2


[-- Attachment #3: 0002-mm-memdelay-idle-time-is-not-productive-time.patch --]
[-- Type: text/x-diff, Size: 1796 bytes --]

>From 7157c70aed93990f59942d39d1c0d8948164cfe2 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 5 Oct 2017 12:34:49 -0400
Subject: [PATCH 2/3] mm: memdelay: idle time is not productive time

There is an error in the multi-core logic, where memory delay numbers
drop as the number of CPU cores increases. Idle CPUs, even though they
don't host delayed processes, shouldn't contribute to the "not
delayed" bucket. Because that's the baseline for productivity, and
idle CPUs aren't productive.

Do not consider idle CPU time in the productivity baseline.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memdelay.h | 3 ++-
 mm/memdelay.c            | 4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/memdelay.h b/include/linux/memdelay.h
index d85d01692610..c49f65338c1f 100644
--- a/include/linux/memdelay.h
+++ b/include/linux/memdelay.h
@@ -24,7 +24,8 @@ enum memdelay_task_state {
  * productivity states of all tasks inside the domain.
  */
 enum memdelay_domain_state {
-	MDS_NONE,		/* No delayed tasks */
+	MDS_IDLE,		/* No tasks */
+	MDS_NONE,		/* Working tasks */
 	MDS_PART,		/* Delayed tasks, working tasks */
 	MDS_FULL,		/* Delayed tasks, no working tasks */
 	NR_MEMDELAY_DOMAIN_STATES,
diff --git a/mm/memdelay.c b/mm/memdelay.c
index c7c32dbb67ac..ea5ede79f044 100644
--- a/mm/memdelay.c
+++ b/mm/memdelay.c
@@ -118,8 +118,10 @@ static void domain_cpu_update(struct memdelay_domain *md, int cpu,
 	else if (mdc->tasks[MTS_DELAYED])
 		state = (mdc->tasks[MTS_RUNNABLE] || mdc->tasks[MTS_IOWAIT]) ?
 			MDS_PART : MDS_FULL;
-	else
+	else if (mdc->tasks[MTS_RUNNABLE] || mdc->tasks[MTS_IOWAIT])
 		state = MDS_NONE;
+	else
+		state = MDS_IDLE;
 
 	if (mdc->state == state)
 		return;
-- 
2.14.2


[-- Attachment #4: 0003-mm-memdelay-drop-IO-as-productive-time.patch --]
[-- Type: text/x-diff, Size: 1451 bytes --]

>From ea663e42b24871d370b6ddbfbf47c1775a2f9f09 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <jweiner@fb.com>
Date: Thu, 28 Sep 2017 10:36:39 -0700
Subject: [PATCH 3/3] mm: memdelay: drop IO as productive time

Counting IO as productive time distorts the sense of pressure with
workloads that are naturally IO-bound. Memory pressure can cause IO,
and thus cause "productive" IO to slow down - yet we don't attribute
this slowdown properly to a shortage of memory.

Disregard IO time altogether, and use CPU time alone as the baseline.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memdelay.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/memdelay.c b/mm/memdelay.c
index ea5ede79f044..fbce1d4ba142 100644
--- a/mm/memdelay.c
+++ b/mm/memdelay.c
@@ -113,12 +113,11 @@ static void domain_cpu_update(struct memdelay_domain *md, int cpu,
 	 * one running the workload, the domain is considered fully
 	 * blocked 50% of the time.
 	 */
-	if (mdc->tasks[MTS_DELAYED_ACTIVE] && !mdc->tasks[MTS_IOWAIT])
+	if (mdc->tasks[MTS_DELAYED_ACTIVE])
 		state = MDS_FULL;
 	else if (mdc->tasks[MTS_DELAYED])
-		state = (mdc->tasks[MTS_RUNNABLE] || mdc->tasks[MTS_IOWAIT]) ?
-			MDS_PART : MDS_FULL;
-	else if (mdc->tasks[MTS_RUNNABLE] || mdc->tasks[MTS_IOWAIT])
+		state = mdc->tasks[MTS_RUNNABLE] ? MDS_PART : MDS_FULL;
+	else if (mdc->tasks[MTS_RUNNABLE])
 		state = MDS_NONE;
 	else
 		state = MDS_IDLE;
-- 
2.14.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-28 15:49       ` Detecting page cache trashing state Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  2017-10-25 16:53         ` Daniel Walker
  2017-10-25 17:54         ` Johannes Weiner
@ 2017-10-26  3:53         ` vinayak menon
  2017-10-27 20:29           ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  2 siblings, 1 reply; 21+ messages in thread
From: vinayak menon @ 2017-10-26  3:53 UTC (permalink / raw)
  To: Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  Cc: Johannes Weiner, Taras Kondratiuk, Michal Hocko, linux-mm,
	xe-linux-external, linux-kernel

On Thu, Sep 28, 2017 at 9:19 PM, Ruslan Ruslichenko -X (rruslich -
GLOBALLOGIC INC at Cisco) <rruslich@cisco.com> wrote:
> Hi Johannes,
>
> Hopefully I was able to rebase the patch on top v4.9.26 (latest supported
> version by us right now)
> and test a bit.
> The overall idea definitely looks promising, although I have one question on
> usage.
> Will it be able to account the time which processes spend on handling major
> page faults
> (including fs and iowait time) of refaulting page?
>
> As we have one big application which code space occupies big amount of place
> in page cache,
> when the system under heavy memory usage will reclaim some of it, the
> application will
> start constantly thrashing. Since it code is placed on squashfs it spends
> whole CPU time
> decompressing the pages and seem memdelay counters are not detecting this
> situation.
> Here are some counters to indicate this:
>
> 19:02:44        CPU     %user     %nice   %system   %iowait %steal     %idle
> 19:02:45        all      0.00      0.00    100.00      0.00 0.00      0.00
>
> 19:02:44     pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s
> pgscand/s pgsteal/s    %vmeff
> 19:02:45     15284.00      0.00    428.00    352.00  19990.00 0.00      0.00
> 15802.00      0.00
>
> And as nobody actively allocating memory anymore looks like memdelay
> counters are not
> actively incremented:
>
> [:~]$ cat /proc/memdelay
> 268035776
> 6.13 5.43 3.58
> 1.90 1.89 1.26
>
> Just in case, I have attached the v4.9.26 rebased patched.
>
Looks like this 4.9 version does not contain the accounting in lock_page.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-10-25 17:54         ` Johannes Weiner
@ 2017-10-27 20:19           ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  2017-11-20 19:40             ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  0 siblings, 1 reply; 21+ messages in thread
From: Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) @ 2017-10-27 20:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Taras Kondratiuk, Michal Hocko, linux-mm, xe-linux-external,
	linux-kernel

Hi Johannes,

On 10/25/2017 08:54 PM, Johannes Weiner wrote:
> Hi Ruslan,
>
> sorry about the delayed response, I missed the new activity in this
> older thread.
>
> On Thu, Sep 28, 2017 at 06:49:07PM +0300, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) wrote:
>> Hi Johannes,
>>
>> Hopefully I was able to rebase the patch on top v4.9.26 (latest supported
>> version by us right now)
>> and test a bit.
>> The overall idea definitely looks promising, although I have one question on
>> usage.
>> Will it be able to account the time which processes spend on handling major
>> page faults
>> (including fs and iowait time) of refaulting page?
> That's the main thing it should measure! :)
>
> The lock_page() and wait_on_page_locked() calls are where iowaits
> happen on a cache miss. If those are refaults, they'll be counted.
>
>> As we have one big application which code space occupies big amount of place
>> in page cache,
>> when the system under heavy memory usage will reclaim some of it, the
>> application will
>> start constantly thrashing. Since it code is placed on squashfs it spends
>> whole CPU time
>> decompressing the pages and seem memdelay counters are not detecting this
>> situation.
>> Here are some counters to indicate this:
>>
>> 19:02:44        CPU     %user     %nice   %system   %iowait %steal     %idle
>> 19:02:45        all      0.00      0.00    100.00      0.00 0.00      0.00
>>
>> 19:02:44     pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s
>> pgscand/s pgsteal/s    %vmeff
>> 19:02:45     15284.00      0.00    428.00    352.00  19990.00 0.00      0.00
>> 15802.00      0.00
>>
>> And as nobody actively allocating memory anymore looks like memdelay
>> counters are not
>> actively incremented:
>>
>> [:~]$ cat /proc/memdelay
>> 268035776
>> 6.13 5.43 3.58
>> 1.90 1.89 1.26
> How does it correlate with /proc/vmstat::workingset_activate during
> that time? It only counts thrashing time of refaults it can actively
> detect.
The workingset counters are growing quite actively too. Here are
some numbers per second:

workingset_refault   8201
workingset_activate   389
workingset_restore   187
workingset_nodereclaim   313

> Btw, how many CPUs does this system have? There is a bug in this
> version on how idle time is aggregated across multiple CPUs. The error
> compounds with the number of CPUs in the system.
The system has 2 CPU cores.
> I'm attaching 3 bugfixes that go on top of what you have. There might
> be some conflicts, but they should be minor variable naming issues.
>
I will test with your patches and get back to you.

Thanks,
Ruslan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-10-26  3:53         ` vinayak menon
@ 2017-10-27 20:29           ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  0 siblings, 0 replies; 21+ messages in thread
From: Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) @ 2017-10-27 20:29 UTC (permalink / raw)
  To: vinayak menon
  Cc: Johannes Weiner, Taras Kondratiuk, Michal Hocko, linux-mm,
	xe-linux-external, linux-kernel

On 10/26/2017 06:53 AM, vinayak menon wrote:
> On Thu, Sep 28, 2017 at 9:19 PM, Ruslan Ruslichenko -X (rruslich -
> GLOBALLOGIC INC at Cisco) <rruslich@cisco.com> wrote:
>> Hi Johannes,
>>
>> Hopefully I was able to rebase the patch on top v4.9.26 (latest supported
>> version by us right now)
>> and test a bit.
>> The overall idea definitely looks promising, although I have one question on
>> usage.
>> Will it be able to account the time which processes spend on handling major
>> page faults
>> (including fs and iowait time) of refaulting page?
>>
>> As we have one big application which code space occupies big amount of place
>> in page cache,
>> when the system under heavy memory usage will reclaim some of it, the
>> application will
>> start constantly thrashing. Since it code is placed on squashfs it spends
>> whole CPU time
>> decompressing the pages and seem memdelay counters are not detecting this
>> situation.
>> Here are some counters to indicate this:
>>
>> 19:02:44        CPU     %user     %nice   %system   %iowait %steal     %idle
>> 19:02:45        all      0.00      0.00    100.00      0.00 0.00      0.00
>>
>> 19:02:44     pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s
>> pgscand/s pgsteal/s    %vmeff
>> 19:02:45     15284.00      0.00    428.00    352.00  19990.00 0.00      0.00
>> 15802.00      0.00
>>
>> And as nobody actively allocating memory anymore looks like memdelay
>> counters are not
>> actively incremented:
>>
>> [:~]$ cat /proc/memdelay
>> 268035776
>> 6.13 5.43 3.58
>> 1.90 1.89 1.26
>>
>> Just in case, I have attached the v4.9.26 rebased patched.
>>
> Looks like this 4.9 version does not contain the accounting in lock_page.

In v4.9 there is no wait_on_page_bit_common(), thus accounting moved to
wait_on_page_bit(_killable|_killable_timeout).
Related functionality around lock_page_or_retry() seem to be mostly the 
same in v4.9.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-10-27 20:19           ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
@ 2017-11-20 19:40             ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  2017-11-27  2:18               ` Minchan Kim
  0 siblings, 1 reply; 21+ messages in thread
From: Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) @ 2017-11-20 19:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Taras Kondratiuk, Michal Hocko, linux-mm, xe-linux-external,
	linux-kernel

Hi Johannes,

I tested with your patches but situation is still mostly the same.

Spend some time for debugging and found that the problem is squashfs 
specific (probably some others fs's too).
The point is that iowait for squashfs reads will be awaited inside 
squashfs readpage() callback.
Here is some backtrace for page fault handling to illustrate this:

  1)               |  handle_mm_fault() {
  1)               |    filemap_fault() {
  1)               |      __do_page_cache_readahead()
  1)               |        add_to_page_cache_lru()
  1)               |        squashfs_readpage() {
  1)               |          squashfs_readpage_block() {
  1)               |            squashfs_get_datablock() {
  1)               |              squashfs_cache_get() {
  1)               |                squashfs_read_data() {
  1)               |                  ll_rw_block() {
  1)               |                    submit_bh_wbc.isra.42()
  1)               |                  __wait_on_buffer() {
  1)               |                    io_schedule() {
  ------------------------------------------
  0)   kworker-79   =>    <idle>-0
  ------------------------------------------
  0)   0.382 us    |  blk_complete_request();
  0)               |  blk_done_softirq() {
  0)               |    blk_update_request() {
  0)               |      end_buffer_read_sync()
  0) + 38.559 us   |    }
  0) + 48.367 us   |  }
  ------------------------------------------
  0)   kworker-79   =>  memhog-781
  ------------------------------------------
  0) ! 278.848 us  |                    }
  0) ! 279.612 us  |                  }
  0)               |                  squashfs_decompress() {
  0) # 4919.082 us |                    squashfs_xz_uncompress();
  0) # 4919.864 us |                  }
  0) # 5479.212 us |                } /* squashfs_read_data */
  0) # 5479.749 us |              } /* squashfs_cache_get */
  0) # 5480.177 us |            } /* squashfs_get_datablock */
  0)               |            squashfs_copy_cache() {
  0)   0.057 us    |              unlock_page();
  0) ! 142.773 us  |            }
  0) # 5624.113 us |          } /* squashfs_readpage_block */
  0) # 5628.814 us |        } /* squashfs_readpage */
  0) # 5665.097 us |      } /* __do_page_cache_readahead */
  0) # 5667.437 us |    } /* filemap_fault */
  0) # 5672.880 us |  } /* handle_mm_fault */

As you can see squashfs_read_data() schedules IO by ll_rw_block() and 
then it waits for IO to finish inside wait_on_buffer().
After that read buffer is decompressed and page is unlocked inside 
squashfs_readpage() handler.

Thus by the the time when filemap_fault() calls lock_page_or_retry() 
page will be uptodate and unlocked,
wait_on_page_bit() is not called at all, and time spent for 
read/decompress is not accounted.

Tried to apply quick workaround for test:

diff --git a/mm/readahead.c b/mm/readahead.c
index c4ca702..5e2be2b 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -126,9 +126,21 @@ static int read_pages(struct address_space 
*mapping, struct file *filp,

      for (page_idx = 0; page_idx < nr_pages; page_idx++) {
          struct page *page = lru_to_page(pages);
+        bool refault = false;
+        unsigned long mdflags;
+
          list_del(&page->lru);
-        if (!add_to_page_cache_lru(page, mapping, page->index, gfp))
+        if (!add_to_page_cache_lru(page, mapping, page->index, gfp)) {
+            if (!PageUptodate(page) && PageWorkingset(page)) {
+                memdelay_enter(&mdflags);
+                refault = true;
+            }
+
              mapping->a_ops->readpage(filp, page);
+
+            if (refault)
+                memdelay_leave(&mdflags);
+        }
          put_page(page);

But found that situation is not much different.
The reason is that at least in my synthetic tests I'm exhausting whole 
memory leaving almost no place for page cache:

Active(anon):   15901788 kB
Inactive(anon):    44844 kB
Active(file):        488 kB
Inactive(file):      612 kB

As result refault distance is always higher that LRU_ACTIVE_FILE size 
and Workingset flag is not set for refaulting page
even if it were active during it's lifecycle before eviction:

         workingset_refault   7773
        workingset_activate   250
         workingset_restore   233
     workingset_nodereclaim   49

Tried to apply following workaround:

diff --git a/mm/workingset.c b/mm/workingset.c
index 264f049..8035ef6 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -305,6 +305,11 @@ void workingset_refault(struct page *page, void 
*shadow)

      inc_lruvec_state(lruvec, WORKINGSET_REFAULT);

+    /* Page was active prior to eviction */
+    if (workingset) {
+        SetPageWorkingset(page);
+        inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
+    }
      /*
       * Compare the distance to the existing workingset size. We
       * don't act on pages that couldn't stay resident even if all
@@ -314,13 +319,9 @@ void workingset_refault(struct page *page, void 
*shadow)
          goto out;

      SetPageActive(page);
-    SetPageWorkingset(page);
      atomic_long_inc(&lruvec->inactive_age);
      inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);

-    /* Page was active prior to eviction */
-    if (workingset)
-        inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
  out:
      rcu_read_unlock();
  }

Now I see that refaults for pages a indeed accounted:

         workingset_refault   4987
        workingset_activate   590
         workingset_restore   4358
     workingset_nodereclaim   944

And memdelay counters are actively incrementing too indicating the 
trashing state:

[:~]$ cat /proc/memdelay
7539897381
63.22 63.19 44.58
14.36 15.11 11.80

So do you know what is the proper way to fix both issues?

--
Thanks,
Ruslan

On 10/27/2017 11:19 PM, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC 
INC at Cisco) wrote:
> Hi Johannes,
>
> On 10/25/2017 08:54 PM, Johannes Weiner wrote:
>> Hi Ruslan,
>>
>> sorry about the delayed response, I missed the new activity in this
>> older thread.
>>
>> On Thu, Sep 28, 2017 at 06:49:07PM +0300, Ruslan Ruslichenko -X 
>> (rruslich - GLOBALLOGIC INC at Cisco) wrote:
>>> Hi Johannes,
>>>
>>> Hopefully I was able to rebase the patch on top v4.9.26 (latest 
>>> supported
>>> version by us right now)
>>> and test a bit.
>>> The overall idea definitely looks promising, although I have one 
>>> question on
>>> usage.
>>> Will it be able to account the time which processes spend on 
>>> handling major
>>> page faults
>>> (including fs and iowait time) of refaulting page?
>> That's the main thing it should measure! :)
>>
>> The lock_page() and wait_on_page_locked() calls are where iowaits
>> happen on a cache miss. If those are refaults, they'll be counted.
>>
>>> As we have one big application which code space occupies big amount 
>>> of place
>>> in page cache,
>>> when the system under heavy memory usage will reclaim some of it, the
>>> application will
>>> start constantly thrashing. Since it code is placed on squashfs it 
>>> spends
>>> whole CPU time
>>> decompressing the pages and seem memdelay counters are not detecting 
>>> this
>>> situation.
>>> Here are some counters to indicate this:
>>>
>>> 19:02:44        CPU     %user     %nice   %system   %iowait 
>>> %steal     %idle
>>> 19:02:45        all      0.00      0.00    100.00      0.00 
>>> 0.00      0.00
>>>
>>> 19:02:44     pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s
>>> pgscand/s pgsteal/s    %vmeff
>>> 19:02:45     15284.00      0.00    428.00    352.00  19990.00 
>>> 0.00      0.00
>>> 15802.00      0.00
>>>
>>> And as nobody actively allocating memory anymore looks like memdelay
>>> counters are not
>>> actively incremented:
>>>
>>> [:~]$ cat /proc/memdelay
>>> 268035776
>>> 6.13 5.43 3.58
>>> 1.90 1.89 1.26
>> How does it correlate with /proc/vmstat::workingset_activate during
>> that time? It only counts thrashing time of refaults it can actively
>> detect.
> The workingset counters are growing quite actively too. Here are
> some numbers per second:
>
> workingset_refault   8201
> workingset_activate   389
> workingset_restore   187
> workingset_nodereclaim   313
>
>> Btw, how many CPUs does this system have? There is a bug in this
>> version on how idle time is aggregated across multiple CPUs. The error
>> compounds with the number of CPUs in the system.
> The system has 2 CPU cores.
>> I'm attaching 3 bugfixes that go on top of what you have. There might
>> be some conflicts, but they should be minor variable naming issues.
>>
> I will test with your patches and get back to you.
>
> Thanks,
> Ruslan

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-11-20 19:40             ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
@ 2017-11-27  2:18               ` Minchan Kim
  0 siblings, 0 replies; 21+ messages in thread
From: Minchan Kim @ 2017-11-27  2:18 UTC (permalink / raw)
  To: Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  Cc: Johannes Weiner, Taras Kondratiuk, Michal Hocko, linux-mm,
	xe-linux-external, linux-kernel

Hello,

On Mon, Nov 20, 2017 at 09:40:56PM +0200, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) wrote:
> Hi Johannes,
> 
> I tested with your patches but situation is still mostly the same.
> 
> Spend some time for debugging and found that the problem is squashfs
> specific (probably some others fs's too).
> The point is that iowait for squashfs reads will be awaited inside squashfs
> readpage() callback.
> Here is some backtrace for page fault handling to illustrate this:
> 
>  1)               |  handle_mm_fault() {
>  1)               |    filemap_fault() {
>  1)               |      __do_page_cache_readahead()
>  1)               |        add_to_page_cache_lru()
>  1)               |        squashfs_readpage() {
>  1)               |          squashfs_readpage_block() {
>  1)               |            squashfs_get_datablock() {
>  1)               |              squashfs_cache_get() {
>  1)               |                squashfs_read_data() {
>  1)               |                  ll_rw_block() {
>  1)               |                    submit_bh_wbc.isra.42()
>  1)               |                  __wait_on_buffer() {
>  1)               |                    io_schedule() {
>  ------------------------------------------
>  0)   kworker-79   =>    <idle>-0
>  ------------------------------------------
>  0)   0.382 us    |  blk_complete_request();
>  0)               |  blk_done_softirq() {
>  0)               |    blk_update_request() {
>  0)               |      end_buffer_read_sync()
>  0) + 38.559 us   |    }
>  0) + 48.367 us   |  }
>  ------------------------------------------
>  0)   kworker-79   =>  memhog-781
>  ------------------------------------------
>  0) ! 278.848 us  |                    }
>  0) ! 279.612 us  |                  }
>  0)               |                  squashfs_decompress() {
>  0) # 4919.082 us |                    squashfs_xz_uncompress();
>  0) # 4919.864 us |                  }
>  0) # 5479.212 us |                } /* squashfs_read_data */
>  0) # 5479.749 us |              } /* squashfs_cache_get */
>  0) # 5480.177 us |            } /* squashfs_get_datablock */
>  0)               |            squashfs_copy_cache() {
>  0)   0.057 us    |              unlock_page();
>  0) ! 142.773 us  |            }
>  0) # 5624.113 us |          } /* squashfs_readpage_block */
>  0) # 5628.814 us |        } /* squashfs_readpage */
>  0) # 5665.097 us |      } /* __do_page_cache_readahead */
>  0) # 5667.437 us |    } /* filemap_fault */
>  0) # 5672.880 us |  } /* handle_mm_fault */
> 
> As you can see squashfs_read_data() schedules IO by ll_rw_block() and then
> it waits for IO to finish inside wait_on_buffer().
> After that read buffer is decompressed and page is unlocked inside
> squashfs_readpage() handler.
> 
> Thus by the the time when filemap_fault() calls lock_page_or_retry() page
> will be uptodate and unlocked,
> wait_on_page_bit() is not called at all, and time spent for read/decompress
> is not accounted.

A weakness in current approach is that it relies on page lock.
It means it cannot work with sychronous devices like DAX, zram and
so on, I think.

Johannes, Can we add memdelay_enter to every fault handler's prologue?
and we can check it in epilogue whether the faulted page is workingset.
If is was, we can accumuate the spent time.
It would work with synchronous devices, esp, zram without hacking
some FSes like squashfs.

I think page fault handler/kswapd/direct reclaim would cover most of
cases of *real* memory pressure but un[lock]page freinds would cover
superfluously, for example, FSes can call it easily without memory
pressure.

> 
> Tried to apply quick workaround for test:
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index c4ca702..5e2be2b 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -126,9 +126,21 @@ static int read_pages(struct address_space *mapping,
> struct file *filp,
> 
>      for (page_idx = 0; page_idx < nr_pages; page_idx++) {
>          struct page *page = lru_to_page(pages);
> +        bool refault = false;
> +        unsigned long mdflags;
> +
>          list_del(&page->lru);
> -        if (!add_to_page_cache_lru(page, mapping, page->index, gfp))
> +        if (!add_to_page_cache_lru(page, mapping, page->index, gfp)) {
> +            if (!PageUptodate(page) && PageWorkingset(page)) {
> +                memdelay_enter(&mdflags);
> +                refault = true;
> +            }
> +
>              mapping->a_ops->readpage(filp, page);
> +
> +            if (refault)
> +                memdelay_leave(&mdflags);
> +        }
>          put_page(page);

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-11-27  2:18 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-15  0:16 Detecting page cache trashing state Taras Kondratiuk
2017-09-15 11:55 ` Zdenek Kabelac
2017-09-15 14:22 ` Daniel Walker
2017-09-15 16:38   ` Taras Kondratiuk
2017-09-15 17:31     ` Daniel Walker
2017-09-15 14:36 ` Michal Hocko
2017-09-15 17:28   ` Taras Kondratiuk
2017-09-18 16:34     ` Johannes Weiner
2017-09-19 10:55       ` [PATCH 1/3] sched/loadavg: consolidate LOAD_INT, LOAD_FRAC macros kbuild test robot
2017-09-19 11:02       ` kbuild test robot
2017-09-28 15:49       ` Detecting page cache trashing state Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
2017-10-25 16:53         ` Daniel Walker
2017-10-25 17:54         ` Johannes Weiner
2017-10-27 20:19           ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
2017-11-20 19:40             ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
2017-11-27  2:18               ` Minchan Kim
2017-10-26  3:53         ` vinayak menon
2017-10-27 20:29           ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
2017-09-15 21:20   ` vcaputo
2017-09-15 23:40     ` Taras Kondratiuk
2017-09-18  5:55     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).