linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Detecting page cache trashing state
@ 2017-09-15  0:16 Taras Kondratiuk
  2017-09-15 11:55 ` Zdenek Kabelac
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Taras Kondratiuk @ 2017-09-15  0:16 UTC (permalink / raw)
  To: linux-mm; +Cc: xe-linux-external, Ruslan Ruslichenko, linux-kernel

Hi

In our devices under low memory conditions we often get into a trashing
state when system spends most of the time re-reading pages of .text
sections from a file system (squashfs in our case). Working set doesn't
fit into available page cache, so it is expected. The issue is that
OOM killer doesn't get triggered because there is still memory for
reclaiming. System may stuck in this state for a quite some time and
usually dies because of watchdogs.

We are trying to detect such trashing state early to take some
preventive actions. It should be a pretty common issue, but for now we
haven't find any existing VM/IO statistics that can reliably detect such
state.

Most of metrics provide absolute values: number/rate of page faults,
rate of IO operations, number of stolen pages, etc. For a specific
device configuration we can determine threshold values for those
parameters that will detect trashing state, but it is not feasible for
hundreds of device configurations.

We are looking for some relative metric like "percent of CPU time spent
handling major page faults". With such relative metric we could use a
common threshold across all devices. For now we have added such metric
to /proc/stat in our kernel, but we would like to find some mechanism
available in upstream kernel.

Has somebody faced similar issue? How are you solving it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-15  0:16 Detecting page cache trashing state Taras Kondratiuk
@ 2017-09-15 11:55 ` Zdenek Kabelac
  2017-09-15 14:22 ` Daniel Walker
  2017-09-15 14:36 ` Michal Hocko
  2 siblings, 0 replies; 21+ messages in thread
From: Zdenek Kabelac @ 2017-09-15 11:55 UTC (permalink / raw)
  To: Taras Kondratiuk, linux-mm
  Cc: xe-linux-external, Ruslan Ruslichenko, linux-kernel

Dne 15.9.2017 v 02:16 Taras Kondratiuk napsal(a):
> Hi
> 
> In our devices under low memory conditions we often get into a trashing
> state when system spends most of the time re-reading pages of .text
> sections from a file system (squashfs in our case). Working set doesn't
> fit into available page cache, so it is expected. The issue is that
> OOM killer doesn't get triggered because there is still memory for
> reclaiming. System may stuck in this state for a quite some time and
> usually dies because of watchdogs.
> 
> We are trying to detect such trashing state early to take some
> preventive actions. It should be a pretty common issue, but for now we
> haven't find any existing VM/IO statistics that can reliably detect such
> state.
> 
> Most of metrics provide absolute values: number/rate of page faults,
> rate of IO operations, number of stolen pages, etc. For a specific
> device configuration we can determine threshold values for those
> parameters that will detect trashing state, but it is not feasible for
> hundreds of device configurations.
> 
> We are looking for some relative metric like "percent of CPU time spent
> handling major page faults". With such relative metric we could use a
> common threshold across all devices. For now we have added such metric
> to /proc/stat in our kernel, but we would like to find some mechanism
> available in upstream kernel.
> 
> Has somebody faced similar issue? How are you solving it?
> 
Hi

Well I witness this when running Firefox & Thunderbird on my desktop for a 
while on just 4G RAM machine till these 2app eat all free RAM...

It gets to the position (when I open new tab) that mouse hardly moves - 
kswapd eats  CPU  (I've no swap in fact - so likely just page-caching).

The only 'quick' solution for me as desktop user is to manually invoke OOM
with SYSRQ+F key -  and I'm also wondering why the system is not reacting 
better.  In most cases it kills one of those 2 - but sometime it kills whole 
Xsession...


Regards

Zdenek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-15  0:16 Detecting page cache trashing state Taras Kondratiuk
  2017-09-15 11:55 ` Zdenek Kabelac
@ 2017-09-15 14:22 ` Daniel Walker
  2017-09-15 16:38   ` Taras Kondratiuk
  2017-09-15 14:36 ` Michal Hocko
  2 siblings, 1 reply; 21+ messages in thread
From: Daniel Walker @ 2017-09-15 14:22 UTC (permalink / raw)
  To: Taras Kondratiuk, linux-mm
  Cc: xe-linux-external, Ruslan Ruslichenko, linux-kernel

On 09/14/2017 05:16 PM, Taras Kondratiuk wrote:
> Hi
>
> In our devices under low memory conditions we often get into a trashing
> state when system spends most of the time re-reading pages of .text
> sections from a file system (squashfs in our case). Working set doesn't
> fit into available page cache, so it is expected. The issue is that
> OOM killer doesn't get triggered because there is still memory for
> reclaiming. System may stuck in this state for a quite some time and
> usually dies because of watchdogs.
>
> We are trying to detect such trashing state early to take some
> preventive actions. It should be a pretty common issue, but for now we
> haven't find any existing VM/IO statistics that can reliably detect such
> state.
>
> Most of metrics provide absolute values: number/rate of page faults,
> rate of IO operations, number of stolen pages, etc. For a specific
> device configuration we can determine threshold values for those
> parameters that will detect trashing state, but it is not feasible for
> hundreds of device configurations.
>
> We are looking for some relative metric like "percent of CPU time spent
> handling major page faults". With such relative metric we could use a
> common threshold across all devices. For now we have added such metric
> to /proc/stat in our kernel, but we would like to find some mechanism
> available in upstream kernel.
>
> Has somebody faced similar issue? How are you solving it?


Did you make any attempt to tune swappiness ?

Documentation/sysctl/vm.txt

swappiness

This control is used to define how aggressive the kernel will swap
memory pages.  Higher values will increase agressiveness, lower values
decrease the amount of swap.

The default value is 60.
=======================================================

Since your using squashfs I would guess that's going to act like swap. 
The default tune of 60 is most likely for x86 servers which may not be a 
good value for some other device.


Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-15  0:16 Detecting page cache trashing state Taras Kondratiuk
  2017-09-15 11:55 ` Zdenek Kabelac
  2017-09-15 14:22 ` Daniel Walker
@ 2017-09-15 14:36 ` Michal Hocko
  2017-09-15 17:28   ` Taras Kondratiuk
  2017-09-15 21:20   ` vcaputo
  2 siblings, 2 replies; 21+ messages in thread
From: Michal Hocko @ 2017-09-15 14:36 UTC (permalink / raw)
  To: Taras Kondratiuk
  Cc: linux-mm, xe-linux-external, Ruslan Ruslichenko, linux-kernel

On Thu 14-09-17 17:16:27, Taras Kondratiuk wrote:
> Hi
> 
> In our devices under low memory conditions we often get into a trashing
> state when system spends most of the time re-reading pages of .text
> sections from a file system (squashfs in our case). Working set doesn't
> fit into available page cache, so it is expected. The issue is that
> OOM killer doesn't get triggered because there is still memory for
> reclaiming. System may stuck in this state for a quite some time and
> usually dies because of watchdogs.
> 
> We are trying to detect such trashing state early to take some
> preventive actions. It should be a pretty common issue, but for now we
> haven't find any existing VM/IO statistics that can reliably detect such
> state.
> 
> Most of metrics provide absolute values: number/rate of page faults,
> rate of IO operations, number of stolen pages, etc. For a specific
> device configuration we can determine threshold values for those
> parameters that will detect trashing state, but it is not feasible for
> hundreds of device configurations.
> 
> We are looking for some relative metric like "percent of CPU time spent
> handling major page faults". With such relative metric we could use a
> common threshold across all devices. For now we have added such metric
> to /proc/stat in our kernel, but we would like to find some mechanism
> available in upstream kernel.
> 
> Has somebody faced similar issue? How are you solving it?

Yes this is a pain point for a _long_ time. And we still do not have a
good answer upstream. Johannes has been playing in this area [1].
The main problem is that our OOM detection logic is based on the ability
to reclaim memory to allocate new memory. And that is pretty much true
for the pagecache when you are trashing. So we do not know that
basically whole time is spent refaulting the memory back and forth.
We do have some refault stats for the page cache but that is not
integrated to the oom detection logic because this is really a
non-trivial problem to solve without triggering early oom killer
invocations.

[1] http://lkml.kernel.org/r/20170727153010.23347-1-hannes@cmpxchg.org
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-15 14:22 ` Daniel Walker
@ 2017-09-15 16:38   ` Taras Kondratiuk
  2017-09-15 17:31     ` Daniel Walker
  0 siblings, 1 reply; 21+ messages in thread
From: Taras Kondratiuk @ 2017-09-15 16:38 UTC (permalink / raw)
  To: Daniel Walker, linux-mm
  Cc: xe-linux-external, Ruslan Ruslichenko, linux-kernel

Quoting Daniel Walker (2017-09-15 07:22:27)
> On 09/14/2017 05:16 PM, Taras Kondratiuk wrote:
> > Hi
> >
> > In our devices under low memory conditions we often get into a trashing
> > state when system spends most of the time re-reading pages of .text
> > sections from a file system (squashfs in our case). Working set doesn't
> > fit into available page cache, so it is expected. The issue is that
> > OOM killer doesn't get triggered because there is still memory for
> > reclaiming. System may stuck in this state for a quite some time and
> > usually dies because of watchdogs.
> >
> > We are trying to detect such trashing state early to take some
> > preventive actions. It should be a pretty common issue, but for now we
> > haven't find any existing VM/IO statistics that can reliably detect such
> > state.
> >
> > Most of metrics provide absolute values: number/rate of page faults,
> > rate of IO operations, number of stolen pages, etc. For a specific
> > device configuration we can determine threshold values for those
> > parameters that will detect trashing state, but it is not feasible for
> > hundreds of device configurations.
> >
> > We are looking for some relative metric like "percent of CPU time spent
> > handling major page faults". With such relative metric we could use a
> > common threshold across all devices. For now we have added such metric
> > to /proc/stat in our kernel, but we would like to find some mechanism
> > available in upstream kernel.
> >
> > Has somebody faced similar issue? How are you solving it?
> 
> 
> Did you make any attempt to tune swappiness ?
> 
> Documentation/sysctl/vm.txt
> 
> swappiness
> 
> This control is used to define how aggressive the kernel will swap
> memory pages.  Higher values will increase agressiveness, lower values
> decrease the amount of swap.
> 
> The default value is 60.
> =======================================================
> 
> Since your using squashfs I would guess that's going to act like swap. 
> The default tune of 60 is most likely for x86 servers which may not be a 
> good value for some other device.

Swap is disabled in our systems, so anonymous pages can't be evicted.
As per my understanding swappiness tune is irrelevant.

Even with enabled swap swappiness tune can't help much in this case. If
working set doesn't fit into available page cache we will hit the same
trashing state.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-15 14:36 ` Michal Hocko
@ 2017-09-15 17:28   ` Taras Kondratiuk
  2017-09-18 16:34     ` Johannes Weiner
  2017-09-15 21:20   ` vcaputo
  1 sibling, 1 reply; 21+ messages in thread
From: Taras Kondratiuk @ 2017-09-15 17:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, xe-linux-external, Ruslan Ruslichenko, linux-kernel

Quoting Michal Hocko (2017-09-15 07:36:19)
> On Thu 14-09-17 17:16:27, Taras Kondratiuk wrote:
> > Hi
> > 
> > In our devices under low memory conditions we often get into a trashing
> > state when system spends most of the time re-reading pages of .text
> > sections from a file system (squashfs in our case). Working set doesn't
> > fit into available page cache, so it is expected. The issue is that
> > OOM killer doesn't get triggered because there is still memory for
> > reclaiming. System may stuck in this state for a quite some time and
> > usually dies because of watchdogs.
> > 
> > We are trying to detect such trashing state early to take some
> > preventive actions. It should be a pretty common issue, but for now we
> > haven't find any existing VM/IO statistics that can reliably detect such
> > state.
> > 
> > Most of metrics provide absolute values: number/rate of page faults,
> > rate of IO operations, number of stolen pages, etc. For a specific
> > device configuration we can determine threshold values for those
> > parameters that will detect trashing state, but it is not feasible for
> > hundreds of device configurations.
> > 
> > We are looking for some relative metric like "percent of CPU time spent
> > handling major page faults". With such relative metric we could use a
> > common threshold across all devices. For now we have added such metric
> > to /proc/stat in our kernel, but we would like to find some mechanism
> > available in upstream kernel.
> > 
> > Has somebody faced similar issue? How are you solving it?
> 
> Yes this is a pain point for a _long_ time. And we still do not have a
> good answer upstream. Johannes has been playing in this area [1].
> The main problem is that our OOM detection logic is based on the ability
> to reclaim memory to allocate new memory. And that is pretty much true
> for the pagecache when you are trashing. So we do not know that
> basically whole time is spent refaulting the memory back and forth.
> We do have some refault stats for the page cache but that is not
> integrated to the oom detection logic because this is really a
> non-trivial problem to solve without triggering early oom killer
> invocations.
> 
> [1] http://lkml.kernel.org/r/20170727153010.23347-1-hannes@cmpxchg.org

Thanks Michal. memdelay looks promising. We will check it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-15 16:38   ` Taras Kondratiuk
@ 2017-09-15 17:31     ` Daniel Walker
  0 siblings, 0 replies; 21+ messages in thread
From: Daniel Walker @ 2017-09-15 17:31 UTC (permalink / raw)
  To: Taras Kondratiuk, linux-mm
  Cc: xe-linux-external, Ruslan Ruslichenko, linux-kernel

On 09/15/2017 09:38 AM, Taras Kondratiuk wrote:
> Quoting Daniel Walker (2017-09-15 07:22:27)
>> On 09/14/2017 05:16 PM, Taras Kondratiuk wrote:
>>> Hi
>>>
>>> In our devices under low memory conditions we often get into a trashing
>>> state when system spends most of the time re-reading pages of .text
>>> sections from a file system (squashfs in our case). Working set doesn't
>>> fit into available page cache, so it is expected. The issue is that
>>> OOM killer doesn't get triggered because there is still memory for
>>> reclaiming. System may stuck in this state for a quite some time and
>>> usually dies because of watchdogs.
>>>
>>> We are trying to detect such trashing state early to take some
>>> preventive actions. It should be a pretty common issue, but for now we
>>> haven't find any existing VM/IO statistics that can reliably detect such
>>> state.
>>>
>>> Most of metrics provide absolute values: number/rate of page faults,
>>> rate of IO operations, number of stolen pages, etc. For a specific
>>> device configuration we can determine threshold values for those
>>> parameters that will detect trashing state, but it is not feasible for
>>> hundreds of device configurations.
>>>
>>> We are looking for some relative metric like "percent of CPU time spent
>>> handling major page faults". With such relative metric we could use a
>>> common threshold across all devices. For now we have added such metric
>>> to /proc/stat in our kernel, but we would like to find some mechanism
>>> available in upstream kernel.
>>>
>>> Has somebody faced similar issue? How are you solving it?
>>
>> Did you make any attempt to tune swappiness ?
>>
>> Documentation/sysctl/vm.txt
>>
>> swappiness
>>
>> This control is used to define how aggressive the kernel will swap
>> memory pages.  Higher values will increase agressiveness, lower values
>> decrease the amount of swap.
>>
>> The default value is 60.
>> =======================================================
>>
>> Since your using squashfs I would guess that's going to act like swap.
>> The default tune of 60 is most likely for x86 servers which may not be a
>> good value for some other device.
> Swap is disabled in our systems, so anonymous pages can't be evicted.
> As per my understanding swappiness tune is irrelevant.
>
> Even with enabled swap swappiness tune can't help much in this case. If
> working set doesn't fit into available page cache we will hit the same
> trashing state.


I think it's our lack of understanding of how the VM works. If the 
system has no swap, then the system shouldn't start evicting pages 
unless you have %100 memory utilization, then the only place for those 
pages to go is back into the backing store, squashfs in this case.

What your suggesting is that there is still free memory, which means 
something must be evicting page more aggressively then waiting till %100 
utilization. Maybe someone more knownlegable about the VM subsystem can 
clear this up.

Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-15 14:36 ` Michal Hocko
  2017-09-15 17:28   ` Taras Kondratiuk
@ 2017-09-15 21:20   ` vcaputo
  2017-09-15 23:40     ` Taras Kondratiuk
  2017-09-18  5:55     ` Michal Hocko
  1 sibling, 2 replies; 21+ messages in thread
From: vcaputo @ 2017-09-15 21:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Taras Kondratiuk, linux-mm, xe-linux-external,
	Ruslan Ruslichenko, linux-kernel

On Fri, Sep 15, 2017 at 04:36:19PM +0200, Michal Hocko wrote:
> On Thu 14-09-17 17:16:27, Taras Kondratiuk wrote:
> > Hi
> > 
> > In our devices under low memory conditions we often get into a trashing
> > state when system spends most of the time re-reading pages of .text
> > sections from a file system (squashfs in our case). Working set doesn't
> > fit into available page cache, so it is expected. The issue is that
> > OOM killer doesn't get triggered because there is still memory for
> > reclaiming. System may stuck in this state for a quite some time and
> > usually dies because of watchdogs.
> > 
> > We are trying to detect such trashing state early to take some
> > preventive actions. It should be a pretty common issue, but for now we
> > haven't find any existing VM/IO statistics that can reliably detect such
> > state.
> > 
> > Most of metrics provide absolute values: number/rate of page faults,
> > rate of IO operations, number of stolen pages, etc. For a specific
> > device configuration we can determine threshold values for those
> > parameters that will detect trashing state, but it is not feasible for
> > hundreds of device configurations.
> > 
> > We are looking for some relative metric like "percent of CPU time spent
> > handling major page faults". With such relative metric we could use a
> > common threshold across all devices. For now we have added such metric
> > to /proc/stat in our kernel, but we would like to find some mechanism
> > available in upstream kernel.
> > 
> > Has somebody faced similar issue? How are you solving it?
> 
> Yes this is a pain point for a _long_ time. And we still do not have a
> good answer upstream. Johannes has been playing in this area [1].
> The main problem is that our OOM detection logic is based on the ability
> to reclaim memory to allocate new memory. And that is pretty much true
> for the pagecache when you are trashing. So we do not know that
> basically whole time is spent refaulting the memory back and forth.
> We do have some refault stats for the page cache but that is not
> integrated to the oom detection logic because this is really a
> non-trivial problem to solve without triggering early oom killer
> invocations.
> 
> [1] http://lkml.kernel.org/r/20170727153010.23347-1-hannes@cmpxchg.org

For desktop users running without swap, couldn't we just provide a kernel
setting which marks all executable pages as unevictable when first faulted
in?  Then at least thrashing within the space occupied by executables and
shared libraries before eventual OOM would be avoided, and only the
remaining file-backed non-executable pages would be thrashable.

On my swapless laptops I'd much rather have OOM killer kick in immediately
rather than wait for a few minutes of thrashing to pass while the bogged
down system crawls through depleting what's left of technically reclaimable
memory.  It's much improved on modern SSDs, but still annoying.

Regards,
Vito Caputo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-15 21:20   ` vcaputo
@ 2017-09-15 23:40     ` Taras Kondratiuk
  2017-09-18  5:55     ` Michal Hocko
  1 sibling, 0 replies; 21+ messages in thread
From: Taras Kondratiuk @ 2017-09-15 23:40 UTC (permalink / raw)
  To: Michal Hocko, vcaputo
  Cc: linux-mm, xe-linux-external, Ruslan Ruslichenko, linux-kernel

Quoting vcaputo@pengaru.com (2017-09-15 14:20:28)
> On Fri, Sep 15, 2017 at 04:36:19PM +0200, Michal Hocko wrote:
> > On Thu 14-09-17 17:16:27, Taras Kondratiuk wrote:
> > > Hi
> > > 
> > > In our devices under low memory conditions we often get into a trashing
> > > state when system spends most of the time re-reading pages of .text
> > > sections from a file system (squashfs in our case). Working set doesn't
> > > fit into available page cache, so it is expected. The issue is that
> > > OOM killer doesn't get triggered because there is still memory for
> > > reclaiming. System may stuck in this state for a quite some time and
> > > usually dies because of watchdogs.
> > > 
> > > We are trying to detect such trashing state early to take some
> > > preventive actions. It should be a pretty common issue, but for now we
> > > haven't find any existing VM/IO statistics that can reliably detect such
> > > state.
> > > 
> > > Most of metrics provide absolute values: number/rate of page faults,
> > > rate of IO operations, number of stolen pages, etc. For a specific
> > > device configuration we can determine threshold values for those
> > > parameters that will detect trashing state, but it is not feasible for
> > > hundreds of device configurations.
> > > 
> > > We are looking for some relative metric like "percent of CPU time spent
> > > handling major page faults". With such relative metric we could use a
> > > common threshold across all devices. For now we have added such metric
> > > to /proc/stat in our kernel, but we would like to find some mechanism
> > > available in upstream kernel.
> > > 
> > > Has somebody faced similar issue? How are you solving it?
> > 
> > Yes this is a pain point for a _long_ time. And we still do not have a
> > good answer upstream. Johannes has been playing in this area [1].
> > The main problem is that our OOM detection logic is based on the ability
> > to reclaim memory to allocate new memory. And that is pretty much true
> > for the pagecache when you are trashing. So we do not know that
> > basically whole time is spent refaulting the memory back and forth.
> > We do have some refault stats for the page cache but that is not
> > integrated to the oom detection logic because this is really a
> > non-trivial problem to solve without triggering early oom killer
> > invocations.
> > 
> > [1] http://lkml.kernel.org/r/20170727153010.23347-1-hannes@cmpxchg.org
> 
> For desktop users running without swap, couldn't we just provide a kernel
> setting which marks all executable pages as unevictable when first faulted
> in?  Then at least thrashing within the space occupied by executables and
> shared libraries before eventual OOM would be avoided, and only the
> remaining file-backed non-executable pages would be thrashable.
> 
> On my swapless laptops I'd much rather have OOM killer kick in immediately
> rather than wait for a few minutes of thrashing to pass while the bogged
> down system crawls through depleting what's left of technically reclaimable
> memory.  It's much improved on modern SSDs, but still annoying.

Usually a significant part of executable is used rarely or only once
during initialization. Pinning all executable pages forever will waste
a lot of memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-15 21:20   ` vcaputo
  2017-09-15 23:40     ` Taras Kondratiuk
@ 2017-09-18  5:55     ` Michal Hocko
  1 sibling, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2017-09-18  5:55 UTC (permalink / raw)
  To: vcaputo
  Cc: Taras Kondratiuk, linux-mm, xe-linux-external,
	Ruslan Ruslichenko, linux-kernel

On Fri 15-09-17 14:20:28, vcaputo@pengaru.com wrote:
> On Fri, Sep 15, 2017 at 04:36:19PM +0200, Michal Hocko wrote:
> > On Thu 14-09-17 17:16:27, Taras Kondratiuk wrote:
> > > Hi
> > > 
> > > In our devices under low memory conditions we often get into a trashing
> > > state when system spends most of the time re-reading pages of .text
> > > sections from a file system (squashfs in our case). Working set doesn't
> > > fit into available page cache, so it is expected. The issue is that
> > > OOM killer doesn't get triggered because there is still memory for
> > > reclaiming. System may stuck in this state for a quite some time and
> > > usually dies because of watchdogs.
> > > 
> > > We are trying to detect such trashing state early to take some
> > > preventive actions. It should be a pretty common issue, but for now we
> > > haven't find any existing VM/IO statistics that can reliably detect such
> > > state.
> > > 
> > > Most of metrics provide absolute values: number/rate of page faults,
> > > rate of IO operations, number of stolen pages, etc. For a specific
> > > device configuration we can determine threshold values for those
> > > parameters that will detect trashing state, but it is not feasible for
> > > hundreds of device configurations.
> > > 
> > > We are looking for some relative metric like "percent of CPU time spent
> > > handling major page faults". With such relative metric we could use a
> > > common threshold across all devices. For now we have added such metric
> > > to /proc/stat in our kernel, but we would like to find some mechanism
> > > available in upstream kernel.
> > > 
> > > Has somebody faced similar issue? How are you solving it?
> > 
> > Yes this is a pain point for a _long_ time. And we still do not have a
> > good answer upstream. Johannes has been playing in this area [1].
> > The main problem is that our OOM detection logic is based on the ability
> > to reclaim memory to allocate new memory. And that is pretty much true
> > for the pagecache when you are trashing. So we do not know that
> > basically whole time is spent refaulting the memory back and forth.
> > We do have some refault stats for the page cache but that is not
> > integrated to the oom detection logic because this is really a
> > non-trivial problem to solve without triggering early oom killer
> > invocations.
> > 
> > [1] http://lkml.kernel.org/r/20170727153010.23347-1-hannes@cmpxchg.org
> 
> For desktop users running without swap, couldn't we just provide a kernel
> setting which marks all executable pages as unevictable when first faulted
> in?

This could result in the immediate DoS vector and you could see trashing
elsewhere. In fact we already do protect executable pages and reclaim
them later (see page_check_references).

I am afraid that the only way to resolve the trashing behavior is to
release a larger amount of memory because shifting the reclaim priority
will just push the suboptimal behavior somewhere else. In order to do
that we really have to detect that the working set doesn't fit into
memory and refaults are predominating system activity.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-15 17:28   ` Taras Kondratiuk
@ 2017-09-18 16:34     ` Johannes Weiner
  2017-09-19 10:55       ` [PATCH 1/3] sched/loadavg: consolidate LOAD_INT, LOAD_FRAC macros kbuild test robot
                         ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Johannes Weiner @ 2017-09-18 16:34 UTC (permalink / raw)
  To: Taras Kondratiuk
  Cc: Michal Hocko, linux-mm, xe-linux-external, Ruslan Ruslichenko,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1421 bytes --]

Hi Taras,

On Fri, Sep 15, 2017 at 10:28:30AM -0700, Taras Kondratiuk wrote:
> Quoting Michal Hocko (2017-09-15 07:36:19)
> > On Thu 14-09-17 17:16:27, Taras Kondratiuk wrote:
> > > Has somebody faced similar issue? How are you solving it?
> > 
> > Yes this is a pain point for a _long_ time. And we still do not have a
> > good answer upstream. Johannes has been playing in this area [1].
> > The main problem is that our OOM detection logic is based on the ability
> > to reclaim memory to allocate new memory. And that is pretty much true
> > for the pagecache when you are trashing. So we do not know that
> > basically whole time is spent refaulting the memory back and forth.
> > We do have some refault stats for the page cache but that is not
> > integrated to the oom detection logic because this is really a
> > non-trivial problem to solve without triggering early oom killer
> > invocations.
> > 
> > [1] http://lkml.kernel.org/r/20170727153010.23347-1-hannes@cmpxchg.org
> 
> Thanks Michal. memdelay looks promising. We will check it.

Great, I'm obviously interested in more users of it :) Please find
attached the latest version of the patch series based on v4.13.

It needs a bit more refactoring in the scheduler bits before
resubmission, but it already contains a couple of fixes and
improvements since the first version I sent out.

Let me know if you need help rebasing to a different kernel version.

[-- Attachment #2: 0001-sched-loadavg-consolidate-LOAD_INT-LOAD_FRAC-macros.patch --]
[-- Type: text/x-diff, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] sched/loadavg: consolidate LOAD_INT, LOAD_FRAC macros
  2017-09-18 16:34     ` Johannes Weiner
@ 2017-09-19 10:55       ` kbuild test robot
  2017-09-19 11:02       ` kbuild test robot
  2017-09-28 15:49       ` Detecting page cache trashing state Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  2 siblings, 0 replies; 21+ messages in thread
From: kbuild test robot @ 2017-09-19 10:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: kbuild-all, Taras Kondratiuk, Michal Hocko, linux-mm,
	xe-linux-external, Ruslan Ruslichenko, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1091 bytes --]

Hi Johannes,

[auto build test ERROR on v4.13]
[cannot apply to mmotm/master linus/master tip/sched/core v4.14-rc1 next-20170919]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Johannes-Weiner/sched-loadavg-consolidate-LOAD_INT-LOAD_FRAC-macros/20170919-161057
config: blackfin-TCM-BF537_defconfig (attached as .config)
compiler: bfin-uclinux-gcc (GCC) 6.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=blackfin 

All errors (new ones prefixed by >>):

   mm/memdelay.o: In function `memdelay_task_change':
>> mm/memdelay.c:(.text+0x270): undefined reference to `__udivdi3'
   mm/memdelay.c:(.text+0x2e4): undefined reference to `__udivdi3'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 11020 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] sched/loadavg: consolidate LOAD_INT, LOAD_FRAC macros
  2017-09-18 16:34     ` Johannes Weiner
  2017-09-19 10:55       ` [PATCH 1/3] sched/loadavg: consolidate LOAD_INT, LOAD_FRAC macros kbuild test robot
@ 2017-09-19 11:02       ` kbuild test robot
  2017-09-28 15:49       ` Detecting page cache trashing state Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  2 siblings, 0 replies; 21+ messages in thread
From: kbuild test robot @ 2017-09-19 11:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: kbuild-all, Taras Kondratiuk, Michal Hocko, linux-mm,
	xe-linux-external, Ruslan Ruslichenko, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1081 bytes --]

Hi Johannes,

[auto build test ERROR on v4.13]
[cannot apply to mmotm/master linus/master tip/sched/core v4.14-rc1 next-20170919]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Johannes-Weiner/sched-loadavg-consolidate-LOAD_INT-LOAD_FRAC-macros/20170919-161057
config: c6x-evmc6678_defconfig (attached as .config)
compiler: c6x-elf-gcc (GCC) 6.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=c6x 

All errors (new ones prefixed by >>):

   mm/memdelay.o: In function `memdelay_task_change':
>> memdelay.c:(.text+0x2bc): undefined reference to `__c6xabi_divull'
   memdelay.c:(.text+0x438): undefined reference to `__c6xabi_divull'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 5431 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-18 16:34     ` Johannes Weiner
  2017-09-19 10:55       ` [PATCH 1/3] sched/loadavg: consolidate LOAD_INT, LOAD_FRAC macros kbuild test robot
  2017-09-19 11:02       ` kbuild test robot
@ 2017-09-28 15:49       ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  2017-10-25 16:53         ` Daniel Walker
                           ` (2 more replies)
  2 siblings, 3 replies; 21+ messages in thread
From: Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) @ 2017-09-28 15:49 UTC (permalink / raw)
  To: Johannes Weiner, Taras Kondratiuk
  Cc: Michal Hocko, linux-mm, xe-linux-external, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3085 bytes --]

Hi Johannes,

Hopefully I was able to rebase the patch on top v4.9.26 (latest 
supported version by us right now)
and test a bit.
The overall idea definitely looks promising, although I have one 
question on usage.
Will it be able to account the time which processes spend on handling 
major page faults
(including fs and iowait time) of refaulting page?

As we have one big application which code space occupies big amount of 
place in page cache,
when the system under heavy memory usage will reclaim some of it, the 
application will
start constantly thrashing. Since it code is placed on squashfs it 
spends whole CPU time
decompressing the pages and seem memdelay counters are not detecting 
this situation.
Here are some counters to indicate this:

19:02:44        CPU     %user     %nice   %system   %iowait %steal     %idle
19:02:45        all      0.00      0.00    100.00      0.00 0.00      0.00

19:02:44     pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s 
pgscand/s pgsteal/s    %vmeff
19:02:45     15284.00      0.00    428.00    352.00  19990.00 0.00      
0.00  15802.00      0.00

And as nobody actively allocating memory anymore looks like memdelay 
counters are not
actively incremented:

[:~]$ cat /proc/memdelay
268035776
6.13 5.43 3.58
1.90 1.89 1.26

Just in case, I have attached the v4.9.26 rebased patched.

Also attached the patch with our current solution. In current 
implementation it will mostly
fit to squashfs only thrashing situation as in general case iowait time 
would be major part of
page fault handling thus it need to be accounted too.

Thanks,
Ruslan

On 09/18/2017 07:34 PM, Johannes Weiner wrote:
> Hi Taras,
>
> On Fri, Sep 15, 2017 at 10:28:30AM -0700, Taras Kondratiuk wrote:
>> Quoting Michal Hocko (2017-09-15 07:36:19)
>>> On Thu 14-09-17 17:16:27, Taras Kondratiuk wrote:
>>>> Has somebody faced similar issue? How are you solving it?
>>> Yes this is a pain point for a _long_ time. And we still do not have a
>>> good answer upstream. Johannes has been playing in this area [1].
>>> The main problem is that our OOM detection logic is based on the ability
>>> to reclaim memory to allocate new memory. And that is pretty much true
>>> for the pagecache when you are trashing. So we do not know that
>>> basically whole time is spent refaulting the memory back and forth.
>>> We do have some refault stats for the page cache but that is not
>>> integrated to the oom detection logic because this is really a
>>> non-trivial problem to solve without triggering early oom killer
>>> invocations.
>>>
>>> [1] http://lkml.kernel.org/r/20170727153010.23347-1-hannes@cmpxchg.org
>> Thanks Michal. memdelay looks promising. We will check it.
> Great, I'm obviously interested in more users of it :) Please find
> attached the latest version of the patch series based on v4.13.
>
> It needs a bit more refactoring in the scheduler bits before
> resubmission, but it already contains a couple of fixes and
> improvements since the first version I sent out.
>
> Let me know if you need help rebasing to a different kernel version.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0002-mm-sched-memdelay-memory-health-interface-for-system.patch --]
[-- Type: text/x-patch; name="0002-mm-sched-memdelay-memory-health-interface-for-system.patch", Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-28 15:49       ` Detecting page cache trashing state Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
@ 2017-10-25 16:53         ` Daniel Walker
  2017-10-25 17:54         ` Johannes Weiner
  2017-10-26  3:53         ` vinayak menon
  2 siblings, 0 replies; 21+ messages in thread
From: Daniel Walker @ 2017-10-25 16:53 UTC (permalink / raw)
  To: Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco),
	Johannes Weiner, Taras Kondratiuk
  Cc: Michal Hocko, linux-mm, xe-linux-external, linux-kernel

On 09/28/2017 08:49 AM, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC 
INC at Cisco) wrote:
> Hi Johannes,
>
> Hopefully I was able to rebase the patch on top v4.9.26 (latest 
> supported version by us right now)
> and test a bit.
> The overall idea definitely looks promising, although I have one 
> question on usage.
> Will it be able to account the time which processes spend on handling 
> major page faults
> (including fs and iowait time) of refaulting page?

Johannes, did you get a chance to review the changes from Ruslan ?

Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-28 15:49       ` Detecting page cache trashing state Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  2017-10-25 16:53         ` Daniel Walker
@ 2017-10-25 17:54         ` Johannes Weiner
  2017-10-27 20:19           ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  2017-10-26  3:53         ` vinayak menon
  2 siblings, 1 reply; 21+ messages in thread
From: Johannes Weiner @ 2017-10-25 17:54 UTC (permalink / raw)
  To: Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  Cc: Taras Kondratiuk, Michal Hocko, linux-mm, xe-linux-external,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2201 bytes --]

Hi Ruslan,

sorry about the delayed response, I missed the new activity in this
older thread.

On Thu, Sep 28, 2017 at 06:49:07PM +0300, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) wrote:
> Hi Johannes,
> 
> Hopefully I was able to rebase the patch on top v4.9.26 (latest supported
> version by us right now)
> and test a bit.
> The overall idea definitely looks promising, although I have one question on
> usage.
> Will it be able to account the time which processes spend on handling major
> page faults
> (including fs and iowait time) of refaulting page?

That's the main thing it should measure! :)

The lock_page() and wait_on_page_locked() calls are where iowaits
happen on a cache miss. If those are refaults, they'll be counted.

> As we have one big application which code space occupies big amount of place
> in page cache,
> when the system under heavy memory usage will reclaim some of it, the
> application will
> start constantly thrashing. Since it code is placed on squashfs it spends
> whole CPU time
> decompressing the pages and seem memdelay counters are not detecting this
> situation.
> Here are some counters to indicate this:
> 
> 19:02:44        CPU     %user     %nice   %system   %iowait %steal     %idle
> 19:02:45        all      0.00      0.00    100.00      0.00 0.00      0.00
> 
> 19:02:44     pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s
> pgscand/s pgsteal/s    %vmeff
> 19:02:45     15284.00      0.00    428.00    352.00  19990.00 0.00      0.00
> 15802.00      0.00
> 
> And as nobody actively allocating memory anymore looks like memdelay
> counters are not
> actively incremented:
> 
> [:~]$ cat /proc/memdelay
> 268035776
> 6.13 5.43 3.58
> 1.90 1.89 1.26

How does it correlate with /proc/vmstat::workingset_activate during
that time? It only counts thrashing time of refaults it can actively
detect.

Btw, how many CPUs does this system have? There is a bug in this
version on how idle time is aggregated across multiple CPUs. The error
compounds with the number of CPUs in the system.

I'm attaching 3 bugfixes that go on top of what you have. There might
be some conflicts, but they should be minor variable naming issues.


[-- Attachment #2: 0001-mm-memdelay-fix-task-flags-race-condition.patch --]
[-- Type: text/x-diff, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-09-28 15:49       ` Detecting page cache trashing state Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  2017-10-25 16:53         ` Daniel Walker
  2017-10-25 17:54         ` Johannes Weiner
@ 2017-10-26  3:53         ` vinayak menon
  2017-10-27 20:29           ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  2 siblings, 1 reply; 21+ messages in thread
From: vinayak menon @ 2017-10-26  3:53 UTC (permalink / raw)
  To: Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  Cc: Johannes Weiner, Taras Kondratiuk, Michal Hocko, linux-mm,
	xe-linux-external, linux-kernel

On Thu, Sep 28, 2017 at 9:19 PM, Ruslan Ruslichenko -X (rruslich -
GLOBALLOGIC INC at Cisco) <rruslich@cisco.com> wrote:
> Hi Johannes,
>
> Hopefully I was able to rebase the patch on top v4.9.26 (latest supported
> version by us right now)
> and test a bit.
> The overall idea definitely looks promising, although I have one question on
> usage.
> Will it be able to account the time which processes spend on handling major
> page faults
> (including fs and iowait time) of refaulting page?
>
> As we have one big application which code space occupies big amount of place
> in page cache,
> when the system under heavy memory usage will reclaim some of it, the
> application will
> start constantly thrashing. Since it code is placed on squashfs it spends
> whole CPU time
> decompressing the pages and seem memdelay counters are not detecting this
> situation.
> Here are some counters to indicate this:
>
> 19:02:44        CPU     %user     %nice   %system   %iowait %steal     %idle
> 19:02:45        all      0.00      0.00    100.00      0.00 0.00      0.00
>
> 19:02:44     pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s
> pgscand/s pgsteal/s    %vmeff
> 19:02:45     15284.00      0.00    428.00    352.00  19990.00 0.00      0.00
> 15802.00      0.00
>
> And as nobody actively allocating memory anymore looks like memdelay
> counters are not
> actively incremented:
>
> [:~]$ cat /proc/memdelay
> 268035776
> 6.13 5.43 3.58
> 1.90 1.89 1.26
>
> Just in case, I have attached the v4.9.26 rebased patched.
>
Looks like this 4.9 version does not contain the accounting in lock_page.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-10-25 17:54         ` Johannes Weiner
@ 2017-10-27 20:19           ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  2017-11-20 19:40             ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  0 siblings, 1 reply; 21+ messages in thread
From: Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) @ 2017-10-27 20:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Taras Kondratiuk, Michal Hocko, linux-mm, xe-linux-external,
	linux-kernel

Hi Johannes,

On 10/25/2017 08:54 PM, Johannes Weiner wrote:
> Hi Ruslan,
>
> sorry about the delayed response, I missed the new activity in this
> older thread.
>
> On Thu, Sep 28, 2017 at 06:49:07PM +0300, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) wrote:
>> Hi Johannes,
>>
>> Hopefully I was able to rebase the patch on top v4.9.26 (latest supported
>> version by us right now)
>> and test a bit.
>> The overall idea definitely looks promising, although I have one question on
>> usage.
>> Will it be able to account the time which processes spend on handling major
>> page faults
>> (including fs and iowait time) of refaulting page?
> That's the main thing it should measure! :)
>
> The lock_page() and wait_on_page_locked() calls are where iowaits
> happen on a cache miss. If those are refaults, they'll be counted.
>
>> As we have one big application which code space occupies big amount of place
>> in page cache,
>> when the system under heavy memory usage will reclaim some of it, the
>> application will
>> start constantly thrashing. Since it code is placed on squashfs it spends
>> whole CPU time
>> decompressing the pages and seem memdelay counters are not detecting this
>> situation.
>> Here are some counters to indicate this:
>>
>> 19:02:44        CPU     %user     %nice   %system   %iowait %steal     %idle
>> 19:02:45        all      0.00      0.00    100.00      0.00 0.00      0.00
>>
>> 19:02:44     pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s
>> pgscand/s pgsteal/s    %vmeff
>> 19:02:45     15284.00      0.00    428.00    352.00  19990.00 0.00      0.00
>> 15802.00      0.00
>>
>> And as nobody actively allocating memory anymore looks like memdelay
>> counters are not
>> actively incremented:
>>
>> [:~]$ cat /proc/memdelay
>> 268035776
>> 6.13 5.43 3.58
>> 1.90 1.89 1.26
> How does it correlate with /proc/vmstat::workingset_activate during
> that time? It only counts thrashing time of refaults it can actively
> detect.
The workingset counters are growing quite actively too. Here are
some numbers per second:

workingset_refault   8201
workingset_activate   389
workingset_restore   187
workingset_nodereclaim   313

> Btw, how many CPUs does this system have? There is a bug in this
> version on how idle time is aggregated across multiple CPUs. The error
> compounds with the number of CPUs in the system.
The system has 2 CPU cores.
> I'm attaching 3 bugfixes that go on top of what you have. There might
> be some conflicts, but they should be minor variable naming issues.
>
I will test with your patches and get back to you.

Thanks,
Ruslan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-10-26  3:53         ` vinayak menon
@ 2017-10-27 20:29           ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  0 siblings, 0 replies; 21+ messages in thread
From: Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) @ 2017-10-27 20:29 UTC (permalink / raw)
  To: vinayak menon
  Cc: Johannes Weiner, Taras Kondratiuk, Michal Hocko, linux-mm,
	xe-linux-external, linux-kernel

On 10/26/2017 06:53 AM, vinayak menon wrote:
> On Thu, Sep 28, 2017 at 9:19 PM, Ruslan Ruslichenko -X (rruslich -
> GLOBALLOGIC INC at Cisco) <rruslich@cisco.com> wrote:
>> Hi Johannes,
>>
>> Hopefully I was able to rebase the patch on top v4.9.26 (latest supported
>> version by us right now)
>> and test a bit.
>> The overall idea definitely looks promising, although I have one question on
>> usage.
>> Will it be able to account the time which processes spend on handling major
>> page faults
>> (including fs and iowait time) of refaulting page?
>>
>> As we have one big application which code space occupies big amount of place
>> in page cache,
>> when the system under heavy memory usage will reclaim some of it, the
>> application will
>> start constantly thrashing. Since it code is placed on squashfs it spends
>> whole CPU time
>> decompressing the pages and seem memdelay counters are not detecting this
>> situation.
>> Here are some counters to indicate this:
>>
>> 19:02:44        CPU     %user     %nice   %system   %iowait %steal     %idle
>> 19:02:45        all      0.00      0.00    100.00      0.00 0.00      0.00
>>
>> 19:02:44     pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s
>> pgscand/s pgsteal/s    %vmeff
>> 19:02:45     15284.00      0.00    428.00    352.00  19990.00 0.00      0.00
>> 15802.00      0.00
>>
>> And as nobody actively allocating memory anymore looks like memdelay
>> counters are not
>> actively incremented:
>>
>> [:~]$ cat /proc/memdelay
>> 268035776
>> 6.13 5.43 3.58
>> 1.90 1.89 1.26
>>
>> Just in case, I have attached the v4.9.26 rebased patched.
>>
> Looks like this 4.9 version does not contain the accounting in lock_page.

In v4.9 there is no wait_on_page_bit_common(), thus accounting moved to
wait_on_page_bit(_killable|_killable_timeout).
Related functionality around lock_page_or_retry() seem to be mostly the 
same in v4.9.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-10-27 20:19           ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
@ 2017-11-20 19:40             ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  2017-11-27  2:18               ` Minchan Kim
  0 siblings, 1 reply; 21+ messages in thread
From: Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) @ 2017-11-20 19:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Taras Kondratiuk, Michal Hocko, linux-mm, xe-linux-external,
	linux-kernel

Hi Johannes,

I tested with your patches but situation is still mostly the same.

Spend some time for debugging and found that the problem is squashfs 
specific (probably some others fs's too).
The point is that iowait for squashfs reads will be awaited inside 
squashfs readpage() callback.
Here is some backtrace for page fault handling to illustrate this:

  1)               |  handle_mm_fault() {
  1)               |    filemap_fault() {
  1)               |      __do_page_cache_readahead()
  1)               |        add_to_page_cache_lru()
  1)               |        squashfs_readpage() {
  1)               |          squashfs_readpage_block() {
  1)               |            squashfs_get_datablock() {
  1)               |              squashfs_cache_get() {
  1)               |                squashfs_read_data() {
  1)               |                  ll_rw_block() {
  1)               |                    submit_bh_wbc.isra.42()
  1)               |                  __wait_on_buffer() {
  1)               |                    io_schedule() {
  ------------------------------------------
  0)   kworker-79   =>    <idle>-0
  ------------------------------------------
  0)   0.382 us    |  blk_complete_request();
  0)               |  blk_done_softirq() {
  0)               |    blk_update_request() {
  0)               |      end_buffer_read_sync()
  0) + 38.559 us   |    }
  0) + 48.367 us   |  }
  ------------------------------------------
  0)   kworker-79   =>  memhog-781
  ------------------------------------------
  0) ! 278.848 us  |                    }
  0) ! 279.612 us  |                  }
  0)               |                  squashfs_decompress() {
  0) # 4919.082 us |                    squashfs_xz_uncompress();
  0) # 4919.864 us |                  }
  0) # 5479.212 us |                } /* squashfs_read_data */
  0) # 5479.749 us |              } /* squashfs_cache_get */
  0) # 5480.177 us |            } /* squashfs_get_datablock */
  0)               |            squashfs_copy_cache() {
  0)   0.057 us    |              unlock_page();
  0) ! 142.773 us  |            }
  0) # 5624.113 us |          } /* squashfs_readpage_block */
  0) # 5628.814 us |        } /* squashfs_readpage */
  0) # 5665.097 us |      } /* __do_page_cache_readahead */
  0) # 5667.437 us |    } /* filemap_fault */
  0) # 5672.880 us |  } /* handle_mm_fault */

As you can see squashfs_read_data() schedules IO by ll_rw_block() and 
then it waits for IO to finish inside wait_on_buffer().
After that read buffer is decompressed and page is unlocked inside 
squashfs_readpage() handler.

Thus by the the time when filemap_fault() calls lock_page_or_retry() 
page will be uptodate and unlocked,
wait_on_page_bit() is not called at all, and time spent for 
read/decompress is not accounted.

Tried to apply quick workaround for test:

diff --git a/mm/readahead.c b/mm/readahead.c
index c4ca702..5e2be2b 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -126,9 +126,21 @@ static int read_pages(struct address_space 
*mapping, struct file *filp,

      for (page_idx = 0; page_idx < nr_pages; page_idx++) {
          struct page *page = lru_to_page(pages);
+        bool refault = false;
+        unsigned long mdflags;
+
          list_del(&page->lru);
-        if (!add_to_page_cache_lru(page, mapping, page->index, gfp))
+        if (!add_to_page_cache_lru(page, mapping, page->index, gfp)) {
+            if (!PageUptodate(page) && PageWorkingset(page)) {
+                memdelay_enter(&mdflags);
+                refault = true;
+            }
+
              mapping->a_ops->readpage(filp, page);
+
+            if (refault)
+                memdelay_leave(&mdflags);
+        }
          put_page(page);

But found that situation is not much different.
The reason is that at least in my synthetic tests I'm exhausting whole 
memory leaving almost no place for page cache:

Active(anon):   15901788 kB
Inactive(anon):    44844 kB
Active(file):        488 kB
Inactive(file):      612 kB

As result refault distance is always higher that LRU_ACTIVE_FILE size 
and Workingset flag is not set for refaulting page
even if it were active during it's lifecycle before eviction:

         workingset_refault   7773
        workingset_activate   250
         workingset_restore   233
     workingset_nodereclaim   49

Tried to apply following workaround:

diff --git a/mm/workingset.c b/mm/workingset.c
index 264f049..8035ef6 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -305,6 +305,11 @@ void workingset_refault(struct page *page, void 
*shadow)

      inc_lruvec_state(lruvec, WORKINGSET_REFAULT);

+    /* Page was active prior to eviction */
+    if (workingset) {
+        SetPageWorkingset(page);
+        inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
+    }
      /*
       * Compare the distance to the existing workingset size. We
       * don't act on pages that couldn't stay resident even if all
@@ -314,13 +319,9 @@ void workingset_refault(struct page *page, void 
*shadow)
          goto out;

      SetPageActive(page);
-    SetPageWorkingset(page);
      atomic_long_inc(&lruvec->inactive_age);
      inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);

-    /* Page was active prior to eviction */
-    if (workingset)
-        inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
  out:
      rcu_read_unlock();
  }

Now I see that refaults for pages a indeed accounted:

         workingset_refault   4987
        workingset_activate   590
         workingset_restore   4358
     workingset_nodereclaim   944

And memdelay counters are actively incrementing too indicating the 
trashing state:

[:~]$ cat /proc/memdelay
7539897381
63.22 63.19 44.58
14.36 15.11 11.80

So do you know what is the proper way to fix both issues?

--
Thanks,
Ruslan

On 10/27/2017 11:19 PM, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC 
INC at Cisco) wrote:
> Hi Johannes,
>
> On 10/25/2017 08:54 PM, Johannes Weiner wrote:
>> Hi Ruslan,
>>
>> sorry about the delayed response, I missed the new activity in this
>> older thread.
>>
>> On Thu, Sep 28, 2017 at 06:49:07PM +0300, Ruslan Ruslichenko -X 
>> (rruslich - GLOBALLOGIC INC at Cisco) wrote:
>>> Hi Johannes,
>>>
>>> Hopefully I was able to rebase the patch on top v4.9.26 (latest 
>>> supported
>>> version by us right now)
>>> and test a bit.
>>> The overall idea definitely looks promising, although I have one 
>>> question on
>>> usage.
>>> Will it be able to account the time which processes spend on 
>>> handling major
>>> page faults
>>> (including fs and iowait time) of refaulting page?
>> That's the main thing it should measure! :)
>>
>> The lock_page() and wait_on_page_locked() calls are where iowaits
>> happen on a cache miss. If those are refaults, they'll be counted.
>>
>>> As we have one big application which code space occupies big amount 
>>> of place
>>> in page cache,
>>> when the system under heavy memory usage will reclaim some of it, the
>>> application will
>>> start constantly thrashing. Since it code is placed on squashfs it 
>>> spends
>>> whole CPU time
>>> decompressing the pages and seem memdelay counters are not detecting 
>>> this
>>> situation.
>>> Here are some counters to indicate this:
>>>
>>> 19:02:44        CPU     %user     %nice   %system   %iowait 
>>> %steal     %idle
>>> 19:02:45        all      0.00      0.00    100.00      0.00 
>>> 0.00      0.00
>>>
>>> 19:02:44     pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s
>>> pgscand/s pgsteal/s    %vmeff
>>> 19:02:45     15284.00      0.00    428.00    352.00  19990.00 
>>> 0.00      0.00
>>> 15802.00      0.00
>>>
>>> And as nobody actively allocating memory anymore looks like memdelay
>>> counters are not
>>> actively incremented:
>>>
>>> [:~]$ cat /proc/memdelay
>>> 268035776
>>> 6.13 5.43 3.58
>>> 1.90 1.89 1.26
>> How does it correlate with /proc/vmstat::workingset_activate during
>> that time? It only counts thrashing time of refaults it can actively
>> detect.
> The workingset counters are growing quite actively too. Here are
> some numbers per second:
>
> workingset_refault   8201
> workingset_activate   389
> workingset_restore   187
> workingset_nodereclaim   313
>
>> Btw, how many CPUs does this system have? There is a bug in this
>> version on how idle time is aggregated across multiple CPUs. The error
>> compounds with the number of CPUs in the system.
> The system has 2 CPU cores.
>> I'm attaching 3 bugfixes that go on top of what you have. There might
>> be some conflicts, but they should be minor variable naming issues.
>>
> I will test with your patches and get back to you.
>
> Thanks,
> Ruslan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: Detecting page cache trashing state
  2017-11-20 19:40             ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
@ 2017-11-27  2:18               ` Minchan Kim
  0 siblings, 0 replies; 21+ messages in thread
From: Minchan Kim @ 2017-11-27  2:18 UTC (permalink / raw)
  To: Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
  Cc: Johannes Weiner, Taras Kondratiuk, Michal Hocko, linux-mm,
	xe-linux-external, linux-kernel

Hello,

On Mon, Nov 20, 2017 at 09:40:56PM +0200, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) wrote:
> Hi Johannes,
> 
> I tested with your patches but situation is still mostly the same.
> 
> Spend some time for debugging and found that the problem is squashfs
> specific (probably some others fs's too).
> The point is that iowait for squashfs reads will be awaited inside squashfs
> readpage() callback.
> Here is some backtrace for page fault handling to illustrate this:
> 
>  1)               |  handle_mm_fault() {
>  1)               |    filemap_fault() {
>  1)               |      __do_page_cache_readahead()
>  1)               |        add_to_page_cache_lru()
>  1)               |        squashfs_readpage() {
>  1)               |          squashfs_readpage_block() {
>  1)               |            squashfs_get_datablock() {
>  1)               |              squashfs_cache_get() {
>  1)               |                squashfs_read_data() {
>  1)               |                  ll_rw_block() {
>  1)               |                    submit_bh_wbc.isra.42()
>  1)               |                  __wait_on_buffer() {
>  1)               |                    io_schedule() {
>  ------------------------------------------
>  0)   kworker-79   =>    <idle>-0
>  ------------------------------------------
>  0)   0.382 us    |  blk_complete_request();
>  0)               |  blk_done_softirq() {
>  0)               |    blk_update_request() {
>  0)               |      end_buffer_read_sync()
>  0) + 38.559 us   |    }
>  0) + 48.367 us   |  }
>  ------------------------------------------
>  0)   kworker-79   =>  memhog-781
>  ------------------------------------------
>  0) ! 278.848 us  |                    }
>  0) ! 279.612 us  |                  }
>  0)               |                  squashfs_decompress() {
>  0) # 4919.082 us |                    squashfs_xz_uncompress();
>  0) # 4919.864 us |                  }
>  0) # 5479.212 us |                } /* squashfs_read_data */
>  0) # 5479.749 us |              } /* squashfs_cache_get */
>  0) # 5480.177 us |            } /* squashfs_get_datablock */
>  0)               |            squashfs_copy_cache() {
>  0)   0.057 us    |              unlock_page();
>  0) ! 142.773 us  |            }
>  0) # 5624.113 us |          } /* squashfs_readpage_block */
>  0) # 5628.814 us |        } /* squashfs_readpage */
>  0) # 5665.097 us |      } /* __do_page_cache_readahead */
>  0) # 5667.437 us |    } /* filemap_fault */
>  0) # 5672.880 us |  } /* handle_mm_fault */
> 
> As you can see squashfs_read_data() schedules IO by ll_rw_block() and then
> it waits for IO to finish inside wait_on_buffer().
> After that read buffer is decompressed and page is unlocked inside
> squashfs_readpage() handler.
> 
> Thus by the the time when filemap_fault() calls lock_page_or_retry() page
> will be uptodate and unlocked,
> wait_on_page_bit() is not called at all, and time spent for read/decompress
> is not accounted.

A weakness in current approach is that it relies on page lock.
It means it cannot work with sychronous devices like DAX, zram and
so on, I think.

Johannes, Can we add memdelay_enter to every fault handler's prologue?
and we can check it in epilogue whether the faulted page is workingset.
If is was, we can accumuate the spent time.
It would work with synchronous devices, esp, zram without hacking
some FSes like squashfs.

I think page fault handler/kswapd/direct reclaim would cover most of
cases of *real* memory pressure but un[lock]page freinds would cover
superfluously, for example, FSes can call it easily without memory
pressure.

> 
> Tried to apply quick workaround for test:
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index c4ca702..5e2be2b 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -126,9 +126,21 @@ static int read_pages(struct address_space *mapping,
> struct file *filp,
> 
>      for (page_idx = 0; page_idx < nr_pages; page_idx++) {
>          struct page *page = lru_to_page(pages);
> +        bool refault = false;
> +        unsigned long mdflags;
> +
>          list_del(&page->lru);
> -        if (!add_to_page_cache_lru(page, mapping, page->index, gfp))
> +        if (!add_to_page_cache_lru(page, mapping, page->index, gfp)) {
> +            if (!PageUptodate(page) && PageWorkingset(page)) {
> +                memdelay_enter(&mdflags);
> +                refault = true;
> +            }
> +
>              mapping->a_ops->readpage(filp, page);
> +
> +            if (refault)
> +                memdelay_leave(&mdflags);
> +        }
>          put_page(page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-11-27  2:18 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-15  0:16 Detecting page cache trashing state Taras Kondratiuk
2017-09-15 11:55 ` Zdenek Kabelac
2017-09-15 14:22 ` Daniel Walker
2017-09-15 16:38   ` Taras Kondratiuk
2017-09-15 17:31     ` Daniel Walker
2017-09-15 14:36 ` Michal Hocko
2017-09-15 17:28   ` Taras Kondratiuk
2017-09-18 16:34     ` Johannes Weiner
2017-09-19 10:55       ` [PATCH 1/3] sched/loadavg: consolidate LOAD_INT, LOAD_FRAC macros kbuild test robot
2017-09-19 11:02       ` kbuild test robot
2017-09-28 15:49       ` Detecting page cache trashing state Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
2017-10-25 16:53         ` Daniel Walker
2017-10-25 17:54         ` Johannes Weiner
2017-10-27 20:19           ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
2017-11-20 19:40             ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
2017-11-27  2:18               ` Minchan Kim
2017-10-26  3:53         ` vinayak menon
2017-10-27 20:29           ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
2017-09-15 21:20   ` vcaputo
2017-09-15 23:40     ` Taras Kondratiuk
2017-09-18  5:55     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).