[PATCH] mm: add MM_SWAPENTS and page table when calculate tasksize in lowmem

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: add MM_SWAPENTS and page table when calculate tasksize in lowmem_scan()
@ 2016-02-16  9:37 Xishi Qiu
  2016-02-16 17:38 ` Greg Kroah-Hartman
  0 siblings, 1 reply; 10+ messages in thread
From: Xishi Qiu @ 2016-02-16  9:37 UTC (permalink / raw)
  To: Greg Kroah-Hartman, arve, riandrews, devel, zhong jiang; +Cc: LKML, Linux MM

Currently tasksize in lowmem_scan() only calculate rss, and not include swap.
But usually smart phones enable zram, so swap space actually use ram.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
---
 drivers/staging/android/lowmemorykiller.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index 8b5a4a8..718ab8e 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -139,7 +139,10 @@ static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
 			task_unlock(p);
 			continue;
 		}
-		tasksize = get_mm_rss(p->mm);
+		tasksize = get_mm_rss(p->mm) +
+			   get_mm_counter(p->mm, MM_SWAPENTS) +
+			   atomic_long_read(&p->mm->nr_ptes) +
+			   mm_nr_pmds(p->mm);
 		task_unlock(p);
 		if (tasksize <= 0)
 			continue;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: add MM_SWAPENTS and page table when calculate tasksize in lowmem_scan()
  2016-02-16  9:37 [PATCH] mm: add MM_SWAPENTS and page table when calculate tasksize in lowmem_scan() Xishi Qiu
@ 2016-02-16 17:38 ` Greg Kroah-Hartman
  2016-02-17  0:35   ` David Rientjes
  0 siblings, 1 reply; 10+ messages in thread
From: Greg Kroah-Hartman @ 2016-02-16 17:38 UTC (permalink / raw)
  To: Xishi Qiu; +Cc: arve, riandrews, devel, zhong jiang, LKML, Linux MM

On Tue, Feb 16, 2016 at 05:37:05PM +0800, Xishi Qiu wrote:
> Currently tasksize in lowmem_scan() only calculate rss, and not include swap.
> But usually smart phones enable zram, so swap space actually use ram.

Yes, but does that matter for this type of calculation?  I need an ack
from the android team before I could ever take such a core change to
this code...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: add MM_SWAPENTS and page table when calculate tasksize in lowmem_scan()
  2016-02-16 17:38 ` Greg Kroah-Hartman
@ 2016-02-17  0:35   ` David Rientjes
  2016-02-17  8:30     ` Xishi Qiu
                       ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: David Rientjes @ 2016-02-17  0:35 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Xishi Qiu, arve, riandrews, devel, zhong jiang, LKML, Linux MM

On Tue, 16 Feb 2016, Greg Kroah-Hartman wrote:

> On Tue, Feb 16, 2016 at 05:37:05PM +0800, Xishi Qiu wrote:
> > Currently tasksize in lowmem_scan() only calculate rss, and not include swap.
> > But usually smart phones enable zram, so swap space actually use ram.
> 
> Yes, but does that matter for this type of calculation?  I need an ack
> from the android team before I could ever take such a core change to
> this code...
> 

The calculation proposed in this patch is the same as the generic oom 
killer, it's an estimate of the amount of memory that will be freed if it 
is killed and can exit.  This is better than simply get_mm_rss().

However, I think we seriously need to re-consider the implementation of 
the lowmem killer entirely.  It currently abuses the use of TIF_MEMDIE, 
which should ideally only be set for one thread on the system since it 
allows unbounded access to global memory reserves.

It also abuses the user-visible /proc/self/oom_score_adj tunable: this 
tunable is used by the generic oom killer to bias or discount a proportion 
of memory from a process's usage.  This is the only supported semantic of 
the tunable.  The lowmem killer uses it as a strict prioritization, so any 
process with oom_score_adj higher than another process is preferred for 
kill, REGARDLESS of memory usage.  This leads to priority inversion, the 
user is unable to always define the same process to be killed by the 
generic oom killer and the lowmem killer.  This is what happens when a 
tunable with a very clear and defined purpose is used for other reasons.

I'd seriously consider not accepting any additional hacks on top of this 
code until the implementation is rewritten.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: add MM_SWAPENTS and page table when calculate tasksize in lowmem_scan()
  2016-02-17  0:35   ` David Rientjes
@ 2016-02-17  8:30     ` Xishi Qiu
  2016-02-17 22:42       ` David Rientjes
  2016-02-17 18:10     ` Michal Hocko
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Xishi Qiu @ 2016-02-17  8:30 UTC (permalink / raw)
  To: David Rientjes
  Cc: Greg Kroah-Hartman, arve, riandrews, devel, zhong jiang, LKML, Linux MM

On 2016/2/17 8:35, David Rientjes wrote:

> On Tue, 16 Feb 2016, Greg Kroah-Hartman wrote:
> 
>> On Tue, Feb 16, 2016 at 05:37:05PM +0800, Xishi Qiu wrote:
>>> Currently tasksize in lowmem_scan() only calculate rss, and not include swap.
>>> But usually smart phones enable zram, so swap space actually use ram.
>>
>> Yes, but does that matter for this type of calculation?  I need an ack
>> from the android team before I could ever take such a core change to
>> this code...
>>
> 
> The calculation proposed in this patch is the same as the generic oom 
> killer, it's an estimate of the amount of memory that will be freed if it 
> is killed and can exit.  This is better than simply get_mm_rss().
> 
> However, I think we seriously need to re-consider the implementation of 
> the lowmem killer entirely.  It currently abuses the use of TIF_MEMDIE, 
> which should ideally only be set for one thread on the system since it 
> allows unbounded access to global memory reserves.
> 
> It also abuses the user-visible /proc/self/oom_score_adj tunable: this 
> tunable is used by the generic oom killer to bias or discount a proportion 
> of memory from a process's usage.  This is the only supported semantic of 
> the tunable.  The lowmem killer uses it as a strict prioritization, so any 
> process with oom_score_adj higher than another process is preferred for 
> kill, REGARDLESS of memory usage.  This leads to priority inversion, the 
> user is unable to always define the same process to be killed by the 
> generic oom killer and the lowmem killer.  This is what happens when a 
> tunable with a very clear and defined purpose is used for other reasons.
> 
> I'd seriously consider not accepting any additional hacks on top of this 
> code until the implementation is rewritten.
> 

Hi David,

Thanks for your advice.

I have a stupid question, what's the main difference between lmk and oom?
1) lmk is called when reclaim memory, and oom is called when alloc failed in slow path.
2) lmk has several lowmem thresholds and oom is not.
3) others?

Thanks,
Xishi Qiu

> .
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: add MM_SWAPENTS and page table when calculate tasksize in lowmem_scan()
  2016-02-17  0:35   ` David Rientjes
  2016-02-17  8:30     ` Xishi Qiu
@ 2016-02-17 18:10     ` Michal Hocko
  2016-02-18  6:51     ` Xishi Qiu
       [not found]     ` <CAF7GXvqr2dmc7CUcs_OmfYnEA9jE_Db4kGGG1HJyYYLhC6Bgew@mail.gmail.com>
  3 siblings, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2016-02-17 18:10 UTC (permalink / raw)
  To: David Rientjes
  Cc: Greg Kroah-Hartman, Xishi Qiu, arve, riandrews, devel,
	zhong jiang, LKML, Linux MM

On Tue 16-02-16 16:35:39, David Rientjes wrote:
> On Tue, 16 Feb 2016, Greg Kroah-Hartman wrote:
> 
> > On Tue, Feb 16, 2016 at 05:37:05PM +0800, Xishi Qiu wrote:
> > > Currently tasksize in lowmem_scan() only calculate rss, and not include swap.
> > > But usually smart phones enable zram, so swap space actually use ram.
> > 
> > Yes, but does that matter for this type of calculation?  I need an ack
> > from the android team before I could ever take such a core change to
> > this code...
> > 
> 
> The calculation proposed in this patch is the same as the generic oom 
> killer, it's an estimate of the amount of memory that will be freed if it 
> is killed and can exit.  This is better than simply get_mm_rss().
> 
> However, I think we seriously need to re-consider the implementation of 
> the lowmem killer entirely.  It currently abuses the use of TIF_MEMDIE, 
> which should ideally only be set for one thread on the system since it 
> allows unbounded access to global memory reserves.
> 
> It also abuses the user-visible /proc/self/oom_score_adj tunable: this 
> tunable is used by the generic oom killer to bias or discount a proportion 
> of memory from a process's usage.  This is the only supported semantic of 
> the tunable.  The lowmem killer uses it as a strict prioritization, so any 
> process with oom_score_adj higher than another process is preferred for 
> kill, REGARDLESS of memory usage.  This leads to priority inversion, the 
> user is unable to always define the same process to be killed by the 
> generic oom killer and the lowmem killer.  This is what happens when a 
> tunable with a very clear and defined purpose is used for other reasons.
> 
> I'd seriously consider not accepting any additional hacks on top of this 
> code until the implementation is rewritten.

Fully agreed!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: add MM_SWAPENTS and page table when calculate tasksize in lowmem_scan()
  2016-02-17  8:30     ` Xishi Qiu
@ 2016-02-17 22:42       ` David Rientjes
  0 siblings, 0 replies; 10+ messages in thread
From: David Rientjes @ 2016-02-17 22:42 UTC (permalink / raw)
  To: Xishi Qiu
  Cc: Greg Kroah-Hartman, arve, riandrews, devel, zhong jiang, LKML, Linux MM

On Wed, 17 Feb 2016, Xishi Qiu wrote:

> Hi David,
> 
> Thanks for your advice.
> 
> I have a stupid question, what's the main difference between lmk and oom?

Hi Xishi, it's not a stupid question at all!

Low memory killer appears to be implemented as a generic shrinker that 
iterates through the tasklist and tries to free memory before the generic 
oom killer.  It has two tunables, "adj" and "minfree": "minfree" describes 
what class of processes are eligible based on how many free pages are left 
on the system and "adj" defines that class by using oom_score_adj values.

So LMK is trying to free memory before all memory is depleted based on 
heuristics for systems that load the driver whereas the generic oom killer 
is called to kill a process when reclaim has failed to free any memory and 
there's no forward progress.

> 1) lmk is called when reclaim memory, and oom is called when alloc failed in slow path.

Yeah, and I don't think LMK provides any sort of guarantee against all 
memory being fully depleted before it can run, so it would probably be 
best effort.

> 2) lmk has several lowmem thresholds and oom is not.

Right, and it abuses oom_score_adj, which is a generic oom killer tunable 
to define priorities to kill at different levels of memory availability.

> 3) others?
> 

LMK also abuses TIF_MEMDIE which is used by the generic oom killer to 
allow a process to free memory.  Since the system is out of memory when it 
is called, a process often needs additional memory to even exit, so we set 
TIF_MEMDIE to ignore zone watermarks in the page allocator.  LMK should 
not be using this, there should already be memory available for it to 
allocate from.

To fix these issues with LMK, I think it should:

 - send SIGKILL to terminate a process in lowmem situations, but not
   set TIF_MEMDIE and implement its own way of determining when to kill
   additional processes, and

 - introduce its own tunable to define the priority of kill when it runs
   rather than oom_score_adj, which is a proportion of memory to bias
   against, not a priority at all.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: add MM_SWAPENTS and page table when calculate tasksize in lowmem_scan()
  2016-02-17  0:35   ` David Rientjes
  2016-02-17  8:30     ` Xishi Qiu
  2016-02-17 18:10     ` Michal Hocko
@ 2016-02-18  6:51     ` Xishi Qiu
  2016-02-23  0:54       ` David Rientjes
       [not found]     ` <CAF7GXvqr2dmc7CUcs_OmfYnEA9jE_Db4kGGG1HJyYYLhC6Bgew@mail.gmail.com>
  3 siblings, 1 reply; 10+ messages in thread
From: Xishi Qiu @ 2016-02-18  6:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: Greg Kroah-Hartman, arve, riandrews, devel, zhong jiang, LKML, Linux MM

On 2016/2/17 8:35, David Rientjes wrote:

> On Tue, 16 Feb 2016, Greg Kroah-Hartman wrote:
> 
>> On Tue, Feb 16, 2016 at 05:37:05PM +0800, Xishi Qiu wrote:
>>> Currently tasksize in lowmem_scan() only calculate rss, and not include swap.
>>> But usually smart phones enable zram, so swap space actually use ram.
>>
>> Yes, but does that matter for this type of calculation?  I need an ack
>> from the android team before I could ever take such a core change to
>> this code...
>>
> 
> The calculation proposed in this patch is the same as the generic oom 
> killer, it's an estimate of the amount of memory that will be freed if it 
> is killed and can exit.  This is better than simply get_mm_rss().
> 
> However, I think we seriously need to re-consider the implementation of 
> the lowmem killer entirely.  It currently abuses the use of TIF_MEMDIE, 
> which should ideally only be set for one thread on the system since it 
> allows unbounded access to global memory reserves.
> 

Hi David,

Does somebody do the work of re-implementation of the lowmem killer entirely
now? Could you give me some details? e.g. when and how?

Here are another two questions.
1) lmk has several lowmem thresholds, it's "lowmem_minfree[]", and the value is
static definition, so is it reasonable for different memory size(e.g. 2G/3G/4G...)
of smart phones?
2) There are many adjustable arguments in /proc/sys/vm/, and the default value
maybe not benefit for smart phones, so any suggestions?

Thanks,
Xishi Qiu

> It also abuses the user-visible /proc/self/oom_score_adj tunable: this 
> tunable is used by the generic oom killer to bias or discount a proportion 
> of memory from a process's usage.  This is the only supported semantic of 
> the tunable.  The lowmem killer uses it as a strict prioritization, so any 
> process with oom_score_adj higher than another process is preferred for 
> kill, REGARDLESS of memory usage.  This leads to priority inversion, the 
> user is unable to always define the same process to be killed by the 
> generic oom killer and the lowmem killer.  This is what happens when a 
> tunable with a very clear and defined purpose is used for other reasons.
> 
> I'd seriously consider not accepting any additional hacks on top of this 
> code until the implementation is rewritten.
> 
> .
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: add MM_SWAPENTS and page table when calculate tasksize in lowmem_scan()
       [not found]     ` <CAF7GXvqr2dmc7CUcs_OmfYnEA9jE_Db4kGGG1HJyYYLhC6Bgew@mail.gmail.com>
@ 2016-02-18 10:21       ` Xishi Qiu
  2016-02-23  0:50         ` David Rientjes
  0 siblings, 1 reply; 10+ messages in thread
From: Xishi Qiu @ 2016-02-18 10:21 UTC (permalink / raw)
  To: Figo.zhang, David Rientjes
  Cc: Greg Kroah-Hartman, arve, riandrews, devel, zhong jiang, LKML, Linux MM

On 2016/2/18 15:55, Figo.zhang wrote:

> 
> 
> 2016-02-17 8:35 GMT+08:00 David Rientjes <rientjes@google.com <mailto:rientjes@google.com>>:
> 
>     On Tue, 16 Feb 2016, Greg Kroah-Hartman wrote:
> 
>     > On Tue, Feb 16, 2016 at 05:37:05PM +0800, Xishi Qiu wrote:
>     > > Currently tasksize in lowmem_scan() only calculate rss, and not include swap.
>     > > But usually smart phones enable zram, so swap space actually use ram.
>     >
>     > Yes, but does that matter for this type of calculation?  I need an ack
>     > from the android team before I could ever take such a core change to
>     > this code...
>     >
> 
>     The calculation proposed in this patch is the same as the generic oom
>     killer, it's an estimate of the amount of memory that will be freed if it
>     is killed and can exit.  This is better than simply get_mm_rss().
> 
>     However, I think we seriously need to re-consider the implementation of
>     the lowmem killer entirely.  It currently abuses the use of TIF_MEMDIE,
>     which should ideally only be set for one thread on the system since it
>     allows unbounded access to global memory reserves.
> 
> 
> 
> i don't understand why it need wait 1 second:
> 

Hi David,

How about kill more processes at one time?

Usually loading camera will alloc 300-500M memory immediately, so call lmk
repeatedly is a waste of time.

And can we reclaim memory at one time instead of reclaim-alloc-reclaim-alloc...
in this situation? e.g. use try_to_free_pages(), set nr_to_reclaim=300M

Thanks,
Xishi Qiu

> if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
>    time_before_eq(jiffies, lowmem_deathpending_timeout)) {
> task_unlock(p);
> rcu_read_unlock();
> return 0;                             <= why return rather than continue?
> }
> 
> and it will retry and wait many CPU times if one task holding the TIF_MEMDI.
>    shrink_slab_node()   
>        while()
>            shrinker->scan_objects();
>                      lowmem_scan()
>                                  if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
>                                        time_before_eq(jiffies, lowmem_deathpending_timeout)) 
> 
>  
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: add MM_SWAPENTS and page table when calculate tasksize in lowmem_scan()
  2016-02-18 10:21       ` Xishi Qiu
@ 2016-02-23  0:50         ` David Rientjes
  0 siblings, 0 replies; 10+ messages in thread
From: David Rientjes @ 2016-02-23  0:50 UTC (permalink / raw)
  To: Xishi Qiu
  Cc: Figo.zhang, Greg Kroah-Hartman, arve, riandrews, devel,
	zhong jiang, LKML, Linux MM

On Thu, 18 Feb 2016, Xishi Qiu wrote:

> How about kill more processes at one time?
> 
> Usually loading camera will alloc 300-500M memory immediately, so call lmk
> repeatedly is a waste of time.
> 
> And can we reclaim memory at one time instead of reclaim-alloc-reclaim-alloc...
> in this situation? e.g. use try_to_free_pages(), set nr_to_reclaim=300M
> 

I don't use the lmk myself and it's never been used on my phone, so I 
can't speak for the usecase.  However, killing more than one process at a 
time is generally a bad idea because it can allow processes to deplete 
memory reserves which may be small on these systems.  The lmk cannot be 
considered a hotpath, so waiting for a process to exit and free its memory 
for a small amount of time is generally not a bad trade-off when the 
alternative is to kill many processes (perhaps unnecessarily), open 
yourself up to livelock, and free memory that is potentially unneeded.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: add MM_SWAPENTS and page table when calculate tasksize in lowmem_scan()
  2016-02-18  6:51     ` Xishi Qiu
@ 2016-02-23  0:54       ` David Rientjes
  0 siblings, 0 replies; 10+ messages in thread
From: David Rientjes @ 2016-02-23  0:54 UTC (permalink / raw)
  To: Xishi Qiu
  Cc: Greg Kroah-Hartman, arve, riandrews, devel, zhong jiang, LKML, Linux MM

On Thu, 18 Feb 2016, Xishi Qiu wrote:

> Does somebody do the work of re-implementation of the lowmem killer entirely
> now? Could you give me some details? e.g. when and how?
> 

I don't know of any plans or anybody working on reimplementing it to be 
correct, I simply don't think we should pile more patches on top of an 
already broken design.

> Here are another two questions.
> 1) lmk has several lowmem thresholds, it's "lowmem_minfree[]", and the value is
> static definition, so is it reasonable for different memory size(e.g. 2G/3G/4G...)
> of smart phones?

It looks like it is configurable from userspace and actually defaults to 
6MB, 8MB, 16MB, and 64MB in the kernel.  Reimplementing the lmk would have 
to take this design decision into consideration and replace it with 
something that would not cause existing userspace to break.

A sane implementation would be to do what vmpressure does in the kernel, 
and that is to signal userspace when certain thresholds are met.  
Userspace could then issue a SIGKILL to processes based on priority 
(/proc/pid/oom_score_adj is world-readable by default) and the oom killer 
will grant these processes access to memory reserves immediately if they 
cannot allocate the memory needed to exit.

> 2) There are many adjustable arguments in /proc/sys/vm/, and the default value
> maybe not benefit for smart phones, so any suggestions?
> 

I would assume that vm sysctls, like any other sysctls, would be tuned 
with initscripts depending on the configuration, if necessary.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-02-23  0:54 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-16  9:37 [PATCH] mm: add MM_SWAPENTS and page table when calculate tasksize in lowmem_scan() Xishi Qiu
2016-02-16 17:38 ` Greg Kroah-Hartman
2016-02-17  0:35   ` David Rientjes
2016-02-17  8:30     ` Xishi Qiu
2016-02-17 22:42       ` David Rientjes
2016-02-17 18:10     ` Michal Hocko
2016-02-18  6:51     ` Xishi Qiu
2016-02-23  0:54       ` David Rientjes
     [not found]     ` <CAF7GXvqr2dmc7CUcs_OmfYnEA9jE_Db4kGGG1HJyYYLhC6Bgew@mail.gmail.com>
2016-02-18 10:21       ` Xishi Qiu
2016-02-23  0:50         ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).