[PATCH 1/1] mm: vmstat: Add OOM kill count in vmstat counter

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/1] mm: vmstat: Add OOM kill count in vmstat counter
@ 2015-10-01 10:48 Pintu Kumar
  2015-10-01 13:29 ` Anshuman Khandual
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Pintu Kumar @ 2015-10-01 10:48 UTC (permalink / raw)
  To: akpm, minchan, dave, pintu.k, mhocko, koct9i, rientjes, hannes,
	penguin-kernel, bywxiaobai, mgorman, vbabka, js1304,
	kirill.shutemov, alexander.h.duyck, sasha.levin, cl,
	fengguang.wu, linux-kernel, linux-mm
  Cc: cpgs, pintu_agarwal, pintu.ping, vishnu.ps, rohit.kr, c.rajkumar,
	sreenathd

This patch maintains number of oom calls and number of oom kill
count in /proc/vmstat.
It is helpful during sluggish, aging or long duration tests.
Currently if the OOM happens, it can be only seen in kernel ring buffer.
But during long duration tests, all the dmesg and /var/log/messages* could
be overwritten.
So, just like other counters, the oom can also be maintained in
/proc/vmstat.
It can be also seen if all logs are disabled in kernel.

A snapshot of the result of over night test is shown below:
$ cat /proc/vmstat
oom_stall 610
oom_kill_count 1763

Here, oom_stall indicates that there are 610 times, kernel entered into OOM
cases. However, there were around 1763 oom killing happens.
The OOM is bad for the any system. So, this counter can help the developer
in tuning the memory requirement at least during initial bringup.

Signed-off-by: Pintu Kumar <pintu.k@samsung.com>
---
 include/linux/vm_event_item.h |    2 ++
 mm/oom_kill.c                 |    2 ++
 mm/page_alloc.c               |    2 +-
 mm/vmstat.c                   |    2 ++
 4 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 2b1cef8..ade0851 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -57,6 +57,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
 #endif
+		OOM_STALL,
+		OOM_KILL_COUNT,
 		UNEVICTABLE_PGCULLED,	/* culled to noreclaim list */
 		UNEVICTABLE_PGSCANNED,	/* scanned for reclaimability */
 		UNEVICTABLE_PGRESCUED,	/* rescued from noreclaim list */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 03b612b..e79caed 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -570,6 +570,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * space under its control.
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
+	count_vm_event(OOM_KILL_COUNT);
 	mark_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
@@ -600,6 +601,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 				task_pid_nr(p), p->comm);
 			task_unlock(p);
 			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
+			count_vm_event(OOM_KILL_COUNT);
 		}
 	rcu_read_unlock();
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9bcfd70..1d82210 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2761,7 +2761,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		schedule_timeout_uninterruptible(1);
 		return NULL;
 	}
-
+	count_vm_event(OOM_STALL);
 	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1fd0886..f054265 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -808,6 +808,8 @@ const char * const vmstat_text[] = {
 	"htlb_buddy_alloc_success",
 	"htlb_buddy_alloc_fail",
 #endif
+	"oom_stall",
+	"oom_kill_count",
 	"unevictable_pgs_culled",
 	"unevictable_pgs_scanned",
 	"unevictable_pgs_rescued",
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] mm: vmstat: Add OOM kill count in vmstat counter
  2015-10-01 10:48 [PATCH 1/1] mm: vmstat: Add OOM kill count in vmstat counter Pintu Kumar
@ 2015-10-01 13:29 ` Anshuman Khandual
  2015-10-05  6:19   ` PINTU KUMAR
  2015-10-01 13:38 ` Michal Hocko
  2015-10-12 13:33 ` [PATCH 1/1] mm: vmstat: Add OOM victims " Pintu Kumar
  2 siblings, 1 reply; 20+ messages in thread
From: Anshuman Khandual @ 2015-10-01 13:29 UTC (permalink / raw)
  To: Pintu Kumar, akpm, minchan, dave, mhocko, koct9i, rientjes,
	hannes, penguin-kernel, bywxiaobai, mgorman, vbabka, js1304,
	kirill.shutemov, alexander.h.duyck, sasha.levin, cl,
	fengguang.wu, linux-kernel, linux-mm
  Cc: cpgs, pintu_agarwal, pintu.ping, vishnu.ps, rohit.kr, c.rajkumar,
	sreenathd

On 10/01/2015 04:18 PM, Pintu Kumar wrote:
> This patch maintains number of oom calls and number of oom kill
> count in /proc/vmstat.
> It is helpful during sluggish, aging or long duration tests.
> Currently if the OOM happens, it can be only seen in kernel ring buffer.
> But during long duration tests, all the dmesg and /var/log/messages* could
> be overwritten.
> So, just like other counters, the oom can also be maintained in
> /proc/vmstat.
> It can be also seen if all logs are disabled in kernel.

Makes sense.

> 
> A snapshot of the result of over night test is shown below:
> $ cat /proc/vmstat
> oom_stall 610
> oom_kill_count 1763
> 
> Here, oom_stall indicates that there are 610 times, kernel entered into OOM
> cases. However, there were around 1763 oom killing happens.
> The OOM is bad for the any system. So, this counter can help the developer
> in tuning the memory requirement at least during initial bringup.

Can you please fix the formatting of the commit message above ?

> 
> Signed-off-by: Pintu Kumar <pintu.k@samsung.com>
> ---
>  include/linux/vm_event_item.h |    2 ++
>  mm/oom_kill.c                 |    2 ++
>  mm/page_alloc.c               |    2 +-
>  mm/vmstat.c                   |    2 ++
>  4 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 2b1cef8..ade0851 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -57,6 +57,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  #ifdef CONFIG_HUGETLB_PAGE
>  		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
>  #endif
> +		OOM_STALL,
> +		OOM_KILL_COUNT,

Removing the COUNT will be better and in sync with others.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] mm: vmstat: Add OOM kill count in vmstat counter
  2015-10-01 10:48 [PATCH 1/1] mm: vmstat: Add OOM kill count in vmstat counter Pintu Kumar
  2015-10-01 13:29 ` Anshuman Khandual
@ 2015-10-01 13:38 ` Michal Hocko
  2015-10-05  6:12   ` PINTU KUMAR
  2015-10-12 13:33 ` [PATCH 1/1] mm: vmstat: Add OOM victims " Pintu Kumar
  2 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2015-10-01 13:38 UTC (permalink / raw)
  To: Pintu Kumar
  Cc: akpm, minchan, dave, koct9i, rientjes, hannes, penguin-kernel,
	bywxiaobai, mgorman, vbabka, js1304, kirill.shutemov,
	alexander.h.duyck, sasha.levin, cl, fengguang.wu, linux-kernel,
	linux-mm, cpgs, pintu_agarwal, pintu.ping, vishnu.ps, rohit.kr,
	c.rajkumar, sreenathd

On Thu 01-10-15 16:18:43, Pintu Kumar wrote:
> This patch maintains number of oom calls and number of oom kill
> count in /proc/vmstat.
> It is helpful during sluggish, aging or long duration tests.
> Currently if the OOM happens, it can be only seen in kernel ring buffer.
> But during long duration tests, all the dmesg and /var/log/messages* could
> be overwritten.
> So, just like other counters, the oom can also be maintained in
> /proc/vmstat.
> It can be also seen if all logs are disabled in kernel.
> 
> A snapshot of the result of over night test is shown below:
> $ cat /proc/vmstat
> oom_stall 610
> oom_kill_count 1763
> 
> Here, oom_stall indicates that there are 610 times, kernel entered into OOM
> cases. However, there were around 1763 oom killing happens.

This alone looks quite suspicious. Unless you have tasks which share the
address space without being in the same thread group this shouldn't
happen in such a large scale.
</me looks into the patch>
And indeed the patch is incorrect. You are only counting OOMs from the
page allocator slow path. You are missing all the OOM invocations from
the page fault path.
The placement inside __alloc_pages_may_oom looks quite arbitrary as
well. You are not counting events where we are OOM but somebody is
holding the oom_mutex but you do count last attempt before going really
OOM. Then we have cases which do not invoke OOM killer which are counted
into oom_stall as well. I am not sure whether they should because I am
not quite sure about the semantic of the counter in the first place.
What is it supposed to tell us? How many times the system had to go into
emergency OOM steps? How many times the direct reclaim didn't make any
progress so we can consider the system OOM?

oom_kill_count has a slightly misleading names because it suggests how
many times oom_kill was called but in fact it counts the oom victims.
Not sure whether this information is so much useful but the semantic is
clear at least.

> The OOM is bad for the any system. So, this counter can help the developer
> in tuning the memory requirement at least during initial bringup.
> 
> Signed-off-by: Pintu Kumar <pintu.k@samsung.com>
> ---
>  include/linux/vm_event_item.h |    2 ++
>  mm/oom_kill.c                 |    2 ++
>  mm/page_alloc.c               |    2 +-
>  mm/vmstat.c                   |    2 ++
>  4 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 2b1cef8..ade0851 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -57,6 +57,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  #ifdef CONFIG_HUGETLB_PAGE
>  		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
>  #endif
> +		OOM_STALL,
> +		OOM_KILL_COUNT,
>  		UNEVICTABLE_PGCULLED,	/* culled to noreclaim list */
>  		UNEVICTABLE_PGSCANNED,	/* scanned for reclaimability */
>  		UNEVICTABLE_PGRESCUED,	/* rescued from noreclaim list */
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 03b612b..e79caed 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -570,6 +570,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	 * space under its control.
>  	 */
>  	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
> +	count_vm_event(OOM_KILL_COUNT);
>  	mark_oom_victim(victim);
>  	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
>  		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
> @@ -600,6 +601,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  				task_pid_nr(p), p->comm);
>  			task_unlock(p);
>  			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
> +			count_vm_event(OOM_KILL_COUNT);
>  		}
>  	rcu_read_unlock();
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9bcfd70..1d82210 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2761,7 +2761,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  		schedule_timeout_uninterruptible(1);
>  		return NULL;
>  	}
> -
> +	count_vm_event(OOM_STALL);
>  	/*
>  	 * Go through the zonelist yet one more time, keep very high watermark
>  	 * here, this is only to catch a parallel oom killing, we must fail if
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 1fd0886..f054265 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -808,6 +808,8 @@ const char * const vmstat_text[] = {
>  	"htlb_buddy_alloc_success",
>  	"htlb_buddy_alloc_fail",
>  #endif
> +	"oom_stall",
> +	"oom_kill_count",
>  	"unevictable_pgs_culled",
>  	"unevictable_pgs_scanned",
>  	"unevictable_pgs_rescued",
> -- 
> 1.7.9.5

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH 1/1] mm: vmstat: Add OOM kill count in vmstat counter
  2015-10-01 13:38 ` Michal Hocko
@ 2015-10-05  6:12   ` PINTU KUMAR
  2015-10-05 12:22     ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: PINTU KUMAR @ 2015-10-05  6:12 UTC (permalink / raw)
  To: 'Michal Hocko'
  Cc: akpm, minchan, dave, koct9i, rientjes, hannes, penguin-kernel,
	bywxiaobai, mgorman, vbabka, js1304, kirill.shutemov,
	alexander.h.duyck, sasha.levin, cl, fengguang.wu, linux-kernel,
	linux-mm, cpgs, pintu_agarwal, pintu.ping, vishnu.ps, rohit.kr,
	c.rajkumar, sreenathd

Hi,

> -----Original Message-----
> From: Michal Hocko [mailto:mhocko@kernel.org]
> Sent: Thursday, October 01, 2015 7:09 PM
> To: Pintu Kumar
> Cc: akpm@linux-foundation.org; minchan@kernel.org; dave@stgolabs.net;
> koct9i@gmail.com; rientjes@google.com; hannes@cmpxchg.org; penguin-
> kernel@i-love.sakura.ne.jp; bywxiaobai@163.com; mgorman@suse.de;
> vbabka@suse.cz; js1304@gmail.com; kirill.shutemov@linux.intel.com;
> alexander.h.duyck@redhat.com; sasha.levin@oracle.com; cl@linux.com;
> fengguang.wu@intel.com; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> cpgs@samsung.com; pintu_agarwal@yahoo.com; pintu.ping@gmail.com;
> vishnu.ps@samsung.com; rohit.kr@samsung.com; c.rajkumar@samsung.com;
> sreenathd@samsung.com
> Subject: Re: [PATCH 1/1] mm: vmstat: Add OOM kill count in vmstat counter
> 
> On Thu 01-10-15 16:18:43, Pintu Kumar wrote:
> > This patch maintains number of oom calls and number of oom kill count
> > in /proc/vmstat.
> > It is helpful during sluggish, aging or long duration tests.
> > Currently if the OOM happens, it can be only seen in kernel ring buffer.
> > But during long duration tests, all the dmesg and /var/log/messages*
> > could be overwritten.
> > So, just like other counters, the oom can also be maintained in
> > /proc/vmstat.
> > It can be also seen if all logs are disabled in kernel.
> >
> > A snapshot of the result of over night test is shown below:
> > $ cat /proc/vmstat
> > oom_stall 610
> > oom_kill_count 1763
> >
> > Here, oom_stall indicates that there are 610 times, kernel entered
> > into OOM cases. However, there were around 1763 oom killing happens.
> 
> This alone looks quite suspicious. Unless you have tasks which share the
address
> space without being in the same thread group this shouldn't happen in such a
> large scale.

Yes, this accounts for out_of_memory even from memory cgroups.
Please check few snapshots of dmesg outputs captured during over-night tests.
........
[49479.078033]  [2:      xxxxxxxx:20874] Memory cgroup out of memory: Kill
process 20880 (xxxxxxx) score 112 or sacrifice child
[49480.910430]  [2:      xxxxxxxx:20882] Memory cgroup out of memory: Kill
process 20888 (xxxxxxxx) score 112 or sacrifice child
[49567.046203]  [0:        yyyyyyy:  548] Out of memory: Kill process 20458
(zzzzzzzzzz) score 102 or sacrifice child
[49567.346588]  [0:        yyyyyyy:  548] Out of memory: Kill process 21102
(zzzzzzzzzz) score 104 or sacrifice child
.........
The _out of memory_ count in dmesg dump output exactly matches the number in
/proc/vmstat -> oom_kill_count

> </me looks into the patch>
> And indeed the patch is incorrect. You are only counting OOMs from the page
> allocator slow path. You are missing all the OOM invocations from the page
fault
> path.

Sorry, I am not sure what exactly you mean. Please point me out if I am missing
some places.
Actually, I tried to add it at generic place that is; oom_kill_process, which is
called by out_of_memory(...).
Are you talking about: pagefault_out_of_memory(...) ?
But, this is already calling: out_of_memory. No?

> The placement inside __alloc_pages_may_oom looks quite arbitrary as well. You
> are not counting events where we are OOM but somebody is holding the
> oom_mutex but you do count last attempt before going really OOM. Then we
> have cases which do not invoke OOM killer which are counted into oom_stall as
> well. I am not sure whether they should because I am not quite sure about the
> semantic of the counter in the first place.

Ok. Yes, it can be added right after it enters into __alloc_pages_may_oom.
I will make the changes.
Actually, I knowingly skipped the oom_lock case, because in our 3.10 kernel, we
had note_oom_kill(..) 
Added right after this check.
So, I also added it exactly at the same place.
Ok, I can make the necessary changes, if the oom_lock case also matters. 

> What is it supposed to tell us? How many times the system had to go into
> emergency OOM steps? How many times the direct reclaim didn't make any
> progress so we can consider the system OOM?
> 
Yes, exactly, oom_stall can tell, how many times OOM is invoked in the system.
Yes, it can also tell how many times direct_reclaim fails completely.
Currently, we don't have any counter for direct_reclaim success/fail.
Also, oom_kill_process will not be invoked for higher orders
(PAGE_ALLOC_COSTLY_ORDER).
But, it will enter OOM and results into straight page allocation failure.

> oom_kill_count has a slightly misleading names because it suggests how many
> times oom_kill was called but in fact it counts the oom victims.
> Not sure whether this information is so much useful but the semantic is clear
at
> least.
> 
Ok, agree about the semantic of the name: oom_kill_count.
If possible please suggest a better name.
How about the following names?
oom_victim_count ?
oom_nr_killed ?
oom_nr_victim ?

> > The OOM is bad for the any system. So, this counter can help the
> > developer in tuning the memory requirement at least during initial bringup.
> >
> > Signed-off-by: Pintu Kumar <pintu.k@samsung.com>
> > ---
> >  include/linux/vm_event_item.h |    2 ++
> >  mm/oom_kill.c                 |    2 ++
> >  mm/page_alloc.c               |    2 +-
> >  mm/vmstat.c                   |    2 ++
> >  4 files changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/vm_event_item.h
> > b/include/linux/vm_event_item.h index 2b1cef8..ade0851 100644
> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -57,6 +57,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN,
> > PSWPOUT,  #ifdef CONFIG_HUGETLB_PAGE
> >  		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,  #endif
> > +		OOM_STALL,
> > +		OOM_KILL_COUNT,
> >  		UNEVICTABLE_PGCULLED,	/* culled to noreclaim list */
> >  		UNEVICTABLE_PGSCANNED,	/* scanned for reclaimability */
> >  		UNEVICTABLE_PGRESCUED,	/* rescued from noreclaim list */
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 03b612b..e79caed
> > 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -570,6 +570,7 @@ void oom_kill_process(struct oom_control *oc, struct
> task_struct *p,
> >  	 * space under its control.
> >  	 */
> >  	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
> > +	count_vm_event(OOM_KILL_COUNT);
> >  	mark_oom_victim(victim);
> >  	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-
> rss:%lukB\n",
> >  		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
> @@
> > -600,6 +601,7 @@ void oom_kill_process(struct oom_control *oc, struct
> task_struct *p,
> >  				task_pid_nr(p), p->comm);
> >  			task_unlock(p);
> >  			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
> > +			count_vm_event(OOM_KILL_COUNT);
> >  		}
> >  	rcu_read_unlock();
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9bcfd70..1d82210
> > 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2761,7 +2761,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned
> int order,
> >  		schedule_timeout_uninterruptible(1);
> >  		return NULL;
> >  	}
> > -
> > +	count_vm_event(OOM_STALL);
> >  	/*
> >  	 * Go through the zonelist yet one more time, keep very high watermark
> >  	 * here, this is only to catch a parallel oom killing, we must fail
> > if diff --git a/mm/vmstat.c b/mm/vmstat.c index 1fd0886..f054265
> > 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -808,6 +808,8 @@ const char * const vmstat_text[] = {
> >  	"htlb_buddy_alloc_success",
> >  	"htlb_buddy_alloc_fail",
> >  #endif
> > +	"oom_stall",
> > +	"oom_kill_count",
> >  	"unevictable_pgs_culled",
> >  	"unevictable_pgs_scanned",
> >  	"unevictable_pgs_rescued",
> > --
> > 1.7.9.5
> 
> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH 1/1] mm: vmstat: Add OOM kill count in vmstat counter
  2015-10-01 13:29 ` Anshuman Khandual
@ 2015-10-05  6:19   ` PINTU KUMAR
  0 siblings, 0 replies; 20+ messages in thread
From: PINTU KUMAR @ 2015-10-05  6:19 UTC (permalink / raw)
  To: 'Anshuman Khandual',
	akpm, minchan, dave, mhocko, koct9i, rientjes, hannes,
	penguin-kernel, bywxiaobai, mgorman, vbabka, js1304,
	kirill.shutemov, alexander.h.duyck, sasha.levin, cl,
	fengguang.wu, linux-kernel, linux-mm
  Cc: cpgs, pintu_agarwal, pintu.ping, vishnu.ps, rohit.kr, c.rajkumar,
	sreenathd

Hi,

> -----Original Message-----
> From: Anshuman Khandual [mailto:khandual@linux.vnet.ibm.com]
> Sent: Thursday, October 01, 2015 7:00 PM
> To: Pintu Kumar; akpm@linux-foundation.org; minchan@kernel.org;
> dave@stgolabs.net; mhocko@suse.cz; koct9i@gmail.com; rientjes@google.com;
> hannes@cmpxchg.org; penguin-kernel@i-love.sakura.ne.jp;
> bywxiaobai@163.com; mgorman@suse.de; vbabka@suse.cz; js1304@gmail.com;
> kirill.shutemov@linux.intel.com; alexander.h.duyck@redhat.com;
> sasha.levin@oracle.com; cl@linux.com; fengguang.wu@intel.com; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org
> Cc: cpgs@samsung.com; pintu_agarwal@yahoo.com; pintu.ping@gmail.com;
> vishnu.ps@samsung.com; rohit.kr@samsung.com; c.rajkumar@samsung.com;
> sreenathd@samsung.com
> Subject: Re: [PATCH 1/1] mm: vmstat: Add OOM kill count in vmstat counter
> 
> On 10/01/2015 04:18 PM, Pintu Kumar wrote:
> > This patch maintains number of oom calls and number of oom kill count
> > in /proc/vmstat.
> > It is helpful during sluggish, aging or long duration tests.
> > Currently if the OOM happens, it can be only seen in kernel ring buffer.
> > But during long duration tests, all the dmesg and /var/log/messages*
> > could be overwritten.
> > So, just like other counters, the oom can also be maintained in
> > /proc/vmstat.
> > It can be also seen if all logs are disabled in kernel.
> 
> Makes sense.
> 
> >
> > A snapshot of the result of over night test is shown below:
> > $ cat /proc/vmstat
> > oom_stall 610
> > oom_kill_count 1763
> >
> > Here, oom_stall indicates that there are 610 times, kernel entered
> > into OOM cases. However, there were around 1763 oom killing happens.
> > The OOM is bad for the any system. So, this counter can help the
> > developer in tuning the memory requirement at least during initial bringup.
> 
> Can you please fix the formatting of the commit message above ?
> 
Not sure if there is any formatting issue here. I cannot see it.
The checkpatch returns no error/warnings.
Please point me out exactly, if there is any issue.

> >
> > Signed-off-by: Pintu Kumar <pintu.k@samsung.com>
> > ---
> >  include/linux/vm_event_item.h |    2 ++
> >  mm/oom_kill.c                 |    2 ++
> >  mm/page_alloc.c               |    2 +-
> >  mm/vmstat.c                   |    2 ++
> >  4 files changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/vm_event_item.h
> > b/include/linux/vm_event_item.h index 2b1cef8..ade0851 100644
> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -57,6 +57,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN,
> > PSWPOUT,  #ifdef CONFIG_HUGETLB_PAGE
> >  		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,  #endif
> > +		OOM_STALL,
> > +		OOM_KILL_COUNT,
> 
> Removing the COUNT will be better and in sync with others.

Ok, even suggested by Michal Hocko and being discussed in another thread.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] mm: vmstat: Add OOM kill count in vmstat counter
  2015-10-05  6:12   ` PINTU KUMAR
@ 2015-10-05 12:22     ` Michal Hocko
  2015-10-06  6:59       ` PINTU KUMAR
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2015-10-05 12:22 UTC (permalink / raw)
  To: PINTU KUMAR
  Cc: akpm, minchan, dave, koct9i, rientjes, hannes, penguin-kernel,
	bywxiaobai, mgorman, vbabka, js1304, kirill.shutemov,
	alexander.h.duyck, sasha.levin, cl, fengguang.wu, linux-kernel,
	linux-mm, cpgs, pintu_agarwal, pintu.ping, vishnu.ps, rohit.kr,
	c.rajkumar, sreenathd

On Mon 05-10-15 11:42:49, PINTU KUMAR wrote:
[...]
> > > A snapshot of the result of over night test is shown below:
> > > $ cat /proc/vmstat
> > > oom_stall 610
> > > oom_kill_count 1763
> > >
> > > Here, oom_stall indicates that there are 610 times, kernel entered
> > > into OOM cases. However, there were around 1763 oom killing happens.
> > 
> > This alone looks quite suspicious. Unless you have tasks which share the
> > address
> > space without being in the same thread group this shouldn't happen in such a
> > large scale.
> 
> Yes, this accounts for out_of_memory even from memory cgroups.
> Please check few snapshots of dmesg outputs captured during over-night tests.

OK, that would explain why the second counter is so much larger than
oom_stall. And that alone should have been a red flag IMO. Why should be
memcg OOM killer events accounted together with the global? How do you
distinguish the two?

> ........
> [49479.078033]  [2:      xxxxxxxx:20874] Memory cgroup out of memory: Kill
> process 20880 (xxxxxxx) score 112 or sacrifice child
> [49480.910430]  [2:      xxxxxxxx:20882] Memory cgroup out of memory: Kill
> process 20888 (xxxxxxxx) score 112 or sacrifice child
> [49567.046203]  [0:        yyyyyyy:  548] Out of memory: Kill process 20458
> (zzzzzzzzzz) score 102 or sacrifice child
> [49567.346588]  [0:        yyyyyyy:  548] Out of memory: Kill process 21102
> (zzzzzzzzzz) score 104 or sacrifice child
> .........
> The _out of memory_ count in dmesg dump output exactly matches the number in
> /proc/vmstat -> oom_kill_count
> 
> > </me looks into the patch>
> > And indeed the patch is incorrect. You are only counting OOMs from the page
> > allocator slow path. You are missing all the OOM invocations from the page
> > fault
> > path.
> 
> Sorry, I am not sure what exactly you mean. Please point me out if I am missing
> some places.
> Actually, I tried to add it at generic place that is; oom_kill_process, which is
> called by out_of_memory(...).
> Are you talking about: pagefault_out_of_memory(...) ?
> But, this is already calling: out_of_memory. No?

Sorry, I wasn't clear enough here. I was talking about oom_stall counter
here not oom_kill_count one.

[...]
> > What is it supposed to tell us? How many times the system had to go into
> > emergency OOM steps? How many times the direct reclaim didn't make any
> > progress so we can consider the system OOM?
> > 
> Yes, exactly, oom_stall can tell, how many times OOM is invoked in the system.
> Yes, it can also tell how many times direct_reclaim fails completely.
> Currently, we don't have any counter for direct_reclaim success/fail.

So why don't we add one? Direct reclaim failure is a clearly defined
event and it also can be evaluated reasonably against allocstall.

> Also, oom_kill_process will not be invoked for higher orders
> (PAGE_ALLOC_COSTLY_ORDER).
> But, it will enter OOM and results into straight page allocation failure.

Yes there are other reasons to not invoke OOM killer or to prevent
actual killing if chances are high we can go without it. This is the
reason I am asking about the exact semantic.

> > oom_kill_count has a slightly misleading names because it suggests how many
> > times oom_kill was called but in fact it counts the oom victims.
> > Not sure whether this information is so much useful but the semantic is clear
> > at least.
> > 
> Ok, agree about the semantic of the name: oom_kill_count.
> If possible please suggest a better name.
> How about the following names?
> oom_victim_count ?
> oom_nr_killed ?
> oom_nr_victim ?

nr_oom_victims?

I am still not sure how useful this counter would be, though. Sure the
log ringbuffer might overflow (the risk can be reduced by reducing the
loglevel) but how much it would help to know that we had additional N
OOM victims? From my experience checking the OOM reports which are still
in the logbuffer are sufficient to see whether there is a memory leak,
pinned memory or a continuous memory pressure. Your experience might be
different so it would be nice to mention that in the changelog.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH 1/1] mm: vmstat: Add OOM kill count in vmstat counter
  2015-10-05 12:22     ` Michal Hocko
@ 2015-10-06  6:59       ` PINTU KUMAR
  2015-10-06 15:41         ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: PINTU KUMAR @ 2015-10-06  6:59 UTC (permalink / raw)
  To: 'Michal Hocko'
  Cc: akpm, minchan, dave, koct9i, rientjes, hannes, penguin-kernel,
	bywxiaobai, mgorman, vbabka, js1304, kirill.shutemov,
	alexander.h.duyck, sasha.levin, cl, fengguang.wu, linux-kernel,
	linux-mm, cpgs, pintu_agarwal, pintu.ping, vishnu.ps, rohit.kr,
	c.rajkumar, sreenathd

Hi,

> -----Original Message-----
> From: Michal Hocko [mailto:mhocko@kernel.org]
> Sent: Monday, October 05, 2015 5:53 PM
> To: PINTU KUMAR
> Cc: akpm@linux-foundation.org; minchan@kernel.org; dave@stgolabs.net;
> koct9i@gmail.com; rientjes@google.com; hannes@cmpxchg.org; penguin-
> kernel@i-love.sakura.ne.jp; bywxiaobai@163.com; mgorman@suse.de;
> vbabka@suse.cz; js1304@gmail.com; kirill.shutemov@linux.intel.com;
> alexander.h.duyck@redhat.com; sasha.levin@oracle.com; cl@linux.com;
> fengguang.wu@intel.com; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> cpgs@samsung.com; pintu_agarwal@yahoo.com; pintu.ping@gmail.com;
> vishnu.ps@samsung.com; rohit.kr@samsung.com; c.rajkumar@samsung.com;
> sreenathd@samsung.com
> Subject: Re: [PATCH 1/1] mm: vmstat: Add OOM kill count in vmstat counter
> 
> On Mon 05-10-15 11:42:49, PINTU KUMAR wrote:
> [...]
> > > > A snapshot of the result of over night test is shown below:
> > > > $ cat /proc/vmstat
> > > > oom_stall 610
> > > > oom_kill_count 1763
> > > >
> > > > Here, oom_stall indicates that there are 610 times, kernel entered
> > > > into OOM cases. However, there were around 1763 oom killing happens.
> > >
> > > This alone looks quite suspicious. Unless you have tasks which share
> > > the address space without being in the same thread group this
> > > shouldn't happen in such a large scale.
> >
> > Yes, this accounts for out_of_memory even from memory cgroups.
> > Please check few snapshots of dmesg outputs captured during over-night
tests.
> 
> OK, that would explain why the second counter is so much larger than
oom_stall.
> And that alone should have been a red flag IMO. Why should be memcg OOM
> killer events accounted together with the global? How do you distinguish the
> two?
> 
Actually, here, we are just interested in knowing oom_kill. Let it be either
global, memcg or others.
Once we know there are oom kill happening, we can easily find it by enabling
logs.
Normally in production system, all system logs will be disabled.

> > ........
> > [49479.078033]  [2:      xxxxxxxx:20874] Memory cgroup out of memory: Kill
> > process 20880 (xxxxxxx) score 112 or sacrifice child
> > [49480.910430]  [2:      xxxxxxxx:20882] Memory cgroup out of memory: Kill
> > process 20888 (xxxxxxxx) score 112 or sacrifice child
> > [49567.046203]  [0:        yyyyyyy:  548] Out of memory: Kill process 20458
> > (zzzzzzzzzz) score 102 or sacrifice child
> > [49567.346588]  [0:        yyyyyyy:  548] Out of memory: Kill process 21102
> > (zzzzzzzzzz) score 104 or sacrifice child .........
> > The _out of memory_ count in dmesg dump output exactly matches the
> > number in /proc/vmstat -> oom_kill_count
> >
> > > </me looks into the patch>
> > > And indeed the patch is incorrect. You are only counting OOMs from
> > > the page allocator slow path. You are missing all the OOM
> > > invocations from the page fault path.
> >
> > Sorry, I am not sure what exactly you mean. Please point me out if I
> > am missing some places.
> > Actually, I tried to add it at generic place that is;
> > oom_kill_process, which is called by out_of_memory(...).
> > Are you talking about: pagefault_out_of_memory(...) ?
> > But, this is already calling: out_of_memory. No?
> 
> Sorry, I wasn't clear enough here. I was talking about oom_stall counter here
not
> oom_kill_count one.
> 
Ok, I got your point.
Oom_kill_process, is called from 2 places:
1) out_of_memory
2) mem_cgroup_out_of_memory

And, out_of_memory is actually called from 3 places:
1) alloc_pages_may_oom
2) pagefault_out_of_memory
3) moom_callback (sysirq.c)

Thus, in this case, the oom_stall counter can be added in 4 places (in the
beginning).
1) alloc_pages_may_oom
2) mem_cgroup_out_of_memory
3) pagefault_out_of_memory
4) moom_callback (sysirq.c)

For, case {2,3,4}, we could have actually called at one place in out_of_memory,
But this result into calling it 2 times because alloc_pages_may_oom also call
out_of_memory.
If there is any better idea, please let me know.

> [...]
> > > What is it supposed to tell us? How many times the system had to go
> > > into emergency OOM steps? How many times the direct reclaim didn't
> > > make any progress so we can consider the system OOM?
> > >
> > Yes, exactly, oom_stall can tell, how many times OOM is invoked in the
system.
> > Yes, it can also tell how many times direct_reclaim fails completely.
> > Currently, we don't have any counter for direct_reclaim success/fail.
> 
> So why don't we add one? Direct reclaim failure is a clearly defined event and
it
> also can be evaluated reasonably against allocstall.
> 
Yes, direct_reclaim success/fail is also planned ahead.
May be something like:
direct_reclaim_alloc_success
direct_reclaim_alloc_fail

But, then I thought oom_kill is more important than this. So I pushed this one
first.

> > Also, oom_kill_process will not be invoked for higher orders
> > (PAGE_ALLOC_COSTLY_ORDER).
> > But, it will enter OOM and results into straight page allocation failure.
> 
> Yes there are other reasons to not invoke OOM killer or to prevent actual
killing
> if chances are high we can go without it. This is the reason I am asking about
the
> exact semantic.
> 
> > > oom_kill_count has a slightly misleading names because it suggests
> > > how many times oom_kill was called but in fact it counts the oom victims.
> > > Not sure whether this information is so much useful but the semantic
> > > is clear at least.
> > >
> > Ok, agree about the semantic of the name: oom_kill_count.
> > If possible please suggest a better name.
> > How about the following names?
> > oom_victim_count ?
> > oom_nr_killed ?
> > oom_nr_victim ?
> 
> nr_oom_victims?
> 
Ok, nr_oom_victims is also nice name. If all agree I can change this name.
Please confirm.

> I am still not sure how useful this counter would be, though. Sure the log
> ringbuffer might overflow (the risk can be reduced by reducing the
> loglevel) but how much it would help to know that we had additional N OOM
> victims? From my experience checking the OOM reports which are still in the
> logbuffer are sufficient to see whether there is a memory leak, pinned memory
> or a continuous memory pressure. Your experience might be different so it
> would be nice to mention that in the changelog.

Ok. 
As I said earlier, normally all logs will be disabled in production system.
But, we can access /proc/vmstat. The oom would have happened in the system
Earlier, but the logs would have over-written.
The /proc/vmstat is the only counter which can tell, if ever system entered into
oom cases.
Once we know for sure that oom happened in the system, then we can enable all
logs in the system to reproduce the oom scenarios to analyze further.
Also it can help in initial tuning of the system for the memory needs of the
system.
In embedded world, we normally try to avoid the system to enter into kernel OOM
as far as possible.
For example, in Android, we have LMK (low memory killer) driver that controls
the OOM behavior. But most of the time these LMK threshold are statically
controlled.
Now with this oom counter we can dynamically control the LMK behavior.
For example, in LMK we can check, if ever oom_stall becomes 1, that means system
is hitting OOM state. At this stage we can immediately trigger the OOM killing
from user space or LMK driver.
Similar user case and requirement is there for Tizen that controls OOM from user
space (without LMK).
It can also trigger the thought for sluggish behavior in the system during long
run.
These are just few use cases. More can be thought of.

> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] mm: vmstat: Add OOM kill count in vmstat counter
  2015-10-06  6:59       ` PINTU KUMAR
@ 2015-10-06 15:41         ` Michal Hocko
  2015-10-07 14:48           ` PINTU KUMAR
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2015-10-06 15:41 UTC (permalink / raw)
  To: PINTU KUMAR
  Cc: akpm, minchan, dave, koct9i, rientjes, hannes, penguin-kernel,
	bywxiaobai, mgorman, vbabka, js1304, kirill.shutemov,
	alexander.h.duyck, sasha.levin, cl, fengguang.wu, linux-kernel,
	linux-mm, cpgs, pintu_agarwal, pintu.ping, vishnu.ps, rohit.kr,
	c.rajkumar, sreenathd

On Tue 06-10-15 12:29:52, PINTU KUMAR wrote:
[...]
> > OK, that would explain why the second counter is so much larger than
> > oom_stall.
> > And that alone should have been a red flag IMO. Why should be memcg OOM
> > killer events accounted together with the global? How do you distinguish the
> > two?
> > 
> Actually, here, we are just interested in knowing oom_kill. Let it be either
> global, memcg or others.
> Once we know there are oom kill happening, we can easily find it by enabling
> logs.
> Normally in production system, all system logs will be disabled.

This doesn't make much sense to me. So you find out that _an oom killer_
was invoked but you have logs disabled. What now? You can hardly find
out what has happened and why it has happened. What is the point then?
Wait for another one to come? This might be never.

What is even more confusing is the mixing of memcg and global oom
conditions. They are really different things. Memcg API will even give
you notification about the OOM event.

[...]
> > Sorry, I wasn't clear enough here. I was talking about oom_stall counter here
> > not
> > oom_kill_count one.
> > 
> Ok, I got your point.
> Oom_kill_process, is called from 2 places:
> 1) out_of_memory
> 2) mem_cgroup_out_of_memory
> 
> And, out_of_memory is actually called from 3 places:
> 1) alloc_pages_may_oom
> 2) pagefault_out_of_memory
> 3) moom_callback (sysirq.c)
> 
> Thus, in this case, the oom_stall counter can be added in 4 places (in the
> beginning).
> 1) alloc_pages_may_oom
> 2) mem_cgroup_out_of_memory
> 3) pagefault_out_of_memory
> 4) moom_callback (sysirq.c)
> 
> For, case {2,3,4}, we could have actually called at one place in out_of_memory,

Why would you even consider 4 for oom_stall? This is an administrator
order to kill a memory hog. The system might be in a good shape just the
memory hog is misbehaving. I realize this is not a usual usecase but if
oom_stall is supposed to measure a memory pressure of some sort then
binding it to a user action is wrong thing to do.

> But this result into calling it 2 times because alloc_pages_may_oom also call
> out_of_memory.
> If there is any better idea, please let me know.

I think you are focusing too much on the implementation before you are
clear in what should be the desired semantic.

> > > > What is it supposed to tell us? How many times the system had to go
> > > > into emergency OOM steps? How many times the direct reclaim didn't
> > > > make any progress so we can consider the system OOM?
> > > >
> > > Yes, exactly, oom_stall can tell, how many times OOM is invoked in the
> > > system.
> > > Yes, it can also tell how many times direct_reclaim fails completely.
> > > Currently, we don't have any counter for direct_reclaim success/fail.
> > 
> > So why don't we add one? Direct reclaim failure is a clearly defined event and
> > it
> > also can be evaluated reasonably against allocstall.
> > 
> Yes, direct_reclaim success/fail is also planned ahead.
> May be something like:
> direct_reclaim_alloc_success
> direct_reclaim_alloc_fail

We already have alloc_stall so all_stall_noprogress or whatever better
name should be sufficient.

[...]

> > I am still not sure how useful this counter would be, though. Sure the log
> > ringbuffer might overflow (the risk can be reduced by reducing the
> > loglevel) but how much it would help to know that we had additional N OOM
> > victims? From my experience checking the OOM reports which are still in the
> > logbuffer are sufficient to see whether there is a memory leak, pinned memory
> > or a continuous memory pressure. Your experience might be different so it
> > would be nice to mention that in the changelog.
> 
> Ok. 
> As I said earlier, normally all logs will be disabled in production system.
> But, we can access /proc/vmstat. The oom would have happened in the system
> Earlier, but the logs would have over-written.
> The /proc/vmstat is the only counter which can tell, if ever system entered into
> oom cases.
> Once we know for sure that oom happened in the system, then we can enable all
> logs in the system to reproduce the oom scenarios to analyze further.

Why reducing the loglevel is not sufficient here? The output should be
considerably reduced and chances to overflow the ringbuffer reduced as well.

> Also it can help in initial tuning of the system for the memory needs of the
> system.
> In embedded world, we normally try to avoid the system to enter into kernel OOM
> as far as possible.

Which means that you should follow a completely different metric IMO.
oom_stall is way too late. It is at the time when no reclaim progress could
be done and we are OOM already.

> For example, in Android, we have LMK (low memory killer) driver that controls
> the OOM behavior. But most of the time these LMK threshold are statically
> controlled.
>
> Now with this oom counter we can dynamically control the LMK behavior.
> For example, in LMK we can check, if ever oom_stall becomes 1, that means system
> is hitting OOM state. At this stage we can immediately trigger the OOM killing
> from user space or LMK driver.

If you see oom_stall then you are basically OOM and the global OOM
killer will fire. Intervening with other party just sounds like a
terrible idea to me.

> Similar user case and requirement is there for Tizen that controls OOM from user
> space (without LMK).

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH 1/1] mm: vmstat: Add OOM kill count in vmstat counter
  2015-10-06 15:41         ` Michal Hocko
@ 2015-10-07 14:48           ` PINTU KUMAR
  2015-10-08 14:18             ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: PINTU KUMAR @ 2015-10-07 14:48 UTC (permalink / raw)
  To: 'Michal Hocko'
  Cc: akpm, minchan, dave, koct9i, rientjes, hannes, penguin-kernel,
	bywxiaobai, mgorman, vbabka, js1304, kirill.shutemov,
	alexander.h.duyck, sasha.levin, cl, fengguang.wu, linux-kernel,
	linux-mm, cpgs, pintu_agarwal, pintu.ping, vishnu.ps, rohit.kr,
	c.rajkumar, sreenathd

Hi,

> -----Original Message-----
> From: Michal Hocko [mailto:mhocko@kernel.org]
> Sent: Tuesday, October 06, 2015 9:12 PM
> To: PINTU KUMAR
> Cc: akpm@linux-foundation.org; minchan@kernel.org; dave@stgolabs.net;
> koct9i@gmail.com; rientjes@google.com; hannes@cmpxchg.org; penguin-
> kernel@i-love.sakura.ne.jp; bywxiaobai@163.com; mgorman@suse.de;
> vbabka@suse.cz; js1304@gmail.com; kirill.shutemov@linux.intel.com;
> alexander.h.duyck@redhat.com; sasha.levin@oracle.com; cl@linux.com;
> fengguang.wu@intel.com; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> cpgs@samsung.com; pintu_agarwal@yahoo.com; pintu.ping@gmail.com;
> vishnu.ps@samsung.com; rohit.kr@samsung.com; c.rajkumar@samsung.com;
> sreenathd@samsung.com
> Subject: Re: [PATCH 1/1] mm: vmstat: Add OOM kill count in vmstat counter
> 
> On Tue 06-10-15 12:29:52, PINTU KUMAR wrote:
> [...]
> > > OK, that would explain why the second counter is so much larger than
> > > oom_stall.
> > > And that alone should have been a red flag IMO. Why should be memcg
> > > OOM killer events accounted together with the global? How do you
> > > distinguish the two?
> > >
> > Actually, here, we are just interested in knowing oom_kill. Let it be
> > either global, memcg or others.
> > Once we know there are oom kill happening, we can easily find it by
> > enabling logs.
> > Normally in production system, all system logs will be disabled.
> 
> This doesn't make much sense to me. So you find out that _an oom killer_ was
> invoked but you have logs disabled. What now? You can hardly find out what
> has happened and why it has happened. What is the point then?
> Wait for another one to come? This might be never.
> 
Ok, let me explain the real case that we have experienced.
In our case, we have low memory killer in user space itself that invoked based
on some memory threshold.
Something like, below 100MB threshold starting killing until it comes back to
150MB.
During our long duration ageing test (more than 72 hours) we observed that many
applications are killed.
Now, we were not sure if killing happens in user space or kernel space.
When we saw the kernel logs, it generated many logs such as;
/var/log/{messages, messages.0, messages.1, messages.2, messages.3, etc.}
But, none of the logs contains kernel OOM messages. Although there were some LMK
kill in user space.
Then in another round of test we keep dumping _dmesg_ output to a file after
each iteration.
After 3 days of tests this time we observed that dmesg output dump contains many
kernel oom messages.
Now, every time this dumping is not feasible. And instead of counting manually
in log file, we wanted to know number of oom kills happened during this tests.
So we decided to add a counter in /proc/vmstat to track the kernel oom_kill, and
monitor it during our ageing test.
Basically, we wanted to tune our user space LMK killer for different threshold
values, so that we can completely avoid the kernel oom kill.
So, just by looking into this counter, we could able to tune the LMK threshold
values without depending on the kernel log messages.

Also, in most of the system /var/log/messages are not present and we just
depends on kernel dmesg output, which is petty small for longer run.
Even if we reduce the loglevel to 4, it may not be suitable to capture all logs.

> What is even more confusing is the mixing of memcg and global oom conditions.
> They are really different things. Memcg API will even give you notification
about
> the OOM event.
> 
Ok, you are suggesting to divide the oom_kill counter into 2 parts (global &
memcg) ?
May be something like:
nr_oom_victims
nr_memcg_oom_victims

> [...]
> > > Sorry, I wasn't clear enough here. I was talking about oom_stall
> > > counter here not oom_kill_count one.
> > >
> > Ok, I got your point.
> > Oom_kill_process, is called from 2 places:
> > 1) out_of_memory
> > 2) mem_cgroup_out_of_memory
> >
> > And, out_of_memory is actually called from 3 places:
> > 1) alloc_pages_may_oom
> > 2) pagefault_out_of_memory
> > 3) moom_callback (sysirq.c)
> >
> > Thus, in this case, the oom_stall counter can be added in 4 places (in
> > the beginning).
> > 1) alloc_pages_may_oom
> > 2) mem_cgroup_out_of_memory
> > 3) pagefault_out_of_memory
> > 4) moom_callback (sysirq.c)
> >
> > For, case {2,3,4}, we could have actually called at one place in
> > out_of_memory,
> 
> Why would you even consider 4 for oom_stall? This is an administrator order to
> kill a memory hog. The system might be in a good shape just the memory hog is
> misbehaving. I realize this is not a usual usecase but if oom_stall is
supposed to
> measure a memory pressure of some sort then binding it to a user action is
> wrong thing to do.
> 
I think, oom_stall is not so important. So I think we can drop it. It also
creates confusion with memcg and others and makes it more complicated. So I am
thinking to remove it.
The more important thing is : nr_oom_victims.
I think this should be sufficient.

> > But this result into calling it 2 times because alloc_pages_may_oom
> > also call out_of_memory.
> > If there is any better idea, please let me know.
> 
> I think you are focusing too much on the implementation before you are clear
in
> what should be the desired semantic.
> 
> > > > > What is it supposed to tell us? How many times the system had to
> > > > > go into emergency OOM steps? How many times the direct reclaim
> > > > > didn't make any progress so we can consider the system OOM?
> > > > >
> > > > Yes, exactly, oom_stall can tell, how many times OOM is invoked in
> > > > the system.
> > > > Yes, it can also tell how many times direct_reclaim fails completely.
> > > > Currently, we don't have any counter for direct_reclaim success/fail.
> > >
> > > So why don't we add one? Direct reclaim failure is a clearly defined
> > > event and it also can be evaluated reasonably against allocstall.
> > >
> > Yes, direct_reclaim success/fail is also planned ahead.
> > May be something like:
> > direct_reclaim_alloc_success
> > direct_reclaim_alloc_fail
> 
> We already have alloc_stall so all_stall_noprogress or whatever better name
> should be sufficient.
> 
Ok, this we can discuss later and finalize on the name.

> [...]
> 
> > > I am still not sure how useful this counter would be, though. Sure
> > > the log ringbuffer might overflow (the risk can be reduced by
> > > reducing the
> > > loglevel) but how much it would help to know that we had additional
> > > N OOM victims? From my experience checking the OOM reports which are
> > > still in the logbuffer are sufficient to see whether there is a
> > > memory leak, pinned memory or a continuous memory pressure. Your
> > > experience might be different so it would be nice to mention that in the
> changelog.
> >
> > Ok.
> > As I said earlier, normally all logs will be disabled in production system.
> > But, we can access /proc/vmstat. The oom would have happened in the
> > system Earlier, but the logs would have over-written.
> > The /proc/vmstat is the only counter which can tell, if ever system
> > entered into oom cases.
> > Once we know for sure that oom happened in the system, then we can
> > enable all logs in the system to reproduce the oom scenarios to analyze
> further.
> 
> Why reducing the loglevel is not sufficient here? The output should be
> considerably reduced and chances to overflow the ringbuffer reduced as well.
> 
I think, I explained it above.
In most of the system /var/log/messages are not present and we just depends on
kernel dmesg output, which is petty small for longer run.

> > Also it can help in initial tuning of the system for the memory needs
> > of the system.
> > In embedded world, we normally try to avoid the system to enter into
> > kernel OOM as far as possible.
> 
> Which means that you should follow a completely different metric IMO.
> oom_stall is way too late. It is at the time when no reclaim progress could be
> done and we are OOM already.
> 
> > For example, in Android, we have LMK (low memory killer) driver that
> > controls the OOM behavior. But most of the time these LMK threshold
> > are statically controlled.
> >
> > Now with this oom counter we can dynamically control the LMK behavior.
> > For example, in LMK we can check, if ever oom_stall becomes 1, that
> > means system is hitting OOM state. At this stage we can immediately
> > trigger the OOM killing from user space or LMK driver.
> 
> If you see oom_stall then you are basically OOM and the global OOM killer will
> fire. Intervening with other party just sounds like a terrible idea to me.
> 
> > Similar user case and requirement is there for Tizen that controls OOM
> > from user space (without LMK).
> 
> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] mm: vmstat: Add OOM kill count in vmstat counter
  2015-10-07 14:48           ` PINTU KUMAR
@ 2015-10-08 14:18             ` Michal Hocko
  2015-10-08 16:06               ` PINTU KUMAR
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2015-10-08 14:18 UTC (permalink / raw)
  To: PINTU KUMAR
  Cc: akpm, minchan, dave, koct9i, rientjes, hannes, penguin-kernel,
	bywxiaobai, mgorman, vbabka, js1304, kirill.shutemov,
	alexander.h.duyck, sasha.levin, cl, fengguang.wu, linux-kernel,
	linux-mm, cpgs, pintu_agarwal, pintu.ping, vishnu.ps, rohit.kr,
	c.rajkumar, sreenathd

On Wed 07-10-15 20:18:16, PINTU KUMAR wrote:
[...]
> Ok, let me explain the real case that we have experienced.
> In our case, we have low memory killer in user space itself that invoked based
> on some memory threshold.
> Something like, below 100MB threshold starting killing until it comes back to
> 150MB.
> During our long duration ageing test (more than 72 hours) we observed that many
> applications are killed.
> Now, we were not sure if killing happens in user space or kernel space.
> When we saw the kernel logs, it generated many logs such as;
> /var/log/{messages, messages.0, messages.1, messages.2, messages.3, etc.}
> But, none of the logs contains kernel OOM messages. Although there were some LMK
> kill in user space.
> Then in another round of test we keep dumping _dmesg_ output to a file after
> each iteration.
> After 3 days of tests this time we observed that dmesg output dump contains many
> kernel oom messages.

I am confused. So you suspect that the OOM report didn't get to
/var/log/messages while it was in dmesg?

> Now, every time this dumping is not feasible. And instead of counting manually
> in log file, we wanted to know number of oom kills happened during this tests.
> So we decided to add a counter in /proc/vmstat to track the kernel oom_kill, and
> monitor it during our ageing test.
>
> Basically, we wanted to tune our user space LMK killer for different threshold
> values, so that we can completely avoid the kernel oom kill.
> So, just by looking into this counter, we could able to tune the LMK threshold
> values without depending on the kernel log messages.

Wouldn't a trace point suit you better for this particular use case
considering this is a testing environment?
 
> Also, in most of the system /var/log/messages are not present and we just
> depends on kernel dmesg output, which is petty small for longer run.
> Even if we reduce the loglevel to 4, it may not be suitable to capture all logs.

Hmm, I would consider a logless system considerably crippled but I see
your point and I can imagine that especially small devices might try
to save every single B of the storage. Such a system is basically
undebugable IMO but it still might be interesting to see OOM killer
traces.
 
> > What is even more confusing is the mixing of memcg and global oom
> > conditions.  They are really different things. Memcg API will even
> > give you notification about the OOM event.
> > 
> Ok, you are suggesting to divide the oom_kill counter into 2 parts (global &
> memcg) ?
> May be something like:
> nr_oom_victims
> nr_memcg_oom_victims

You do not need the later. Memcg interface already provides you with a
notification API and if a counter is _really_ needed then it should be
per-memcg not a global cumulative number.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH 1/1] mm: vmstat: Add OOM kill count in vmstat counter
  2015-10-08 14:18             ` Michal Hocko
@ 2015-10-08 16:06               ` PINTU KUMAR
  2015-10-08 16:30                 ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: PINTU KUMAR @ 2015-10-08 16:06 UTC (permalink / raw)
  To: 'Michal Hocko'
  Cc: akpm, minchan, dave, koct9i, rientjes, hannes, penguin-kernel,
	bywxiaobai, mgorman, vbabka, js1304, kirill.shutemov,
	alexander.h.duyck, sasha.levin, cl, fengguang.wu, linux-kernel,
	linux-mm, cpgs, pintu_agarwal, pintu.ping, vishnu.ps, rohit.kr,
	c.rajkumar, sreenathd

Hi,

Thank you very much for your reply and comments.

> -----Original Message-----
> From: Michal Hocko [mailto:mhocko@kernel.org]
> Sent: Thursday, October 08, 2015 7:49 PM
> To: PINTU KUMAR
> Cc: akpm@linux-foundation.org; minchan@kernel.org; dave@stgolabs.net;
> koct9i@gmail.com; rientjes@google.com; hannes@cmpxchg.org; penguin-
> kernel@i-love.sakura.ne.jp; bywxiaobai@163.com; mgorman@suse.de;
> vbabka@suse.cz; js1304@gmail.com; kirill.shutemov@linux.intel.com;
> alexander.h.duyck@redhat.com; sasha.levin@oracle.com; cl@linux.com;
> fengguang.wu@intel.com; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> cpgs@samsung.com; pintu_agarwal@yahoo.com; pintu.ping@gmail.com;
> vishnu.ps@samsung.com; rohit.kr@samsung.com; c.rajkumar@samsung.com;
> sreenathd@samsung.com
> Subject: Re: [PATCH 1/1] mm: vmstat: Add OOM kill count in vmstat counter
> 
> On Wed 07-10-15 20:18:16, PINTU KUMAR wrote:
> [...]
> > Ok, let me explain the real case that we have experienced.
> > In our case, we have low memory killer in user space itself that
> > invoked based on some memory threshold.
> > Something like, below 100MB threshold starting killing until it comes
> > back to 150MB.
> > During our long duration ageing test (more than 72 hours) we observed
> > that many applications are killed.
> > Now, we were not sure if killing happens in user space or kernel space.
> > When we saw the kernel logs, it generated many logs such as;
> > /var/log/{messages, messages.0, messages.1, messages.2, messages.3,
> > etc.} But, none of the logs contains kernel OOM messages. Although
> > there were some LMK kill in user space.
> > Then in another round of test we keep dumping _dmesg_ output to a file
> > after each iteration.
> > After 3 days of tests this time we observed that dmesg output dump
> > contains many kernel oom messages.
> 
> I am confused. So you suspect that the OOM report didn't get to
> /var/log/messages while it was in dmesg?

No, I mean to say that all the /var/log/messages were over-written (after 3
days).
Or, it was cleared due to storage space constraints. So, oom kill logs were not
visible.
So, in our ageing test scripts, we keep dumping the dmesg output, during our
tests.
For_each_application:
Do
	Launch an application from cmdline
	Sleep 10 seconds
	dmesg -c >> /var/log/dmesg.log
Done
Continue this loop for more than 300 times.
After 3 days, when we analyzed the dump, we found that dmesg.log contains some
OOM messages.
Whereas, these OOM logs were not found in /var/log/messages.
May be we do heavy logging because in ageing test we enable maximum
functionality (Wifi, BT, GPS, fully loaded system).

Hope, it is clear now. If not, please ask me for more information.

> 
> > Now, every time this dumping is not feasible. And instead of counting
> > manually in log file, we wanted to know number of oom kills happened during
> this tests.
> > So we decided to add a counter in /proc/vmstat to track the kernel
> > oom_kill, and monitor it during our ageing test.
> >
> > Basically, we wanted to tune our user space LMK killer for different
> > threshold values, so that we can completely avoid the kernel oom kill.
> > So, just by looking into this counter, we could able to tune the LMK
> > threshold values without depending on the kernel log messages.
> 
> Wouldn't a trace point suit you better for this particular use case
considering this
> is a testing environment?
> 
Tracing for oom_kill count?
Actually, tracing related configs will be normally disabled in release binary.
And it is not always feasible to perform tracing for such long duration tests.
Then it should be valid for other counters as well.

> > Also, in most of the system /var/log/messages are not present and we
> > just depends on kernel dmesg output, which is petty small for longer run.
> > Even if we reduce the loglevel to 4, it may not be suitable to capture all
logs.
> 
> Hmm, I would consider a logless system considerably crippled but I see your
> point and I can imagine that especially small devices might try to save every
> single B of the storage. Such a system is basically undebugable IMO but it
still
> might be interesting to see OOM killer traces.
> 
Exactly, some of the small embedded systems might be having 512MB, 256MB, 128MB,
or even lesser.
Also, the storage space will be 8GB or below.
In such a system we cannot afford heavy log files and exact tuning and stability
is most important.
Even all tracing / profiling configs will be disabled to lowest level for
reducing kernel code size as well.

> > > What is even more confusing is the mixing of memcg and global oom
> > > conditions.  They are really different things. Memcg API will even
> > > give you notification about the OOM event.
> > >
> > Ok, you are suggesting to divide the oom_kill counter into 2 parts
> > (global &
> > memcg) ?
> > May be something like:
> > nr_oom_victims
> > nr_memcg_oom_victims
> 
> You do not need the later. Memcg interface already provides you with a
> notification API and if a counter is _really_ needed then it should be
per-memcg
> not a global cumulative number.

Ok, for memory cgroups, you mean to say this one?
sh-3.2# cat /sys/fs/cgroup/memory/memory.oom_control
oom_kill_disable 0
under_oom 0

I am actually confused here what to do next?
Shall I push a new patch set with just:
nr_oom_victims counter ?

Or, please let me know, if more information is missing.
If you have any more suggestions, please let me know.
I will really feel glad about it.
Thank you very much for all your suggestions and review so far.


> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] mm: vmstat: Add OOM kill count in vmstat counter
  2015-10-08 16:06               ` PINTU KUMAR
@ 2015-10-08 16:30                 ` Michal Hocko
  2015-10-09 12:59                   ` PINTU KUMAR
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2015-10-08 16:30 UTC (permalink / raw)
  To: PINTU KUMAR
  Cc: akpm, minchan, dave, koct9i, rientjes, hannes, penguin-kernel,
	bywxiaobai, mgorman, vbabka, js1304, kirill.shutemov,
	alexander.h.duyck, sasha.levin, cl, fengguang.wu, linux-kernel,
	linux-mm, cpgs, pintu_agarwal, pintu.ping, vishnu.ps, rohit.kr,
	c.rajkumar, sreenathd

On Thu 08-10-15 21:36:24, PINTU KUMAR wrote:
[...]
> Whereas, these OOM logs were not found in /var/log/messages.
> May be we do heavy logging because in ageing test we enable maximum
> functionality (Wifi, BT, GPS, fully loaded system).

If you swamp your logs so heavily that even critical messages won't make
it into the log files then your logging is basically useless for
anything serious. But that is not really that important.

> Hope, it is clear now. If not, please ask me for more information.
> 
> > 
> > > Now, every time this dumping is not feasible. And instead of counting
> > > manually in log file, we wanted to know number of oom kills happened during
> > this tests.
> > > So we decided to add a counter in /proc/vmstat to track the kernel
> > > oom_kill, and monitor it during our ageing test.
> > >
> > > Basically, we wanted to tune our user space LMK killer for different
> > > threshold values, so that we can completely avoid the kernel oom kill.
> > > So, just by looking into this counter, we could able to tune the LMK
> > > threshold values without depending on the kernel log messages.
> > 
> > Wouldn't a trace point suit you better for this particular use case
> > considering this
> > is a testing environment?
> > 
> Tracing for oom_kill count?
> Actually, tracing related configs will be normally disabled in release binary.

Yes but your use case described a testing environment.

> And it is not always feasible to perform tracing for such long duration tests.

I do not see why long duration would be a problem. Each tracepoint can
be enabled separatelly.

> Then it should be valid for other counters as well.
> 
> > > Also, in most of the system /var/log/messages are not present and we
> > > just depends on kernel dmesg output, which is petty small for longer run.
> > > Even if we reduce the loglevel to 4, it may not be suitable to capture all
> logs.
> > 
> > Hmm, I would consider a logless system considerably crippled but I see your
> > point and I can imagine that especially small devices might try to save every
> > single B of the storage. Such a system is basically undebugable IMO but it
> still
> > might be interesting to see OOM killer traces.
> > 
> Exactly, some of the small embedded systems might be having 512MB, 256MB, 128MB,
> or even lesser.
> Also, the storage space will be 8GB or below.
> In such a system we cannot afford heavy log files and exact tuning and stability
> is most important.

And that is what log level is for. If your logs are heavy with error
levels then you are far from being production ready... ;)

> Even all tracing / profiling configs will be disabled to lowest level for
> reducing kernel code size as well.

What level is that? crit? Is err really that noisy?
 
[...]
> > > Ok, you are suggesting to divide the oom_kill counter into 2 parts
> > > (global &
> > > memcg) ?
> > > May be something like:
> > > nr_oom_victims
> > > nr_memcg_oom_victims
> > 
> > You do not need the later. Memcg interface already provides you with a
> > notification API and if a counter is _really_ needed then it should be
> > per-memcg
> > not a global cumulative number.
> 
> Ok, for memory cgroups, you mean to say this one?
> sh-3.2# cat /sys/fs/cgroup/memory/memory.oom_control
> oom_kill_disable 0
> under_oom 0

Yes this is the notification API.

> I am actually confused here what to do next?
> Shall I push a new patch set with just:
> nr_oom_victims counter ?

Yes you can repost with a better description about a typical usage
scenarios. I cannot say I would be completely sold to this because
the only relevant usecase I've heard so far is the logless system
which is pretty much a corner case. This is not a reason to nack it
though. It is definitely better than the original oom_stall suggestion
because it has a clear semantic at least.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH 1/1] mm: vmstat: Add OOM kill count in vmstat counter
  2015-10-08 16:30                 ` Michal Hocko
@ 2015-10-09 12:59                   ` PINTU KUMAR
  0 siblings, 0 replies; 20+ messages in thread
From: PINTU KUMAR @ 2015-10-09 12:59 UTC (permalink / raw)
  To: 'Michal Hocko'
  Cc: akpm, minchan, dave, koct9i, rientjes, hannes, penguin-kernel,
	bywxiaobai, mgorman, vbabka, js1304, kirill.shutemov,
	alexander.h.duyck, sasha.levin, cl, fengguang.wu, linux-kernel,
	linux-mm, cpgs, pintu_agarwal, pintu.ping, vishnu.ps, rohit.kr,
	c.rajkumar, sreenathd


> -----Original Message-----
> From: Michal Hocko [mailto:mhocko@kernel.org]
> Sent: Thursday, October 08, 2015 10:01 PM
> To: PINTU KUMAR
> Cc: akpm@linux-foundation.org; minchan@kernel.org; dave@stgolabs.net;
> koct9i@gmail.com; rientjes@google.com; hannes@cmpxchg.org; penguin-
> kernel@i-love.sakura.ne.jp; bywxiaobai@163.com; mgorman@suse.de;
> vbabka@suse.cz; js1304@gmail.com; kirill.shutemov@linux.intel.com;
> alexander.h.duyck@redhat.com; sasha.levin@oracle.com; cl@linux.com;
> fengguang.wu@intel.com; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> cpgs@samsung.com; pintu_agarwal@yahoo.com; pintu.ping@gmail.com;
> vishnu.ps@samsung.com; rohit.kr@samsung.com; c.rajkumar@samsung.com;
> sreenathd@samsung.com
> Subject: Re: [PATCH 1/1] mm: vmstat: Add OOM kill count in vmstat counter
> 
> On Thu 08-10-15 21:36:24, PINTU KUMAR wrote:
> [...]
> > Whereas, these OOM logs were not found in /var/log/messages.
> > May be we do heavy logging because in ageing test we enable maximum
> > functionality (Wifi, BT, GPS, fully loaded system).
> 
> If you swamp your logs so heavily that even critical messages won't make it
into
> the log files then your logging is basically useless for anything serious. But
that is
> not really that important.
> 
> > Hope, it is clear now. If not, please ask me for more information.
> >
> > >
> > > > Now, every time this dumping is not feasible. And instead of
> > > > counting manually in log file, we wanted to know number of oom
> > > > kills happened during
> > > this tests.
> > > > So we decided to add a counter in /proc/vmstat to track the kernel
> > > > oom_kill, and monitor it during our ageing test.
> > > >
> > > > Basically, we wanted to tune our user space LMK killer for
> > > > different threshold values, so that we can completely avoid the kernel
oom
> kill.
> > > > So, just by looking into this counter, we could able to tune the
> > > > LMK threshold values without depending on the kernel log messages.
> > >
> > > Wouldn't a trace point suit you better for this particular use case
> > > considering this is a testing environment?
> > >
> > Tracing for oom_kill count?
> > Actually, tracing related configs will be normally disabled in release
binary.
> 
> Yes but your use case described a testing environment.
> 
> > And it is not always feasible to perform tracing for such long duration
tests.
> 
> I do not see why long duration would be a problem. Each tracepoint can be
> enabled separatelly.
> 
> > Then it should be valid for other counters as well.
> >
> > > > Also, in most of the system /var/log/messages are not present and
> > > > we just depends on kernel dmesg output, which is petty small for longer
> run.
> > > > Even if we reduce the loglevel to 4, it may not be suitable to
> > > > capture all
> > logs.
> > >
> > > Hmm, I would consider a logless system considerably crippled but I
> > > see your point and I can imagine that especially small devices might
> > > try to save every single B of the storage. Such a system is
> > > basically undebugable IMO but it
> > still
> > > might be interesting to see OOM killer traces.
> > >
> > Exactly, some of the small embedded systems might be having 512MB,
> > 256MB, 128MB, or even lesser.
> > Also, the storage space will be 8GB or below.
> > In such a system we cannot afford heavy log files and exact tuning and
> > stability is most important.
> 
> And that is what log level is for. If your logs are heavy with error levels
then you
> are far from being production ready... ;)
> 
> > Even all tracing / profiling configs will be disabled to lowest level
> > for reducing kernel code size as well.
> 
> What level is that? crit? Is err really that noisy?
> 
No. I was talking about kernel configs. Normally we keep some profiling/tracing
related configs disabled for low memory system, to save some kernel code size.
The point is that it's always not easy for all systems to heavily depends on
logging and tracing.
Else, the other counters would also not be required.
We thought that the /proc/vmstat output (which is ideally available in all
systems, small or big, embedded or none embedded), it can quickly tell us what
has happened really.

> [...]
> > > > Ok, you are suggesting to divide the oom_kill counter into 2 parts
> > > > (global &
> > > > memcg) ?
> > > > May be something like:
> > > > nr_oom_victims
> > > > nr_memcg_oom_victims
> > >
> > > You do not need the later. Memcg interface already provides you with
> > > a notification API and if a counter is _really_ needed then it
> > > should be per-memcg not a global cumulative number.
> >
> > Ok, for memory cgroups, you mean to say this one?
> > sh-3.2# cat /sys/fs/cgroup/memory/memory.oom_control
> > oom_kill_disable 0
> > under_oom 0
> 
> Yes this is the notification API.
> 
> > I am actually confused here what to do next?
> > Shall I push a new patch set with just:
> > nr_oom_victims counter ?
> 
> Yes you can repost with a better description about a typical usage scenarios.
I
> cannot say I would be completely sold to this because the only relevant
usecase
> I've heard so far is the logless system which is pretty much a corner case.
This is
> not a reason to nack it though. It is definitely better than the original
oom_stall
> suggestion because it has a clear semantic at least.

Ok, thank you very much for your suggestions.
I agree, oom_stall is not so important.
I will try to submit a new patch set with only _nr_oom_victims_ with the
descriptions about the usefulness that I came across.
If anybody else can point out other use cases, please let me know. 
I will be happy to try that and share the results.

> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/1] mm: vmstat: Add OOM victims count in vmstat counter
  2015-10-01 10:48 [PATCH 1/1] mm: vmstat: Add OOM kill count in vmstat counter Pintu Kumar
  2015-10-01 13:29 ` Anshuman Khandual
  2015-10-01 13:38 ` Michal Hocko
@ 2015-10-12 13:33 ` Pintu Kumar
  2015-10-12 14:28   ` [RESEND PATCH " Pintu Kumar
  2015-10-12 14:44   ` [PATCH " PINTU KUMAR
  2 siblings, 2 replies; 20+ messages in thread
From: Pintu Kumar @ 2015-10-12 13:33 UTC (permalink / raw)
  To: akpm, minchan, dave, pintu.k, mhocko, koct9i, rientjes, hannes,
	penguin-kernel, bywxiaobai, mgorman, vbabka, js1304,
	kirill.shutemov, alexander.h.duyck, sasha.levin, cl,
	fengguang.wu, linux-kernel, linux-mm
  Cc: cpgs, pintu_agarwal, pintu.ping, vishnu.ps, rohit.kr, c.rajkumar,
	sreenathd

This patch maintains the number of oom victims kill count in
/proc/vmstat.
Currently, we are dependent upon kernel logs when the kernel OOM occurs.
But kernel OOM can went passed unnoticed by the developer as it can
silently kill some background applications/services.
In some small embedded system, it might be possible that OOM is captured
in the logs but it was over-written due to ring-buffer.
Thus this interface can quickly help the user in analyzing, whether there
were any OOM kill happened in the past, or whether the system have ever
entered the oom kill stage till date.

Thus, it can be beneficial under following cases:
1. User can monitor kernel oom kill scenario without looking into the
   kernel logs.
2. It can help in tuning the watermark level in the system.
3. It can help in tuning the low memory killer behavior in user space.
4. It can be helpful on a logless system or if klogd logging
   (/var/log/messages) are disabled.

A snapshot of the result of 3 days of over night test is shown below:
System: ARM Cortex A7, 1GB RAM, 8GB EMMC
Linux: 3.10.xx
Category: reference smart phone device
Loglevel: 7
Conditions: Fully loaded, BT/WiFi/GPS ON
Tests: auto launching of ~30+ apps using test scripts, in a loop for
3 days.
At the end of tests, check:
$ cat /proc/vmstat
nr_oom_victims 6

As we noticed, there were around 6 oom kill victims.

The OOM is bad for any system. So, this counter can help in quickly
tuning the OOM behavior of the system, without depending on the logs.

Signed-off-by: Pintu Kumar <pintu.k@samsung.com>
---
 include/linux/vm_event_item.h |    1 +
 mm/oom_kill.c                 |    2 ++
 mm/page_alloc.c               |    1 -
 mm/vmstat.c                   |    1 +
 4 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 2b1cef8..dd2600d 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -57,6 +57,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
 #endif
+		NR_OOM_VICTIMS,
 		UNEVICTABLE_PGCULLED,	/* culled to noreclaim list */
 		UNEVICTABLE_PGSCANNED,	/* scanned for reclaimability */
 		UNEVICTABLE_PGRESCUED,	/* rescued from noreclaim list */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 03b612b..802b8a1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -570,6 +570,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * space under its control.
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
+	count_vm_event(NR_OOM_VICTIMS);
 	mark_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
@@ -600,6 +601,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 				task_pid_nr(p), p->comm);
 			task_unlock(p);
 			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
+			count_vm_event(NR_OOM_VICTIMS);
 		}
 	rcu_read_unlock();

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9bcfd70..fafb09d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2761,7 +2761,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		schedule_timeout_uninterruptible(1);
 		return NULL;
 	}
-
 	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1fd0886..8503a2e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -808,6 +808,7 @@ const char * const vmstat_text[] = {
 	"htlb_buddy_alloc_success",
 	"htlb_buddy_alloc_fail",
 #endif
+	"nr_oom_victims",
 	"unevictable_pgs_culled",
 	"unevictable_pgs_scanned",
 	"unevictable_pgs_rescued",
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RESEND PATCH 1/1] mm: vmstat: Add OOM victims count in vmstat counter
  2015-10-12 13:33 ` [PATCH 1/1] mm: vmstat: Add OOM victims " Pintu Kumar
@ 2015-10-12 14:28   ` Pintu Kumar
  2015-10-14  3:05     ` David Rientjes
  2015-10-12 14:44   ` [PATCH " PINTU KUMAR
  1 sibling, 1 reply; 20+ messages in thread
From: Pintu Kumar @ 2015-10-12 14:28 UTC (permalink / raw)
  To: akpm, minchan, dave, pintu.k, mhocko, koct9i, rientjes, hannes,
	penguin-kernel, bywxiaobai, mgorman, vbabka, js1304,
	kirill.shutemov, alexander.h.duyck, sasha.levin, cl,
	fengguang.wu, linux-kernel, linux-mm
  Cc: cpgs, pintu_agarwal, pintu.ping, vishnu.ps, rohit.kr, c.rajkumar,
	sreenathd

This patch maintains the number of oom victims kill count in
/proc/vmstat.
Currently, we are dependent upon kernel logs when the kernel OOM occurs.
But kernel OOM can went passed unnoticed by the developer as it can
silently kill some background applications/services.
In some small embedded system, it might be possible that OOM is captured
in the logs but it was over-written due to ring-buffer.
Thus this interface can quickly help the user in analyzing, whether there
were any OOM kill happened in the past, or whether the system have ever
entered the oom kill stage till date.

Thus, it can be beneficial under following cases:
1. User can monitor kernel oom kill scenario without looking into the
   kernel logs.
2. It can help in tuning the watermark level in the system.
3. It can help in tuning the low memory killer behavior in user space.
4. It can be helpful on a logless system or if klogd logging
   (/var/log/messages) are disabled.

A snapshot of the result of 3 days of over night test is shown below:
System: ARM Cortex A7, 1GB RAM, 8GB EMMC
Linux: 3.10.xx
Category: reference smart phone device
Loglevel: 7
Conditions: Fully loaded, BT/WiFi/GPS ON
Tests: auto launching of ~30+ apps using test scripts, in a loop for
3 days.
At the end of tests, check:
$ cat /proc/vmstat
nr_oom_victims 6

As we noticed, there were around 6 oom kill victims.

The OOM is bad for any system. So, this counter can help in quickly
tuning the OOM behavior of the system, without depending on the logs.

Signed-off-by: Pintu Kumar <pintu.k@samsung.com>
---
V2: Removed oom_stall, Suggested By: Michal Hocko <mhocko@kernel.org>
    Renamed oom_kill_count to nr_oom_victims,
    Suggested By: Michal Hocko <mhocko@kernel.org>
    Suggested By: Anshuman Khandual <khandual@linux.vnet.ibm.com>

 include/linux/vm_event_item.h |    1 +
 mm/oom_kill.c                 |    2 ++
 mm/page_alloc.c               |    1 -
 mm/vmstat.c                   |    1 +
 4 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 2b1cef8..dd2600d 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -57,6 +57,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
 #endif
+		NR_OOM_VICTIMS,
 		UNEVICTABLE_PGCULLED,	/* culled to noreclaim list */
 		UNEVICTABLE_PGSCANNED,	/* scanned for reclaimability */
 		UNEVICTABLE_PGRESCUED,	/* rescued from noreclaim list */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 03b612b..802b8a1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -570,6 +570,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * space under its control.
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
+	count_vm_event(NR_OOM_VICTIMS);
 	mark_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
@@ -600,6 +601,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 				task_pid_nr(p), p->comm);
 			task_unlock(p);
 			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
+			count_vm_event(NR_OOM_VICTIMS);
 		}
 	rcu_read_unlock();

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9bcfd70..fafb09d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2761,7 +2761,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		schedule_timeout_uninterruptible(1);
 		return NULL;
 	}
-
 	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1fd0886..8503a2e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -808,6 +808,7 @@ const char * const vmstat_text[] = {
 	"htlb_buddy_alloc_success",
 	"htlb_buddy_alloc_fail",
 #endif
+	"nr_oom_victims",
 	"unevictable_pgs_culled",
 	"unevictable_pgs_scanned",
 	"unevictable_pgs_rescued",
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* RE: [PATCH 1/1] mm: vmstat: Add OOM victims count in vmstat counter
  2015-10-12 13:33 ` [PATCH 1/1] mm: vmstat: Add OOM victims " Pintu Kumar
  2015-10-12 14:28   ` [RESEND PATCH " Pintu Kumar
@ 2015-10-12 14:44   ` PINTU KUMAR
  1 sibling, 0 replies; 20+ messages in thread
From: PINTU KUMAR @ 2015-10-12 14:44 UTC (permalink / raw)
  To: akpm, minchan, dave, mhocko, koct9i, rientjes, hannes,
	penguin-kernel, bywxiaobai, mgorman, vbabka, js1304,
	kirill.shutemov, alexander.h.duyck, sasha.levin, cl,
	fengguang.wu, linux-kernel, linux-mm
  Cc: cpgs, pintu_agarwal, pintu.ping, vishnu.ps, rohit.kr, c.rajkumar,
	sreenathd

Hi,

Sorry, I forgot to mention the V2 update.
I will highlight the V2 changes and RESEND.

> -----Original Message-----
> From: Pintu Kumar [mailto:pintu.k@samsung.com]
> Sent: Monday, October 12, 2015 7:03 PM
> To: akpm@linux-foundation.org; minchan@kernel.org; dave@stgolabs.net;
> pintu.k@samsung.com; mhocko@suse.cz; koct9i@gmail.com;
> rientjes@google.com; hannes@cmpxchg.org; penguin-kernel@i-
> love.sakura.ne.jp; bywxiaobai@163.com; mgorman@suse.de; vbabka@suse.cz;
> js1304@gmail.com; kirill.shutemov@linux.intel.com;
> alexander.h.duyck@redhat.com; sasha.levin@oracle.com; cl@linux.com;
> fengguang.wu@intel.com; linux-kernel@vger.kernel.org; linux-mm@kvack.org
> Cc: cpgs@samsung.com; pintu_agarwal@yahoo.com; pintu.ping@gmail.com;
> vishnu.ps@samsung.com; rohit.kr@samsung.com; c.rajkumar@samsung.com;
> sreenathd@samsung.com
> Subject: [PATCH 1/1] mm: vmstat: Add OOM victims count in vmstat counter
> 
> This patch maintains the number of oom victims kill count in /proc/vmstat.
> Currently, we are dependent upon kernel logs when the kernel OOM occurs.
> But kernel OOM can went passed unnoticed by the developer as it can silently
> kill some background applications/services.
> In some small embedded system, it might be possible that OOM is captured in
> the logs but it was over-written due to ring-buffer.
> Thus this interface can quickly help the user in analyzing, whether there were
> any OOM kill happened in the past, or whether the system have ever entered
> the oom kill stage till date.
> 
> Thus, it can be beneficial under following cases:
> 1. User can monitor kernel oom kill scenario without looking into the
>    kernel logs.
> 2. It can help in tuning the watermark level in the system.
> 3. It can help in tuning the low memory killer behavior in user space.
> 4. It can be helpful on a logless system or if klogd logging
>    (/var/log/messages) are disabled.
> 
> A snapshot of the result of 3 days of over night test is shown below:
> System: ARM Cortex A7, 1GB RAM, 8GB EMMC
> Linux: 3.10.xx
> Category: reference smart phone device
> Loglevel: 7
> Conditions: Fully loaded, BT/WiFi/GPS ON
> Tests: auto launching of ~30+ apps using test scripts, in a loop for
> 3 days.
> At the end of tests, check:
> $ cat /proc/vmstat
> nr_oom_victims 6
> 
> As we noticed, there were around 6 oom kill victims.
> 
> The OOM is bad for any system. So, this counter can help in quickly tuning the
> OOM behavior of the system, without depending on the logs.
> 
> Signed-off-by: Pintu Kumar <pintu.k@samsung.com>
> ---
>  include/linux/vm_event_item.h |    1 +
>  mm/oom_kill.c                 |    2 ++
>  mm/page_alloc.c               |    1 -
>  mm/vmstat.c                   |    1 +
>  4 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 2b1cef8..dd2600d 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -57,6 +57,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN,
> PSWPOUT,  #ifdef CONFIG_HUGETLB_PAGE
>  		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,  #endif
> +		NR_OOM_VICTIMS,
>  		UNEVICTABLE_PGCULLED,	/* culled to noreclaim list */
>  		UNEVICTABLE_PGSCANNED,	/* scanned for reclaimability */
>  		UNEVICTABLE_PGRESCUED,	/* rescued from noreclaim list */
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 03b612b..802b8a1 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -570,6 +570,7 @@ void oom_kill_process(struct oom_control *oc, struct
> task_struct *p,
>  	 * space under its control.
>  	 */
>  	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
> +	count_vm_event(NR_OOM_VICTIMS);
>  	mark_oom_victim(victim);
>  	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-
> rss:%lukB\n",
>  		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
> @@ -600,6 +601,7 @@ void oom_kill_process(struct oom_control *oc, struct
> task_struct *p,
>  				task_pid_nr(p), p->comm);
>  			task_unlock(p);
>  			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
> +			count_vm_event(NR_OOM_VICTIMS);
>  		}
>  	rcu_read_unlock();
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9bcfd70..fafb09d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2761,7 +2761,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned
> int order,
>  		schedule_timeout_uninterruptible(1);
>  		return NULL;
>  	}
> -
>  	/*
>  	 * Go through the zonelist yet one more time, keep very high watermark
>  	 * here, this is only to catch a parallel oom killing, we must fail if
diff --git
> a/mm/vmstat.c b/mm/vmstat.c index 1fd0886..8503a2e 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -808,6 +808,7 @@ const char * const vmstat_text[] = {
>  	"htlb_buddy_alloc_success",
>  	"htlb_buddy_alloc_fail",
>  #endif
> +	"nr_oom_victims",
>  	"unevictable_pgs_culled",
>  	"unevictable_pgs_scanned",
>  	"unevictable_pgs_rescued",
> --
> 1.7.9.5

Regards,
Pintu


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RESEND PATCH 1/1] mm: vmstat: Add OOM victims count in vmstat counter
  2015-10-12 14:28   ` [RESEND PATCH " Pintu Kumar
@ 2015-10-14  3:05     ` David Rientjes
  2015-10-14 13:41       ` PINTU KUMAR
  0 siblings, 1 reply; 20+ messages in thread
From: David Rientjes @ 2015-10-14  3:05 UTC (permalink / raw)
  To: Pintu Kumar
  Cc: akpm, minchan, dave, mhocko, koct9i, hannes, penguin-kernel,
	bywxiaobai, mgorman, vbabka, js1304, kirill.shutemov,
	alexander.h.duyck, sasha.levin, cl, fengguang.wu, linux-kernel,
	linux-mm, cpgs, pintu_agarwal, pintu.ping, vishnu.ps, rohit.kr,
	c.rajkumar, sreenathd

On Mon, 12 Oct 2015, Pintu Kumar wrote:

> This patch maintains the number of oom victims kill count in
> /proc/vmstat.
> Currently, we are dependent upon kernel logs when the kernel OOM occurs.
> But kernel OOM can went passed unnoticed by the developer as it can
> silently kill some background applications/services.
> In some small embedded system, it might be possible that OOM is captured
> in the logs but it was over-written due to ring-buffer.
> Thus this interface can quickly help the user in analyzing, whether there
> were any OOM kill happened in the past, or whether the system have ever
> entered the oom kill stage till date.
> 
> Thus, it can be beneficial under following cases:
> 1. User can monitor kernel oom kill scenario without looking into the
>    kernel logs.

I'm not sure how helpful that would be since we don't know anything about 
the oom kill itself, only that at some point during the uptime there were 
oom kills.

> 2. It can help in tuning the watermark level in the system.

I disagree with this one, because we can encounter oom kills due to 
fragmentation rather than low memory conditions for high-order 
allocations.  The amount of free memory may be substantially higher than 
all zone watermarks.

> 3. It can help in tuning the low memory killer behavior in user space.

Same reason as above.

> 4. It can be helpful on a logless system or if klogd logging
>    (/var/log/messages) are disabled.
> 

This would be similar to point (1) above, and I question how helpful it 
would be.  I notice that all oom kills (system, cpuset, mempolicy, and 
memcg) are treated equally in this case and there's no way to 
differentiate them.  That would lead me to believe that you are targeting 
this change for systems that don't use mempolicies or cgroups.  That's 
fine, but I doubt it will be helpful for anybody else.

> A snapshot of the result of 3 days of over night test is shown below:
> System: ARM Cortex A7, 1GB RAM, 8GB EMMC
> Linux: 3.10.xx
> Category: reference smart phone device
> Loglevel: 7
> Conditions: Fully loaded, BT/WiFi/GPS ON
> Tests: auto launching of ~30+ apps using test scripts, in a loop for
> 3 days.
> At the end of tests, check:
> $ cat /proc/vmstat
> nr_oom_victims 6
> 
> As we noticed, there were around 6 oom kill victims.
> 
> The OOM is bad for any system. So, this counter can help in quickly
> tuning the OOM behavior of the system, without depending on the logs.
> 

NACK to the patch since it isn't justified.

We've long had a desire to have a better oom reporting mechanism rather 
than just the kernel log.  It seems like you're feeling the same pain.  I 
think it would be better to have an eventfd notifier for system oom 
conditions so we can track kernel oom kills (and conditions) in 
userspace.  I have a patch for that, and it works quite well when 
userspace is mlocked with a buffer in memory.

If you are only interested in a strict count of system oom kills, this 
could then easily be implemented without adding vmstat counters.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RESEND PATCH 1/1] mm: vmstat: Add OOM victims count in vmstat counter
  2015-10-14  3:05     ` David Rientjes
@ 2015-10-14 13:41       ` PINTU KUMAR
  2015-10-14 22:04         ` David Rientjes
  0 siblings, 1 reply; 20+ messages in thread
From: PINTU KUMAR @ 2015-10-14 13:41 UTC (permalink / raw)
  To: 'David Rientjes'
  Cc: akpm, minchan, dave, mhocko, koct9i, hannes, penguin-kernel,
	bywxiaobai, mgorman, vbabka, js1304, kirill.shutemov,
	alexander.h.duyck, sasha.levin, cl, fengguang.wu, linux-kernel,
	linux-mm, cpgs, pintu_agarwal, pintu.ping, vishnu.ps, rohit.kr,
	c.rajkumar

Hi,

Thank you very much for your review and comments.

> -----Original Message-----
> From: David Rientjes [mailto:rientjes@google.com]
> Sent: Wednesday, October 14, 2015 8:36 AM
> To: Pintu Kumar
> Cc: akpm@linux-foundation.org; minchan@kernel.org; dave@stgolabs.net;
> mhocko@suse.cz; koct9i@gmail.com; hannes@cmpxchg.org; penguin-kernel@i-
> love.sakura.ne.jp; bywxiaobai@163.com; mgorman@suse.de; vbabka@suse.cz;
> js1304@gmail.com; kirill.shutemov@linux.intel.com;
> alexander.h.duyck@redhat.com; sasha.levin@oracle.com; cl@linux.com;
> fengguang.wu@intel.com; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> cpgs@samsung.com; pintu_agarwal@yahoo.com; pintu.ping@gmail.com;
> vishnu.ps@samsung.com; rohit.kr@samsung.com; c.rajkumar@samsung.com;
> sreenathd@samsung.com
> Subject: Re: [RESEND PATCH 1/1] mm: vmstat: Add OOM victims count in vmstat
> counter
> 
> On Mon, 12 Oct 2015, Pintu Kumar wrote:
> 
> > This patch maintains the number of oom victims kill count in
> > /proc/vmstat.
> > Currently, we are dependent upon kernel logs when the kernel OOM occurs.
> > But kernel OOM can went passed unnoticed by the developer as it can
> > silently kill some background applications/services.
> > In some small embedded system, it might be possible that OOM is
> > captured in the logs but it was over-written due to ring-buffer.
> > Thus this interface can quickly help the user in analyzing, whether
> > there were any OOM kill happened in the past, or whether the system
> > have ever entered the oom kill stage till date.
> >
> > Thus, it can be beneficial under following cases:
> > 1. User can monitor kernel oom kill scenario without looking into the
> >    kernel logs.
> 
> I'm not sure how helpful that would be since we don't know anything about the
> oom kill itself, only that at some point during the uptime there were oom
kills.
> 
Not sure about others.
For me it was very helpful during sluggish and long duration ageing tests.
With this, I don't have to look into the logs manually.
I just monitor this count in a script. 
The moment I get nr_oom_victims > 1, I know that kernel OOM would have happened
and I need to take the log dump.
So, then I do: dmesg >> oom_logs.txt
Or, even stop the tests for further tuning.

> > 2. It can help in tuning the watermark level in the system.
> 
> I disagree with this one, because we can encounter oom kills due to
> fragmentation rather than low memory conditions for high-order allocations.
> The amount of free memory may be substantially higher than all zone
> watermarks.
> 
AFAIK, kernel oom happens only for lower-order (PAGE_ALLOC_COSTLY_ORDER).
For higher-order we get page allocation failure.

> > 3. It can help in tuning the low memory killer behavior in user space.
> 
> Same reason as above.
> 
> > 4. It can be helpful on a logless system or if klogd logging
> >    (/var/log/messages) are disabled.
> >
> 
> This would be similar to point (1) above, and I question how helpful it would
be.
> I notice that all oom kills (system, cpuset, mempolicy, and
> memcg) are treated equally in this case and there's no way to differentiate
them.
> That would lead me to believe that you are targeting this change for systems
> that don't use mempolicies or cgroups.  That's fine, but I doubt it will be
helpful
> for anybody else.
> 
No, we are not targeting any specific category.
Our goal is simple, track and report kernel oom kill as soon as it occurs.

> > A snapshot of the result of 3 days of over night test is shown below:
> > System: ARM Cortex A7, 1GB RAM, 8GB EMMC
> > Linux: 3.10.xx
> > Category: reference smart phone device
> > Loglevel: 7
> > Conditions: Fully loaded, BT/WiFi/GPS ON
> > Tests: auto launching of ~30+ apps using test scripts, in a loop for
> > 3 days.
> > At the end of tests, check:
> > $ cat /proc/vmstat
> > nr_oom_victims 6
> >
> > As we noticed, there were around 6 oom kill victims.
> >
> > The OOM is bad for any system. So, this counter can help in quickly
> > tuning the OOM behavior of the system, without depending on the logs.
> >
> 
> NACK to the patch since it isn't justified.
> 
> We've long had a desire to have a better oom reporting mechanism rather than
> just the kernel log.  It seems like you're feeling the same pain.  I think it
would be
> better to have an eventfd notifier for system oom conditions so we can track
> kernel oom kills (and conditions) in userspace.  I have a patch for that, and
it
> works quite well when userspace is mlocked with a buffer in memory.
> 
Ok, this would be interesting.
Can you point me to the patches?
I will quickly check if it is useful for us.

> If you are only interested in a strict count of system oom kills, this could
then
> easily be implemented without adding vmstat counters.
>
We are interested only to know when kernel OOM occurs and not even the oom
victim count. So that we can tune something is user space to avoid or delay it
as far as possible.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RESEND PATCH 1/1] mm: vmstat: Add OOM victims count in vmstat counter
  2015-10-14 13:41       ` PINTU KUMAR
@ 2015-10-14 22:04         ` David Rientjes
  2015-10-15 14:35           ` PINTU KUMAR
  0 siblings, 1 reply; 20+ messages in thread
From: David Rientjes @ 2015-10-14 22:04 UTC (permalink / raw)
  To: PINTU KUMAR
  Cc: akpm, minchan, dave, mhocko, koct9i, hannes, penguin-kernel,
	bywxiaobai, mgorman, vbabka, js1304, kirill.shutemov,
	alexander.h.duyck, sasha.levin, cl, fengguang.wu, linux-kernel,
	linux-mm, cpgs, pintu_agarwal, pintu.ping, vishnu.ps, rohit.kr,
	c.rajkumar

On Wed, 14 Oct 2015, PINTU KUMAR wrote:

> For me it was very helpful during sluggish and long duration ageing tests.
> With this, I don't have to look into the logs manually.
> I just monitor this count in a script. 
> The moment I get nr_oom_victims > 1, I know that kernel OOM would have happened
> and I need to take the log dump.
> So, then I do: dmesg >> oom_logs.txt
> Or, even stop the tests for further tuning.
> 

I think eventfd(2) was created for that purpose, to avoid the constant 
polling that you would have to do to check nr_oom_victims and then take a 
snapshot.

> > I disagree with this one, because we can encounter oom kills due to
> > fragmentation rather than low memory conditions for high-order allocations.
> > The amount of free memory may be substantially higher than all zone
> > watermarks.
> > 
> AFAIK, kernel oom happens only for lower-order (PAGE_ALLOC_COSTLY_ORDER).
> For higher-order we get page allocation failure.
> 

Order-3 is included.  I've seen machines with _gigabytes_ of free memory 
in ZONE_NORMAL on a node and have an order-3 page allocation failure that 
called the oom killer.

> > We've long had a desire to have a better oom reporting mechanism rather than
> > just the kernel log.  It seems like you're feeling the same pain.  I think it
> would be
> > better to have an eventfd notifier for system oom conditions so we can track
> > kernel oom kills (and conditions) in userspace.  I have a patch for that, and
> it
> > works quite well when userspace is mlocked with a buffer in memory.
> > 
> Ok, this would be interesting.
> Can you point me to the patches?
> I will quickly check if it is useful for us.
> 

https://lwn.net/Articles/589404.  It's invasive and isn't upstream.  I 
would like to restructure that patchset to avoid the memcg trickery and 
allow for a root-only eventfd(2) notification through procfs on system 
oom.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RESEND PATCH 1/1] mm: vmstat: Add OOM victims count in vmstat counter
  2015-10-14 22:04         ` David Rientjes
@ 2015-10-15 14:35           ` PINTU KUMAR
  0 siblings, 0 replies; 20+ messages in thread
From: PINTU KUMAR @ 2015-10-15 14:35 UTC (permalink / raw)
  To: 'David Rientjes'
  Cc: akpm, minchan, dave, mhocko, koct9i, hannes, penguin-kernel,
	bywxiaobai, mgorman, vbabka, js1304, kirill.shutemov,
	alexander.h.duyck, sasha.levin, cl, fengguang.wu, linux-kernel,
	linux-mm, cpgs, pintu_agarwal, pintu.ping, vishnu.ps, rohit.kr,
	c.rajkumar

Hi,

> -----Original Message-----
> From: David Rientjes [mailto:rientjes@google.com]
> Sent: Thursday, October 15, 2015 3:35 AM
> To: PINTU KUMAR
> Cc: akpm@linux-foundation.org; minchan@kernel.org; dave@stgolabs.net;
> mhocko@suse.cz; koct9i@gmail.com; hannes@cmpxchg.org; penguin-kernel@i-
> love.sakura.ne.jp; bywxiaobai@163.com; mgorman@suse.de; vbabka@suse.cz;
> js1304@gmail.com; kirill.shutemov@linux.intel.com;
> alexander.h.duyck@redhat.com; sasha.levin@oracle.com; cl@linux.com;
> fengguang.wu@intel.com; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> cpgs@samsung.com; pintu_agarwal@yahoo.com; pintu.ping@gmail.com;
> vishnu.ps@samsung.com; rohit.kr@samsung.com; c.rajkumar@samsung.com
> Subject: RE: [RESEND PATCH 1/1] mm: vmstat: Add OOM victims count in vmstat
> counter
> 
> On Wed, 14 Oct 2015, PINTU KUMAR wrote:
> 
> > For me it was very helpful during sluggish and long duration ageing tests.
> > With this, I don't have to look into the logs manually.
> > I just monitor this count in a script.
> > The moment I get nr_oom_victims > 1, I know that kernel OOM would have
> > happened and I need to take the log dump.
> > So, then I do: dmesg >> oom_logs.txt
> > Or, even stop the tests for further tuning.
> >
> 
> I think eventfd(2) was created for that purpose, to avoid the constant polling
> that you would have to do to check nr_oom_victims and then take a snapshot.
> 
> > > I disagree with this one, because we can encounter oom kills due to
> > > fragmentation rather than low memory conditions for high-order
allocations.
> > > The amount of free memory may be substantially higher than all zone
> > > watermarks.
> > >
> > AFAIK, kernel oom happens only for lower-order
> (PAGE_ALLOC_COSTLY_ORDER).
> > For higher-order we get page allocation failure.
> >
> 
> Order-3 is included.  I've seen machines with _gigabytes_ of free memory in
> ZONE_NORMAL on a node and have an order-3 page allocation failure that
> called the oom killer.
> 
Yes, if PAGE_ALLOC_COSTLY_ORDER is defined as 3, then order-3 will be included
for OOM. But that's fine. We are just interested to know if system entered oom
state.
That's the reason, earlier I added even _oom_stall_ to know if system ever
entered oom but resulted into page allocation failure instead of oom killing.

> > > We've long had a desire to have a better oom reporting mechanism
> > > rather than just the kernel log.  It seems like you're feeling the
> > > same pain.  I think it
> > would be
> > > better to have an eventfd notifier for system oom conditions so we
> > > can track kernel oom kills (and conditions) in userspace.  I have a
> > > patch for that, and
> > it
> > > works quite well when userspace is mlocked with a buffer in memory.
> > >
> > Ok, this would be interesting.
> > Can you point me to the patches?
> > I will quickly check if it is useful for us.
> >
> 
> https://lwn.net/Articles/589404.  It's invasive and isn't upstream.  I would
like to
> restructure that patchset to avoid the memcg trickery and allow for a
root-only
> eventfd(2) notification through procfs on system oom.

I am interested only in global oom case and not memcg. We have memcg enabled but
I think even memcg_oom will finally invoke _oom_kill_process_.
So, I am interested in a patchset that can trigger notifications from
oom_kill_process, as soon as any victim is killed.
Sorry, from your patchset, I could not actually local the system_oom
notification patch.
If you have similar patchset please point me to it.
It will be really helpful.
Thank you!


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2015-10-15 14:35 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-01 10:48 [PATCH 1/1] mm: vmstat: Add OOM kill count in vmstat counter Pintu Kumar
2015-10-01 13:29 ` Anshuman Khandual
2015-10-05  6:19   ` PINTU KUMAR
2015-10-01 13:38 ` Michal Hocko
2015-10-05  6:12   ` PINTU KUMAR
2015-10-05 12:22     ` Michal Hocko
2015-10-06  6:59       ` PINTU KUMAR
2015-10-06 15:41         ` Michal Hocko
2015-10-07 14:48           ` PINTU KUMAR
2015-10-08 14:18             ` Michal Hocko
2015-10-08 16:06               ` PINTU KUMAR
2015-10-08 16:30                 ` Michal Hocko
2015-10-09 12:59                   ` PINTU KUMAR
2015-10-12 13:33 ` [PATCH 1/1] mm: vmstat: Add OOM victims " Pintu Kumar
2015-10-12 14:28   ` [RESEND PATCH " Pintu Kumar
2015-10-14  3:05     ` David Rientjes
2015-10-14 13:41       ` PINTU KUMAR
2015-10-14 22:04         ` David Rientjes
2015-10-15 14:35           ` PINTU KUMAR
2015-10-12 14:44   ` [PATCH " PINTU KUMAR

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).