linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: + mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint.patch added to -mm tree
       [not found] <586edadc.figmHAGrTxvM7Wei%akpm@linux-foundation.org>
@ 2017-01-10 23:52 ` Minchan Kim
  2017-01-11 15:52   ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Minchan Kim @ 2017-01-10 23:52 UTC (permalink / raw)
  To: linux-kernel; +Cc: mhocko, hillf.zj, mgorman, vbabka, mm-commits, linux-mm

Hi Michal,

Sorry for the late review. Acutally, I just review your patch:
[RFC PATCH 2/2] mm, vmscan: cleanup inactive_list_is_low and
found some questions. Here it goes.

On Thu, Jan 05, 2017 at 03:46:36PM -0800, akpm@linux-foundation.org wrote:
> 
> The patch titled
>      Subject: mm, vmscan: add mm_vmscan_inactive_list_is_low tracepoint
> has been added to the -mm tree.  Its filename is
>      mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint.patch
> 
> This patch should soon appear at
>     http://ozlabs.org/~akpm/mmots/broken-out/mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint.patch
> and later at
>     http://ozlabs.org/~akpm/mmotm/broken-out/mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint.patch
> 
> Before you just go and hit "reply", please:
>    a) Consider who else should be cc'ed
>    b) Prefer to cc a suitable mailing list as well
>    c) Ideally: find the original patch on the mailing list and do a
>       reply-to-all to that, adding suitable additional cc's
> 
> *** Remember to use Documentation/SubmitChecklist when testing your code ***
> 
> The -mm tree is included into linux-next and is updated
> there every 3-4 working days
> 
> ------------------------------------------------------
> From: Michal Hocko <mhocko@suse.com>
> Subject: mm, vmscan: add mm_vmscan_inactive_list_is_low tracepoint
> 
> Currently we have tracepoints for both active and inactive LRU lists
> reclaim but we do not have any which would tell us why we we decided to
> age the active list.  Without that it is quite hard to diagnose
> active/inactive lists balancing.  Add mm_vmscan_inactive_list_is_low
> tracepoint to tell us this information.
> 
> Link: http://lkml.kernel.org/r/20170104101942.4860-8-mhocko@kernel.org
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
> Acked-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  include/trace/events/vmscan.h |   40 ++++++++++++++++++++++++++++++++
>  mm/vmscan.c                   |   23 +++++++++++-------
>  2 files changed, 54 insertions(+), 9 deletions(-)
> 
> diff -puN include/trace/events/vmscan.h~mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint include/trace/events/vmscan.h
> --- a/include/trace/events/vmscan.h~mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint
> +++ a/include/trace/events/vmscan.h
> @@ -15,6 +15,7 @@
>  #define RECLAIM_WB_MIXED	0x0010u
>  #define RECLAIM_WB_SYNC		0x0004u /* Unused, all reclaim async */
>  #define RECLAIM_WB_ASYNC	0x0008u
> +#define RECLAIM_WB_LRU		(RECLAIM_WB_ANON|RECLAIM_WB_FILE)
>  
>  #define show_reclaim_flags(flags)				\
>  	(flags) ? __print_flags(flags, "|",			\
> @@ -426,6 +427,45 @@ TRACE_EVENT(mm_vmscan_lru_shrink_active,
>  		show_reclaim_flags(__entry->reclaim_flags))
>  );
>  
> +TRACE_EVENT(mm_vmscan_inactive_list_is_low,
> +
> +	TP_PROTO(int nid, int reclaim_idx,
> +		unsigned long total_inactive, unsigned long inactive,
> +		unsigned long total_active, unsigned long active,
> +		unsigned long ratio, int file),
> +
> +	TP_ARGS(nid, reclaim_idx, total_inactive, inactive, total_active, active, ratio, file),
> +
> +	TP_STRUCT__entry(
> +		__field(int, nid)
> +		__field(int, reclaim_idx)
> +		__field(unsigned long, total_inactive)
> +		__field(unsigned long, inactive)
> +		__field(unsigned long, total_active)
> +		__field(unsigned long, active)
> +		__field(unsigned long, ratio)
> +		__field(int, reclaim_flags)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->nid = nid;
> +		__entry->reclaim_idx = reclaim_idx;
> +		__entry->total_inactive = total_inactive;
> +		__entry->inactive = inactive;
> +		__entry->total_active = total_active;
> +		__entry->active = active;
> +		__entry->ratio = ratio;
> +		__entry->reclaim_flags = trace_shrink_flags(file) & RECLAIM_WB_LRU;
> +	),
> +
> +	TP_printk("nid=%d reclaim_idx=%d total_inactive=%ld inactive=%ld total_active=%ld active=%ld ratio=%ld flags=%s",
> +		__entry->nid,
> +		__entry->reclaim_idx,
> +		__entry->total_inactive, __entry->inactive,
> +		__entry->total_active, __entry->active,
> +		__entry->ratio,
> +		show_reclaim_flags(__entry->reclaim_flags))
> +);
>  #endif /* _TRACE_VMSCAN_H */
>  
>  /* This part must be outside protection */
> diff -puN mm/vmscan.c~mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint mm/vmscan.c
> --- a/mm/vmscan.c~mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint
> +++ a/mm/vmscan.c
> @@ -2039,11 +2039,11 @@ static void shrink_active_list(unsigned
>   *   10TB     320        32GB
>   */
>  static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
> -						struct scan_control *sc)
> +						struct scan_control *sc, bool trace)
>  {
>  	unsigned long inactive_ratio;
> -	unsigned long inactive;
> -	unsigned long active;
> +	unsigned long total_inactive, inactive;
> +	unsigned long total_active, active;
>  	unsigned long gb;
>  	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>  	int zid;
> @@ -2055,8 +2055,8 @@ static bool inactive_list_is_low(struct
>  	if (!file && !total_swap_pages)
>  		return false;
>  
> -	inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> -	active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> +	total_inactive = inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> +	total_active = active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
>  

the decision of deactivating is based on eligible zone's LRU size,
not whole zone so why should we need to get a trace of all zones's LRU?

>  	/*
>  	 * For zone-constrained allocations, it is necessary to check if
> @@ -2085,6 +2085,11 @@ static bool inactive_list_is_low(struct
>  	else
>  		inactive_ratio = 1;
>  
> +	if (trace)
> +		trace_mm_vmscan_inactive_list_is_low(pgdat->node_id,
> +				sc->reclaim_idx,
> +				total_inactive, inactive,
> +				total_active, active, inactive_ratio, file);
>  	return inactive * inactive_ratio < active;
>  }
>  
> @@ -2092,7 +2097,7 @@ static unsigned long shrink_list(enum lr
>  				 struct lruvec *lruvec, struct scan_control *sc)
>  {
>  	if (is_active_lru(lru)) {
> -		if (inactive_list_is_low(lruvec, is_file_lru(lru), sc))
> +		if (inactive_list_is_low(lruvec, is_file_lru(lru), sc, true))
>  			shrink_active_list(nr_to_scan, lruvec, sc, lru);
>  		return 0;
>  	}
> @@ -2223,7 +2228,7 @@ static void get_scan_count(struct lruvec
>  	 * lruvec even if it has plenty of old anonymous pages unless the
>  	 * system is under heavy pressure.
>  	 */
> -	if (!inactive_list_is_low(lruvec, true, sc) &&
> +	if (!inactive_list_is_low(lruvec, true, sc, false) &&

Hmm, I was curious why you added trace boolean arguement and found it here.
Yes, here is not related to deactivation directly but couldn't we help to
trace it unconditionally? With that, we can know why VM reclaim only
file-backed page on slow device although enough anonymous pages on fast
swap like zram are enough.

So I suggest to trace it unconditionally with sc->priority. With that,
we can catch such problem.
What do you think?

>  	    lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
>  		scan_balance = SCAN_FILE;
>  		goto out;
> @@ -2448,7 +2453,7 @@ static void shrink_node_memcg(struct pgl
>  	 * Even if we did not try to evict anon pages at all, we want to
>  	 * rebalance the anon lru active/inactive ratio.
>  	 */
> -	if (inactive_list_is_low(lruvec, false, sc))
> +	if (inactive_list_is_low(lruvec, false, sc, true))
>  		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
>  				   sc, LRU_ACTIVE_ANON);
>  }
> @@ -3098,7 +3103,7 @@ static void age_active_anon(struct pglis
>  	do {
>  		struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
>  
> -		if (inactive_list_is_low(lruvec, false, sc))
> +		if (inactive_list_is_low(lruvec, false, sc, true))
>  			shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
>  					   sc, LRU_ACTIVE_ANON);
>  
> _
> 
> Patches currently in -mm which might be from mhocko@suse.com are
> 
> mm-slab-make-sure-that-kmalloc_max_size-will-fit-into-max_order.patch
> bpf-do-not-use-kmalloc_shift_max.patch
> mm-fix-remote-numa-hits-statistics.patch
> mm-get-rid-of-__gfp_other_node.patch
> mm-throttle-show_mem-from-warn_alloc.patch
> mm-trace-extract-compaction_status-and-zone_type-to-a-common-header.patch
> oom-trace-add-oom-detection-tracepoints.patch
> oom-trace-add-compaction-retry-tracepoint.patch
> mm-vmscan-remove-unused-mm_vmscan_memcg_isolate.patch
> mm-vmscan-add-active-list-aging-tracepoint.patch
> mm-vmscan-add-active-list-aging-tracepoint-update.patch
> mm-vmscan-show-the-number-of-skipped-pages-in-mm_vmscan_lru_isolate.patch
> mm-vmscan-show-lru-name-in-mm_vmscan_lru_isolate-tracepoint.patch
> mm-vmscan-extract-shrink_page_list-reclaim-counters-into-a-struct.patch
> mm-vmscan-enhance-mm_vmscan_lru_shrink_inactive-tracepoint.patch
> mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint.patch
> trace-vmscan-postprocess-sync-with-tracepoints-updates.patch
> 
> --
> To unsubscribe from this list: send the line "unsubscribe mm-commits" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint.patch added to -mm tree
  2017-01-10 23:52 ` + mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint.patch added to -mm tree Minchan Kim
@ 2017-01-11 15:52   ` Michal Hocko
  2017-01-12  5:12     ` Minchan Kim
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2017-01-11 15:52 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-kernel, hillf.zj, mgorman, vbabka, mm-commits, linux-mm

On Wed 11-01-17 08:52:50, Minchan Kim wrote:
[...]
> > @@ -2055,8 +2055,8 @@ static bool inactive_list_is_low(struct
> >  	if (!file && !total_swap_pages)
> >  		return false;
> >  
> > -	inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > -	active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> > +	total_inactive = inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > +	total_active = active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> >  
> 
> the decision of deactivating is based on eligible zone's LRU size,
> not whole zone so why should we need to get a trace of all zones's LRU?

Strictly speaking, the total_ counters are not necessary for making the
decision. I found reporting those numbers useful regardless because this
will give us also an information how large is the eligible portion of
the LRU list. We do not have any other tracepoint which would report
that.
 
[...]
> > @@ -2223,7 +2228,7 @@ static void get_scan_count(struct lruvec
> >  	 * lruvec even if it has plenty of old anonymous pages unless the
> >  	 * system is under heavy pressure.
> >  	 */
> > -	if (!inactive_list_is_low(lruvec, true, sc) &&
> > +	if (!inactive_list_is_low(lruvec, true, sc, false) &&
> 
> Hmm, I was curious why you added trace boolean arguement and found it here.
> Yes, here is not related to deactivation directly but couldn't we help to
> trace it unconditionally?

I've had it like that when I was debugging the mentioned bug and found
it a bit disturbing. It generated more output than I would like and it
wasn't really clear from which code path  this has been called from.

> With that, we can know why VM reclaim only
> file-backed page on slow device although enough anonymous pages on fast
> swap like zram are enough.

That would be something for a separate tracepoint in g_s_c

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint.patch added to -mm tree
  2017-01-11 15:52   ` Michal Hocko
@ 2017-01-12  5:12     ` Minchan Kim
  2017-01-12  8:15       ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Minchan Kim @ 2017-01-12  5:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, hillf.zj, mgorman, vbabka, mm-commits, linux-mm

Hello,

On Wed, Jan 11, 2017 at 04:52:39PM +0100, Michal Hocko wrote:
> On Wed 11-01-17 08:52:50, Minchan Kim wrote:
> [...]
> > > @@ -2055,8 +2055,8 @@ static bool inactive_list_is_low(struct
> > >  	if (!file && !total_swap_pages)
> > >  		return false;
> > >  
> > > -	inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > > -	active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> > > +	total_inactive = inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > > +	total_active = active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> > >  
> > 
> > the decision of deactivating is based on eligible zone's LRU size,
> > not whole zone so why should we need to get a trace of all zones's LRU?
> 
> Strictly speaking, the total_ counters are not necessary for making the
> decision. I found reporting those numbers useful regardless because this
> will give us also an information how large is the eligible portion of
> the LRU list. We do not have any other tracepoint which would report
> that.

The patch doesn't say anything why it's useful. Could you tell why it's
useful and inactive_list_is_low should be right place?

Don't get me wrong, please. I don't want to bother you.
I really don't want to add random stuff although it's tracepoint for
debugging.

>  
> [...]
> > > @@ -2223,7 +2228,7 @@ static void get_scan_count(struct lruvec
> > >  	 * lruvec even if it has plenty of old anonymous pages unless the
> > >  	 * system is under heavy pressure.
> > >  	 */
> > > -	if (!inactive_list_is_low(lruvec, true, sc) &&
> > > +	if (!inactive_list_is_low(lruvec, true, sc, false) &&
> > 
> > Hmm, I was curious why you added trace boolean arguement and found it here.
> > Yes, here is not related to deactivation directly but couldn't we help to
> > trace it unconditionally?
> 
> I've had it like that when I was debugging the mentioned bug and found
> it a bit disturbing. It generated more output than I would like and it
> wasn't really clear from which code path  this has been called from.

Indeed.

Personally, I want to move inactive_list_is_low in shrink_active_list
and shrink_active_list calls inactive_list_is_low(...., true),
unconditionally so that it can make code simple/clear but cannot remove
trace boolean variable , which what I want. So, it's okay if you love
your version.

> 
> > With that, we can know why VM reclaim only
> > file-backed page on slow device although enough anonymous pages on fast
> > swap like zram are enough.
> 
> That would be something for a separate tracepoint in g_s_c

Agree.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint.patch added to -mm tree
  2017-01-12  5:12     ` Minchan Kim
@ 2017-01-12  8:15       ` Michal Hocko
  2017-01-12  8:48         ` Minchan Kim
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2017-01-12  8:15 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-kernel, hillf.zj, mgorman, vbabka, mm-commits, linux-mm

On Thu 12-01-17 14:12:47, Minchan Kim wrote:
> Hello,
> 
> On Wed, Jan 11, 2017 at 04:52:39PM +0100, Michal Hocko wrote:
> > On Wed 11-01-17 08:52:50, Minchan Kim wrote:
> > [...]
> > > > @@ -2055,8 +2055,8 @@ static bool inactive_list_is_low(struct
> > > >  	if (!file && !total_swap_pages)
> > > >  		return false;
> > > >  
> > > > -	inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > > > -	active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> > > > +	total_inactive = inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > > > +	total_active = active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> > > >  
> > > 
> > > the decision of deactivating is based on eligible zone's LRU size,
> > > not whole zone so why should we need to get a trace of all zones's LRU?
> > 
> > Strictly speaking, the total_ counters are not necessary for making the
> > decision. I found reporting those numbers useful regardless because this
> > will give us also an information how large is the eligible portion of
> > the LRU list. We do not have any other tracepoint which would report
> > that.
> 
> The patch doesn't say anything why it's useful. Could you tell why it's
> useful and inactive_list_is_low should be right place?
> 
> Don't get me wrong, please. I don't want to bother you.
> I really don't want to add random stuff although it's tracepoint for
> debugging.

This doesn't sounds random to me. We simply do not have a full picture
on 32b systems without this information. Especially when memcgs are
involved and global numbers spread over different LRUs.
 
> > [...]
> > > > @@ -2223,7 +2228,7 @@ static void get_scan_count(struct lruvec
> > > >  	 * lruvec even if it has plenty of old anonymous pages unless the
> > > >  	 * system is under heavy pressure.
> > > >  	 */
> > > > -	if (!inactive_list_is_low(lruvec, true, sc) &&
> > > > +	if (!inactive_list_is_low(lruvec, true, sc, false) &&
> > > 
> > > Hmm, I was curious why you added trace boolean arguement and found it here.
> > > Yes, here is not related to deactivation directly but couldn't we help to
> > > trace it unconditionally?
> > 
> > I've had it like that when I was debugging the mentioned bug and found
> > it a bit disturbing. It generated more output than I would like and it
> > wasn't really clear from which code path  this has been called from.
> 
> Indeed.
> 
> Personally, I want to move inactive_list_is_low in shrink_active_list
> and shrink_active_list calls inactive_list_is_low(...., true),
> unconditionally so that it can make code simple/clear but cannot remove
> trace boolean variable , which what I want. So, it's okay if you love
> your version.

I am not sure I am following. Why is the additional parameter a problem?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint.patch added to -mm tree
  2017-01-12  8:15       ` Michal Hocko
@ 2017-01-12  8:48         ` Minchan Kim
  2017-01-12  9:10           ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Minchan Kim @ 2017-01-12  8:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, hillf.zj, mgorman, vbabka, mm-commits, linux-mm

On Thu, Jan 12, 2017 at 09:15:54AM +0100, Michal Hocko wrote:
> On Thu 12-01-17 14:12:47, Minchan Kim wrote:
> > Hello,
> > 
> > On Wed, Jan 11, 2017 at 04:52:39PM +0100, Michal Hocko wrote:
> > > On Wed 11-01-17 08:52:50, Minchan Kim wrote:
> > > [...]
> > > > > @@ -2055,8 +2055,8 @@ static bool inactive_list_is_low(struct
> > > > >  	if (!file && !total_swap_pages)
> > > > >  		return false;
> > > > >  
> > > > > -	inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > > > > -	active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> > > > > +	total_inactive = inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > > > > +	total_active = active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> > > > >  
> > > > 
> > > > the decision of deactivating is based on eligible zone's LRU size,
> > > > not whole zone so why should we need to get a trace of all zones's LRU?
> > > 
> > > Strictly speaking, the total_ counters are not necessary for making the
> > > decision. I found reporting those numbers useful regardless because this
> > > will give us also an information how large is the eligible portion of
> > > the LRU list. We do not have any other tracepoint which would report
> > > that.
> > 
> > The patch doesn't say anything why it's useful. Could you tell why it's
> > useful and inactive_list_is_low should be right place?
> > 
> > Don't get me wrong, please. I don't want to bother you.
> > I really don't want to add random stuff although it's tracepoint for
> > debugging.
> 
> This doesn't sounds random to me. We simply do not have a full picture
> on 32b systems without this information. Especially when memcgs are
> involved and global numbers spread over different LRUs.

Could you elaborate it?

"
Currently we have tracepoints for both active and inactive LRU lists
reclaim but we do not have any which would tell us why we we decided to
age the active list.  Without that it is quite hard to diagnose
active/inactive lists balancing.  Add mm_vmscan_inactive_list_is_low
tracepoint to tell us this information.
"

Your description says "why we decided to age the active list".
So, what's needed?

>  
> > > [...]
> > > > > @@ -2223,7 +2228,7 @@ static void get_scan_count(struct lruvec
> > > > >  	 * lruvec even if it has plenty of old anonymous pages unless the
> > > > >  	 * system is under heavy pressure.
> > > > >  	 */
> > > > > -	if (!inactive_list_is_low(lruvec, true, sc) &&
> > > > > +	if (!inactive_list_is_low(lruvec, true, sc, false) &&
> > > > 
> > > > Hmm, I was curious why you added trace boolean arguement and found it here.
> > > > Yes, here is not related to deactivation directly but couldn't we help to
> > > > trace it unconditionally?
> > > 
> > > I've had it like that when I was debugging the mentioned bug and found
> > > it a bit disturbing. It generated more output than I would like and it
> > > wasn't really clear from which code path  this has been called from.
> > 
> > Indeed.
> > 
> > Personally, I want to move inactive_list_is_low in shrink_active_list
> > and shrink_active_list calls inactive_list_is_low(...., true),
> > unconditionally so that it can make code simple/clear but cannot remove
> > trace boolean variable , which what I want. So, it's okay if you love
> > your version.
> 
> I am not sure I am following. Why is the additional parameter a problem?

Well, to me, it's not a elegance. Is it? If we need such boolean variable
to control show the trace, it means it's not a good place or think
refactoring.

> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint.patch added to -mm tree
  2017-01-12  8:48         ` Minchan Kim
@ 2017-01-12  9:10           ` Michal Hocko
  2017-01-13  1:37             ` Minchan Kim
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2017-01-12  9:10 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-kernel, hillf.zj, mgorman, vbabka, mm-commits, linux-mm

On Thu 12-01-17 17:48:13, Minchan Kim wrote:
> On Thu, Jan 12, 2017 at 09:15:54AM +0100, Michal Hocko wrote:
> > On Thu 12-01-17 14:12:47, Minchan Kim wrote:
> > > Hello,
> > > 
> > > On Wed, Jan 11, 2017 at 04:52:39PM +0100, Michal Hocko wrote:
> > > > On Wed 11-01-17 08:52:50, Minchan Kim wrote:
> > > > [...]
> > > > > > @@ -2055,8 +2055,8 @@ static bool inactive_list_is_low(struct
> > > > > >  	if (!file && !total_swap_pages)
> > > > > >  		return false;
> > > > > >  
> > > > > > -	inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > > > > > -	active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> > > > > > +	total_inactive = inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > > > > > +	total_active = active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> > > > > >  
> > > > > 
> > > > > the decision of deactivating is based on eligible zone's LRU size,
> > > > > not whole zone so why should we need to get a trace of all zones's LRU?
> > > > 
> > > > Strictly speaking, the total_ counters are not necessary for making the
> > > > decision. I found reporting those numbers useful regardless because this
> > > > will give us also an information how large is the eligible portion of
> > > > the LRU list. We do not have any other tracepoint which would report
> > > > that.
> > > 
> > > The patch doesn't say anything why it's useful. Could you tell why it's
> > > useful and inactive_list_is_low should be right place?
> > > 
> > > Don't get me wrong, please. I don't want to bother you.
> > > I really don't want to add random stuff although it's tracepoint for
> > > debugging.
> > 
> > This doesn't sounds random to me. We simply do not have a full picture
> > on 32b systems without this information. Especially when memcgs are
> > involved and global numbers spread over different LRUs.
> 
> Could you elaborate it?

The problem with 32b systems is that you only can consider a part of the
LRU for the lowmem requests. While we have global counters to see how
much lowmem inactive/active pages we have, those get distributed to
memcg LRUs. And that distribution is impossible to guess. So my thinking
is that it can become a real head scratcher to realize why certain
active LRUs are aged while others are not. This was the case when I was
debugging the last issue which triggered all this. All of the sudden I
have seen many invocations when inactive and active were zero which
sounded weird, until I realized that those are memcg's lruvec which is
what total numbers told me...

Later on I would like to add an memcg identifier to the vmscan
tracepoints but I didn't get there yet.
 
> "
> Currently we have tracepoints for both active and inactive LRU lists
> reclaim but we do not have any which would tell us why we we decided to
> age the active list.  Without that it is quite hard to diagnose
> active/inactive lists balancing.  Add mm_vmscan_inactive_list_is_low
> tracepoint to tell us this information.
> "
> 
> Your description says "why we decided to age the active list".
> So, what's needed?
> 
> >  
> > > > [...]
> > > > > > @@ -2223,7 +2228,7 @@ static void get_scan_count(struct lruvec
> > > > > >  	 * lruvec even if it has plenty of old anonymous pages unless the
> > > > > >  	 * system is under heavy pressure.
> > > > > >  	 */
> > > > > > -	if (!inactive_list_is_low(lruvec, true, sc) &&
> > > > > > +	if (!inactive_list_is_low(lruvec, true, sc, false) &&
> > > > > 
> > > > > Hmm, I was curious why you added trace boolean arguement and found it here.
> > > > > Yes, here is not related to deactivation directly but couldn't we help to
> > > > > trace it unconditionally?
> > > > 
> > > > I've had it like that when I was debugging the mentioned bug and found
> > > > it a bit disturbing. It generated more output than I would like and it
> > > > wasn't really clear from which code path  this has been called from.
> > > 
> > > Indeed.
> > > 
> > > Personally, I want to move inactive_list_is_low in shrink_active_list
> > > and shrink_active_list calls inactive_list_is_low(...., true),
> > > unconditionally so that it can make code simple/clear but cannot remove
> > > trace boolean variable , which what I want. So, it's okay if you love
> > > your version.
> > 
> > I am not sure I am following. Why is the additional parameter a problem?
> 
> Well, to me, it's not a elegance. Is it? If we need such boolean variable
> to control show the trace, it means it's not a good place or think
> refactoring.

But, even when you refactor the code there will be other callers of
inactive_list_is_low outside of shrink_active_list...

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint.patch added to -mm tree
  2017-01-12  9:10           ` Michal Hocko
@ 2017-01-13  1:37             ` Minchan Kim
  2017-01-13  7:47               ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Minchan Kim @ 2017-01-13  1:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, hillf.zj, mgorman, vbabka, mm-commits, linux-mm

Hello,

On Thu, Jan 12, 2017 at 10:10:17AM +0100, Michal Hocko wrote:
> On Thu 12-01-17 17:48:13, Minchan Kim wrote:
> > On Thu, Jan 12, 2017 at 09:15:54AM +0100, Michal Hocko wrote:
> > > On Thu 12-01-17 14:12:47, Minchan Kim wrote:
> > > > Hello,
> > > > 
> > > > On Wed, Jan 11, 2017 at 04:52:39PM +0100, Michal Hocko wrote:
> > > > > On Wed 11-01-17 08:52:50, Minchan Kim wrote:
> > > > > [...]
> > > > > > > @@ -2055,8 +2055,8 @@ static bool inactive_list_is_low(struct
> > > > > > >  	if (!file && !total_swap_pages)
> > > > > > >  		return false;
> > > > > > >  
> > > > > > > -	inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > > > > > > -	active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> > > > > > > +	total_inactive = inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > > > > > > +	total_active = active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> > > > > > >  
> > > > > > 
> > > > > > the decision of deactivating is based on eligible zone's LRU size,
> > > > > > not whole zone so why should we need to get a trace of all zones's LRU?
> > > > > 
> > > > > Strictly speaking, the total_ counters are not necessary for making the
> > > > > decision. I found reporting those numbers useful regardless because this
> > > > > will give us also an information how large is the eligible portion of
> > > > > the LRU list. We do not have any other tracepoint which would report
> > > > > that.
> > > > 
> > > > The patch doesn't say anything why it's useful. Could you tell why it's
> > > > useful and inactive_list_is_low should be right place?
> > > > 
> > > > Don't get me wrong, please. I don't want to bother you.
> > > > I really don't want to add random stuff although it's tracepoint for
> > > > debugging.
> > > 
> > > This doesn't sounds random to me. We simply do not have a full picture
> > > on 32b systems without this information. Especially when memcgs are
> > > involved and global numbers spread over different LRUs.
> > 
> > Could you elaborate it?
> 
> The problem with 32b systems is that you only can consider a part of the
> LRU for the lowmem requests. While we have global counters to see how
> much lowmem inactive/active pages we have, those get distributed to
> memcg LRUs. And that distribution is impossible to guess. So my thinking
> is that it can become a real head scratcher to realize why certain
> active LRUs are aged while others are not. This was the case when I was
> debugging the last issue which triggered all this. All of the sudden I
> have seen many invocations when inactive and active were zero which
> sounded weird, until I realized that those are memcg's lruvec which is
> what total numbers told me...

Hmm, it seems I miss something. AFAIU, what you need is just memcg
identifier, not all lru size. If it isn't, please tell more detail
usecase of all lru size in that particular tracepoint.

> 
> Later on I would like to add an memcg identifier to the vmscan
> tracepoints but I didn't get there yet.
>  
> > "
> > Currently we have tracepoints for both active and inactive LRU lists
> > reclaim but we do not have any which would tell us why we we decided to
> > age the active list.  Without that it is quite hard to diagnose
> > active/inactive lists balancing.  Add mm_vmscan_inactive_list_is_low
> > tracepoint to tell us this information.
> > "
> > 
> > Your description says "why we decided to age the active list".
> > So, what's needed?
> > 
> > >  
> > > > > [...]
> > > > > > > @@ -2223,7 +2228,7 @@ static void get_scan_count(struct lruvec
> > > > > > >  	 * lruvec even if it has plenty of old anonymous pages unless the
> > > > > > >  	 * system is under heavy pressure.
> > > > > > >  	 */
> > > > > > > -	if (!inactive_list_is_low(lruvec, true, sc) &&
> > > > > > > +	if (!inactive_list_is_low(lruvec, true, sc, false) &&
> > > > > > 
> > > > > > Hmm, I was curious why you added trace boolean arguement and found it here.
> > > > > > Yes, here is not related to deactivation directly but couldn't we help to
> > > > > > trace it unconditionally?
> > > > > 
> > > > > I've had it like that when I was debugging the mentioned bug and found
> > > > > it a bit disturbing. It generated more output than I would like and it
> > > > > wasn't really clear from which code path  this has been called from.
> > > > 
> > > > Indeed.
> > > > 
> > > > Personally, I want to move inactive_list_is_low in shrink_active_list
> > > > and shrink_active_list calls inactive_list_is_low(...., true),
> > > > unconditionally so that it can make code simple/clear but cannot remove
> > > > trace boolean variable , which what I want. So, it's okay if you love
> > > > your version.
> > > 
> > > I am not sure I am following. Why is the additional parameter a problem?
> > 
> > Well, to me, it's not a elegance. Is it? If we need such boolean variable
> > to control show the trace, it means it's not a good place or think
> > refactoring.
> 
> But, even when you refactor the code there will be other callers of
> inactive_list_is_low outside of shrink_active_list...

Yes, that's why I said "it's okay if you love your version". However,
we can do refactoring to remove "bool trace" and even, it makes code
more readable, I believe.

>From 06eb7201d781155a8dee7e72fbb8423ec8175223 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Fri, 13 Jan 2017 10:13:36 +0900
Subject: [PATCH] mm: refactoring inactive_list_is_low

Recently, Michal Hocko added tracepoint into inactive_list_is_low
for catching why VM decided to age the active list to know
active/inacive balancing problem. With that, unfortunately, it
added "bool trace" to inactlive_list_is_low to control some place
should be prohibited tracing. It is not elegant to me so this patch
try to clean it up.

Normally, most inactive_list_is_low is used for deciding active list
demotion but one site(i.e., get_scan_count) uses for other purpose
which reclaim file LRU forcefully. Sites for deactivation calls it
with shrink_active_list. It means inactive_list_is_low could be
located in shrink_active_list.

One more thing this patch does is to remove "ratio" in the tracepoint
because we can get it by post processing in script via simple math.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/trace/events/vmscan.h |  9 +++-----
 mm/vmscan.c                   | 51 ++++++++++++++++++++++++-------------------
 2 files changed, 31 insertions(+), 29 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 27e8a5c..406ea95 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -432,9 +432,9 @@ TRACE_EVENT(mm_vmscan_inactive_list_is_low,
 	TP_PROTO(int nid, int reclaim_idx,
 		unsigned long total_inactive, unsigned long inactive,
 		unsigned long total_active, unsigned long active,
-		unsigned long ratio, int file),
+		int file),
 
-	TP_ARGS(nid, reclaim_idx, total_inactive, inactive, total_active, active, ratio, file),
+	TP_ARGS(nid, reclaim_idx, total_inactive, inactive, total_active, active, file),
 
 	TP_STRUCT__entry(
 		__field(int, nid)
@@ -443,7 +443,6 @@ TRACE_EVENT(mm_vmscan_inactive_list_is_low,
 		__field(unsigned long, inactive)
 		__field(unsigned long, total_active)
 		__field(unsigned long, active)
-		__field(unsigned long, ratio)
 		__field(int, reclaim_flags)
 	),
 
@@ -454,16 +453,14 @@ TRACE_EVENT(mm_vmscan_inactive_list_is_low,
 		__entry->inactive = inactive;
 		__entry->total_active = total_active;
 		__entry->active = active;
-		__entry->ratio = ratio;
 		__entry->reclaim_flags = trace_shrink_flags(file) & RECLAIM_WB_LRU;
 	),
 
-	TP_printk("nid=%d reclaim_idx=%d total_inactive=%ld inactive=%ld total_active=%ld active=%ld ratio=%ld flags=%s",
+	TP_printk("nid=%d reclaim_idx=%d total_inactive=%ld inactive=%ld total_active=%ld active=%ld flags=%s",
 		__entry->nid,
 		__entry->reclaim_idx,
 		__entry->total_inactive, __entry->inactive,
 		__entry->total_active, __entry->active,
-		__entry->ratio,
 		show_reclaim_flags(__entry->reclaim_flags))
 );
 #endif /* _TRACE_VMSCAN_H */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 75cdf68..6890c21 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -150,6 +150,7 @@ unsigned long vm_total_pages;
 
 static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
+static bool inactive_list_is_low(bool file, unsigned long, unsigned long);
 
 #ifdef CONFIG_MEMCG
 static bool global_reclaim(struct scan_control *sc)
@@ -1962,6 +1963,22 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	isolate_mode_t isolate_mode = 0;
 	int file = is_file_lru(lru);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+	unsigned long inactive, active;
+	enum lru_list inactive_lru = file * LRU_FILE;
+	enum lru_list active_lru = file * LRU_FILE + LRU_ACTIVE;
+	bool deactivate;
+
+	inactive = lruvec_lru_size_eligibe_zones(lruvec, file * LRU_FILE,
+					sc->reclaim_idx);
+	active = lruvec_lru_size_eligibe_zones(lruvec, file * LRU_FILE +
+					LRU_ACTIVE, sc->reclaim_idx);
+	deactivate = inactive_list_is_low(file, inactive, active);
+	trace_mm_vmscan_inactive_list_is_low(pgdat->node_id,
+			sc->reclaim_idx,
+			lruvec_lru_size(lruvec, inactive_lru), inactive,
+			lruvec_lru_size(lruvec, active_lru), active, file);
+	if (!deactivate)
+		return;
 
 	lru_add_drain();
 
@@ -2073,13 +2090,10 @@ static void shrink_active_list(unsigned long nr_to_scan,
  *    1TB     101        10GB
  *   10TB     320        32GB
  */
-static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
-						struct scan_control *sc, bool trace)
+static bool inactive_list_is_low(bool file,
+			unsigned long inactive, unsigned long active)
 {
 	unsigned long inactive_ratio;
-	unsigned long inactive, active;
-	enum lru_list inactive_lru = file * LRU_FILE;
-	enum lru_list active_lru = file * LRU_FILE + LRU_ACTIVE;
 	unsigned long gb;
 
 	/*
@@ -2089,22 +2103,12 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
 	if (!file && !total_swap_pages)
 		return false;
 
-	inactive = lruvec_lru_size_eligibe_zones(lruvec, inactive_lru, sc->reclaim_idx);
-	active = lruvec_lru_size_eligibe_zones(lruvec, active_lru, sc->reclaim_idx);
-
 	gb = (inactive + active) >> (30 - PAGE_SHIFT);
 	if (gb)
 		inactive_ratio = int_sqrt(10 * gb);
 	else
 		inactive_ratio = 1;
 
-	if (trace)
-		trace_mm_vmscan_inactive_list_is_low(lruvec_pgdat(lruvec)->node_id,
-				sc->reclaim_idx,
-				lruvec_lru_size(lruvec, inactive_lru), inactive,
-				lruvec_lru_size(lruvec, active_lru), active,
-				inactive_ratio, file);
-
 	return inactive * inactive_ratio < active;
 }
 
@@ -2112,8 +2116,7 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 				 struct lruvec *lruvec, struct scan_control *sc)
 {
 	if (is_active_lru(lru)) {
-		if (inactive_list_is_low(lruvec, is_file_lru(lru), sc, true))
-			shrink_active_list(nr_to_scan, lruvec, sc, lru);
+		shrink_active_list(nr_to_scan, lruvec, sc, lru);
 		return 0;
 	}
 
@@ -2153,6 +2156,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	enum lru_list lru;
 	bool some_scanned;
 	int pass;
+	unsigned long inactive, active;
 
 	/*
 	 * If the zone or memcg is small, nr[l] can be 0.  This
@@ -2243,7 +2247,11 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * lruvec even if it has plenty of old anonymous pages unless the
 	 * system is under heavy pressure.
 	 */
-	if (!inactive_list_is_low(lruvec, true, sc, false) &&
+	inactive = lruvec_lru_size_eligibe_zones(lruvec,
+				LRU_FILE, sc->reclaim_idx);
+	active = lruvec_lru_size_eligibe_zones(lruvec,
+				LRU_FILE + LRU_ACTIVE, sc->reclaim_idx);
+	if (!inactive_list_is_low(true, inactive, active) &&
 	    lruvec_lru_size_eligibe_zones(lruvec, LRU_INACTIVE_FILE, sc->reclaim_idx) >> sc->priority) {
 		scan_balance = SCAN_FILE;
 		goto out;
@@ -2468,9 +2476,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
 	 * Even if we did not try to evict anon pages at all, we want to
 	 * rebalance the anon lru active/inactive ratio.
 	 */
-	if (inactive_list_is_low(lruvec, false, sc, true))
-		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
-				   sc, LRU_ACTIVE_ANON);
+	shrink_active_list(SWAP_CLUSTER_MAX, lruvec, sc, LRU_ACTIVE_ANON);
 }
 
 /* Use reclaim/compaction for costly allocs or under memory pressure */
@@ -3118,8 +3124,7 @@ static void age_active_anon(struct pglist_data *pgdat,
 	do {
 		struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
 
-		if (inactive_list_is_low(lruvec, false, sc, true))
-			shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
+		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
 					   sc, LRU_ACTIVE_ANON);
 
 		memcg = mem_cgroup_iter(NULL, memcg, NULL);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: + mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint.patch added to -mm tree
  2017-01-13  1:37             ` Minchan Kim
@ 2017-01-13  7:47               ` Michal Hocko
  2017-01-13  8:57                 ` Minchan Kim
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2017-01-13  7:47 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-kernel, hillf.zj, mgorman, vbabka, mm-commits, linux-mm

On Fri 13-01-17 10:37:24, Minchan Kim wrote:
> Hello,
> 
> On Thu, Jan 12, 2017 at 10:10:17AM +0100, Michal Hocko wrote:
> > On Thu 12-01-17 17:48:13, Minchan Kim wrote:
> > > On Thu, Jan 12, 2017 at 09:15:54AM +0100, Michal Hocko wrote:
> > > > On Thu 12-01-17 14:12:47, Minchan Kim wrote:
> > > > > Hello,
> > > > > 
> > > > > On Wed, Jan 11, 2017 at 04:52:39PM +0100, Michal Hocko wrote:
> > > > > > On Wed 11-01-17 08:52:50, Minchan Kim wrote:
> > > > > > [...]
> > > > > > > > @@ -2055,8 +2055,8 @@ static bool inactive_list_is_low(struct
> > > > > > > >  	if (!file && !total_swap_pages)
> > > > > > > >  		return false;
> > > > > > > >  
> > > > > > > > -	inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > > > > > > > -	active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> > > > > > > > +	total_inactive = inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > > > > > > > +	total_active = active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> > > > > > > >  
> > > > > > > 
> > > > > > > the decision of deactivating is based on eligible zone's LRU size,
> > > > > > > not whole zone so why should we need to get a trace of all zones's LRU?
> > > > > > 
> > > > > > Strictly speaking, the total_ counters are not necessary for making the
> > > > > > decision. I found reporting those numbers useful regardless because this
> > > > > > will give us also an information how large is the eligible portion of
> > > > > > the LRU list. We do not have any other tracepoint which would report
> > > > > > that.
> > > > > 
> > > > > The patch doesn't say anything why it's useful. Could you tell why it's
> > > > > useful and inactive_list_is_low should be right place?
> > > > > 
> > > > > Don't get me wrong, please. I don't want to bother you.
> > > > > I really don't want to add random stuff although it's tracepoint for
> > > > > debugging.
> > > > 
> > > > This doesn't sounds random to me. We simply do not have a full picture
> > > > on 32b systems without this information. Especially when memcgs are
> > > > involved and global numbers spread over different LRUs.
> > > 
> > > Could you elaborate it?
> > 
> > The problem with 32b systems is that you only can consider a part of the
> > LRU for the lowmem requests. While we have global counters to see how
> > much lowmem inactive/active pages we have, those get distributed to
> > memcg LRUs. And that distribution is impossible to guess. So my thinking
> > is that it can become a real head scratcher to realize why certain
> > active LRUs are aged while others are not. This was the case when I was
> > debugging the last issue which triggered all this. All of the sudden I
> > have seen many invocations when inactive and active were zero which
> > sounded weird, until I realized that those are memcg's lruvec which is
> > what total numbers told me...
> 
> Hmm, it seems I miss something. AFAIU, what you need is just memcg
> identifier, not all lru size. If it isn't, please tell more detail
> usecase of all lru size in that particular tracepoint.

Having memcg id would be definitely helpful but that alone wouldn't tell
us how is the lowmem distributed. To be honest I really fail to see why
this bothers you all that much.
 
[...]
> > > > I am not sure I am following. Why is the additional parameter a problem?
> > > 
> > > Well, to me, it's not a elegance. Is it? If we need such boolean variable
> > > to control show the trace, it means it's not a good place or think
> > > refactoring.
> > 
> > But, even when you refactor the code there will be other callers of
> > inactive_list_is_low outside of shrink_active_list...
> 
> Yes, that's why I said "it's okay if you love your version". However,
> we can do refactoring to remove "bool trace" and even, it makes code
> more readable, I believe.
> 
> >From 06eb7201d781155a8dee7e72fbb8423ec8175223 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan@kernel.org>
> Date: Fri, 13 Jan 2017 10:13:36 +0900
> Subject: [PATCH] mm: refactoring inactive_list_is_low
> 
> Recently, Michal Hocko added tracepoint into inactive_list_is_low
> for catching why VM decided to age the active list to know
> active/inacive balancing problem. With that, unfortunately, it
> added "bool trace" to inactlive_list_is_low to control some place
> should be prohibited tracing. It is not elegant to me so this patch
> try to clean it up.
> 
> Normally, most inactive_list_is_low is used for deciding active list
> demotion but one site(i.e., get_scan_count) uses for other purpose
> which reclaim file LRU forcefully. Sites for deactivation calls it
> with shrink_active_list. It means inactive_list_is_low could be
> located in shrink_active_list.
> 
> One more thing this patch does is to remove "ratio" in the tracepoint
> because we can get it by post processing in script via simple math.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/trace/events/vmscan.h |  9 +++-----
>  mm/vmscan.c                   | 51 ++++++++++++++++++++++++-------------------
>  2 files changed, 31 insertions(+), 29 deletions(-)

this cleanup adds more lines than it removes. I think reporting the
ratio is helpful because it doesn't cost us anything while calculating
it by later is just a bit annoying.

> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> index 27e8a5c..406ea95 100644
> --- a/include/trace/events/vmscan.h
> +++ b/include/trace/events/vmscan.h
> @@ -432,9 +432,9 @@ TRACE_EVENT(mm_vmscan_inactive_list_is_low,
>  	TP_PROTO(int nid, int reclaim_idx,
>  		unsigned long total_inactive, unsigned long inactive,
>  		unsigned long total_active, unsigned long active,
> -		unsigned long ratio, int file),
> +		int file),
>  
> -	TP_ARGS(nid, reclaim_idx, total_inactive, inactive, total_active, active, ratio, file),
> +	TP_ARGS(nid, reclaim_idx, total_inactive, inactive, total_active, active, file),
>  
>  	TP_STRUCT__entry(
>  		__field(int, nid)
> @@ -443,7 +443,6 @@ TRACE_EVENT(mm_vmscan_inactive_list_is_low,
>  		__field(unsigned long, inactive)
>  		__field(unsigned long, total_active)
>  		__field(unsigned long, active)
> -		__field(unsigned long, ratio)
>  		__field(int, reclaim_flags)
>  	),
>  
> @@ -454,16 +453,14 @@ TRACE_EVENT(mm_vmscan_inactive_list_is_low,
>  		__entry->inactive = inactive;
>  		__entry->total_active = total_active;
>  		__entry->active = active;
> -		__entry->ratio = ratio;
>  		__entry->reclaim_flags = trace_shrink_flags(file) & RECLAIM_WB_LRU;
>  	),
>  
> -	TP_printk("nid=%d reclaim_idx=%d total_inactive=%ld inactive=%ld total_active=%ld active=%ld ratio=%ld flags=%s",
> +	TP_printk("nid=%d reclaim_idx=%d total_inactive=%ld inactive=%ld total_active=%ld active=%ld flags=%s",
>  		__entry->nid,
>  		__entry->reclaim_idx,
>  		__entry->total_inactive, __entry->inactive,
>  		__entry->total_active, __entry->active,
> -		__entry->ratio,
>  		show_reclaim_flags(__entry->reclaim_flags))
>  );
>  #endif /* _TRACE_VMSCAN_H */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 75cdf68..6890c21 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -150,6 +150,7 @@ unsigned long vm_total_pages;
>  
>  static LIST_HEAD(shrinker_list);
>  static DECLARE_RWSEM(shrinker_rwsem);
> +static bool inactive_list_is_low(bool file, unsigned long, unsigned long);
>  
>  #ifdef CONFIG_MEMCG
>  static bool global_reclaim(struct scan_control *sc)
> @@ -1962,6 +1963,22 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  	isolate_mode_t isolate_mode = 0;
>  	int file = is_file_lru(lru);
>  	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +	unsigned long inactive, active;
> +	enum lru_list inactive_lru = file * LRU_FILE;
> +	enum lru_list active_lru = file * LRU_FILE + LRU_ACTIVE;
> +	bool deactivate;
> +
> +	inactive = lruvec_lru_size_eligibe_zones(lruvec, file * LRU_FILE,
> +					sc->reclaim_idx);
> +	active = lruvec_lru_size_eligibe_zones(lruvec, file * LRU_FILE +
> +					LRU_ACTIVE, sc->reclaim_idx);
> +	deactivate = inactive_list_is_low(file, inactive, active);
> +	trace_mm_vmscan_inactive_list_is_low(pgdat->node_id,
> +			sc->reclaim_idx,
> +			lruvec_lru_size(lruvec, inactive_lru), inactive,
> +			lruvec_lru_size(lruvec, active_lru), active, file);
> +	if (!deactivate)
> +		return;
>  
>  	lru_add_drain();
>  
> @@ -2073,13 +2090,10 @@ static void shrink_active_list(unsigned long nr_to_scan,
>   *    1TB     101        10GB
>   *   10TB     320        32GB
>   */
> -static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
> -						struct scan_control *sc, bool trace)
> +static bool inactive_list_is_low(bool file,
> +			unsigned long inactive, unsigned long active)
>  {
>  	unsigned long inactive_ratio;
> -	unsigned long inactive, active;
> -	enum lru_list inactive_lru = file * LRU_FILE;
> -	enum lru_list active_lru = file * LRU_FILE + LRU_ACTIVE;
>  	unsigned long gb;
>  
>  	/*
> @@ -2089,22 +2103,12 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
>  	if (!file && !total_swap_pages)
>  		return false;
>  
> -	inactive = lruvec_lru_size_eligibe_zones(lruvec, inactive_lru, sc->reclaim_idx);
> -	active = lruvec_lru_size_eligibe_zones(lruvec, active_lru, sc->reclaim_idx);
> -
>  	gb = (inactive + active) >> (30 - PAGE_SHIFT);
>  	if (gb)
>  		inactive_ratio = int_sqrt(10 * gb);
>  	else
>  		inactive_ratio = 1;
>  
> -	if (trace)
> -		trace_mm_vmscan_inactive_list_is_low(lruvec_pgdat(lruvec)->node_id,
> -				sc->reclaim_idx,
> -				lruvec_lru_size(lruvec, inactive_lru), inactive,
> -				lruvec_lru_size(lruvec, active_lru), active,
> -				inactive_ratio, file);
> -
>  	return inactive * inactive_ratio < active;
>  }
>  
> @@ -2112,8 +2116,7 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
>  				 struct lruvec *lruvec, struct scan_control *sc)
>  {
>  	if (is_active_lru(lru)) {
> -		if (inactive_list_is_low(lruvec, is_file_lru(lru), sc, true))
> -			shrink_active_list(nr_to_scan, lruvec, sc, lru);
> +		shrink_active_list(nr_to_scan, lruvec, sc, lru);
>  		return 0;
>  	}
>  
> @@ -2153,6 +2156,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  	enum lru_list lru;
>  	bool some_scanned;
>  	int pass;
> +	unsigned long inactive, active;
>  
>  	/*
>  	 * If the zone or memcg is small, nr[l] can be 0.  This
> @@ -2243,7 +2247,11 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  	 * lruvec even if it has plenty of old anonymous pages unless the
>  	 * system is under heavy pressure.
>  	 */
> -	if (!inactive_list_is_low(lruvec, true, sc, false) &&
> +	inactive = lruvec_lru_size_eligibe_zones(lruvec,
> +				LRU_FILE, sc->reclaim_idx);
> +	active = lruvec_lru_size_eligibe_zones(lruvec,
> +				LRU_FILE + LRU_ACTIVE, sc->reclaim_idx);
> +	if (!inactive_list_is_low(true, inactive, active) &&
>  	    lruvec_lru_size_eligibe_zones(lruvec, LRU_INACTIVE_FILE, sc->reclaim_idx) >> sc->priority) {
>  		scan_balance = SCAN_FILE;
>  		goto out;
> @@ -2468,9 +2476,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
>  	 * Even if we did not try to evict anon pages at all, we want to
>  	 * rebalance the anon lru active/inactive ratio.
>  	 */
> -	if (inactive_list_is_low(lruvec, false, sc, true))
> -		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
> -				   sc, LRU_ACTIVE_ANON);
> +	shrink_active_list(SWAP_CLUSTER_MAX, lruvec, sc, LRU_ACTIVE_ANON);
>  }
>  
>  /* Use reclaim/compaction for costly allocs or under memory pressure */
> @@ -3118,8 +3124,7 @@ static void age_active_anon(struct pglist_data *pgdat,
>  	do {
>  		struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
>  
> -		if (inactive_list_is_low(lruvec, false, sc, true))
> -			shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
> +		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
>  					   sc, LRU_ACTIVE_ANON);
>  
>  		memcg = mem_cgroup_iter(NULL, memcg, NULL);
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint.patch added to -mm tree
  2017-01-13  7:47               ` Michal Hocko
@ 2017-01-13  8:57                 ` Minchan Kim
  2017-01-13  9:10                   ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Minchan Kim @ 2017-01-13  8:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, hillf.zj, mgorman, vbabka, mm-commits, linux-mm

On Fri, Jan 13, 2017 at 08:47:07AM +0100, Michal Hocko wrote:
> On Fri 13-01-17 10:37:24, Minchan Kim wrote:
> > Hello,
> > 
> > On Thu, Jan 12, 2017 at 10:10:17AM +0100, Michal Hocko wrote:
> > > On Thu 12-01-17 17:48:13, Minchan Kim wrote:
> > > > On Thu, Jan 12, 2017 at 09:15:54AM +0100, Michal Hocko wrote:
> > > > > On Thu 12-01-17 14:12:47, Minchan Kim wrote:
> > > > > > Hello,
> > > > > > 
> > > > > > On Wed, Jan 11, 2017 at 04:52:39PM +0100, Michal Hocko wrote:
> > > > > > > On Wed 11-01-17 08:52:50, Minchan Kim wrote:
> > > > > > > [...]
> > > > > > > > > @@ -2055,8 +2055,8 @@ static bool inactive_list_is_low(struct
> > > > > > > > >  	if (!file && !total_swap_pages)
> > > > > > > > >  		return false;
> > > > > > > > >  
> > > > > > > > > -	inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > > > > > > > > -	active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> > > > > > > > > +	total_inactive = inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > > > > > > > > +	total_active = active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> > > > > > > > >  
> > > > > > > > 
> > > > > > > > the decision of deactivating is based on eligible zone's LRU size,
> > > > > > > > not whole zone so why should we need to get a trace of all zones's LRU?
> > > > > > > 
> > > > > > > Strictly speaking, the total_ counters are not necessary for making the
> > > > > > > decision. I found reporting those numbers useful regardless because this
> > > > > > > will give us also an information how large is the eligible portion of
> > > > > > > the LRU list. We do not have any other tracepoint which would report
> > > > > > > that.
> > > > > > 
> > > > > > The patch doesn't say anything why it's useful. Could you tell why it's
> > > > > > useful and inactive_list_is_low should be right place?
> > > > > > 
> > > > > > Don't get me wrong, please. I don't want to bother you.
> > > > > > I really don't want to add random stuff although it's tracepoint for
> > > > > > debugging.
> > > > > 
> > > > > This doesn't sounds random to me. We simply do not have a full picture
> > > > > on 32b systems without this information. Especially when memcgs are
> > > > > involved and global numbers spread over different LRUs.
> > > > 
> > > > Could you elaborate it?
> > > 
> > > The problem with 32b systems is that you only can consider a part of the
> > > LRU for the lowmem requests. While we have global counters to see how
> > > much lowmem inactive/active pages we have, those get distributed to
> > > memcg LRUs. And that distribution is impossible to guess. So my thinking
> > > is that it can become a real head scratcher to realize why certain
> > > active LRUs are aged while others are not. This was the case when I was
> > > debugging the last issue which triggered all this. All of the sudden I
> > > have seen many invocations when inactive and active were zero which
> > > sounded weird, until I realized that those are memcg's lruvec which is
> > > what total numbers told me...
> > 
> > Hmm, it seems I miss something. AFAIU, what you need is just memcg
> > identifier, not all lru size. If it isn't, please tell more detail
> > usecase of all lru size in that particular tracepoint.
> 
> Having memcg id would be definitely helpful but that alone wouldn't tell
> us how is the lowmem distributed. To be honest I really fail to see why
> this bothers you all that much.

Because I fail to understand why you want to need additional all zone's
LRU stat in inactive_list_is_low. With clear understanding, we can think
over that it's really needed and right place to achieve the goal.

Could you say with a example you can think? It's really helpful to
understand why it's needed.

>  
> [...]
> > > > > I am not sure I am following. Why is the additional parameter a problem?
> > > > 
> > > > Well, to me, it's not a elegance. Is it? If we need such boolean variable
> > > > to control show the trace, it means it's not a good place or think
> > > > refactoring.
> > > 
> > > But, even when you refactor the code there will be other callers of
> > > inactive_list_is_low outside of shrink_active_list...
> > 
> > Yes, that's why I said "it's okay if you love your version". However,
> > we can do refactoring to remove "bool trace" and even, it makes code
> > more readable, I believe.
> > 
> > >From 06eb7201d781155a8dee7e72fbb8423ec8175223 Mon Sep 17 00:00:00 2001
> > From: Minchan Kim <minchan@kernel.org>
> > Date: Fri, 13 Jan 2017 10:13:36 +0900
> > Subject: [PATCH] mm: refactoring inactive_list_is_low
> > 
> > Recently, Michal Hocko added tracepoint into inactive_list_is_low
> > for catching why VM decided to age the active list to know
> > active/inacive balancing problem. With that, unfortunately, it
> > added "bool trace" to inactlive_list_is_low to control some place
> > should be prohibited tracing. It is not elegant to me so this patch
> > try to clean it up.
> > 
> > Normally, most inactive_list_is_low is used for deciding active list
> > demotion but one site(i.e., get_scan_count) uses for other purpose
> > which reclaim file LRU forcefully. Sites for deactivation calls it
> > with shrink_active_list. It means inactive_list_is_low could be
> > located in shrink_active_list.
> > 
> > One more thing this patch does is to remove "ratio" in the tracepoint
> > because we can get it by post processing in script via simple math.
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/trace/events/vmscan.h |  9 +++-----
> >  mm/vmscan.c                   | 51 ++++++++++++++++++++++++-------------------
> >  2 files changed, 31 insertions(+), 29 deletions(-)
> 
> this cleanup adds more lines than it removes. I think reporting the

It's just marginal because the function names are really long and I want to
keep a 80 column rule.
Anyway, I'm not insisting on pushing this patch although I still think
it's not nice to add "boolean variable" to control tracing or not.
It's not a main interest.

> ratio is helpful because it doesn't cost us anything while calculating
> it by later is just a bit annoying.

I really cannot imagine when inactive_ratio value is helpful for debugging.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint.patch added to -mm tree
  2017-01-13  8:57                 ` Minchan Kim
@ 2017-01-13  9:10                   ` Michal Hocko
  2017-01-17  6:45                     ` Minchan Kim
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2017-01-13  9:10 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-kernel, hillf.zj, mgorman, vbabka, mm-commits, linux-mm

On Fri 13-01-17 17:57:34, Minchan Kim wrote:
> On Fri, Jan 13, 2017 at 08:47:07AM +0100, Michal Hocko wrote:
> > On Fri 13-01-17 10:37:24, Minchan Kim wrote:
> > > Hello,
> > > 
> > > On Thu, Jan 12, 2017 at 10:10:17AM +0100, Michal Hocko wrote:
> > > > On Thu 12-01-17 17:48:13, Minchan Kim wrote:
> > > > > On Thu, Jan 12, 2017 at 09:15:54AM +0100, Michal Hocko wrote:
> > > > > > On Thu 12-01-17 14:12:47, Minchan Kim wrote:
> > > > > > > Hello,
> > > > > > > 
> > > > > > > On Wed, Jan 11, 2017 at 04:52:39PM +0100, Michal Hocko wrote:
> > > > > > > > On Wed 11-01-17 08:52:50, Minchan Kim wrote:
> > > > > > > > [...]
> > > > > > > > > > @@ -2055,8 +2055,8 @@ static bool inactive_list_is_low(struct
> > > > > > > > > >  	if (!file && !total_swap_pages)
> > > > > > > > > >  		return false;
> > > > > > > > > >  
> > > > > > > > > > -	inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > > > > > > > > > -	active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> > > > > > > > > > +	total_inactive = inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > > > > > > > > > +	total_active = active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> > > > > > > > > >  
> > > > > > > > > 
> > > > > > > > > the decision of deactivating is based on eligible zone's LRU size,
> > > > > > > > > not whole zone so why should we need to get a trace of all zones's LRU?
> > > > > > > > 
> > > > > > > > Strictly speaking, the total_ counters are not necessary for making the
> > > > > > > > decision. I found reporting those numbers useful regardless because this
> > > > > > > > will give us also an information how large is the eligible portion of
> > > > > > > > the LRU list. We do not have any other tracepoint which would report
> > > > > > > > that.
> > > > > > > 
> > > > > > > The patch doesn't say anything why it's useful. Could you tell why it's
> > > > > > > useful and inactive_list_is_low should be right place?
> > > > > > > 
> > > > > > > Don't get me wrong, please. I don't want to bother you.
> > > > > > > I really don't want to add random stuff although it's tracepoint for
> > > > > > > debugging.
> > > > > > 
> > > > > > This doesn't sounds random to me. We simply do not have a full picture
> > > > > > on 32b systems without this information. Especially when memcgs are
> > > > > > involved and global numbers spread over different LRUs.
> > > > > 
> > > > > Could you elaborate it?
> > > > 
> > > > The problem with 32b systems is that you only can consider a part of the
> > > > LRU for the lowmem requests. While we have global counters to see how
> > > > much lowmem inactive/active pages we have, those get distributed to
> > > > memcg LRUs. And that distribution is impossible to guess. So my thinking
> > > > is that it can become a real head scratcher to realize why certain
> > > > active LRUs are aged while others are not. This was the case when I was
> > > > debugging the last issue which triggered all this. All of the sudden I
> > > > have seen many invocations when inactive and active were zero which
> > > > sounded weird, until I realized that those are memcg's lruvec which is
> > > > what total numbers told me...
> > > 
> > > Hmm, it seems I miss something. AFAIU, what you need is just memcg
> > > identifier, not all lru size. If it isn't, please tell more detail
> > > usecase of all lru size in that particular tracepoint.
> > 
> > Having memcg id would be definitely helpful but that alone wouldn't tell
> > us how is the lowmem distributed. To be honest I really fail to see why
> > this bothers you all that much.
> 
> Because I fail to understand why you want to need additional all zone's
> LRU stat in inactive_list_is_low. With clear understanding, we can think
> over that it's really needed and right place to achieve the goal.
> 
> Could you say with a example you can think? It's really helpful to
> understand why it's needed.

OK, I feel I am repeating myself but let me try again. Without the
total_ numbers we really do not know how is the lowmem distributed over
lruvecs. There is simply no way to get this information from existing
counters if memcg is enabled.
 
> > [...]
> > > > > > I am not sure I am following. Why is the additional parameter a problem?
> > > > > 
> > > > > Well, to me, it's not a elegance. Is it? If we need such boolean variable
> > > > > to control show the trace, it means it's not a good place or think
> > > > > refactoring.
> > > > 
> > > > But, even when you refactor the code there will be other callers of
> > > > inactive_list_is_low outside of shrink_active_list...
> > > 
> > > Yes, that's why I said "it's okay if you love your version". However,
> > > we can do refactoring to remove "bool trace" and even, it makes code
> > > more readable, I believe.
> > > 
> > > >From 06eb7201d781155a8dee7e72fbb8423ec8175223 Mon Sep 17 00:00:00 2001
> > > From: Minchan Kim <minchan@kernel.org>
> > > Date: Fri, 13 Jan 2017 10:13:36 +0900
> > > Subject: [PATCH] mm: refactoring inactive_list_is_low
> > > 
> > > Recently, Michal Hocko added tracepoint into inactive_list_is_low
> > > for catching why VM decided to age the active list to know
> > > active/inacive balancing problem. With that, unfortunately, it
> > > added "bool trace" to inactlive_list_is_low to control some place
> > > should be prohibited tracing. It is not elegant to me so this patch
> > > try to clean it up.
> > > 
> > > Normally, most inactive_list_is_low is used for deciding active list
> > > demotion but one site(i.e., get_scan_count) uses for other purpose
> > > which reclaim file LRU forcefully. Sites for deactivation calls it
> > > with shrink_active_list. It means inactive_list_is_low could be
> > > located in shrink_active_list.
> > > 
> > > One more thing this patch does is to remove "ratio" in the tracepoint
> > > because we can get it by post processing in script via simple math.
> > > 
> > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > ---
> > >  include/trace/events/vmscan.h |  9 +++-----
> > >  mm/vmscan.c                   | 51 ++++++++++++++++++++++++-------------------
> > >  2 files changed, 31 insertions(+), 29 deletions(-)
> > 
> > this cleanup adds more lines than it removes. I think reporting the
> 
> It's just marginal because the function names are really long and I want to
> keep a 80 column rule.
> Anyway, I'm not insisting on pushing this patch although I still think
> it's not nice to add "boolean variable" to control tracing or not.
> It's not a main interest.
> 
> > ratio is helpful because it doesn't cost us anything while calculating
> > it by later is just a bit annoying.
> 
> I really cannot imagine when inactive_ratio value is helpful for debugging.

without ratio you cannot really tell whether the inactive list is low.
As you say you can do that calculation in the trace data post processing
but why to bother with that if the number is that and we just need to
print it?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint.patch added to -mm tree
  2017-01-13  9:10                   ` Michal Hocko
@ 2017-01-17  6:45                     ` Minchan Kim
  2017-01-17 10:12                       ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Minchan Kim @ 2017-01-17  6:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, hillf.zj, mgorman, vbabka, mm-commits, linux-mm

Hello,

On Fri, Jan 13, 2017 at 10:10:09AM +0100, Michal Hocko wrote:
> On Fri 13-01-17 17:57:34, Minchan Kim wrote:
> > On Fri, Jan 13, 2017 at 08:47:07AM +0100, Michal Hocko wrote:
> > > On Fri 13-01-17 10:37:24, Minchan Kim wrote:
> > > > Hello,
> > > > 
> > > > On Thu, Jan 12, 2017 at 10:10:17AM +0100, Michal Hocko wrote:
> > > > > On Thu 12-01-17 17:48:13, Minchan Kim wrote:
> > > > > > On Thu, Jan 12, 2017 at 09:15:54AM +0100, Michal Hocko wrote:
> > > > > > > On Thu 12-01-17 14:12:47, Minchan Kim wrote:
> > > > > > > > Hello,
> > > > > > > > 
> > > > > > > > On Wed, Jan 11, 2017 at 04:52:39PM +0100, Michal Hocko wrote:
> > > > > > > > > On Wed 11-01-17 08:52:50, Minchan Kim wrote:
> > > > > > > > > [...]
> > > > > > > > > > > @@ -2055,8 +2055,8 @@ static bool inactive_list_is_low(struct
> > > > > > > > > > >  	if (!file && !total_swap_pages)
> > > > > > > > > > >  		return false;
> > > > > > > > > > >  
> > > > > > > > > > > -	inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > > > > > > > > > > -	active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> > > > > > > > > > > +	total_inactive = inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > > > > > > > > > > +	total_active = active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> > > > > > > > > > >  
> > > > > > > > > > 
> > > > > > > > > > the decision of deactivating is based on eligible zone's LRU size,
> > > > > > > > > > not whole zone so why should we need to get a trace of all zones's LRU?
> > > > > > > > > 
> > > > > > > > > Strictly speaking, the total_ counters are not necessary for making the
> > > > > > > > > decision. I found reporting those numbers useful regardless because this
> > > > > > > > > will give us also an information how large is the eligible portion of
> > > > > > > > > the LRU list. We do not have any other tracepoint which would report
> > > > > > > > > that.
> > > > > > > > 
> > > > > > > > The patch doesn't say anything why it's useful. Could you tell why it's
> > > > > > > > useful and inactive_list_is_low should be right place?
> > > > > > > > 
> > > > > > > > Don't get me wrong, please. I don't want to bother you.
> > > > > > > > I really don't want to add random stuff although it's tracepoint for
> > > > > > > > debugging.
> > > > > > > 
> > > > > > > This doesn't sounds random to me. We simply do not have a full picture
> > > > > > > on 32b systems without this information. Especially when memcgs are
> > > > > > > involved and global numbers spread over different LRUs.
> > > > > > 
> > > > > > Could you elaborate it?
> > > > > 
> > > > > The problem with 32b systems is that you only can consider a part of the
> > > > > LRU for the lowmem requests. While we have global counters to see how
> > > > > much lowmem inactive/active pages we have, those get distributed to
> > > > > memcg LRUs. And that distribution is impossible to guess. So my thinking
> > > > > is that it can become a real head scratcher to realize why certain
> > > > > active LRUs are aged while others are not. This was the case when I was
> > > > > debugging the last issue which triggered all this. All of the sudden I
> > > > > have seen many invocations when inactive and active were zero which
> > > > > sounded weird, until I realized that those are memcg's lruvec which is
> > > > > what total numbers told me...
> > > > 
> > > > Hmm, it seems I miss something. AFAIU, what you need is just memcg
> > > > identifier, not all lru size. If it isn't, please tell more detail
> > > > usecase of all lru size in that particular tracepoint.
> > > 
> > > Having memcg id would be definitely helpful but that alone wouldn't tell
> > > us how is the lowmem distributed. To be honest I really fail to see why
> > > this bothers you all that much.
> > 
> > Because I fail to understand why you want to need additional all zone's
> > LRU stat in inactive_list_is_low. With clear understanding, we can think
> > over that it's really needed and right place to achieve the goal.
> > 
> > Could you say with a example you can think? It's really helpful to
> > understand why it's needed.
> 
> OK, I feel I am repeating myself but let me try again. Without the
> total_ numbers we really do not know how is the lowmem distributed over
> lruvecs. There is simply no way to get this information from existing
> counters if memcg is enabled.

I can't understand clearly why you need to know distribution.
Anyway, if we need it, why do you think such particular inactive_list_is_low
is right place?

Actually, IMO, there is no need to insert any tracepoint in inactive_list_is_low.
Instead, if we add tracepint in get_scan_count to record each LRU list size and
nr[LRU_{INACTIVE,ACTIVE}_{ANON|FILE}], it could be general and more helpful.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint.patch added to -mm tree
  2017-01-17  6:45                     ` Minchan Kim
@ 2017-01-17 10:12                       ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2017-01-17 10:12 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-kernel, hillf.zj, mgorman, vbabka, mm-commits, linux-mm

On Tue 17-01-17 15:45:31, Minchan Kim wrote:
[...]
> Actually, IMO, there is no need to insert any tracepoint in inactive_list_is_low.
> Instead, if we add tracepint in get_scan_count to record each LRU list size and
> nr[LRU_{INACTIVE,ACTIVE}_{ANON|FILE}], it could be general and more helpful.

You are free to propose patches of course. I just worry that you are
overthinking this. This is no rocket science, really. We have a set of
trace points at places where we make a decision. Having a tracepoint in
inactive_list_is_low sounds like a proper fit to me. get_scan_count has
a different responsibility. We might disagree on that, though, but as
long as you preserve the debugability I won't be opposed.

I really do not see much point in discussing this further and spend more
time repeating arguments. After all the whole point of the series was
to make the debugging easier.  Which I believe is the case. Different
people do debugging differently so it is not really all that surprising
that we disagree on some parts. I really consider these tracepoints as a
debugging aid and exporting more than less has proven being useful in
the past. The worst thing really is when numbers do not make sense
because you are just missing part of the picture. I definitely agree
with you on the general objective to keep this debugging tools out of hot
paths and being too disruptive or spill over to the regular code to
cause a maintenance burden but I _believe_ this is not the case here.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-01-17 10:12 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <586edadc.figmHAGrTxvM7Wei%akpm@linux-foundation.org>
2017-01-10 23:52 ` + mm-vmscan-add-mm_vmscan_inactive_list_is_low-tracepoint.patch added to -mm tree Minchan Kim
2017-01-11 15:52   ` Michal Hocko
2017-01-12  5:12     ` Minchan Kim
2017-01-12  8:15       ` Michal Hocko
2017-01-12  8:48         ` Minchan Kim
2017-01-12  9:10           ` Michal Hocko
2017-01-13  1:37             ` Minchan Kim
2017-01-13  7:47               ` Michal Hocko
2017-01-13  8:57                 ` Minchan Kim
2017-01-13  9:10                   ` Michal Hocko
2017-01-17  6:45                     ` Minchan Kim
2017-01-17 10:12                       ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).