From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1751533AbdADFH1 (ORCPT <rfc822;w@1wt.eu>);
        Wed, 4 Jan 2017 00:07:27 -0500
Received: from LGEAMRELO13.lge.com ([156.147.23.53]:44464 "EHLO
        lgeamrelo13.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1750853AbdADFHZ (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 4 Jan 2017 00:07:25 -0500
X-Original-SENDERIP: 156.147.1.151
X-Original-MAILFROM: minchan@kernel.org
X-Original-SENDERIP: 10.177.223.161
X-Original-MAILFROM: minchan@kernel.org
Date: Wed, 4 Jan 2017 14:07:22 +0900
From: Minchan Kim <minchan@kernel.org>
To: Michal Hocko <mhocko@kernel.org>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>, linux-mm@kvack.org,
        Andrew Morton <akpm@linux-foundation.org>,
        Mel Gorman <mgorman@suse.de>, Johannes Weiner <hannes@cmpxchg.org>,
        Vlastimil Babka <vbabka@suse.cz>, Rik van Riel <riel@redhat.com>,
        LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/7] mm, vmscan: add active list aging tracepoint
Message-ID: <20170104050722.GA17166@bbox>
References: <20161228153032.10821-1-mhocko@kernel.org>
 <20161228153032.10821-3-mhocko@kernel.org>
 <20161229053359.GA1815@bbox>
 <20161229075243.GA29208@dhcp22.suse.cz>
 <20161230014853.GA4184@bbox>
 <20161230092636.GA13301@dhcp22.suse.cz>
 <20161230160456.GA7267@bbox>
 <20161230163742.GK13301@dhcp22.suse.cz>
 <20170103050328.GA15700@bbox>
 <20170103082122.GA30111@dhcp22.suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170103082122.GA30111@dhcp22.suse.cz>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Tue, Jan 03, 2017 at 09:21:22AM +0100, Michal Hocko wrote:
> On Tue 03-01-17 14:03:28, Minchan Kim wrote:
> > Hi Michal,
> > 
> > On Fri, Dec 30, 2016 at 05:37:42PM +0100, Michal Hocko wrote:
> > > On Sat 31-12-16 01:04:56, Minchan Kim wrote:
> > > [...]
> > > > > From 5f1bc22ad1e54050b4da3228d68945e70342ebb6 Mon Sep 17 00:00:00 2001
> > > > > From: Michal Hocko <mhocko@suse.com>
> > > > > Date: Tue, 27 Dec 2016 13:18:20 +0100
> > > > > Subject: [PATCH] mm, vmscan: add active list aging tracepoint
> > > > > 
> > > > > Our reclaim process has several tracepoints to tell us more about how
> > > > > things are progressing. We are, however, missing a tracepoint to track
> > > > > active list aging. Introduce mm_vmscan_lru_shrink_active which reports
> > > > 
> > > > I agree this part.
> > > > 
> > > > > the number of
> > > > > 	- nr_scanned, nr_taken pages to tell us the LRU isolation
> > > > > 	  effectiveness.
> > > > 
> > > > I agree nr_taken for knowing shrinking effectiveness but don't
> > > > agree nr_scanned. If we want to know LRU isolation effectiveness
> > > > with nr_scanned and nr_taken, isolate_lru_pages will do.
> > > 
> > > Yes it will. On the other hand the number is there and there is no
> > > additional overhead, maintenance or otherwise, to provide that number.
> > 
> > You are adding some instructions, how can you imagine it's no overhead?
> 
> There should be close to zero overhead when the tracepoint is disabled
> (we pay only one more argument when the function is called). Is this
> really worth discussing in this cold path? We are talking about the
> reclaim here.

I am talking about that why we should add pointless code in there.
No matter it's overhead. We are looping infinite. Blindly, it adds
overhead although you might think so trivial.

> 
> > Let's say whether it's measurable. Although it's not big in particular case,
> > it would be measurable if everyone start to say like that "it's trivial so
> > what's the problem adding a few instructions although it was duplicated?"
> > 
> > You already said "LRU isolate effectiveness". It should be done in there,
> > isolate_lru_pages and we have been. You need another reasons if you want to
> > add the duplicated work, strongly.
> 
> isolate_lru_pages is certainly there but you have to enable a trace
> point for that. Sometimes it is quite useful to get a reasonably good
> picture even without all the vmscan tracepoints enabled because they
> can generate quite a lot of output. So if the counter is available I

If someone want to see "isolate effectivenss", he should enable
mm_vmscan_lru_isolate which was born in that and has more helpful
information.

Think it in an opposit way. If some users want to see just active
list aging problem and no interested in "LRU isolate effectivness",
you are adding meaningless output for him and he has no choice to
turn it off with your patch.

> see no reason to exclude it, especially when it can provide a useful
> information. One of the most frustrating debugging experience is when

I said several times. Please think over if everyone begins adding extra
parameters in every tracepoints which we could already get it via other
tracepoint with "just, it might be useful in a specific context".
Could you be happy with that, really?

> you are missing some part of the information and have to guess which
> part is that and patch, rebuild the kernel and hope to reproduce it
> again in the same/similar way.

No need to rebuild. Just enable mm_vmscan_lru_isolate.

> 
> There are two things about this and other tracepoint patches in general
> I believe. 1) Is the tracepoint useful? and 2) Do we have to go over
> extra hops to show tracepoint data?
> 
> I guess we are in an agreement that the answer for 1 is yes. And

yeb.

> regarding 2, all the data we are showing are there or trivially
> retrieved without touching _any_ hot path. Som of it might be duplicated


Currently, you rely on just unfortunate modulization to just add
unncessary information to the tracepoint.

I just removed nr_scanned in your patch and look below.

./scripts/bloat-o-meter vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 0/6 up/down: 0/-147 (-147)
function                                     old     new   delta
perf_trace_mm_vmscan_lru_shrink_active       264     256      -8
trace_raw_output_mm_vmscan_lru_shrink_active     203     193     -10
trace_event_raw_event_mm_vmscan_lru_shrink_active     241     225     -16
print_fmt_mm_vmscan_lru_shrink_active        458     426     -32
shrink_active_list                          1265    1232     -33
trace_event_define_fields_mm_vmscan_lru_shrink_active     384     336     -48
Total: Before=26268743, After=26268596, chg -0.00%

Let's furhter it more.

We can factor out logics to account isolation of LRU from shrink_[in]active_list
which is more clean, I think.

>>From 1053968d526427ecad96b682aa586701c4ecfc84 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Wed, 4 Jan 2017 10:04:36 +0900
Subject: [PATCH] factor out LRU isolation accounting.

Not-yet-signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/trace/events/vmscan.h | 14 +++++----
 mm/vmscan.c                   | 68 ++++++++++++++++++-------------------------
 2 files changed, 37 insertions(+), 45 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 79b3cd9c7048..5fc3a94a14cd 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -364,14 +364,15 @@ TRACE_EVENT(mm_vmscan_writepage,
 TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
 
 	TP_PROTO(int nid,
-		unsigned long nr_scanned, unsigned long nr_reclaimed,
+		unsigned long nr_taken,
+		unsigned long nr_reclaimed,
 		int priority, int file),
 
-	TP_ARGS(nid, nr_scanned, nr_reclaimed, priority, file),
+	TP_ARGS(nid, nr_taken, nr_reclaimed, priority, file),
 
 	TP_STRUCT__entry(
 		__field(int, nid)
-		__field(unsigned long, nr_scanned)
+		__field(unsigned long, nr_taken)
 		__field(unsigned long, nr_reclaimed)
 		__field(int, priority)
 		__field(int, reclaim_flags)
@@ -379,15 +380,16 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
 
 	TP_fast_assign(
 		__entry->nid = nid;
-		__entry->nr_scanned = nr_scanned;
+		__entry->nr_taken = nr_taken;
 		__entry->nr_reclaimed = nr_reclaimed;
 		__entry->priority = priority;
 		__entry->reclaim_flags = trace_shrink_flags(file);
 	),
 
-	TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
+	TP_printk("nid=%d nr_taken=%ld nr_reclaimed=%ld priority=%d flags=%s",
 		__entry->nid,
-		__entry->nr_scanned, __entry->nr_reclaimed,
+		__entry->nr_taken,
+		__entry->nr_reclaimed,
 		__entry->priority,
 		show_reclaim_flags(__entry->reclaim_flags))
 );
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 37ccd4e0b349..74f55f39f963 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1454,16 +1454,16 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
  * @nr_to_scan:	The number of pages to look through on the list.
  * @lruvec:	The LRU vector to pull pages from.
  * @dst:	The temp list to put pages on to.
- * @nr_scanned:	The number of pages that were scanned.
  * @sc:		The scan_control struct for this reclaim session
  * @mode:	One of the LRU isolation modes
  * @lru:	LRU list id for isolating
  *
  * returns how many pages were moved onto *@dst.
  */
-static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
+static unsigned long isolate_lru_pages(struct pglist_data *pgdat,
+		unsigned long nr_to_scan,
 		struct lruvec *lruvec, struct list_head *dst,
-		unsigned long *nr_scanned, struct scan_control *sc,
+		struct scan_control *sc,
 		isolate_mode_t mode, enum lru_list lru)
 {
 	struct list_head *src = &lruvec->lists[lru];
@@ -1471,8 +1471,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
 	unsigned long nr_skipped[MAX_NR_ZONES] = { 0, };
 	unsigned long scan, nr_pages;
+	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 	LIST_HEAD(pages_skipped);
+	int file = is_file_lru(lru);
 
+	spin_lock_irq(&pgdat->lru_lock);
 	for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
 					!list_empty(src);) {
 		struct page *page;
@@ -1540,10 +1543,25 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 
 		list_splice(&pages_skipped, src);
 	}
-	*nr_scanned = scan;
 	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, scan,
 				    nr_taken, mode, is_file_lru(lru));
 	update_lru_sizes(lruvec, lru, nr_zone_taken, nr_taken);
+
+	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
+	reclaim_stat->recent_scanned[file] += nr_taken;
+
+	if (global_reclaim(sc))
+		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, scan);
+	if (is_active_lru(lru)) {
+		__count_vm_events(PGREFILL, scan);
+	} else {
+		if (current_is_kswapd())
+			__count_vm_events(PGSCAN_KSWAPD, scan);
+		else
+			__count_vm_events(PGSCAN_DIRECT, scan);
+	}
+	spin_unlock_irq(&pgdat->lru_lock);
+
 	return nr_taken;
 }
 
@@ -1735,7 +1753,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 		     struct scan_control *sc, enum lru_list lru)
 {
 	LIST_HEAD(page_list);
-	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
 	unsigned long nr_taken;
 	unsigned long nr_dirty = 0;
@@ -1746,7 +1763,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	isolate_mode_t isolate_mode = 0;
 	int file = is_file_lru(lru);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
-	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 
 	if (!inactive_reclaimable_pages(lruvec, sc, lru))
 		return 0;
@@ -1766,23 +1782,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	if (!sc->may_writepage)
 		isolate_mode |= ISOLATE_CLEAN;
 
-	spin_lock_irq(&pgdat->lru_lock);
-
-	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
-				     &nr_scanned, sc, isolate_mode, lru);
-
-	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
-	reclaim_stat->recent_scanned[file] += nr_taken;
-
-	if (global_reclaim(sc)) {
-		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
-		if (current_is_kswapd())
-			__count_vm_events(PGSCAN_KSWAPD, nr_scanned);
-		else
-			__count_vm_events(PGSCAN_DIRECT, nr_scanned);
-	}
-	spin_unlock_irq(&pgdat->lru_lock);
-
+	nr_taken = isolate_lru_pages(pgdat, nr_to_scan, lruvec, &page_list,
+					sc, isolate_mode, lru);
 	if (nr_taken == 0)
 		return 0;
 
@@ -1866,7 +1867,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 		wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
 
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
-			nr_scanned, nr_reclaimed,
+			nr_taken,
+			nr_reclaimed,
 			sc->priority, file);
 	return nr_reclaimed;
 }
@@ -1943,18 +1945,17 @@ static void shrink_active_list(unsigned long nr_to_scan,
 			       enum lru_list lru)
 {
 	unsigned long nr_taken;
-	unsigned long nr_scanned;
 	unsigned long vm_flags;
 	LIST_HEAD(l_hold);	/* The pages which were snipped off */
 	LIST_HEAD(l_active);
 	LIST_HEAD(l_inactive);
 	struct page *page;
-	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 	unsigned nr_deactivate, nr_activate;
 	unsigned nr_rotated = 0;
 	isolate_mode_t isolate_mode = 0;
 	int file = is_file_lru(lru);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 
 	lru_add_drain();
 
@@ -1963,19 +1964,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	if (!sc->may_writepage)
 		isolate_mode |= ISOLATE_CLEAN;
 
-	spin_lock_irq(&pgdat->lru_lock);
-
-	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
-				     &nr_scanned, sc, isolate_mode, lru);
-
-	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
-	reclaim_stat->recent_scanned[file] += nr_taken;
-
-	if (global_reclaim(sc))
-		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
-	__count_vm_events(PGREFILL, nr_scanned);
-
-	spin_unlock_irq(&pgdat->lru_lock);
+	nr_taken = isolate_lru_pages(pgdat, nr_to_scan, lruvec, &l_hold,
+				     sc, isolate_mode, lru);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
-- 
2.7.4

With this,
./scripts/bloat-o-meter vmlinux.old vmlinux.new.new
add/remove: 1/1 grow/shrink: 0/9 up/down: 1394/-1636 (-242)
function                                     old     new   delta
isolate_lru_pages                              -    1394   +1394
print_fmt_mm_vmscan_lru_shrink_inactive      359     355      -4
vermagic                                      64      58      -6
perf_trace_mm_vmscan_lru_shrink_active       264     256      -8
trace_raw_output_mm_vmscan_lru_shrink_active     203     193     -10
trace_event_raw_event_mm_vmscan_lru_shrink_active     241     225     -16
print_fmt_mm_vmscan_lru_shrink_active        458     426     -32
trace_event_define_fields_mm_vmscan_lru_shrink_active     384     336     -48
shrink_inactive_list                        1430    1271    -159
shrink_active_list                          1265    1082    -183
isolate_lru_pages.isra                      1170       -   -1170
Total: Before=26268743, After=26268501, chg -0.00%

We can save 242 bytes.

If we consider binary size, 424 bytes save.

#> ls -l vmlinux.old vmlinux.new.new
194092840  vmlinux.old
194092416  vmlinux.new.new

> with other tracepoints but that can be helpful because you do not have
> all the tracepoints enabled all the time. So unless you see this
> particular thing as a road block I would rather keep it.

I didn't know how long this thread becomes lenghy. To me, it was no worth
to discuss. I did best effot to explain my stand with valid points, I think
and don't want to go infinite loop. If you don't agree still, separate
the patch. One includes only necessary things with removing nr_scanned, which
I am happy to ack it. Based upon it, add one more patch you want adding
nr_scanned with your claim. I will reply that thread with my claim and
let's keep an eye on it that whether maintainer will take it or not.
If maintainer will take it, it's good indication which will represent
we can add more extra tracepoint easily with "might be helpful with someone
although it's redunant" so do not prevent others who want to do
in the future.

>  
> > > The inactive counterpart does that for quite some time already. So why
> > 
> > It couldn't be a reason. If it was duplicated in there, it would be
> > better to fix it rather than adding more duplciated work to match both
> > sides.
> 
> I really do not see this as a bad thing.
> 
> > > exactly does that matter? Don't take me wrong but isn't this more on a
> > > nit picking side than necessary? Or do I just misunderstand your
> > > concenrs? It is not like we are providing a stable user API as the
> > 
> > My concern is that I don't see what we can get benefit from those
> > duplicated work. If it doesn't give benefit to us, I don't want to add.
> > I hope you think another reasonable reasons.
> > 
> > > tracepoint is clearly implementation specific and not something to be
> > > used for anything other than debugging.
> > 
> > My point is we already had things "LRU isolation effectivness". Namely,
> > isolate_lru_pages.
> > 
> > > 
> > > > > 	- nr_rotated pages which tells us that we are hitting referenced
> > > > > 	  pages which are deactivated. If this is a large part of the
> > > > > 	  reported nr_deactivated pages then the active list is too small
> > > > 
> > > > It might be but not exactly. If your goal is to know LRU size, it can be
> > > > done in get_scan_count. I tend to agree LRU size is helpful for
> > > > performance analysis because decreased LRU size signals memory shortage
> > > > then performance drop.
> > > 
> > > No, I am not really interested in the exact size but rather to allow to
> > > find whether we are aging the active list too early...
> > 
> > Could you elaborate it more that how we can get active list early aging
> > with nr_rotated?
> 
> If you see too many referenced pages on the active list then they have
> been used since promoted and that is an indication that they might be
> reclaimed too early. If you are debugging a performance issue and see
> this happening then it might be a good indication to look at.

This is better than "active list is too small". I hope you change
description with this.