linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] mm/madvise: add vmstat statistics for madvise_[cold|pageout]
@ 2023-01-25  0:54 Minchan Kim
  2023-01-25  8:04 ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2023-01-25  0:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Suren Baghdasaryan, Matthew Wilcox, linux-mm, LKML, Michal Hocko,
	Minchan Kim

madvise LRU manipulation APIs need to scan address ranges to find
present pages at page table and provides advice hints for them.

Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted]
shows the proactive reclaim efficiency so this patch adds those
two statistics in vmstat.

	madvise_pgscanned, madvise_pghinted

Since proactive reclaim using process_madvise(2) as userland
memory policy is popular(e.g,. Android ActivityManagerService),
those stats are helpful to know how efficiently the policy works
well.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---

* From v1 - https://lore.kernel.org/linux-mm/20230117231632.2734737-1-minchan@kernel.org/
  * not relying on the pageout for accounting - mhocko
  * drop unnecessary changes - mhocko
  
 include/linux/vm_event_item.h | 2 ++
 mm/madvise.c                  | 8 ++++++++
 mm/vmstat.c                   | 2 ++
 3 files changed, 12 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 7f5d1caf5890..3c117858946d 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -52,6 +52,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGSCAN_FILE,
 		PGSTEAL_ANON,
 		PGSTEAL_FILE,
+		MADVISE_PGSCANNED,
+		MADVISE_PGHINTED,
 #ifdef CONFIG_NUMA
 		PGSCAN_ZONE_RECLAIM_FAILED,
 #endif
diff --git a/mm/madvise.c b/mm/madvise.c
index 7db6622f8293..d2624e77f729 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -344,6 +344,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	spinlock_t *ptl;
 	struct folio *folio = NULL;
 	LIST_HEAD(folio_list);
+	unsigned int nr_scanned = 0;
+	unsigned int nr_hinted = 0;
 	bool pageout_anon_only_filter;
 
 	if (fatal_signal_pending(current))
@@ -357,6 +359,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		pmd_t orig_pmd;
 		unsigned long next = pmd_addr_end(addr, end);
 
+		nr_scanned += HPAGE_PMD_NR;
 		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
 		ptl = pmd_trans_huge_lock(pmd, vma);
 		if (!ptl)
@@ -414,6 +417,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 			}
 		} else
 			folio_deactivate(folio);
+		nr_hinted += HPAGE_PMD_NR;
 huge_unlock:
 		spin_unlock(ptl);
 		if (pageout)
@@ -431,6 +435,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	arch_enter_lazy_mmu_mode();
 	for (; addr < end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
+		nr_scanned++;
 
 		if (pte_none(ptent))
 			continue;
@@ -508,6 +513,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 			}
 		} else
 			folio_deactivate(folio);
+		nr_hinted++;
 	}
 
 	arch_leave_lazy_mmu_mode();
@@ -515,6 +521,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	if (pageout)
 		reclaim_pages(&folio_list);
 	cond_resched();
+	count_vm_events(MADVISE_PGSCANNED, nr_scanned);
+	count_vm_events(MADVISE_PGHINTED, nr_hinted);
 
 	return 0;
 }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1ea6a5ce1c41..84acc90820e1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1283,6 +1283,8 @@ const char * const vmstat_text[] = {
 	"pgscan_file",
 	"pgsteal_anon",
 	"pgsteal_file",
+	"madvise_pgscanned",
+	"madvise_pghinted",
 
 #ifdef CONFIG_NUMA
 	"zone_reclaim_failed",
-- 
2.39.1.405.gd4c25cc71f-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/madvise: add vmstat statistics for madvise_[cold|pageout]
  2023-01-25  0:54 [PATCH v2] mm/madvise: add vmstat statistics for madvise_[cold|pageout] Minchan Kim
@ 2023-01-25  8:04 ` Michal Hocko
  2023-01-25 16:36   ` Minchan Kim
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2023-01-25  8:04 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Suren Baghdasaryan, Matthew Wilcox, linux-mm, LKML

On Tue 24-01-23 16:54:57, Minchan Kim wrote:
> madvise LRU manipulation APIs need to scan address ranges to find
> present pages at page table and provides advice hints for them.
> 
> Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted]
> shows the proactive reclaim efficiency so this patch adds those
> two statistics in vmstat.
> 
> 	madvise_pgscanned, madvise_pghinted
> 
> Since proactive reclaim using process_madvise(2) as userland
> memory policy is popular(e.g,. Android ActivityManagerService),
> those stats are helpful to know how efficiently the policy works
> well.

The usecase description is still too vague. What are those values useful
for? Is there anything actionable based on those numbers? How do you
deal with multiple parties using madvise resp. process_madvise so that
their stats are combined?

In the previous version I have also pointed out that this might be
easily achieved by tracepoints. Your counterargument was a convenience
in a large scale monitoring without going much into details. Presumably
this is because your fleet based monitoring already collects
/proc/vmstat while tracepoints based monitoring would require additional
changes. This alone is rather weak argument to be honest because
deploying tracepoints monitoring is quite trivial and can be done
outside of the said memory reclaim agent.

> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
> 
> * From v1 - https://lore.kernel.org/linux-mm/20230117231632.2734737-1-minchan@kernel.org/
>   * not relying on the pageout for accounting - mhocko
>   * drop unnecessary changes - mhocko
>   
>  include/linux/vm_event_item.h | 2 ++
>  mm/madvise.c                  | 8 ++++++++
>  mm/vmstat.c                   | 2 ++
>  3 files changed, 12 insertions(+)
> 
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 7f5d1caf5890..3c117858946d 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -52,6 +52,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		PGSCAN_FILE,
>  		PGSTEAL_ANON,
>  		PGSTEAL_FILE,
> +		MADVISE_PGSCANNED,
> +		MADVISE_PGHINTED,
>  #ifdef CONFIG_NUMA
>  		PGSCAN_ZONE_RECLAIM_FAILED,
>  #endif
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 7db6622f8293..d2624e77f729 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -344,6 +344,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  	spinlock_t *ptl;
>  	struct folio *folio = NULL;
>  	LIST_HEAD(folio_list);
> +	unsigned int nr_scanned = 0;
> +	unsigned int nr_hinted = 0;
>  	bool pageout_anon_only_filter;
>  
>  	if (fatal_signal_pending(current))
> @@ -357,6 +359,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  		pmd_t orig_pmd;
>  		unsigned long next = pmd_addr_end(addr, end);
>  
> +		nr_scanned += HPAGE_PMD_NR;
>  		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
>  		ptl = pmd_trans_huge_lock(pmd, vma);
>  		if (!ptl)
> @@ -414,6 +417,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  			}
>  		} else
>  			folio_deactivate(folio);
> +		nr_hinted += HPAGE_PMD_NR;
>  huge_unlock:
>  		spin_unlock(ptl);
>  		if (pageout)
> @@ -431,6 +435,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  	arch_enter_lazy_mmu_mode();
>  	for (; addr < end; pte++, addr += PAGE_SIZE) {
>  		ptent = *pte;
> +		nr_scanned++;
>  
>  		if (pte_none(ptent))
>  			continue;
> @@ -508,6 +513,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  			}
>  		} else
>  			folio_deactivate(folio);
> +		nr_hinted++;
>  	}
>  
>  	arch_leave_lazy_mmu_mode();
> @@ -515,6 +521,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  	if (pageout)
>  		reclaim_pages(&folio_list);
>  	cond_resched();
> +	count_vm_events(MADVISE_PGSCANNED, nr_scanned);
> +	count_vm_events(MADVISE_PGHINTED, nr_hinted);
>  
>  	return 0;
>  }
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 1ea6a5ce1c41..84acc90820e1 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1283,6 +1283,8 @@ const char * const vmstat_text[] = {
>  	"pgscan_file",
>  	"pgsteal_anon",
>  	"pgsteal_file",
> +	"madvise_pgscanned",
> +	"madvise_pghinted",
>  
>  #ifdef CONFIG_NUMA
>  	"zone_reclaim_failed",
> -- 
> 2.39.1.405.gd4c25cc71f-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/madvise: add vmstat statistics for madvise_[cold|pageout]
  2023-01-25  8:04 ` Michal Hocko
@ 2023-01-25 16:36   ` Minchan Kim
  2023-01-25 17:07     ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2023-01-25 16:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Suren Baghdasaryan, Matthew Wilcox, linux-mm, LKML

On Wed, Jan 25, 2023 at 09:04:16AM +0100, Michal Hocko wrote:
> On Tue 24-01-23 16:54:57, Minchan Kim wrote:
> > madvise LRU manipulation APIs need to scan address ranges to find
> > present pages at page table and provides advice hints for them.
> > 
> > Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted]
> > shows the proactive reclaim efficiency so this patch adds those
> > two statistics in vmstat.
> > 
> > 	madvise_pgscanned, madvise_pghinted
> > 
> > Since proactive reclaim using process_madvise(2) as userland
> > memory policy is popular(e.g,. Android ActivityManagerService),
> > those stats are helpful to know how efficiently the policy works
> > well.
> 
> The usecase description is still too vague. What are those values useful
> for? Is there anything actionable based on those numbers? How do you
> deal with multiple parties using madvise resp. process_madvise so that
> their stats are combined?

The metric helps monitoing system MM health under fleet and experimental
tuning with diffrent policies from the centralized userland memory daemon.
That's really good fit under vmstat along with other MM metrics.

> 
> In the previous version I have also pointed out that this might be
> easily achieved by tracepoints. Your counterargument was a convenience
> in a large scale monitoring without going much into details. Presumably
> this is because your fleet based monitoring already collects
> /proc/vmstat while tracepoints based monitoring would require additional
> changes. This alone is rather weak argument to be honest because
> deploying tracepoints monitoring is quite trivial and can be done
> outside of the said memory reclaim agent.

The convenience matters but that's not my argument. 

Ithink using tracepoint for system metric makes no sense even though
the tracepoint could be extended by using bpf or histogram trigger to
get the accumulated counters for system metric.

The tracepoint is the next step if we want to know further breakdown
once something strange happens. That's why we have separate level metric
system to narrow problem down rather than implementing all the metric
with tracepoint. Please, look at vmstat fields. Almost every fields
would have same question you asked "how do you break down if multiple
processes were invovled to contribute the metric?"

I am fine if you suggest adding tracepoint as well as the vmstat fields
for further breakdown but only relying on tracepoint and frineds for
system global metric doesn't make sense.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/madvise: add vmstat statistics for madvise_[cold|pageout]
  2023-01-25 16:36   ` Minchan Kim
@ 2023-01-25 17:07     ` Michal Hocko
  2023-01-25 18:07       ` Minchan Kim
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2023-01-25 17:07 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Suren Baghdasaryan, Matthew Wilcox, linux-mm, LKML

On Wed 25-01-23 08:36:02, Minchan Kim wrote:
> On Wed, Jan 25, 2023 at 09:04:16AM +0100, Michal Hocko wrote:
> > On Tue 24-01-23 16:54:57, Minchan Kim wrote:
> > > madvise LRU manipulation APIs need to scan address ranges to find
> > > present pages at page table and provides advice hints for them.
> > > 
> > > Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted]
> > > shows the proactive reclaim efficiency so this patch adds those
> > > two statistics in vmstat.
> > > 
> > > 	madvise_pgscanned, madvise_pghinted
> > > 
> > > Since proactive reclaim using process_madvise(2) as userland
> > > memory policy is popular(e.g,. Android ActivityManagerService),
> > > those stats are helpful to know how efficiently the policy works
> > > well.
> > 
> > The usecase description is still too vague. What are those values useful
> > for? Is there anything actionable based on those numbers? How do you
> > deal with multiple parties using madvise resp. process_madvise so that
> > their stats are combined?
> 
> The metric helps monitoing system MM health under fleet and experimental
> tuning with diffrent policies from the centralized userland memory daemon.

That is just too vague for me to imagine anything more specific then, we
have numbers and we can show them in a report. What does it actually
mean that madvise_pgscanned is high. Or that pghinted / pgscanned is
low (that you tend to manually reclaim sparse mappings)?

> That's really good fit under vmstat along with other MM metrics.
> 
> > 
> > In the previous version I have also pointed out that this might be
> > easily achieved by tracepoints. Your counterargument was a convenience
> > in a large scale monitoring without going much into details. Presumably
> > this is because your fleet based monitoring already collects
> > /proc/vmstat while tracepoints based monitoring would require additional
> > changes. This alone is rather weak argument to be honest because
> > deploying tracepoints monitoring is quite trivial and can be done
> > outside of the said memory reclaim agent.
> 
> The convenience matters but that's not my argument. 
> 
> Ithink using tracepoint for system metric makes no sense even though
> the tracepoint could be extended by using bpf or histogram trigger to
> get the accumulated counters for system metric.

System wide metrics data collection by ftrace is a common use case. I
really do not follow your argument here. There are certainly cases where
ftrace is suboptimal solution - e.g. when the cumulative data couldn't
have been collected early on for one reason or another (e.g. system
uptime is already high when you decide to start collecting). But you
have stated there is data collection happening so what does prevent
collecting this just along with anything else.
 
> The tracepoint is the next step if we want to know further breakdown
> once something strange happens. That's why we have separate level metric
> system to narrow problem down rather than implementing all the metric
> with tracepoint. Please, look at vmstat fields. Almost every fields
> would have same question you asked "how do you break down if multiple
> processes were invovled to contribute the metric?"

Yes, we tended to be much more willing to add counters. Partly because
runtime debugging capabilities were not that great in the past as we
have these days.

> I am fine if you suggest adding tracepoint as well as the vmstat fields
> for further breakdown but only relying on tracepoint and frineds for
> system global metric doesn't make sense.

We have to agree to disagree here. I am not going to nack this but I
disagree with this patch because the justification is just too vague and
also those numbers cannot really be attributed to anybody performing
madvise to actually evaluate that activity.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/madvise: add vmstat statistics for madvise_[cold|pageout]
  2023-01-25 17:07     ` Michal Hocko
@ 2023-01-25 18:07       ` Minchan Kim
  2023-01-25 21:37         ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2023-01-25 18:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Suren Baghdasaryan, Matthew Wilcox, linux-mm, LKML

On Wed, Jan 25, 2023 at 06:07:00PM +0100, Michal Hocko wrote:
> On Wed 25-01-23 08:36:02, Minchan Kim wrote:
> > On Wed, Jan 25, 2023 at 09:04:16AM +0100, Michal Hocko wrote:
> > > On Tue 24-01-23 16:54:57, Minchan Kim wrote:
> > > > madvise LRU manipulation APIs need to scan address ranges to find
> > > > present pages at page table and provides advice hints for them.
> > > > 
> > > > Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted]
> > > > shows the proactive reclaim efficiency so this patch adds those
> > > > two statistics in vmstat.
> > > > 
> > > > 	madvise_pgscanned, madvise_pghinted
> > > > 
> > > > Since proactive reclaim using process_madvise(2) as userland
> > > > memory policy is popular(e.g,. Android ActivityManagerService),
> > > > those stats are helpful to know how efficiently the policy works
> > > > well.
> > > 
> > > The usecase description is still too vague. What are those values useful
> > > for? Is there anything actionable based on those numbers? How do you
> > > deal with multiple parties using madvise resp. process_madvise so that
> > > their stats are combined?
> > 
> > The metric helps monitoing system MM health under fleet and experimental
> > tuning with diffrent policies from the centralized userland memory daemon.
> 
> That is just too vague for me to imagine anything more specific then, we
> have numbers and we can show them in a report. What does it actually
> mean that madvise_pgscanned is high. Or that pghinted / pgscanned is
> low (that you tend to manually reclaim sparse mappings)?

If that's low, it means the userspace daemon's current tune/policy are
inefficient or too aggressive since it is working on address spacess
of processes which don't have enough memory the hint can work(e.g.,
shared addresses, cold address ranges or some special address ranges like
VM_PFNMAP) so sometime, we can detect regression to find culprit or
have a chance to look into better ideas to improve.

> 
> > That's really good fit under vmstat along with other MM metrics.
> > 
> > > 
> > > In the previous version I have also pointed out that this might be
> > > easily achieved by tracepoints. Your counterargument was a convenience
> > > in a large scale monitoring without going much into details. Presumably
> > > this is because your fleet based monitoring already collects
> > > /proc/vmstat while tracepoints based monitoring would require additional
> > > changes. This alone is rather weak argument to be honest because
> > > deploying tracepoints monitoring is quite trivial and can be done
> > > outside of the said memory reclaim agent.
> > 
> > The convenience matters but that's not my argument. 
> > 
> > Ithink using tracepoint for system metric makes no sense even though
> > the tracepoint could be extended by using bpf or histogram trigger to
> > get the accumulated counters for system metric.
> 
> System wide metrics data collection by ftrace is a common use case. I
> really do not follow your argument here. There are certainly cases where
> ftrace is suboptimal solution - e.g. when the cumulative data couldn't
> have been collected early on for one reason or another (e.g. system
> uptime is already high when you decide to start collecting). But you
> have stated there is data collection happening so what does prevent
> collecting this just along with anything else.
>  
> > The tracepoint is the next step if we want to know further breakdown
> > once something strange happens. That's why we have separate level metric
> > system to narrow problem down rather than implementing all the metric
> > with tracepoint. Please, look at vmstat fields. Almost every fields
> > would have same question you asked "how do you break down if multiple
> > processes were invovled to contribute the metric?"
> 
> Yes, we tended to be much more willing to add counters. Partly because
> runtime debugging capabilities were not that great in the past as we
> have these days.
> 
> > I am fine if you suggest adding tracepoint as well as the vmstat fields
> > for further breakdown but only relying on tracepoint and frineds for
> > system global metric doesn't make sense.
> 
> We have to agree to disagree here. I am not going to nack this but I
> disagree with this patch because the justification is just too vague and
> also those numbers cannot really be attributed to anybody performing
> madvise to actually evaluate that activity.
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/madvise: add vmstat statistics for madvise_[cold|pageout]
  2023-01-25 18:07       ` Minchan Kim
@ 2023-01-25 21:37         ` Michal Hocko
  2023-01-25 22:21           ` Minchan Kim
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2023-01-25 21:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Suren Baghdasaryan, Matthew Wilcox, linux-mm, LKML

On Wed 25-01-23 10:07:49, Minchan Kim wrote:
> On Wed, Jan 25, 2023 at 06:07:00PM +0100, Michal Hocko wrote:
> > On Wed 25-01-23 08:36:02, Minchan Kim wrote:
> > > On Wed, Jan 25, 2023 at 09:04:16AM +0100, Michal Hocko wrote:
> > > > On Tue 24-01-23 16:54:57, Minchan Kim wrote:
> > > > > madvise LRU manipulation APIs need to scan address ranges to find
> > > > > present pages at page table and provides advice hints for them.
> > > > > 
> > > > > Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted]
> > > > > shows the proactive reclaim efficiency so this patch adds those
> > > > > two statistics in vmstat.
> > > > > 
> > > > > 	madvise_pgscanned, madvise_pghinted
> > > > > 
> > > > > Since proactive reclaim using process_madvise(2) as userland
> > > > > memory policy is popular(e.g,. Android ActivityManagerService),
> > > > > those stats are helpful to know how efficiently the policy works
> > > > > well.
> > > > 
> > > > The usecase description is still too vague. What are those values useful
> > > > for? Is there anything actionable based on those numbers? How do you
> > > > deal with multiple parties using madvise resp. process_madvise so that
> > > > their stats are combined?
> > > 
> > > The metric helps monitoing system MM health under fleet and experimental
> > > tuning with diffrent policies from the centralized userland memory daemon.
> > 
> > That is just too vague for me to imagine anything more specific then, we
> > have numbers and we can show them in a report. What does it actually
> > mean that madvise_pgscanned is high. Or that pghinted / pgscanned is
> > low (that you tend to manually reclaim sparse mappings)?
> 
> If that's low, it means the userspace daemon's current tune/policy are
> inefficient or too aggressive since it is working on address spacess
> of processes which don't have enough memory the hint can work(e.g.,
> shared addresses, cold address ranges or some special address ranges like
> VM_PFNMAP) so sometime, we can detect regression to find culprit or
> have a chance to look into better ideas to improve.

Are you sure this is really meaningful metric? Just consider a large and
sparsely populated mapping. This can be a perfect candidate for user
space reclaim target (e.g. consider a mapping covering a large matrix
or other similar data structure). pghinted/pgscanned would be really
small while the reclaim efficiency could be quite high in that case,
wouldn't it?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/madvise: add vmstat statistics for madvise_[cold|pageout]
  2023-01-25 21:37         ` Michal Hocko
@ 2023-01-25 22:21           ` Minchan Kim
  2023-01-26  8:50             ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2023-01-25 22:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Suren Baghdasaryan, Matthew Wilcox, linux-mm, LKML

On Wed, Jan 25, 2023 at 10:37:59PM +0100, Michal Hocko wrote:
> On Wed 25-01-23 10:07:49, Minchan Kim wrote:
> > On Wed, Jan 25, 2023 at 06:07:00PM +0100, Michal Hocko wrote:
> > > On Wed 25-01-23 08:36:02, Minchan Kim wrote:
> > > > On Wed, Jan 25, 2023 at 09:04:16AM +0100, Michal Hocko wrote:
> > > > > On Tue 24-01-23 16:54:57, Minchan Kim wrote:
> > > > > > madvise LRU manipulation APIs need to scan address ranges to find
> > > > > > present pages at page table and provides advice hints for them.
> > > > > > 
> > > > > > Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted]
> > > > > > shows the proactive reclaim efficiency so this patch adds those
> > > > > > two statistics in vmstat.
> > > > > > 
> > > > > > 	madvise_pgscanned, madvise_pghinted
> > > > > > 
> > > > > > Since proactive reclaim using process_madvise(2) as userland
> > > > > > memory policy is popular(e.g,. Android ActivityManagerService),
> > > > > > those stats are helpful to know how efficiently the policy works
> > > > > > well.
> > > > > 
> > > > > The usecase description is still too vague. What are those values useful
> > > > > for? Is there anything actionable based on those numbers? How do you
> > > > > deal with multiple parties using madvise resp. process_madvise so that
> > > > > their stats are combined?
> > > > 
> > > > The metric helps monitoing system MM health under fleet and experimental
> > > > tuning with diffrent policies from the centralized userland memory daemon.
> > > 
> > > That is just too vague for me to imagine anything more specific then, we
> > > have numbers and we can show them in a report. What does it actually
> > > mean that madvise_pgscanned is high. Or that pghinted / pgscanned is
> > > low (that you tend to manually reclaim sparse mappings)?
> > 
> > If that's low, it means the userspace daemon's current tune/policy are
> > inefficient or too aggressive since it is working on address spacess
> > of processes which don't have enough memory the hint can work(e.g.,
> > shared addresses, cold address ranges or some special address ranges like
> > VM_PFNMAP) so sometime, we can detect regression to find culprit or
> > have a chance to look into better ideas to improve.
> 
> Are you sure this is really meaningful metric? Just consider a large and
> sparsely populated mapping. This can be a perfect candidate for user
> space reclaim target (e.g. consider a mapping covering a large matrix
> or other similar data structure). pghinted/pgscanned would be really
> small while the reclaim efficiency could be quite high in that case,
> wouldn't it?

Why do you think it's efficient? It need to spend quite CPU cycle to
scan a few of pages to evict. I don't see it's efficient if it happens
quite a lot.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/madvise: add vmstat statistics for madvise_[cold|pageout]
  2023-01-25 22:21           ` Minchan Kim
@ 2023-01-26  8:50             ` Michal Hocko
  2023-01-26  8:51               ` Michal Hocko
  2023-01-26 17:10               ` Minchan Kim
  0 siblings, 2 replies; 15+ messages in thread
From: Michal Hocko @ 2023-01-26  8:50 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Suren Baghdasaryan, Matthew Wilcox, linux-mm, LKML

On Wed 25-01-23 14:21:35, Minchan Kim wrote:
> On Wed, Jan 25, 2023 at 10:37:59PM +0100, Michal Hocko wrote:
> > On Wed 25-01-23 10:07:49, Minchan Kim wrote:
> > > On Wed, Jan 25, 2023 at 06:07:00PM +0100, Michal Hocko wrote:
> > > > On Wed 25-01-23 08:36:02, Minchan Kim wrote:
> > > > > On Wed, Jan 25, 2023 at 09:04:16AM +0100, Michal Hocko wrote:
> > > > > > On Tue 24-01-23 16:54:57, Minchan Kim wrote:
> > > > > > > madvise LRU manipulation APIs need to scan address ranges to find
> > > > > > > present pages at page table and provides advice hints for them.
> > > > > > > 
> > > > > > > Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted]
> > > > > > > shows the proactive reclaim efficiency so this patch adds those
> > > > > > > two statistics in vmstat.
> > > > > > > 
> > > > > > > 	madvise_pgscanned, madvise_pghinted
> > > > > > > 
> > > > > > > Since proactive reclaim using process_madvise(2) as userland
> > > > > > > memory policy is popular(e.g,. Android ActivityManagerService),
> > > > > > > those stats are helpful to know how efficiently the policy works
> > > > > > > well.
> > > > > > 
> > > > > > The usecase description is still too vague. What are those values useful
> > > > > > for? Is there anything actionable based on those numbers? How do you
> > > > > > deal with multiple parties using madvise resp. process_madvise so that
> > > > > > their stats are combined?
> > > > > 
> > > > > The metric helps monitoing system MM health under fleet and experimental
> > > > > tuning with diffrent policies from the centralized userland memory daemon.
> > > > 
> > > > That is just too vague for me to imagine anything more specific then, we
> > > > have numbers and we can show them in a report. What does it actually
> > > > mean that madvise_pgscanned is high. Or that pghinted / pgscanned is
> > > > low (that you tend to manually reclaim sparse mappings)?
> > > 
> > > If that's low, it means the userspace daemon's current tune/policy are
> > > inefficient or too aggressive since it is working on address spacess
> > > of processes which don't have enough memory the hint can work(e.g.,
> > > shared addresses, cold address ranges or some special address ranges like
> > > VM_PFNMAP) so sometime, we can detect regression to find culprit or
> > > have a chance to look into better ideas to improve.
> > 
> > Are you sure this is really meaningful metric? Just consider a large and
> > sparsely populated mapping. This can be a perfect candidate for user
> > space reclaim target (e.g. consider a mapping covering a large matrix
> > or other similar data structure). pghinted/pgscanned would be really
> > small while the reclaim efficiency could be quite high in that case,
> > wouldn't it?
> 
> Why do you think it's efficient? It need to spend quite CPU cycle to
> scan a few of pages to evict. I don't see it's efficient if it happens
> quite a lot.

Because it doesn't really matter how many page tables you have to scan
but how easily you can reclaim the memory behind that. Because it is the
memory that matters. Just consider THP vs. 4k backed address ranges. You
are going to scan much more for latter by design. That doesn't really
mean that this is a worse candidate for reclaim and you should be only
focusing on THP backed mappings. See?

I suspect you try to mimic pgscan/pgsteal effectivness metric on the
address space but that is a fundamentally different thing.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/madvise: add vmstat statistics for madvise_[cold|pageout]
  2023-01-26  8:50             ` Michal Hocko
@ 2023-01-26  8:51               ` Michal Hocko
  2023-01-26 17:10               ` Minchan Kim
  1 sibling, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2023-01-26  8:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Suren Baghdasaryan, Matthew Wilcox, linux-mm, LKML

On Thu 26-01-23 09:50:38, Michal Hocko wrote:
> On Wed 25-01-23 14:21:35, Minchan Kim wrote:
> > On Wed, Jan 25, 2023 at 10:37:59PM +0100, Michal Hocko wrote:
> > > On Wed 25-01-23 10:07:49, Minchan Kim wrote:
> > > > On Wed, Jan 25, 2023 at 06:07:00PM +0100, Michal Hocko wrote:
> > > > > On Wed 25-01-23 08:36:02, Minchan Kim wrote:
> > > > > > On Wed, Jan 25, 2023 at 09:04:16AM +0100, Michal Hocko wrote:
> > > > > > > On Tue 24-01-23 16:54:57, Minchan Kim wrote:
> > > > > > > > madvise LRU manipulation APIs need to scan address ranges to find
> > > > > > > > present pages at page table and provides advice hints for them.
> > > > > > > > 
> > > > > > > > Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted]
> > > > > > > > shows the proactive reclaim efficiency so this patch adds those
> > > > > > > > two statistics in vmstat.
> > > > > > > > 
> > > > > > > > 	madvise_pgscanned, madvise_pghinted
> > > > > > > > 
> > > > > > > > Since proactive reclaim using process_madvise(2) as userland
> > > > > > > > memory policy is popular(e.g,. Android ActivityManagerService),
> > > > > > > > those stats are helpful to know how efficiently the policy works
> > > > > > > > well.
> > > > > > > 
> > > > > > > The usecase description is still too vague. What are those values useful
> > > > > > > for? Is there anything actionable based on those numbers? How do you
> > > > > > > deal with multiple parties using madvise resp. process_madvise so that
> > > > > > > their stats are combined?
> > > > > > 
> > > > > > The metric helps monitoing system MM health under fleet and experimental
> > > > > > tuning with diffrent policies from the centralized userland memory daemon.
> > > > > 
> > > > > That is just too vague for me to imagine anything more specific then, we
> > > > > have numbers and we can show them in a report. What does it actually
> > > > > mean that madvise_pgscanned is high. Or that pghinted / pgscanned is
> > > > > low (that you tend to manually reclaim sparse mappings)?
> > > > 
> > > > If that's low, it means the userspace daemon's current tune/policy are
> > > > inefficient or too aggressive since it is working on address spacess
> > > > of processes which don't have enough memory the hint can work(e.g.,
> > > > shared addresses, cold address ranges or some special address ranges like
> > > > VM_PFNMAP) so sometime, we can detect regression to find culprit or
> > > > have a chance to look into better ideas to improve.
> > > 
> > > Are you sure this is really meaningful metric? Just consider a large and
> > > sparsely populated mapping. This can be a perfect candidate for user
> > > space reclaim target (e.g. consider a mapping covering a large matrix
> > > or other similar data structure). pghinted/pgscanned would be really
> > > small while the reclaim efficiency could be quite high in that case,
> > > wouldn't it?
> > 
> > Why do you think it's efficient? It need to spend quite CPU cycle to
> > scan a few of pages to evict. I don't see it's efficient if it happens
> > quite a lot.
> 
> Because it doesn't really matter how many page tables you have to scan
> but how easily you can reclaim the memory behind that. Because it is the
> memory that matters. Just consider THP vs. 4k backed address ranges. You
> are going to scan much more for latter by design. That doesn't really
> mean that this is a worse candidate for reclaim and you should be only
> focusing on THP backed mappings. See?
> 
> I suspect you try to mimic pgscan/pgsteal effectivness metric on the

dang. I meant pgsteal/pgscan

> address space but that is a fundamentally different thing.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/madvise: add vmstat statistics for madvise_[cold|pageout]
  2023-01-26  8:50             ` Michal Hocko
  2023-01-26  8:51               ` Michal Hocko
@ 2023-01-26 17:10               ` Minchan Kim
  2023-01-26 19:58                 ` Michal Hocko
  1 sibling, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2023-01-26 17:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Suren Baghdasaryan, Matthew Wilcox, linux-mm, LKML

On Thu, Jan 26, 2023 at 09:50:37AM +0100, Michal Hocko wrote:
> On Wed 25-01-23 14:21:35, Minchan Kim wrote:
> > On Wed, Jan 25, 2023 at 10:37:59PM +0100, Michal Hocko wrote:
> > > On Wed 25-01-23 10:07:49, Minchan Kim wrote:
> > > > On Wed, Jan 25, 2023 at 06:07:00PM +0100, Michal Hocko wrote:
> > > > > On Wed 25-01-23 08:36:02, Minchan Kim wrote:
> > > > > > On Wed, Jan 25, 2023 at 09:04:16AM +0100, Michal Hocko wrote:
> > > > > > > On Tue 24-01-23 16:54:57, Minchan Kim wrote:
> > > > > > > > madvise LRU manipulation APIs need to scan address ranges to find
> > > > > > > > present pages at page table and provides advice hints for them.
> > > > > > > > 
> > > > > > > > Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted]
> > > > > > > > shows the proactive reclaim efficiency so this patch adds those
> > > > > > > > two statistics in vmstat.
> > > > > > > > 
> > > > > > > > 	madvise_pgscanned, madvise_pghinted
> > > > > > > > 
> > > > > > > > Since proactive reclaim using process_madvise(2) as userland
> > > > > > > > memory policy is popular(e.g,. Android ActivityManagerService),
> > > > > > > > those stats are helpful to know how efficiently the policy works
> > > > > > > > well.
> > > > > > > 
> > > > > > > The usecase description is still too vague. What are those values useful
> > > > > > > for? Is there anything actionable based on those numbers? How do you
> > > > > > > deal with multiple parties using madvise resp. process_madvise so that
> > > > > > > their stats are combined?
> > > > > > 
> > > > > > The metric helps monitoing system MM health under fleet and experimental
> > > > > > tuning with diffrent policies from the centralized userland memory daemon.
> > > > > 
> > > > > That is just too vague for me to imagine anything more specific then, we
> > > > > have numbers and we can show them in a report. What does it actually
> > > > > mean that madvise_pgscanned is high. Or that pghinted / pgscanned is
> > > > > low (that you tend to manually reclaim sparse mappings)?
> > > > 
> > > > If that's low, it means the userspace daemon's current tune/policy are
> > > > inefficient or too aggressive since it is working on address spacess
> > > > of processes which don't have enough memory the hint can work(e.g.,
> > > > shared addresses, cold address ranges or some special address ranges like
> > > > VM_PFNMAP) so sometime, we can detect regression to find culprit or
> > > > have a chance to look into better ideas to improve.
> > > 
> > > Are you sure this is really meaningful metric? Just consider a large and
> > > sparsely populated mapping. This can be a perfect candidate for user
> > > space reclaim target (e.g. consider a mapping covering a large matrix
> > > or other similar data structure). pghinted/pgscanned would be really
> > > small while the reclaim efficiency could be quite high in that case,
> > > wouldn't it?
> > 
> > Why do you think it's efficient? It need to spend quite CPU cycle to
> > scan a few of pages to evict. I don't see it's efficient if it happens
> > quite a lot.
> 
> Because it doesn't really matter how many page tables you have to scan
> but how easily you can reclaim the memory behind that. Because it is the

I really don't follow your claim here. Efficiency is input vs output.
For the those advices, input is number of scanned ptes vs. number of
hinted pages.
If you keep need to scan the sparsed huge address range to reclaim just
a few of pages, that's really inefficient.
If you keep hinting to the non-populated-page, already swapped-out and
hint-cannot-work address ranges, that's really inefficient.

What do you see the problem here? What exactly do you mean "how easily"
from your context?

> memory that matters. Just consider THP vs. 4k backed address ranges. You
> are going to scan much more for latter by design. That doesn't really
> mean that this is a worse candidate for reclaim and you should be only
> focusing on THP backed mappings. See?

I don't understand your point here. The stat doesn't aim to make such
decision. If THP page shows the good yield from the efficienty above,
that's good. If 4K page shows the bad yield, should we focus on THP
pages? How could you conclude such decision from the stat?

The stat just sees the current health of the system and find something
regressed/improved compared to old to intitiate further investigation.

> 
> I suspect you try to mimic pgscan/pgsteal effectivness metric on the
> address space but that is a fundamentally different thing.

I don't see anything different, fundamentally.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/madvise: add vmstat statistics for madvise_[cold|pageout]
  2023-01-26 17:10               ` Minchan Kim
@ 2023-01-26 19:58                 ` Michal Hocko
  2023-01-27  0:08                   ` Minchan Kim
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2023-01-26 19:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Suren Baghdasaryan, Matthew Wilcox, linux-mm, LKML

On Thu 26-01-23 09:10:46, Minchan Kim wrote:
> On Thu, Jan 26, 2023 at 09:50:37AM +0100, Michal Hocko wrote:
[...]
> > I suspect you try to mimic pgscan/pgsteal effectivness metric on the
> > address space but that is a fundamentally different thing.
> 
> I don't see anything different, fundamentally.

OK, this really explains our disconnect here. Your metric reports
nr_page_tables (nr_scanned) and number of aged and potentially reclaimed
pages. You do not know whether that reclaim was successful. So you
effectively learn how many pages have already been unmapped before your
call. Can this be sometimes useful? Probably yes. Does it say anything
about the reclaim efficiency? I do not think so. You could have hit
pinned pages or countless other conditions why those pages couldn't have
been reclaimed and they have stayed mapped after madvise call.

pgsteal tells you how many pages from those scanned have been reclaimed.
See the difference?

Also I do not find information about how many non-present ptes have
been scann super interesting. Sure that is a burnt time as well but to
me it would be much more valuable information to see how many of those
resident could have been actually reclaimed. Because that tells whether
your reclaim target was a good choice and IMHO that is a valuable
information for user space memory reclaim agent.

Again consider a large sparsely mapped memory but mostly inactive memory
and a condensed active one with the same rss. The reclaim could have
been successful for the former while not on the latter. Your matric
would give a rather misleading numbers, don't you think?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/madvise: add vmstat statistics for madvise_[cold|pageout]
  2023-01-26 19:58                 ` Michal Hocko
@ 2023-01-27  0:08                   ` Minchan Kim
  2023-01-27  9:48                     ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2023-01-27  0:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Suren Baghdasaryan, Matthew Wilcox, linux-mm, LKML

On Thu, Jan 26, 2023 at 08:58:57PM +0100, Michal Hocko wrote:
> On Thu 26-01-23 09:10:46, Minchan Kim wrote:
> > On Thu, Jan 26, 2023 at 09:50:37AM +0100, Michal Hocko wrote:
> [...]
> > > I suspect you try to mimic pgscan/pgsteal effectivness metric on the
> > > address space but that is a fundamentally different thing.
> > 
> > I don't see anything different, fundamentally.
> 
> OK, this really explains our disconnect here. Your metric reports
> nr_page_tables (nr_scanned) and number of aged and potentially reclaimed
> pages. You do not know whether that reclaim was successful. So you
> effectively learn how many pages have already been unmapped before your
> call. Can this be sometimes useful? Probably yes. Does it say anything
> about the reclaim efficiency? I do not think so. You could have hit
> pinned pages or countless other conditions why those pages couldn't have
> been reclaimed and they have stayed mapped after madvise call.
> 
> pgsteal tells you how many pages from those scanned have been reclaimed.
> See the difference?

That's why my previous version kept counting exact number of reclaimed/
deactivated pages but I changed mind since I observed majority of failure
happened from already-paged-out ranges and shared pages rather than minor
countless other conditions in real practice. Without finding present pages,
the mavise hints couldn't do anything from the beginning and that's the
major cost we are facing.

Saing again, I don't think the global stat could cover all the minor
you are insisting and I agree tracepoint could do better jobs to pinpoint
root causes but the global stat still have a role to provides basic ground
to sense abnormal and guides us moving next steps with easier interface/
efficient way.

> 
> Also I do not find information about how many non-present ptes have
> been scann super interesting. Sure that is a burnt time as well but to
> me it would be much more valuable information to see how many of those
> resident could have been actually reclaimed. Because that tells whether
> your reclaim target was a good choice and IMHO that is a valuable
> information for user space memory reclaim agent.

That's exactly what I had in previous version. If you believe it's right
direction, I am okay.

> 
> Again consider a large sparsely mapped memory but mostly inactive memory
> and a condensed active one with the same rss. The reclaim could have
> been successful for the former while not on the latter. Your matric
> would give a rather misleading numbers, don't you think?
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/madvise: add vmstat statistics for madvise_[cold|pageout]
  2023-01-27  0:08                   ` Minchan Kim
@ 2023-01-27  9:48                     ` Michal Hocko
  2023-01-28  3:00                       ` Minchan Kim
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2023-01-27  9:48 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Suren Baghdasaryan, Matthew Wilcox, linux-mm, LKML

On Thu 26-01-23 16:08:43, Minchan Kim wrote:
> On Thu, Jan 26, 2023 at 08:58:57PM +0100, Michal Hocko wrote:
> > On Thu 26-01-23 09:10:46, Minchan Kim wrote:
> > > On Thu, Jan 26, 2023 at 09:50:37AM +0100, Michal Hocko wrote:
> > [...]
> > > > I suspect you try to mimic pgscan/pgsteal effectivness metric on the
> > > > address space but that is a fundamentally different thing.
> > > 
> > > I don't see anything different, fundamentally.
> > 
> > OK, this really explains our disconnect here. Your metric reports
> > nr_page_tables (nr_scanned) and number of aged and potentially reclaimed
> > pages. You do not know whether that reclaim was successful. So you
> > effectively learn how many pages have already been unmapped before your
> > call. Can this be sometimes useful? Probably yes. Does it say anything
> > about the reclaim efficiency? I do not think so. You could have hit
> > pinned pages or countless other conditions why those pages couldn't have
> > been reclaimed and they have stayed mapped after madvise call.
> > 
> > pgsteal tells you how many pages from those scanned have been reclaimed.
> > See the difference?
> 
> That's why my previous version kept counting exact number of reclaimed/
> deactivated pages but I changed mind since I observed majority of failure
> happened from already-paged-out ranges and shared pages rather than minor
> countless other conditions in real practice. Without finding present pages,
> the mavise hints couldn't do anything from the beginning and that's the
> major cost we are facing.

I cannot really comment on your user space reclaim policy but I would
have expected that you at least check for rss before trying to use
madvise on the range. Learning that from the operation sounds like a
suboptimal policy to me.

> Saing again, I don't think the global stat could cover all the minor
> you are insisting and I agree tracepoint could do better jobs to pinpoint
> root causes but the global stat still have a role to provides basic ground
> to sense abnormal and guides us moving next steps with easier interface/
> efficient way.

I hate to repeat myself but the more we discuss this the more I am
convinced that vmstat is a bad fit. Sooner or later you end up realizing
that nr_reclaimed/nr_scanned is insufficient metric because you would
need to learn more anout those reclaim failures. Really what you want is
to have a tracepoint with a full reclaim metric and grow monitoring tooling
around that. This will deal with the major design flaw of global stat
mentioned ealier (that you cannot attribute specific stats to the
corresponding madvise caller).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/madvise: add vmstat statistics for madvise_[cold|pageout]
  2023-01-27  9:48                     ` Michal Hocko
@ 2023-01-28  3:00                       ` Minchan Kim
  2023-01-30 11:12                         ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2023-01-28  3:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Suren Baghdasaryan, Matthew Wilcox, linux-mm, LKML

On Fri, Jan 27, 2023 at 10:48:25AM +0100, Michal Hocko wrote:
> On Thu 26-01-23 16:08:43, Minchan Kim wrote:
> > On Thu, Jan 26, 2023 at 08:58:57PM +0100, Michal Hocko wrote:
> > > On Thu 26-01-23 09:10:46, Minchan Kim wrote:
> > > > On Thu, Jan 26, 2023 at 09:50:37AM +0100, Michal Hocko wrote:
> > > [...]
> > > > > I suspect you try to mimic pgscan/pgsteal effectivness metric on the
> > > > > address space but that is a fundamentally different thing.
> > > > 
> > > > I don't see anything different, fundamentally.
> > > 
> > > OK, this really explains our disconnect here. Your metric reports
> > > nr_page_tables (nr_scanned) and number of aged and potentially reclaimed
> > > pages. You do not know whether that reclaim was successful. So you
> > > effectively learn how many pages have already been unmapped before your
> > > call. Can this be sometimes useful? Probably yes. Does it say anything
> > > about the reclaim efficiency? I do not think so. You could have hit
> > > pinned pages or countless other conditions why those pages couldn't have
> > > been reclaimed and they have stayed mapped after madvise call.
> > > 
> > > pgsteal tells you how many pages from those scanned have been reclaimed.
> > > See the difference?
> > 
> > That's why my previous version kept counting exact number of reclaimed/
> > deactivated pages but I changed mind since I observed majority of failure
> > happened from already-paged-out ranges and shared pages rather than minor
> > countless other conditions in real practice. Without finding present pages,
> > the mavise hints couldn't do anything from the beginning and that's the
> > major cost we are facing.
> 
> I cannot really comment on your user space reclaim policy but I would
> have expected that you at least check for rss before trying to use
> madvise on the range. Learning that from the operation sounds like a
> suboptimal policy to me.

Current rss couldn't say where is the present pages among huge address spaces.
And that's not what I want to from the operation but keep monitoring trending
under fleet. 


> 
> > Saing again, I don't think the global stat could cover all the minor
> > you are insisting and I agree tracepoint could do better jobs to pinpoint
> > root causes but the global stat still have a role to provides basic ground
> > to sense abnormal and guides us moving next steps with easier interface/
> > efficient way.
> 
> I hate to repeat myself but the more we discuss this the more I am
> convinced that vmstat is a bad fit. Sooner or later you end up realizing
> that nr_reclaimed/nr_scanned is insufficient metric because you would
> need to learn more anout those reclaim failures. Really what you want is
> to have a tracepoint with a full reclaim metric and grow monitoring tooling
> around that. This will deal with the major design flaw of global stat
> mentioned ealier (that you cannot attribute specific stats to the
> corresponding madvise caller).

Then, let me ask back to you.

What statistcis in the current vmstat fields or pending fields
(to be merged) among accumulated counter stats sound reasonable
to be part of vmstat fields not tracepoint from your perspective?

Almost every stat would have corner cases by various reasons and
people would want to know the reason from process, context, function
or block scope depending on how they want to use the stat.
Even, tracepoint you're loving couldn't tell all the detail what they want
without adding more and more as on growing code chages.
However, unlike your worry, people has used such an high level vague
vmstat fields very well to understand/monitor system health even though
it has various miscounting cases since they know the corner cases
are really minor.

I am really curious what metric we could add in the vmstat instead of
tracepoint in future if we follow your logic. 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/madvise: add vmstat statistics for madvise_[cold|pageout]
  2023-01-28  3:00                       ` Minchan Kim
@ 2023-01-30 11:12                         ` Michal Hocko
  0 siblings, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2023-01-30 11:12 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Suren Baghdasaryan, Matthew Wilcox, linux-mm, LKML

On Fri 27-01-23 19:00:15, Minchan Kim wrote:
[...]
> Then, let me ask back to you.
> 
> What statistcis in the current vmstat fields or pending fields
> (to be merged) among accumulated counter stats sound reasonable
> to be part of vmstat fields not tracepoint from your perspective?

Most of those could be replaced but for historical reasons a counter was
an only solution back then. Some metrics make a lot of sense these days
as well. Regular snapshots of vmstat can give a nice overview of the
_system_ reclaim activity.

> Almost every stat would have corner cases by various reasons and
> people would want to know the reason from process, context, function
> or block scope depending on how they want to use the stat.
> Even, tracepoint you're loving couldn't tell all the detail what they want
> without adding more and more as on growing code chages.

Quite possible but tracepoints are much easier to modify and shape to a
particular need.

> However, unlike your worry, people has used such an high level vague
> vmstat fields very well to understand/monitor system health even though
> it has various miscounting cases since they know the corner cases
> are really minor.
> 
> I am really curious what metric we could add in the vmstat instead of
> tracepoint in future if we follow your logic. 

I would say that we should be more and more conservative when extending
vmstat counters and use tracing instead as much as possible. I can
imagine there could be cases where tracing is not a preferable option.
Then we can judge case by case. So far you have presented no real
argument, except you already collect vmstat on a larger scale and that
would be easier (essentially free from the tool modification POV). That
is a weak argument. Especially with a major design flaw already
mentioned.

I do not have much more to add here.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-01-30 11:13 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-25  0:54 [PATCH v2] mm/madvise: add vmstat statistics for madvise_[cold|pageout] Minchan Kim
2023-01-25  8:04 ` Michal Hocko
2023-01-25 16:36   ` Minchan Kim
2023-01-25 17:07     ` Michal Hocko
2023-01-25 18:07       ` Minchan Kim
2023-01-25 21:37         ` Michal Hocko
2023-01-25 22:21           ` Minchan Kim
2023-01-26  8:50             ` Michal Hocko
2023-01-26  8:51               ` Michal Hocko
2023-01-26 17:10               ` Minchan Kim
2023-01-26 19:58                 ` Michal Hocko
2023-01-27  0:08                   ` Minchan Kim
2023-01-27  9:48                     ` Michal Hocko
2023-01-28  3:00                       ` Minchan Kim
2023-01-30 11:12                         ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).