linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm:workingset use real time to judge activity of the file page
@ 2019-04-04  3:30 Zhaoyang Huang
  2019-04-04  7:15 ` Michal Hocko
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Zhaoyang Huang @ 2019-04-04  3:30 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka, Pavel Tatashin, Joonsoo Kim,
	David Rientjes, Zhaoyang Huang, Roman Gushchin, Jeff Layton,
	Matthew Wilcox, linux-mm, linux-kernel

From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>

In previous implementation, the number of refault pages is used
for judging the refault period of each page, which is not precised as
eviction of other files will be affect a lot on current cache.
We introduce the timestamp into the workingset's entry and refault ratio
to measure the file page's activity. It helps to decrease the affection
of other files(average refault ratio can reflect the view of whole system
's memory).
The patch is tested on an Android system, which can be described as
comparing the launch time of an application between a huge memory
consumption. The result is launch time decrease 50% and the page fault
during the test decrease 80%.

Signed-off-by: Zhaoyang Huang <huangzhaoyang@gmail.com>
---
 include/linux/mmzone.h |  2 ++
 mm/workingset.c        | 24 +++++++++++++++++-------
 2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 32699b2..c38ba0a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -240,6 +240,8 @@ struct lruvec {
 	atomic_long_t			inactive_age;
 	/* Refaults at the time of last reclaim cycle */
 	unsigned long			refaults;
+	atomic_long_t			refaults_ratio;
+	atomic_long_t			prev_fault;
 #ifdef CONFIG_MEMCG
 	struct pglist_data *pgdat;
 #endif
diff --git a/mm/workingset.c b/mm/workingset.c
index 40ee02c..6361853 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -159,7 +159,7 @@
 			 NODES_SHIFT +	\
 			 MEM_CGROUP_ID_SHIFT)
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
-
+#define EVICTION_JIFFIES (BITS_PER_LONG >> 3)
 /*
  * Eviction timestamps need to be able to cover the full range of
  * actionable refaults. However, bits are tight in the radix tree
@@ -175,18 +175,22 @@ static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction)
 	eviction >>= bucket_order;
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
+	eviction = (eviction << EVICTION_JIFFIES) | (jiffies >> EVICTION_JIFFIES);
 	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
 
 	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
 static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
-			  unsigned long *evictionp)
+			  unsigned long *evictionp, unsigned long *prev_jiffp)
 {
 	unsigned long entry = (unsigned long)shadow;
 	int memcgid, nid;
+	unsigned long prev_jiff;
 
 	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
+	entry >>= EVICTION_JIFFIES;
+	prev_jiff = (entry & ((1UL << EVICTION_JIFFIES) - 1)) << EVICTION_JIFFIES;
 	nid = entry & ((1UL << NODES_SHIFT) - 1);
 	entry >>= NODES_SHIFT;
 	memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
@@ -195,6 +199,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 	*memcgidp = memcgid;
 	*pgdat = NODE_DATA(nid);
 	*evictionp = entry << bucket_order;
+	*prev_jiffp = prev_jiff;
 }
 
 /**
@@ -242,8 +247,12 @@ bool workingset_refault(void *shadow)
 	unsigned long refault;
 	struct pglist_data *pgdat;
 	int memcgid;
+	unsigned long refault_ratio;
+	unsigned long prev_jiff;
+	unsigned long avg_refault_time;
+	unsigned long refault_time;
 
-	unpack_shadow(shadow, &memcgid, &pgdat, &eviction);
+	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &prev_jiff);
 
 	rcu_read_lock();
 	/*
@@ -288,10 +297,11 @@ bool workingset_refault(void *shadow)
 	 * list is not a problem.
 	 */
 	refault_distance = (refault - eviction) & EVICTION_MASK;
-
 	inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
-
-	if (refault_distance <= active_file) {
+	lruvec->refaults_ratio = atomic_long_read(&lruvec->inactive_age) / jiffies;
+	refault_time = jiffies - prev_jiff;
+	avg_refault_time = refault_distance / lruvec->refaults_ratio;
+	if (refault_time <= avg_refault_time) {
 		inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
 		rcu_read_unlock();
 		return true;
@@ -521,7 +531,7 @@ static int __init workingset_init(void)
 	 * some more pages at runtime, so keep working with up to
 	 * double the initial memory by using totalram_pages as-is.
 	 */
-	timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT;
+	timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT - EVICTION_JIFFIES;
 	max_order = fls_long(totalram_pages - 1);
 	if (max_order > timestamp_bits)
 		bucket_order = max_order - timestamp_bits;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm:workingset use real time to judge activity of the file page
  2019-04-04  3:30 [PATCH] mm:workingset use real time to judge activity of the file page Zhaoyang Huang
@ 2019-04-04  7:15 ` Michal Hocko
  2019-04-05  3:13   ` Zhaoyang Huang
  2019-04-04 16:39 ` Johannes Weiner
  2019-04-05  3:24 ` Matthew Wilcox
  2 siblings, 1 reply; 9+ messages in thread
From: Michal Hocko @ 2019-04-04  7:15 UTC (permalink / raw)
  To: Zhaoyang Huang
  Cc: Andrew Morton, Vlastimil Babka, Joonsoo Kim, David Rientjes,
	Zhaoyang Huang, Roman Gushchin, Jeff Layton, Matthew Wilcox,
	linux-mm, linux-kernel, Pavel Tatashin, Johannes Weiner

[Fixup email for Pavel and add Johannes]

On Thu 04-04-19 11:30:17, Zhaoyang Huang wrote:
> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> 
> In previous implementation, the number of refault pages is used
> for judging the refault period of each page, which is not precised as
> eviction of other files will be affect a lot on current cache.
> We introduce the timestamp into the workingset's entry and refault ratio
> to measure the file page's activity. It helps to decrease the affection
> of other files(average refault ratio can reflect the view of whole system
> 's memory).
> The patch is tested on an Android system, which can be described as
> comparing the launch time of an application between a huge memory
> consumption. The result is launch time decrease 50% and the page fault
> during the test decrease 80%.
> 
> Signed-off-by: Zhaoyang Huang <huangzhaoyang@gmail.com>
> ---
>  include/linux/mmzone.h |  2 ++
>  mm/workingset.c        | 24 +++++++++++++++++-------
>  2 files changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 32699b2..c38ba0a 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -240,6 +240,8 @@ struct lruvec {
>  	atomic_long_t			inactive_age;
>  	/* Refaults at the time of last reclaim cycle */
>  	unsigned long			refaults;
> +	atomic_long_t			refaults_ratio;
> +	atomic_long_t			prev_fault;
>  #ifdef CONFIG_MEMCG
>  	struct pglist_data *pgdat;
>  #endif
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 40ee02c..6361853 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -159,7 +159,7 @@
>  			 NODES_SHIFT +	\
>  			 MEM_CGROUP_ID_SHIFT)
>  #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
> -
> +#define EVICTION_JIFFIES (BITS_PER_LONG >> 3)
>  /*
>   * Eviction timestamps need to be able to cover the full range of
>   * actionable refaults. However, bits are tight in the radix tree
> @@ -175,18 +175,22 @@ static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction)
>  	eviction >>= bucket_order;
>  	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
>  	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
> +	eviction = (eviction << EVICTION_JIFFIES) | (jiffies >> EVICTION_JIFFIES);
>  	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
>  
>  	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
>  }
>  
>  static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
> -			  unsigned long *evictionp)
> +			  unsigned long *evictionp, unsigned long *prev_jiffp)
>  {
>  	unsigned long entry = (unsigned long)shadow;
>  	int memcgid, nid;
> +	unsigned long prev_jiff;
>  
>  	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
> +	entry >>= EVICTION_JIFFIES;
> +	prev_jiff = (entry & ((1UL << EVICTION_JIFFIES) - 1)) << EVICTION_JIFFIES;
>  	nid = entry & ((1UL << NODES_SHIFT) - 1);
>  	entry >>= NODES_SHIFT;
>  	memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
> @@ -195,6 +199,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
>  	*memcgidp = memcgid;
>  	*pgdat = NODE_DATA(nid);
>  	*evictionp = entry << bucket_order;
> +	*prev_jiffp = prev_jiff;
>  }
>  
>  /**
> @@ -242,8 +247,12 @@ bool workingset_refault(void *shadow)
>  	unsigned long refault;
>  	struct pglist_data *pgdat;
>  	int memcgid;
> +	unsigned long refault_ratio;
> +	unsigned long prev_jiff;
> +	unsigned long avg_refault_time;
> +	unsigned long refault_time;
>  
> -	unpack_shadow(shadow, &memcgid, &pgdat, &eviction);
> +	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &prev_jiff);
>  
>  	rcu_read_lock();
>  	/*
> @@ -288,10 +297,11 @@ bool workingset_refault(void *shadow)
>  	 * list is not a problem.
>  	 */
>  	refault_distance = (refault - eviction) & EVICTION_MASK;
> -
>  	inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
> -
> -	if (refault_distance <= active_file) {
> +	lruvec->refaults_ratio = atomic_long_read(&lruvec->inactive_age) / jiffies;
> +	refault_time = jiffies - prev_jiff;
> +	avg_refault_time = refault_distance / lruvec->refaults_ratio;
> +	if (refault_time <= avg_refault_time) {
>  		inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
>  		rcu_read_unlock();
>  		return true;
> @@ -521,7 +531,7 @@ static int __init workingset_init(void)
>  	 * some more pages at runtime, so keep working with up to
>  	 * double the initial memory by using totalram_pages as-is.
>  	 */
> -	timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT;
> +	timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT - EVICTION_JIFFIES;
>  	max_order = fls_long(totalram_pages - 1);
>  	if (max_order > timestamp_bits)
>  		bucket_order = max_order - timestamp_bits;
> -- 
> 1.9.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm:workingset use real time to judge activity of the file page
  2019-04-04  3:30 [PATCH] mm:workingset use real time to judge activity of the file page Zhaoyang Huang
  2019-04-04  7:15 ` Michal Hocko
@ 2019-04-04 16:39 ` Johannes Weiner
  2019-04-04 23:23   ` Zhaoyang Huang
  2019-04-05  3:24 ` Matthew Wilcox
  2 siblings, 1 reply; 9+ messages in thread
From: Johannes Weiner @ 2019-04-04 16:39 UTC (permalink / raw)
  To: Zhaoyang Huang
  Cc: Andrew Morton, Vlastimil Babka, Pavel Tatashin, Joonsoo Kim,
	David Rientjes, Zhaoyang Huang, Roman Gushchin, Jeff Layton,
	Matthew Wilcox, linux-mm, linux-kernel

On Thu, Apr 04, 2019 at 11:30:17AM +0800, Zhaoyang Huang wrote:
> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> 
> In previous implementation, the number of refault pages is used
> for judging the refault period of each page, which is not precised as
> eviction of other files will be affect a lot on current cache.
> We introduce the timestamp into the workingset's entry and refault ratio
> to measure the file page's activity. It helps to decrease the affection
> of other files(average refault ratio can reflect the view of whole system
> 's memory).

I don't understand what exactly you're saying here, can you please
elaborate?

The reason it's using distances instead of absolute time is because
the ordering of the LRU is relative and not based on absolute time.

E.g. if a page is accessed every 500ms, it depends on all other pages
to determine whether this page is at the head or the tail of the LRU.

So when you refault, in order to determine the relative position of
the refaulted page in the LRU, you have to compare it to how fast that
LRU is moving. The absolute refault time, or the average time between
refaults, is not comparable to what's already in memory.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm:workingset use real time to judge activity of the file page
  2019-04-04 16:39 ` Johannes Weiner
@ 2019-04-04 23:23   ` Zhaoyang Huang
  2019-04-05 19:34     ` Johannes Weiner
  0 siblings, 1 reply; 9+ messages in thread
From: Zhaoyang Huang @ 2019-04-04 23:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vlastimil Babka, Pavel Tatashin, Joonsoo Kim,
	David Rientjes, Zhaoyang Huang, Roman Gushchin, Jeff Layton,
	Matthew Wilcox, open list:MEMORY MANAGEMENT, LKML

On Fri, Apr 5, 2019 at 12:39 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Apr 04, 2019 at 11:30:17AM +0800, Zhaoyang Huang wrote:
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >
> > In previous implementation, the number of refault pages is used
> > for judging the refault period of each page, which is not precised as
> > eviction of other files will be affect a lot on current cache.
> > We introduce the timestamp into the workingset's entry and refault ratio
> > to measure the file page's activity. It helps to decrease the affection
> > of other files(average refault ratio can reflect the view of whole system
> > 's memory).
>
> I don't understand what exactly you're saying here, can you please
> elaborate?
>
> The reason it's using distances instead of absolute time is because
> the ordering of the LRU is relative and not based on absolute time.
>
> E.g. if a page is accessed every 500ms, it depends on all other pages
> to determine whether this page is at the head or the tail of the LRU.
>
> So when you refault, in order to determine the relative position of
> the refaulted page in the LRU, you have to compare it to how fast that
> LRU is moving. The absolute refault time, or the average time between
> refaults, is not comparable to what's already in memory.
How do you know how long time did these pages' dropping taken.Actruly,
a quick dropping of large mount of pages will be wrongly deemed as
slow dropping instead of the exact hard situation.That is to say, 100
pages per million second or per second have same impaction on
calculating the refault distance, which may cause less protection on
this page cache for former scenario and introduce page thrashing.
especially when global reclaim, a round of kswapd reclaiming that
waked up by a high order allocation or large number of single page
allocations may cause such things as all pages within the node are
counted in the same lru. This commit can decreasing above things by
comparing refault time of single page with avg_refault_time =
delta_lru_reclaimed_pages/ avg_refault_retio (refault_ratio =
lru->inactive_ages / time).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm:workingset use real time to judge activity of the file page
  2019-04-04  7:15 ` Michal Hocko
@ 2019-04-05  3:13   ` Zhaoyang Huang
  0 siblings, 0 replies; 9+ messages in thread
From: Zhaoyang Huang @ 2019-04-05  3:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vlastimil Babka, Joonsoo Kim, David Rientjes,
	Zhaoyang Huang, Roman Gushchin, Jeff Layton, Matthew Wilcox,
	open list:MEMORY MANAGEMENT, LKML, Pavel Tatashin,
	Johannes Weiner, geng.ren

resend it via the right mailling list and rewrite the comments by ZY.

On Thu, Apr 4, 2019 at 3:15 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> [Fixup email for Pavel and add Johannes]
>
> On Thu 04-04-19 11:30:17, Zhaoyang Huang wrote:
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >
> > In previous implementation, the number of refault pages is used
> > for judging the refault period of each page, which is not precised as
> > eviction of other files will be affect a lot on current cache.
> > We introduce the timestamp into the workingset's entry and refault ratio
> > to measure the file page's activity. It helps to decrease the affection
> > of other files(average refault ratio can reflect the view of whole system
> > 's memory).
> > The patch is tested on an Android system, which can be described as
> > comparing the launch time of an application between a huge memory
> > consumption. The result is launch time decrease 50% and the page fault
> > during the test decrease 80%.
> >
I don't understand what exactly you're saying here, can you please elaborate?

The reason it's using distances instead of absolute time is because
the ordering of the LRU is relative and not based on absolute time.

E.g. if a page is accessed every 500ms, it depends on all other pages
to determine whether this page is at the head or the tail of the LRU.

So when you refault, in order to determine the relative position of
the refaulted page in the LRU, you have to compare it to how fast that
LRU is moving. The absolute refault time, or the average time between
refaults, is not comparable to what's already in memory.

comment by ZY
For current implementation, it is hard to deal with the evaluation of
refault period under the scenario of huge dropping of file pages
within short time, which maybe caused by a high order allocation or
continues single page allocation in KSWAPD. On the contrary, such page
which having a big refault_distance will be deemed as INACTIVE
wrongly, which will be reclaimed earlier than it should be and lead to
page thrashing. So we introduce 'avg_refault_time' & 'refault_ratio'
to judge if the refault is a accumulated thing or caused by a tight
reclaiming. That is to say, a big refault_distance in a long time
would also be inactive as the result of comparing it with ideal
time(avg_refault_time: avg_refault_time = delta_lru_reclaimed_pages/
avg_refault_retio (refault_ratio = lru->inactive_ages / time).
> > Signed-off-by: Zhaoyang Huang <huangzhaoyang@gmail.com>
> > ---
> >  include/linux/mmzone.h |  2 ++
> >  mm/workingset.c        | 24 +++++++++++++++++-------
> >  2 files changed, 19 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 32699b2..c38ba0a 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -240,6 +240,8 @@ struct lruvec {
> >       atomic_long_t                   inactive_age;
> >       /* Refaults at the time of last reclaim cycle */
> >       unsigned long                   refaults;
> > +     atomic_long_t                   refaults_ratio;
> > +     atomic_long_t                   prev_fault;
> >  #ifdef CONFIG_MEMCG
> >       struct pglist_data *pgdat;
> >  #endif
> > diff --git a/mm/workingset.c b/mm/workingset.c
> > index 40ee02c..6361853 100644
> > --- a/mm/workingset.c
> > +++ b/mm/workingset.c
> > @@ -159,7 +159,7 @@
> >                        NODES_SHIFT +  \
> >                        MEM_CGROUP_ID_SHIFT)
> >  #define EVICTION_MASK        (~0UL >> EVICTION_SHIFT)
> > -
> > +#define EVICTION_JIFFIES (BITS_PER_LONG >> 3)
> >  /*
> >   * Eviction timestamps need to be able to cover the full range of
> >   * actionable refaults. However, bits are tight in the radix tree
> > @@ -175,18 +175,22 @@ static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction)
> >       eviction >>= bucket_order;
> >       eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
> >       eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
> > +     eviction = (eviction << EVICTION_JIFFIES) | (jiffies >> EVICTION_JIFFIES);
> >       eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
> >
> >       return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
> >  }
> >
> >  static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
> > -                       unsigned long *evictionp)
> > +                       unsigned long *evictionp, unsigned long *prev_jiffp)
> >  {
> >       unsigned long entry = (unsigned long)shadow;
> >       int memcgid, nid;
> > +     unsigned long prev_jiff;
> >
> >       entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
> > +     entry >>= EVICTION_JIFFIES;
> > +     prev_jiff = (entry & ((1UL << EVICTION_JIFFIES) - 1)) << EVICTION_JIFFIES;
> >       nid = entry & ((1UL << NODES_SHIFT) - 1);
> >       entry >>= NODES_SHIFT;
> >       memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
> > @@ -195,6 +199,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
> >       *memcgidp = memcgid;
> >       *pgdat = NODE_DATA(nid);
> >       *evictionp = entry << bucket_order;
> > +     *prev_jiffp = prev_jiff;
> >  }
> >
> >  /**
> > @@ -242,8 +247,12 @@ bool workingset_refault(void *shadow)
> >       unsigned long refault;
> >       struct pglist_data *pgdat;
> >       int memcgid;
> > +     unsigned long refault_ratio;
> > +     unsigned long prev_jiff;
> > +     unsigned long avg_refault_time;
> > +     unsigned long refault_time;
> >
> > -     unpack_shadow(shadow, &memcgid, &pgdat, &eviction);
> > +     unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &prev_jiff);
> >
> >       rcu_read_lock();
> >       /*
> > @@ -288,10 +297,11 @@ bool workingset_refault(void *shadow)
> >        * list is not a problem.
> >        */
> >       refault_distance = (refault - eviction) & EVICTION_MASK;
> > -
> >       inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
> > -
> > -     if (refault_distance <= active_file) {
> > +     lruvec->refaults_ratio = atomic_long_read(&lruvec->inactive_age) / jiffies;
> > +     refault_time = jiffies - prev_jiff;
> > +     avg_refault_time = refault_distance / lruvec->refaults_ratio;
> > +     if (refault_time <= avg_refault_time) {
> >               inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
> >               rcu_read_unlock();
> >               return true;
> > @@ -521,7 +531,7 @@ static int __init workingset_init(void)
> >        * some more pages at runtime, so keep working with up to
> >        * double the initial memory by using totalram_pages as-is.
> >        */
> > -     timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT;
> > +     timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT - EVICTION_JIFFIES;
> >       max_order = fls_long(totalram_pages - 1);
> >       if (max_order > timestamp_bits)
> >               bucket_order = max_order - timestamp_bits;
> > --
> > 1.9.1
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm:workingset use real time to judge activity of the file page
  2019-04-04  3:30 [PATCH] mm:workingset use real time to judge activity of the file page Zhaoyang Huang
  2019-04-04  7:15 ` Michal Hocko
  2019-04-04 16:39 ` Johannes Weiner
@ 2019-04-05  3:24 ` Matthew Wilcox
  2 siblings, 0 replies; 9+ messages in thread
From: Matthew Wilcox @ 2019-04-05  3:24 UTC (permalink / raw)
  To: Zhaoyang Huang
  Cc: Andrew Morton, Vlastimil Babka, Pavel Tatashin, Joonsoo Kim,
	David Rientjes, Zhaoyang Huang, Roman Gushchin, Jeff Layton,
	Matthew Wilcox, linux-mm, linux-kernel

On Thu, Apr 04, 2019 at 11:30:17AM +0800, Zhaoyang Huang wrote:
> +++ b/mm/workingset.c
> @@ -159,7 +159,7 @@
>  			 NODES_SHIFT +	\
>  			 MEM_CGROUP_ID_SHIFT)
>  #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
> -
> +#define EVICTION_JIFFIES (BITS_PER_LONG >> 3)
>  /*
>   * Eviction timestamps need to be able to cover the full range of
>   * actionable refaults. However, bits are tight in the radix tree
> @@ -175,18 +175,22 @@ static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction)
>  	eviction >>= bucket_order;
>  	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
>  	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
> +	eviction = (eviction << EVICTION_JIFFIES) | (jiffies >> EVICTION_JIFFIES);
>  	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);

... this isn't against current, or even 5.0.

>  	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
> +	entry >>= EVICTION_JIFFIES;
> +	prev_jiff = (entry & ((1UL << EVICTION_JIFFIES) - 1)) << EVICTION_JIFFIES;

These two lines are in the wrong order.  So you're getting (effectively) a
random answer in your 'prev_jiff', which means your testing isn't thorough
enough.  I suspect you're only testing cases you're expecting to improve,
and you aren't testing to make sure that other cases don't regress.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm:workingset use real time to judge activity of the file page
  2019-04-04 23:23   ` Zhaoyang Huang
@ 2019-04-05 19:34     ` Johannes Weiner
  0 siblings, 0 replies; 9+ messages in thread
From: Johannes Weiner @ 2019-04-05 19:34 UTC (permalink / raw)
  To: Zhaoyang Huang
  Cc: Andrew Morton, Vlastimil Babka, Pavel Tatashin, Joonsoo Kim,
	David Rientjes, Zhaoyang Huang, Roman Gushchin, Jeff Layton,
	Matthew Wilcox, open list:MEMORY MANAGEMENT, LKML

On Fri, Apr 05, 2019 at 07:23:46AM +0800, Zhaoyang Huang wrote:
> On Fri, Apr 5, 2019 at 12:39 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Thu, Apr 04, 2019 at 11:30:17AM +0800, Zhaoyang Huang wrote:
> > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > >
> > > In previous implementation, the number of refault pages is used
> > > for judging the refault period of each page, which is not precised as
> > > eviction of other files will be affect a lot on current cache.
> > > We introduce the timestamp into the workingset's entry and refault ratio
> > > to measure the file page's activity. It helps to decrease the affection
> > > of other files(average refault ratio can reflect the view of whole system
> > > 's memory).
> >
> > I don't understand what exactly you're saying here, can you please
> > elaborate?
> >
> > The reason it's using distances instead of absolute time is because
> > the ordering of the LRU is relative and not based on absolute time.
> >
> > E.g. if a page is accessed every 500ms, it depends on all other pages
> > to determine whether this page is at the head or the tail of the LRU.
> >
> > So when you refault, in order to determine the relative position of
> > the refaulted page in the LRU, you have to compare it to how fast that
> > LRU is moving. The absolute refault time, or the average time between
> > refaults, is not comparable to what's already in memory.
> How do you know how long time did these pages' dropping taken.Actruly,
> a quick dropping of large mount of pages will be wrongly deemed as
> slow dropping instead of the exact hard situation.That is to say, 100
> pages per million second or per second have same impaction on
> calculating the refault distance, which may cause less protection on
> this page cache for former scenario and introduce page thrashing.
> especially when global reclaim, a round of kswapd reclaiming that
> waked up by a high order allocation or large number of single page
> allocations may cause such things as all pages within the node are
> counted in the same lru. This commit can decreasing above things by
> comparing refault time of single page with avg_refault_time =
> delta_lru_reclaimed_pages/ avg_refault_retio (refault_ratio =
> lru->inactive_ages / time).

When something like a higher-order allocation drops a large number of
file pages, it's *intentional* that the pages that were evicted before
them become less valuable and less likely to be activated on
refault. There is a finite amount of in-memory LRU space and the pages
that have been evicted the most recently have precedence because they
have the highest proven access frequency.

Of course, when a large amount of the cache that was pushed out in
between is not re-used again, and don't claim their space in memory,
it would be great if we could then activate the older pages that *are*
re-used again in their stead.

But that would require us being able to look into the future. When an
old page refaults, we don't know if a younger page is still going to
refault with a shorter refault distance or not. If it won't, then we
were right to activate it. If it will refault, then we put something
on the active list whose reuse frequency is too low to be able to fit
into memory, and we thrash the hottest pages in the system.

As Matthew says, you are fairly randomly making refault activations
more aggressive (especially with that timestamp unpacking bug), and
while that expectedly boosts workload transition / startup, it comes
at the cost of disrupting stable states because you can flood a very
active in-ram workingset with completely cold cache pages simply
because they refault uniformly wrt each other.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm:workingset use real time to judge activity of the file page
  2019-04-04  2:01 Zhaoyang Huang
@ 2019-04-07  0:43 ` Suren Baghdasaryan
  0 siblings, 0 replies; 9+ messages in thread
From: Suren Baghdasaryan @ 2019-04-07  0:43 UTC (permalink / raw)
  To: Zhaoyang Huang
  Cc: Andrew Morton, Vlastimil Babka, Pavel Tatashin, Joonsoo Kim,
	David Rientjes, Roman Gushchin, Jeff Layton, Matthew Wilcox,
	linux-mm, LKML

On Wed, Apr 3, 2019 at 7:03 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
>
> From: Zhaoyang Huang <Zhaoyang Huang@unisoc.com>
>
> In previous implementation, the number of refault pages is used
> for judging the refault period of each page, which is not precised.
> We introduce the timestamp into the workingset's entry to measure
> the file page's activity.
>
> The patch is tested on an Android system, which can be described as
> comparing the launch time of an application between a huge memory
> consumption. The result is launch time decrease 50% and the page fault
> during the test decrease 80%.
>
> Signed-off-by: Zhaoyang Huang <huangzhaoyang@gmail.com>
> ---
>  include/linux/mmzone.h |  2 ++
>  mm/workingset.c        | 24 +++++++++++++++++-------
>  2 files changed, 19 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 32699b2..c38ba0a 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -240,6 +240,8 @@ struct lruvec {
>         atomic_long_t                   inactive_age;
>         /* Refaults at the time of last reclaim cycle */
>         unsigned long                   refaults;
> +       atomic_long_t                   refaults_ratio;
> +       atomic_long_t                   prev_fault;
>  #ifdef CONFIG_MEMCG
>         struct pglist_data *pgdat;
>  #endif
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 40ee02c..6361853 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -159,7 +159,7 @@
>                          NODES_SHIFT +  \
>                          MEM_CGROUP_ID_SHIFT)
>  #define EVICTION_MASK  (~0UL >> EVICTION_SHIFT)
> -
> +#define EVICTION_JIFFIES (BITS_PER_LONG >> 3)
>  /*
>   * Eviction timestamps need to be able to cover the full range of
>   * actionable refaults. However, bits are tight in the radix tree
> @@ -175,18 +175,22 @@ static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction)
>         eviction >>= bucket_order;
>         eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
>         eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
> +       eviction = (eviction << EVICTION_JIFFIES) | (jiffies >> EVICTION_JIFFIES);
>         eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
>
>         return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
>  }
>
>  static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
> -                         unsigned long *evictionp)
> +                         unsigned long *evictionp, unsigned long *prev_jiffp)
>  {
>         unsigned long entry = (unsigned long)shadow;
>         int memcgid, nid;
> +       unsigned long prev_jiff;
>
>         entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
> +       entry >>= EVICTION_JIFFIES;
> +       prev_jiff = (entry & ((1UL << EVICTION_JIFFIES) - 1)) << EVICTION_JIFFIES;
>         nid = entry & ((1UL << NODES_SHIFT) - 1);
>         entry >>= NODES_SHIFT;
>         memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
> @@ -195,6 +199,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
>         *memcgidp = memcgid;
>         *pgdat = NODE_DATA(nid);
>         *evictionp = entry << bucket_order;
> +       *prev_jiffp = prev_jiff;
>  }
>
>  /**
> @@ -242,8 +247,12 @@ bool workingset_refault(void *shadow)
>         unsigned long refault;
>         struct pglist_data *pgdat;
>         int memcgid;
> +       unsigned long refault_ratio;
> +       unsigned long prev_jiff;
> +       unsigned long avg_refault_time;
> +       unsigned long refault_time;
>
> -       unpack_shadow(shadow, &memcgid, &pgdat, &eviction);
> +       unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &prev_jiff);
>
>         rcu_read_lock();
>         /*
> @@ -288,10 +297,11 @@ bool workingset_refault(void *shadow)
>          * list is not a problem.
>          */
>         refault_distance = (refault - eviction) & EVICTION_MASK;
> -
>         inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
> -
> -       if (refault_distance <= active_file) {
> +       lruvec->refaults_ratio = atomic_long_read(&lruvec->inactive_age) / jiffies;

I also wonder how many times the division above yields a 0...

> +       refault_time = jiffies - prev_jiff;
> +       avg_refault_time = refault_distance / lruvec->refaults_ratio;

and then used here as a denominator.

> +       if (refault_time <= avg_refault_time) {
>                 inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
>                 rcu_read_unlock();
>                 return true;
> @@ -521,7 +531,7 @@ static int __init workingset_init(void)
>          * some more pages at runtime, so keep working with up to
>          * double the initial memory by using totalram_pages as-is.
>          */
> -       timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT;
> +       timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT - EVICTION_JIFFIES;
>         max_order = fls_long(totalram_pages - 1);
>         if (max_order > timestamp_bits)
>                 bucket_order = max_order - timestamp_bits;
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] mm:workingset use real time to judge activity of the file page
@ 2019-04-04  2:01 Zhaoyang Huang
  2019-04-07  0:43 ` Suren Baghdasaryan
  0 siblings, 1 reply; 9+ messages in thread
From: Zhaoyang Huang @ 2019-04-04  2:01 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka, Pavel Tatashin, Joonsoo Kim,
	David Rientjes, Roman Gushchin, Jeff Layton, Matthew Wilcox,
	linux-mm, linux-kernel

From: Zhaoyang Huang <Zhaoyang Huang@unisoc.com>

In previous implementation, the number of refault pages is used
for judging the refault period of each page, which is not precised.
We introduce the timestamp into the workingset's entry to measure
the file page's activity.

The patch is tested on an Android system, which can be described as
comparing the launch time of an application between a huge memory
consumption. The result is launch time decrease 50% and the page fault
during the test decrease 80%.

Signed-off-by: Zhaoyang Huang <huangzhaoyang@gmail.com>
---
 include/linux/mmzone.h |  2 ++
 mm/workingset.c        | 24 +++++++++++++++++-------
 2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 32699b2..c38ba0a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -240,6 +240,8 @@ struct lruvec {
 	atomic_long_t			inactive_age;
 	/* Refaults at the time of last reclaim cycle */
 	unsigned long			refaults;
+	atomic_long_t			refaults_ratio;
+	atomic_long_t			prev_fault;
 #ifdef CONFIG_MEMCG
 	struct pglist_data *pgdat;
 #endif
diff --git a/mm/workingset.c b/mm/workingset.c
index 40ee02c..6361853 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -159,7 +159,7 @@
 			 NODES_SHIFT +	\
 			 MEM_CGROUP_ID_SHIFT)
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
-
+#define EVICTION_JIFFIES (BITS_PER_LONG >> 3)
 /*
  * Eviction timestamps need to be able to cover the full range of
  * actionable refaults. However, bits are tight in the radix tree
@@ -175,18 +175,22 @@ static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction)
 	eviction >>= bucket_order;
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
+	eviction = (eviction << EVICTION_JIFFIES) | (jiffies >> EVICTION_JIFFIES);
 	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
 
 	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
 static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
-			  unsigned long *evictionp)
+			  unsigned long *evictionp, unsigned long *prev_jiffp)
 {
 	unsigned long entry = (unsigned long)shadow;
 	int memcgid, nid;
+	unsigned long prev_jiff;
 
 	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
+	entry >>= EVICTION_JIFFIES;
+	prev_jiff = (entry & ((1UL << EVICTION_JIFFIES) - 1)) << EVICTION_JIFFIES;
 	nid = entry & ((1UL << NODES_SHIFT) - 1);
 	entry >>= NODES_SHIFT;
 	memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
@@ -195,6 +199,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 	*memcgidp = memcgid;
 	*pgdat = NODE_DATA(nid);
 	*evictionp = entry << bucket_order;
+	*prev_jiffp = prev_jiff;
 }
 
 /**
@@ -242,8 +247,12 @@ bool workingset_refault(void *shadow)
 	unsigned long refault;
 	struct pglist_data *pgdat;
 	int memcgid;
+	unsigned long refault_ratio;
+	unsigned long prev_jiff;
+	unsigned long avg_refault_time;
+	unsigned long refault_time;
 
-	unpack_shadow(shadow, &memcgid, &pgdat, &eviction);
+	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &prev_jiff);
 
 	rcu_read_lock();
 	/*
@@ -288,10 +297,11 @@ bool workingset_refault(void *shadow)
 	 * list is not a problem.
 	 */
 	refault_distance = (refault - eviction) & EVICTION_MASK;
-
 	inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
-
-	if (refault_distance <= active_file) {
+	lruvec->refaults_ratio = atomic_long_read(&lruvec->inactive_age) / jiffies;
+	refault_time = jiffies - prev_jiff;
+	avg_refault_time = refault_distance / lruvec->refaults_ratio;
+	if (refault_time <= avg_refault_time) {
 		inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
 		rcu_read_unlock();
 		return true;
@@ -521,7 +531,7 @@ static int __init workingset_init(void)
 	 * some more pages at runtime, so keep working with up to
 	 * double the initial memory by using totalram_pages as-is.
 	 */
-	timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT;
+	timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT - EVICTION_JIFFIES;
 	max_order = fls_long(totalram_pages - 1);
 	if (max_order > timestamp_bits)
 		bucket_order = max_order - timestamp_bits;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-04-07  0:43 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-04  3:30 [PATCH] mm:workingset use real time to judge activity of the file page Zhaoyang Huang
2019-04-04  7:15 ` Michal Hocko
2019-04-05  3:13   ` Zhaoyang Huang
2019-04-04 16:39 ` Johannes Weiner
2019-04-04 23:23   ` Zhaoyang Huang
2019-04-05 19:34     ` Johannes Weiner
2019-04-05  3:24 ` Matthew Wilcox
  -- strict thread matches above, loose matches on Subject: below --
2019-04-04  2:01 Zhaoyang Huang
2019-04-07  0:43 ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).