linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add
@ 2020-11-20  8:27 Alex Shi
  2020-11-20 23:19 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Alex Shi @ 2020-11-20  8:27 UTC (permalink / raw)
  Cc: Konstantin Khlebnikov, Andrew Morton, Hugh Dickins, Yu Zhao,
	Michal Hocko, linux-mm, linux-kernel

The current relock logical will change lru_lock when found a new
lruvec, so if 2 memcgs are reading file or alloc page at same time,
they could hold the lru_lock alternately, and wait for each other for
fairness attribute of ticket spin lock.

This patch will sort that all lru_locks and only hold them once in
above scenario. That could reduce fairness waiting for lock reget.
Than, vm-scalability/case-lru-file-readtwice could get ~5% performance
gain on my 2P*20core*HT machine.

Suggested-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 49 insertions(+), 8 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 490553f3f9ef..c787b38bf9c0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1009,24 +1009,65 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
 	trace_mm_lru_insertion(page, lru);
 }
 
+struct lruvecs {
+	struct list_head lists[PAGEVEC_SIZE];
+	struct lruvec *vecs[PAGEVEC_SIZE];
+};
+
+/* Sort pvec pages on their lruvec */
+int sort_page_lruvec(struct lruvecs *lruvecs, struct pagevec *pvec)
+{
+	int i, j, nr_lruvec;
+	struct page *page;
+	struct lruvec *lruvec = NULL;
+
+	lruvecs->vecs[0] = NULL;
+	for (i = nr_lruvec = 0; i < pagevec_count(pvec); i++) {
+		page = pvec->pages[i];
+		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+
+		/* Try to find a same lruvec */
+		for (j = 0; j <= nr_lruvec; j++)
+			if (lruvec == lruvecs->vecs[j])
+				break;
+
+		/* A new lruvec */
+		if (j > nr_lruvec) {
+			INIT_LIST_HEAD(&lruvecs->lists[nr_lruvec]);
+			lruvecs->vecs[nr_lruvec] = lruvec;
+			j = nr_lruvec++;
+			lruvecs->vecs[nr_lruvec] = 0;
+		}
+
+		list_add_tail(&page->lru, &lruvecs->lists[j]);
+	}
+
+	return nr_lruvec;
+}
+
 /*
  * Add the passed pages to the LRU, then drop the caller's refcount
  * on them.  Reinitialises the caller's pagevec.
  */
 void __pagevec_lru_add(struct pagevec *pvec)
 {
-	int i;
-	struct lruvec *lruvec = NULL;
+	int i, nr_lruvec;
 	unsigned long flags = 0;
+	struct page *page;
+	struct lruvecs lruvecs;
 
-	for (i = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
+	nr_lruvec = sort_page_lruvec(&lruvecs, pvec);
 
-		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
-		__pagevec_lru_add_fn(page, lruvec);
+	for (i = 0; i < nr_lruvec; i++) {
+		spin_lock_irqsave(&lruvecs.vecs[i]->lru_lock, flags);
+		while (!list_empty(&lruvecs.lists[i])) {
+			page = lru_to_page(&lruvecs.lists[i]);
+			list_del(&page->lru);
+			__pagevec_lru_add_fn(page, lruvecs.vecs[i]);
+		}
+		spin_unlock_irqrestore(&lruvecs.vecs[i]->lru_lock, flags);
 	}
-	if (lruvec)
-		unlock_page_lruvec_irqrestore(lruvec, flags);
+
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
-- 
2.29.GIT


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add
  2020-11-20  8:27 [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add Alex Shi
@ 2020-11-20 23:19 ` Andrew Morton
  2020-11-23  4:46   ` Alex Shi
  2020-11-25 15:38 ` Vlastimil Babka
  2020-11-26  4:52 ` Yu Zhao
  2 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2020-11-20 23:19 UTC (permalink / raw)
  To: Alex Shi
  Cc: Konstantin Khlebnikov, Hugh Dickins, Yu Zhao, Michal Hocko,
	linux-mm, linux-kernel

On Fri, 20 Nov 2020 16:27:27 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote:

> The current relock logical will change lru_lock when found a new
> lruvec, so if 2 memcgs are reading file or alloc page at same time,
> they could hold the lru_lock alternately, and wait for each other for
> fairness attribute of ticket spin lock.
> 
> This patch will sort that all lru_locks and only hold them once in
> above scenario. That could reduce fairness waiting for lock reget.
> Than, vm-scalability/case-lru-file-readtwice could get ~5% performance
> gain on my 2P*20core*HT machine.

But what happens when all or most of the pages belong to the same
lruvec?  This sounds like the common case - won't it suffer?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add
  2020-11-20 23:19 ` Andrew Morton
@ 2020-11-23  4:46   ` Alex Shi
  0 siblings, 0 replies; 23+ messages in thread
From: Alex Shi @ 2020-11-23  4:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, Hugh Dickins, Yu Zhao, Michal Hocko,
	linux-mm, linux-kernel



在 2020/11/21 上午7:19, Andrew Morton 写道:
> On Fri, 20 Nov 2020 16:27:27 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote:
> 
>> The current relock logical will change lru_lock when found a new
>> lruvec, so if 2 memcgs are reading file or alloc page at same time,
>> they could hold the lru_lock alternately, and wait for each other for
>> fairness attribute of ticket spin lock.
>>
>> This patch will sort that all lru_locks and only hold them once in
>> above scenario. That could reduce fairness waiting for lock reget.
>> Than, vm-scalability/case-lru-file-readtwice could get ~5% performance
>> gain on my 2P*20core*HT machine.
> 
> But what happens when all or most of the pages belong to the same
> lruvec?  This sounds like the common case - won't it suffer?
> 
Hi Andrew,

My testing show no regression on this situation, like original centos7,
The most spending time is on lru_lock for lru sensitive case.

Thanks
Alex

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add
  2020-11-20  8:27 [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add Alex Shi
  2020-11-20 23:19 ` Andrew Morton
@ 2020-11-25 15:38 ` Vlastimil Babka
  2020-11-26  3:12   ` Alex Shi
  2020-11-26  4:52 ` Yu Zhao
  2 siblings, 1 reply; 23+ messages in thread
From: Vlastimil Babka @ 2020-11-25 15:38 UTC (permalink / raw)
  To: Alex Shi
  Cc: Konstantin Khlebnikov, Andrew Morton, Hugh Dickins, Yu Zhao,
	Michal Hocko, linux-mm, linux-kernel

On 11/20/20 9:27 AM, Alex Shi wrote:
> The current relock logical will change lru_lock when found a new
> lruvec, so if 2 memcgs are reading file or alloc page at same time,
> they could hold the lru_lock alternately, and wait for each other for
> fairness attribute of ticket spin lock.
> 
> This patch will sort that all lru_locks and only hold them once in
> above scenario. That could reduce fairness waiting for lock reget.
> Than, vm-scalability/case-lru-file-readtwice could get ~5% performance
> gain on my 2P*20core*HT machine.

Hm, once you sort the pages like this, it's a shame not to splice them 
instead of more list_del() + list_add() iterations. update_lru_size() 
could be also called once?

> Suggested-by: Konstantin Khlebnikov <koct9i@gmail.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>   mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++--------
>   1 file changed, 49 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 490553f3f9ef..c787b38bf9c0 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -1009,24 +1009,65 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
>   	trace_mm_lru_insertion(page, lru);
>   }
>   
> +struct lruvecs {
> +	struct list_head lists[PAGEVEC_SIZE];
> +	struct lruvec *vecs[PAGEVEC_SIZE];
> +};
> +
> +/* Sort pvec pages on their lruvec */
> +int sort_page_lruvec(struct lruvecs *lruvecs, struct pagevec *pvec)
> +{
> +	int i, j, nr_lruvec;
> +	struct page *page;
> +	struct lruvec *lruvec = NULL;
> +
> +	lruvecs->vecs[0] = NULL;
> +	for (i = nr_lruvec = 0; i < pagevec_count(pvec); i++) {
> +		page = pvec->pages[i];
> +		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +
> +		/* Try to find a same lruvec */
> +		for (j = 0; j <= nr_lruvec; j++)
> +			if (lruvec == lruvecs->vecs[j])
> +				break;
> +
> +		/* A new lruvec */
> +		if (j > nr_lruvec) {
> +			INIT_LIST_HEAD(&lruvecs->lists[nr_lruvec]);
> +			lruvecs->vecs[nr_lruvec] = lruvec;
> +			j = nr_lruvec++;
> +			lruvecs->vecs[nr_lruvec] = 0;
> +		}
> +
> +		list_add_tail(&page->lru, &lruvecs->lists[j]);
> +	}
> +
> +	return nr_lruvec;
> +}
> +
>   /*
>    * Add the passed pages to the LRU, then drop the caller's refcount
>    * on them.  Reinitialises the caller's pagevec.
>    */
>   void __pagevec_lru_add(struct pagevec *pvec)
>   {
> -	int i;
> -	struct lruvec *lruvec = NULL;
> +	int i, nr_lruvec;
>   	unsigned long flags = 0;
> +	struct page *page;
> +	struct lruvecs lruvecs;
>   
> -	for (i = 0; i < pagevec_count(pvec); i++) {
> -		struct page *page = pvec->pages[i];
> +	nr_lruvec = sort_page_lruvec(&lruvecs, pvec);
>   
> -		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
> -		__pagevec_lru_add_fn(page, lruvec);
> +	for (i = 0; i < nr_lruvec; i++) {
> +		spin_lock_irqsave(&lruvecs.vecs[i]->lru_lock, flags);
> +		while (!list_empty(&lruvecs.lists[i])) {
> +			page = lru_to_page(&lruvecs.lists[i]);
> +			list_del(&page->lru);
> +			__pagevec_lru_add_fn(page, lruvecs.vecs[i]);
> +		}
> +		spin_unlock_irqrestore(&lruvecs.vecs[i]->lru_lock, flags);
>   	}
> -	if (lruvec)
> -		unlock_page_lruvec_irqrestore(lruvec, flags);
> +
>   	release_pages(pvec->pages, pvec->nr);
>   	pagevec_reinit(pvec);
>   }
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add
  2020-11-25 15:38 ` Vlastimil Babka
@ 2020-11-26  3:12   ` Alex Shi
  2020-11-26 11:05     ` Vlastimil Babka
  0 siblings, 1 reply; 23+ messages in thread
From: Alex Shi @ 2020-11-26  3:12 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Konstantin Khlebnikov, Andrew Morton, Hugh Dickins, Yu Zhao,
	Michal Hocko, linux-mm, linux-kernel



在 2020/11/25 下午11:38, Vlastimil Babka 写道:
> On 11/20/20 9:27 AM, Alex Shi wrote:
>> The current relock logical will change lru_lock when found a new
>> lruvec, so if 2 memcgs are reading file or alloc page at same time,
>> they could hold the lru_lock alternately, and wait for each other for
>> fairness attribute of ticket spin lock.
>>
>> This patch will sort that all lru_locks and only hold them once in
>> above scenario. That could reduce fairness waiting for lock reget.
>> Than, vm-scalability/case-lru-file-readtwice could get ~5% performance
>> gain on my 2P*20core*HT machine.
> 
> Hm, once you sort the pages like this, it's a shame not to splice them instead of more list_del() + list_add() iterations. update_lru_size() could be also called once?

Yes, looks it's a good idea to use splice instead of list_del/add, but pages
may on different lru list in a same lruvec, and also may come from different
zones. That could involve 5 cycles for different lists, and more for zones...

I give up the try.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add
  2020-11-20  8:27 [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add Alex Shi
  2020-11-20 23:19 ` Andrew Morton
  2020-11-25 15:38 ` Vlastimil Babka
@ 2020-11-26  4:52 ` Yu Zhao
  2020-11-26  6:39   ` Alex Shi
  2 siblings, 1 reply; 23+ messages in thread
From: Yu Zhao @ 2020-11-26  4:52 UTC (permalink / raw)
  To: Alex Shi
  Cc: Konstantin Khlebnikov, Andrew Morton, Hugh Dickins, Michal Hocko,
	linux-mm, linux-kernel

On Fri, Nov 20, 2020 at 04:27:27PM +0800, Alex Shi wrote:
> The current relock logical will change lru_lock when found a new
> lruvec, so if 2 memcgs are reading file or alloc page at same time,
> they could hold the lru_lock alternately, and wait for each other for
> fairness attribute of ticket spin lock.
> 
> This patch will sort that all lru_locks and only hold them once in
> above scenario. That could reduce fairness waiting for lock reget.
> Than, vm-scalability/case-lru-file-readtwice could get ~5% performance
> gain on my 2P*20core*HT machine.
> 
> Suggested-by: Konstantin Khlebnikov <koct9i@gmail.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 49 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 490553f3f9ef..c787b38bf9c0 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -1009,24 +1009,65 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
>  	trace_mm_lru_insertion(page, lru);
>  }
>  
> +struct lruvecs {
> +	struct list_head lists[PAGEVEC_SIZE];
> +	struct lruvec *vecs[PAGEVEC_SIZE];
> +};
> +
> +/* Sort pvec pages on their lruvec */
> +int sort_page_lruvec(struct lruvecs *lruvecs, struct pagevec *pvec)
> +{
> +	int i, j, nr_lruvec;
> +	struct page *page;
> +	struct lruvec *lruvec = NULL;
> +
> +	lruvecs->vecs[0] = NULL;
> +	for (i = nr_lruvec = 0; i < pagevec_count(pvec); i++) {
> +		page = pvec->pages[i];
> +		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +
> +		/* Try to find a same lruvec */
> +		for (j = 0; j <= nr_lruvec; j++)
> +			if (lruvec == lruvecs->vecs[j])
> +				break;
> +
> +		/* A new lruvec */
> +		if (j > nr_lruvec) {
> +			INIT_LIST_HEAD(&lruvecs->lists[nr_lruvec]);
> +			lruvecs->vecs[nr_lruvec] = lruvec;
> +			j = nr_lruvec++;
> +			lruvecs->vecs[nr_lruvec] = 0;
> +		}
> +
> +		list_add_tail(&page->lru, &lruvecs->lists[j]);
> +	}
> +
> +	return nr_lruvec;
> +}
> +
>  /*
>   * Add the passed pages to the LRU, then drop the caller's refcount
>   * on them.  Reinitialises the caller's pagevec.
>   */
>  void __pagevec_lru_add(struct pagevec *pvec)
>  {
> -	int i;
> -	struct lruvec *lruvec = NULL;
> +	int i, nr_lruvec;
>  	unsigned long flags = 0;
> +	struct page *page;
> +	struct lruvecs lruvecs;
>  
> -	for (i = 0; i < pagevec_count(pvec); i++) {
> -		struct page *page = pvec->pages[i];
> +	nr_lruvec = sort_page_lruvec(&lruvecs, pvec);

Simply looping pvec multiple times (15 at most) for different lruvecs
would be better because 1) it requires no extra data structures and
therefore has better cache locality (theoretically faster) 2) it only
loops once when !CONFIG_MEMCG and !CONFIG_NUMA and therefore has no
impact on Android and Chrome OS.

> -		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
> -		__pagevec_lru_add_fn(page, lruvec);
> +	for (i = 0; i < nr_lruvec; i++) {
> +		spin_lock_irqsave(&lruvecs.vecs[i]->lru_lock, flags);
> +		while (!list_empty(&lruvecs.lists[i])) {
> +			page = lru_to_page(&lruvecs.lists[i]);
> +			list_del(&page->lru);
> +			__pagevec_lru_add_fn(page, lruvecs.vecs[i]);
> +		}
> +		spin_unlock_irqrestore(&lruvecs.vecs[i]->lru_lock, flags);
>  	}
> -	if (lruvec)
> -		unlock_page_lruvec_irqrestore(lruvec, flags);
> +
>  	release_pages(pvec->pages, pvec->nr);
>  	pagevec_reinit(pvec);
>  }
> -- 
> 2.29.GIT
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add
  2020-11-26  4:52 ` Yu Zhao
@ 2020-11-26  6:39   ` Alex Shi
  2020-11-26  7:24     ` Yu Zhao
  0 siblings, 1 reply; 23+ messages in thread
From: Alex Shi @ 2020-11-26  6:39 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Konstantin Khlebnikov, Andrew Morton, Hugh Dickins, Michal Hocko,
	linux-mm, linux-kernel



在 2020/11/26 下午12:52, Yu Zhao 写道:
>>   */
>>  void __pagevec_lru_add(struct pagevec *pvec)
>>  {
>> -	int i;
>> -	struct lruvec *lruvec = NULL;
>> +	int i, nr_lruvec;
>>  	unsigned long flags = 0;
>> +	struct page *page;
>> +	struct lruvecs lruvecs;
>>  
>> -	for (i = 0; i < pagevec_count(pvec); i++) {
>> -		struct page *page = pvec->pages[i];
>> +	nr_lruvec = sort_page_lruvec(&lruvecs, pvec);
> Simply looping pvec multiple times (15 at most) for different lruvecs
> would be better because 1) it requires no extra data structures and
> therefore has better cache locality (theoretically faster) 2) it only
> loops once when !CONFIG_MEMCG and !CONFIG_NUMA and therefore has no
> impact on Android and Chrome OS.
> 

With multiple memcgs, it do help a lot, I had gotten 30% grain on readtwice
case. but yes, w/o MEMCG and NUMA, it's good to keep old behavior. So 
would you like has a proposal for this?

Thanks
Alex

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add
  2020-11-26  6:39   ` Alex Shi
@ 2020-11-26  7:24     ` Yu Zhao
  2020-11-26  8:09       ` Alex Shi
  2020-11-26 11:22       ` Vlastimil Babka
  0 siblings, 2 replies; 23+ messages in thread
From: Yu Zhao @ 2020-11-26  7:24 UTC (permalink / raw)
  To: Alex Shi
  Cc: Konstantin Khlebnikov, Andrew Morton, Hugh Dickins, Michal Hocko,
	linux-mm, linux-kernel

On Thu, Nov 26, 2020 at 02:39:03PM +0800, Alex Shi wrote:
> 
> 
> 在 2020/11/26 下午12:52, Yu Zhao 写道:
> >>   */
> >>  void __pagevec_lru_add(struct pagevec *pvec)
> >>  {
> >> -	int i;
> >> -	struct lruvec *lruvec = NULL;
> >> +	int i, nr_lruvec;
> >>  	unsigned long flags = 0;
> >> +	struct page *page;
> >> +	struct lruvecs lruvecs;
> >>  
> >> -	for (i = 0; i < pagevec_count(pvec); i++) {
> >> -		struct page *page = pvec->pages[i];
> >> +	nr_lruvec = sort_page_lruvec(&lruvecs, pvec);
> > Simply looping pvec multiple times (15 at most) for different lruvecs
> > would be better because 1) it requires no extra data structures and
> > therefore has better cache locality (theoretically faster) 2) it only
> > loops once when !CONFIG_MEMCG and !CONFIG_NUMA and therefore has no
> > impact on Android and Chrome OS.
> > 
> 
> With multiple memcgs, it do help a lot, I had gotten 30% grain on readtwice
> case. but yes, w/o MEMCG and NUMA, it's good to keep old behavior. So 
> would you like has a proposal for this?

Oh, no, I'm not against your idea. I was saying it doesn't seem
necessary to sort -- a nested loop would just do the job given
pagevec is small.

diff --git a/mm/swap.c b/mm/swap.c
index cb3794e13b48..1d238edc2907 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -996,15 +996,26 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
  */
 void __pagevec_lru_add(struct pagevec *pvec)
 {
-	int i;
+	int i, j;
 	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
 
+		if (!page)
+			continue;
+
 		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
-		__pagevec_lru_add_fn(page, lruvec);
+
+		for (j = i; j < pagevec_count(pvec); j++) {
+			if (page_to_nid(pvec->pages[j]) != page_to_nid(page) ||
+			    page_memcg(pvec->pages[j]) != page_memcg(page))
+				continue;
+
+			__pagevec_lru_add_fn(pvec->pages[j], lruvec);
+			pvec->pages[j] = NULL;
+		}
 	}
 	if (lruvec)
 		unlock_page_lruvec_irqrestore(lruvec, flags);

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add
  2020-11-26  7:24     ` Yu Zhao
@ 2020-11-26  8:09       ` Alex Shi
  2020-11-26 11:22       ` Vlastimil Babka
  1 sibling, 0 replies; 23+ messages in thread
From: Alex Shi @ 2020-11-26  8:09 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Konstantin Khlebnikov, Andrew Morton, Hugh Dickins, Michal Hocko,
	linux-mm, linux-kernel



在 2020/11/26 下午3:24, Yu Zhao 写道:
> Oh, no, I'm not against your idea. I was saying it doesn't seem
> necessary to sort -- a nested loop would just do the job given
> pagevec is small.
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index cb3794e13b48..1d238edc2907 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -996,15 +996,26 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
>   */
>  void __pagevec_lru_add(struct pagevec *pvec)
>  {
> -	int i;
> +	int i, j;
>  	struct lruvec *lruvec = NULL;
>  	unsigned long flags = 0;
>  
>  	for (i = 0; i < pagevec_count(pvec); i++) {
>  		struct page *page = pvec->pages[i];
>  
> +		if (!page)
> +			continue;
> +
>  		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
> -		__pagevec_lru_add_fn(page, lruvec);
> +
> +		for (j = i; j < pagevec_count(pvec); j++) {
> +			if (page_to_nid(pvec->pages[j]) != page_to_nid(page) ||
> +			    page_memcg(pvec->pages[j]) != page_memcg(page))
> +				continue;
> +
> +			__pagevec_lru_add_fn(pvec->pages[j], lruvec);
> +			pvec->pages[j] = NULL;
> +		}

Uh, I have to say your method is more better than mine.
And this could be reused for all relock_page_lruvec. I expect this could
speed up lru performance a lot!


>  	}
>  	if (lruvec)
>  		unlock_page_lruvec_irqrestore(lruvec, flags);

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add
  2020-11-26  3:12   ` Alex Shi
@ 2020-11-26 11:05     ` Vlastimil Babka
  0 siblings, 0 replies; 23+ messages in thread
From: Vlastimil Babka @ 2020-11-26 11:05 UTC (permalink / raw)
  To: Alex Shi
  Cc: Konstantin Khlebnikov, Andrew Morton, Hugh Dickins, Yu Zhao,
	Michal Hocko, linux-mm, linux-kernel

On 11/26/20 4:12 AM, Alex Shi wrote:
> 
> 
> 在 2020/11/25 下午11:38, Vlastimil Babka 写道:
>> On 11/20/20 9:27 AM, Alex Shi wrote:
>>> The current relock logical will change lru_lock when found a new
>>> lruvec, so if 2 memcgs are reading file or alloc page at same time,
>>> they could hold the lru_lock alternately, and wait for each other for
>>> fairness attribute of ticket spin lock.
>>>
>>> This patch will sort that all lru_locks and only hold them once in
>>> above scenario. That could reduce fairness waiting for lock reget.
>>> Than, vm-scalability/case-lru-file-readtwice could get ~5% performance
>>> gain on my 2P*20core*HT machine.
>> 
>> Hm, once you sort the pages like this, it's a shame not to splice them instead of more list_del() + list_add() iterations. update_lru_size() could be also called once?
> 
> Yes, looks it's a good idea to use splice instead of list_del/add, but pages
> may on different lru list in a same lruvec, and also may come from different
> zones. That could involve 5 cycles for different lists, and more for zones...

Hmm, zones wouldn't affect splicing (there's a per-node lru these days), but 
would affect accounting. And yeah, there are 5 lru lists, and we probably need 
to be under lru lock to stabilize where a page belongs, so pre-sorting without 
lock wouldn't be safe? Bummer.

> I give up the try.
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add
  2020-11-26  7:24     ` Yu Zhao
  2020-11-26  8:09       ` Alex Shi
@ 2020-11-26 11:22       ` Vlastimil Babka
  2020-11-26 15:44         ` Vlastimil Babka
  1 sibling, 1 reply; 23+ messages in thread
From: Vlastimil Babka @ 2020-11-26 11:22 UTC (permalink / raw)
  To: Yu Zhao, Alex Shi
  Cc: Konstantin Khlebnikov, Andrew Morton, Hugh Dickins, Michal Hocko,
	linux-mm, linux-kernel

On 11/26/20 8:24 AM, Yu Zhao wrote:
> On Thu, Nov 26, 2020 at 02:39:03PM +0800, Alex Shi wrote:
>> 
>> 
>> 在 2020/11/26 下午12:52, Yu Zhao 写道:
>> >>   */
>> >>  void __pagevec_lru_add(struct pagevec *pvec)
>> >>  {
>> >> -	int i;
>> >> -	struct lruvec *lruvec = NULL;
>> >> +	int i, nr_lruvec;
>> >>  	unsigned long flags = 0;
>> >> +	struct page *page;
>> >> +	struct lruvecs lruvecs;
>> >>  
>> >> -	for (i = 0; i < pagevec_count(pvec); i++) {
>> >> -		struct page *page = pvec->pages[i];
>> >> +	nr_lruvec = sort_page_lruvec(&lruvecs, pvec);
>> > Simply looping pvec multiple times (15 at most) for different lruvecs
>> > would be better because 1) it requires no extra data structures and
>> > therefore has better cache locality (theoretically faster) 2) it only
>> > loops once when !CONFIG_MEMCG and !CONFIG_NUMA and therefore has no
>> > impact on Android and Chrome OS.
>> > 
>> 
>> With multiple memcgs, it do help a lot, I had gotten 30% grain on readtwice
>> case. but yes, w/o MEMCG and NUMA, it's good to keep old behavior. So 
>> would you like has a proposal for this?
> 
> Oh, no, I'm not against your idea. I was saying it doesn't seem
> necessary to sort -- a nested loop would just do the job given
> pagevec is small.

Yeah that could work. The worst case doesn't look nice (O(n^2)) but it should be 
rather rare to have pages from really multiple memcgs/nodes?

Maybe with the following change? Avoids going through the nulls if we got lucky 
(or have !MEMCG !NUMA).

> diff --git a/mm/swap.c b/mm/swap.c
> index cb3794e13b48..1d238edc2907 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -996,15 +996,26 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
>    */
>   void __pagevec_lru_add(struct pagevec *pvec)
>   {
> -	int i;
> +	int i, j;

int i, j, n;

>   	struct lruvec *lruvec = NULL;
>   	unsigned long flags = 0;
>   

n = pagevec_count(pvec);
>   	for (i = 0; i < pagevec_count(pvec); i++) {

    	for (i = 0; n; i++) {

>   		struct page *page = pvec->pages[i];
>   
> +		if (!page)
> +			continue;
> +
>   		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
> -		__pagevec_lru_add_fn(page, lruvec);

		n--;

> +
> +		for (j = i; j < pagevec_count(pvec); j++) {
> +			if (page_to_nid(pvec->pages[j]) != page_to_nid(page) ||
> +			    page_memcg(pvec->pages[j]) != page_memcg(page))
> +				continue;
> +
> +			__pagevec_lru_add_fn(pvec->pages[j], lruvec);
> +			pvec->pages[j] = NULL;

			n--;
> +		}
>   	}
>   	if (lruvec)
>   		unlock_page_lruvec_irqrestore(lruvec, flags);
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add
  2020-11-26 11:22       ` Vlastimil Babka
@ 2020-11-26 15:44         ` Vlastimil Babka
  2020-11-26 15:55           ` Matthew Wilcox
  0 siblings, 1 reply; 23+ messages in thread
From: Vlastimil Babka @ 2020-11-26 15:44 UTC (permalink / raw)
  To: Yu Zhao, Alex Shi, Matthew Wilcox
  Cc: Konstantin Khlebnikov, Andrew Morton, Hugh Dickins, Michal Hocko,
	linux-mm, linux-kernel

On 11/26/20 12:22 PM, Vlastimil Babka wrote:
> On 11/26/20 8:24 AM, Yu Zhao wrote:
>> On Thu, Nov 26, 2020 at 02:39:03PM +0800, Alex Shi wrote:
>>> 
>>> 
>>> 在 2020/11/26 下午12:52, Yu Zhao 写道:
>>> >>   */
>>> >>  void __pagevec_lru_add(struct pagevec *pvec)
>>> >>  {
>>> >> -	int i;
>>> >> -	struct lruvec *lruvec = NULL;
>>> >> +	int i, nr_lruvec;
>>> >>  	unsigned long flags = 0;
>>> >> +	struct page *page;
>>> >> +	struct lruvecs lruvecs;
>>> >>  
>>> >> -	for (i = 0; i < pagevec_count(pvec); i++) {
>>> >> -		struct page *page = pvec->pages[i];
>>> >> +	nr_lruvec = sort_page_lruvec(&lruvecs, pvec);
>>> > Simply looping pvec multiple times (15 at most) for different lruvecs
>>> > would be better because 1) it requires no extra data structures and
>>> > therefore has better cache locality (theoretically faster) 2) it only
>>> > loops once when !CONFIG_MEMCG and !CONFIG_NUMA and therefore has no
>>> > impact on Android and Chrome OS.
>>> > 
>>> 
>>> With multiple memcgs, it do help a lot, I had gotten 30% grain on readtwice
>>> case. but yes, w/o MEMCG and NUMA, it's good to keep old behavior. So 
>>> would you like has a proposal for this?
>> 
>> Oh, no, I'm not against your idea. I was saying it doesn't seem
>> necessary to sort -- a nested loop would just do the job given
>> pagevec is small.
> 
> Yeah that could work. The worst case doesn't look nice (O(n^2)) but it should be
> rather rare to have pages from really multiple memcgs/nodes?

However, Matthew wanted to increase pagevec size [1] and once 15^2 becomes 63^2, 
it starts to be somewhat more worrying.

[1] https://lore.kernel.org/linux-mm/20201105172651.2455-1-willy@infradead.org/

> Maybe with the following change? Avoids going through the nulls if we got lucky
> (or have !MEMCG !NUMA).
> 
>> diff --git a/mm/swap.c b/mm/swap.c
>> index cb3794e13b48..1d238edc2907 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -996,15 +996,26 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
>>    */
>>   void __pagevec_lru_add(struct pagevec *pvec)
>>   {
>> -	int i;
>> +	int i, j;
> 
> int i, j, n;
> 
>>   	struct lruvec *lruvec = NULL;
>>   	unsigned long flags = 0;
>>   
> 
> n = pagevec_count(pvec);
>>   	for (i = 0; i < pagevec_count(pvec); i++) {
> 
>      	for (i = 0; n; i++) {
> 
>>   		struct page *page = pvec->pages[i];
>>   
>> +		if (!page)
>> +			continue;
>> +
>>   		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
>> -		__pagevec_lru_add_fn(page, lruvec);
> 
> 		n--;
> 
>> +
>> +		for (j = i; j < pagevec_count(pvec); j++) {
>> +			if (page_to_nid(pvec->pages[j]) != page_to_nid(page) ||
>> +			    page_memcg(pvec->pages[j]) != page_memcg(page))
>> +				continue;
>> +
>> +			__pagevec_lru_add_fn(pvec->pages[j], lruvec);
>> +			pvec->pages[j] = NULL;
> 
> 			n--;
>> +		}
>>   	}
>>   	if (lruvec)
>>   		unlock_page_lruvec_irqrestore(lruvec, flags);
>> 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add
  2020-11-26 15:44         ` Vlastimil Babka
@ 2020-11-26 15:55           ` Matthew Wilcox
  2020-11-27  3:14             ` Alex Shi
                               ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Matthew Wilcox @ 2020-11-26 15:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Yu Zhao, Alex Shi, Konstantin Khlebnikov, Andrew Morton,
	Hugh Dickins, Michal Hocko, linux-mm, linux-kernel

On Thu, Nov 26, 2020 at 04:44:04PM +0100, Vlastimil Babka wrote:
> However, Matthew wanted to increase pagevec size [1] and once 15^2 becomes
> 63^2, it starts to be somewhat more worrying.
> 
> [1] https://lore.kernel.org/linux-mm/20201105172651.2455-1-willy@infradead.org/

Well, Tim wanted it ;-)

I would suggest that rather than an insertion sort (or was it a bubble
sort?), we should be using a Shell sort.  It's ideal for these kinds of
smallish arrays.

https://en.wikipedia.org/wiki/Shellsort

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add
  2020-11-26 15:55           ` Matthew Wilcox
@ 2020-11-27  3:14             ` Alex Shi
  2020-12-01  8:02             ` [PATCH 1/3] mm/swap.c: pre-sort pages in pagevec for pagevec_lru_move_fn Alex Shi
  2020-12-25  9:59             ` [RFC PATCH 0/4] pre sort pages on lruvec in pagevec Alex Shi
  2 siblings, 0 replies; 23+ messages in thread
From: Alex Shi @ 2020-11-27  3:14 UTC (permalink / raw)
  To: Matthew Wilcox, Vlastimil Babka
  Cc: Yu Zhao, Konstantin Khlebnikov, Andrew Morton, Hugh Dickins,
	Michal Hocko, linux-mm, linux-kernel



在 2020/11/26 下午11:55, Matthew Wilcox 写道:
> On Thu, Nov 26, 2020 at 04:44:04PM +0100, Vlastimil Babka wrote:
>> However, Matthew wanted to increase pagevec size [1] and once 15^2 becomes
>> 63^2, it starts to be somewhat more worrying.
>>
>> [1] https://lore.kernel.org/linux-mm/20201105172651.2455-1-willy@infradead.org/
> 
> Well, Tim wanted it ;-)
> 
> I would suggest that rather than an insertion sort (or was it a bubble
> sort?), we should be using a Shell sort.  It's ideal for these kinds of
> smallish arrays.
> 
> https://en.wikipedia.org/wiki/Shellsort
> 

Uh, looks perfect good!. I gonna look into it. :)

Thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/3] mm/swap.c: pre-sort pages in pagevec for pagevec_lru_move_fn
  2020-11-26 15:55           ` Matthew Wilcox
  2020-11-27  3:14             ` Alex Shi
@ 2020-12-01  8:02             ` Alex Shi
  2020-12-01  8:02               ` [PATCH 2/3] mm/swap.c: bail out early for no memcg and no numa Alex Shi
                                 ` (2 more replies)
  2020-12-25  9:59             ` [RFC PATCH 0/4] pre sort pages on lruvec in pagevec Alex Shi
  2 siblings, 3 replies; 23+ messages in thread
From: Alex Shi @ 2020-12-01  8:02 UTC (permalink / raw)
  To: vbabka, alex.shi
  Cc: Konstantin Khlebnikov, Hugh Dickins, Yu Zhao, Michal Hocko,
	Matthew Wilcox (Oracle),
	Andrew Morton, linux-mm, linux-kernel

Pages in pagevec may have different lruvec, so we have to do relock in
function pagevec_lru_move_fn(), but a relock may cause current cpu wait
for long time on the same lock for spinlock fairness reason.

Before per memcg lru_lock, we have to bear the relock since the spinlock
is the only way to serialize page's memcg/lruvec. Now TestClearPageLRU
could be used to isolate pages exculsively, and stable the page's
lruvec/memcg. So it gives us a chance to sort the page's lruvec before
moving action in pagevec_lru_move_fn. Then we don't suffer from the
spinlock's fairness wait.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 79 insertions(+), 13 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 490553f3f9ef..17d8990e5ca7 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -201,29 +201,95 @@ int get_kernel_page(unsigned long start, int write, struct page **pages)
 }
 EXPORT_SYMBOL_GPL(get_kernel_page);
 
+/* Pratt's gaps for shell sort, https://en.wikipedia.org/wiki/Shellsort */
+static int gaps[] = { 6, 4, 3, 2, 1, 0};
+
+/* Shell sort pagevec[] on page's lruvec.*/
+static void shell_sort(struct pagevec *pvec, unsigned long *lvaddr)
+{
+	int g, i, j, n = pagevec_count(pvec);
+
+	for (g=0; gaps[g] > 0 && gaps[g] <= n/2; g++) {
+		int gap = gaps[g];
+
+		for (i = gap; i < n; i++) {
+			unsigned long tmp = lvaddr[i];
+			struct page *page = pvec->pages[i];
+
+			for (j = i - gap; j >= 0 && lvaddr[j] > tmp; j -= gap) {
+				lvaddr[j + gap] = lvaddr[j];
+				pvec->pages[j + gap] = pvec->pages[j];
+			}
+			lvaddr[j + gap] = tmp;
+			pvec->pages[j + gap] = page;
+		}
+	}
+}
+
+/* Get lru bit cleared page and their lruvec address, release the others */
+void sort_isopv(struct pagevec *pvec, struct pagevec *isopv,
+		unsigned long *lvaddr)
+{
+	int i, j;
+	struct pagevec busypv;
+
+	pagevec_init(&busypv);
+
+	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+
+		pvec->pages[i] = NULL;
+
+		/* block memcg migration during page moving between lru */
+		if (!TestClearPageLRU(page)) {
+			pagevec_add(&busypv, page);
+			continue;
+		}
+		lvaddr[j++] = (unsigned long)
+				mem_cgroup_page_lruvec(page, page_pgdat(page));
+		pagevec_add(isopv, page);
+	}
+	pagevec_reinit(pvec);
+	if (pagevec_count(&busypv))
+		release_pages(busypv.pages, busypv.nr);
+
+	shell_sort(isopv, lvaddr);
+}
+
 static void pagevec_lru_move_fn(struct pagevec *pvec,
 	void (*move_fn)(struct page *page, struct lruvec *lruvec))
 {
-	int i;
+	int i, n;
 	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
+	unsigned long lvaddr[PAGEVEC_SIZE];
+	struct pagevec isopv;
 
-	for (i = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
+	pagevec_init(&isopv);
 
-		/* block memcg migration during page moving between lru */
-		if (!TestClearPageLRU(page))
-			continue;
+	sort_isopv(pvec, &isopv, lvaddr);
 
-		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
-		(*move_fn)(page, lruvec);
+	n = pagevec_count(&isopv);
+	if (!n)
+		return;
 
-		SetPageLRU(page);
+	lruvec = (struct lruvec *)lvaddr[0];
+	spin_lock_irqsave(&lruvec->lru_lock, flags);
+
+	for (i = 0; i < n; i++) {
+		/* lock new lruvec if lruvec changes, we have sorted them */
+		if (lruvec != (struct lruvec *)lvaddr[i]) {
+			spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+			lruvec = (struct lruvec *)lvaddr[i];
+			spin_lock_irqsave(&lruvec->lru_lock, flags);
+		}
+
+		(*move_fn)(isopv.pages[i], lruvec);
+
+		SetPageLRU(isopv.pages[i]);
 	}
-	if (lruvec)
-		unlock_page_lruvec_irqrestore(lruvec, flags);
-	release_pages(pvec->pages, pvec->nr);
-	pagevec_reinit(pvec);
+	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+	release_pages(isopv.pages, isopv.nr);
 }
 
 static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
-- 
2.29.GIT


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 2/3] mm/swap.c: bail out early for no memcg and no numa
  2020-12-01  8:02             ` [PATCH 1/3] mm/swap.c: pre-sort pages in pagevec for pagevec_lru_move_fn Alex Shi
@ 2020-12-01  8:02               ` Alex Shi
  2020-12-01  8:02               ` [PATCH 3/3] mm/swap.c: extend the usage to pagevec_lru_add Alex Shi
  2020-12-01  8:10               ` [PATCH 1/3] mm/swap.c: pre-sort pages in pagevec for pagevec_lru_move_fn Michal Hocko
  2 siblings, 0 replies; 23+ messages in thread
From: Alex Shi @ 2020-12-01  8:02 UTC (permalink / raw)
  To: vbabka, alex.shi
  Cc: Konstantin Khlebnikov, Hugh Dickins, Yu Zhao, Michal Hocko,
	Matthew Wilcox (Oracle),
	Andrew Morton, linux-mm, linux-kernel

If a system has memcg disabled and no numa node, like a embedded system,
there is no needs to do the pagevec sort, since only just one lruvec in
system. In this situation, we could skip the pagevec sorting.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 17d8990e5ca7..814809845700 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -264,12 +264,17 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	unsigned long flags = 0;
 	unsigned long lvaddr[PAGEVEC_SIZE];
 	struct pagevec isopv;
+	struct pagevec *pv;
 
-	pagevec_init(&isopv);
-
-	sort_isopv(pvec, &isopv, lvaddr);
+	if (!mem_cgroup_disabled() || num_online_nodes() > 1) {
+		pagevec_init(&isopv);
+		sort_isopv(pvec, &isopv, lvaddr);
+		pv = &isopv;
+	} else {
+		pv = pvec;
+	}
 
-	n = pagevec_count(&isopv);
+	n = pagevec_count(pv);
 	if (!n)
 		return;
 
@@ -284,12 +289,12 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 			spin_lock_irqsave(&lruvec->lru_lock, flags);
 		}
 
-		(*move_fn)(isopv.pages[i], lruvec);
+		(*move_fn)(pv->pages[i], lruvec);
 
-		SetPageLRU(isopv.pages[i]);
+		SetPageLRU(pv->pages[i]);
 	}
 	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
-	release_pages(isopv.pages, isopv.nr);
+	release_pages(pv->pages, pv->nr);
 }
 
 static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
-- 
2.29.GIT


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 3/3] mm/swap.c: extend the usage to pagevec_lru_add
  2020-12-01  8:02             ` [PATCH 1/3] mm/swap.c: pre-sort pages in pagevec for pagevec_lru_move_fn Alex Shi
  2020-12-01  8:02               ` [PATCH 2/3] mm/swap.c: bail out early for no memcg and no numa Alex Shi
@ 2020-12-01  8:02               ` Alex Shi
  2020-12-01  8:10               ` [PATCH 1/3] mm/swap.c: pre-sort pages in pagevec for pagevec_lru_move_fn Michal Hocko
  2 siblings, 0 replies; 23+ messages in thread
From: Alex Shi @ 2020-12-01  8:02 UTC (permalink / raw)
  To: vbabka, alex.shi
  Cc: Konstantin Khlebnikov, Hugh Dickins, Yu Zhao, Michal Hocko,
	Matthew Wilcox (Oracle),
	Andrew Morton, linux-mm, linux-kernel

The only different for __pagevec_lru_add and other page moving between
lru lists is page to add lru list has no need to do TestClearPageLRU and
set the lru bit back. So we could combound them with a clear lru bit
switch in sort function parameter.

Than all lru list operation functions could be united.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 30 +++++++++++-------------------
 1 file changed, 11 insertions(+), 19 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 814809845700..6a7920b2937f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -12,6 +12,7 @@
  * Started 18.12.91
  * Swap aging added 23.2.95, Stephen Tweedie.
  * Buffermem limits added 12.3.98, Rik van Riel.
+ * Pre-sort pagevec added 1.12.20, Alex Shi.
  */
 
 #include <linux/mm.h>
@@ -227,8 +228,8 @@ static void shell_sort(struct pagevec *pvec, unsigned long *lvaddr)
 }
 
 /* Get lru bit cleared page and their lruvec address, release the others */
-void sort_isopv(struct pagevec *pvec, struct pagevec *isopv,
-		unsigned long *lvaddr)
+static void sort_isopv(struct pagevec *pvec, struct pagevec *isopv,
+		unsigned long *lvaddr, bool clearlru)
 {
 	int i, j;
 	struct pagevec busypv;
@@ -241,7 +242,7 @@ void sort_isopv(struct pagevec *pvec, struct pagevec *isopv,
 		pvec->pages[i] = NULL;
 
 		/* block memcg migration during page moving between lru */
-		if (!TestClearPageLRU(page)) {
+		if (clearlru && !TestClearPageLRU(page)) {
 			pagevec_add(&busypv, page);
 			continue;
 		}
@@ -265,10 +266,13 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	unsigned long lvaddr[PAGEVEC_SIZE];
 	struct pagevec isopv;
 	struct pagevec *pv;
+	bool clearlru;
+
+	clearlru = pvec != this_cpu_ptr(&lru_pvecs.lru_add);
 
 	if (!mem_cgroup_disabled() || num_online_nodes() > 1) {
 		pagevec_init(&isopv);
-		sort_isopv(pvec, &isopv, lvaddr);
+		sort_isopv(pvec, &isopv, lvaddr, clearlru);
 		pv = &isopv;
 	} else {
 		pv = pvec;
@@ -291,7 +295,8 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 
 		(*move_fn)(pv->pages[i], lruvec);
 
-		SetPageLRU(pv->pages[i]);
+		if (clearlru)
+			SetPageLRU(pv->pages[i]);
 	}
 	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
 	release_pages(pv->pages, pv->nr);
@@ -1086,20 +1091,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
  */
 void __pagevec_lru_add(struct pagevec *pvec)
 {
-	int i;
-	struct lruvec *lruvec = NULL;
-	unsigned long flags = 0;
-
-	for (i = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
-
-		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
-		__pagevec_lru_add_fn(page, lruvec);
-	}
-	if (lruvec)
-		unlock_page_lruvec_irqrestore(lruvec, flags);
-	release_pages(pvec->pages, pvec->nr);
-	pagevec_reinit(pvec);
+	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn);
 }
 
 /**
-- 
2.29.GIT


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] mm/swap.c: pre-sort pages in pagevec for pagevec_lru_move_fn
  2020-12-01  8:02             ` [PATCH 1/3] mm/swap.c: pre-sort pages in pagevec for pagevec_lru_move_fn Alex Shi
  2020-12-01  8:02               ` [PATCH 2/3] mm/swap.c: bail out early for no memcg and no numa Alex Shi
  2020-12-01  8:02               ` [PATCH 3/3] mm/swap.c: extend the usage to pagevec_lru_add Alex Shi
@ 2020-12-01  8:10               ` Michal Hocko
  2020-12-01  8:20                 ` Alex Shi
  2 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2020-12-01  8:10 UTC (permalink / raw)
  To: Alex Shi
  Cc: vbabka, Konstantin Khlebnikov, Hugh Dickins, Yu Zhao,
	Matthew Wilcox (Oracle),
	Andrew Morton, linux-mm, linux-kernel

On Tue 01-12-20 16:02:13, Alex Shi wrote:
> Pages in pagevec may have different lruvec, so we have to do relock in
> function pagevec_lru_move_fn(), but a relock may cause current cpu wait
> for long time on the same lock for spinlock fairness reason.
> 
> Before per memcg lru_lock, we have to bear the relock since the spinlock
> is the only way to serialize page's memcg/lruvec. Now TestClearPageLRU
> could be used to isolate pages exculsively, and stable the page's
> lruvec/memcg. So it gives us a chance to sort the page's lruvec before
> moving action in pagevec_lru_move_fn. Then we don't suffer from the
> spinlock's fairness wait.

Do you have any data to show any improvements from this?

> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  mm/swap.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 79 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 490553f3f9ef..17d8990e5ca7 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -201,29 +201,95 @@ int get_kernel_page(unsigned long start, int write, struct page **pages)
>  }
>  EXPORT_SYMBOL_GPL(get_kernel_page);
>  
> +/* Pratt's gaps for shell sort, https://en.wikipedia.org/wiki/Shellsort */
> +static int gaps[] = { 6, 4, 3, 2, 1, 0};
> +
> +/* Shell sort pagevec[] on page's lruvec.*/
> +static void shell_sort(struct pagevec *pvec, unsigned long *lvaddr)
> +{
> +	int g, i, j, n = pagevec_count(pvec);
> +
> +	for (g=0; gaps[g] > 0 && gaps[g] <= n/2; g++) {
> +		int gap = gaps[g];
> +
> +		for (i = gap; i < n; i++) {
> +			unsigned long tmp = lvaddr[i];
> +			struct page *page = pvec->pages[i];
> +
> +			for (j = i - gap; j >= 0 && lvaddr[j] > tmp; j -= gap) {
> +				lvaddr[j + gap] = lvaddr[j];
> +				pvec->pages[j + gap] = pvec->pages[j];
> +			}
> +			lvaddr[j + gap] = tmp;
> +			pvec->pages[j + gap] = page;
> +		}
> +	}
> +}
> +
> +/* Get lru bit cleared page and their lruvec address, release the others */
> +void sort_isopv(struct pagevec *pvec, struct pagevec *isopv,
> +		unsigned long *lvaddr)
> +{
> +	int i, j;
> +	struct pagevec busypv;
> +
> +	pagevec_init(&busypv);
> +
> +	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
> +		struct page *page = pvec->pages[i];
> +
> +		pvec->pages[i] = NULL;
> +
> +		/* block memcg migration during page moving between lru */
> +		if (!TestClearPageLRU(page)) {
> +			pagevec_add(&busypv, page);
> +			continue;
> +		}
> +		lvaddr[j++] = (unsigned long)
> +				mem_cgroup_page_lruvec(page, page_pgdat(page));
> +		pagevec_add(isopv, page);
> +	}
> +	pagevec_reinit(pvec);
> +	if (pagevec_count(&busypv))
> +		release_pages(busypv.pages, busypv.nr);
> +
> +	shell_sort(isopv, lvaddr);
> +}
> +
>  static void pagevec_lru_move_fn(struct pagevec *pvec,
>  	void (*move_fn)(struct page *page, struct lruvec *lruvec))
>  {
> -	int i;
> +	int i, n;
>  	struct lruvec *lruvec = NULL;
>  	unsigned long flags = 0;
> +	unsigned long lvaddr[PAGEVEC_SIZE];
> +	struct pagevec isopv;
>  
> -	for (i = 0; i < pagevec_count(pvec); i++) {
> -		struct page *page = pvec->pages[i];
> +	pagevec_init(&isopv);
>  
> -		/* block memcg migration during page moving between lru */
> -		if (!TestClearPageLRU(page))
> -			continue;
> +	sort_isopv(pvec, &isopv, lvaddr);
>  
> -		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
> -		(*move_fn)(page, lruvec);
> +	n = pagevec_count(&isopv);
> +	if (!n)
> +		return;
>  
> -		SetPageLRU(page);
> +	lruvec = (struct lruvec *)lvaddr[0];
> +	spin_lock_irqsave(&lruvec->lru_lock, flags);
> +
> +	for (i = 0; i < n; i++) {
> +		/* lock new lruvec if lruvec changes, we have sorted them */
> +		if (lruvec != (struct lruvec *)lvaddr[i]) {
> +			spin_unlock_irqrestore(&lruvec->lru_lock, flags);
> +			lruvec = (struct lruvec *)lvaddr[i];
> +			spin_lock_irqsave(&lruvec->lru_lock, flags);
> +		}
> +
> +		(*move_fn)(isopv.pages[i], lruvec);
> +
> +		SetPageLRU(isopv.pages[i]);
>  	}
> -	if (lruvec)
> -		unlock_page_lruvec_irqrestore(lruvec, flags);
> -	release_pages(pvec->pages, pvec->nr);
> -	pagevec_reinit(pvec);
> +	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
> +	release_pages(isopv.pages, isopv.nr);
>  }
>  
>  static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
> -- 
> 2.29.GIT
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] mm/swap.c: pre-sort pages in pagevec for pagevec_lru_move_fn
  2020-12-01  8:10               ` [PATCH 1/3] mm/swap.c: pre-sort pages in pagevec for pagevec_lru_move_fn Michal Hocko
@ 2020-12-01  8:20                 ` Alex Shi
  0 siblings, 0 replies; 23+ messages in thread
From: Alex Shi @ 2020-12-01  8:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: vbabka, Konstantin Khlebnikov, Hugh Dickins, Yu Zhao,
	Matthew Wilcox (Oracle),
	Andrew Morton, linux-mm, linux-kernel



在 2020/12/1 下午4:10, Michal Hocko 写道:
> On Tue 01-12-20 16:02:13, Alex Shi wrote:
>> Pages in pagevec may have different lruvec, so we have to do relock in
>> function pagevec_lru_move_fn(), but a relock may cause current cpu wait
>> for long time on the same lock for spinlock fairness reason.
>>
>> Before per memcg lru_lock, we have to bear the relock since the spinlock
>> is the only way to serialize page's memcg/lruvec. Now TestClearPageLRU
>> could be used to isolate pages exculsively, and stable the page's
>> lruvec/memcg. So it gives us a chance to sort the page's lruvec before
>> moving action in pagevec_lru_move_fn. Then we don't suffer from the
>> spinlock's fairness wait.
> Do you have any data to show any improvements from this?
> 

Hi Michal,

Thanks for quick response.

Not yet. I am running for data. but according to the lru_add result, there
should be a big gain for multiple memcgs scenario.

Also I don't except a quick accept, just send out the idea for comments 
when the thread is still warm. :)

Thanks
Alex

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH 0/4] pre sort pages on lruvec in pagevec
  2020-11-26 15:55           ` Matthew Wilcox
  2020-11-27  3:14             ` Alex Shi
  2020-12-01  8:02             ` [PATCH 1/3] mm/swap.c: pre-sort pages in pagevec for pagevec_lru_move_fn Alex Shi
@ 2020-12-25  9:59             ` Alex Shi
  2020-12-25  9:59               ` [RFC PATCH 1/4] mm/swap.c: pre-sort pages in pagevec for pagevec_lru_move_fn Alex Shi
                                 ` (2 more replies)
  2 siblings, 3 replies; 23+ messages in thread
From: Alex Shi @ 2020-12-25  9:59 UTC (permalink / raw)
  To: willy
  Cc: tim.c.chen, Konstantin Khlebnikov, Hugh Dickins, Yu Zhao,
	Michal Hocko, Andrew Morton, linux-mm, linux-kernel

This idea was tried on per memcg lru lock patchset v18, and had a good
result, about 5%~20+% performance gain on lru lock busy benchmarks,
like case-lru-file-readtwice.

But on the latest kernel, I can not reproduce the result on my box.
Also I can not reproduce Tim's performance gain too on my box.

So I don't know if it's workable in some scenario, just sent out if
someone has interesting...

Alex Shi (4):
  mm/swap.c: pre-sort pages in pagevec for pagevec_lru_move_fn
  mm/swap.c: bail out early for no memcg and no numa
  mm/swap.c: extend the usage to pagevec_lru_add
  mm/swap.c: no sort if all page's lruvec are same

Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org

 mm/swap.c | 118 +++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 91 insertions(+), 27 deletions(-)

-- 
2.29.GIT


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH 1/4] mm/swap.c: pre-sort pages in pagevec for pagevec_lru_move_fn
  2020-12-25  9:59             ` [RFC PATCH 0/4] pre sort pages on lruvec in pagevec Alex Shi
@ 2020-12-25  9:59               ` Alex Shi
  2020-12-25  9:59               ` [RFC PATCH 2/4] mm/swap.c: bail out early for no memcg and no numa Alex Shi
  2020-12-25  9:59               ` [RFC PATCH 3/4] mm/swap.c: extend the usage to pagevec_lru_add Alex Shi
  2 siblings, 0 replies; 23+ messages in thread
From: Alex Shi @ 2020-12-25  9:59 UTC (permalink / raw)
  To: willy
  Cc: tim.c.chen, Konstantin Khlebnikov, Hugh Dickins, Yu Zhao,
	Michal Hocko, Andrew Morton, linux-mm, linux-kernel

Pages in pagevec may have different lruvec, so we have to do relock in
function pagevec_lru_move_fn(), but a relock may cause current cpu wait
for long time on the same lock for spinlock fairness reason.

Before per memcg lru_lock, we have to bear the relock since the spinlock
is the only way to serialize page's memcg/lruvec. Now TestClearPageLRU
could be used to isolate pages exculsively, and stable the page's
lruvec/memcg. So it gives us a chance to sort the page's lruvec before
moving action in pagevec_lru_move_fn. Then we don't suffer from the
spinlock's fairness wait.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 79 insertions(+), 13 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index c5363bdebe67..994641331bf7 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -201,29 +201,95 @@ int get_kernel_page(unsigned long start, int write, struct page **pages)
 }
 EXPORT_SYMBOL_GPL(get_kernel_page);
 
+/* Pratt's gaps for shell sort, https://en.wikipedia.org/wiki/Shellsort */
+static int gaps[] = { 6, 4, 3, 2, 1, 0};
+
+/* Shell sort pagevec[] on page's lruvec.*/
+static void shell_sort(struct pagevec *pvec, unsigned long *lvaddr)
+{
+	int g, i, j, n = pagevec_count(pvec);
+
+	for (g=0; gaps[g] > 0 && gaps[g] <= n/2; g++) {
+		int gap = gaps[g];
+
+		for (i = gap; i < n; i++) {
+			unsigned long tmp = lvaddr[i];
+			struct page *page = pvec->pages[i];
+
+			for (j = i - gap; j >= 0 && lvaddr[j] > tmp; j -= gap) {
+				lvaddr[j + gap] = lvaddr[j];
+				pvec->pages[j + gap] = pvec->pages[j];
+			}
+			lvaddr[j + gap] = tmp;
+			pvec->pages[j + gap] = page;
+		}
+	}
+}
+
+/* Get lru bit cleared page and their lruvec address, release the others */
+void sort_isopv(struct pagevec *pvec, struct pagevec *isopv,
+		unsigned long *lvaddr)
+{
+	int i, j;
+	struct pagevec busypv;
+
+	pagevec_init(&busypv);
+
+	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+
+		pvec->pages[i] = NULL;
+
+		/* block memcg migration during page moving between lru */
+		if (!TestClearPageLRU(page)) {
+			pagevec_add(&busypv, page);
+			continue;
+		}
+		lvaddr[j++] = (unsigned long)
+				mem_cgroup_page_lruvec(page, page_pgdat(page));
+		pagevec_add(isopv, page);
+	}
+	pagevec_reinit(pvec);
+	if (pagevec_count(&busypv))
+		release_pages(busypv.pages, busypv.nr);
+
+	shell_sort(isopv, lvaddr);
+}
+
 static void pagevec_lru_move_fn(struct pagevec *pvec,
 	void (*move_fn)(struct page *page, struct lruvec *lruvec))
 {
-	int i;
+	int i, n;
 	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
+	unsigned long lvaddr[PAGEVEC_SIZE];
+	struct pagevec isopv;
 
-	for (i = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
+	pagevec_init(&isopv);
 
-		/* block memcg migration during page moving between lru */
-		if (!TestClearPageLRU(page))
-			continue;
+	sort_isopv(pvec, &isopv, lvaddr);
 
-		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
-		(*move_fn)(page, lruvec);
+	n = pagevec_count(&isopv);
+	if (!n)
+		return;
 
-		SetPageLRU(page);
+	lruvec = (struct lruvec *)lvaddr[0];
+	spin_lock_irqsave(&lruvec->lru_lock, flags);
+
+	for (i = 0; i < n; i++) {
+		/* lock new lruvec if lruvec changes, we have sorted them */
+		if (lruvec != (struct lruvec *)lvaddr[i]) {
+			spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+			lruvec = (struct lruvec *)lvaddr[i];
+			spin_lock_irqsave(&lruvec->lru_lock, flags);
+		}
+
+		(*move_fn)(isopv.pages[i], lruvec);
+
+		SetPageLRU(isopv.pages[i]);
 	}
-	if (lruvec)
-		unlock_page_lruvec_irqrestore(lruvec, flags);
-	release_pages(pvec->pages, pvec->nr);
-	pagevec_reinit(pvec);
+	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+	release_pages(isopv.pages, isopv.nr);
 }
 
 static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
-- 
2.29.GIT


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH 2/4] mm/swap.c: bail out early for no memcg and no numa
  2020-12-25  9:59             ` [RFC PATCH 0/4] pre sort pages on lruvec in pagevec Alex Shi
  2020-12-25  9:59               ` [RFC PATCH 1/4] mm/swap.c: pre-sort pages in pagevec for pagevec_lru_move_fn Alex Shi
@ 2020-12-25  9:59               ` Alex Shi
  2020-12-25  9:59               ` [RFC PATCH 3/4] mm/swap.c: extend the usage to pagevec_lru_add Alex Shi
  2 siblings, 0 replies; 23+ messages in thread
From: Alex Shi @ 2020-12-25  9:59 UTC (permalink / raw)
  To: willy
  Cc: tim.c.chen, Konstantin Khlebnikov, Hugh Dickins, Yu Zhao,
	Michal Hocko, Andrew Morton, linux-mm, linux-kernel

If a system has memcg disabled and no numa node, like a embedded system,
there is no needs to do the pagevec sort, since only just one lruvec in
system. In this situation, we could skip the pagevec sorting.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 994641331bf7..bb5300b7e321 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -235,6 +235,7 @@ void sort_isopv(struct pagevec *pvec, struct pagevec *isopv,
 
 	pagevec_init(&busypv);
 
+
 	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
 
@@ -253,7 +254,8 @@ void sort_isopv(struct pagevec *pvec, struct pagevec *isopv,
 	if (pagevec_count(&busypv))
 		release_pages(busypv.pages, busypv.nr);
 
-	shell_sort(isopv, lvaddr);
+	if (!mem_cgroup_disabled() || num_online_nodes() > 1)
+		shell_sort(isopv, lvaddr);
 }
 
 static void pagevec_lru_move_fn(struct pagevec *pvec,
@@ -263,13 +265,12 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 	unsigned long lvaddr[PAGEVEC_SIZE];
-	struct pagevec isopv;
-
-	pagevec_init(&isopv);
+	struct pagevec sortedpv;
 
-	sort_isopv(pvec, &isopv, lvaddr);
+	pagevec_init(&sortedpv);
+	sort_isopv(pvec, &sortedpv, lvaddr);
 
-	n = pagevec_count(&isopv);
+	n = pagevec_count(&sortedpv);
 	if (!n)
 		return;
 
@@ -284,12 +285,12 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 			spin_lock_irqsave(&lruvec->lru_lock, flags);
 		}
 
-		(*move_fn)(isopv.pages[i], lruvec);
+		(*move_fn)(sortedpv.pages[i], lruvec);
 
-		SetPageLRU(isopv.pages[i]);
+		SetPageLRU(sortedpv.pages[i]);
 	}
 	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
-	release_pages(isopv.pages, isopv.nr);
+	release_pages(sortedpv.pages, sortedpv.nr);
 }
 
 static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
-- 
2.29.GIT


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH 3/4] mm/swap.c: extend the usage to pagevec_lru_add
  2020-12-25  9:59             ` [RFC PATCH 0/4] pre sort pages on lruvec in pagevec Alex Shi
  2020-12-25  9:59               ` [RFC PATCH 1/4] mm/swap.c: pre-sort pages in pagevec for pagevec_lru_move_fn Alex Shi
  2020-12-25  9:59               ` [RFC PATCH 2/4] mm/swap.c: bail out early for no memcg and no numa Alex Shi
@ 2020-12-25  9:59               ` Alex Shi
  2 siblings, 0 replies; 23+ messages in thread
From: Alex Shi @ 2020-12-25  9:59 UTC (permalink / raw)
  To: willy
  Cc: tim.c.chen, Konstantin Khlebnikov, Hugh Dickins, Yu Zhao,
	Michal Hocko, Andrew Morton, linux-mm, linux-kernel

The only different for __pagevec_lru_add and other page moving between
lru lists is page to add lru list has no need to do TestClearPageLRU and
set the lru bit back. So we could combound them with a clear lru bit
switch in sort function parameter.

Than all lru list operation functions could be united.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 31 ++++++++++++-------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index bb5300b7e321..9a2269e5099b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -12,6 +12,7 @@
  * Started 18.12.91
  * Swap aging added 23.2.95, Stephen Tweedie.
  * Buffermem limits added 12.3.98, Rik van Riel.
+ * Pre-sort pagevec added 12.1.20, Alex Shi.
  */
 
 #include <linux/mm.h>
@@ -227,8 +228,8 @@ static void shell_sort(struct pagevec *pvec, unsigned long *lvaddr)
 }
 
 /* Get lru bit cleared page and their lruvec address, release the others */
-void sort_isopv(struct pagevec *pvec, struct pagevec *isopv,
-		unsigned long *lvaddr)
+static void sort_isopv(struct pagevec *pvec, struct pagevec *isopv,
+		unsigned long *lvaddr, bool clearlru)
 {
 	int i, j;
 	struct pagevec busypv;
@@ -242,7 +243,7 @@ void sort_isopv(struct pagevec *pvec, struct pagevec *isopv,
 		pvec->pages[i] = NULL;
 
 		/* block memcg migration during page moving between lru */
-		if (!TestClearPageLRU(page)) {
+		if (clearlru && !TestClearPageLRU(page)) {
 			pagevec_add(&busypv, page);
 			continue;
 		}
@@ -266,9 +267,13 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	unsigned long flags = 0;
 	unsigned long lvaddr[PAGEVEC_SIZE];
 	struct pagevec sortedpv;
+	bool clearlru;
+
+	/* don't clear lru bit for new page adding to lru */
+	clearlru = pvec != this_cpu_ptr(&lru_pvecs.lru_add);
 
 	pagevec_init(&sortedpv);
-	sort_isopv(pvec, &sortedpv, lvaddr);
+	sort_isopv(pvec, &sortedpv, lvaddr, clearlru);
 
 	n = pagevec_count(&sortedpv);
 	if (!n)
@@ -287,7 +292,8 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 
 		(*move_fn)(sortedpv.pages[i], lruvec);
 
-		SetPageLRU(sortedpv.pages[i]);
+		if (clearlru)
+			SetPageLRU(sortedpv.pages[i]);
 	}
 	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
 	release_pages(sortedpv.pages, sortedpv.nr);
@@ -1111,20 +1117,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
  */
 void __pagevec_lru_add(struct pagevec *pvec)
 {
-	int i;
-	struct lruvec *lruvec = NULL;
-	unsigned long flags = 0;
-
-	for (i = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
-
-		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
-		__pagevec_lru_add_fn(page, lruvec);
-	}
-	if (lruvec)
-		unlock_page_lruvec_irqrestore(lruvec, flags);
-	release_pages(pvec->pages, pvec->nr);
-	pagevec_reinit(pvec);
+	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn);
 }
 
 /**
-- 
2.29.GIT


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2020-12-25 10:02 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-20  8:27 [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add Alex Shi
2020-11-20 23:19 ` Andrew Morton
2020-11-23  4:46   ` Alex Shi
2020-11-25 15:38 ` Vlastimil Babka
2020-11-26  3:12   ` Alex Shi
2020-11-26 11:05     ` Vlastimil Babka
2020-11-26  4:52 ` Yu Zhao
2020-11-26  6:39   ` Alex Shi
2020-11-26  7:24     ` Yu Zhao
2020-11-26  8:09       ` Alex Shi
2020-11-26 11:22       ` Vlastimil Babka
2020-11-26 15:44         ` Vlastimil Babka
2020-11-26 15:55           ` Matthew Wilcox
2020-11-27  3:14             ` Alex Shi
2020-12-01  8:02             ` [PATCH 1/3] mm/swap.c: pre-sort pages in pagevec for pagevec_lru_move_fn Alex Shi
2020-12-01  8:02               ` [PATCH 2/3] mm/swap.c: bail out early for no memcg and no numa Alex Shi
2020-12-01  8:02               ` [PATCH 3/3] mm/swap.c: extend the usage to pagevec_lru_add Alex Shi
2020-12-01  8:10               ` [PATCH 1/3] mm/swap.c: pre-sort pages in pagevec for pagevec_lru_move_fn Michal Hocko
2020-12-01  8:20                 ` Alex Shi
2020-12-25  9:59             ` [RFC PATCH 0/4] pre sort pages on lruvec in pagevec Alex Shi
2020-12-25  9:59               ` [RFC PATCH 1/4] mm/swap.c: pre-sort pages in pagevec for pagevec_lru_move_fn Alex Shi
2020-12-25  9:59               ` [RFC PATCH 2/4] mm/swap.c: bail out early for no memcg and no numa Alex Shi
2020-12-25  9:59               ` [RFC PATCH 3/4] mm/swap.c: extend the usage to pagevec_lru_add Alex Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).