All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC]mm: batch activate_page() to reduce lock contention
@ 2010-07-20  7:18 Shaohua Li
  2010-07-21 16:06 ` Minchan Kim
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Shaohua Li @ 2010-07-20  7:18 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, Andi Kleen, Wu, Fengguang

The zone->lru_lock is heavily contented in workload where activate_page()
is frequently used. We could do batch activate_page() to reduce the lock
contention. The batched pages will be added into zone list when the pool
is full or page reclaim is trying to drain them.

For example, in a 4 socket 64 CPU system, create a sparse file and 64 processes,
processes shared map to the file. Each process read access the whole file and
then exit. The process exit will do unmap_vmas() and cause a lot of
activate_page() call. In such workload, we saw about 58% total time reduction
with below patch.

But we did see some strange regression. The regression is small (usually < 2%)
and most are from multithread test and none heavily use activate_page(). For
example, in the same system, we create 64 threads. Each thread creates a private
mmap region and does read access. We measure the total time and saw about 2%
regression. But in such workload, 99% time is on page fault and activate_page()
takes no time. Very strange, we haven't a good explanation for this so far,
hopefully somebody can share a hint.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>

diff --git a/mm/swap.c b/mm/swap.c
index 3ce7bc3..4a3fd7f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -39,6 +39,7 @@ int page_cluster;
 
 static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
+static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
 
 /*
  * This path almost never happens for VM activity - pages are normally
@@ -175,11 +176,10 @@ static void update_page_reclaim_stat(struct zone *zone, struct page *page,
 /*
  * FIXME: speed this up?
  */
-void activate_page(struct page *page)
+static void __activate_page(struct page *page)
 {
 	struct zone *zone = page_zone(page);
 
-	spin_lock_irq(&zone->lru_lock);
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		int file = page_is_file_cache(page);
 		int lru = page_lru_base_type(page);
@@ -192,7 +192,46 @@ void activate_page(struct page *page)
 
 		update_page_reclaim_stat(zone, page, file, 1);
 	}
-	spin_unlock_irq(&zone->lru_lock);
+}
+
+static void activate_page_drain_cpu(int cpu)
+{
+	struct pagevec *pvec = &per_cpu(activate_page_pvecs, cpu);
+	struct zone *last_zone = NULL, *zone;
+	int i, j;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		zone = page_zone(pvec->pages[i]);
+		if (zone == last_zone)
+			continue;
+
+		if (last_zone)
+			spin_unlock_irq(&last_zone->lru_lock);
+		last_zone = zone;
+		spin_lock_irq(&last_zone->lru_lock);
+
+		for (j = i; j < pagevec_count(pvec); j++) {
+			struct page *page = pvec->pages[j];
+
+			if (last_zone != page_zone(page))
+				continue;
+			__activate_page(page);
+		}
+	}
+	if (last_zone)
+		spin_unlock_irq(&last_zone->lru_lock);
+	release_pages(pvec->pages, pagevec_count(pvec), pvec->cold);
+	pagevec_reinit(pvec);
+}
+
+void activate_page(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(activate_page_pvecs);
+
+	page_cache_get(page);
+	if (!pagevec_add(pvec, page))
+		activate_page_drain_cpu(smp_processor_id());
+	put_cpu_var(activate_page_pvecs);
 }
 
 /*
@@ -297,6 +336,7 @@ static void drain_cpu_pagevecs(int cpu)
 void lru_add_drain(void)
 {
 	drain_cpu_pagevecs(get_cpu());
+	activate_page_drain_cpu(smp_processor_id());
 	put_cpu();
 }
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC]mm: batch activate_page() to reduce lock contention
  2010-07-20  7:18 [RFC]mm: batch activate_page() to reduce lock contention Shaohua Li
@ 2010-07-21 16:06 ` Minchan Kim
  2010-07-22  0:27   ` Shaohua Li
  2010-07-22 23:49 ` Andrew Morton
  2010-07-23 15:10 ` KOSAKI Motohiro
  2 siblings, 1 reply; 20+ messages in thread
From: Minchan Kim @ 2010-07-21 16:06 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-mm, Andrew Morton, Andi Kleen, Wu, Fengguang

On Tue, Jul 20, 2010 at 03:18:44PM +0800, Shaohua Li wrote:
> The zone->lru_lock is heavily contented in workload where activate_page()
> is frequently used. We could do batch activate_page() to reduce the lock
> contention. The batched pages will be added into zone list when the pool
> is full or page reclaim is trying to drain them.
> 
> For example, in a 4 socket 64 CPU system, create a sparse file and 64 processes,
> processes shared map to the file. Each process read access the whole file and
> then exit. The process exit will do unmap_vmas() and cause a lot of
> activate_page() call. In such workload, we saw about 58% total time reduction
> with below patch.

Great :)

> 
> But we did see some strange regression. The regression is small (usually < 2%)
> and most are from multithread test and none heavily use activate_page(). For
> example, in the same system, we create 64 threads. Each thread creates a private
> mmap region and does read access. We measure the total time and saw about 2%
> regression. But in such workload, 99% time is on page fault and activate_page()
> takes no time. Very strange, we haven't a good explanation for this so far,
> hopefully somebody can share a hint.

Mabye it might be due to lru_add_drain. 
You are adding cost in lru_add_drain and it is called several place.
So if we can't get the gain in there, it could make a bit of regression.
I might be wrong and it's a just my guessing. 

> 
> Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 3ce7bc3..4a3fd7f 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -39,6 +39,7 @@ int page_cluster;
>  
>  static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
>  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
> +static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
>  
>  /*
>   * This path almost never happens for VM activity - pages are normally
> @@ -175,11 +176,10 @@ static void update_page_reclaim_stat(struct zone *zone, struct page *page,
>  /*
>   * FIXME: speed this up?
>   */
Couldn't we remove above comment by this patch?

> -void activate_page(struct page *page)
> +static void __activate_page(struct page *page)
>  {
>  	struct zone *zone = page_zone(page);
>  
> -	spin_lock_irq(&zone->lru_lock);
>  	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
>  		int file = page_is_file_cache(page);
>  		int lru = page_lru_base_type(page);
> @@ -192,7 +192,46 @@ void activate_page(struct page *page)
>  
>  		update_page_reclaim_stat(zone, page, file, 1);
>  	}
> -	spin_unlock_irq(&zone->lru_lock);
> +}
> +
> +static void activate_page_drain_cpu(int cpu)
> +{
> +	struct pagevec *pvec = &per_cpu(activate_page_pvecs, cpu);
> +	struct zone *last_zone = NULL, *zone;
> +	int i, j;
> +
> +	for (i = 0; i < pagevec_count(pvec); i++) {
> +		zone = page_zone(pvec->pages[i]);
> +		if (zone == last_zone)
> +			continue;
> +
> +		if (last_zone)
> +			spin_unlock_irq(&last_zone->lru_lock);
> +		last_zone = zone;
> +		spin_lock_irq(&last_zone->lru_lock);
> +
> +		for (j = i; j < pagevec_count(pvec); j++) {
> +			struct page *page = pvec->pages[j];
> +
> +			if (last_zone != page_zone(page))
> +				continue;
> +			__activate_page(page);
> +		}
> +	}
> +	if (last_zone)
> +		spin_unlock_irq(&last_zone->lru_lock);
> +	release_pages(pvec->pages, pagevec_count(pvec), pvec->cold);
> +	pagevec_reinit(pvec);

In worst case(DMA->NORMAL->HIGHMEM->DMA->NORMA->HIGHMEM->......), 
overhead would is big than old. how about following as?
static DEFINE_PER_CPU(struct pagevec[MAX_NR_ZONES], activate_page_pvecs);
Is it a overkill?

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC]mm: batch activate_page() to reduce lock contention
  2010-07-21 16:06 ` Minchan Kim
@ 2010-07-22  0:27   ` Shaohua Li
  2010-07-22  1:08     ` Minchan Kim
  0 siblings, 1 reply; 20+ messages in thread
From: Shaohua Li @ 2010-07-22  0:27 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, Andrew Morton, Andi Kleen, Wu, Fengguang

On Thu, Jul 22, 2010 at 12:06:34AM +0800, Minchan Kim wrote:
> On Tue, Jul 20, 2010 at 03:18:44PM +0800, Shaohua Li wrote:
> > The zone->lru_lock is heavily contented in workload where activate_page()
> > is frequently used. We could do batch activate_page() to reduce the lock
> > contention. The batched pages will be added into zone list when the pool
> > is full or page reclaim is trying to drain them.
> > 
> > For example, in a 4 socket 64 CPU system, create a sparse file and 64 processes,
> > processes shared map to the file. Each process read access the whole file and
> > then exit. The process exit will do unmap_vmas() and cause a lot of
> > activate_page() call. In such workload, we saw about 58% total time reduction
> > with below patch.
> 
> Great :)
> > 
> > But we did see some strange regression. The regression is small (usually < 2%)
> > and most are from multithread test and none heavily use activate_page(). For
> > example, in the same system, we create 64 threads. Each thread creates a private
> > mmap region and does read access. We measure the total time and saw about 2%
> > regression. But in such workload, 99% time is on page fault and activate_page()
> > takes no time. Very strange, we haven't a good explanation for this so far,
> > hopefully somebody can share a hint.
> 
> Mabye it might be due to lru_add_drain. 
> You are adding cost in lru_add_drain and it is called several place.
> So if we can't get the gain in there, it could make a bit of regression.
> I might be wrong and it's a just my guessing. 
The workload with regression doesn't invoke too many activate_page, so
basically activate_page_drain_cpu() is a nop, it should not take too much.

> > Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> > 
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 3ce7bc3..4a3fd7f 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -39,6 +39,7 @@ int page_cluster;
> >  
> >  static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
> >  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
> > +static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
> >  
> >  /*
> >   * This path almost never happens for VM activity - pages are normally
> > @@ -175,11 +176,10 @@ static void update_page_reclaim_stat(struct zone *zone, struct page *page,
> >  /*
> >   * FIXME: speed this up?
> >   */
> Couldn't we remove above comment by this patch?
ha, yes.

> > -void activate_page(struct page *page)
> > +static void __activate_page(struct page *page)
> >  {
> >  	struct zone *zone = page_zone(page);
> >  
> > -	spin_lock_irq(&zone->lru_lock);
> >  	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
> >  		int file = page_is_file_cache(page);
> >  		int lru = page_lru_base_type(page);
> > @@ -192,7 +192,46 @@ void activate_page(struct page *page)
> >  
> >  		update_page_reclaim_stat(zone, page, file, 1);
> >  	}
> > -	spin_unlock_irq(&zone->lru_lock);
> > +}
> > +
> > +static void activate_page_drain_cpu(int cpu)
> > +{
> > +	struct pagevec *pvec = &per_cpu(activate_page_pvecs, cpu);
> > +	struct zone *last_zone = NULL, *zone;
> > +	int i, j;
> > +
> > +	for (i = 0; i < pagevec_count(pvec); i++) {
> > +		zone = page_zone(pvec->pages[i]);
> > +		if (zone == last_zone)
> > +			continue;
> > +
> > +		if (last_zone)
> > +			spin_unlock_irq(&last_zone->lru_lock);
> > +		last_zone = zone;
> > +		spin_lock_irq(&last_zone->lru_lock);
> > +
> > +		for (j = i; j < pagevec_count(pvec); j++) {
> > +			struct page *page = pvec->pages[j];
> > +
> > +			if (last_zone != page_zone(page))
> > +				continue;
> > +			__activate_page(page);
> > +		}
> > +	}
> > +	if (last_zone)
> > +		spin_unlock_irq(&last_zone->lru_lock);
> > +	release_pages(pvec->pages, pagevec_count(pvec), pvec->cold);
> > +	pagevec_reinit(pvec);
> 
> In worst case(DMA->NORMAL->HIGHMEM->DMA->NORMA->HIGHMEM->......), 
> overhead would is big than old. how about following as?
> static DEFINE_PER_CPU(struct pagevec[MAX_NR_ZONES], activate_page_pvecs);
> Is it a overkill?
activate_page_drain_cpu is a two level loop. In you case, the drain order
will be DMA->DMA->NORMAL->NORMAL->HIGHMEM->HIGHMEM. Since pagevec size is
14, the loop should finish quickly.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC]mm: batch activate_page() to reduce lock contention
  2010-07-22  0:27   ` Shaohua Li
@ 2010-07-22  1:08     ` Minchan Kim
  2010-07-22  5:17       ` Shaohua Li
  0 siblings, 1 reply; 20+ messages in thread
From: Minchan Kim @ 2010-07-22  1:08 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-mm, Andrew Morton, Andi Kleen, Wu, Fengguang

On Thu, Jul 22, 2010 at 9:27 AM, Shaohua Li <shaohua.li@intel.com> wrote:
>> > But we did see some strange regression. The regression is small (usually < 2%)
>> > and most are from multithread test and none heavily use activate_page(). For
>> > example, in the same system, we create 64 threads. Each thread creates a private
>> > mmap region and does read access. We measure the total time and saw about 2%
>> > regression. But in such workload, 99% time is on page fault and activate_page()
>> > takes no time. Very strange, we haven't a good explanation for this so far,
>> > hopefully somebody can share a hint.
>>
>> Mabye it might be due to lru_add_drain.
>> You are adding cost in lru_add_drain and it is called several place.
>> So if we can't get the gain in there, it could make a bit of regression.
>> I might be wrong and it's a just my guessing.
> The workload with regression doesn't invoke too many activate_page, so
> basically activate_page_drain_cpu() is a nop, it should not take too much.

I think it's culprit. little call activate_page, many call lru_drain_all.
It would make losing pagevec's benefit.
But as your scenario, I think it doesn't call lru_drain_all frequently.
That's because it is called when process call things related unmap
operation or swapping.
Do you have a such workload in test case?

>
>> > Signed-off-by: Shaohua Li <shaohua.li@intel.com>
>> >
>> > diff --git a/mm/swap.c b/mm/swap.c
>> > index 3ce7bc3..4a3fd7f 100644
>> > --- a/mm/swap.c
>> > +++ b/mm/swap.c
>> > @@ -39,6 +39,7 @@ int page_cluster;
>> >
>> >  static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
>> >  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
>> > +static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
>> >
>> >  /*
>> >   * This path almost never happens for VM activity - pages are normally
>> > @@ -175,11 +176,10 @@ static void update_page_reclaim_stat(struct zone *zone, struct page *page,
>> >  /*
>> >   * FIXME: speed this up?
>> >   */
>> Couldn't we remove above comment by this patch?
> ha, yes.
>
>> > -void activate_page(struct page *page)
>> > +static void __activate_page(struct page *page)
>> >  {
>> >     struct zone *zone = page_zone(page);
>> >
>> > -   spin_lock_irq(&zone->lru_lock);
>> >     if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
>> >             int file = page_is_file_cache(page);
>> >             int lru = page_lru_base_type(page);
>> > @@ -192,7 +192,46 @@ void activate_page(struct page *page)
>> >
>> >             update_page_reclaim_stat(zone, page, file, 1);
>> >     }
>> > -   spin_unlock_irq(&zone->lru_lock);
>> > +}
>> > +
>> > +static void activate_page_drain_cpu(int cpu)
>> > +{
>> > +   struct pagevec *pvec = &per_cpu(activate_page_pvecs, cpu);
>> > +   struct zone *last_zone = NULL, *zone;
>> > +   int i, j;
>> > +
>> > +   for (i = 0; i < pagevec_count(pvec); i++) {
>> > +           zone = page_zone(pvec->pages[i]);
>> > +           if (zone == last_zone)
>> > +                   continue;
>> > +
>> > +           if (last_zone)
>> > +                   spin_unlock_irq(&last_zone->lru_lock);
>> > +           last_zone = zone;
>> > +           spin_lock_irq(&last_zone->lru_lock);
>> > +
>> > +           for (j = i; j < pagevec_count(pvec); j++) {
>> > +                   struct page *page = pvec->pages[j];
>> > +
>> > +                   if (last_zone != page_zone(page))
>> > +                           continue;
>> > +                   __activate_page(page);
>> > +           }
>> > +   }
>> > +   if (last_zone)
>> > +           spin_unlock_irq(&last_zone->lru_lock);
>> > +   release_pages(pvec->pages, pagevec_count(pvec), pvec->cold);
>> > +   pagevec_reinit(pvec);
>>
>> In worst case(DMA->NORMAL->HIGHMEM->DMA->NORMA->HIGHMEM->......),
>> overhead would is big than old. how about following as?
>> static DEFINE_PER_CPU(struct pagevec[MAX_NR_ZONES], activate_page_pvecs);
>> Is it a overkill?
> activate_page_drain_cpu is a two level loop. In you case, the drain order
> will be DMA->DMA->NORMAL->NORMAL->HIGHMEM->HIGHMEM. Since pagevec size is
> 14, the loop should finish quickly.
Yes. so why do we separates lru pagevec with  pagevec[NR_LRU_LISTS]?
I think It can remove looping unnecessary looping overhead but of
course we have to use more memory.




-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC]mm: batch activate_page() to reduce lock contention
  2010-07-22  1:08     ` Minchan Kim
@ 2010-07-22  5:17       ` Shaohua Li
  2010-07-22 12:28         ` Minchan Kim
  2010-07-23  8:12         ` Wu Fengguang
  0 siblings, 2 replies; 20+ messages in thread
From: Shaohua Li @ 2010-07-22  5:17 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, Andrew Morton, Andi Kleen, Wu, Fengguang

On Thu, Jul 22, 2010 at 09:08:43AM +0800, Minchan Kim wrote:
> On Thu, Jul 22, 2010 at 9:27 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> >> > But we did see some strange regression. The regression is small (usually < 2%)
> >> > and most are from multithread test and none heavily use activate_page(). For
> >> > example, in the same system, we create 64 threads. Each thread creates a private
> >> > mmap region and does read access. We measure the total time and saw about 2%
> >> > regression. But in such workload, 99% time is on page fault and activate_page()
> >> > takes no time. Very strange, we haven't a good explanation for this so far,
> >> > hopefully somebody can share a hint.
> >>
> >> Mabye it might be due to lru_add_drain.
> >> You are adding cost in lru_add_drain and it is called several place.
> >> So if we can't get the gain in there, it could make a bit of regression.
> >> I might be wrong and it's a just my guessing.
> > The workload with regression doesn't invoke too many activate_page, so
> > basically activate_page_drain_cpu() is a nop, it should not take too much.
> 
> I think it's culprit. little call activate_page, many call lru_drain_all.
> It would make losing pagevec's benefit.
> But as your scenario, I think it doesn't call lru_drain_all frequently.
> That's because it is called when process call things related unmap
> operation or swapping.
> Do you have a such workload in test case?
Yes, I'm testing if activate_page_drain_cpu() causes the regression. This regression
is small (<2%) for a stress test and sometimes not stable.

> >> > Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> >> >
> >> > diff --git a/mm/swap.c b/mm/swap.c
> >> > index 3ce7bc3..4a3fd7f 100644
> >> > --- a/mm/swap.c
> >> > +++ b/mm/swap.c
> >> > @@ -39,6 +39,7 @@ int page_cluster;
> >> >
> >> >  static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
> >> >  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
> >> > +static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
> >> >
> >> >  /*
> >> >   * This path almost never happens for VM activity - pages are normally
> >> > @@ -175,11 +176,10 @@ static void update_page_reclaim_stat(struct zone *zone, struct page *page,
> >> >  /*
> >> >   * FIXME: speed this up?
> >> >   */
> >> Couldn't we remove above comment by this patch?
> > ha, yes.
> >
> >> > -void activate_page(struct page *page)
> >> > +static void __activate_page(struct page *page)
> >> >  {
> >> >     struct zone *zone = page_zone(page);
> >> >
> >> > -   spin_lock_irq(&zone->lru_lock);
> >> >     if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
> >> >             int file = page_is_file_cache(page);
> >> >             int lru = page_lru_base_type(page);
> >> > @@ -192,7 +192,46 @@ void activate_page(struct page *page)
> >> >
> >> >             update_page_reclaim_stat(zone, page, file, 1);
> >> >     }
> >> > -   spin_unlock_irq(&zone->lru_lock);
> >> > +}
> >> > +
> >> > +static void activate_page_drain_cpu(int cpu)
> >> > +{
> >> > +   struct pagevec *pvec = &per_cpu(activate_page_pvecs, cpu);
> >> > +   struct zone *last_zone = NULL, *zone;
> >> > +   int i, j;
> >> > +
> >> > +   for (i = 0; i < pagevec_count(pvec); i++) {
> >> > +           zone = page_zone(pvec->pages[i]);
> >> > +           if (zone == last_zone)
> >> > +                   continue;
> >> > +
> >> > +           if (last_zone)
> >> > +                   spin_unlock_irq(&last_zone->lru_lock);
> >> > +           last_zone = zone;
> >> > +           spin_lock_irq(&last_zone->lru_lock);
> >> > +
> >> > +           for (j = i; j < pagevec_count(pvec); j++) {
> >> > +                   struct page *page = pvec->pages[j];
> >> > +
> >> > +                   if (last_zone != page_zone(page))
> >> > +                           continue;
> >> > +                   __activate_page(page);
> >> > +           }
> >> > +   }
> >> > +   if (last_zone)
> >> > +           spin_unlock_irq(&last_zone->lru_lock);
> >> > +   release_pages(pvec->pages, pagevec_count(pvec), pvec->cold);
> >> > +   pagevec_reinit(pvec);
> >>
> >> In worst case(DMA->NORMAL->HIGHMEM->DMA->NORMA->HIGHMEM->......),
> >> overhead would is big than old. how about following as?
> >> static DEFINE_PER_CPU(struct pagevec[MAX_NR_ZONES], activate_page_pvecs);
> >> Is it a overkill?
> > activate_page_drain_cpu is a two level loop. In you case, the drain order
> > will be DMA->DMA->NORMAL->NORMAL->HIGHMEM->HIGHMEM. Since pagevec size is
> > 14, the loop should finish quickly.
> Yes. so why do we separates lru pagevec with  pagevec[NR_LRU_LISTS]?
> I think It can remove looping unnecessary looping overhead but of
> course we have to use more memory.
Each node has zones, so a pagevec[MAX_NR_ZONES] doesn't work here. And in my
test wich activate_page heavily used, activate_page_drain_cpu overhead is quite
small. This isn't worthy IMO.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC]mm: batch activate_page() to reduce lock contention
  2010-07-22  5:17       ` Shaohua Li
@ 2010-07-22 12:28         ` Minchan Kim
  2010-07-23  8:12         ` Wu Fengguang
  1 sibling, 0 replies; 20+ messages in thread
From: Minchan Kim @ 2010-07-22 12:28 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-mm, Andrew Morton, Andi Kleen, Wu, Fengguang

On Thu, Jul 22, 2010 at 01:17:03PM +0800, Shaohua Li wrote:
> On Thu, Jul 22, 2010 at 09:08:43AM +0800, Minchan Kim wrote:
> > On Thu, Jul 22, 2010 at 9:27 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> > >> > But we did see some strange regression. The regression is small (usually < 2%)
> > >> > and most are from multithread test and none heavily use activate_page(). For
> > >> > example, in the same system, we create 64 threads. Each thread creates a private
> > >> > mmap region and does read access. We measure the total time and saw about 2%
> > >> > regression. But in such workload, 99% time is on page fault and activate_page()
> > >> > takes no time. Very strange, we haven't a good explanation for this so far,
> > >> > hopefully somebody can share a hint.
> > >>
> > >> Mabye it might be due to lru_add_drain.
> > >> You are adding cost in lru_add_drain and it is called several place.
> > >> So if we can't get the gain in there, it could make a bit of regression.
> > >> I might be wrong and it's a just my guessing.
> > > The workload with regression doesn't invoke too many activate_page, so
> > > basically activate_page_drain_cpu() is a nop, it should not take too much.
> > 
> > I think it's culprit. little call activate_page, many call lru_drain_all.
> > It would make losing pagevec's benefit.
> > But as your scenario, I think it doesn't call lru_drain_all frequently.
> > That's because it is called when process call things related unmap
> > operation or swapping.
> > Do you have a such workload in test case?
> Yes, I'm testing if activate_page_drain_cpu() causes the regression. This regression
> is small (<2%) for a stress test and sometimes not stable.

Do you mean regression is less 2% even corner case happens?
Then, Okay. it might be marginal number.

> 
> > >> > Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> > >> >
> > >> > diff --git a/mm/swap.c b/mm/swap.c
> > >> > index 3ce7bc3..4a3fd7f 100644
> > >> > --- a/mm/swap.c
> > >> > +++ b/mm/swap.c
> > >> > @@ -39,6 +39,7 @@ int page_cluster;
> > >> >
> > >> >  static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
> > >> >  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
> > >> > +static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
> > >> >
> > >> >  /*
> > >> >   * This path almost never happens for VM activity - pages are normally
> > >> > @@ -175,11 +176,10 @@ static void update_page_reclaim_stat(struct zone *zone, struct page *page,
> > >> >  /*
> > >> >   * FIXME: speed this up?
> > >> >   */
> > >> Couldn't we remove above comment by this patch?
> > > ha, yes.
> > >
> > >> > -void activate_page(struct page *page)
> > >> > +static void __activate_page(struct page *page)
> > >> >  {
> > >> >     struct zone *zone = page_zone(page);
> > >> >
> > >> > -   spin_lock_irq(&zone->lru_lock);
> > >> >     if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
> > >> >             int file = page_is_file_cache(page);
> > >> >             int lru = page_lru_base_type(page);
> > >> > @@ -192,7 +192,46 @@ void activate_page(struct page *page)
> > >> >
> > >> >             update_page_reclaim_stat(zone, page, file, 1);
> > >> >     }
> > >> > -   spin_unlock_irq(&zone->lru_lock);
> > >> > +}
> > >> > +
> > >> > +static void activate_page_drain_cpu(int cpu)
> > >> > +{
> > >> > +   struct pagevec *pvec = &per_cpu(activate_page_pvecs, cpu);
> > >> > +   struct zone *last_zone = NULL, *zone;
> > >> > +   int i, j;
> > >> > +
> > >> > +   for (i = 0; i < pagevec_count(pvec); i++) {
> > >> > +           zone = page_zone(pvec->pages[i]);
> > >> > +           if (zone == last_zone)
> > >> > +                   continue;
> > >> > +
> > >> > +           if (last_zone)
> > >> > +                   spin_unlock_irq(&last_zone->lru_lock);
> > >> > +           last_zone = zone;
> > >> > +           spin_lock_irq(&last_zone->lru_lock);
> > >> > +
> > >> > +           for (j = i; j < pagevec_count(pvec); j++) {
> > >> > +                   struct page *page = pvec->pages[j];
> > >> > +
> > >> > +                   if (last_zone != page_zone(page))
> > >> > +                           continue;
> > >> > +                   __activate_page(page);
> > >> > +           }
> > >> > +   }
> > >> > +   if (last_zone)
> > >> > +           spin_unlock_irq(&last_zone->lru_lock);
> > >> > +   release_pages(pvec->pages, pagevec_count(pvec), pvec->cold);
> > >> > +   pagevec_reinit(pvec);
> > >>
> > >> In worst case(DMA->NORMAL->HIGHMEM->DMA->NORMA->HIGHMEM->......),
> > >> overhead would is big than old. how about following as?
> > >> static DEFINE_PER_CPU(struct pagevec[MAX_NR_ZONES], activate_page_pvecs);
> > >> Is it a overkill?
> > > activate_page_drain_cpu is a two level loop. In you case, the drain order
> > > will be DMA->DMA->NORMAL->NORMAL->HIGHMEM->HIGHMEM. Since pagevec size is
> > > 14, the loop should finish quickly.
> > Yes. so why do we separates lru pagevec with  pagevec[NR_LRU_LISTS]?
> > I think It can remove looping unnecessary looping overhead but of
> > course we have to use more memory.
> Each node has zones, so a pagevec[MAX_NR_ZONES] doesn't work here. And in my

Ahh. Yes. We might need pagevec per node but it seem to be overkill 
as you mentioned.

Feel free to add my reviewed-by sign when you resend.
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

> test wich activate_page heavily used, activate_page_drain_cpu overhead is quite
> small. This isn't worthy IMO.

> 
> Thanks,
> Shaohua

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC]mm: batch activate_page() to reduce lock contention
  2010-07-20  7:18 [RFC]mm: batch activate_page() to reduce lock contention Shaohua Li
  2010-07-21 16:06 ` Minchan Kim
@ 2010-07-22 23:49 ` Andrew Morton
  2010-07-23 15:10 ` KOSAKI Motohiro
  2 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2010-07-22 23:49 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-mm, Andi Kleen, Wu, Fengguang

On Tue, 20 Jul 2010 15:18:44 +0800
Shaohua Li <shaohua.li@intel.com> wrote:

> +void activate_page(struct page *page)
> +{
> +	struct pagevec *pvec = &get_cpu_var(activate_page_pvecs);
> +
> +	page_cache_get(page);
> +	if (!pagevec_add(pvec, page))
> +		activate_page_drain_cpu(smp_processor_id());
> +	put_cpu_var(activate_page_pvecs);
>  }

uhm, could I please draw attention to the most valuable
Documentation/SubmitChecklist?  In particular,

12: Has been tested with CONFIG_PREEMPT, CONFIG_DEBUG_PREEMPT,
    CONFIG_DEBUG_SLAB, CONFIG_DEBUG_PAGEALLOC, CONFIG_DEBUG_MUTEXES,
    CONFIG_DEBUG_SPINLOCK, CONFIG_DEBUG_SPINLOCK_SLEEP all simultaneously
    enabled.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC]mm: batch activate_page() to reduce lock contention
  2010-07-22  5:17       ` Shaohua Li
  2010-07-22 12:28         ` Minchan Kim
@ 2010-07-23  8:12         ` Wu Fengguang
  2010-07-23  8:14           ` Wu Fengguang
  1 sibling, 1 reply; 20+ messages in thread
From: Wu Fengguang @ 2010-07-23  8:12 UTC (permalink / raw)
  To: Li, Shaohua; +Cc: Minchan Kim, linux-mm, Andrew Morton, Andi Kleen

> Each node has zones, so a pagevec[MAX_NR_ZONES] doesn't work here.

It's actually pagevec[MAX_NR_ZONES][nr_cpus], where the CPU dimension
selects a NUMA node. So it looks like a worthy optimization.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC]mm: batch activate_page() to reduce lock contention
  2010-07-23  8:12         ` Wu Fengguang
@ 2010-07-23  8:14           ` Wu Fengguang
  0 siblings, 0 replies; 20+ messages in thread
From: Wu Fengguang @ 2010-07-23  8:14 UTC (permalink / raw)
  To: Li, Shaohua; +Cc: Minchan Kim, linux-mm, Andrew Morton, Andi Kleen

On Fri, Jul 23, 2010 at 04:12:24PM +0800, Wu Fengguang wrote:
> > Each node has zones, so a pagevec[MAX_NR_ZONES] doesn't work here.
> 
> It's actually pagevec[MAX_NR_ZONES][nr_cpus], where the CPU dimension
> selects a NUMA node. So it looks like a worthy optimization.

Ah sorry, please ignore this. activate_page() may be called from any CPU..

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC]mm: batch activate_page() to reduce lock contention
  2010-07-20  7:18 [RFC]mm: batch activate_page() to reduce lock contention Shaohua Li
  2010-07-21 16:06 ` Minchan Kim
  2010-07-22 23:49 ` Andrew Morton
@ 2010-07-23 15:10 ` KOSAKI Motohiro
  2010-07-23 15:25   ` Andi Kleen
  2010-07-26  5:08   ` Shaohua Li
  2 siblings, 2 replies; 20+ messages in thread
From: KOSAKI Motohiro @ 2010-07-23 15:10 UTC (permalink / raw)
  To: Shaohua Li
  Cc: kosaki.motohiro, linux-mm, Andrew Morton, Andi Kleen, Wu, Fengguang

> The zone->lru_lock is heavily contented in workload where activate_page()
> is frequently used. We could do batch activate_page() to reduce the lock
> contention. The batched pages will be added into zone list when the pool
> is full or page reclaim is trying to drain them.
> 
> For example, in a 4 socket 64 CPU system, create a sparse file and 64 processes,
> processes shared map to the file. Each process read access the whole file and
> then exit. The process exit will do unmap_vmas() and cause a lot of
> activate_page() call. In such workload, we saw about 58% total time reduction
> with below patch.

I'm not sure this. Why process exiting on your workload call unmap_vmas?
Can you please explain why we can't stop activate_page? Is this proper page activation?


> 
> But we did see some strange regression. The regression is small (usually < 2%)
> and most are from multithread test and none heavily use activate_page(). For
> example, in the same system, we create 64 threads. Each thread creates a private
> mmap region and does read access. We measure the total time and saw about 2%
> regression. But in such workload, 99% time is on page fault and activate_page()
> takes no time. Very strange, we haven't a good explanation for this so far,
> hopefully somebody can share a hint.
> 
> Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 3ce7bc3..4a3fd7f 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -39,6 +39,7 @@ int page_cluster;
>  
>  static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
>  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
> +static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
>  
>  /*
>   * This path almost never happens for VM activity - pages are normally
> @@ -175,11 +176,10 @@ static void update_page_reclaim_stat(struct zone *zone, struct page *page,
>  /*
>   * FIXME: speed this up?
>   */
> -void activate_page(struct page *page)
> +static void __activate_page(struct page *page)
>  {
>  	struct zone *zone = page_zone(page);

this page_zone() can move in following branch.


> -	spin_lock_irq(&zone->lru_lock);
>  	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
>  		int file = page_is_file_cache(page);
>  		int lru = page_lru_base_type(page);
> @@ -192,7 +192,46 @@ void activate_page(struct page *page)
>  
>  		update_page_reclaim_stat(zone, page, file, 1);
>  	}
> -	spin_unlock_irq(&zone->lru_lock);
> +}
> +
> +static void activate_page_drain_cpu(int cpu)
> +{
> +	struct pagevec *pvec = &per_cpu(activate_page_pvecs, cpu);
> +	struct zone *last_zone = NULL, *zone;
> +	int i, j;
> +
> +	for (i = 0; i < pagevec_count(pvec); i++) {
> +		zone = page_zone(pvec->pages[i]);
> +		if (zone == last_zone)
> +			continue;
> +
> +		if (last_zone)
> +			spin_unlock_irq(&last_zone->lru_lock);
> +		last_zone = zone;
> +		spin_lock_irq(&last_zone->lru_lock);
> +
> +		for (j = i; j < pagevec_count(pvec); j++) {
> +			struct page *page = pvec->pages[j];
> +
> +			if (last_zone != page_zone(page))
> +				continue;
> +			__activate_page(page);
> +		}
> +	}
> +	if (last_zone)
> +		spin_unlock_irq(&last_zone->lru_lock);
> +	release_pages(pvec->pages, pagevec_count(pvec), pvec->cold);
> +	pagevec_reinit(pvec);
> +}

Can we unify this and ____pagevec_lru_add(). they are very similar.


> +
> +void activate_page(struct page *page)
> +{

activate_page() is called from few non mark_page_accessed() function.
so, 

	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page))

line slightly help.


> +	struct pagevec *pvec = &get_cpu_var(activate_page_pvecs);
> +
> +	page_cache_get(page);
> +	if (!pagevec_add(pvec, page))
> +		activate_page_drain_cpu(smp_processor_id());

seems no need smp_processor_id(). we can pass pvec directly.

> +	put_cpu_var(activate_page_pvecs);
>  }
>  
>  /*
> @@ -297,6 +336,7 @@ static void drain_cpu_pagevecs(int cpu)
>  void lru_add_drain(void)
>  {
>  	drain_cpu_pagevecs(get_cpu());
> +	activate_page_drain_cpu(smp_processor_id());
>  	put_cpu();
>  }
>  
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


Following code is the explanation of another way.



---
 mm/swap.c |  112 ++++++++++++++++++++++++++++++++++++++----------------------
 1 files changed, 71 insertions(+), 41 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 3ce7bc3..48e6f54 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -39,6 +39,7 @@ int page_cluster;
 
 static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
+static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
 
 /*
  * This path almost never happens for VM activity - pages are normally
@@ -172,27 +173,65 @@ static void update_page_reclaim_stat(struct zone *zone, struct page *page,
 		memcg_reclaim_stat->recent_rotated[file]++;
 }
 
-/*
- * FIXME: speed this up?
- */
-void activate_page(struct page *page)
+static void __activate_page(struct page *page, void *arg)
 {
-	struct zone *zone = page_zone(page);
-
-	spin_lock_irq(&zone->lru_lock);
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
+		struct zone *zone = page_zone(page);
 		int file = page_is_file_cache(page);
 		int lru = page_lru_base_type(page);
+
 		del_page_from_lru_list(zone, page, lru);
 
 		SetPageActive(page);
 		lru += LRU_ACTIVE;
 		add_page_to_lru_list(zone, page, lru);
-		__count_vm_event(PGACTIVATE);
 
+		__count_vm_event(PGACTIVATE);
 		update_page_reclaim_stat(zone, page, file, 1);
 	}
-	spin_unlock_irq(&zone->lru_lock);
+}
+
+static void pagevec_lru_move_fn(struct pagevec *pvec,
+				void (*move_fn)(struct page *page, void *arg),
+				void *arg)
+{
+	int i;
+	struct zone *zone = NULL;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct zone *pagezone = page_zone(page);
+
+		if (pagezone != zone) {
+			if (zone)
+				spin_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			spin_lock_irq(&zone->lru_lock);
+		}
+
+		(*move_fn)(page, arg);
+	}
+	if (zone)
+		spin_unlock_irq(&zone->lru_lock);
+	release_pages(pvec->pages, pvec->nr, pvec->cold);
+	pagevec_reinit(pvec);
+}
+
+static void activate_page_drain(struct pagevec *pvec)
+{
+	pagevec_lru_move_fn(pvec, __activate_page, NULL);
+}
+
+void activate_page(struct page *page)
+{
+	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
+		struct pagevec *pvec = &get_cpu_var(activate_page_pvecs);
+
+		page_cache_get(page);
+		if (!pagevec_add(pvec, page))
+			activate_page_drain(pvec);
+		put_cpu_var(activate_page_pvecs);
+	}
 }
 
 /*
@@ -292,6 +331,10 @@ static void drain_cpu_pagevecs(int cpu)
 		pagevec_move_tail(pvec);
 		local_irq_restore(flags);
 	}
+
+	pvec = &per_cpu(activate_page_pvecs, cpu);
+	if (pagevec_count(pvec))
+		activate_page_drain(pvec);
 }
 
 void lru_add_drain(void)
@@ -398,46 +441,33 @@ void __pagevec_release(struct pagevec *pvec)
 
 EXPORT_SYMBOL(__pagevec_release);
 
+static void ____pagevec_lru_add_fn(struct page *page, void *arg)
+{
+	enum lru_list lru = (enum lru_list) arg;
+	struct zone *zone = page_zone(page);
+	int file = is_file_lru(lru);
+	int active = is_active_lru(lru);
+
+	VM_BUG_ON(PageActive(page));
+	VM_BUG_ON(PageUnevictable(page));
+	VM_BUG_ON(PageLRU(page));
+
+	SetPageLRU(page);
+	if (active)
+		SetPageActive(page);
+	update_page_reclaim_stat(zone, page, file, active);
+	add_page_to_lru_list(zone, page, lru);
+}
+
 /*
  * Add the passed pages to the LRU, then drop the caller's refcount
  * on them.  Reinitialises the caller's pagevec.
  */
 void ____pagevec_lru_add(struct pagevec *pvec, enum lru_list lru)
 {
-	int i;
-	struct zone *zone = NULL;
-
-	VM_BUG_ON(is_unevictable_lru(lru));
+	pagevec_lru_move_fn(pvec, ____pagevec_lru_add_fn, (void*)lru);
 
-	for (i = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
-		struct zone *pagezone = page_zone(page);
-		int file;
-		int active;
-
-		if (pagezone != zone) {
-			if (zone)
-				spin_unlock_irq(&zone->lru_lock);
-			zone = pagezone;
-			spin_lock_irq(&zone->lru_lock);
-		}
-		VM_BUG_ON(PageActive(page));
-		VM_BUG_ON(PageUnevictable(page));
-		VM_BUG_ON(PageLRU(page));
-		SetPageLRU(page);
-		active = is_active_lru(lru);
-		file = is_file_lru(lru);
-		if (active)
-			SetPageActive(page);
-		update_page_reclaim_stat(zone, page, file, active);
-		add_page_to_lru_list(zone, page, lru);
-	}
-	if (zone)
-		spin_unlock_irq(&zone->lru_lock);
-	release_pages(pvec->pages, pvec->nr, pvec->cold);
-	pagevec_reinit(pvec);
 }
-
 EXPORT_SYMBOL(____pagevec_lru_add);
 
 /*
-- 
1.6.5.2




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC]mm: batch activate_page() to reduce lock contention
  2010-07-23 15:10 ` KOSAKI Motohiro
@ 2010-07-23 15:25   ` Andi Kleen
  2010-07-23 18:06     ` KOSAKI Motohiro
  2010-07-26  5:08   ` Shaohua Li
  1 sibling, 1 reply; 20+ messages in thread
From: Andi Kleen @ 2010-07-23 15:25 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Shaohua Li, linux-mm, Andrew Morton, Andi Kleen, Wu, Fengguang

On Sat, Jul 24, 2010 at 12:10:49AM +0900, KOSAKI Motohiro wrote:
> > The zone->lru_lock is heavily contented in workload where activate_page()
> > is frequently used. We could do batch activate_page() to reduce the lock
> > contention. The batched pages will be added into zone list when the pool
> > is full or page reclaim is trying to drain them.
> > 
> > For example, in a 4 socket 64 CPU system, create a sparse file and 64 processes,
> > processes shared map to the file. Each process read access the whole file and
> > then exit. The process exit will do unmap_vmas() and cause a lot of
> > activate_page() call. In such workload, we saw about 58% total time reduction
> > with below patch.
> 
> I'm not sure this. Why process exiting on your workload call unmap_vmas?

Trick question? 

Getting rid of a mm on process exit requires unmapping the vmas.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC]mm: batch activate_page() to reduce lock contention
  2010-07-23 15:25   ` Andi Kleen
@ 2010-07-23 18:06     ` KOSAKI Motohiro
  0 siblings, 0 replies; 20+ messages in thread
From: KOSAKI Motohiro @ 2010-07-23 18:06 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Shaohua Li, linux-mm, Andrew Morton, Wu, Fengguang

>> > For example, in a 4 socket 64 CPU system, create a sparse file and 64 processes,
>> > processes shared map to the file. Each process read access the whole file and
>> > then exit. The process exit will do unmap_vmas() and cause a lot of
>> > activate_page() call. In such workload, we saw about 58% total time reduction
>> > with below patch.
>>
>> I'm not sure this. Why process exiting on your workload call unmap_vmas?
>
> Trick question?
>
> Getting rid of a mm on process exit requires unmapping the vmas.

Oops, I misparsed unmap_vmas() and unuse_vma(). thanks fix me.

Ho Hum, zap_pte_range() call mark_page_accessed(). it was introduced slightly
recent (below).

----------------------------------------------------------------------------------
commit bf3f3bc5e734706730c12a323f9b2068052aa1f0
Author: Nick Piggin <npiggin@suse.de>
Date:   Tue Jan 6 14:38:55 2009 -0800

    mm: don't mark_page_accessed in fault path
----------------------------------------------------------------------------------


So, I guess we can apply following small optimization.
the intention is, if the exiting process is last mapped one,
we don't need to refrect its pte_young() because it is good
sign that the page is never touched.

Of cource, this is offtopic. On Shaouhua's testcase, 64 processes
shared map to the file.

-------------------------------------------------------------------------------
diff --git a/mm/memory.c b/mm/memory.c
index 833952d..ebdfcb1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -951,7 +951,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
                                if (pte_dirty(ptent))
                                        set_page_dirty(page);
                                if (pte_young(ptent) &&
-                                   likely(!VM_SequentialReadHint(vma)))
+                                   (page_mapcount(page) != 1) &&
+                                   !VM_SequentialReadHint(vma))
                                        mark_page_accessed(page);
                                rss[MM_FILEPAGES]--;
-------------------------------------------------------------------------------

One more offtopic:
we need to move ClearPageReferenced() from mark_page_accessed()
to __activate_page() honorly. unuse_vma() case also need to clear
page-referenced bit. I think.


Anyway, I agree original patch concept is fine. Thanks a lot of productive
information!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC]mm: batch activate_page() to reduce lock contention
  2010-07-23 15:10 ` KOSAKI Motohiro
  2010-07-23 15:25   ` Andi Kleen
@ 2010-07-26  5:08   ` Shaohua Li
  2010-08-05 21:07     ` Andrew Morton
  1 sibling, 1 reply; 20+ messages in thread
From: Shaohua Li @ 2010-07-26  5:08 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Andrew Morton, Andi Kleen, Wu, Fengguang, minchan.kim

On Fri, Jul 23, 2010 at 11:10:49PM +0800, KOSAKI Motohiro wrote:
> > The zone->lru_lock is heavily contented in workload where activate_page()
> > is frequently used. We could do batch activate_page() to reduce the lock
> > contention. The batched pages will be added into zone list when the pool
> > is full or page reclaim is trying to drain them.
> > 
> > For example, in a 4 socket 64 CPU system, create a sparse file and 64 processes,
> > processes shared map to the file. Each process read access the whole file and
> > then exit. The process exit will do unmap_vmas() and cause a lot of
> > activate_page() call. In such workload, we saw about 58% total time reduction
> > with below patch.
> 
> I'm not sure this. Why process exiting on your workload call unmap_vmas?
> Can you please explain why we can't stop activate_page? Is this proper page activation?
> 
> 
> > 
> > But we did see some strange regression. The regression is small (usually < 2%)
> > and most are from multithread test and none heavily use activate_page(). For
> > example, in the same system, we create 64 threads. Each thread creates a private
> > mmap region and does read access. We measure the total time and saw about 2%
> > regression. But in such workload, 99% time is on page fault and activate_page()
> > takes no time. Very strange, we haven't a good explanation for this so far,
> > hopefully somebody can share a hint.
> > 
> > Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> > 
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 3ce7bc3..4a3fd7f 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -39,6 +39,7 @@ int page_cluster;
> >  
> >  static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
> >  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
> > +static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
> >  
> >  /*
> >   * This path almost never happens for VM activity - pages are normally
> > @@ -175,11 +176,10 @@ static void update_page_reclaim_stat(struct zone *zone, struct page *page,
> >  /*
> >   * FIXME: speed this up?
> >   */
> > -void activate_page(struct page *page)
> > +static void __activate_page(struct page *page)
> >  {
> >  	struct zone *zone = page_zone(page);
> 
> this page_zone() can move in following branch.
> 
> 
> > -	spin_lock_irq(&zone->lru_lock);
> >  	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
> >  		int file = page_is_file_cache(page);
> >  		int lru = page_lru_base_type(page);
> > @@ -192,7 +192,46 @@ void activate_page(struct page *page)
> >  
> >  		update_page_reclaim_stat(zone, page, file, 1);
> >  	}
> > -	spin_unlock_irq(&zone->lru_lock);
> > +}
> > +
> > +static void activate_page_drain_cpu(int cpu)
> > +{
> > +	struct pagevec *pvec = &per_cpu(activate_page_pvecs, cpu);
> > +	struct zone *last_zone = NULL, *zone;
> > +	int i, j;
> > +
> > +	for (i = 0; i < pagevec_count(pvec); i++) {
> > +		zone = page_zone(pvec->pages[i]);
> > +		if (zone == last_zone)
> > +			continue;
> > +
> > +		if (last_zone)
> > +			spin_unlock_irq(&last_zone->lru_lock);
> > +		last_zone = zone;
> > +		spin_lock_irq(&last_zone->lru_lock);
> > +
> > +		for (j = i; j < pagevec_count(pvec); j++) {
> > +			struct page *page = pvec->pages[j];
> > +
> > +			if (last_zone != page_zone(page))
> > +				continue;
> > +			__activate_page(page);
> > +		}
> > +	}
> > +	if (last_zone)
> > +		spin_unlock_irq(&last_zone->lru_lock);
> > +	release_pages(pvec->pages, pagevec_count(pvec), pvec->cold);
> > +	pagevec_reinit(pvec);
> > +}
> 
> Can we unify this and ____pagevec_lru_add(). they are very similar.
Very good suggestions. I slightly updated the patch to reduce some locking
in worst case in pagevec_lru_move_fn(). Since the regression I mentioned
before isn't stable, I removed that comment too. Below is updated patch.

Thanks,
Shaohua


Subject: mm: batch activate_page() to reduce lock contention

The zone->lru_lock is heavily contented in workload where activate_page()
is frequently used. We could do batch activate_page() to reduce the lock
contention. The batched pages will be added into zone list when the pool
is full or page reclaim is trying to drain them.

For example, in a 4 socket 64 CPU system, create a sparse file and 64 processes,
processes shared map to the file. Each process read access the whole file and
then exit. The process exit will do unmap_vmas() and cause a lot of
activate_page() call. In such workload, we saw about 58% total time reduction
with below patch.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

diff --git a/mm/swap.c b/mm/swap.c
index 3ce7bc3..3738134 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -39,6 +39,7 @@ int page_cluster;
 
 static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
+static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
 
 /*
  * This path almost never happens for VM activity - pages are normally
@@ -172,27 +173,72 @@ static void update_page_reclaim_stat(struct zone *zone, struct page *page,
 		memcg_reclaim_stat->recent_rotated[file]++;
 }
 
-/*
- * FIXME: speed this up?
- */
-void activate_page(struct page *page)
+static void __activate_page(struct page *page, void *arg)
 {
-	struct zone *zone = page_zone(page);
-
-	spin_lock_irq(&zone->lru_lock);
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
+		struct zone *zone = page_zone(page);
 		int file = page_is_file_cache(page);
 		int lru = page_lru_base_type(page);
+
 		del_page_from_lru_list(zone, page, lru);
 
 		SetPageActive(page);
 		lru += LRU_ACTIVE;
 		add_page_to_lru_list(zone, page, lru);
-		__count_vm_event(PGACTIVATE);
 
+		__count_vm_event(PGACTIVATE);
 		update_page_reclaim_stat(zone, page, file, 1);
 	}
-	spin_unlock_irq(&zone->lru_lock);
+}
+
+static void pagevec_lru_move_fn(struct pagevec *pvec,
+				void (*move_fn)(struct page *page, void *arg),
+				void *arg)
+{
+	struct zone *last_zone = NULL;
+	int i, j;
+	DECLARE_BITMAP(pages_done, PAGEVEC_SIZE);
+
+	bitmap_zero(pages_done, PAGEVEC_SIZE);
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		if (test_bit(i, pages_done))
+			continue;
+
+		if (last_zone)
+			spin_unlock_irq(&last_zone->lru_lock);
+		last_zone = page_zone(pvec->pages[i]);
+		spin_lock_irq(&last_zone->lru_lock);
+
+		for (j = i; j < pagevec_count(pvec); j++) {
+			struct page *page = pvec->pages[j];
+
+			if (last_zone != page_zone(page))
+				continue;
+			(*move_fn)(page, arg);
+			__set_bit(j, pages_done);
+		}
+	}
+	if (last_zone)
+		spin_unlock_irq(&last_zone->lru_lock);
+	release_pages(pvec->pages, pagevec_count(pvec), pvec->cold);
+	pagevec_reinit(pvec);
+}
+
+static void activate_page_drain(struct pagevec *pvec)
+{
+	pagevec_lru_move_fn(pvec, __activate_page, NULL);
+}
+
+void activate_page(struct page *page)
+{
+	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
+		struct pagevec *pvec = &get_cpu_var(activate_page_pvecs);
+
+		page_cache_get(page);
+		if (!pagevec_add(pvec, page))
+			activate_page_drain(pvec);
+		put_cpu_var(activate_page_pvecs);
+	}
 }
 
 /*
@@ -292,6 +338,10 @@ static void drain_cpu_pagevecs(int cpu)
 		pagevec_move_tail(pvec);
 		local_irq_restore(flags);
 	}
+
+	pvec = &per_cpu(activate_page_pvecs, cpu);
+	if (pagevec_count(pvec))
+		activate_page_drain(pvec);
 }
 
 void lru_add_drain(void)
@@ -398,46 +448,34 @@ void __pagevec_release(struct pagevec *pvec)
 
 EXPORT_SYMBOL(__pagevec_release);
 
+static void ____pagevec_lru_add_fn(struct page *page, void *arg)
+{
+	enum lru_list lru = (enum lru_list)arg;
+	struct zone *zone = page_zone(page);
+	int file = is_file_lru(lru);
+	int active = is_active_lru(lru);
+
+	VM_BUG_ON(PageActive(page));
+	VM_BUG_ON(PageUnevictable(page));
+	VM_BUG_ON(PageLRU(page));
+
+	SetPageLRU(page);
+	if (active)
+		SetPageActive(page);
+	update_page_reclaim_stat(zone, page, file, active);
+	add_page_to_lru_list(zone, page, lru);
+}
+
 /*
  * Add the passed pages to the LRU, then drop the caller's refcount
  * on them.  Reinitialises the caller's pagevec.
  */
 void ____pagevec_lru_add(struct pagevec *pvec, enum lru_list lru)
 {
-	int i;
-	struct zone *zone = NULL;
-
 	VM_BUG_ON(is_unevictable_lru(lru));
 
-	for (i = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
-		struct zone *pagezone = page_zone(page);
-		int file;
-		int active;
-
-		if (pagezone != zone) {
-			if (zone)
-				spin_unlock_irq(&zone->lru_lock);
-			zone = pagezone;
-			spin_lock_irq(&zone->lru_lock);
-		}
-		VM_BUG_ON(PageActive(page));
-		VM_BUG_ON(PageUnevictable(page));
-		VM_BUG_ON(PageLRU(page));
-		SetPageLRU(page);
-		active = is_active_lru(lru);
-		file = is_file_lru(lru);
-		if (active)
-			SetPageActive(page);
-		update_page_reclaim_stat(zone, page, file, active);
-		add_page_to_lru_list(zone, page, lru);
-	}
-	if (zone)
-		spin_unlock_irq(&zone->lru_lock);
-	release_pages(pvec->pages, pvec->nr, pvec->cold);
-	pagevec_reinit(pvec);
+	pagevec_lru_move_fn(pvec, ____pagevec_lru_add_fn, (void *)lru);
 }
-
 EXPORT_SYMBOL(____pagevec_lru_add);
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC]mm: batch activate_page() to reduce lock contention
  2010-07-26  5:08   ` Shaohua Li
@ 2010-08-05 21:07     ` Andrew Morton
  2010-08-06  3:08       ` Shaohua Li
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2010-08-05 21:07 UTC (permalink / raw)
  To: Shaohua Li
  Cc: KOSAKI Motohiro, linux-mm, Andi Kleen, Wu, Fengguang, minchan.kim

On Mon, 26 Jul 2010 13:08:27 +0800
Shaohua Li <shaohua.li@intel.com> wrote:

> The zone->lru_lock is heavily contented in workload where activate_page()
> is frequently used. We could do batch activate_page() to reduce the lock
> contention. The batched pages will be added into zone list when the pool
> is full or page reclaim is trying to drain them.
> 
> For example, in a 4 socket 64 CPU system, create a sparse file and 64 processes,
> processes shared map to the file. Each process read access the whole file and
> then exit. The process exit will do unmap_vmas() and cause a lot of
> activate_page() call. In such workload, we saw about 58% total time reduction
> with below patch.

What happened to the 2% regression that earlier changelogs mentioned?

afacit the patch optimises the rare munmap() case.  But what effect
does it have upon the common case?  How do we know that it is a net
benefit?

Because the impact on kernel footprint is awful.  x86_64 allmodconfig:

   text    data     bss     dec     hex filename
   5857    1426    1712    8995    2323 mm/swap.o
   6245    1587    1840    9672    25c8 mm/swap.o

and look at x86_64 allnoconfig:

   text    data     bss     dec     hex filename
   2344     768       4    3116     c2c mm/swap.o
   2632     896       4    3532     dcc mm/swap.o

that's a uniprocessor kernel where none of this was of any use!

Looking at the patch, I'm not sure where all this bloat came from.  But
the SMP=n case is pretty bad and needs fixing, IMO.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC]mm: batch activate_page() to reduce lock contention
  2010-08-05 21:07     ` Andrew Morton
@ 2010-08-06  3:08       ` Shaohua Li
  2010-08-25 20:03         ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: Shaohua Li @ 2010-08-06  3:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, linux-mm, Andi Kleen, Wu, Fengguang, minchan.kim

On Fri, Aug 06, 2010 at 05:07:55AM +0800, Andrew Morton wrote:
> On Mon, 26 Jul 2010 13:08:27 +0800
> Shaohua Li <shaohua.li@intel.com> wrote:
> 
> > The zone->lru_lock is heavily contented in workload where activate_page()
> > is frequently used. We could do batch activate_page() to reduce the lock
> > contention. The batched pages will be added into zone list when the pool
> > is full or page reclaim is trying to drain them.
> > 
> > For example, in a 4 socket 64 CPU system, create a sparse file and 64 processes,
> > processes shared map to the file. Each process read access the whole file and
> > then exit. The process exit will do unmap_vmas() and cause a lot of
> > activate_page() call. In such workload, we saw about 58% total time reduction
> > with below patch.
> 
> What happened to the 2% regression that earlier changelogs mentioned?
The 2% regression tend to be a noise. I did a bunch of test later, and the regression
isn't stable and sometimes there is improvement and sometimes there is regression.
so I removed that changelog. I mentioned this in previous mail too.
 
> afacit the patch optimises the rare munmap() case.  But what effect
> does it have upon the common case?  How do we know that it is a net
> benefit?
Not just munmap() case. There are a lot of workloads lru_lock is heavilly contented
in activate_page(), for example some file io workloads.

> Because the impact on kernel footprint is awful.  x86_64 allmodconfig:
> 
>    text    data     bss     dec     hex filename
>    5857    1426    1712    8995    2323 mm/swap.o
>    6245    1587    1840    9672    25c8 mm/swap.o
> 
> and look at x86_64 allnoconfig:
> 
>    text    data     bss     dec     hex filename
>    2344     768       4    3116     c2c mm/swap.o
>    2632     896       4    3532     dcc mm/swap.o
> 
> that's a uniprocessor kernel where none of this was of any use!
> 
> Looking at the patch, I'm not sure where all this bloat came from.  But
> the SMP=n case is pretty bad and needs fixing, IMO.
updated the patch, which reduce the footprint a little bit for SMP=n
2472     768       4    3244     cac ../tmp/mm/swap.o
2600     768       4    3372     d2c ../tmp/mm/swap.o
we unified lru_add and activate_page, which adds a little footprint.

Thanks,
Shaohua



Subject: mm: batch activate_page() to reduce lock contention

The zone->lru_lock is heavily contented in workload where activate_page()
is frequently used. We could do batch activate_page() to reduce the lock
contention. The batched pages will be added into zone list when the pool
is full or page reclaim is trying to drain them.

For example, in a 4 socket 64 CPU system, create a sparse file and 64 processes,
processes shared map to the file. Each process read access the whole file and
then exit. The process exit will do unmap_vmas() and cause a lot of
activate_page() call. In such workload, we saw about 58% total time reduction
with below patch.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

diff --git a/mm/swap.c b/mm/swap.c
index 3ce7bc3..744883f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -172,28 +172,93 @@ static void update_page_reclaim_stat(struct zone *zone, struct page *page,
 		memcg_reclaim_stat->recent_rotated[file]++;
 }
 
-/*
- * FIXME: speed this up?
- */
-void activate_page(struct page *page)
+static void __activate_page(struct page *page, void *arg)
 {
-	struct zone *zone = page_zone(page);
-
-	spin_lock_irq(&zone->lru_lock);
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
+		struct zone *zone = page_zone(page);
 		int file = page_is_file_cache(page);
 		int lru = page_lru_base_type(page);
+
 		del_page_from_lru_list(zone, page, lru);
 
 		SetPageActive(page);
 		lru += LRU_ACTIVE;
 		add_page_to_lru_list(zone, page, lru);
-		__count_vm_event(PGACTIVATE);
 
+		__count_vm_event(PGACTIVATE);
 		update_page_reclaim_stat(zone, page, file, 1);
 	}
+}
+
+static void pagevec_lru_move_fn(struct pagevec *pvec,
+				void (*move_fn)(struct page *page, void *arg),
+				void *arg)
+{
+	struct zone *last_zone = NULL;
+	int i, j;
+	DECLARE_BITMAP(pages_done, PAGEVEC_SIZE);
+
+	bitmap_zero(pages_done, PAGEVEC_SIZE);
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		if (test_bit(i, pages_done))
+			continue;
+
+		if (last_zone)
+			spin_unlock_irq(&last_zone->lru_lock);
+		last_zone = page_zone(pvec->pages[i]);
+		spin_lock_irq(&last_zone->lru_lock);
+
+		for (j = i; j < pagevec_count(pvec); j++) {
+			struct page *page = pvec->pages[j];
+
+			if (last_zone != page_zone(page))
+				continue;
+			(*move_fn)(page, arg);
+			__set_bit(j, pages_done);
+		}
+	}
+	if (last_zone)
+		spin_unlock_irq(&last_zone->lru_lock);
+	release_pages(pvec->pages, pagevec_count(pvec), pvec->cold);
+	pagevec_reinit(pvec);
+}
+
+#ifdef CONFIG_SMP
+static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
+
+static void activate_page_drain(int cpu)
+{
+	struct pagevec *pvec = &per_cpu(activate_page_pvecs, cpu);
+
+	if (pagevec_count(pvec))
+		pagevec_lru_move_fn(pvec, __activate_page, NULL);
+}
+
+void activate_page(struct page *page)
+{
+	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
+		struct pagevec *pvec = &get_cpu_var(activate_page_pvecs);
+
+		page_cache_get(page);
+		if (!pagevec_add(pvec, page))
+			pagevec_lru_move_fn(pvec, __activate_page, NULL);
+		put_cpu_var(activate_page_pvecs);
+	}
+}
+#else
+static void inline activate_page_drain(int cpu)
+{
+}
+
+void activate_page(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+
+	spin_lock_irq(&zone->lru_lock);
+	__activate_page(page, NULL);
 	spin_unlock_irq(&zone->lru_lock);
 }
+#endif
 
 /*
  * Mark a page as having seen activity.
@@ -292,6 +357,8 @@ static void drain_cpu_pagevecs(int cpu)
 		pagevec_move_tail(pvec);
 		local_irq_restore(flags);
 	}
+
+	activate_page_drain(cpu);
 }
 
 void lru_add_drain(void)
@@ -398,46 +465,34 @@ void __pagevec_release(struct pagevec *pvec)
 
 EXPORT_SYMBOL(__pagevec_release);
 
+static void ____pagevec_lru_add_fn(struct page *page, void *arg)
+{
+	enum lru_list lru = (enum lru_list)arg;
+	struct zone *zone = page_zone(page);
+	int file = is_file_lru(lru);
+	int active = is_active_lru(lru);
+
+	VM_BUG_ON(PageActive(page));
+	VM_BUG_ON(PageUnevictable(page));
+	VM_BUG_ON(PageLRU(page));
+
+	SetPageLRU(page);
+	if (active)
+		SetPageActive(page);
+	update_page_reclaim_stat(zone, page, file, active);
+	add_page_to_lru_list(zone, page, lru);
+}
+
 /*
  * Add the passed pages to the LRU, then drop the caller's refcount
  * on them.  Reinitialises the caller's pagevec.
  */
 void ____pagevec_lru_add(struct pagevec *pvec, enum lru_list lru)
 {
-	int i;
-	struct zone *zone = NULL;
-
 	VM_BUG_ON(is_unevictable_lru(lru));
 
-	for (i = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
-		struct zone *pagezone = page_zone(page);
-		int file;
-		int active;
-
-		if (pagezone != zone) {
-			if (zone)
-				spin_unlock_irq(&zone->lru_lock);
-			zone = pagezone;
-			spin_lock_irq(&zone->lru_lock);
-		}
-		VM_BUG_ON(PageActive(page));
-		VM_BUG_ON(PageUnevictable(page));
-		VM_BUG_ON(PageLRU(page));
-		SetPageLRU(page);
-		active = is_active_lru(lru);
-		file = is_file_lru(lru);
-		if (active)
-			SetPageActive(page);
-		update_page_reclaim_stat(zone, page, file, active);
-		add_page_to_lru_list(zone, page, lru);
-	}
-	if (zone)
-		spin_unlock_irq(&zone->lru_lock);
-	release_pages(pvec->pages, pvec->nr, pvec->cold);
-	pagevec_reinit(pvec);
+	pagevec_lru_move_fn(pvec, ____pagevec_lru_add_fn, (void *)lru);
 }
-
 EXPORT_SYMBOL(____pagevec_lru_add);
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC]mm: batch activate_page() to reduce lock contention
  2010-08-06  3:08       ` Shaohua Li
@ 2010-08-25 20:03         ` Andrew Morton
  2010-08-26  7:59           ` Shaohua Li
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2010-08-25 20:03 UTC (permalink / raw)
  To: Shaohua Li
  Cc: KOSAKI Motohiro, linux-mm, Andi Kleen, Wu, Fengguang, minchan.kim

On Fri, 6 Aug 2010 11:08:05 +0800
Shaohua Li <shaohua.li@intel.com> wrote:

> Subject: mm: batch activate_page() to reduce lock contention
> 
> The zone->lru_lock is heavily contented in workload where activate_page()
> is frequently used. We could do batch activate_page() to reduce the lock
> contention. The batched pages will be added into zone list when the pool
> is full or page reclaim is trying to drain them.
> 
> For example, in a 4 socket 64 CPU system, create a sparse file and 64 processes,
> processes shared map to the file. Each process read access the whole file and
> then exit. The process exit will do unmap_vmas() and cause a lot of
> activate_page() call. In such workload, we saw about 58% total time reduction
> with below patch.
> 

Am still not happy that this bloats swap.o by 144 bytes in an
allnoconfig build for something which is only relevant to SMP builds.

> index 3ce7bc3..744883f 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -172,28 +172,93 @@ static void update_page_reclaim_stat(struct zone *zone, struct page *page,
>  		memcg_reclaim_stat->recent_rotated[file]++;
>  }
>  
> -/*
> - * FIXME: speed this up?
> - */
> -void activate_page(struct page *page)
> +static void __activate_page(struct page *page, void *arg)
>  {
> -	struct zone *zone = page_zone(page);
> -
> -	spin_lock_irq(&zone->lru_lock);
>  	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
> +		struct zone *zone = page_zone(page);
>  		int file = page_is_file_cache(page);
>  		int lru = page_lru_base_type(page);
> +
>  		del_page_from_lru_list(zone, page, lru);
>  
>  		SetPageActive(page);
>  		lru += LRU_ACTIVE;
>  		add_page_to_lru_list(zone, page, lru);
> -		__count_vm_event(PGACTIVATE);
>  
> +		__count_vm_event(PGACTIVATE);
>  		update_page_reclaim_stat(zone, page, file, 1);
>  	}
> +}
> +
> +static void pagevec_lru_move_fn(struct pagevec *pvec,
> +				void (*move_fn)(struct page *page, void *arg),
> +				void *arg)
> +{
> +	struct zone *last_zone = NULL;
> +	int i, j;
> +	DECLARE_BITMAP(pages_done, PAGEVEC_SIZE);
> +
> +	bitmap_zero(pages_done, PAGEVEC_SIZE);
> +	for (i = 0; i < pagevec_count(pvec); i++) {
> +		if (test_bit(i, pages_done))
> +			continue;
> +
> +		if (last_zone)
> +			spin_unlock_irq(&last_zone->lru_lock);
> +		last_zone = page_zone(pvec->pages[i]);
> +		spin_lock_irq(&last_zone->lru_lock);
> +
> +		for (j = i; j < pagevec_count(pvec); j++) {
> +			struct page *page = pvec->pages[j];
> +
> +			if (last_zone != page_zone(page))
> +				continue;
> +			(*move_fn)(page, arg);
> +			__set_bit(j, pages_done);
> +		}
> +	}
> +	if (last_zone)
> +		spin_unlock_irq(&last_zone->lru_lock);
> +	release_pages(pvec->pages, pagevec_count(pvec), pvec->cold);
> +	pagevec_reinit(pvec);
> +}

This function is pretty bizarre.  It really really needs some comments
explaining what it's doing and most especially *why* it's doing it.

It's a potential O(n*nr_zones) search (I think)!  We demand proof that
it's worthwhile!

Yes, if the pagevec is filled with pages from different zones then it
will reduce the locking frequency.  But in the common case where the
pagevec has pages all from the same zone, or has contiguous runs of
pages from different zones then all that extra bitmap fiddling gained
us nothing.

(I think the search could be made more efficient by advancing `i' when
we first see last_zone!=page_zone(page), but that'd just make the code
even worse).



Also..

There's a downside/risk to this code.  A billion years ago I found
that it was pretty important that if we're going to batch pages in this
manner, it's important that ALL pages be batched via the same means. 
If 99% of the pages go through the pagevec and 1% of pages bypass the
pagevec, the LRU order gets scrambled and we can end up causing
additional disk seeks when the time comes to write things out.  The
effect was measurable.

And lo, putback_lru_pages() (at least) bypasses your new pagevecs,
potentially scrambling the LRU ordering.  Admittedly, if we're putting
back unreclaimable pages in there, the LRU is probably already pretty
scrambled.  But that's just a guess.

Even if that is addressed, we still scramble the LRU to some extent
simply because the pagevecs are per-cpu.  We already do that to some
extent when shrink_inactive_list() snips a batch of pages off the LRU
for processing.  To what extent this matters and to what extent your
new activate_page() worsens this is also unknown.

It's tricky stuff and it likes to blow up in our faces.  Much caution
is needed, please.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC]mm: batch activate_page() to reduce lock contention
  2010-08-25 20:03         ` Andrew Morton
@ 2010-08-26  7:59           ` Shaohua Li
  2010-08-26 21:30             ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: Shaohua Li @ 2010-08-26  7:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, linux-mm, Andi Kleen, Wu, Fengguang, minchan.kim

On Thu, Aug 26, 2010 at 04:03:18AM +0800, Andrew Morton wrote:
> On Fri, 6 Aug 2010 11:08:05 +0800
> Shaohua Li <shaohua.li@intel.com> wrote:
> 
> > Subject: mm: batch activate_page() to reduce lock contention
> > 
> > The zone->lru_lock is heavily contented in workload where activate_page()
> > is frequently used. We could do batch activate_page() to reduce the lock
> > contention. The batched pages will be added into zone list when the pool
> > is full or page reclaim is trying to drain them.
> > 
> > For example, in a 4 socket 64 CPU system, create a sparse file and 64 processes,
> > processes shared map to the file. Each process read access the whole file and
> > then exit. The process exit will do unmap_vmas() and cause a lot of
> > activate_page() call. In such workload, we saw about 58% total time reduction
> > with below patch.
> > 
> 
> Am still not happy that this bloats swap.o by 144 bytes in an
> allnoconfig build for something which is only relevant to SMP builds.
> 
> > index 3ce7bc3..744883f 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -172,28 +172,93 @@ static void update_page_reclaim_stat(struct zone *zone, struct page *page,
> >  		memcg_reclaim_stat->recent_rotated[file]++;
> >  }
> >  
> > -/*
> > - * FIXME: speed this up?
> > - */
> > -void activate_page(struct page *page)
> > +static void __activate_page(struct page *page, void *arg)
> >  {
> > -	struct zone *zone = page_zone(page);
> > -
> > -	spin_lock_irq(&zone->lru_lock);
> >  	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
> > +		struct zone *zone = page_zone(page);
> >  		int file = page_is_file_cache(page);
> >  		int lru = page_lru_base_type(page);
> > +
> >  		del_page_from_lru_list(zone, page, lru);
> >  
> >  		SetPageActive(page);
> >  		lru += LRU_ACTIVE;
> >  		add_page_to_lru_list(zone, page, lru);
> > -		__count_vm_event(PGACTIVATE);
> >  
> > +		__count_vm_event(PGACTIVATE);
> >  		update_page_reclaim_stat(zone, page, file, 1);
> >  	}
> > +}
> > +
> > +static void pagevec_lru_move_fn(struct pagevec *pvec,
> > +				void (*move_fn)(struct page *page, void *arg),
> > +				void *arg)
> > +{
> > +	struct zone *last_zone = NULL;
> > +	int i, j;
> > +	DECLARE_BITMAP(pages_done, PAGEVEC_SIZE);
> > +
> > +	bitmap_zero(pages_done, PAGEVEC_SIZE);
> > +	for (i = 0; i < pagevec_count(pvec); i++) {
> > +		if (test_bit(i, pages_done))
> > +			continue;
> > +
> > +		if (last_zone)
> > +			spin_unlock_irq(&last_zone->lru_lock);
> > +		last_zone = page_zone(pvec->pages[i]);
> > +		spin_lock_irq(&last_zone->lru_lock);
> > +
> > +		for (j = i; j < pagevec_count(pvec); j++) {
> > +			struct page *page = pvec->pages[j];
> > +
> > +			if (last_zone != page_zone(page))
> > +				continue;
> > +			(*move_fn)(page, arg);
> > +			__set_bit(j, pages_done);
> > +		}
> > +	}
> > +	if (last_zone)
> > +		spin_unlock_irq(&last_zone->lru_lock);
> > +	release_pages(pvec->pages, pagevec_count(pvec), pvec->cold);
> > +	pagevec_reinit(pvec);
> > +}
> 
> This function is pretty bizarre.  It really really needs some comments
> explaining what it's doing and most especially *why* it's doing it.
> 
> It's a potential O(n*nr_zones) search (I think)!  We demand proof that
> it's worthwhile!
> 
> Yes, if the pagevec is filled with pages from different zones then it
> will reduce the locking frequency.  But in the common case where the
> pagevec has pages all from the same zone, or has contiguous runs of
> pages from different zones then all that extra bitmap fiddling gained
> us nothing.
> 
> (I think the search could be made more efficient by advancing `i' when
> we first see last_zone!=page_zone(page), but that'd just make the code
> even worse).
Thanks for pointing this out. Then we can simplify things a little bit.
the 144 bytes footprint is because of this too, then we can remove it.

> 
> There's a downside/risk to this code.  A billion years ago I found
> that it was pretty important that if we're going to batch pages in this
> manner, it's important that ALL pages be batched via the same means. 
> If 99% of the pages go through the pagevec and 1% of pages bypass the
> pagevec, the LRU order gets scrambled and we can end up causing
> additional disk seeks when the time comes to write things out.  The
> effect was measurable.
> 
> And lo, putback_lru_pages() (at least) bypasses your new pagevecs,
> potentially scrambling the LRU ordering.  Admittedly, if we're putting
> back unreclaimable pages in there, the LRU is probably already pretty
> scrambled.  But that's just a guess.
ok, we can drain the pagevecs in putback_lru_pages() or add active page
to the new pagevecs.

> Even if that is addressed, we still scramble the LRU to some extent
> simply because the pagevecs are per-cpu.  We already do that to some
> extent when shrink_inactive_list() snips a batch of pages off the LRU
> for processing.  To what extent this matters and to what extent your
> new activate_page() worsens this is also unknown.
ok, this is possible. Any suggestion which benchmark I can test to verify
if this is a real problem?

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC]mm: batch activate_page() to reduce lock contention
  2010-08-26  7:59           ` Shaohua Li
@ 2010-08-26 21:30             ` Andrew Morton
  2010-08-27  8:17               ` Shaohua Li
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2010-08-26 21:30 UTC (permalink / raw)
  To: Shaohua Li
  Cc: KOSAKI Motohiro, linux-mm, Andi Kleen, Wu, Fengguang, minchan.kim

On Thu, 26 Aug 2010 15:59:10 +0800
Shaohua Li <shaohua.li@intel.com> wrote:

> On Thu, Aug 26, 2010 at 04:03:18AM +0800, Andrew Morton wrote:
> > On Fri, 6 Aug 2010 11:08:05 +0800
> > Shaohua Li <shaohua.li@intel.com> wrote:
> > 
> > > Subject: mm: batch activate_page() to reduce lock contention
> > 
> ...
>
> > This function is pretty bizarre.  It really really needs some comments
> > explaining what it's doing and most especially *why* it's doing it.
> > 
> > It's a potential O(n*nr_zones) search (I think)!  We demand proof that
> > it's worthwhile!
> > 
> > Yes, if the pagevec is filled with pages from different zones then it
> > will reduce the locking frequency.  But in the common case where the
> > pagevec has pages all from the same zone, or has contiguous runs of
> > pages from different zones then all that extra bitmap fiddling gained
> > us nothing.
> > 
> > (I think the search could be made more efficient by advancing `i' when
> > we first see last_zone!=page_zone(page), but that'd just make the code
> > even worse).
> Thanks for pointing this out. Then we can simplify things a little bit.
> the 144 bytes footprint is because of this too, then we can remove it.

ok..

> > 
> > There's a downside/risk to this code.  A billion years ago I found
> > that it was pretty important that if we're going to batch pages in this
> > manner, it's important that ALL pages be batched via the same means. 
> > If 99% of the pages go through the pagevec and 1% of pages bypass the
> > pagevec, the LRU order gets scrambled and we can end up causing
> > additional disk seeks when the time comes to write things out.  The
> > effect was measurable.
> > 
> > And lo, putback_lru_pages() (at least) bypasses your new pagevecs,
> > potentially scrambling the LRU ordering.  Admittedly, if we're putting
> > back unreclaimable pages in there, the LRU is probably already pretty
> > scrambled.  But that's just a guess.
> ok, we can drain the pagevecs in putback_lru_pages() or add active page
> to the new pagevecs.

The latter I guess?

> > Even if that is addressed, we still scramble the LRU to some extent
> > simply because the pagevecs are per-cpu.  We already do that to some
> > extent when shrink_inactive_list() snips a batch of pages off the LRU
> > for processing.  To what extent this matters and to what extent your
> > new activate_page() worsens this is also unknown.
> ok, this is possible. Any suggestion which benchmark I can test to verify
> if this is a real problem?

Any test which does significant amounts of writeback off the LRU.

And...  we don't do significant amounts of writeback off the LRU any
more.  We used to, long ago.  Then we broke it and started to do lots
more.  Then we changed other things and the current state of play is
that Mel hasn't been able to find any workload which does much at all,
and there are moves afoot to eliminate writeback-off-LRU altogether.

So I expect this won't actually be a problem.  Until we change things again ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC]mm: batch activate_page() to reduce lock contention
  2010-08-26 21:30             ` Andrew Morton
@ 2010-08-27  8:17               ` Shaohua Li
  2010-09-03 21:12                 ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: Shaohua Li @ 2010-08-27  8:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, linux-mm, Andi Kleen, Wu, Fengguang, minchan.kim

On Fri, 2010-08-27 at 05:30 +0800, Andrew Morton wrote:
> On Thu, 26 Aug 2010 15:59:10 +0800
> Shaohua Li <shaohua.li@intel.com> wrote:
> 
> > On Thu, Aug 26, 2010 at 04:03:18AM +0800, Andrew Morton wrote:
> > > On Fri, 6 Aug 2010 11:08:05 +0800
> > > Shaohua Li <shaohua.li@intel.com> wrote:
> > > 
> > > > Subject: mm: batch activate_page() to reduce lock contention
> > > 
> > ...
> >
> > > This function is pretty bizarre.  It really really needs some comments
> > > explaining what it's doing and most especially *why* it's doing it.
> > > 
> > > It's a potential O(n*nr_zones) search (I think)!  We demand proof that
> > > it's worthwhile!
> > > 
> > > Yes, if the pagevec is filled with pages from different zones then it
> > > will reduce the locking frequency.  But in the common case where the
> > > pagevec has pages all from the same zone, or has contiguous runs of
> > > pages from different zones then all that extra bitmap fiddling gained
> > > us nothing.
> > > 
> > > (I think the search could be made more efficient by advancing `i' when
> > > we first see last_zone!=page_zone(page), but that'd just make the code
> > > even worse).
> > Thanks for pointing this out. Then we can simplify things a little bit.
> > the 144 bytes footprint is because of this too, then we can remove it.
> 
> ok..
> 
> > > 
> > > There's a downside/risk to this code.  A billion years ago I found
> > > that it was pretty important that if we're going to batch pages in this
> > > manner, it's important that ALL pages be batched via the same means. 
> > > If 99% of the pages go through the pagevec and 1% of pages bypass the
> > > pagevec, the LRU order gets scrambled and we can end up causing
> > > additional disk seeks when the time comes to write things out.  The
> > > effect was measurable.
> > > 
> > > And lo, putback_lru_pages() (at least) bypasses your new pagevecs,
> > > potentially scrambling the LRU ordering.  Admittedly, if we're putting
> > > back unreclaimable pages in there, the LRU is probably already pretty
> > > scrambled.  But that's just a guess.
> > ok, we can drain the pagevecs in putback_lru_pages() or add active page
> > to the new pagevecs.
> 
> The latter I guess?
hi,
looks the lru_add_pvecs pagevecs is bypassed too in putback_lru_pages().
Assume the bypass doesn't has obvious impact? each pagevec stores 14
pages, it should be < 1/1000 total memory in typical systems. so I
wonder if we really need handle the active page pagevecs bypass.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC]mm: batch activate_page() to reduce lock contention
  2010-08-27  8:17               ` Shaohua Li
@ 2010-09-03 21:12                 ` Andrew Morton
  0 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2010-09-03 21:12 UTC (permalink / raw)
  To: Shaohua Li
  Cc: KOSAKI Motohiro, linux-mm, Andi Kleen, Wu, Fengguang, minchan.kim

On Fri, 27 Aug 2010 16:17:50 +0800
Shaohua Li <shaohua.li@intel.com> wrote:

> On Fri, 2010-08-27 at 05:30 +0800, Andrew Morton wrote:
> > On Thu, 26 Aug 2010 15:59:10 +0800
> > Shaohua Li <shaohua.li@intel.com> wrote:
> > 
> > > On Thu, Aug 26, 2010 at 04:03:18AM +0800, Andrew Morton wrote:
> > > > On Fri, 6 Aug 2010 11:08:05 +0800
> > > > Shaohua Li <shaohua.li@intel.com> wrote:
> > > > 
> > > > > Subject: mm: batch activate_page() to reduce lock contention
> > > > 
> > > ...
> > >
> > > > This function is pretty bizarre.  It really really needs some comments
> > > > explaining what it's doing and most especially *why* it's doing it.
> > > > 
> > > > It's a potential O(n*nr_zones) search (I think)!  We demand proof that
> > > > it's worthwhile!
> > > > 
> > > > Yes, if the pagevec is filled with pages from different zones then it
> > > > will reduce the locking frequency.  But in the common case where the
> > > > pagevec has pages all from the same zone, or has contiguous runs of
> > > > pages from different zones then all that extra bitmap fiddling gained
> > > > us nothing.
> > > > 
> > > > (I think the search could be made more efficient by advancing `i' when
> > > > we first see last_zone!=page_zone(page), but that'd just make the code
> > > > even worse).
> > > Thanks for pointing this out. Then we can simplify things a little bit.
> > > the 144 bytes footprint is because of this too, then we can remove it.
> > 
> > ok..
> > 
> > > > 
> > > > There's a downside/risk to this code.  A billion years ago I found
> > > > that it was pretty important that if we're going to batch pages in this
> > > > manner, it's important that ALL pages be batched via the same means. 
> > > > If 99% of the pages go through the pagevec and 1% of pages bypass the
> > > > pagevec, the LRU order gets scrambled and we can end up causing
> > > > additional disk seeks when the time comes to write things out.  The
> > > > effect was measurable.
> > > > 
> > > > And lo, putback_lru_pages() (at least) bypasses your new pagevecs,
> > > > potentially scrambling the LRU ordering.  Admittedly, if we're putting
> > > > back unreclaimable pages in there, the LRU is probably already pretty
> > > > scrambled.  But that's just a guess.
> > > ok, we can drain the pagevecs in putback_lru_pages() or add active page
> > > to the new pagevecs.
> > 
> > The latter I guess?
> hi,
> looks the lru_add_pvecs pagevecs is bypassed too in putback_lru_pages().
> Assume the bypass doesn't has obvious impact? each pagevec stores 14
> pages, it should be < 1/1000 total memory in typical systems. so I
> wonder if we really need handle the active page pagevecs bypass.

I think it would be best to always use the batched API.  Just from a
cleanliness point of view: send all the pages through the same path,
through the same official API rather than occasionally bypassing it.

Unless there's some real downside to doing it that way?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2010-09-04  1:13 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-20  7:18 [RFC]mm: batch activate_page() to reduce lock contention Shaohua Li
2010-07-21 16:06 ` Minchan Kim
2010-07-22  0:27   ` Shaohua Li
2010-07-22  1:08     ` Minchan Kim
2010-07-22  5:17       ` Shaohua Li
2010-07-22 12:28         ` Minchan Kim
2010-07-23  8:12         ` Wu Fengguang
2010-07-23  8:14           ` Wu Fengguang
2010-07-22 23:49 ` Andrew Morton
2010-07-23 15:10 ` KOSAKI Motohiro
2010-07-23 15:25   ` Andi Kleen
2010-07-23 18:06     ` KOSAKI Motohiro
2010-07-26  5:08   ` Shaohua Li
2010-08-05 21:07     ` Andrew Morton
2010-08-06  3:08       ` Shaohua Li
2010-08-25 20:03         ` Andrew Morton
2010-08-26  7:59           ` Shaohua Li
2010-08-26 21:30             ` Andrew Morton
2010-08-27  8:17               ` Shaohua Li
2010-09-03 21:12                 ` Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.