linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH]: 1/4 batch mark_page_accessed()
@ 2004-11-21 15:44 Nikita Danilov
  2004-11-21 21:12 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Nikita Danilov @ 2004-11-21 15:44 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Andrew Morton, Linux MM Mailing List

Batch mark_page_accessed() (a la lru_cache_add() and lru_cache_add_active()):
page to be marked accessed is placed into per-cpu pagevec
(page_accessed_pvec). When pagevec is filled up, all pages are processed in a
batch.

This is supposed to decrease contention on zone->lru_lock.

(Patch is for 2.6.10-rc2)

Signed-off-by: Nikita Danilov <nikita@clusterfs.com>

 mm/swap.c |   47 ++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 40 insertions(+), 7 deletions(-)

diff -puN mm/swap.c~batch-mark_page_accessed mm/swap.c
--- bk-linux/mm/swap.c~batch-mark_page_accessed	2004-11-21 17:01:02.061618792 +0300
+++ bk-linux-nikita/mm/swap.c	2004-11-21 17:01:02.063618488 +0300
@@ -113,6 +113,39 @@ void fastcall activate_page(struct page 
 	spin_unlock_irq(&zone->lru_lock);
 }
 
+static void __pagevec_mark_accessed(struct pagevec *pvec)
+{
+	int i;
+	struct zone *zone = NULL;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct zone *pagezone = page_zone(page);
+
+		if (pagezone != zone) {
+			if (zone)
+				local_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			local_lock_irq(&zone->lru_lock);
+		}
+		if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
+			del_page_from_inactive_list(zone, page);
+			SetPageActive(page);
+			add_page_to_active_list(zone, page);
+			inc_page_state(pgactivate);
+			ClearPageReferenced(page);
+		} else if (!PageReferenced(page)) {
+			SetPageReferenced(page);
+		}
+	}
+	if (zone)
+		local_unlock_irq(&zone->lru_lock);
+	release_pages(pvec->pages, pvec->nr, pvec->cold);
+	pagevec_reinit(pvec);
+}
+
+static DEFINE_PER_CPU(struct pagevec, page_accessed_pvec) = { 0, };
+
 /*
  * Mark a page as having seen activity.
  *
@@ -122,14 +155,14 @@ void fastcall activate_page(struct page 
  */
 void fastcall mark_page_accessed(struct page *page)
 {
-	if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
-		activate_page(page);
-		ClearPageReferenced(page);
-	} else if (!PageReferenced(page)) {
-		SetPageReferenced(page);
-	}
-}
+	struct pagevec *pvec;
 
+	pvec = &get_cpu_var(page_accessed_pvec);
+	page_cache_get(page);
+	if (!pagevec_add(pvec, page))
+		__pagevec_mark_accessed(pvec);
+	put_cpu_var(page_accessed_pvec);
+}
 EXPORT_SYMBOL(mark_page_accessed);
 
 /**

_

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH]: 1/4 batch mark_page_accessed()
  2004-11-21 15:44 [PATCH]: 1/4 batch mark_page_accessed() Nikita Danilov
@ 2004-11-21 21:12 ` Andrew Morton
  2004-11-24 10:40   ` Marcelo Tosatti
  2004-11-24 16:32 ` Marcelo Tosatti
  2004-11-26 18:58 ` Marcelo Tosatti
  2 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2004-11-21 21:12 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Linux-Kernel, AKPM, linux-mm

Nikita Danilov <nikita@clusterfs.com> wrote:
>
> Batch mark_page_accessed() (a la lru_cache_add() and lru_cache_add_active()):
>  page to be marked accessed is placed into per-cpu pagevec
>  (page_accessed_pvec). When pagevec is filled up, all pages are processed in a
>  batch.
> 
>  This is supposed to decrease contention on zone->lru_lock.

Looks sane, althought it does add more atomic ops (the extra
get_page/put_page).  Some benchmarks would be nice to have.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH]: 1/4 batch mark_page_accessed()
  2004-11-21 21:12 ` Andrew Morton
@ 2004-11-24 10:40   ` Marcelo Tosatti
  0 siblings, 0 replies; 15+ messages in thread
From: Marcelo Tosatti @ 2004-11-24 10:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nikita Danilov, Linux-Kernel, linux-mm

On Sun, Nov 21, 2004 at 01:12:50PM -0800, Andrew Morton wrote:
> Nikita Danilov <nikita@clusterfs.com> wrote:
> >
> > Batch mark_page_accessed() (a la lru_cache_add() and lru_cache_add_active()):
> >  page to be marked accessed is placed into per-cpu pagevec
> >  (page_accessed_pvec). When pagevec is filled up, all pages are processed in a
> >  batch.
> > 
> >  This is supposed to decrease contention on zone->lru_lock.
> 
> Looks sane, althought it does add more atomic ops (the extra
> get_page/put_page).  Some benchmarks would be nice to have.

I'll run STP benchmarks as soon as STP is working again. 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH]: 1/4 batch mark_page_accessed()
  2004-11-21 15:44 [PATCH]: 1/4 batch mark_page_accessed() Nikita Danilov
  2004-11-21 21:12 ` Andrew Morton
@ 2004-11-24 16:32 ` Marcelo Tosatti
  2004-11-24 21:53   ` Nikita Danilov
  2004-11-26 18:58 ` Marcelo Tosatti
  2 siblings, 1 reply; 15+ messages in thread
From: Marcelo Tosatti @ 2004-11-24 16:32 UTC (permalink / raw)
  To: Nikita Danilov
  Cc: Linux Kernel Mailing List, Andrew Morton, Linux MM Mailing List


Hi Nikita,

On Sun, Nov 21, 2004 at 06:44:04PM +0300, Nikita Danilov wrote:
> Batch mark_page_accessed() (a la lru_cache_add() and lru_cache_add_active()):
> page to be marked accessed is placed into per-cpu pagevec
> (page_accessed_pvec). When pagevec is filled up, all pages are processed in a
> batch.
> 
> This is supposed to decrease contention on zone->lru_lock.
> 
> (Patch is for 2.6.10-rc2)
> 
> Signed-off-by: Nikita Danilov <nikita@clusterfs.com>
> 
>  mm/swap.c |   47 ++++++++++++++++++++++++++++++++++++++++-------
>  1 files changed, 40 insertions(+), 7 deletions(-)
> 
> diff -puN mm/swap.c~batch-mark_page_accessed mm/swap.c
> --- bk-linux/mm/swap.c~batch-mark_page_accessed	2004-11-21 17:01:02.061618792 +0300
> +++ bk-linux-nikita/mm/swap.c	2004-11-21 17:01:02.063618488 +0300
> @@ -113,6 +113,39 @@ void fastcall activate_page(struct page 
>  	spin_unlock_irq(&zone->lru_lock);
>  }
>  
> +static void __pagevec_mark_accessed(struct pagevec *pvec)
> +{
> +	int i;
> +	struct zone *zone = NULL;
> +
> +	for (i = 0; i < pagevec_count(pvec); i++) {
> +		struct page *page = pvec->pages[i];
> +		struct zone *pagezone = page_zone(page);
> +
> +		if (pagezone != zone) {
> +			if (zone)
> +				local_unlock_irq(&zone->lru_lock);

You surely meant spin_{un}lock_irq and not local{un}lock_irq.

Started the STP tests on 4way/8way boxes.

> +			zone = pagezone;
> +			local_lock_irq(&zone->lru_lock);
> +		}
> +		if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
> +			del_page_from_inactive_list(zone, page);
> +			SetPageActive(page);
> +			add_page_to_active_list(zone, page);
> +			inc_page_state(pgactivate);
> +			ClearPageReferenced(page);
> +		} else if (!PageReferenced(page)) {
> +			SetPageReferenced(page);
> +		}
> +	}
> +	if (zone)
> +		local_unlock_irq(&zone->lru_lock);
> +	release_pages(pvec->pages, pvec->nr, pvec->cold);
> +	pagevec_reinit(pvec);
> +}
> +
> +static DEFINE_PER_CPU(struct pagevec, page_accessed_pvec) = { 0, };
> +



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH]: 1/4 batch mark_page_accessed()
  2004-11-24 16:32 ` Marcelo Tosatti
@ 2004-11-24 21:53   ` Nikita Danilov
  0 siblings, 0 replies; 15+ messages in thread
From: Nikita Danilov @ 2004-11-24 21:53 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Linux Kernel Mailing List, Andrew Morton, Linux MM Mailing List

Marcelo Tosatti writes:
 > 
 > Hi Nikita,

Hello, Marcelo,

 > 

[...]

 > > +		if (pagezone != zone) {
 > > +			if (zone)
 > > +				local_unlock_irq(&zone->lru_lock);
 > 
 > You surely meant spin_{un}lock_irq and not local{un}lock_irq.

Oh, you are right. local_lock_* are functions to manipulate "local wait"
spin-lock variety that was introduced by some other
patch. batch-mark_page_accessed patch worked only because all references
to local_lock_* functions were removed by pvec-cleanup patch.

Another proof of the obvious fact that manually coded pagevec iteration
is evil. :)

 > 
 > Started the STP tests on 4way/8way boxes.

Great.

Nikita.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH]: 1/4 batch mark_page_accessed()
  2004-11-21 15:44 [PATCH]: 1/4 batch mark_page_accessed() Nikita Danilov
  2004-11-21 21:12 ` Andrew Morton
  2004-11-24 16:32 ` Marcelo Tosatti
@ 2004-11-26 18:58 ` Marcelo Tosatti
  2004-11-27  0:37   ` Nick Piggin
  2004-11-27 10:41   ` Nikita Danilov
  2 siblings, 2 replies; 15+ messages in thread
From: Marcelo Tosatti @ 2004-11-26 18:58 UTC (permalink / raw)
  To: Nikita Danilov
  Cc: Linux Kernel Mailing List, Andrew Morton, Linux MM Mailing List

On Sun, Nov 21, 2004 at 06:44:04PM +0300, Nikita Danilov wrote:
> Batch mark_page_accessed() (a la lru_cache_add() and lru_cache_add_active()):
> page to be marked accessed is placed into per-cpu pagevec
> (page_accessed_pvec). When pagevec is filled up, all pages are processed in a
> batch.
> 
> This is supposed to decrease contention on zone->lru_lock.

Here are the STP 8way results:

8way:

reaim default (database IO intensive load), increases performance _significantly_:
------------------------------------------
kernel: patch-2.6.10-rc2
Peak load Test: Maximum Jobs per Minute 8491.96 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 8326.23 (average of 3 runs)

kernel: nikita-b2
Peak load Test: Maximum Jobs per Minute 9039.56 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 8325.09 (average of 3 runs)


reaim -w compute (compute intensive load), decreases performance:
-----------------------------------------
kernel: patch-2.6.10-rc2
Peak load Test: Maximum Jobs per Minute 9591.82 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 9359.76 (average of 3 runs)

kernel: nikita-b2
Peak load Test: Maximum Jobs per Minute 9533.34 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 9324.25 (average of 3 runs)

kernbench 

Decreases performance significantly (on -j4 more notably), probably due to 
the additional atomic operations as noted by Andrew:

kernel: nikita-b2                               kernel: patch-2.6.10-rc2
Host: stp8-002                                  Host: stp8-003

Average Optimal -j 32 Load Run:                 Average Optimal -j 32 Load Run:
Elapsed Time 130                                Elapsed Time 129.562
User Time 872.816                               User Time 871.898
System Time 88.978                              System Time 87.346
Percent CPU 739.2                               Percent CPU 739.8
Context Switches 35111.4                        Context Switches 34973.2
Sleeps 28182.6                                  Sleeps 28465.2

Average Maximal -j Load Run:                    Average Maximal -j Load Run:
Elapsed Time 128.862                            Elapsed Time 128.334
User Time 868.234                               User Time 867.702
System Time 86.888                              System Time 85.318
Percent CPU 740.6                               Percent CPU 742.2
Context Switches 27278.2                        Context Switches 27210
Sleeps 19889                                    Sleeps 19898.4

Average Half Load -j 4 Run:                     Average Half Load -j 4 Run:
Elapsed Time 274.916                            Elapsed Time 245.026
User Time 833.63                                User Time 832.34
System Time 73.704                              System Time 73.41
Percent CPU 335.8                               Percent CPU 373.6
Context Switches 12984.8                        Context Switches 13427.4
Sleeps 21459.2                                  Sleeps 21642



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH]: 1/4 batch mark_page_accessed()
  2004-11-26 18:58 ` Marcelo Tosatti
@ 2004-11-27  0:37   ` Nick Piggin
  2004-11-30 16:29     ` Marcelo Tosatti
  2004-11-27 10:41   ` Nikita Danilov
  1 sibling, 1 reply; 15+ messages in thread
From: Nick Piggin @ 2004-11-27  0:37 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Nikita Danilov, Linux Kernel Mailing List, Andrew Morton,
	Linux MM Mailing List

Marcelo Tosatti wrote:
> On Sun, Nov 21, 2004 at 06:44:04PM +0300, Nikita Danilov wrote:
> 
>>Batch mark_page_accessed() (a la lru_cache_add() and lru_cache_add_active()):
>>page to be marked accessed is placed into per-cpu pagevec
>>(page_accessed_pvec). When pagevec is filled up, all pages are processed in a
>>batch.
>>
>>This is supposed to decrease contention on zone->lru_lock.
> 
> 
> Here are the STP 8way results:
> 
> 8way:
> 

...

> kernbench 
> 
> Decreases performance significantly (on -j4 more notably), probably due to 
> the additional atomic operations as noted by Andrew:
> 
> kernel: nikita-b2                               kernel: patch-2.6.10-rc2
> Host: stp8-002                                  Host: stp8-003
> 

...

> 
> Average Half Load -j 4 Run:                     Average Half Load -j 4 Run:
> Elapsed Time 274.916                            Elapsed Time 245.026
> User Time 833.63                                User Time 832.34
> System Time 73.704                              System Time 73.41
> Percent CPU 335.8                               Percent CPU 373.6
> Context Switches 12984.8                        Context Switches 13427.4
> Sleeps 21459.2                                  Sleeps 21642

Do you think looks like it may be a CPU scheduling or disk/fs artifact?
Neither user nor system time are significantly worse, while the vanilla
kernel is using a lot more of the CPUs' power (ie waiting for IO less,
or becoming idle less often due to CPU scheduler balancing).

Aside: under-load conditions like this is actually something where the
CPU scheduler doesn't do brilliantly at currently. I attribute this to
probably most "performance tests" loading it up as much as possible.
I am (on and off) looking at improving performance in these conditions,
and am making some inroads.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH]: 1/4 batch mark_page_accessed()
  2004-11-27 10:41   ` Nikita Danilov
@ 2004-11-27  8:19     ` Marcelo Tosatti
  0 siblings, 0 replies; 15+ messages in thread
From: Marcelo Tosatti @ 2004-11-27  8:19 UTC (permalink / raw)
  To: Nikita Danilov
  Cc: Linux Kernel Mailing List, Andrew Morton, Linux MM Mailing List

On Sat, Nov 27, 2004 at 01:41:58PM +0300, Nikita Danilov wrote:
> Marcelo Tosatti writes:
> 
> [...]
> 
>  > 
>  > Average Half Load -j 4 Run:                     Average Half Load -j 4 Run:
>  > Elapsed Time 274.916                            Elapsed Time 245.026
>  > User Time 833.63                                User Time 832.34
>  > System Time 73.704                              System Time 73.41
>  > Percent CPU 335.8                               Percent CPU 373.6
>  > Context Switches 12984.8                        Context Switches 13427.4
>  > Sleeps 21459.2                                  Sleeps 21642
> 
> What is "System Time" here? Does in include CPU consumption by, say,
> kswapd?

I dont think "System Time" counts for kswapd - but only time spent inside 
the kernel, including page reclamation efforts.

> If not, then one may conjecture that mark_page_accessed batching
> modifies ordering of pages on zone "LRU" lists in a way that measurably
> changes efficiency of VM scanning, and mostly affects process that scans
> most---kswapd.

I agree.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH]: 1/4 batch mark_page_accessed()
  2004-11-26 18:58 ` Marcelo Tosatti
  2004-11-27  0:37   ` Nick Piggin
@ 2004-11-27 10:41   ` Nikita Danilov
  2004-11-27  8:19     ` Marcelo Tosatti
  1 sibling, 1 reply; 15+ messages in thread
From: Nikita Danilov @ 2004-11-27 10:41 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Linux Kernel Mailing List, Andrew Morton, Linux MM Mailing List

Marcelo Tosatti writes:

[...]

 > 
 > Average Half Load -j 4 Run:                     Average Half Load -j 4 Run:
 > Elapsed Time 274.916                            Elapsed Time 245.026
 > User Time 833.63                                User Time 832.34
 > System Time 73.704                              System Time 73.41
 > Percent CPU 335.8                               Percent CPU 373.6
 > Context Switches 12984.8                        Context Switches 13427.4
 > Sleeps 21459.2                                  Sleeps 21642

What is "System Time" here? Does in include CPU consumption by, say,
kswapd? If not, then one may conjecture that mark_page_accessed batching
modifies ordering of pages on zone "LRU" lists in a way that measurably
changes efficiency of VM scanning, and mostly affects process that scans
most---kswapd.

Nikita.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH]: 1/4 batch mark_page_accessed()
  2004-11-27  0:37   ` Nick Piggin
@ 2004-11-30 16:29     ` Marcelo Tosatti
  2004-12-01  1:33       ` Andrew Morton
  0 siblings, 1 reply; 15+ messages in thread
From: Marcelo Tosatti @ 2004-11-30 16:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Nikita Danilov, Linux Kernel Mailing List, Andrew Morton,
	Linux MM Mailing List

On Sat, Nov 27, 2004 at 11:37:17AM +1100, Nick Piggin wrote:
> Marcelo Tosatti wrote:
> >On Sun, Nov 21, 2004 at 06:44:04PM +0300, Nikita Danilov wrote:
> >
> >>Batch mark_page_accessed() (a la lru_cache_add() and 
> >>lru_cache_add_active()):
> >>page to be marked accessed is placed into per-cpu pagevec
> >>(page_accessed_pvec). When pagevec is filled up, all pages are processed 
> >>in a
> >>batch.
> >>
> >>This is supposed to decrease contention on zone->lru_lock.
> >
> >
> >Here are the STP 8way results:
> >
> >8way:
> >
> 
> ...
> 
> >kernbench 
> >
> >Decreases performance significantly (on -j4 more notably), probably due to 
> >the additional atomic operations as noted by Andrew:
> >
> >kernel: nikita-b2                               kernel: patch-2.6.10-rc2
> >Host: stp8-002                                  Host: stp8-003
> >
> 
> ...
> 
> >
> >Average Half Load -j 4 Run:                     Average Half Load -j 4 Run:
> >Elapsed Time 274.916                            Elapsed Time 245.026
> >User Time 833.63                                User Time 832.34
> >System Time 73.704                              System Time 73.41
> >Percent CPU 335.8                               Percent CPU 373.6
> >Context Switches 12984.8                        Context Switches 13427.4
> >Sleeps 21459.2                                  Sleeps 21642
> 
> Do you think looks like it may be a CPU scheduling or disk/fs artifact?
> Neither user nor system time are significantly worse, while the vanilla
> kernel is using a lot more of the CPUs' power (ie waiting for IO less,
> or becoming idle less often due to CPU scheduler balancing).

Nick,

I do not think it is a disk/fs artifact.

Because the ordering of LRU pages should be enhanced in respect to locality, 
with the mark_page_accessed batching you group together tasks accessed pages 
and move them at once to the active list. 

You maintain better locality ordering, while decreasing the precision of aging/
temporal locality.

Which should enhance disk writeout performance.

On the other hand, without batching you mix the locality up in LRU - the LRU becomes 
more precise in terms of "LRU aging", but less ordered in terms of sequential 
access pattern.

The disk IO intensive reaim has very significant gain from the batching, its
probably due to the enhanced LRU ordering (what Nikita says).

The slowdown is probably due to the additional atomic_inc by page_cache_get(). 

Is there no way to avoid such page_cache_get there (and in lru_cache_add also)?

> Aside: under-load conditions like this is actually something where the
> CPU scheduler doesn't do brilliantly at currently. I attribute this to
> probably most "performance tests" loading it up as much as possible.
> I am (on and off) looking at improving performance in these conditions,
> and am making some inroads. 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH]: 1/4 batch mark_page_accessed()
  2004-12-01  1:33       ` Andrew Morton
@ 2004-11-30 22:57         ` Marcelo Tosatti
  2004-12-01 12:23         ` Nikita Danilov
  1 sibling, 0 replies; 15+ messages in thread
From: Marcelo Tosatti @ 2004-11-30 22:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nickpiggin, nikita, Linux-Kernel, linux-mm

On Tue, Nov 30, 2004 at 05:33:23PM -0800, Andrew Morton wrote:
> Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
> >
> > Because the ordering of LRU pages should be enhanced in respect to locality, 
> >  with the mark_page_accessed batching you group together tasks accessed pages 
> >  and move them at once to the active list. 
> > 
> >  You maintain better locality ordering, while decreasing the precision of aging/
> >  temporal locality.
> > 
> >  Which should enhance disk writeout performance.
> 
> I'll buy that explanation.  Although I'm a bit sceptical that it is
> measurable.

Its just a theory that makes sense to me, but yes we need to be measure it.

> Was that particular workload actually performing significant amounts of
> writeout in vmscan.c?  (We should have direct+kswapd counters for that, but
> we don't.  /proc/vmstat:pgrotated will give us an idea).

I strongly believe so, its a memory hungry benchmark - I'll collect status later 
on this week or the next.

> >  On the other hand, without batching you mix the locality up in LRU - the LRU becomes 
> >  more precise in terms of "LRU aging", but less ordered in terms of sequential 
> >  access pattern.
> > 
> >  The disk IO intensive reaim has very significant gain from the batching, its
> >  probably due to the enhanced LRU ordering (what Nikita says).
> > 
> >  The slowdown is probably due to the additional atomic_inc by page_cache_get(). 
> > 
> >  Is there no way to avoid such page_cache_get there (and in lru_cache_add also)?
> 
> Not really.  The page is only in the pagevec at that time - if someone does
> a put_page() on it the page will be freed for real, and will then be
> spilled onto the LRU.  Messy. 

We could handle such situation on allocation and during vmscan - maybe its doable.
Just pondering, maybe its indeed too messy to even ponder about.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH]: 1/4 batch mark_page_accessed()
  2004-11-30 16:29     ` Marcelo Tosatti
@ 2004-12-01  1:33       ` Andrew Morton
  2004-11-30 22:57         ` Marcelo Tosatti
  2004-12-01 12:23         ` Nikita Danilov
  0 siblings, 2 replies; 15+ messages in thread
From: Andrew Morton @ 2004-12-01  1:33 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: nickpiggin, nikita, Linux-Kernel, AKPM, linux-mm

Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
> Because the ordering of LRU pages should be enhanced in respect to locality, 
>  with the mark_page_accessed batching you group together tasks accessed pages 
>  and move them at once to the active list. 
> 
>  You maintain better locality ordering, while decreasing the precision of aging/
>  temporal locality.
> 
>  Which should enhance disk writeout performance.

I'll buy that explanation.  Although I'm a bit sceptical that it is
measurable.

Was that particular workload actually performing significant amounts of
writeout in vmscan.c?  (We should have direct+kswapd counters for that, but
we don't.  /proc/vmstat:pgrotated will give us an idea).


>  On the other hand, without batching you mix the locality up in LRU - the LRU becomes 
>  more precise in terms of "LRU aging", but less ordered in terms of sequential 
>  access pattern.
> 
>  The disk IO intensive reaim has very significant gain from the batching, its
>  probably due to the enhanced LRU ordering (what Nikita says).
> 
>  The slowdown is probably due to the additional atomic_inc by page_cache_get(). 
> 
>  Is there no way to avoid such page_cache_get there (and in lru_cache_add also)?

Not really.  The page is only in the pagevec at that time - if someone does
a put_page() on it the page will be freed for real, and will then be
spilled onto the LRU.  Messy.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH]: 1/4 batch mark_page_accessed()
  2004-12-01  1:33       ` Andrew Morton
  2004-11-30 22:57         ` Marcelo Tosatti
@ 2004-12-01 12:23         ` Nikita Danilov
  2004-12-01 18:58           ` Marcelo Tosatti
  1 sibling, 1 reply; 15+ messages in thread
From: Nikita Danilov @ 2004-12-01 12:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Marcelo Tosatti, nickpiggin, Linux-Kernel, linux-mm

Andrew Morton writes:
 > Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
 > >
 > > Because the ordering of LRU pages should be enhanced in respect to locality, 
 > >  with the mark_page_accessed batching you group together tasks accessed pages 
 > >  and move them at once to the active list. 
 > > 
 > >  You maintain better locality ordering, while decreasing the precision of aging/
 > >  temporal locality.
 > > 
 > >  Which should enhance disk writeout performance.
 > 
 > I'll buy that explanation.  Although I'm a bit sceptical that it is
 > measurable.

cluster-pageout.patch that was sent together with
mark_page_accessed-batching.patch has roughly similar effect: page-out
is done in file order, ignoring local LRU order at end of the inactive
list. It did improve performance in page-out intensive micro-benchmark:

Averaged number of microseconds it takes to dirty 1GB of
16-times-larger-than-RAM ext3 file mmaped in 1GB chunks:

without-patch:   average:    74188417.156250
               deviation:    10538258.613280

   with-patch:   average:    69449001.583333
               deviation:    12621756.615280

 > 
 > Was that particular workload actually performing significant amounts of
 > writeout in vmscan.c?  (We should have direct+kswapd counters for that, but
 > we don't.  /proc/vmstat:pgrotated will give us an idea).
 > 
 > 
 > >  On the other hand, without batching you mix the locality up in LRU - the LRU becomes 
 > >  more precise in terms of "LRU aging", but less ordered in terms of sequential 
 > >  access pattern.
 > > 
 > >  The disk IO intensive reaim has very significant gain from the batching, its
 > >  probably due to the enhanced LRU ordering (what Nikita says).
 > > 
 > >  The slowdown is probably due to the additional atomic_inc by page_cache_get(). 
 > > 
 > >  Is there no way to avoid such page_cache_get there (and in lru_cache_add also)?
 > 
 > Not really.  The page is only in the pagevec at that time - if someone does
 > a put_page() on it the page will be freed for real, and will then be
 > spilled onto the LRU.  Messy.

I don't think that atomic_inc will be particularly
costly. generic_file_{write,read}() call find_get_page() just before
calling mark_page_accessed(), so cache-line with page reference counter
is most likely still exclusive owned by this CPU.

Nikita.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH]: 1/4 batch mark_page_accessed()
  2004-12-01 12:23         ` Nikita Danilov
@ 2004-12-01 18:58           ` Marcelo Tosatti
  2004-12-02  1:59             ` Andrew Morton
  0 siblings, 1 reply; 15+ messages in thread
From: Marcelo Tosatti @ 2004-12-01 18:58 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Andrew Morton, nickpiggin, Linux-Kernel, linux-mm


<snip>

>  > >  On the other hand, without batching you mix the locality up in LRU - the LRU becomes 
>  > >  more precise in terms of "LRU aging", but less ordered in terms of sequential 
>  > >  access pattern.
>  > > 
>  > >  The disk IO intensive reaim has very significant gain from the batching, its
>  > >  probably due to the enhanced LRU ordering (what Nikita says).
>  > > 
>  > >  The slowdown is probably due to the additional atomic_inc by page_cache_get(). 
>  > > 
>  > >  Is there no way to avoid such page_cache_get there (and in lru_cache_add also)?
>  > 
>  > Not really.  The page is only in the pagevec at that time - if someone does
>  > a put_page() on it the page will be freed for real, and will then be
>  > spilled onto the LRU.  Messy.
> 
> I don't think that atomic_inc will be particularly
> costly. generic_file_{write,read}() call find_get_page() just before
> calling mark_page_accessed(), so cache-line with page reference counter
> is most likely still exclusive owned by this CPU. 

Assuming that is true - what could cause the slowdown? 

There are only benefits from the makr_page_accessed batching, I can't
see any drawbacks. Do you?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH]: 1/4 batch mark_page_accessed()
  2004-12-01 18:58           ` Marcelo Tosatti
@ 2004-12-02  1:59             ` Andrew Morton
  0 siblings, 0 replies; 15+ messages in thread
From: Andrew Morton @ 2004-12-02  1:59 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: nikita, nickpiggin, Linux-Kernel, linux-mm

Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
> > I don't think that atomic_inc will be particularly
>  > costly. generic_file_{write,read}() call find_get_page() just before
>  > calling mark_page_accessed(), so cache-line with page reference counter
>  > is most likely still exclusive owned by this CPU. 
> 
>  Assuming that is true - what could cause the slowdown? 

It isn't true.  Atomic ops have a considerable overhead, and this is
unrelated to cache misses.   This is especially true of p4's.

Now, that overhead may be justified.  Needs more study.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2004-12-02  2:00 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-11-21 15:44 [PATCH]: 1/4 batch mark_page_accessed() Nikita Danilov
2004-11-21 21:12 ` Andrew Morton
2004-11-24 10:40   ` Marcelo Tosatti
2004-11-24 16:32 ` Marcelo Tosatti
2004-11-24 21:53   ` Nikita Danilov
2004-11-26 18:58 ` Marcelo Tosatti
2004-11-27  0:37   ` Nick Piggin
2004-11-30 16:29     ` Marcelo Tosatti
2004-12-01  1:33       ` Andrew Morton
2004-11-30 22:57         ` Marcelo Tosatti
2004-12-01 12:23         ` Nikita Danilov
2004-12-01 18:58           ` Marcelo Tosatti
2004-12-02  1:59             ` Andrew Morton
2004-11-27 10:41   ` Nikita Danilov
2004-11-27  8:19     ` Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).