All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
@ 2009-04-29 23:25 Mathieu Desnoyers
  2009-04-29 23:56 ` Mathieu Desnoyers
                   ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Mathieu Desnoyers @ 2009-04-29 23:25 UTC (permalink / raw)
  To: Linus Torvalds, akpm, Nick Piggin
  Cc: Ingo Molnar, KOSAKI Motohiro, Peter Zijlstra, thomas.pi,
	Yuriy Lalym, linux-kernel, ltt-dev

Basically, the following execution :

dd if=/dev/zero of=/tmp/testfile

will slowly fill _all_ ram available without taking into account memory
pressure.

This is because the dirty page accounting is incorrect in
redirty_page_for_writepage.

This patch adds missing dirty page accounting in redirty_page_for_writepage().
This should fix a _lot_ of issues involving machines becoming slow under heavy
write I/O. No surprise : eventually the system starts swapping.

Linux kernel 2.6.30-rc2

The /proc/meminfo picture I had before applying this patch after filling my
memory with the dd execution was :

MemTotal:       16433732 kB
MemFree:        10919700 kB
Buffers:           12492 kB
Cached:          5262508 kB
SwapCached:            0 kB
Active:            37096 kB
Inactive:        5254384 kB
Active(anon):      16716 kB
Inactive(anon):        0 kB
Active(file):      20380 kB
Inactive(file):  5254384 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      19535024 kB
SwapFree:       19535024 kB
Dirty:           2125956 kB
Writeback:         50476 kB
AnonPages:         16660 kB
Mapped:             9560 kB
Slab:             189692 kB
SReclaimable:     166688 kB
SUnreclaim:        23004 kB
PageTables:         3396 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    27751888 kB
Committed_AS:      53904 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       10764 kB
VmallocChunk:   34359726963 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        3456 kB
DirectMap2M:    16773120 kB

After applying my patch, the same test case steadily leaves between 8
and 500MB ram free in the steady-state (when pressure is reached).

MemTotal:       16433732 kB
MemFree:           85144 kB
Buffers:           23148 kB
Cached:         15766280 kB
SwapCached:            0 kB
Active:            51500 kB
Inactive:       15755140 kB
Active(anon):      15540 kB
Inactive(anon):     1824 kB
Active(file):      35960 kB
Inactive(file): 15753316 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      19535024 kB
SwapFree:       19535024 kB
Dirty:           2501644 kB
Writeback:         33280 kB
AnonPages:         17280 kB
Mapped:             9272 kB
Slab:             505524 kB
SReclaimable:     485596 kB
SUnreclaim:        19928 kB
PageTables:         3396 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    27751888 kB
Committed_AS:      54508 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       10764 kB
VmallocChunk:   34359726715 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        3456 kB
DirectMap2M:    16773120 kB

The pressure pattern I see with the patch applied is :
(16GB ram total)

- Inactive(file) fills up to 15.7GB.
- Dirty fills up to 1.7GB.
- Writeback vary between 0 and 600MB

sync() behavior :

- Dirty down to ~6MB.
- Writeback increases to 1.6GB, then shrinks down to ~0MB.

References :
This insanely huge
http://bugzilla.kernel.org/show_bug.cgi?id=12309
[Bug 12309] Large I/O operations result in slow performance and high iowait times
(yes, I've been in CC all along)

Special thanks to Linus Torvalds and Nick Piggin and Thomas Pi for their
suggestions on previous patch iterations.

Special thanks to the LTTng community, which helped me getting LTTng up to its
current usability level. It's been tremendously useful in understanding those
problematic I/O workloads and generating fio test cases.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: akpm@linux-foundation.org
CC: Nick Piggin <nickpiggin@yahoo.com.au>
CC: Ingo Molnar <mingo@elte.hu>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: thomas.pi@arcor.dea
CC: Yuriy Lalym <ylalym@gmail.com>
---
 mm/page-writeback.c |    6 ++++++
 1 file changed, 6 insertions(+)

Index: linux-2.6-lttng/mm/page-writeback.c
===================================================================
--- linux-2.6-lttng.orig/mm/page-writeback.c	2009-04-29 18:14:48.000000000 -0400
+++ linux-2.6-lttng/mm/page-writeback.c	2009-04-29 18:23:59.000000000 -0400
@@ -1237,6 +1237,12 @@ int __set_page_dirty_nobuffers(struct pa
 		if (!mapping)
 			return 1;
 
+		/*
+		 * Take care of setting back page accounting correctly.
+		 */
+		inc_zone_page_state(page, NR_FILE_DIRTY);
+		inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+
 		spin_lock_irq(&mapping->tree_lock);
 		mapping2 = page_mapping(page);
 		if (mapping2) { /* Race with truncate? */

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-29 23:25 [PATCH] Fix dirty page accounting in redirty_page_for_writepage() Mathieu Desnoyers
@ 2009-04-29 23:56 ` Mathieu Desnoyers
  2009-04-29 23:59 ` Andrew Morton
  2009-04-30  0:06 ` Linus Torvalds
  2 siblings, 0 replies; 62+ messages in thread
From: Mathieu Desnoyers @ 2009-04-29 23:56 UTC (permalink / raw)
  To: Linus Torvalds, akpm, Nick Piggin
  Cc: Ingo Molnar, KOSAKI Motohiro, Peter Zijlstra, thomas.pi,
	Yuriy Lalym, linux-kernel, ltt-dev

* Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> Basically, the following execution :
> 
> dd if=/dev/zero of=/tmp/testfile
> 
> will slowly fill _all_ ram available without taking into account memory
> pressure.
> 
> This is because the dirty page accounting is incorrect in
> redirty_page_for_writepage.
> 
> This patch adds missing dirty page accounting in redirty_page_for_writepage().
> This should fix a _lot_ of issues involving machines becoming slow under heavy
> write I/O. No surprise : eventually the system starts swapping.
> 
> Linux kernel 2.6.30-rc2
> 
> The /proc/meminfo picture I had before applying this patch after filling my
> memory with the dd execution was :
> 
> MemTotal:       16433732 kB
> MemFree:        10919700 kB

Darn, I have not taken this meminfo snapshot at the appropriate moment.

I actually have to double-check if 2.6.30-rc still shows the bogus
behavior I identified in the 2.6.28-2.6.29 days. Then I'll check with
earlier 2.6.29.x. I know there has been some improvement on the ext3
side since then. I'll come back when I have those informations.

Sorry.

Mathieu

> Buffers:           12492 kB
> Cached:          5262508 kB
> SwapCached:            0 kB
> Active:            37096 kB
> Inactive:        5254384 kB
> Active(anon):      16716 kB
> Inactive(anon):        0 kB
> Active(file):      20380 kB
> Inactive(file):  5254384 kB
> Unevictable:           0 kB
> Mlocked:               0 kB
> SwapTotal:      19535024 kB
> SwapFree:       19535024 kB
> Dirty:           2125956 kB
> Writeback:         50476 kB
> AnonPages:         16660 kB
> Mapped:             9560 kB
> Slab:             189692 kB
> SReclaimable:     166688 kB
> SUnreclaim:        23004 kB
> PageTables:         3396 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:    27751888 kB
> Committed_AS:      53904 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:       10764 kB
> VmallocChunk:   34359726963 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> DirectMap4k:        3456 kB
> DirectMap2M:    16773120 kB
> 
> After applying my patch, the same test case steadily leaves between 8
> and 500MB ram free in the steady-state (when pressure is reached).
> 
> MemTotal:       16433732 kB
> MemFree:           85144 kB
> Buffers:           23148 kB
> Cached:         15766280 kB
> SwapCached:            0 kB
> Active:            51500 kB
> Inactive:       15755140 kB
> Active(anon):      15540 kB
> Inactive(anon):     1824 kB
> Active(file):      35960 kB
> Inactive(file): 15753316 kB
> Unevictable:           0 kB
> Mlocked:               0 kB
> SwapTotal:      19535024 kB
> SwapFree:       19535024 kB
> Dirty:           2501644 kB
> Writeback:         33280 kB
> AnonPages:         17280 kB
> Mapped:             9272 kB
> Slab:             505524 kB
> SReclaimable:     485596 kB
> SUnreclaim:        19928 kB
> PageTables:         3396 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:    27751888 kB
> Committed_AS:      54508 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:       10764 kB
> VmallocChunk:   34359726715 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> DirectMap4k:        3456 kB
> DirectMap2M:    16773120 kB
> 
> The pressure pattern I see with the patch applied is :
> (16GB ram total)
> 
> - Inactive(file) fills up to 15.7GB.
> - Dirty fills up to 1.7GB.
> - Writeback vary between 0 and 600MB
> 
> sync() behavior :
> 
> - Dirty down to ~6MB.
> - Writeback increases to 1.6GB, then shrinks down to ~0MB.
> 
> References :
> This insanely huge
> http://bugzilla.kernel.org/show_bug.cgi?id=12309
> [Bug 12309] Large I/O operations result in slow performance and high iowait times
> (yes, I've been in CC all along)
> 
> Special thanks to Linus Torvalds and Nick Piggin and Thomas Pi for their
> suggestions on previous patch iterations.
> 
> Special thanks to the LTTng community, which helped me getting LTTng up to its
> current usability level. It's been tremendously useful in understanding those
> problematic I/O workloads and generating fio test cases.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> CC: akpm@linux-foundation.org
> CC: Nick Piggin <nickpiggin@yahoo.com.au>
> CC: Ingo Molnar <mingo@elte.hu>
> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> CC: thomas.pi@arcor.dea
> CC: Yuriy Lalym <ylalym@gmail.com>
> ---
>  mm/page-writeback.c |    6 ++++++
>  1 file changed, 6 insertions(+)
> 
> Index: linux-2.6-lttng/mm/page-writeback.c
> ===================================================================
> --- linux-2.6-lttng.orig/mm/page-writeback.c	2009-04-29 18:14:48.000000000 -0400
> +++ linux-2.6-lttng/mm/page-writeback.c	2009-04-29 18:23:59.000000000 -0400
> @@ -1237,6 +1237,12 @@ int __set_page_dirty_nobuffers(struct pa
>  		if (!mapping)
>  			return 1;
>  
> +		/*
> +		 * Take care of setting back page accounting correctly.
> +		 */
> +		inc_zone_page_state(page, NR_FILE_DIRTY);
> +		inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> +
>  		spin_lock_irq(&mapping->tree_lock);
>  		mapping2 = page_mapping(page);
>  		if (mapping2) { /* Race with truncate? */
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-29 23:25 [PATCH] Fix dirty page accounting in redirty_page_for_writepage() Mathieu Desnoyers
  2009-04-29 23:56 ` Mathieu Desnoyers
@ 2009-04-29 23:59 ` Andrew Morton
  2009-04-30  2:34   ` Mathieu Desnoyers
  2009-04-30  0:06 ` Linus Torvalds
  2 siblings, 1 reply; 62+ messages in thread
From: Andrew Morton @ 2009-04-29 23:59 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: torvalds, nickpiggin, mingo, kosaki.motohiro, a.p.zijlstra,
	thomas.pi, ylalym, linux-kernel, ltt-dev

On Wed, 29 Apr 2009 19:25:46 -0400
Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> Basically, the following execution :
> 
> dd if=/dev/zero of=/tmp/testfile
> 
> will slowly fill _all_ ram available without taking into account memory
> pressure.
> 
> This is because the dirty page accounting is incorrect in
> redirty_page_for_writepage.
> 
> This patch adds missing dirty page accounting in redirty_page_for_writepage().

The patch changes __set_page_dirty_nobuffers(), not
redirty_page_for_writepage().

__set_page_dirty_nobuffers() has a huge number of callers.

> --- linux-2.6-lttng.orig/mm/page-writeback.c	2009-04-29 18:14:48.000000000 -0400
> +++ linux-2.6-lttng/mm/page-writeback.c	2009-04-29 18:23:59.000000000 -0400
> @@ -1237,6 +1237,12 @@ int __set_page_dirty_nobuffers(struct pa
>  		if (!mapping)
>  			return 1;
>  
> +		/*
> +		 * Take care of setting back page accounting correctly.
> +		 */
> +		inc_zone_page_state(page, NR_FILE_DIRTY);
> +		inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> +
>  		spin_lock_irq(&mapping->tree_lock);
>  		mapping2 = page_mapping(page);
>  		if (mapping2) { /* Race with truncate? */
> 

But __set_page_dirty_nobuffers() calls account_page_dirtied(), which
already does the above two operations.  afacit we're now
double-accounting.

Now, it's possible that the accounting goes wrong very occasionally in
the "/* Race with truncate?  */" case.  If the truncate path clears the
page's dirty bit then it will decrement the dirty-page accounting, but
this code path will fail to perform the increment of the dirty-page
accounting.  IOW, once this function has set PG_Dirty, it is committed
to altering some or all of the page-dirty accounting.

But afacit your test case will not trigger the race-with-truncate anyway?

Can you determine at approximately what frequency (pages-per-second)
this accounting leak is occurring in your test?


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-29 23:25 [PATCH] Fix dirty page accounting in redirty_page_for_writepage() Mathieu Desnoyers
  2009-04-29 23:56 ` Mathieu Desnoyers
  2009-04-29 23:59 ` Andrew Morton
@ 2009-04-30  0:06 ` Linus Torvalds
  2009-04-30  2:43   ` Mathieu Desnoyers
  2 siblings, 1 reply; 62+ messages in thread
From: Linus Torvalds @ 2009-04-30  0:06 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Nick Piggin, Ingo Molnar, KOSAKI Motohiro,
	Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev



On Wed, 29 Apr 2009, Mathieu Desnoyers wrote:
>
> This patch adds missing dirty page accounting in redirty_page_for_writepage().
> This should fix a _lot_ of issues involving machines becoming slow under heavy
> write I/O. No surprise : eventually the system starts swapping.

That patch (and description) is odd.

The patch actually adds the dirty page accounting not to 
redirty_page_for_writepage(), but to __set_page_dirty_nobuffers().

And __set_page_dirty_nobuffers() will later (just a few lines down) call 
down to account_page_dirtied(), which in turn does all that 
same accounting (assuming the "mapping" is marked to account for dirty.

So the description seems to be wrong, but so does the patch. 

Did you attach the wrong patch (explaining both problems)?

Or if the patch is what you really meant to do, then you need to fix your 
explanation, and also explain why the double-dirty accounting is a good 
idea.

Or is the real problem perhaps that your /tmp is ramdisk, and not marked 
to do dirty accounting?

		Linus

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-29 23:59 ` Andrew Morton
@ 2009-04-30  2:34   ` Mathieu Desnoyers
  0 siblings, 0 replies; 62+ messages in thread
From: Mathieu Desnoyers @ 2009-04-30  2:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: torvalds, nickpiggin, mingo, kosaki.motohiro, a.p.zijlstra,
	thomas.pi, ylalym, linux-kernel, ltt-dev

* Andrew Morton (akpm@linux-foundation.org) wrote:
> On Wed, 29 Apr 2009 19:25:46 -0400
> Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > Basically, the following execution :
> > 
> > dd if=/dev/zero of=/tmp/testfile
> > 
> > will slowly fill _all_ ram available without taking into account memory
> > pressure.
> > 
> > This is because the dirty page accounting is incorrect in
> > redirty_page_for_writepage.
> > 
> > This patch adds missing dirty page accounting in redirty_page_for_writepage().
> 
> The patch changes __set_page_dirty_nobuffers(), not
> redirty_page_for_writepage().
> 
> __set_page_dirty_nobuffers() has a huge number of callers.
> 

Right.

> > --- linux-2.6-lttng.orig/mm/page-writeback.c	2009-04-29 18:14:48.000000000 -0400
> > +++ linux-2.6-lttng/mm/page-writeback.c	2009-04-29 18:23:59.000000000 -0400
> > @@ -1237,6 +1237,12 @@ int __set_page_dirty_nobuffers(struct pa
> >  		if (!mapping)
> >  			return 1;
> >  
> > +		/*
> > +		 * Take care of setting back page accounting correctly.
> > +		 */
> > +		inc_zone_page_state(page, NR_FILE_DIRTY);
> > +		inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> > +
> >  		spin_lock_irq(&mapping->tree_lock);
> >  		mapping2 = page_mapping(page);
> >  		if (mapping2) { /* Race with truncate? */
> > 
> 
> But __set_page_dirty_nobuffers() calls account_page_dirtied(), which
> already does the above two operations.  afacit we're now
> double-accounting.
> 

Yes, you are right.

> Now, it's possible that the accounting goes wrong very occasionally in
> the "/* Race with truncate?  */" case.  If the truncate path clears the
> page's dirty bit then it will decrement the dirty-page accounting, but
> this code path will fail to perform the increment of the dirty-page
> accounting.  IOW, once this function has set PG_Dirty, it is committed
> to altering some or all of the page-dirty accounting.
> 
> But afacit your test case will not trigger the race-with-truncate anyway?
> 
> Can you determine at approximately what frequency (pages-per-second)
> this accounting leak is occurring in your test?
> 

0 per minute actually. I've tried adding a printk when the

if (mapping2) {

} else {
  <--
}

case is hit, and it never triggered in my tests.

I am currently trying to figure out if I can reproduce the OOM problems
I had experienced with 2.6.29-rc3. I investigate memory accounting by
turning the memory accounting code into a slow cache-line bouncing
version and by adding some assertions about the fact that per-zone
global counters must never go below zero. Having unbalanced accounting
could have some nasty long-term effects on memory pressure accounting.

But so far the memory accounting code looks solid. It's my bad then. I
cannot reproduce the behavior I noticed with 2.6.29-rc3, so I guess we
should we consider this a non-issue (or code 9 if you prefer). ;)

Thanks for looking into this.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30  0:06 ` Linus Torvalds
@ 2009-04-30  2:43   ` Mathieu Desnoyers
  2009-04-30  6:21     ` Ingo Molnar
  2009-04-30 13:22     ` Christoph Lameter
  0 siblings, 2 replies; 62+ messages in thread
From: Mathieu Desnoyers @ 2009-04-30  2:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Nick Piggin, Ingo Molnar, KOSAKI Motohiro,
	Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Christoph Lameter

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Wed, 29 Apr 2009, Mathieu Desnoyers wrote:
> >
> > This patch adds missing dirty page accounting in redirty_page_for_writepage().
> > This should fix a _lot_ of issues involving machines becoming slow under heavy
> > write I/O. No surprise : eventually the system starts swapping.
> 
> That patch (and description) is odd.
> 
> The patch actually adds the dirty page accounting not to 
> redirty_page_for_writepage(), but to __set_page_dirty_nobuffers().
> 
> And __set_page_dirty_nobuffers() will later (just a few lines down) call 
> down to account_page_dirtied(), which in turn does all that 
> same accounting (assuming the "mapping" is marked to account for dirty.
> 
> So the description seems to be wrong, but so does the patch. 
> 
> Did you attach the wrong patch (explaining both problems)?
> 

Nope, just tried to go back into the OOM problem I experienced when
testing the "heavy I/O latency" bug. But it looks like I can't reproduce
it anyway.

> Or if the patch is what you really meant to do, then you need to fix your 
> explanation, and also explain why the double-dirty accounting is a good 
> idea.

Nope, I just "messed this one up" completely. :-)

> 
> Or is the real problem perhaps that your /tmp is ramdisk, and not marked 
> to do dirty accounting?

Nope, I don't think so. /tmp is on a ext3 partition, on raid 1, sata
disks.

And thanks for the review! This excercise only convinced me that the
kernel memory accounting works as expected. All this gave me the chance
to have a good look at the memory accounting code. We could probably
benefit of Christoph Lameter's cpu ops (using segment registers to
address per-cpu variables with atomic inc/dec) in there. Or at least
removing interrupt disabling by using preempt disable and local_t
variables for the per-cpu counters could bring some benefit.

Mathieu

> 
> 		Linus

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30  2:43   ` Mathieu Desnoyers
@ 2009-04-30  6:21     ` Ingo Molnar
  2009-04-30  6:33       ` [ltt-dev] " Mathieu Desnoyers
  2009-05-03  2:40       ` Tejun Heo
  2009-04-30 13:22     ` Christoph Lameter
  1 sibling, 2 replies; 62+ messages in thread
From: Ingo Molnar @ 2009-04-30  6:21 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Andrew Morton, Nick Piggin, KOSAKI Motohiro,
	Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Christoph Lameter


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> And thanks for the review! This excercise only convinced me that 
> the kernel memory accounting works as expected. All this gave me 
> the chance to have a good look at the memory accounting code. We 
> could probably benefit of Christoph Lameter's cpu ops (using 
> segment registers to address per-cpu variables with atomic 
> inc/dec) in there. Or at least removing interrupt disabling by 
> using preempt disable and local_t variables for the per-cpu 
> counters could bring some benefit.

Note, optimized per cpu ops are already implemented upstream, by 
Tejun Heo's percpu patches in .30:

 #define percpu_read(var)	percpu_from_op("mov", per_cpu__##var)
 #define percpu_write(var, val)	percpu_to_op("mov", per_cpu__##var, val)
 #define percpu_add(var, val)	percpu_to_op("add", per_cpu__##var, val)
 #define percpu_sub(var, val)	percpu_to_op("sub", per_cpu__##var, val)
 #define percpu_and(var, val)	percpu_to_op("and", per_cpu__##var, val)
 #define percpu_or(var, val)	percpu_to_op("or", per_cpu__##var, val)
 #define percpu_xor(var, val)	percpu_to_op("xor", per_cpu__##var, val)

See:

  6dbde35: percpu: add optimized generic percpu accessors

>From the changelog:

    [...]
    The advantage is that for example to read a local percpu variable,
    instead of this sequence:
    
     return __get_cpu_var(var);
    
     ffffffff8102ca2b:  48 8b 14 fd 80 09 74    mov    -0x7e8bf680(,%rdi,8),%rdx
     ffffffff8102ca32:  81
     ffffffff8102ca33:  48 c7 c0 d8 59 00 00    mov    $0x59d8,%rax
     ffffffff8102ca3a:  48 8b 04 10             mov    (%rax,%rdx,1),%rax
    
    We can get a single instruction by using the optimized variants:
    
     return percpu_read(var);
    
     ffffffff8102ca3f:  65 48 8b 05 91 8f fd    mov    %gs:0x7efd8f91(%rip),%rax
    [...]

So if you want to make use of it, percpu_add()/percpu_sub() would be 
the place to start.

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [ltt-dev] [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30  6:21     ` Ingo Molnar
@ 2009-04-30  6:33       ` Mathieu Desnoyers
  2009-04-30  6:50         ` Ingo Molnar
  2009-05-03  2:40       ` Tejun Heo
  1 sibling, 1 reply; 62+ messages in thread
From: Mathieu Desnoyers @ 2009-04-30  6:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Peter Zijlstra, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Andrew Morton, thomas.pi,
	Linus Torvalds, Christoph Lameter

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > And thanks for the review! This excercise only convinced me that 
> > the kernel memory accounting works as expected. All this gave me 
> > the chance to have a good look at the memory accounting code. We 
> > could probably benefit of Christoph Lameter's cpu ops (using 
> > segment registers to address per-cpu variables with atomic 
> > inc/dec) in there. Or at least removing interrupt disabling by 
> > using preempt disable and local_t variables for the per-cpu 
> > counters could bring some benefit.
> 
> Note, optimized per cpu ops are already implemented upstream, by 
> Tejun Heo's percpu patches in .30:
> 
>  #define percpu_read(var)	percpu_from_op("mov", per_cpu__##var)
>  #define percpu_write(var, val)	percpu_to_op("mov", per_cpu__##var, val)
>  #define percpu_add(var, val)	percpu_to_op("add", per_cpu__##var, val)
>  #define percpu_sub(var, val)	percpu_to_op("sub", per_cpu__##var, val)
>  #define percpu_and(var, val)	percpu_to_op("and", per_cpu__##var, val)
>  #define percpu_or(var, val)	percpu_to_op("or", per_cpu__##var, val)
>  #define percpu_xor(var, val)	percpu_to_op("xor", per_cpu__##var, val)
> 
> See:
> 
>   6dbde35: percpu: add optimized generic percpu accessors
> 
> From the changelog:
> 
>     [...]
>     The advantage is that for example to read a local percpu variable,
>     instead of this sequence:
>     
>      return __get_cpu_var(var);
>     
>      ffffffff8102ca2b:  48 8b 14 fd 80 09 74    mov    -0x7e8bf680(,%rdi,8),%rdx
>      ffffffff8102ca32:  81
>      ffffffff8102ca33:  48 c7 c0 d8 59 00 00    mov    $0x59d8,%rax
>      ffffffff8102ca3a:  48 8b 04 10             mov    (%rax,%rdx,1),%rax
>     
>     We can get a single instruction by using the optimized variants:
>     
>      return percpu_read(var);
>     
>      ffffffff8102ca3f:  65 48 8b 05 91 8f fd    mov    %gs:0x7efd8f91(%rip),%rax
>     [...]
> 
> So if you want to make use of it, percpu_add()/percpu_sub() would be 
> the place to start.
> 

Great !

I see however that it's only guaranteed to be atomic wrt preemption.
What would be even better would be to have the atomic ops wrt local irqs
(as local.h does) available in this percpu flavor. By doing this, we
could have interrupt and nmi-safe per-cpu counters, without even the
need to disable preemption.

In terms of counters, except maybe for tri-values for some
architectures, I don't see how we could manage synchronization in a
better way.

Mathieu

> 	Ingo
> 
> _______________________________________________
> ltt-dev mailing list
> ltt-dev@lists.casi.polymtl.ca
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [ltt-dev] [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30  6:33       ` [ltt-dev] " Mathieu Desnoyers
@ 2009-04-30  6:50         ` Ingo Molnar
  2009-04-30 13:38           ` Christoph Lameter
  0 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2009-04-30  6:50 UTC (permalink / raw)
  To: Mathieu Desnoyers, Tejun Heo
  Cc: Nick Piggin, Peter Zijlstra, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Andrew Morton, thomas.pi,
	Linus Torvalds, Christoph Lameter


* Mathieu Desnoyers <compudj@krystal.dyndns.org> wrote:

> * Ingo Molnar (mingo@elte.hu) wrote:
> > 
> > * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> > 
> > > And thanks for the review! This excercise only convinced me that 
> > > the kernel memory accounting works as expected. All this gave me 
> > > the chance to have a good look at the memory accounting code. We 
> > > could probably benefit of Christoph Lameter's cpu ops (using 
> > > segment registers to address per-cpu variables with atomic 
> > > inc/dec) in there. Or at least removing interrupt disabling by 
> > > using preempt disable and local_t variables for the per-cpu 
> > > counters could bring some benefit.
> > 
> > Note, optimized per cpu ops are already implemented upstream, by 
> > Tejun Heo's percpu patches in .30:
> > 
> >  #define percpu_read(var)	percpu_from_op("mov", per_cpu__##var)
> >  #define percpu_write(var, val)	percpu_to_op("mov", per_cpu__##var, val)
> >  #define percpu_add(var, val)	percpu_to_op("add", per_cpu__##var, val)
> >  #define percpu_sub(var, val)	percpu_to_op("sub", per_cpu__##var, val)
> >  #define percpu_and(var, val)	percpu_to_op("and", per_cpu__##var, val)
> >  #define percpu_or(var, val)	percpu_to_op("or", per_cpu__##var, val)
> >  #define percpu_xor(var, val)	percpu_to_op("xor", per_cpu__##var, val)
> > 
> > See:
> > 
> >   6dbde35: percpu: add optimized generic percpu accessors
> > 
> > From the changelog:
> > 
> >     [...]
> >     The advantage is that for example to read a local percpu variable,
> >     instead of this sequence:
> >     
> >      return __get_cpu_var(var);
> >     
> >      ffffffff8102ca2b:  48 8b 14 fd 80 09 74    mov    -0x7e8bf680(,%rdi,8),%rdx
> >      ffffffff8102ca32:  81
> >      ffffffff8102ca33:  48 c7 c0 d8 59 00 00    mov    $0x59d8,%rax
> >      ffffffff8102ca3a:  48 8b 04 10             mov    (%rax,%rdx,1),%rax
> >     
> >     We can get a single instruction by using the optimized variants:
> >     
> >      return percpu_read(var);
> >     
> >      ffffffff8102ca3f:  65 48 8b 05 91 8f fd    mov    %gs:0x7efd8f91(%rip),%rax
> >     [...]
> > 
> > So if you want to make use of it, percpu_add()/percpu_sub() would be 
> > the place to start.
> > 
> 
> Great !
> 
> I see however that it's only guaranteed to be atomic wrt preemption.

That's really only true for the non-x86 fallback defines. If we so 
decide, we could make the fallbacks in asm-generic/percpu.h irq-safe 
...

> What would be even better would be to have the atomic ops wrt local irqs
> (as local.h does) available in this percpu flavor. By doing this, we
> could have interrupt and nmi-safe per-cpu counters, without even the
> need to disable preemption.

nmi-safe isnt a big issue (we have no NMI code that interacts with 
MM counters) - and we could make them irq-safe by fixing the 
wrapper. (and on x86 they are NMI-safe too.)

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30  2:43   ` Mathieu Desnoyers
  2009-04-30  6:21     ` Ingo Molnar
@ 2009-04-30 13:22     ` Christoph Lameter
  2009-04-30 13:38       ` Ingo Molnar
  1 sibling, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2009-04-30 13:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Andrew Morton, Nick Piggin, Ingo Molnar,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev

On Wed, 29 Apr 2009, Mathieu Desnoyers wrote:

> to have a good look at the memory accounting code. We could probably
> benefit of Christoph Lameter's cpu ops (using segment registers to
> address per-cpu variables with atomic inc/dec) in there. Or at least
> removing interrupt disabling by using preempt disable and local_t
> variables for the per-cpu counters could bring some benefit.

Guess we are ready for atomic per cpu ops now that the new per cpu
allocator is in? Segment register issues with the PDA are also solved
right?


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [ltt-dev] [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30  6:50         ` Ingo Molnar
@ 2009-04-30 13:38           ` Christoph Lameter
  2009-04-30 14:10             ` Ingo Molnar
  2009-04-30 14:12             ` Mathieu Desnoyers
  0 siblings, 2 replies; 62+ messages in thread
From: Christoph Lameter @ 2009-04-30 13:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mathieu Desnoyers, Tejun Heo, Nick Piggin, Peter Zijlstra,
	Yuriy Lalym, Linux Kernel Mailing List, ltt-dev, Andrew Morton,
	thomas.pi, Linus Torvalds

On Thu, 30 Apr 2009, Ingo Molnar wrote:

> > I see however that it's only guaranteed to be atomic wrt preemption.
>
> That's really only true for the non-x86 fallback defines. If we so
> decide, we could make the fallbacks in asm-generic/percpu.h irq-safe

The fallbacks have different semantics and therefore we cannot rely on
irq safeness in the core code when using the x86 cpu ops.

> nmi-safe isnt a big issue (we have no NMI code that interacts with
> MM counters) - and we could make them irq-safe by fixing the
> wrapper. (and on x86 they are NMI-safe too.)

There are also context in which you alrady are preempt safe and where the
per cpu ops do not need to go through the prremption hoops.

This means it would be best to have 3 variants for 3 different contexts in
the core code:

1. Need irq safety
2. Need preempt safety
3. We know the operation is safe due to preemption already having been
disabled or irqs are not enabled.

The 3 variants on x86 generate the same instructions. On other platforms
they would need to be able to fallback in various way depending on the
availability of instructions that are atomic vs. preempt or irqs.

http://thread.gmane.org/gmane.linux.kernel.cross-arch/1124
http://lwn.net/Articles/284526/


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 13:22     ` Christoph Lameter
@ 2009-04-30 13:38       ` Ingo Molnar
  2009-04-30 13:40         ` Christoph Lameter
  2009-04-30 13:50         ` Mathieu Desnoyers
  0 siblings, 2 replies; 62+ messages in thread
From: Ingo Molnar @ 2009-04-30 13:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mathieu Desnoyers, Linus Torvalds, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev


* Christoph Lameter <cl@linux.com> wrote:

> On Wed, 29 Apr 2009, Mathieu Desnoyers wrote:
> 
> > to have a good look at the memory accounting code. We could 
> > probably benefit of Christoph Lameter's cpu ops (using segment 
> > registers to address per-cpu variables with atomic inc/dec) in 
> > there. Or at least removing interrupt disabling by using preempt 
> > disable and local_t variables for the per-cpu counters could 
> > bring some benefit.
> 
> Guess we are ready for atomic per cpu ops now that the new per cpu 
> allocator is in? Segment register issues with the PDA are also 
> solved right?

it's all done, implemented and upstream already. You are a bit late 
to the party ;-)

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 13:38       ` Ingo Molnar
@ 2009-04-30 13:40         ` Christoph Lameter
  2009-04-30 14:14           ` Ingo Molnar
  2009-04-30 13:50         ` Mathieu Desnoyers
  1 sibling, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2009-04-30 13:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mathieu Desnoyers, Linus Torvalds, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev

On Thu, 30 Apr 2009, Ingo Molnar wrote:

> it's all done, implemented and upstream already. You are a bit late
> to the party ;-)

Just looked over it. Yep, now we just need to fix the things that were
messed up.




^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 13:38       ` Ingo Molnar
  2009-04-30 13:40         ` Christoph Lameter
@ 2009-04-30 13:50         ` Mathieu Desnoyers
  2009-04-30 13:55           ` Christoph Lameter
  2009-04-30 14:32           ` Ingo Molnar
  1 sibling, 2 replies; 62+ messages in thread
From: Mathieu Desnoyers @ 2009-04-30 13:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Christoph Lameter <cl@linux.com> wrote:
> 
> > On Wed, 29 Apr 2009, Mathieu Desnoyers wrote:
> > 
> > > to have a good look at the memory accounting code. We could 
> > > probably benefit of Christoph Lameter's cpu ops (using segment 
> > > registers to address per-cpu variables with atomic inc/dec) in 
> > > there. Or at least removing interrupt disabling by using preempt 
> > > disable and local_t variables for the per-cpu counters could 
> > > bring some benefit.
> > 
> > Guess we are ready for atomic per cpu ops now that the new per cpu 
> > allocator is in? Segment register issues with the PDA are also 
> > solved right?
> 
> it's all done, implemented and upstream already. You are a bit late 
> to the party ;-)
> 
> 	Ingo

Or way too early, depending on the point of view. :-)

e.g.
http://lkml.org/lkml/2008/5/30/3

I think Christoph deserves credits for pioneering this area with fresh
ideas.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 13:50         ` Mathieu Desnoyers
@ 2009-04-30 13:55           ` Christoph Lameter
  2009-04-30 14:32           ` Ingo Molnar
  1 sibling, 0 replies; 62+ messages in thread
From: Christoph Lameter @ 2009-04-30 13:55 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev

On Thu, 30 Apr 2009, Mathieu Desnoyers wrote:

> I think Christoph deserves credits for pioneering this area with fresh
> ideas.

Thanks. I like ideas but not so much the follow through on them (which
also got a bit problematic in this case because I was changing companies
while this was going on). The work that Tejun has done on this is quite
good as far as I can see.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [ltt-dev] [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 13:38           ` Christoph Lameter
@ 2009-04-30 14:10             ` Ingo Molnar
  2009-04-30 14:12             ` Mathieu Desnoyers
  1 sibling, 0 replies; 62+ messages in thread
From: Ingo Molnar @ 2009-04-30 14:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mathieu Desnoyers, Tejun Heo, Nick Piggin, Peter Zijlstra,
	Yuriy Lalym, Linux Kernel Mailing List, ltt-dev, Andrew Morton,
	thomas.pi, Linus Torvalds


* Christoph Lameter <cl@linux.com> wrote:

> On Thu, 30 Apr 2009, Ingo Molnar wrote:
> 
> > > I see however that it's only guaranteed to be atomic wrt preemption.
> >
> > That's really only true for the non-x86 fallback defines. If we so
> > decide, we could make the fallbacks in asm-generic/percpu.h irq-safe
> 
> The fallbacks have different semantics and therefore we cannot 
> rely on irq safeness in the core code when using the x86 cpu ops.

Well it's irq and preempt safe on x86.

It's preempt-safe on other architectures - but the fallback is not 
irq-safe on other architectures. That is remedied easily via the 
patch below. (Note: totally untested)

	Ingo

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 1581ff2..6b3984a 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -139,17 +139,23 @@ static inline void free_percpu(void *p)
 #ifndef percpu_read
 # define percpu_read(var)						\
   ({									\
+	unsigned long flags;						\
 	typeof(per_cpu_var(var)) __tmp_var__;				\
-	__tmp_var__ = get_cpu_var(var);					\
-	put_cpu_var(var);						\
+									\
+	local_irq_save(flags);						\
+	__tmp_var__ = __get_cpu_var(var);				\
+	local_irq_restore(flags);					\
 	__tmp_var__;							\
   })
 #endif
 
 #define __percpu_generic_to_op(var, val, op)				\
 do {									\
-	get_cpu_var(var) op val;					\
-	put_cpu_var(var);						\
+	unsigned long flags;						\
+									\
+	local_irq_save(flags);						\
+	op val;								\
+	local_irq_restore(flags);					\
 } while (0)
 
 #ifndef percpu_write

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [ltt-dev] [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 13:38           ` Christoph Lameter
  2009-04-30 14:10             ` Ingo Molnar
@ 2009-04-30 14:12             ` Mathieu Desnoyers
  2009-04-30 14:12               ` Christoph Lameter
  1 sibling, 1 reply; 62+ messages in thread
From: Mathieu Desnoyers @ 2009-04-30 14:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, Nick Piggin, Peter Zijlstra,
	Linux Kernel Mailing List, Yuriy Lalym, Tejun Heo, ltt-dev,
	Andrew Morton, thomas.pi, Linus Torvalds

* Christoph Lameter (cl@linux.com) wrote:
> On Thu, 30 Apr 2009, Ingo Molnar wrote:
> 
> > > I see however that it's only guaranteed to be atomic wrt preemption.
> >
> > That's really only true for the non-x86 fallback defines. If we so
> > decide, we could make the fallbacks in asm-generic/percpu.h irq-safe
> 
> The fallbacks have different semantics and therefore we cannot rely on
> irq safeness in the core code when using the x86 cpu ops.
> 
> > nmi-safe isnt a big issue (we have no NMI code that interacts with
> > MM counters) - and we could make them irq-safe by fixing the
> > wrapper. (and on x86 they are NMI-safe too.)
> 
> There are also context in which you alrady are preempt safe and where the
> per cpu ops do not need to go through the prremption hoops.
> 
> This means it would be best to have 3 variants for 3 different contexts in
> the core code:
> 
> 1. Need irq safety
> 2. Need preempt safety
> 3. We know the operation is safe due to preemption already having been
> disabled or irqs are not enabled.
> 
> The 3 variants on x86 generate the same instructions. On other platforms
> they would need to be able to fallback in various way depending on the
> availability of instructions that are atomic vs. preempt or irqs.
> 

The problem here, as we did figure out a while ago with the atomic
slub we worked on a while ago, is that if we have the following code :

local_irq_save
var++
var++
local_irq_restore

that we would like to turn into irq-safe percpu variant with this
semantic :

percpu_add_irqsafe(var)
percpu_add_irqsafe(var)

We are generating two irq save/restore in the fallback, which will be
slow.

However, we could do the following trick :

percpu_irqsave(flags);
percpu_add_irq(var);
percpu_add_irq(var);
percpu_irqrestore(flags);

And we could require that percpu_*_irq operations are put within a
irq safe section. The fallback would disable interrupts, but
arch-specific irq-safe atomic implementations would replace this by
nops.

And if interrupts are already disabled, percpu_add_irq could be used
directly. There is no need to duplicate the primitives (no
_percpu_add_irq() needed). Same could apply to preempt-safety :

percpu_preempt_disable();
percpu_add(var);
percpu_add(var);
percpu_preempt_enable();

Where requirements on percpu_add would be to be called within a
percpu_preempt_disable/percpu_preempt_enable section or to be sure that
preemption is already disabled around.

Same thing could apply to bh. But I don't see any difference between
percpu_add_bh and percpu_add_irq, except maybe on architectures which
would use tri-values :

percpu_bh_disable();
percpu_add_bh(var);
percpu_add_bh(var);
percpu_bh_enable();

Thoughts ?

Mathieu

> http://thread.gmane.org/gmane.linux.kernel.cross-arch/1124
> http://lwn.net/Articles/284526/
> 
> 
-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [ltt-dev] [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 14:12             ` Mathieu Desnoyers
@ 2009-04-30 14:12               ` Christoph Lameter
  2009-04-30 19:41                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2009-04-30 14:12 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, Nick Piggin, Peter Zijlstra,
	Linux Kernel Mailing List, Yuriy Lalym, Tejun Heo, ltt-dev,
	Andrew Morton, thomas.pi, Linus Torvalds

On Thu, 30 Apr 2009, Mathieu Desnoyers wrote:

> > The 3 variants on x86 generate the same instructions. On other platforms
> > they would need to be able to fallback in various way depending on the
> > availability of instructions that are atomic vs. preempt or irqs.
> >
>
> The problem here, as we did figure out a while ago with the atomic
> slub we worked on a while ago, is that if we have the following code :
>
> local_irq_save
> var++
> var++
> local_irq_restore
>
> that we would like to turn into irq-safe percpu variant with this
> semantic :
>
> percpu_add_irqsafe(var)
> percpu_add_irqsafe(var)
>
> We are generating two irq save/restore in the fallback, which will be
> slow.
>
> However, we could do the following trick :
>
> percpu_irqsave(flags);
> percpu_add_irq(var);
> percpu_add_irq(var);
> percpu_irqrestore(flags);

Hmmm.I do not remember any of those double ops in the patches that I did a
while back for this. It does not make sense either because atomic per cpu
ops are only atomic for a single instruction. You are trying to extend
that so that multiple "atomic" instructions are now atomic.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 13:40         ` Christoph Lameter
@ 2009-04-30 14:14           ` Ingo Molnar
  2009-04-30 14:15             ` Christoph Lameter
  0 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2009-04-30 14:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mathieu Desnoyers, Linus Torvalds, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo


* Christoph Lameter <cl@linux.com> wrote:

> On Thu, 30 Apr 2009, Ingo Molnar wrote:
> 
> > it's all done, implemented and upstream already. You are a bit late
> > to the party ;-)
> 
> Just looked over it. Yep, now we just need to fix the things that 
> were messed up.

Could we please skip over the lengthy flamewar and bad-mouthing of 
other people's work and go straight to the constructive portion of 
the discussion? ;-)

The patch below makes the fallback/slowpath irq safe.

	Ingo

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 1581ff2..6b3984a 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -139,17 +139,23 @@ static inline void free_percpu(void *p)
 #ifndef percpu_read
 # define percpu_read(var)						\
   ({									\
+	unsigned long flags;						\
 	typeof(per_cpu_var(var)) __tmp_var__;				\
-	__tmp_var__ = get_cpu_var(var);					\
-	put_cpu_var(var);						\
+									\
+	local_irq_save(flags);						\
+	__tmp_var__ = __get_cpu_var(var);				\
+	local_irq_restore(flags);					\
 	__tmp_var__;							\
   })
 #endif
 
 #define __percpu_generic_to_op(var, val, op)				\
 do {									\
-	get_cpu_var(var) op val;					\
-	put_cpu_var(var);						\
+	unsigned long flags;						\
+									\
+	local_irq_save(flags);						\
+	op val;								\
+	local_irq_restore(flags);					\
 } while (0)
 
 #ifndef percpu_write

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 14:14           ` Ingo Molnar
@ 2009-04-30 14:15             ` Christoph Lameter
  2009-04-30 14:38               ` Ingo Molnar
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2009-04-30 14:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mathieu Desnoyers, Linus Torvalds, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo

On Thu, 30 Apr 2009, Ingo Molnar wrote:

> The patch below makes the fallback/slowpath irq safe.

Yes but sometimes you are already irq safe and such a fallback would
create significant irq/enable/disable stack operations etc overhead for
architectures that are using the fallback.

I think we really need another __xxx op here. Especially since these
operations are often in critical code paths.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 13:50         ` Mathieu Desnoyers
  2009-04-30 13:55           ` Christoph Lameter
@ 2009-04-30 14:32           ` Ingo Molnar
  2009-04-30 14:42             ` Christoph Lameter
  2009-04-30 16:03             ` [ltt-dev] " Mathieu Desnoyers
  1 sibling, 2 replies; 62+ messages in thread
From: Ingo Molnar @ 2009-04-30 14:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Christoph Lameter, Linus Torvalds, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> * Ingo Molnar (mingo@elte.hu) wrote:
> > 
> > * Christoph Lameter <cl@linux.com> wrote:
> > 
> > > On Wed, 29 Apr 2009, Mathieu Desnoyers wrote:
> > > 
> > > > to have a good look at the memory accounting code. We could 
> > > > probably benefit of Christoph Lameter's cpu ops (using segment 
> > > > registers to address per-cpu variables with atomic inc/dec) in 
> > > > there. Or at least removing interrupt disabling by using preempt 
> > > > disable and local_t variables for the per-cpu counters could 
> > > > bring some benefit.
> > > 
> > > Guess we are ready for atomic per cpu ops now that the new per cpu 
> > > allocator is in? Segment register issues with the PDA are also 
> > > solved right?
> > 
> > it's all done, implemented and upstream already. You are a bit late 
> > to the party ;-)
> > 
> > 	Ingo
> 
> Or way too early, depending on the point of view. :-)
> 
> e.g.
> http://lkml.org/lkml/2008/5/30/3
> 
> I think Christoph deserves credits for pioneering this area with fresh
> ideas.

Ok, i didnt want to go there - but let me correct this version of 
history.

Christoph's zero-based x86 percpu patches were incomplete and never 
worked reliably - Christoph unfortunately never addressed the 
bugs/crashes i reported. (and Mike Travis and me injected quite a 
bit of testing into it) There were two failed attempts to productize 
them and the patches just bitrotted for more than a year.

Tejun on the other hand fixed those problems (four of Christoph's 
patches survived more or less and were credited to Christoph) and 
did more than 50 highly delicate patches of far larger complexity to 
solve the _whole_ problem range - within a two months timeframe.

Ideas and half-done patches covering <10% of the work needed are not 
enough. Being able to implement it and productize it is the real 
deal, in my book.

Thanks goes to Christoph (and Rusty) for coming up with the idea, 
but it would be manifestly unfair to not send 90% of the kudos to 
Tejun for turning it all into reality and fixing all the other 
problems and redesigning almost all the x86 percpu code in the 
process! ;-)

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 14:15             ` Christoph Lameter
@ 2009-04-30 14:38               ` Ingo Molnar
  2009-04-30 14:45                 ` Christoph Lameter
  0 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2009-04-30 14:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mathieu Desnoyers, Linus Torvalds, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo


* Christoph Lameter <cl@linux.com> wrote:

> On Thu, 30 Apr 2009, Ingo Molnar wrote:
> 
> > The patch below makes the fallback/slowpath irq safe.
> 
> Yes but sometimes you are already irq safe and such a fallback 
> would create significant irq/enable/disable stack operations etc 
> overhead for architectures that are using the fallback.

It's a fallback slowpath - non-x86 architectures should still fill 
in a real implementation of course.

> I think we really need another __xxx op here. Especially since 
> these operations are often in critical code paths.

That's a receipe for fragility: as using __xxx will still be 
irq-safe on x86, and 95% of the testing is done on x86, so this 
opens up the path to non-x86 bugs.

So we first have to see the list of architectures that _cannot_ 
implement an irq-safe op here via a single machine instruction.
x86, ia64 and powerpc should be fine.

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 14:32           ` Ingo Molnar
@ 2009-04-30 14:42             ` Christoph Lameter
  2009-04-30 14:59               ` Ingo Molnar
  2009-04-30 16:03             ` [ltt-dev] " Mathieu Desnoyers
  1 sibling, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2009-04-30 14:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mathieu Desnoyers, Linus Torvalds, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo

On Thu, 30 Apr 2009, Ingo Molnar wrote:

> Christoph's zero-based x86 percpu patches were incomplete and never
> worked reliably - Christoph unfortunately never addressed the
> bugs/crashes i reported. (and Mike Travis and me injected quite a
> bit of testing into it) There were two failed attempts to productize
> them and the patches just bitrotted for more than a year.

Sorry not my issue. Mike took over that work on the x86 arch.

I am fine with how things developed. The core patches fell off the
priority list after job change. Just get the things that are not right now
straightened out.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 14:38               ` Ingo Molnar
@ 2009-04-30 14:45                 ` Christoph Lameter
  2009-04-30 15:01                   ` Ingo Molnar
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2009-04-30 14:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mathieu Desnoyers, Linus Torvalds, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo

On Thu, 30 Apr 2009, Ingo Molnar wrote:

> > Yes but sometimes you are already irq safe and such a fallback
> > would create significant irq/enable/disable stack operations etc
> > overhead for architectures that are using the fallback.
>
> It's a fallback slowpath - non-x86 architectures should still fill
> in a real implementation of course.

Arch code cannot provide an effective implementation since they
always have to assume that interupts need to be disabled if we stay with
the current implementation.

> So we first have to see the list of architectures that _cannot_
> implement an irq-safe op here via a single machine instruction.
> x86, ia64 and powerpc should be fine.

Look at Ia64, sparc, s/390, powerpc. They can fall back to atomic ops but
those are very ineffective on some of these platforms. Since these are
performance critical they will need to be optimized depending on the
context of their use in the core.





^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 14:42             ` Christoph Lameter
@ 2009-04-30 14:59               ` Ingo Molnar
  0 siblings, 0 replies; 62+ messages in thread
From: Ingo Molnar @ 2009-04-30 14:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mathieu Desnoyers, Linus Torvalds, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo


* Christoph Lameter <cl@linux.com> wrote:

> I am fine with how things developed. The core patches fell off the 
> priority list after job change. Just get the things that are not 
> right now straightened out.

Regarding the __xxx ops i'm on two minds really. If there's a 
significant architecture that really cannot live without them (if
it does not have the proper machine instructions _and_ local irq 
disabling is unreasonably expensive) then i guess we could (and 
should) add them.

Otherwise we should just try to be as simple as possible.

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 14:45                 ` Christoph Lameter
@ 2009-04-30 15:01                   ` Ingo Molnar
  2009-04-30 15:25                     ` Christoph Lameter
  0 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2009-04-30 15:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mathieu Desnoyers, Linus Torvalds, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo


* Christoph Lameter <cl@linux.com> wrote:

> On Thu, 30 Apr 2009, Ingo Molnar wrote:
> 
> > > Yes but sometimes you are already irq safe and such a fallback
> > > would create significant irq/enable/disable stack operations etc
> > > overhead for architectures that are using the fallback.
> >
> > It's a fallback slowpath - non-x86 architectures should still fill
> > in a real implementation of course.
> 
> Arch code cannot provide an effective implementation since they
> always have to assume that interupts need to be disabled if we stay with
> the current implementation.
> 
> > So we first have to see the list of architectures that _cannot_
> > implement an irq-safe op here via a single machine instruction.
> > x86, ia64 and powerpc should be fine.
> 
> Look at Ia64, sparc, s/390, powerpc. They can fall back to atomic 
> ops but those are very ineffective on some of these platforms. 
> Since these are performance critical they will need to be 
> optimized depending on the context of their use in the core.

Could you cite a specific example / situation where you'd use __xxx 
ops?

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 15:01                   ` Ingo Molnar
@ 2009-04-30 15:25                     ` Christoph Lameter
  2009-04-30 15:42                       ` Ingo Molnar
                                         ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Christoph Lameter @ 2009-04-30 15:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mathieu Desnoyers, Linus Torvalds, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo

On Thu, 30 Apr 2009, Ingo Molnar wrote:

> Could you cite a specific example / situation where you'd use __xxx
> ops?

Well the cpu alloc v3 patchset (already cited) included numerous
locations. Sure wish we could move much of the core stats over to percpu
ops. Here is a list of some stats patches that do not need irq disable.

http://article.gmane.org/gmane.linux.kernel.cross-arch/1128
http://article.gmane.org/gmane.linux.kernel.cross-arch/1132
http://article.gmane.org/gmane.linux.kernel.cross-arch/1134
http://article.gmane.org/gmane.linux.kernel.cross-arch/1138
http://article.gmane.org/gmane.linux.kernel.cross-arch/1139
http://article.gmane.org/gmane.linux.kernel.cross-arch/1145 VM stats
http://article.gmane.org/gmane.linux.kernel.cross-arch/1160 NFS stats
http://article.gmane.org/gmane.linux.kernel.cross-arch/1161 Genhd stats
http://article.gmane.org/gmane.linux.kernel.cross-arch/1163 Socket inuse counter
http://article.gmane.org/gmane.linux.kernel.cross-arch/1164 SRCU


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 15:25                     ` Christoph Lameter
@ 2009-04-30 15:42                       ` Ingo Molnar
  2009-04-30 15:44                         ` Christoph Lameter
  2009-04-30 16:13                         ` Linus Torvalds
  2009-04-30 15:54                       ` Ingo Molnar
  2009-04-30 16:00                       ` Ingo Molnar
  2 siblings, 2 replies; 62+ messages in thread
From: Ingo Molnar @ 2009-04-30 15:42 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mathieu Desnoyers, Linus Torvalds, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo


* Christoph Lameter <cl@linux.com> wrote:

> On Thu, 30 Apr 2009, Ingo Molnar wrote:
> 
> > Could you cite a specific example / situation where you'd use __xxx
> > ops?
> 
> Well the cpu alloc v3 patchset (already cited) included numerous 
> locations. Sure wish we could move much of the core stats over to 
> percpu ops. Here is a list of some stats patches that do not need 
> irq disable.
> 
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1128
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1132
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1134
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1138
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1139
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1145 VM stats
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1160 NFS stats
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1161 Genhd stats
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1163 Socket inuse counter
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1164 SRCU

nice. Do these all get called with irqs off, all the time?

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 15:42                       ` Ingo Molnar
@ 2009-04-30 15:44                         ` Christoph Lameter
  2009-04-30 16:06                           ` Ingo Molnar
  2009-04-30 16:13                         ` Linus Torvalds
  1 sibling, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2009-04-30 15:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mathieu Desnoyers, Linus Torvalds, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo

On Thu, 30 Apr 2009, Ingo Molnar wrote:

> nice. Do these all get called with irqs off, all the time?

In most cases yes. The VMstat patch includes an example were we may not
care too much about a event counter missing a beat once in a while for
platforms not supporting atomic per cpu ops. I know this affects IA64. The
cost of an atomic operations for an event counter update (which would have
avoided the potential of a concurrent update) was not justifiable.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 15:25                     ` Christoph Lameter
  2009-04-30 15:42                       ` Ingo Molnar
@ 2009-04-30 15:54                       ` Ingo Molnar
  2009-04-30 16:00                       ` Ingo Molnar
  2 siblings, 0 replies; 62+ messages in thread
From: Ingo Molnar @ 2009-04-30 15:54 UTC (permalink / raw)
  To: Christoph Lameter, Eric Dumazet
  Cc: Mathieu Desnoyers, Linus Torvalds, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo


* Christoph Lameter <cl@linux.com> wrote:

> http://article.gmane.org/gmane.linux.kernel.cross-arch/1163 Socket inuse counter

btw., this one is already converted upstream, via:

  4e69489: socket: use percpu_add() while updating sockets_in_use

done by Eric Dumazet.

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 15:25                     ` Christoph Lameter
  2009-04-30 15:42                       ` Ingo Molnar
  2009-04-30 15:54                       ` Ingo Molnar
@ 2009-04-30 16:00                       ` Ingo Molnar
  2009-04-30 16:08                         ` Christoph Lameter
  2 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2009-04-30 16:00 UTC (permalink / raw)
  To: Christoph Lameter, Eric Dumazet
  Cc: Mathieu Desnoyers, Linus Torvalds, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo


* Christoph Lameter <cl@linux.com> wrote:

> http://article.gmane.org/gmane.linux.kernel.cross-arch/1128
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1132
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1134
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1138
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1139
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1145 VM stats
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1160 NFS stats
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1161 Genhd stats
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1164 SRCU

The new percpu APIs could be used in most of these places already, 
straight away. This is a really good TODO list for places to 
enhance.

Then a second set of patches could convert percpu_add() / etc. uses 
to __percpu_add() ... but that should be done by those architectures 
that need it (and to the extent they need it), because it's not 
really testable on x86.

I dont really like the PER_CPU / CPU_INC etc. type of all-capitals 
APIs you introduced in the patches above:


+		__CPU_INC(bt->sequence);
+	CPU_FREE(bt->sequence);

was there any strong reason to go outside the well-established 
percpu_* name space and call these primitives as if they were 
macros?

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [ltt-dev] [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 14:32           ` Ingo Molnar
  2009-04-30 14:42             ` Christoph Lameter
@ 2009-04-30 16:03             ` Mathieu Desnoyers
  1 sibling, 0 replies; 62+ messages in thread
From: Mathieu Desnoyers @ 2009-04-30 16:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Peter Zijlstra, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo, Christoph Lameter,
	thomas.pi, Linus Torvalds, Andrew Morton

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > * Ingo Molnar (mingo@elte.hu) wrote:
> > > 
> > > * Christoph Lameter <cl@linux.com> wrote:
> > > 
> > > > On Wed, 29 Apr 2009, Mathieu Desnoyers wrote:
> > > > 
> > > > > to have a good look at the memory accounting code. We could 
> > > > > probably benefit of Christoph Lameter's cpu ops (using segment 
> > > > > registers to address per-cpu variables with atomic inc/dec) in 
> > > > > there. Or at least removing interrupt disabling by using preempt 
> > > > > disable and local_t variables for the per-cpu counters could 
> > > > > bring some benefit.
> > > > 
> > > > Guess we are ready for atomic per cpu ops now that the new per cpu 
> > > > allocator is in? Segment register issues with the PDA are also 
> > > > solved right?
> > > 
> > > it's all done, implemented and upstream already. You are a bit late 
> > > to the party ;-)
> > > 
> > > 	Ingo
> > 
> > Or way too early, depending on the point of view. :-)
> > 
> > e.g.
> > http://lkml.org/lkml/2008/5/30/3
> > 
> > I think Christoph deserves credits for pioneering this area with fresh
> > ideas.
> 
> Ok, i didnt want to go there - but let me correct this version of 
> history.
> 
> Christoph's zero-based x86 percpu patches were incomplete and never 
> worked reliably - Christoph unfortunately never addressed the 
> bugs/crashes i reported. (and Mike Travis and me injected quite a 
> bit of testing into it) There were two failed attempts to productize 
> them and the patches just bitrotted for more than a year.
> 
> Tejun on the other hand fixed those problems (four of Christoph's 
> patches survived more or less and were credited to Christoph) and 
> did more than 50 highly delicate patches of far larger complexity to 
> solve the _whole_ problem range - within a two months timeframe.
> 
> Ideas and half-done patches covering <10% of the work needed are not 
> enough. Being able to implement it and productize it is the real 
> deal, in my book.
> 
> Thanks goes to Christoph (and Rusty) for coming up with the idea, 
> but it would be manifestly unfair to not send 90% of the kudos to 
> Tejun for turning it all into reality and fixing all the other 
> problems and redesigning almost all the x86 percpu code in the 
> process! ;-)
> 

Then thanks to Christoph, Rusty and to Tejun for the respective effort
they did. My goal is surely only to recognise their respective credit.

Mathieu

> 	Ingo
> 
> _______________________________________________
> ltt-dev mailing list
> ltt-dev@lists.casi.polymtl.ca
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 15:44                         ` Christoph Lameter
@ 2009-04-30 16:06                           ` Ingo Molnar
  2009-04-30 16:11                             ` Christoph Lameter
  2009-04-30 16:16                             ` Linus Torvalds
  0 siblings, 2 replies; 62+ messages in thread
From: Ingo Molnar @ 2009-04-30 16:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mathieu Desnoyers, Linus Torvalds, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo


* Christoph Lameter <cl@linux.com> wrote:

> On Thu, 30 Apr 2009, Ingo Molnar wrote:
> 
> > nice. Do these all get called with irqs off, all the time?
> 
> In most cases yes. The VMstat patch includes an example were we 
> may not care too much about a event counter missing a beat once in 
> a while for platforms not supporting atomic per cpu ops. I know 
> this affects IA64. The cost of an atomic operations for an event 
> counter update (which would have avoided the potential of a 
> concurrent update) was not justifiable.

when you say "atomics", do you mean the classic meaning of atomics? 
Because there are no classic atomics involved. This is the 
before/after disassembly from Eric's commit 4e69489a0:

before:

c0436274:       b8 01 00 00 00          mov    $0x1,%eax
c0436279:       e8 42 40 df ff          call   c022a2c0 <add_preempt_count>
c043627e:       bb 20 4f 6a c0          mov    $0xc06a4f20,%ebx
c0436283:       e8 18 ca f0 ff          call   c0342ca0 <debug_smp_processor_id>
c0436288:       03 1c 85 60 4a 65 c0    add    -0x3f9ab5a0(,%eax,4),%ebx
c043628f:       ff 03                   incl   (%ebx)
c0436291:       b8 01 00 00 00          mov    $0x1,%eax
c0436296:       e8 75 3f df ff          call   c022a210 <sub_preempt_count>
c043629b:       89 e0                   mov    %esp,%eax
c043629d:       25 00 e0 ff ff          and    $0xffffe000,%eax
c04362a2:       f6 40 08 08             testb  $0x8,0x8(%eax)
c04362a6:       75 07                   jne    c04362af <sock_alloc+0x7f>
c04362a8:       8d 46 d8                lea    -0x28(%esi),%eax
c04362ab:       5b                      pop    %ebx
c04362ac:       5e                      pop    %esi
c04362ad:       c9                      leave
c04362ae:       c3                      ret
c04362af:       e8 cc 5d 09 00          call   c04cc080 <preempt_schedule>
c04362b4:       8d 74 26 00             lea    0x0(%esi,%eiz,1),%esi
c04362b8:       eb ee                   jmp    c04362a8 <sock_alloc+0x78>

after:

c0436275:   64 83 05 20 5f 6a c0    addl   $0x1,%fs:0xc06a5f20

There's no atomic instructions at all - the counters here are only 
accessed locally. They are local-irq-atomic, but not 
cacheline-atomic.

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 16:00                       ` Ingo Molnar
@ 2009-04-30 16:08                         ` Christoph Lameter
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Lameter @ 2009-04-30 16:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, Mathieu Desnoyers, Linus Torvalds, Andrew Morton,
	Nick Piggin, KOSAKI Motohiro, Peter Zijlstra, thomas.pi,
	Yuriy Lalym, Linux Kernel Mailing List, ltt-dev, Tejun Heo

On Thu, 30 Apr 2009, Ingo Molnar wrote:

> The new percpu APIs could be used in most of these places already,
> straight away. This is a really good TODO list for places to
> enhance.

Please look a the full list in the cpu alloc v3 patchset and not only
those that I listed here.

> Then a second set of patches could convert percpu_add() / etc. uses
> to __percpu_add() ... but that should be done by those architectures
> that need it (and to the extent they need it), because it's not
> really testable on x86.

Ok So we convert it and wait until the arch maintainers complain? I
definitely know that there is an IA64 issue with vm statistics.

> I dont really like the PER_CPU / CPU_INC etc. type of all-capitals
> APIs you introduced in the patches above:

I know. Patches would have to be redone against whatever API we agree on.

>
> +		__CPU_INC(bt->sequence);
> +	CPU_FREE(bt->sequence);
>
> was there any strong reason to go outside the well-established
> percpu_* name space and call these primitives as if they were
> macros?

They are macros and may do weird things with the variables. This goes back
to our disagreement last year on caps/lower case. I still think this kind
of preprocessor magic should be uppercase.

The reason not to use the percpu_* names was that they were x86 arch
specific (and thus not available) and did not differentiate in terms of
the irq/preemption context.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 16:06                           ` Ingo Molnar
@ 2009-04-30 16:11                             ` Christoph Lameter
  2009-04-30 16:16                             ` Linus Torvalds
  1 sibling, 0 replies; 62+ messages in thread
From: Christoph Lameter @ 2009-04-30 16:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mathieu Desnoyers, Linus Torvalds, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo

On Thu, 30 Apr 2009, Ingo Molnar wrote:

> > may not care too much about a event counter missing a beat once in
> > a while for platforms not supporting atomic per cpu ops. I know
> > this affects IA64. The cost of an atomic operations for an event
> > counter update (which would have avoided the potential of a
> > concurrent update) was not justifiable.
>
> when you say "atomics", do you mean the classic meaning of atomics?
> Because there are no classic atomics involved. This is the
> before/after disassembly from Eric's commit 4e69489a0:

The fallback for IA64 would be to use full (classic) atomic operations
(fetchadd) instead of fast atomic vs. interrupt as available on x86

> c0436275:   64 83 05 20 5f 6a c0    addl   $0x1,%fs:0xc06a5f20
>
> There's no atomic instructions at all - the counters here are only
> accessed locally. They are local-irq-atomic, but not
> cacheline-atomic.

Right but that is not available on IA64. So one must choose between
manually disabling interrupts and then increment the counter (long code
sequence) and a classic atomic operation for the fallback.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 15:42                       ` Ingo Molnar
  2009-04-30 15:44                         ` Christoph Lameter
@ 2009-04-30 16:13                         ` Linus Torvalds
  1 sibling, 0 replies; 62+ messages in thread
From: Linus Torvalds @ 2009-04-30 16:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Lameter, Mathieu Desnoyers, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo



On Thu, 30 Apr 2009, Ingo Molnar wrote:
> 
> nice. Do these all get called with irqs off, all the time?

There are lots of non-irq-off cases where non-atomic counters are safe. A 
couple of reasons:

 - counters that just don't care enough. Some statistics are very 
   important to always be exact, others are "helpful" but performance is 
   more important than being exactly right.

 - counters simply not accessed (or at least changed) from interrupts at 
   all. This is very common for some cases, notably "user event" stats. 
   They may need preemption support, but nothing fancier.

 - counters that are "idempotent" wrt interrupts. For example, anything 
   that will always count up and then down again in a paired fashion 
   (think kernel lock counter, preemption counter etc) is inherently safe 
   as a per-cpu counter without needing any protection at all, since any 
   interrupts may _change_ them, but will always change them back, so even 
   if a non-atomic sequence gets interrupted, it doesn't matter.

In fact, I'd argue that for a per-cpu counter, the whole _point_ of the 
exercise is almost always performance, so locked sequences would be bad to 
assume. The fact that x86 can do "atomic" per-cpu accesses with basically 
zero cost (by virtue of it's rmw memory ops) is unusual.

Most other architectures will have a very hard time making such counters 
cheap, since for them, there "irq atomicity" (expensive irq disable or 
store conditional) or "SMP atomicity" (the same fairly expensive 
store-conditional or something even worse).

				Linus

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 16:06                           ` Ingo Molnar
  2009-04-30 16:11                             ` Christoph Lameter
@ 2009-04-30 16:16                             ` Linus Torvalds
  2009-04-30 17:23                               ` Ingo Molnar
  1 sibling, 1 reply; 62+ messages in thread
From: Linus Torvalds @ 2009-04-30 16:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Lameter, Mathieu Desnoyers, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo



On Thu, 30 Apr 2009, Ingo Molnar wrote:
> 
> c0436275:   64 83 05 20 5f 6a c0    addl   $0x1,%fs:0xc06a5f20
> 
> There's no atomic instructions at all - the counters here are only 
> accessed locally. They are local-irq-atomic, but not 
> cacheline-atomic.

On other architectures, you need the whole "disable preemption, 
load-locked, store-conditional, test-and-loop, enable preemption" thing. 

Or "disable interrupts, load, store, restore interrupts".

There really aren't very many architectures that can do almost 
unrestricted ALU ops in a single instruction (and thus automatically safe 
from preemption and interrupts).

			Linus

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 16:16                             ` Linus Torvalds
@ 2009-04-30 17:23                               ` Ingo Molnar
  2009-04-30 18:07                                 ` Christoph Lameter
  0 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2009-04-30 17:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, Mathieu Desnoyers, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, 30 Apr 2009, Ingo Molnar wrote:
> > 
> > c0436275:   64 83 05 20 5f 6a c0    addl   $0x1,%fs:0xc06a5f20
> > 
> > There's no atomic instructions at all - the counters here are 
> > only accessed locally. They are local-irq-atomic, but not 
> > cacheline-atomic.
> 
> On other architectures, you need the whole "disable preemption, 
> load-locked, store-conditional, test-and-loop, enable preemption" 
> thing.
> 
> Or "disable interrupts, load, store, restore interrupts".
> 
> There really aren't very many architectures that can do almost 
> unrestricted ALU ops in a single instruction (and thus 
> automatically safe from preemption and interrupts).

Maybe then what we should do is the very first version of commit 
6dbde35308: declaredly make percpu_arith_op() non-irq-atomic (and 
non-preempt-atomic) everywhere. The commit's internal changelog 
still says:

        * made generic percpu ops atomic against preemption

So we introduced preemption-safety in the v2 version of that commit.

This non-atomicity will 1) either not matter 2) will be irq-atomic 
by virtue of being within a critical section 3) can be made atomic 
in the few remaining cases.

And maybe, at most, introduce an opt-in API: percpu_add_irqsafe().

Right?

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 17:23                               ` Ingo Molnar
@ 2009-04-30 18:07                                 ` Christoph Lameter
  2009-05-01 19:59                                   ` Ingo Molnar
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2009-04-30 18:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Mathieu Desnoyers, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo

On Thu, 30 Apr 2009, Ingo Molnar wrote:

> So we introduced preemption-safety in the v2 version of that commit.
>
> This non-atomicity will 1) either not matter 2) will be irq-atomic
> by virtue of being within a critical section 3) can be made atomic
> in the few remaining cases.
>
> And maybe, at most, introduce an opt-in API: percpu_add_irqsafe().

Plus percpu_add_preemptsafe().



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [ltt-dev] [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 14:12               ` Christoph Lameter
@ 2009-04-30 19:41                 ` Mathieu Desnoyers
  2009-04-30 20:17                   ` Christoph Lameter
  0 siblings, 1 reply; 62+ messages in thread
From: Mathieu Desnoyers @ 2009-04-30 19:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, Nick Piggin, Peter Zijlstra,
	Linux Kernel Mailing List, Yuriy Lalym, Tejun Heo, ltt-dev,
	Andrew Morton, Linus Torvalds

* Christoph Lameter (cl@linux.com) wrote:
> On Thu, 30 Apr 2009, Mathieu Desnoyers wrote:
> 
> > > The 3 variants on x86 generate the same instructions. On other platforms
> > > they would need to be able to fallback in various way depending on the
> > > availability of instructions that are atomic vs. preempt or irqs.
> > >
> >
> > The problem here, as we did figure out a while ago with the atomic
> > slub we worked on a while ago, is that if we have the following code :
> >
> > local_irq_save
> > var++
> > var++
> > local_irq_restore
> >
> > that we would like to turn into irq-safe percpu variant with this
> > semantic :
> >
> > percpu_add_irqsafe(var)
> > percpu_add_irqsafe(var)
> >
> > We are generating two irq save/restore in the fallback, which will be
> > slow.
> >
> > However, we could do the following trick :
> >
> > percpu_irqsave(flags);
> > percpu_add_irq(var);
> > percpu_add_irq(var);
> > percpu_irqrestore(flags);
> 
> Hmmm.I do not remember any of those double ops in the patches that I did a
> while back for this. It does not make sense either because atomic per cpu
> ops are only atomic for a single instruction. You are trying to extend
> that so that multiple "atomic" instructions are now atomic.
> 

Hrm, not exactly. So I probably chose the naming of the primitives
poorly here if my idea seems unclear. Here is what I am trying to do :

On architectures with irq-safe percpu_add :
- No need to disable interrupts at all

On archs lacking such irq-safe percpu_add :
- disabling interrupts only once for a sequence of percpu counter operations.

I tried to come up with an example in vmstat where multiple percpu ops
would be required, but I figured out that the code needs to be changed
to support percpu ops correctly. However separating
percpu_irqsave/restore from percpu_add_return_irq lets us express
__inc_zone_state and inc_zone_state cleanly, which would be difficult
otherwise.

Let's assume we change the stat_threshold values in mm/vmstat.c so they
become power of 2 (so we don't care when the u8 overflow occurs so it
becomes a free running counter). This model does not support
"overstep" (yet).

Then assume :

        u8 stat_threshold_mask = pcp->stat_threshold - 1;

mm/vmstat.c :

void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
{
        ... assuming p references percpu "u8" counters ...
        u8 p_new;

        p_new = percpu_add_return_irq(p, 1);

        if (unlikely(!(p_new & pcp->stat_threshold_mask)))
                zone_page_state_add(pcp->stat_threshold, zone, item);
}


void inc_zone_state(struct zone *zone, enum zone_stat_item item)
{
        unsigned long flags;

        /*
         * Disabling interrupts _only_ on architectures lacking atomic
         * percpu_*_irq ops.
         */
        percpu_irqsave(flags);
        __inc_zone_state(zone, item);
        percpu_irqrestore(flags);
}

void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
{
        ... assuming p references percpu "u8" counters ...
        u8 p_new;

        p_new = percpu_sub_return_irq(p, 1);

        if (unlikely(!(p_new & pcp->stat_threshold_mask)))
                zone_page_state_add(-(pcp->stat_threshold), zone, item);
}

void dec_zone_state(struct zone *zone, enum zone_stat_item item)
{
        unsigned long flags;

        /*
         * Disabling interrupts _only_ on architectures lacking atomic
         * percpu_*_irq ops.
         */
        percpu_irqsave(flags);
        __dec_zone_state(zone, item);
        percpu_irqrestore(flags);
}

void __mod_zone_state(struct zone *zone, enum zone_stat_item item, long delta)
{
        ... assuming p references percpu "u8" counters ...

        u8 p_new;
        long overflow_delta;

        p_new = percpu_add_return_irq(p, delta);

        /*
         * We must count the number of threshold overflow generated by
         * "delta". I know, this looks rather odd.
         */
        overflow_delta = ((long)p_new & ~(long)pcp->stat_threshold_mask)
                         - (((long)p_new - delta)
                           & ~(long)pcp->stat_threshold_mask);

        if (unlikely(abs(overflow_delta) > pcp->stat_threshold_mask))
                zone_page_state_add(glob_delta, zone, item);
}

void mod_zone_state(struct zone *zone, enum zone_stat_item item, long delta)
{
        unsigned long flags;

        /*
         * Disabling interrupts _only_ on architectures lacking atomic
         * percpu_*_irq ops.
         */
        percpu_irqsave(flags);
        __mod_zone_state(zone, item, detlta);
        percpu_irqrestore(flags);
}

Note that all the fast-path would execute with preemption enabled if the
architecture supports irqsave percpu atomic ops.

So as we can see, if cpu ops are used on _different_ atomic counters,
then it may require multiple percpu ops in sequence. However, in the
vmstat case, given the version currently in mainline uses a sequence of
operations on the same variable, this requires re-engineering the
structure, because otherwise races with preemption would occur.

disclaimer : the code above has been written in a email client and may
not compile/work/etc etc.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [ltt-dev] [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 19:41                 ` Mathieu Desnoyers
@ 2009-04-30 20:17                   ` Christoph Lameter
  2009-04-30 21:17                     ` Mathieu Desnoyers
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2009-04-30 20:17 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, Nick Piggin, Peter Zijlstra,
	Linux Kernel Mailing List, Yuriy Lalym, Tejun Heo, ltt-dev,
	Andrew Morton, Linus Torvalds

On Thu, 30 Apr 2009, Mathieu Desnoyers wrote:

> On architectures with irq-safe percpu_add :
> - No need to disable interrupts at all

They may have other options like fall back to classic atomic ops.

> Let's assume we change the stat_threshold values in mm/vmstat.c so they
> become power of 2 (so we don't care when the u8 overflow occurs so it
> becomes a free running counter). This model does not support
> "overstep" (yet).

The problem with such an approach is that the counters may wrap if
sufficiently many processors are in a ZVC update. Lets say a new interrupt
is occuring while irq is held off that increments the same counter. When
interrupts are reenabled, we increment again from the new interrupt.
We then check for overflow only after the new interrupt that modified the
counter has completed. The differentials must be constant while the ZVC
update functions are running and with your modifications that is no longer
guaranteed.

> Note that all the fast-path would execute with preemption enabled if the
> architecture supports irqsave percpu atomic ops.

What would work is to use a percpu cmpxchg for the ZVC updates.
I wrote a patch like that over a year ago. There is no percpu_cmpxchg
coming with the percpu ops as of today so we'd need to add that first.

Not sure if this is a performance win though. Complexity increases
somewhat.

---
 include/linux/vmstat.h |   17 ++++--
 mm/vmstat.c            |  122 ++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 127 insertions(+), 12 deletions(-)

Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h	2007-11-07 18:36:03.000000000 -0800
+++ linux-2.6/include/linux/vmstat.h	2007-11-07 18:38:27.000000000 -0800
@@ -202,15 +202,22 @@ extern void inc_zone_state(struct zone *
 void __mod_zone_page_state(struct zone *, enum zone_stat_item item, int);
 void __inc_zone_page_state(struct page *, enum zone_stat_item);
 void __dec_zone_page_state(struct page *, enum zone_stat_item);
+void __inc_zone_state(struct zone *, enum zone_stat_item);
+void __dec_zone_state(struct zone *, enum zone_stat_item);

+#ifdef CONFIG_FAST_CMPXCHG_LOCAL
+#define inc_zone_page_state __inc_zone_page_state
+#define dec_zone_page_state __dec_zone_page_state
+#define mod_zone_page_state __mod_zone_page_state
+#define inc_zone_state __inc_zone_state
+#define dec_zone_state __dec_zone_state
+#else
 void mod_zone_page_state(struct zone *, enum zone_stat_item, int);
 void inc_zone_page_state(struct page *, enum zone_stat_item);
 void dec_zone_page_state(struct page *, enum zone_stat_item);
-
-extern void inc_zone_state(struct zone *, enum zone_stat_item);
-extern void __inc_zone_state(struct zone *, enum zone_stat_item);
-extern void dec_zone_state(struct zone *, enum zone_stat_item);
-extern void __dec_zone_state(struct zone *, enum zone_stat_item);
+void inc_zone_state(struct zone *, enum zone_stat_item);
+void dec_zone_state(struct zone *, enum zone_stat_item);
+#endif

 void refresh_cpu_vm_stats(int);
 #else /* CONFIG_SMP */
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2007-11-07 17:20:16.000000000 -0800
+++ linux-2.6/mm/vmstat.c	2007-11-07 18:55:57.000000000 -0800
@@ -151,8 +151,109 @@ static void refresh_zone_stat_thresholds
 	}
 }

+#ifdef CONFIG_FAST_CMPXCHG_LOCAL
+void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
+				int delta)
+{
+	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	s8 *p = pcp->vm_stat_diff + item;
+	s8 old;
+	unsigned long new;
+	unsigned long add;
+
+	do {
+		add = 0;
+		old = *p;
+		new = old + delta;
+
+		if (unlikely(new > pcp->stat_threshold ||
+				new < -pcp->stat_threshold)) {
+			add = new;
+			new = 0;
+		}
+
+	} while (cmpxchg_local(p, old, new) != old);
+
+	if (add)
+		zone_page_state_add(add, zone, item);
+}
+EXPORT_SYMBOL(__mod_zone_page_state);
+
+/*
+ * Optimized increment and decrement functions implemented using
+ * cmpxchg_local. These do not require interrupts to be disabled
+ * and enabled.
+ */
+void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
+{
+	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	s8 *p = pcp->vm_stat_diff + item;
+	int add;
+	unsigned long old;
+	unsigned long new;
+
+	do {
+		add = 0;
+		old = *p;
+		new = old + 1;
+
+		if (unlikely(new > pcp->stat_threshold)) {
+			add = new + pcp->stat_threshold / 2;
+			new = -(pcp->stat_threshold / 2);
+		}
+	} while (cmpxchg_local(p, old, new) != old);
+
+	if (add)
+		zone_page_state_add(add, zone, item);
+}
+
+void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	__inc_zone_state(page_zone(page), item);
+}
+EXPORT_SYMBOL(__inc_zone_page_state);
+
+void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
+{
+	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	s8 *p = pcp->vm_stat_diff + item;
+	int sub;
+	unsigned long old;
+	unsigned long new;
+
+	do {
+		sub = 0;
+		old = *p;
+		new = old - 1;
+
+		if (unlikely(new < - pcp->stat_threshold)) {
+			sub = new - pcp->stat_threshold / 2;
+			new = pcp->stat_threshold / 2;
+		}
+	} while (cmpxchg_local(p, old, new) != old);
+
+	if (sub)
+		zone_page_state_add(sub, zone, item);
+}
+
+void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	__dec_zone_state(page_zone(page), item);
+}
+EXPORT_SYMBOL(__dec_zone_page_state);
+
+static inline void sync_diff(struct zone *zone, struct per_cpu_pageset *p, int i)
+{
+	/*
+	 * xchg_local() would be useful here but that does not exist.
+	 */
+	zone_page_state_add(xchg(&p->vm_stat_diff[i], 0), zone, i);
+}
+
+#else /* CONFIG_FAST_CMPXCHG_LOCAL */
+
 /*
- * For use when we know that interrupts are disabled.
+ * Functions that do not rely on cmpxchg_local
  */
 void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 				int delta)
@@ -281,6 +382,17 @@ void dec_zone_page_state(struct page *pa
 }
 EXPORT_SYMBOL(dec_zone_page_state);

+static inline void sync_diff(struct zone *zone, struct per_cpu_pageset *p, int i)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	zone_page_state_add(p->vm_stat_diff[i], zone, i);
+	p->vm_stat_diff[i] = 0;
+	local_irq_restore(flags);
+}
+#endif /* !CONFIG_FAST_CMPXCHG_LOCAL */
+
 /*
  * Update the zone counters for one cpu.
  *
@@ -299,7 +411,6 @@ void refresh_cpu_vm_stats(int cpu)
 {
 	struct zone *zone;
 	int i;
-	unsigned long flags;

 	for_each_zone(zone) {
 		struct per_cpu_pageset *p;
@@ -311,15 +422,12 @@ void refresh_cpu_vm_stats(int cpu)

 		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 			if (p->vm_stat_diff[i]) {
-				local_irq_save(flags);
-				zone_page_state_add(p->vm_stat_diff[i],
-					zone, i);
-				p->vm_stat_diff[i] = 0;
+				sync_diff(zone, p, i);
+
 #ifdef CONFIG_NUMA
 				/* 3 seconds idle till flush */
 				p->expire = 3;
 #endif
-				local_irq_restore(flags);
 			}
 #ifdef CONFIG_NUMA
 		/*

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [ltt-dev] [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 20:17                   ` Christoph Lameter
@ 2009-04-30 21:17                     ` Mathieu Desnoyers
  2009-05-01 13:44                       ` Christoph Lameter
  0 siblings, 1 reply; 62+ messages in thread
From: Mathieu Desnoyers @ 2009-04-30 21:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Peter Zijlstra, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo, Ingo Molnar,
	Linus Torvalds, Andrew Morton

* Christoph Lameter (cl@linux.com) wrote:
> On Thu, 30 Apr 2009, Mathieu Desnoyers wrote:
> 
> > On architectures with irq-safe percpu_add :
> > - No need to disable interrupts at all
> 
> They may have other options like fall back to classic atomic ops.
> 
> > Let's assume we change the stat_threshold values in mm/vmstat.c so they
> > become power of 2 (so we don't care when the u8 overflow occurs so it
> > becomes a free running counter). This model does not support
> > "overstep" (yet).
> 
> The problem with such an approach is that the counters may wrap if
> sufficiently many processors are in a ZVC update.

By ZVC update, you mean Zone ... Counter update ? (which code exactly ?)

>  Lets say a new interrupt
> is occuring while irq is held off that increments the same counter. When
> interrupts are reenabled, we increment again from the new interrupt.
> We then check for overflow only after the new interrupt that modified the
> counter has completed. The differentials must be constant while the ZVC
> update functions are running and with your modifications that is no longer
> guaranteed.
> 

Hrm, I must admit I'm not sure I follow how your reasoning applies to my
code. I am using a percpu_add_return_irq() exactly for this reason : it
only ever touches the percpu variable once and atomically. The test for
overflow is done on the value returned by percpu_add_return_irq().

Therefore, an interrupt scenario that would be close to what I
understand from your concerns would be :

* Thread A

inc_zone_page_state()
  p_ret = percpu_add_return(p, 1); (let's suppose this increment
                                    overflows the threshold, therefore
                                    (p_ret & mask) == 0)

----> interrupt comes in, preempts the current thread, execution in a
      different thread context (* Thread B) :

     inc_zone_page_state()
       p_ret = percpu_add_return(p, 1);  ((p_ret & mask) == 1)
       if (!(p_ret & mask))
         increment global zone count. (not executed)

----> interrupt comes in, preempts the current thread, execution back to
      the original thread context (Thread A), on the same or on a
      different CPU :

  if (!(p_ret & mask))
    increment global zone count.   -----> will therefore increment the
                                          global zone count only after
                                          scheduling back the original
                                          thread.

So I guess what you say here is that if Thread B is preempted for too
long, we will have to wait until it gets scheduled back before the
global count is incremented. Do we really need such degree of precision
for those counters ?

(I fear I'm not understanding your concern fully though)

> > Note that all the fast-path would execute with preemption enabled if the
> > architecture supports irqsave percpu atomic ops.
> 
> What would work is to use a percpu cmpxchg for the ZVC updates.
> I wrote a patch like that over a year ago. There is no percpu_cmpxchg
> coming with the percpu ops as of today so we'd need to add that first.
> 
> Not sure if this is a performance win though. Complexity increases
> somewhat.
> 

hrm, yes, the patch below clearly does not win the low-complexity
contest. :)

Mathieu

> ---
>  include/linux/vmstat.h |   17 ++++--
>  mm/vmstat.c            |  122 ++++++++++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 127 insertions(+), 12 deletions(-)
> 
> Index: linux-2.6/include/linux/vmstat.h
> ===================================================================
> --- linux-2.6.orig/include/linux/vmstat.h	2007-11-07 18:36:03.000000000 -0800
> +++ linux-2.6/include/linux/vmstat.h	2007-11-07 18:38:27.000000000 -0800
> @@ -202,15 +202,22 @@ extern void inc_zone_state(struct zone *
>  void __mod_zone_page_state(struct zone *, enum zone_stat_item item, int);
>  void __inc_zone_page_state(struct page *, enum zone_stat_item);
>  void __dec_zone_page_state(struct page *, enum zone_stat_item);
> +void __inc_zone_state(struct zone *, enum zone_stat_item);
> +void __dec_zone_state(struct zone *, enum zone_stat_item);
> 
> +#ifdef CONFIG_FAST_CMPXCHG_LOCAL
> +#define inc_zone_page_state __inc_zone_page_state
> +#define dec_zone_page_state __dec_zone_page_state
> +#define mod_zone_page_state __mod_zone_page_state
> +#define inc_zone_state __inc_zone_state
> +#define dec_zone_state __dec_zone_state
> +#else
>  void mod_zone_page_state(struct zone *, enum zone_stat_item, int);
>  void inc_zone_page_state(struct page *, enum zone_stat_item);
>  void dec_zone_page_state(struct page *, enum zone_stat_item);
> -
> -extern void inc_zone_state(struct zone *, enum zone_stat_item);
> -extern void __inc_zone_state(struct zone *, enum zone_stat_item);
> -extern void dec_zone_state(struct zone *, enum zone_stat_item);
> -extern void __dec_zone_state(struct zone *, enum zone_stat_item);
> +void inc_zone_state(struct zone *, enum zone_stat_item);
> +void dec_zone_state(struct zone *, enum zone_stat_item);
> +#endif
> 
>  void refresh_cpu_vm_stats(int);
>  #else /* CONFIG_SMP */
> Index: linux-2.6/mm/vmstat.c
> ===================================================================
> --- linux-2.6.orig/mm/vmstat.c	2007-11-07 17:20:16.000000000 -0800
> +++ linux-2.6/mm/vmstat.c	2007-11-07 18:55:57.000000000 -0800
> @@ -151,8 +151,109 @@ static void refresh_zone_stat_thresholds
>  	}
>  }
> 
> +#ifdef CONFIG_FAST_CMPXCHG_LOCAL
> +void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
> +				int delta)
> +{
> +	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
> +	s8 *p = pcp->vm_stat_diff + item;
> +	s8 old;
> +	unsigned long new;
> +	unsigned long add;
> +
> +	do {
> +		add = 0;
> +		old = *p;
> +		new = old + delta;
> +
> +		if (unlikely(new > pcp->stat_threshold ||
> +				new < -pcp->stat_threshold)) {
> +			add = new;
> +			new = 0;
> +		}
> +
> +	} while (cmpxchg_local(p, old, new) != old);
> +
> +	if (add)
> +		zone_page_state_add(add, zone, item);
> +}
> +EXPORT_SYMBOL(__mod_zone_page_state);
> +
> +/*
> + * Optimized increment and decrement functions implemented using
> + * cmpxchg_local. These do not require interrupts to be disabled
> + * and enabled.
> + */
> +void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
> +{
> +	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
> +	s8 *p = pcp->vm_stat_diff + item;
> +	int add;
> +	unsigned long old;
> +	unsigned long new;
> +
> +	do {
> +		add = 0;
> +		old = *p;
> +		new = old + 1;
> +
> +		if (unlikely(new > pcp->stat_threshold)) {
> +			add = new + pcp->stat_threshold / 2;
> +			new = -(pcp->stat_threshold / 2);
> +		}
> +	} while (cmpxchg_local(p, old, new) != old);
> +
> +	if (add)
> +		zone_page_state_add(add, zone, item);
> +}
> +
> +void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
> +{
> +	__inc_zone_state(page_zone(page), item);
> +}
> +EXPORT_SYMBOL(__inc_zone_page_state);
> +
> +void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
> +{
> +	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
> +	s8 *p = pcp->vm_stat_diff + item;
> +	int sub;
> +	unsigned long old;
> +	unsigned long new;
> +
> +	do {
> +		sub = 0;
> +		old = *p;
> +		new = old - 1;
> +
> +		if (unlikely(new < - pcp->stat_threshold)) {
> +			sub = new - pcp->stat_threshold / 2;
> +			new = pcp->stat_threshold / 2;
> +		}
> +	} while (cmpxchg_local(p, old, new) != old);
> +
> +	if (sub)
> +		zone_page_state_add(sub, zone, item);
> +}
> +
> +void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
> +{
> +	__dec_zone_state(page_zone(page), item);
> +}
> +EXPORT_SYMBOL(__dec_zone_page_state);
> +
> +static inline void sync_diff(struct zone *zone, struct per_cpu_pageset *p, int i)
> +{
> +	/*
> +	 * xchg_local() would be useful here but that does not exist.
> +	 */
> +	zone_page_state_add(xchg(&p->vm_stat_diff[i], 0), zone, i);
> +}
> +
> +#else /* CONFIG_FAST_CMPXCHG_LOCAL */
> +
>  /*
> - * For use when we know that interrupts are disabled.
> + * Functions that do not rely on cmpxchg_local
>   */
>  void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
>  				int delta)
> @@ -281,6 +382,17 @@ void dec_zone_page_state(struct page *pa
>  }
>  EXPORT_SYMBOL(dec_zone_page_state);
> 
> +static inline void sync_diff(struct zone *zone, struct per_cpu_pageset *p, int i)
> +{
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	zone_page_state_add(p->vm_stat_diff[i], zone, i);
> +	p->vm_stat_diff[i] = 0;
> +	local_irq_restore(flags);
> +}
> +#endif /* !CONFIG_FAST_CMPXCHG_LOCAL */
> +
>  /*
>   * Update the zone counters for one cpu.
>   *
> @@ -299,7 +411,6 @@ void refresh_cpu_vm_stats(int cpu)
>  {
>  	struct zone *zone;
>  	int i;
> -	unsigned long flags;
> 
>  	for_each_zone(zone) {
>  		struct per_cpu_pageset *p;
> @@ -311,15 +422,12 @@ void refresh_cpu_vm_stats(int cpu)
> 
>  		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
>  			if (p->vm_stat_diff[i]) {
> -				local_irq_save(flags);
> -				zone_page_state_add(p->vm_stat_diff[i],
> -					zone, i);
> -				p->vm_stat_diff[i] = 0;
> +				sync_diff(zone, p, i);
> +
>  #ifdef CONFIG_NUMA
>  				/* 3 seconds idle till flush */
>  				p->expire = 3;
>  #endif
> -				local_irq_restore(flags);
>  			}
>  #ifdef CONFIG_NUMA
>  		/*
> 
> _______________________________________________
> ltt-dev mailing list
> ltt-dev@lists.casi.polymtl.ca
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [ltt-dev] [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 21:17                     ` Mathieu Desnoyers
@ 2009-05-01 13:44                       ` Christoph Lameter
  2009-05-01 19:21                         ` Mathieu Desnoyers
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2009-05-01 13:44 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Nick Piggin, Peter Zijlstra, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo, Ingo Molnar,
	Linus Torvalds, Andrew Morton

On Thu, 30 Apr 2009, Mathieu Desnoyers wrote:

> By ZVC update, you mean Zone ... Counter update ? (which code exactly ?)

The code that you were modifying in vmstat.c.

> Hrm, I must admit I'm not sure I follow how your reasoning applies to my
> code. I am using a percpu_add_return_irq() exactly for this reason : it
> only ever touches the percpu variable once and atomically. The test for
> overflow is done on the value returned by percpu_add_return_irq().

If the percpu differential goes over a certain boundary then the
differential would be updated twice.

> Therefore, an interrupt scenario that would be close to what I
> understand from your concerns would be :
>
> * Thread A
>
> inc_zone_page_state()
>   p_ret = percpu_add_return(p, 1); (let's suppose this increment
>                                     overflows the threshold, therefore
>                                     (p_ret & mask) == 0)
>
> ----> interrupt comes in, preempts the current thread, execution in a
>       different thread context (* Thread B) :
>
>      inc_zone_page_state()
>        p_ret = percpu_add_return(p, 1);  ((p_ret & mask) == 1)
>        if (!(p_ret & mask))
>          increment global zone count. (not executed)
>
> ----> interrupt comes in, preempts the current thread, execution back to
>       the original thread context (Thread A), on the same or on a
>       different CPU :
>
>   if (!(p_ret & mask))
>     increment global zone count.   -----> will therefore increment the
>                                           global zone count only after
>                                           scheduling back the original
>                                           thread.
>
> So I guess what you say here is that if Thread B is preempted for too
> long, we will have to wait until it gets scheduled back before the
> global count is incremented. Do we really need such degree of precision
> for those counters ?
>
> (I fear I'm not understanding your concern fully though)

Inc_zone_page_state modifies the differential which is u8 and can easily
overflow.

Hmmm. But if you check for overflow to zero this way it may work without
the need for cmpxchg. But if you rely on overflow then we only update the
global count after 256 counts on the percpu differential. The tuning of
the accuracy of the counter wont work anymore. The global counter could
become wildly inaccurate with a lot of processors.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [ltt-dev] [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-05-01 13:44                       ` Christoph Lameter
@ 2009-05-01 19:21                         ` Mathieu Desnoyers
  2009-05-01 19:31                           ` Christoph Lameter
  0 siblings, 1 reply; 62+ messages in thread
From: Mathieu Desnoyers @ 2009-05-01 19:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Peter Zijlstra, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo, Ingo Molnar,
	Linus Torvalds, Andrew Morton

* Christoph Lameter (cl@linux.com) wrote:
> On Thu, 30 Apr 2009, Mathieu Desnoyers wrote:
> 
> > By ZVC update, you mean Zone ... Counter update ? (which code exactly ?)
> 
> The code that you were modifying in vmstat.c.
> 
> > Hrm, I must admit I'm not sure I follow how your reasoning applies to my
> > code. I am using a percpu_add_return_irq() exactly for this reason : it
> > only ever touches the percpu variable once and atomically. The test for
> > overflow is done on the value returned by percpu_add_return_irq().
> 
> If the percpu differential goes over a certain boundary then the
> differential would be updated twice.
> 

Not with my approach which tests for == 0, as you point out below,

> > Therefore, an interrupt scenario that would be close to what I
> > understand from your concerns would be :
> >
> > * Thread A
> >
> > inc_zone_page_state()
> >   p_ret = percpu_add_return(p, 1); (let's suppose this increment
> >                                     overflows the threshold, therefore
> >                                     (p_ret & mask) == 0)
> >
> > ----> interrupt comes in, preempts the current thread, execution in a
> >       different thread context (* Thread B) :
> >
> >      inc_zone_page_state()
> >        p_ret = percpu_add_return(p, 1);  ((p_ret & mask) == 1)
> >        if (!(p_ret & mask))
> >          increment global zone count. (not executed)
> >
> > ----> interrupt comes in, preempts the current thread, execution back to
> >       the original thread context (Thread A), on the same or on a
> >       different CPU :
> >
> >   if (!(p_ret & mask))
> >     increment global zone count.   -----> will therefore increment the
> >                                           global zone count only after
> >                                           scheduling back the original
> >                                           thread.
> >
> > So I guess what you say here is that if Thread B is preempted for too
> > long, we will have to wait until it gets scheduled back before the
> > global count is incremented. Do we really need such degree of precision
> > for those counters ?
> >
> > (I fear I'm not understanding your concern fully though)
> 
> Inc_zone_page_state modifies the differential which is u8 and can easily
> overflow.
> 
> Hmmm. But if you check for overflow to zero this way it may work without
> the need for cmpxchg. But if you rely on overflow then we only update the
> global count after 256 counts on the percpu differential. The tuning of
> the accuracy of the counter wont work anymore. The global counter could
> become wildly inaccurate with a lot of processors.
> 

I see that we are getting on the same page here. Good :) About the
overflow :

What I do here is to let those u8 counters increment as free-running
counters. Yes, they will periodically overflow the 8 bits. But I don't
rely on this for counting the number of increments we need between
global counter updates : I use the bitmask taken from the threshold
value (which is now required to be a power of two) to detect 0, 1, 2, 3,
4, 5, 6 or 7-bit counter overflow. Therefore we can still have the kind
of granularity currently provided. The only limitation is that we have
to use powers of two for the threshold, so we end up counting in power
of two modulo, which will be unaffected by the u8 overflow.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [ltt-dev] [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-05-01 19:21                         ` Mathieu Desnoyers
@ 2009-05-01 19:31                           ` Christoph Lameter
  2009-05-01 20:24                             ` Mathieu Desnoyers
  2009-05-02 21:01                             ` Mathieu Desnoyers
  0 siblings, 2 replies; 62+ messages in thread
From: Christoph Lameter @ 2009-05-01 19:31 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Nick Piggin, Peter Zijlstra, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo, Ingo Molnar,
	Linus Torvalds, Andrew Morton

On Fri, 1 May 2009, Mathieu Desnoyers wrote:

> What I do here is to let those u8 counters increment as free-running
> counters. Yes, they will periodically overflow the 8 bits. But I don't
> rely on this for counting the number of increments we need between
> global counter updates : I use the bitmask taken from the threshold
> value (which is now required to be a power of two) to detect 0, 1, 2, 3,
> 4, 5, 6 or 7-bit counter overflow. Therefore we can still have the kind
> of granularity currently provided. The only limitation is that we have
> to use powers of two for the threshold, so we end up counting in power
> of two modulo, which will be unaffected by the u8 overflow.

Ack. Got it. Looks good.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30 18:07                                 ` Christoph Lameter
@ 2009-05-01 19:59                                   ` Ingo Molnar
  2009-05-01 20:35                                     ` Christoph Lameter
  0 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2009-05-01 19:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Mathieu Desnoyers, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo


* Christoph Lameter <cl@linux.com> wrote:

> On Thu, 30 Apr 2009, Ingo Molnar wrote:
> 
> > So we introduced preemption-safety in the v2 version of that commit.
> >
> > This non-atomicity will 1) either not matter 2) will be 
> > irq-atomic by virtue of being within a critical section 3) can 
> > be made atomic in the few remaining cases.
> >
> > And maybe, at most, introduce an opt-in API: 
> > percpu_add_irqsafe().
> 
> Plus percpu_add_preemptsafe().

Ok - that would be easy. Would you be interested in doing that and 
would you be interested in resurrecting your old patches? They are 
genuinely useful IMHO.

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [ltt-dev] [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-05-01 19:31                           ` Christoph Lameter
@ 2009-05-01 20:24                             ` Mathieu Desnoyers
  2009-05-01 20:28                               ` Christoph Lameter
  2009-05-02 21:01                             ` Mathieu Desnoyers
  1 sibling, 1 reply; 62+ messages in thread
From: Mathieu Desnoyers @ 2009-05-01 20:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Peter Zijlstra, Linux Kernel Mailing List,
	Yuriy Lalym, ltt-dev, Tejun Heo, Ingo Molnar, Linus Torvalds,
	Andrew Morton

* Christoph Lameter (cl@linux.com) wrote:
> On Fri, 1 May 2009, Mathieu Desnoyers wrote:
> 
> > What I do here is to let those u8 counters increment as free-running
> > counters. Yes, they will periodically overflow the 8 bits. But I don't
> > rely on this for counting the number of increments we need between
> > global counter updates : I use the bitmask taken from the threshold
> > value (which is now required to be a power of two) to detect 0, 1, 2, 3,
> > 4, 5, 6 or 7-bit counter overflow. Therefore we can still have the kind
> > of granularity currently provided. The only limitation is that we have
> > to use powers of two for the threshold, so we end up counting in power
> > of two modulo, which will be unaffected by the u8 overflow.
> 
> Ack. Got it. Looks good.
> 

Super ! :)

So, back to my original point : do you agree on the usefulness of
separating fallback irq-disabling from the per-cpu atomic construct ?

e.g. :

__inc_zone_state
  percpu_add_return_irq(var);

inc_zone_state
  percpu_irqsave(flags);
  __inc_zone_state()
  percpu_irqrestore(flags);

This would require that percpu_add_return_irq should always be called
either in :
  - irq disabled code paths
  - in code paths surrounded by percpu_irqsave/restore.

In this example :

x86 would map :

percpu_irqsave/restore to "nothing".
percpu_add_return_irq to xadd instruction. It is irq-safe by design.

Other architectures (fallback) would map

percpu_irqsave/restore to local_irq_save/restore.
percpu_add_return_irq to var += value; return var;

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [ltt-dev] [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-05-01 20:24                             ` Mathieu Desnoyers
@ 2009-05-01 20:28                               ` Christoph Lameter
  2009-05-01 20:43                                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2009-05-01 20:28 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Nick Piggin, Peter Zijlstra, Linux Kernel Mailing List,
	Yuriy Lalym, ltt-dev, Tejun Heo, Ingo Molnar, Linus Torvalds,
	Andrew Morton

On Fri, 1 May 2009, Mathieu Desnoyers wrote:

> So, back to my original point : do you agree on the usefulness of
> separating fallback irq-disabling from the per-cpu atomic construct ?

No. Percpu operations are used for statistics and are like atomic
operations. Aggregation of these leads to a can of worms that we better
leave unopened.

> x86 would map :
>
> percpu_irqsave/restore to "nothing".
> percpu_add_return_irq to xadd instruction. It is irq-safe by design.
>
> Other architectures (fallback) would map
>
> percpu_irqsave/restore to local_irq_save/restore.
> percpu_add_return_irq to var += value; return var;

Shudder.... We have explored those types of macros before (while doing
fastpath optimization for SLUB) and it significant increases the
complexity. People may add additional instructions in between and now
interrupts off could be on or off depending on the architecture. Sometimes
percpu_irqsave does nothing. Very difficult to ensure that the usage is
correct.

And we have barely any usage case for such macros.





^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-05-01 19:59                                   ` Ingo Molnar
@ 2009-05-01 20:35                                     ` Christoph Lameter
  2009-05-01 21:07                                       ` Ingo Molnar
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2009-05-01 20:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Mathieu Desnoyers, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo

On Fri, 1 May 2009, Ingo Molnar wrote:

> > Plus percpu_add_preemptsafe().
>
> Ok - that would be easy. Would you be interested in doing that and
> would you be interested in resurrecting your old patches? They are
> genuinely useful IMHO.

I see but it may take some time for me to get that finished (probably 2
weeks or so). Dont feel that you have to merge my old patches or a variant
thereof.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [ltt-dev] [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-05-01 20:43                                 ` Mathieu Desnoyers
@ 2009-05-01 20:42                                   ` Christoph Lameter
  2009-05-01 21:19                                     ` Mathieu Desnoyers
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2009-05-01 20:42 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Nick Piggin, Peter Zijlstra, Linux Kernel Mailing List,
	Yuriy Lalym, ltt-dev, Tejun Heo, Ingo Molnar, Linus Torvalds,
	Andrew Morton

On Fri, 1 May 2009, Mathieu Desnoyers wrote:

> Then do you have a better idea on how to deal with
> __inc_zone_state/inc_zone_state without duplicating the code and without
> adding useless local_irq_save/restore on x86 ?

We are already using __inc_zone_state to avoid local_irq_save/restore.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [ltt-dev] [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-05-01 20:28                               ` Christoph Lameter
@ 2009-05-01 20:43                                 ` Mathieu Desnoyers
  2009-05-01 20:42                                   ` Christoph Lameter
  0 siblings, 1 reply; 62+ messages in thread
From: Mathieu Desnoyers @ 2009-05-01 20:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Peter Zijlstra, Linux Kernel Mailing List,
	Yuriy Lalym, ltt-dev, Tejun Heo, Ingo Molnar, Linus Torvalds,
	Andrew Morton

* Christoph Lameter (cl@linux.com) wrote:
> On Fri, 1 May 2009, Mathieu Desnoyers wrote:
> 
> > So, back to my original point : do you agree on the usefulness of
> > separating fallback irq-disabling from the per-cpu atomic construct ?
> 
> No. Percpu operations are used for statistics and are like atomic
> operations. Aggregation of these leads to a can of worms that we better
> leave unopened.
> 
> > x86 would map :
> >
> > percpu_irqsave/restore to "nothing".
> > percpu_add_return_irq to xadd instruction. It is irq-safe by design.
> >
> > Other architectures (fallback) would map
> >
> > percpu_irqsave/restore to local_irq_save/restore.
> > percpu_add_return_irq to var += value; return var;
> 
> Shudder.... We have explored those types of macros before (while doing
> fastpath optimization for SLUB) and it significant increases the
> complexity. People may add additional instructions in between and now
> interrupts off could be on or off depending on the architecture. Sometimes
> percpu_irqsave does nothing. Very difficult to ensure that the usage is
> correct.
> 
> And we have barely any usage case for such macros.
> 

Then do you have a better idea on how to deal with
__inc_zone_state/inc_zone_state without duplicating the code and without
adding useless local_irq_save/restore on x86 ?

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-05-01 20:35                                     ` Christoph Lameter
@ 2009-05-01 21:07                                       ` Ingo Molnar
  2009-05-02  3:06                                         ` Christoph Lameter
  0 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2009-05-01 21:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Mathieu Desnoyers, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo


* Christoph Lameter <cl@linux.com> wrote:

> On Fri, 1 May 2009, Ingo Molnar wrote:
> 
> > > Plus percpu_add_preemptsafe().
> >
> > Ok - that would be easy. Would you be interested in doing that 
> > and would you be interested in resurrecting your old patches? 
> > They are genuinely useful IMHO.
> 
> I see but it may take some time for me to get that finished 
> (probably 2 weeks or so). Dont feel that you have to merge my old 
> patches or a variant thereof.

it's your baby really and i dont want to interfere in any negative 
way. The patches look fairly complete and you did all the hard work 
with them already.

Can help out with review and testing (as always) but i suspect it's 
more appropriate for -mm to carry them in the end, because of the MM 
and tree-wide impact, and because the fine differences between those 
APIs will only be truly visible on non-x86.

( There's also Tejun's latest improvements of the dynamic allocator
  that he made based on your feedback - are those in any tree yet? )

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [ltt-dev] [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-05-01 20:42                                   ` Christoph Lameter
@ 2009-05-01 21:19                                     ` Mathieu Desnoyers
  2009-05-02  3:00                                       ` Christoph Lameter
  0 siblings, 1 reply; 62+ messages in thread
From: Mathieu Desnoyers @ 2009-05-01 21:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Peter Zijlstra, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo, Ingo Molnar,
	Linus Torvalds, Andrew Morton

* Christoph Lameter (cl@linux.com) wrote:
> On Fri, 1 May 2009, Mathieu Desnoyers wrote:
> 
> > Then do you have a better idea on how to deal with
> > __inc_zone_state/inc_zone_state without duplicating the code and without
> > adding useless local_irq_save/restore on x86 ?
> 
> We are already using __inc_zone_state to avoid local_irq_save/restore.
> 

Then, if I understand you correctly, you propose :

void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
{
        ... assuming p references percpu "u8" counters ...
        u8 p_new;

        p_new = percpu_add_return(p, 1);

        if (unlikely(!(p_new & pcp->stat_threshold_mask)))
                zone_page_state_add(pcp->stat_threshold, zone, item);
}

void inc_zone_state(struct zone *zone, enum zone_stat_item item)
{
        ... assuming p references percpu "u8" counters ...
        u8 p_new;

        p_new = percpu_add_return_irqsafe(p, 1);

        if (unlikely(!(p_new & pcp->stat_threshold_mask)))
                zone_page_state_add(pcp->stat_threshold, zone, item);
}

(therefore opting for code duplication)

Am I correct ?

It can indeed by argued to be more straightforward than adding irq
disabling primitives which behave differently depending on the
architecture. And it certainly makes sense as long as duplicated
functions are small enough and as long as we never touch more than one
counter in the same function.

A compromise would be to include the appropriate irq disabling or
preempt disabling in the "standard version" of these functions, but
create underscore prefixed versions which would need
percpu_local_irqsave and friends to be done separately. Therefore, the
standard usage would be easy to deal with and review, and it would still
allow using the underscore-prefixed version when code would otherwise
have to be duplicated or when multiple counters must be updated at once.

And the rule would be simple enough :

- percpu_add_return_irqsafe is safe wrt interrupts.

- _always_ disable interrupts or use percpu_local_irqsave/restore
  around __percpu_add_return_irqsafe().

For the example above, this would let us write :

void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
{
        ... assuming p references percpu "u8" counters ...
        u8 p_new;

        p_new = __percpu_add_return_irqsafe(p, 1);

        if (unlikely(!(p_new & pcp->stat_threshold_mask)))
                zone_page_state_add(pcp->stat_threshold, zone, item);
}

void inc_zone_state(struct zone *zone, enum zone_stat_item item)
{
	unsigned long flags;

	percpu_local_irqsave(flags);
	__inc_zone_state(zone, item);
	percpu_local_irqrestore(flags);
}

Which is more compact and does not duplicate code.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [ltt-dev] [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-05-01 21:19                                     ` Mathieu Desnoyers
@ 2009-05-02  3:00                                       ` Christoph Lameter
  2009-05-02  7:01                                         ` Mathieu Desnoyers
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2009-05-02  3:00 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Nick Piggin, Peter Zijlstra, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo, Ingo Molnar,
	Linus Torvalds, Andrew Morton

On Fri, 1 May 2009, Mathieu Desnoyers wrote:

> Then, if I understand you correctly, you propose :
>
> void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
> {
>         ... assuming p references percpu "u8" counters ...
>         u8 p_new;
>
>         p_new = percpu_add_return(p, 1);
>
>         if (unlikely(!(p_new & pcp->stat_threshold_mask)))
>                 zone_page_state_add(pcp->stat_threshold, zone, item);
> }
>
> void inc_zone_state(struct zone *zone, enum zone_stat_item item)
> {
>         ... assuming p references percpu "u8" counters ...
>         u8 p_new;
>
>         p_new = percpu_add_return_irqsafe(p, 1);
>
>         if (unlikely(!(p_new & pcp->stat_threshold_mask)))
>                 zone_page_state_add(pcp->stat_threshold, zone, item);
> }
>
> (therefore opting for code duplication)
>
> Am I correct ?

Well __inc_zone_state is fine by itself. inc_zone_state will currently
disable irqs. But we can do it your way and duplicate the code.

> void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
> {
>         ... assuming p references percpu "u8" counters ...
>         u8 p_new;
>
>         p_new = __percpu_add_return_irqsafe(p, 1);
>
>         if (unlikely(!(p_new & pcp->stat_threshold_mask)))
>                 zone_page_state_add(pcp->stat_threshold, zone, item);
> }
>
> void inc_zone_state(struct zone *zone, enum zone_stat_item item)
> {
> 	unsigned long flags;
>
> 	percpu_local_irqsave(flags);
> 	__inc_zone_state(zone, item);
> 	percpu_local_irqrestore(flags);
> }
>
> Which is more compact and does not duplicate code.

This is almost like the current code. But lets avoid percpu_local_irqs
etc if we can.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-05-01 21:07                                       ` Ingo Molnar
@ 2009-05-02  3:06                                         ` Christoph Lameter
  2009-05-02  9:03                                           ` Ingo Molnar
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Lameter @ 2009-05-02  3:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Mathieu Desnoyers, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo

On Fri, 1 May 2009, Ingo Molnar wrote:

> > I see but it may take some time for me to get that finished
> > (probably 2 weeks or so). Dont feel that you have to merge my old
> > patches or a variant thereof.
>
> it's your baby really and i dont want to interfere in any negative
> way. The patches look fairly complete and you did all the hard work
> with them already.

Ah I remember, you drowned that baby when you asked me to do lowercase
macros. That plus the job change made me to lose most of my drive for the
patchset. No employer anymore that wants the stuff... Then there is the
scheduler regression and the network latency issues that prevent my new
project from running kernels after 2.6.22. Time that I have will be mostly
spend on that in the coming weeks. Cannot really push much into newer
kernels until we are able to run upstream kernels without significant loss
in performance.

We just tried 2.6.9. And hey performance is better. Looks like going back
in kernel versions is a way to improve performance. Simpler kernel,
lower cache footprint and it becomes faster. And the app is simply
doing network I/O and some number crunching. Sigh.

> Can help out with review and testing (as always) but i suspect it's
> more appropriate for -mm to carry them in the end, because of the MM
> and tree-wide impact, and because the fine differences between those
> APIs will only be truly visible on non-x86.

I'd be glad to review. But this is not far up in the priority list given
the unresolved regressions.

> ( There's also Tejun's latest improvements of the dynamic allocator
>   that he made based on your feedback - are those in any tree yet? )

No idea. Do not see them upstream.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [ltt-dev] [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-05-02  3:00                                       ` Christoph Lameter
@ 2009-05-02  7:01                                         ` Mathieu Desnoyers
  0 siblings, 0 replies; 62+ messages in thread
From: Mathieu Desnoyers @ 2009-05-02  7:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Peter Zijlstra, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo, Ingo Molnar,
	Linus Torvalds, Andrew Morton

* Christoph Lameter (cl@linux.com) wrote:
> On Fri, 1 May 2009, Mathieu Desnoyers wrote:
> 
> > Then, if I understand you correctly, you propose :
> >
> > void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
> > {
> >         ... assuming p references percpu "u8" counters ...
> >         u8 p_new;
> >
> >         p_new = percpu_add_return(p, 1);
> >
> >         if (unlikely(!(p_new & pcp->stat_threshold_mask)))
> >                 zone_page_state_add(pcp->stat_threshold, zone, item);
> > }
> >
> > void inc_zone_state(struct zone *zone, enum zone_stat_item item)
> > {
> >         ... assuming p references percpu "u8" counters ...
> >         u8 p_new;
> >
> >         p_new = percpu_add_return_irqsafe(p, 1);
> >
> >         if (unlikely(!(p_new & pcp->stat_threshold_mask)))
> >                 zone_page_state_add(pcp->stat_threshold, zone, item);
> > }
> >
> > (therefore opting for code duplication)
> >
> > Am I correct ?
> 
> Well __inc_zone_state is fine by itself. inc_zone_state will currently
> disable irqs. But we can do it your way and duplicate the code.
> 

OK, I think I see the source of disagreement here. Leaving
local_irq_save in place in inc_zone_state as you propose will degrade
performance significantly. x86 is not the only architecture not that
fast at disabling interrupts.

Let met get my hands on the numbers I've prepared for my upcoming paper
about tracer locking... here they are. These are in cycles.

Architecture           Speedup             CAS        Interrupts
                (cli+sti) / local CAS  local sync     sti   cli

AMD Athlon(tm)64 X2         4.57       7     17       17    15
Intel Core2                 6.33       6     30       20    18
Intel Xeon E5405            5.25       8     20       20    22
PowerPC G5                  4.00       1      2        3     1
PowerPC POWER6 4.2 GHz      1.77       9     17       14     2
Itanium 2 (single and dual-core, 1.6GHz)
                            1.33       3      3        2     2
UltraSPARC-IIIi (1.2GHz)(a) 0.64     0.394 0.394    0.094  0.159
        (a)  (in system bus clock cycles)

I've also got numbers for atomic add return.. it's usually a bit faster
than local CAS, except on architectures where atomic add return must be
emulated by a CAS (it's then very slightly slower). But in any case,
disabling/enabling interrupts is _WAY_ slower than local CAS or local
add return.

> > void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
> > {
> >         ... assuming p references percpu "u8" counters ...
> >         u8 p_new;
> >
> >         p_new = __percpu_add_return_irqsafe(p, 1);
> >
> >         if (unlikely(!(p_new & pcp->stat_threshold_mask)))
> >                 zone_page_state_add(pcp->stat_threshold, zone, item);
> > }
> >
> > void inc_zone_state(struct zone *zone, enum zone_stat_item item)
> > {
> > 	unsigned long flags;
> >
> > 	percpu_local_irqsave(flags);
> > 	__inc_zone_state(zone, item);
> > 	percpu_local_irqrestore(flags);
> > }
> >
> > Which is more compact and does not duplicate code.
> 
> This is almost like the current code. But lets avoid percpu_local_irqs
> etc if we can.

If we want to get good performance out of those percpu ops, we have to
leave interrupts enabled. It's a factor ~5 slowdown otherwise on x86. So
the choice really comes down to : either we duplicate code, or we create
those percpu_local_irqsave-like primitives. BTW if the name is bad, we
may have to just find something better.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-05-02  3:06                                         ` Christoph Lameter
@ 2009-05-02  9:03                                           ` Ingo Molnar
  2009-05-04 14:48                                             ` Christoph Lameter
  0 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2009-05-02  9:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Mathieu Desnoyers, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo


* Christoph Lameter <cl@linux.com> wrote:

> On Fri, 1 May 2009, Ingo Molnar wrote:
> 
> > > I see but it may take some time for me to get that finished
> > > (probably 2 weeks or so). Dont feel that you have to merge my old
> > > patches or a variant thereof.
> >
> > it's your baby really and i dont want to interfere in any negative
> > way. The patches look fairly complete and you did all the hard work
> > with them already.
> 
> Ah I remember, you drowned that baby when you asked me to do 
> lowercase macros. [...]

heh, had i not asked for that during review i would have been 
drowned by Linus later on. So it was a basic survival reflex really 
;-)

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [ltt-dev] [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-05-01 19:31                           ` Christoph Lameter
  2009-05-01 20:24                             ` Mathieu Desnoyers
@ 2009-05-02 21:01                             ` Mathieu Desnoyers
  2009-05-04 14:08                               ` Christoph Lameter
  1 sibling, 1 reply; 62+ messages in thread
From: Mathieu Desnoyers @ 2009-05-02 21:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Peter Zijlstra, Linux Kernel Mailing List,
	Yuriy Lalym, ltt-dev, Tejun Heo, Ingo Molnar, Linus Torvalds,
	Andrew Morton

* Christoph Lameter (cl@linux.com) wrote:
> On Fri, 1 May 2009, Mathieu Desnoyers wrote:
> 
> > What I do here is to let those u8 counters increment as free-running
> > counters. Yes, they will periodically overflow the 8 bits. But I don't
> > rely on this for counting the number of increments we need between
> > global counter updates : I use the bitmask taken from the threshold
> > value (which is now required to be a power of two) to detect 0, 1, 2, 3,
> > 4, 5, 6 or 7-bit counter overflow. Therefore we can still have the kind
> > of granularity currently provided. The only limitation is that we have
> > to use powers of two for the threshold, so we end up counting in power
> > of two modulo, which will be unaffected by the u8 overflow.
> 
> Ack. Got it. Looks good.
> 

Modifying mmzone.h "struct per_cpu_pageset" so it uses percpu dynamic
allocation seems to be the far-reaching part of the modification (of a
subsystem I'm not completely familiar with, including NUMA special
cases). Is there any patch already doing this kind of modification
floating around ?

I'd be glad to give a try at some percpu_add_return counters experiments
if percpu struct per_cpu_pageset allocation happens to be already
available.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-04-30  6:21     ` Ingo Molnar
  2009-04-30  6:33       ` [ltt-dev] " Mathieu Desnoyers
@ 2009-05-03  2:40       ` Tejun Heo
  2009-05-04 14:10         ` Christoph Lameter
  1 sibling, 1 reply; 62+ messages in thread
From: Tejun Heo @ 2009-05-03  2:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mathieu Desnoyers, Linus Torvalds, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Christoph Lameter

Hello, Ingo.

Ingo Molnar wrote:
> * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
>> And thanks for the review! This excercise only convinced me that 
>> the kernel memory accounting works as expected. All this gave me 
>> the chance to have a good look at the memory accounting code. We 
>> could probably benefit of Christoph Lameter's cpu ops (using 
>> segment registers to address per-cpu variables with atomic 
>> inc/dec) in there. Or at least removing interrupt disabling by 
>> using preempt disable and local_t variables for the per-cpu 
>> counters could bring some benefit.
> 
> Note, optimized per cpu ops are already implemented upstream, by 
> Tejun Heo's percpu patches in .30:
> 
>  #define percpu_read(var)	percpu_from_op("mov", per_cpu__##var)
>  #define percpu_write(var, val)	percpu_to_op("mov", per_cpu__##var, val)
>  #define percpu_add(var, val)	percpu_to_op("add", per_cpu__##var, val)
>  #define percpu_sub(var, val)	percpu_to_op("sub", per_cpu__##var, val)
>  #define percpu_and(var, val)	percpu_to_op("and", per_cpu__##var, val)
>  #define percpu_or(var, val)	percpu_to_op("or", per_cpu__##var, val)
>  #define percpu_xor(var, val)	percpu_to_op("xor", per_cpu__##var, val)
> 
> See:
> 
>   6dbde35: percpu: add optimized generic percpu accessors

One problem I have with the above api is that those take the variable
symbol not pointer to it, so they can't be used with dynamic
variables.  The api needs major revisions anyway regarding atomicity
and I was planning on getting back to it once the all archs have been
converted and hoping that it wouldn't be used widely before that,
but then again it's not like changing percpu api and its users is
difficult thing to do and there are a lot of benefits in testing how
things work as soon as possible.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [ltt-dev] [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-05-02 21:01                             ` Mathieu Desnoyers
@ 2009-05-04 14:08                               ` Christoph Lameter
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Lameter @ 2009-05-04 14:08 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Nick Piggin, Peter Zijlstra, Linux Kernel Mailing List,
	Yuriy Lalym, ltt-dev, Tejun Heo, Ingo Molnar, Linus Torvalds,
	Andrew Morton

On Sat, 2 May 2009, Mathieu Desnoyers wrote:

> Modifying mmzone.h "struct per_cpu_pageset" so it uses percpu dynamic
> allocation seems to be the far-reaching part of the modification (of a
> subsystem I'm not completely familiar with, including NUMA special
> cases). Is there any patch already doing this kind of modification
> floating around ?

Yes. Its was part of the original cpu alloc patchset. One reason to do cpu
alloc was to clean up the current problem with a having to allocate for
NR_CPUS in the page allocator. That in turn would result in the ability to
use the fast percpu ops to optimize the allocator paths. Hopefully we can
end up with page allocator alloc/free operations that do not need to
disable interrupts.

> I'd be glad to give a try at some percpu_add_return counters experiments
> if percpu struct per_cpu_pageset allocation happens to be already
> available.

Look at the cpu alloc patchset following the link that I posted.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-05-03  2:40       ` Tejun Heo
@ 2009-05-04 14:10         ` Christoph Lameter
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Lameter @ 2009-05-04 14:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Mathieu Desnoyers, Linus Torvalds, Andrew Morton,
	Nick Piggin, KOSAKI Motohiro, Peter Zijlstra, thomas.pi,
	Yuriy Lalym, Linux Kernel Mailing List, ltt-dev

On Sun, 3 May 2009, Tejun Heo wrote:

> One problem I have with the above api is that those take the variable
> symbol not pointer to it, so they can't be used with dynamic
> variables.  The api needs major revisions anyway regarding atomicity

Thanks. I missed making that point.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Fix dirty page accounting in redirty_page_for_writepage()
  2009-05-02  9:03                                           ` Ingo Molnar
@ 2009-05-04 14:48                                             ` Christoph Lameter
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Lameter @ 2009-05-04 14:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Mathieu Desnoyers, Andrew Morton, Nick Piggin,
	KOSAKI Motohiro, Peter Zijlstra, thomas.pi, Yuriy Lalym,
	Linux Kernel Mailing List, ltt-dev, Tejun Heo

On Sat, 2 May 2009, Ingo Molnar wrote:

> heh, had i not asked for that during review i would have been
> drowned by Linus later on. So it was a basic survival reflex really

There seemed to be basic agreement before on upper case being the
right solution. You were not the first to bring the issue up.

There is the weighting of the shouting effect of uppercase
vs. the alert that there are wild things going on in the percpu macros.


^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2009-05-04 14:59 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-29 23:25 [PATCH] Fix dirty page accounting in redirty_page_for_writepage() Mathieu Desnoyers
2009-04-29 23:56 ` Mathieu Desnoyers
2009-04-29 23:59 ` Andrew Morton
2009-04-30  2:34   ` Mathieu Desnoyers
2009-04-30  0:06 ` Linus Torvalds
2009-04-30  2:43   ` Mathieu Desnoyers
2009-04-30  6:21     ` Ingo Molnar
2009-04-30  6:33       ` [ltt-dev] " Mathieu Desnoyers
2009-04-30  6:50         ` Ingo Molnar
2009-04-30 13:38           ` Christoph Lameter
2009-04-30 14:10             ` Ingo Molnar
2009-04-30 14:12             ` Mathieu Desnoyers
2009-04-30 14:12               ` Christoph Lameter
2009-04-30 19:41                 ` Mathieu Desnoyers
2009-04-30 20:17                   ` Christoph Lameter
2009-04-30 21:17                     ` Mathieu Desnoyers
2009-05-01 13:44                       ` Christoph Lameter
2009-05-01 19:21                         ` Mathieu Desnoyers
2009-05-01 19:31                           ` Christoph Lameter
2009-05-01 20:24                             ` Mathieu Desnoyers
2009-05-01 20:28                               ` Christoph Lameter
2009-05-01 20:43                                 ` Mathieu Desnoyers
2009-05-01 20:42                                   ` Christoph Lameter
2009-05-01 21:19                                     ` Mathieu Desnoyers
2009-05-02  3:00                                       ` Christoph Lameter
2009-05-02  7:01                                         ` Mathieu Desnoyers
2009-05-02 21:01                             ` Mathieu Desnoyers
2009-05-04 14:08                               ` Christoph Lameter
2009-05-03  2:40       ` Tejun Heo
2009-05-04 14:10         ` Christoph Lameter
2009-04-30 13:22     ` Christoph Lameter
2009-04-30 13:38       ` Ingo Molnar
2009-04-30 13:40         ` Christoph Lameter
2009-04-30 14:14           ` Ingo Molnar
2009-04-30 14:15             ` Christoph Lameter
2009-04-30 14:38               ` Ingo Molnar
2009-04-30 14:45                 ` Christoph Lameter
2009-04-30 15:01                   ` Ingo Molnar
2009-04-30 15:25                     ` Christoph Lameter
2009-04-30 15:42                       ` Ingo Molnar
2009-04-30 15:44                         ` Christoph Lameter
2009-04-30 16:06                           ` Ingo Molnar
2009-04-30 16:11                             ` Christoph Lameter
2009-04-30 16:16                             ` Linus Torvalds
2009-04-30 17:23                               ` Ingo Molnar
2009-04-30 18:07                                 ` Christoph Lameter
2009-05-01 19:59                                   ` Ingo Molnar
2009-05-01 20:35                                     ` Christoph Lameter
2009-05-01 21:07                                       ` Ingo Molnar
2009-05-02  3:06                                         ` Christoph Lameter
2009-05-02  9:03                                           ` Ingo Molnar
2009-05-04 14:48                                             ` Christoph Lameter
2009-04-30 16:13                         ` Linus Torvalds
2009-04-30 15:54                       ` Ingo Molnar
2009-04-30 16:00                       ` Ingo Molnar
2009-04-30 16:08                         ` Christoph Lameter
2009-04-30 13:50         ` Mathieu Desnoyers
2009-04-30 13:55           ` Christoph Lameter
2009-04-30 14:32           ` Ingo Molnar
2009-04-30 14:42             ` Christoph Lameter
2009-04-30 14:59               ` Ingo Molnar
2009-04-30 16:03             ` [ltt-dev] " Mathieu Desnoyers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.