All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/5] mm: per-zone dirty limiting
@ 2011-07-25 20:19 ` Johannes Weiner
  0 siblings, 0 replies; 64+ messages in thread
From: Johannes Weiner @ 2011-07-25 20:19 UTC (permalink / raw)
  To: linux-mm
  Cc: Dave Chinner, Christoph Hellwig, Mel Gorman, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

Hello!

Writing back single file pages during reclaim exhibits bad IO
patterns, but we can't just stop doing that before the VM has other
means to ensure the pages in a zone are reclaimable.

Over time there were several suggestions of at least doing
write-around of the pages in inode-proximity when the need arises to
clean pages during memory pressure.  But even that would interrupt
writeback from the flushers, without any guarantees that the nearby
inode-pages are even sitting on the same troubled zone.

The reason why dirty pages reach the end of LRU lists in the first
place is in part because the dirty limits are a global restriction
while most systems have more than one LRU list that are different in
size.  Multiple nodes have multiple zones have multiple file lists but
at the same time there is nothing to balance the dirty pages between
the lists except for reclaim writing them out upon encounter.

With around 4G of RAM, a x86_64 machine of mine has a DMA32 zone of a
bit over 3G, a Normal zone of 500M, and a DMA zone of 15M.

A linear writer can quickly fill up the Normal zone, then the DMA32
zone, throttled by the dirty limit initially.  The flushers catch up,
the zones are now mostly full of clean pages and memory reclaim kicks
in on subsequent allocations.  The pages it frees from the Normal zone
are quickly filled with dirty pages (unthrottled, as the much bigger
DMA32 zone allows for a huge number of dirty pages in comparison to
the Normal zone).  As there are also anon and active file pages on the
Normal zone, it is not unlikely that a significant amount of its
inactive file pages are now dirty [ foo=zone(global) ]:

reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=112313(821289) active=9942(10039) isolated=27(27) dirty=59709(146944) writeback=739(4017)
reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111102(806876) active=9925(10022) isolated=32(32) dirty=72125(146914) writeback=957(3972)
reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=110493(803374) active=9871(9978) isolated=32(32) dirty=57274(146618) writeback=4088(4088)
reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111957(806559) active=9871(9978) isolated=32(32) dirty=65125(147329) writeback=456(3866)
reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=110601(803978) active=9860(9973) isolated=27(27) dirty=63792(146590) writeback=61(4276)
reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111786(804032) active=9860(9973) isolated=0(64) dirty=64310(146998) writeback=1282(3847)
reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111643(805651) active=9860(9982) isolated=32(32) dirty=63778(147217) writeback=1127(4156)
reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111678(804709) active=9859(10112) isolated=27(27) dirty=81673(148224) writeback=29(4233)

[ These prints occur only once per reclaim invocation, so the actual
->writepage calls are more frequent than the timestamp may suggest. ]

In the scenario without the Normal zone, first the DMA32 zone fills
up, then the DMA zone.  When reclaim kicks in, it is presented with a
DMA zone whose inactive pages are all dirty -- and dirtied most
recently at that, so the flushers really had abysmal chances at making
some headway:

reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=776(430813) active=2(2931) isolated=32(32) dirty=814(68649) writeback=0(18765)
reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=726(430344) active=2(2931) isolated=32(32) dirty=764(67790) writeback=0(17146)
reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=729(430838) active=2(2931) isolated=32(32) dirty=293(65303) writeback=468(20122)
reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=757(431181) active=2(2931) isolated=32(32) dirty=63(68851) writeback=731(15926)
reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=758(432808) active=2(2931) isolated=32(32) dirty=645(64106) writeback=0(19666)
reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=726(431018) active=2(2931) isolated=32(32) dirty=740(65770) writeback=10(17907)
reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=697(430467) active=2(2931) isolated=32(32) dirty=743(63757) writeback=0(18826)
reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=693(430951) active=2(2931) isolated=32(32) dirty=626(54529) writeback=91(16198)

The idea behind this patch set is to take the ratio the global dirty
limits have to the global memory state and put it into proportion to
the individual zone.  The allocator ensures that pages allocated for
being written to in the page cache are distributed across zones such
that there are always enough clean pages on a zone to begin with.

I am not yet really satisfied as it's not really orthogonal or
integrated with the other writeback throttling much, and has rough
edges here and there, but test results do look rather promising so
far:

--- Copying 8G to fuse-ntfs on USB stick in 4G machine

3.0:

 Performance counter stats for 'dd if=/dev/zero of=zeroes bs=32k count=262144' (6 runs):

       140,671,831 cache-misses             #      4.923 M/sec   ( +-   0.198% )  (scaled from 82.80%)
       726,265,014 cache-references         #     25.417 M/sec   ( +-   1.104% )  (scaled from 83.06%)
       144,092,383 branch-misses            #      4.157 %       ( +-   0.493% )  (scaled from 83.17%)
     3,466,608,296 branches                 #    121.319 M/sec   ( +-   0.421% )  (scaled from 67.89%)
    17,882,351,343 instructions             #      0.417 IPC     ( +-   0.457% )  (scaled from 84.73%)
    42,848,633,897 cycles                   #   1499.554 M/sec   ( +-   0.604% )  (scaled from 83.08%)
               236 page-faults              #      0.000 M/sec   ( +-   0.323% )
             8,026 CPU-migrations           #      0.000 M/sec   ( +-   6.291% )
         2,372,358 context-switches         #      0.083 M/sec   ( +-   0.003% )
      28574.255540 task-clock-msecs         #      0.031 CPUs    ( +-   0.409% )

      912.625436885  seconds time elapsed   ( +-   3.851% )

 nr_vmscan_write 667839

3.0-per-zone-dirty:

 Performance counter stats for 'dd if=/dev/zero of=zeroes bs=32k count=262144' (6 runs):

       140,791,501 cache-misses             #      3.887 M/sec   ( +-   0.186% )  (scaled from 83.09%)
       816,474,193 cache-references         #     22.540 M/sec   ( +-   0.923% )  (scaled from 83.16%)
       154,500,577 branch-misses            #      4.302 %       ( +-   0.495% )  (scaled from 83.15%)
     3,591,344,338 branches                 #     99.143 M/sec   ( +-   0.402% )  (scaled from 67.32%)
    18,713,190,183 instructions             #      0.338 IPC     ( +-   0.448% )  (scaled from 83.96%)
    55,285,320,107 cycles                   #   1526.208 M/sec   ( +-   0.588% )  (scaled from 83.28%)
               237 page-faults              #      0.000 M/sec   ( +-   0.302% )
            28,028 CPU-migrations           #      0.001 M/sec   ( +-   3.070% )
         2,369,897 context-switches         #      0.065 M/sec   ( +-   0.006% )
      36223.970238 task-clock-msecs         #      0.060 CPUs    ( +-   1.062% )

      605.909769823  seconds time elapsed   ( +-   0.783% )

 nr_vmscan_write 0

That's an increase of throughput by 30% and no writeback interference
from reclaim.

As not every other allocation has to reclaim from a Normal zone full
of dirty pages anymore, the patched kernel is also more responsive in
general during the copy.

I am also running fs_mark on XFS on a 2G machine, but the final
results are not in yet.  The preliminary results appear to be in this
ballpark:

--- fs_mark -d fsmark-one -d fsmark-two -D 100 -N 150 -n 150 -L 25 -t 1 -S 0 -s $((10 << 20))

3.0:

real    20m43.901s
user    0m8.988s
sys     0m58.227s
nr_vmscan_write 3347

3.0-per-zone-dirty:

real    20m8.012s
user    0m8.862s
sys     1m2.585s
nr_vmscan_write 161

Patch #1 is more or less an unrelated fix that subsequent patches
depend upon as they modify the same code.  It should go upstream
immediately, me thinks.

#2 and #3 are boring cleanup, guess they can go in right away as well.

#4 adds per-zone dirty throttling for __GFP_WRITE allocators, #5
passes __GFP_WRITE from the grab_cache_page* functions in the hope to
get most writers and no readers; I haven't checked all sites yet.

Discuss! :-)

 include/linux/gfp.h       |    4 +-
 include/linux/pagemap.h   |    6 +-
 include/linux/writeback.h |    5 +-
 mm/filemap.c              |    8 +-
 mm/page-writeback.c       |  225 ++++++++++++++++++++++++++++++--------------
 mm/page_alloc.c           |   27 ++++++
 6 files changed, 196 insertions(+), 79 deletions(-)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [patch 0/5] mm: per-zone dirty limiting
@ 2011-07-25 20:19 ` Johannes Weiner
  0 siblings, 0 replies; 64+ messages in thread
From: Johannes Weiner @ 2011-07-25 20:19 UTC (permalink / raw)
  To: linux-mm
  Cc: Dave Chinner, Christoph Hellwig, Mel Gorman, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

Hello!

Writing back single file pages during reclaim exhibits bad IO
patterns, but we can't just stop doing that before the VM has other
means to ensure the pages in a zone are reclaimable.

Over time there were several suggestions of at least doing
write-around of the pages in inode-proximity when the need arises to
clean pages during memory pressure.  But even that would interrupt
writeback from the flushers, without any guarantees that the nearby
inode-pages are even sitting on the same troubled zone.

The reason why dirty pages reach the end of LRU lists in the first
place is in part because the dirty limits are a global restriction
while most systems have more than one LRU list that are different in
size.  Multiple nodes have multiple zones have multiple file lists but
at the same time there is nothing to balance the dirty pages between
the lists except for reclaim writing them out upon encounter.

With around 4G of RAM, a x86_64 machine of mine has a DMA32 zone of a
bit over 3G, a Normal zone of 500M, and a DMA zone of 15M.

A linear writer can quickly fill up the Normal zone, then the DMA32
zone, throttled by the dirty limit initially.  The flushers catch up,
the zones are now mostly full of clean pages and memory reclaim kicks
in on subsequent allocations.  The pages it frees from the Normal zone
are quickly filled with dirty pages (unthrottled, as the much bigger
DMA32 zone allows for a huge number of dirty pages in comparison to
the Normal zone).  As there are also anon and active file pages on the
Normal zone, it is not unlikely that a significant amount of its
inactive file pages are now dirty [ foo=zone(global) ]:

reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=112313(821289) active=9942(10039) isolated=27(27) dirty=59709(146944) writeback=739(4017)
reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111102(806876) active=9925(10022) isolated=32(32) dirty=72125(146914) writeback=957(3972)
reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=110493(803374) active=9871(9978) isolated=32(32) dirty=57274(146618) writeback=4088(4088)
reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111957(806559) active=9871(9978) isolated=32(32) dirty=65125(147329) writeback=456(3866)
reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=110601(803978) active=9860(9973) isolated=27(27) dirty=63792(146590) writeback=61(4276)
reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111786(804032) active=9860(9973) isolated=0(64) dirty=64310(146998) writeback=1282(3847)
reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111643(805651) active=9860(9982) isolated=32(32) dirty=63778(147217) writeback=1127(4156)
reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111678(804709) active=9859(10112) isolated=27(27) dirty=81673(148224) writeback=29(4233)

[ These prints occur only once per reclaim invocation, so the actual
->writepage calls are more frequent than the timestamp may suggest. ]

In the scenario without the Normal zone, first the DMA32 zone fills
up, then the DMA zone.  When reclaim kicks in, it is presented with a
DMA zone whose inactive pages are all dirty -- and dirtied most
recently at that, so the flushers really had abysmal chances at making
some headway:

reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=776(430813) active=2(2931) isolated=32(32) dirty=814(68649) writeback=0(18765)
reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=726(430344) active=2(2931) isolated=32(32) dirty=764(67790) writeback=0(17146)
reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=729(430838) active=2(2931) isolated=32(32) dirty=293(65303) writeback=468(20122)
reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=757(431181) active=2(2931) isolated=32(32) dirty=63(68851) writeback=731(15926)
reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=758(432808) active=2(2931) isolated=32(32) dirty=645(64106) writeback=0(19666)
reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=726(431018) active=2(2931) isolated=32(32) dirty=740(65770) writeback=10(17907)
reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=697(430467) active=2(2931) isolated=32(32) dirty=743(63757) writeback=0(18826)
reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=693(430951) active=2(2931) isolated=32(32) dirty=626(54529) writeback=91(16198)

The idea behind this patch set is to take the ratio the global dirty
limits have to the global memory state and put it into proportion to
the individual zone.  The allocator ensures that pages allocated for
being written to in the page cache are distributed across zones such
that there are always enough clean pages on a zone to begin with.

I am not yet really satisfied as it's not really orthogonal or
integrated with the other writeback throttling much, and has rough
edges here and there, but test results do look rather promising so
far:

--- Copying 8G to fuse-ntfs on USB stick in 4G machine

3.0:

 Performance counter stats for 'dd if=/dev/zero of=zeroes bs=32k count=262144' (6 runs):

       140,671,831 cache-misses             #      4.923 M/sec   ( +-   0.198% )  (scaled from 82.80%)
       726,265,014 cache-references         #     25.417 M/sec   ( +-   1.104% )  (scaled from 83.06%)
       144,092,383 branch-misses            #      4.157 %       ( +-   0.493% )  (scaled from 83.17%)
     3,466,608,296 branches                 #    121.319 M/sec   ( +-   0.421% )  (scaled from 67.89%)
    17,882,351,343 instructions             #      0.417 IPC     ( +-   0.457% )  (scaled from 84.73%)
    42,848,633,897 cycles                   #   1499.554 M/sec   ( +-   0.604% )  (scaled from 83.08%)
               236 page-faults              #      0.000 M/sec   ( +-   0.323% )
             8,026 CPU-migrations           #      0.000 M/sec   ( +-   6.291% )
         2,372,358 context-switches         #      0.083 M/sec   ( +-   0.003% )
      28574.255540 task-clock-msecs         #      0.031 CPUs    ( +-   0.409% )

      912.625436885  seconds time elapsed   ( +-   3.851% )

 nr_vmscan_write 667839

3.0-per-zone-dirty:

 Performance counter stats for 'dd if=/dev/zero of=zeroes bs=32k count=262144' (6 runs):

       140,791,501 cache-misses             #      3.887 M/sec   ( +-   0.186% )  (scaled from 83.09%)
       816,474,193 cache-references         #     22.540 M/sec   ( +-   0.923% )  (scaled from 83.16%)
       154,500,577 branch-misses            #      4.302 %       ( +-   0.495% )  (scaled from 83.15%)
     3,591,344,338 branches                 #     99.143 M/sec   ( +-   0.402% )  (scaled from 67.32%)
    18,713,190,183 instructions             #      0.338 IPC     ( +-   0.448% )  (scaled from 83.96%)
    55,285,320,107 cycles                   #   1526.208 M/sec   ( +-   0.588% )  (scaled from 83.28%)
               237 page-faults              #      0.000 M/sec   ( +-   0.302% )
            28,028 CPU-migrations           #      0.001 M/sec   ( +-   3.070% )
         2,369,897 context-switches         #      0.065 M/sec   ( +-   0.006% )
      36223.970238 task-clock-msecs         #      0.060 CPUs    ( +-   1.062% )

      605.909769823  seconds time elapsed   ( +-   0.783% )

 nr_vmscan_write 0

That's an increase of throughput by 30% and no writeback interference
from reclaim.

As not every other allocation has to reclaim from a Normal zone full
of dirty pages anymore, the patched kernel is also more responsive in
general during the copy.

I am also running fs_mark on XFS on a 2G machine, but the final
results are not in yet.  The preliminary results appear to be in this
ballpark:

--- fs_mark -d fsmark-one -d fsmark-two -D 100 -N 150 -n 150 -L 25 -t 1 -S 0 -s $((10 << 20))

3.0:

real    20m43.901s
user    0m8.988s
sys     0m58.227s
nr_vmscan_write 3347

3.0-per-zone-dirty:

real    20m8.012s
user    0m8.862s
sys     1m2.585s
nr_vmscan_write 161

Patch #1 is more or less an unrelated fix that subsequent patches
depend upon as they modify the same code.  It should go upstream
immediately, me thinks.

#2 and #3 are boring cleanup, guess they can go in right away as well.

#4 adds per-zone dirty throttling for __GFP_WRITE allocators, #5
passes __GFP_WRITE from the grab_cache_page* functions in the hope to
get most writers and no readers; I haven't checked all sites yet.

Discuss! :-)

 include/linux/gfp.h       |    4 +-
 include/linux/pagemap.h   |    6 +-
 include/linux/writeback.h |    5 +-
 mm/filemap.c              |    8 +-
 mm/page-writeback.c       |  225 ++++++++++++++++++++++++++++++--------------
 mm/page_alloc.c           |   27 ++++++
 6 files changed, 196 insertions(+), 79 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [patch 1/5] mm: page_alloc: increase __GFP_BITS_SHIFT to include __GFP_OTHER_NODE
  2011-07-25 20:19 ` Johannes Weiner
@ 2011-07-25 20:19   ` Johannes Weiner
  -1 siblings, 0 replies; 64+ messages in thread
From: Johannes Weiner @ 2011-07-25 20:19 UTC (permalink / raw)
  To: linux-mm
  Cc: Dave Chinner, Christoph Hellwig, Mel Gorman, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

From: Johannes Weiner <hannes@cmpxchg.org>

__GFP_OTHER_NODE is used for NUMA allocations on behalf of other
nodes.  It's supposed to be passed through from the page allocator to
zone_statistics(), but it never gets there as gfp_allowed_mask is not
wide enough and masks out the flag early in the allocation path.

The result is an accounting glitch where successful NUMA allocations
by-agent are not properly attributed as local.

Increase __GFP_BITS_SHIFT so that it includes __GFP_OTHER_NODE.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/gfp.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index cb40892..3a76faf 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -92,7 +92,7 @@ struct vm_area_struct;
  */
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
-#define __GFP_BITS_SHIFT 23	/* Room for 23 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 24	/* Room for N __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /* This equals 0, but use constants in case they ever change */
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [patch 1/5] mm: page_alloc: increase __GFP_BITS_SHIFT to include __GFP_OTHER_NODE
@ 2011-07-25 20:19   ` Johannes Weiner
  0 siblings, 0 replies; 64+ messages in thread
From: Johannes Weiner @ 2011-07-25 20:19 UTC (permalink / raw)
  To: linux-mm
  Cc: Dave Chinner, Christoph Hellwig, Mel Gorman, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

From: Johannes Weiner <hannes@cmpxchg.org>

__GFP_OTHER_NODE is used for NUMA allocations on behalf of other
nodes.  It's supposed to be passed through from the page allocator to
zone_statistics(), but it never gets there as gfp_allowed_mask is not
wide enough and masks out the flag early in the allocation path.

The result is an accounting glitch where successful NUMA allocations
by-agent are not properly attributed as local.

Increase __GFP_BITS_SHIFT so that it includes __GFP_OTHER_NODE.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/gfp.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index cb40892..3a76faf 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -92,7 +92,7 @@ struct vm_area_struct;
  */
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
-#define __GFP_BITS_SHIFT 23	/* Room for 23 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 24	/* Room for N __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /* This equals 0, but use constants in case they ever change */
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [patch 2/5] mm: writeback: make determine_dirtyable_memory static again
  2011-07-25 20:19 ` Johannes Weiner
@ 2011-07-25 20:19   ` Johannes Weiner
  -1 siblings, 0 replies; 64+ messages in thread
From: Johannes Weiner @ 2011-07-25 20:19 UTC (permalink / raw)
  To: linux-mm
  Cc: Dave Chinner, Christoph Hellwig, Mel Gorman, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

From: Johannes Weiner <hannes@cmpxchg.org>

The tracing ring-buffer used this function briefly, but not anymore.
Make it local to the writeback code again.

Also, move the function so that no forward declaration needs to be
reintroduced.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/writeback.h |    2 -
 mm/page-writeback.c       |   85 ++++++++++++++++++++++-----------------------
 2 files changed, 42 insertions(+), 45 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 17e7ccc..8c63f3a 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -105,8 +105,6 @@ extern int vm_highmem_is_dirtyable;
 extern int block_dump;
 extern int laptop_mode;
 
-extern unsigned long determine_dirtyable_memory(void);
-
 extern int dirty_background_ratio_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 31f6988..a4de005 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -111,6 +111,48 @@ EXPORT_SYMBOL(laptop_mode);
 
 /* End of sysctl-exported parameters */
 
+static unsigned long highmem_dirtyable_memory(unsigned long total)
+{
+#ifdef CONFIG_HIGHMEM
+	int node;
+	unsigned long x = 0;
+
+	for_each_node_state(node, N_HIGH_MEMORY) {
+		struct zone *z =
+			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
+
+		x += zone_page_state(z, NR_FREE_PAGES) +
+		     zone_reclaimable_pages(z);
+	}
+	/*
+	 * Make sure that the number of highmem pages is never larger
+	 * than the number of the total dirtyable memory. This can only
+	 * occur in very strange VM situations but we want to make sure
+	 * that this does not occur.
+	 */
+	return min(x, total);
+#else
+	return 0;
+#endif
+}
+
+/**
+ * determine_dirtyable_memory - amount of memory that may be used
+ *
+ * Returns the numebr of pages that can currently be freed and used
+ * by the kernel for direct mappings.
+ */
+static unsigned long determine_dirtyable_memory(void)
+{
+	unsigned long x;
+
+	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
+
+	if (!vm_highmem_is_dirtyable)
+		x -= highmem_dirtyable_memory(x);
+
+	return x + 1;	/* Ensure that we never return 0 */
+}
 
 /*
  * Scale the writeback cache size proportional to the relative writeout speeds.
@@ -354,49 +396,6 @@ EXPORT_SYMBOL(bdi_set_max_ratio);
  * clamping level.
  */
 
-static unsigned long highmem_dirtyable_memory(unsigned long total)
-{
-#ifdef CONFIG_HIGHMEM
-	int node;
-	unsigned long x = 0;
-
-	for_each_node_state(node, N_HIGH_MEMORY) {
-		struct zone *z =
-			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
-
-		x += zone_page_state(z, NR_FREE_PAGES) +
-		     zone_reclaimable_pages(z);
-	}
-	/*
-	 * Make sure that the number of highmem pages is never larger
-	 * than the number of the total dirtyable memory. This can only
-	 * occur in very strange VM situations but we want to make sure
-	 * that this does not occur.
-	 */
-	return min(x, total);
-#else
-	return 0;
-#endif
-}
-
-/**
- * determine_dirtyable_memory - amount of memory that may be used
- *
- * Returns the numebr of pages that can currently be freed and used
- * by the kernel for direct mappings.
- */
-unsigned long determine_dirtyable_memory(void)
-{
-	unsigned long x;
-
-	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
-
-	if (!vm_highmem_is_dirtyable)
-		x -= highmem_dirtyable_memory(x);
-
-	return x + 1;	/* Ensure that we never return 0 */
-}
-
 /*
  * global_dirty_limits - background-writeback and dirty-throttling thresholds
  *
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [patch 2/5] mm: writeback: make determine_dirtyable_memory static again
@ 2011-07-25 20:19   ` Johannes Weiner
  0 siblings, 0 replies; 64+ messages in thread
From: Johannes Weiner @ 2011-07-25 20:19 UTC (permalink / raw)
  To: linux-mm
  Cc: Dave Chinner, Christoph Hellwig, Mel Gorman, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

From: Johannes Weiner <hannes@cmpxchg.org>

The tracing ring-buffer used this function briefly, but not anymore.
Make it local to the writeback code again.

Also, move the function so that no forward declaration needs to be
reintroduced.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/writeback.h |    2 -
 mm/page-writeback.c       |   85 ++++++++++++++++++++++-----------------------
 2 files changed, 42 insertions(+), 45 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 17e7ccc..8c63f3a 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -105,8 +105,6 @@ extern int vm_highmem_is_dirtyable;
 extern int block_dump;
 extern int laptop_mode;
 
-extern unsigned long determine_dirtyable_memory(void);
-
 extern int dirty_background_ratio_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 31f6988..a4de005 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -111,6 +111,48 @@ EXPORT_SYMBOL(laptop_mode);
 
 /* End of sysctl-exported parameters */
 
+static unsigned long highmem_dirtyable_memory(unsigned long total)
+{
+#ifdef CONFIG_HIGHMEM
+	int node;
+	unsigned long x = 0;
+
+	for_each_node_state(node, N_HIGH_MEMORY) {
+		struct zone *z =
+			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
+
+		x += zone_page_state(z, NR_FREE_PAGES) +
+		     zone_reclaimable_pages(z);
+	}
+	/*
+	 * Make sure that the number of highmem pages is never larger
+	 * than the number of the total dirtyable memory. This can only
+	 * occur in very strange VM situations but we want to make sure
+	 * that this does not occur.
+	 */
+	return min(x, total);
+#else
+	return 0;
+#endif
+}
+
+/**
+ * determine_dirtyable_memory - amount of memory that may be used
+ *
+ * Returns the numebr of pages that can currently be freed and used
+ * by the kernel for direct mappings.
+ */
+static unsigned long determine_dirtyable_memory(void)
+{
+	unsigned long x;
+
+	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
+
+	if (!vm_highmem_is_dirtyable)
+		x -= highmem_dirtyable_memory(x);
+
+	return x + 1;	/* Ensure that we never return 0 */
+}
 
 /*
  * Scale the writeback cache size proportional to the relative writeout speeds.
@@ -354,49 +396,6 @@ EXPORT_SYMBOL(bdi_set_max_ratio);
  * clamping level.
  */
 
-static unsigned long highmem_dirtyable_memory(unsigned long total)
-{
-#ifdef CONFIG_HIGHMEM
-	int node;
-	unsigned long x = 0;
-
-	for_each_node_state(node, N_HIGH_MEMORY) {
-		struct zone *z =
-			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
-
-		x += zone_page_state(z, NR_FREE_PAGES) +
-		     zone_reclaimable_pages(z);
-	}
-	/*
-	 * Make sure that the number of highmem pages is never larger
-	 * than the number of the total dirtyable memory. This can only
-	 * occur in very strange VM situations but we want to make sure
-	 * that this does not occur.
-	 */
-	return min(x, total);
-#else
-	return 0;
-#endif
-}
-
-/**
- * determine_dirtyable_memory - amount of memory that may be used
- *
- * Returns the numebr of pages that can currently be freed and used
- * by the kernel for direct mappings.
- */
-unsigned long determine_dirtyable_memory(void)
-{
-	unsigned long x;
-
-	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
-
-	if (!vm_highmem_is_dirtyable)
-		x -= highmem_dirtyable_memory(x);
-
-	return x + 1;	/* Ensure that we never return 0 */
-}
-
 /*
  * global_dirty_limits - background-writeback and dirty-throttling thresholds
  *
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [patch 3/5] mm: writeback: remove seriously stale comment on dirty limits
  2011-07-25 20:19 ` Johannes Weiner
@ 2011-07-25 20:19   ` Johannes Weiner
  -1 siblings, 0 replies; 64+ messages in thread
From: Johannes Weiner @ 2011-07-25 20:19 UTC (permalink / raw)
  To: linux-mm
  Cc: Dave Chinner, Christoph Hellwig, Mel Gorman, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

From: Johannes Weiner <hannes@cmpxchg.org>

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page-writeback.c |   18 ------------------
 1 files changed, 0 insertions(+), 18 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a4de005..41dc871 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -379,24 +379,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
 EXPORT_SYMBOL(bdi_set_max_ratio);
 
 /*
- * Work out the current dirty-memory clamping and background writeout
- * thresholds.
- *
- * The main aim here is to lower them aggressively if there is a lot of mapped
- * memory around.  To avoid stressing page reclaim with lots of unreclaimable
- * pages.  It is better to clamp down on writers than to start swapping, and
- * performing lots of scanning.
- *
- * We only allow 1/2 of the currently-unmapped memory to be dirtied.
- *
- * We don't permit the clamping level to fall below 5% - that is getting rather
- * excessive.
- *
- * We make sure that the background writeout level is below the adjusted
- * clamping level.
- */
-
-/*
  * global_dirty_limits - background-writeback and dirty-throttling thresholds
  *
  * Calculate the dirty thresholds based on sysctl parameters
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [patch 3/5] mm: writeback: remove seriously stale comment on dirty limits
@ 2011-07-25 20:19   ` Johannes Weiner
  0 siblings, 0 replies; 64+ messages in thread
From: Johannes Weiner @ 2011-07-25 20:19 UTC (permalink / raw)
  To: linux-mm
  Cc: Dave Chinner, Christoph Hellwig, Mel Gorman, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

From: Johannes Weiner <hannes@cmpxchg.org>

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page-writeback.c |   18 ------------------
 1 files changed, 0 insertions(+), 18 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a4de005..41dc871 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -379,24 +379,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
 EXPORT_SYMBOL(bdi_set_max_ratio);
 
 /*
- * Work out the current dirty-memory clamping and background writeout
- * thresholds.
- *
- * The main aim here is to lower them aggressively if there is a lot of mapped
- * memory around.  To avoid stressing page reclaim with lots of unreclaimable
- * pages.  It is better to clamp down on writers than to start swapping, and
- * performing lots of scanning.
- *
- * We only allow 1/2 of the currently-unmapped memory to be dirtied.
- *
- * We don't permit the clamping level to fall below 5% - that is getting rather
- * excessive.
- *
- * We make sure that the background writeout level is below the adjusted
- * clamping level.
- */
-
-/*
  * global_dirty_limits - background-writeback and dirty-throttling thresholds
  *
  * Calculate the dirty thresholds based on sysctl parameters
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [patch 4/5] mm: writeback: throttle __GFP_WRITE on per-zone dirty limits
  2011-07-25 20:19 ` Johannes Weiner
@ 2011-07-25 20:19   ` Johannes Weiner
  -1 siblings, 0 replies; 64+ messages in thread
From: Johannes Weiner @ 2011-07-25 20:19 UTC (permalink / raw)
  To: linux-mm
  Cc: Dave Chinner, Christoph Hellwig, Mel Gorman, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

From: Johannes Weiner <hannes@cmpxchg.org>

Allow allocators to pass __GFP_WRITE when they know in advance that
the allocated page will be written to and become dirty soon.

The page allocator will then attempt to distribute those allocations
across zones, such that no single zone will end up full of dirty and
thus more or less unreclaimable pages.

The global dirty limits are put in proportion to the respective zone's
amount of dirtyable memory and the allocation denied when the limit of
that zone is reached.

Before the allocation fails, the allocator slowpath has a stage before
compaction and reclaim, where the flusher threads are kicked and the
allocator ultimately has to wait for writeback if still none of the
zones has become eligible for allocation again in the meantime.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/gfp.h       |    4 +-
 include/linux/writeback.h |    3 +
 mm/page-writeback.c       |  132 +++++++++++++++++++++++++++++++++++++++------
 mm/page_alloc.c           |   27 +++++++++
 4 files changed, 149 insertions(+), 17 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 3a76faf..78d5338 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -36,6 +36,7 @@ struct vm_area_struct;
 #endif
 #define ___GFP_NO_KSWAPD	0x400000u
 #define ___GFP_OTHER_NODE	0x800000u
+#define ___GFP_WRITE		0x1000000u
 
 /*
  * GFP bitmasks..
@@ -85,6 +86,7 @@ struct vm_area_struct;
 
 #define __GFP_NO_KSWAPD	((__force gfp_t)___GFP_NO_KSWAPD)
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
+#define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Will be dirtied soon */
 
 /*
  * This may seem redundant, but it's a way of annotating false positives vs.
@@ -92,7 +94,7 @@ struct vm_area_struct;
  */
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
-#define __GFP_BITS_SHIFT 24	/* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 25	/* Room for N __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /* This equals 0, but use constants in case they ever change */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 8c63f3a..9312e25 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -93,6 +93,9 @@ void laptop_mode_timer_fn(unsigned long data);
 static inline void laptop_sync_completion(void) { }
 #endif
 void throttle_vm_writeout(gfp_t gfp_mask);
+bool zone_dirty_ok(struct zone *zone);
+void try_to_writeback_pages(struct zonelist *zonelist, gfp_t gfp_mask,
+			    nodemask_t *nodemask);
 
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 41dc871..ce673ec 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -154,6 +154,18 @@ static unsigned long determine_dirtyable_memory(void)
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long zone_dirtyable_memory(struct zone *zone)
+{
+	unsigned long x = 1; /* Ensure that we never return 0 */
+
+	if (is_highmem(zone) && !vm_highmem_is_dirtyable)
+		return x;
+
+	x += zone_page_state(zone, NR_FREE_PAGES);
+	x += zone_reclaimable_pages(zone);
+	return x;
+}
+
 /*
  * Scale the writeback cache size proportional to the relative writeout speeds.
  *
@@ -378,6 +390,24 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
 }
 EXPORT_SYMBOL(bdi_set_max_ratio);
 
+static void sanitize_dirty_limits(unsigned long *pbackground,
+				  unsigned long *pdirty)
+{
+	unsigned long background = *pbackground;
+	unsigned long dirty = *pdirty;
+	struct task_struct *tsk;
+
+	if (background >= dirty)
+		background = dirty / 2;
+	tsk = current;
+	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
+		background += background / 4;
+		dirty += dirty / 4;
+	}
+	*pbackground = background;
+	*pdirty = dirty;
+}
+
 /*
  * global_dirty_limits - background-writeback and dirty-throttling thresholds
  *
@@ -389,33 +419,52 @@ EXPORT_SYMBOL(bdi_set_max_ratio);
  */
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
 {
-	unsigned long background;
-	unsigned long dirty;
 	unsigned long uninitialized_var(available_memory);
-	struct task_struct *tsk;
 
 	if (!vm_dirty_bytes || !dirty_background_bytes)
 		available_memory = determine_dirtyable_memory();
 
 	if (vm_dirty_bytes)
-		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+		*pdirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
 	else
-		dirty = (vm_dirty_ratio * available_memory) / 100;
+		*pdirty = vm_dirty_ratio * available_memory / 100;
 
 	if (dirty_background_bytes)
-		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+		*pbackground = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
 	else
-		background = (dirty_background_ratio * available_memory) / 100;
+		*pbackground = dirty_background_ratio * available_memory / 100;
 
-	if (background >= dirty)
-		background = dirty / 2;
-	tsk = current;
-	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
-		background += background / 4;
-		dirty += dirty / 4;
-	}
-	*pbackground = background;
-	*pdirty = dirty;
+	sanitize_dirty_limits(pbackground, pdirty);
+}
+
+static void zone_dirty_limits(struct zone *zone, unsigned long *pbackground,
+			      unsigned long *pdirty)
+{
+	unsigned long uninitialized_var(global_memory);
+	unsigned long zone_memory;
+
+	zone_memory = zone_dirtyable_memory(zone);
+
+	if (!vm_dirty_bytes || !dirty_background_bytes)
+		global_memory = determine_dirtyable_memory();
+
+	if (vm_dirty_bytes) {
+		unsigned long dirty_pages;
+
+		dirty_pages = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+		*pdirty = zone_memory * dirty_pages / global_memory;
+	} else
+		*pdirty = zone_memory * vm_dirty_ratio / 100;
+
+	if (dirty_background_bytes) {
+		unsigned long dirty_pages;
+
+		dirty_pages = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+		*pbackground = zone_memory * dirty_pages / global_memory;
+	} else
+		*pbackground = zone_memory * dirty_background_ratio / 100;
+
+	sanitize_dirty_limits(pbackground, pdirty);
 }
 
 /*
@@ -661,6 +710,57 @@ void throttle_vm_writeout(gfp_t gfp_mask)
         }
 }
 
+bool zone_dirty_ok(struct zone *zone)
+{
+	unsigned long background_thresh, dirty_thresh;
+	unsigned long nr_reclaimable, nr_writeback;
+
+	zone_dirty_limits(zone, &background_thresh, &dirty_thresh);
+
+	nr_reclaimable = zone_page_state(zone, NR_FILE_DIRTY) +
+		zone_page_state(zone, NR_UNSTABLE_NFS);
+	nr_writeback = zone_page_state(zone, NR_WRITEBACK);
+
+	return nr_reclaimable + nr_writeback <= dirty_thresh;
+}
+
+void try_to_writeback_pages(struct zonelist *zonelist, gfp_t gfp_mask,
+			    nodemask_t *nodemask)
+{
+	unsigned int nr_exceeded = 0;
+	unsigned int nr_zones = 0;
+	struct zoneref *z;
+	struct zone *zone;
+
+	for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask),
+					nodemask) {
+		unsigned long background_thresh, dirty_thresh;
+		unsigned long nr_reclaimable, nr_writeback;
+
+		nr_zones++;
+
+		zone_dirty_limits(zone, &background_thresh, &dirty_thresh);
+
+		nr_reclaimable = zone_page_state(zone, NR_FILE_DIRTY) +
+			zone_page_state(zone, NR_UNSTABLE_NFS);
+		nr_writeback = zone_page_state(zone, NR_WRITEBACK);
+
+		if (nr_reclaimable + nr_writeback <= background_thresh)
+			continue;
+
+		if (nr_reclaimable > nr_writeback)
+			wakeup_flusher_threads(nr_reclaimable - nr_writeback);
+
+		if (nr_reclaimable + nr_writeback <= dirty_thresh)
+			continue;
+
+		nr_exceeded++;
+	}
+
+	if (nr_zones == nr_exceeded)
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
+}
+
 /*
  * sysctl handler for /proc/sys/vm/dirty_writeback_centisecs
  */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4e8985a..1fac154 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1666,6 +1666,9 @@ zonelist_scan:
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				goto try_next_zone;
 
+		if ((gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
+			goto this_zone_full;
+
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
 			unsigned long mark;
@@ -1863,6 +1866,22 @@ out:
 	return page;
 }
 
+static struct page *
+__alloc_pages_writeback(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, enum zone_type high_zoneidx,
+			nodemask_t *nodemask, int alloc_flags,
+			struct zone *preferred_zone, int migratetype)
+{
+	if (!(gfp_mask & __GFP_WRITE))
+		return NULL;
+
+	try_to_writeback_pages(zonelist, gfp_mask, nodemask);
+
+	return get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
+				      high_zoneidx, alloc_flags,
+				      preferred_zone, migratetype);
+}
+
 #ifdef CONFIG_COMPACTION
 /* Try memory compaction for high-order allocations before reclaim */
 static struct page *
@@ -2135,6 +2154,14 @@ rebalance:
 	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
 		goto nopage;
 
+	/* Try writing back pages if per-zone dirty limits are reached */
+	page = __alloc_pages_writeback(gfp_mask, order, zonelist,
+				       high_zoneidx, nodemask,
+				       alloc_flags, preferred_zone,
+				       migratetype);
+	if (page)
+		goto got_pg;
+
 	/*
 	 * Try direct compaction. The first pass is asynchronous. Subsequent
 	 * attempts after direct reclaim are synchronous
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [patch 4/5] mm: writeback: throttle __GFP_WRITE on per-zone dirty limits
@ 2011-07-25 20:19   ` Johannes Weiner
  0 siblings, 0 replies; 64+ messages in thread
From: Johannes Weiner @ 2011-07-25 20:19 UTC (permalink / raw)
  To: linux-mm
  Cc: Dave Chinner, Christoph Hellwig, Mel Gorman, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

From: Johannes Weiner <hannes@cmpxchg.org>

Allow allocators to pass __GFP_WRITE when they know in advance that
the allocated page will be written to and become dirty soon.

The page allocator will then attempt to distribute those allocations
across zones, such that no single zone will end up full of dirty and
thus more or less unreclaimable pages.

The global dirty limits are put in proportion to the respective zone's
amount of dirtyable memory and the allocation denied when the limit of
that zone is reached.

Before the allocation fails, the allocator slowpath has a stage before
compaction and reclaim, where the flusher threads are kicked and the
allocator ultimately has to wait for writeback if still none of the
zones has become eligible for allocation again in the meantime.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/gfp.h       |    4 +-
 include/linux/writeback.h |    3 +
 mm/page-writeback.c       |  132 +++++++++++++++++++++++++++++++++++++++------
 mm/page_alloc.c           |   27 +++++++++
 4 files changed, 149 insertions(+), 17 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 3a76faf..78d5338 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -36,6 +36,7 @@ struct vm_area_struct;
 #endif
 #define ___GFP_NO_KSWAPD	0x400000u
 #define ___GFP_OTHER_NODE	0x800000u
+#define ___GFP_WRITE		0x1000000u
 
 /*
  * GFP bitmasks..
@@ -85,6 +86,7 @@ struct vm_area_struct;
 
 #define __GFP_NO_KSWAPD	((__force gfp_t)___GFP_NO_KSWAPD)
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
+#define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Will be dirtied soon */
 
 /*
  * This may seem redundant, but it's a way of annotating false positives vs.
@@ -92,7 +94,7 @@ struct vm_area_struct;
  */
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
-#define __GFP_BITS_SHIFT 24	/* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 25	/* Room for N __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /* This equals 0, but use constants in case they ever change */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 8c63f3a..9312e25 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -93,6 +93,9 @@ void laptop_mode_timer_fn(unsigned long data);
 static inline void laptop_sync_completion(void) { }
 #endif
 void throttle_vm_writeout(gfp_t gfp_mask);
+bool zone_dirty_ok(struct zone *zone);
+void try_to_writeback_pages(struct zonelist *zonelist, gfp_t gfp_mask,
+			    nodemask_t *nodemask);
 
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 41dc871..ce673ec 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -154,6 +154,18 @@ static unsigned long determine_dirtyable_memory(void)
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long zone_dirtyable_memory(struct zone *zone)
+{
+	unsigned long x = 1; /* Ensure that we never return 0 */
+
+	if (is_highmem(zone) && !vm_highmem_is_dirtyable)
+		return x;
+
+	x += zone_page_state(zone, NR_FREE_PAGES);
+	x += zone_reclaimable_pages(zone);
+	return x;
+}
+
 /*
  * Scale the writeback cache size proportional to the relative writeout speeds.
  *
@@ -378,6 +390,24 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
 }
 EXPORT_SYMBOL(bdi_set_max_ratio);
 
+static void sanitize_dirty_limits(unsigned long *pbackground,
+				  unsigned long *pdirty)
+{
+	unsigned long background = *pbackground;
+	unsigned long dirty = *pdirty;
+	struct task_struct *tsk;
+
+	if (background >= dirty)
+		background = dirty / 2;
+	tsk = current;
+	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
+		background += background / 4;
+		dirty += dirty / 4;
+	}
+	*pbackground = background;
+	*pdirty = dirty;
+}
+
 /*
  * global_dirty_limits - background-writeback and dirty-throttling thresholds
  *
@@ -389,33 +419,52 @@ EXPORT_SYMBOL(bdi_set_max_ratio);
  */
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
 {
-	unsigned long background;
-	unsigned long dirty;
 	unsigned long uninitialized_var(available_memory);
-	struct task_struct *tsk;
 
 	if (!vm_dirty_bytes || !dirty_background_bytes)
 		available_memory = determine_dirtyable_memory();
 
 	if (vm_dirty_bytes)
-		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+		*pdirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
 	else
-		dirty = (vm_dirty_ratio * available_memory) / 100;
+		*pdirty = vm_dirty_ratio * available_memory / 100;
 
 	if (dirty_background_bytes)
-		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+		*pbackground = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
 	else
-		background = (dirty_background_ratio * available_memory) / 100;
+		*pbackground = dirty_background_ratio * available_memory / 100;
 
-	if (background >= dirty)
-		background = dirty / 2;
-	tsk = current;
-	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
-		background += background / 4;
-		dirty += dirty / 4;
-	}
-	*pbackground = background;
-	*pdirty = dirty;
+	sanitize_dirty_limits(pbackground, pdirty);
+}
+
+static void zone_dirty_limits(struct zone *zone, unsigned long *pbackground,
+			      unsigned long *pdirty)
+{
+	unsigned long uninitialized_var(global_memory);
+	unsigned long zone_memory;
+
+	zone_memory = zone_dirtyable_memory(zone);
+
+	if (!vm_dirty_bytes || !dirty_background_bytes)
+		global_memory = determine_dirtyable_memory();
+
+	if (vm_dirty_bytes) {
+		unsigned long dirty_pages;
+
+		dirty_pages = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+		*pdirty = zone_memory * dirty_pages / global_memory;
+	} else
+		*pdirty = zone_memory * vm_dirty_ratio / 100;
+
+	if (dirty_background_bytes) {
+		unsigned long dirty_pages;
+
+		dirty_pages = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+		*pbackground = zone_memory * dirty_pages / global_memory;
+	} else
+		*pbackground = zone_memory * dirty_background_ratio / 100;
+
+	sanitize_dirty_limits(pbackground, pdirty);
 }
 
 /*
@@ -661,6 +710,57 @@ void throttle_vm_writeout(gfp_t gfp_mask)
         }
 }
 
+bool zone_dirty_ok(struct zone *zone)
+{
+	unsigned long background_thresh, dirty_thresh;
+	unsigned long nr_reclaimable, nr_writeback;
+
+	zone_dirty_limits(zone, &background_thresh, &dirty_thresh);
+
+	nr_reclaimable = zone_page_state(zone, NR_FILE_DIRTY) +
+		zone_page_state(zone, NR_UNSTABLE_NFS);
+	nr_writeback = zone_page_state(zone, NR_WRITEBACK);
+
+	return nr_reclaimable + nr_writeback <= dirty_thresh;
+}
+
+void try_to_writeback_pages(struct zonelist *zonelist, gfp_t gfp_mask,
+			    nodemask_t *nodemask)
+{
+	unsigned int nr_exceeded = 0;
+	unsigned int nr_zones = 0;
+	struct zoneref *z;
+	struct zone *zone;
+
+	for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask),
+					nodemask) {
+		unsigned long background_thresh, dirty_thresh;
+		unsigned long nr_reclaimable, nr_writeback;
+
+		nr_zones++;
+
+		zone_dirty_limits(zone, &background_thresh, &dirty_thresh);
+
+		nr_reclaimable = zone_page_state(zone, NR_FILE_DIRTY) +
+			zone_page_state(zone, NR_UNSTABLE_NFS);
+		nr_writeback = zone_page_state(zone, NR_WRITEBACK);
+
+		if (nr_reclaimable + nr_writeback <= background_thresh)
+			continue;
+
+		if (nr_reclaimable > nr_writeback)
+			wakeup_flusher_threads(nr_reclaimable - nr_writeback);
+
+		if (nr_reclaimable + nr_writeback <= dirty_thresh)
+			continue;
+
+		nr_exceeded++;
+	}
+
+	if (nr_zones == nr_exceeded)
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
+}
+
 /*
  * sysctl handler for /proc/sys/vm/dirty_writeback_centisecs
  */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4e8985a..1fac154 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1666,6 +1666,9 @@ zonelist_scan:
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				goto try_next_zone;
 
+		if ((gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
+			goto this_zone_full;
+
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
 			unsigned long mark;
@@ -1863,6 +1866,22 @@ out:
 	return page;
 }
 
+static struct page *
+__alloc_pages_writeback(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, enum zone_type high_zoneidx,
+			nodemask_t *nodemask, int alloc_flags,
+			struct zone *preferred_zone, int migratetype)
+{
+	if (!(gfp_mask & __GFP_WRITE))
+		return NULL;
+
+	try_to_writeback_pages(zonelist, gfp_mask, nodemask);
+
+	return get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
+				      high_zoneidx, alloc_flags,
+				      preferred_zone, migratetype);
+}
+
 #ifdef CONFIG_COMPACTION
 /* Try memory compaction for high-order allocations before reclaim */
 static struct page *
@@ -2135,6 +2154,14 @@ rebalance:
 	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
 		goto nopage;
 
+	/* Try writing back pages if per-zone dirty limits are reached */
+	page = __alloc_pages_writeback(gfp_mask, order, zonelist,
+				       high_zoneidx, nodemask,
+				       alloc_flags, preferred_zone,
+				       migratetype);
+	if (page)
+		goto got_pg;
+
 	/*
 	 * Try direct compaction. The first pass is asynchronous. Subsequent
 	 * attempts after direct reclaim are synchronous
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [patch 5/5] mm: filemap: horrid hack to pass __GFP_WRITE for most page cache writers
  2011-07-25 20:19 ` Johannes Weiner
@ 2011-07-25 20:19   ` Johannes Weiner
  -1 siblings, 0 replies; 64+ messages in thread
From: Johannes Weiner @ 2011-07-25 20:19 UTC (permalink / raw)
  To: linux-mm
  Cc: Dave Chinner, Christoph Hellwig, Mel Gorman, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

From: Johannes Weiner <hannes@cmpxchg.org>

This makes every page allocation that results from grabbing a single
page in the page cache pass __GFP_WRITE.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/pagemap.h |    6 ++++--
 mm/filemap.c            |    8 ++++++--
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 716875e..3355c9b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -244,12 +244,14 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 			pgoff_t index, unsigned flags);
 
 /*
- * Returns locked page at given index in given cache, creating it if needed.
+ * Returns locked page at given index in given cache, creating it if
+ * needed.  XXX: Assumes that page will be dirtied soon!
  */
 static inline struct page *grab_cache_page(struct address_space *mapping,
 								pgoff_t index)
 {
-	return find_or_create_page(mapping, index, mapping_gfp_mask(mapping));
+	return find_or_create_page(mapping, index,
+				   mapping_gfp_mask(mapping) | __GFP_WRITE);
 }
 
 extern struct page * grab_cache_page_nowait(struct address_space *mapping,
diff --git a/mm/filemap.c b/mm/filemap.c
index a8251a8..e315d46 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1030,6 +1030,7 @@ struct page *
 grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
 {
 	struct page *page = find_get_page(mapping, index);
+	gfp_t gfp_mask;
 
 	if (page) {
 		if (trylock_page(page))
@@ -1037,7 +1038,8 @@ grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
 		page_cache_release(page);
 		return NULL;
 	}
-	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
+	gfp_mask = (mapping_gfp_mask(mapping) | __GFP_WRITE) & ~__GFP_FS;
+	page = __page_cache_alloc(gfp_mask);
 	if (page && add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
 		page_cache_release(page);
 		page = NULL;
@@ -2330,6 +2332,7 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 					pgoff_t index, unsigned flags)
 {
 	int status;
+	gfp_t gfp_mask;
 	struct page *page;
 	gfp_t gfp_notmask = 0;
 	if (flags & AOP_FLAG_NOFS)
@@ -2339,7 +2342,8 @@ repeat:
 	if (page)
 		goto found;
 
-	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~gfp_notmask);
+	gfp_mask = (mapping_gfp_mask(mapping) | __GFP_WRITE) & ~gfp_notmask;
+	page = __page_cache_alloc(gfp_mask);
 	if (!page)
 		return NULL;
 	status = add_to_page_cache_lru(page, mapping, index,
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [patch 5/5] mm: filemap: horrid hack to pass __GFP_WRITE for most page cache writers
@ 2011-07-25 20:19   ` Johannes Weiner
  0 siblings, 0 replies; 64+ messages in thread
From: Johannes Weiner @ 2011-07-25 20:19 UTC (permalink / raw)
  To: linux-mm
  Cc: Dave Chinner, Christoph Hellwig, Mel Gorman, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

From: Johannes Weiner <hannes@cmpxchg.org>

This makes every page allocation that results from grabbing a single
page in the page cache pass __GFP_WRITE.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/pagemap.h |    6 ++++--
 mm/filemap.c            |    8 ++++++--
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 716875e..3355c9b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -244,12 +244,14 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 			pgoff_t index, unsigned flags);
 
 /*
- * Returns locked page at given index in given cache, creating it if needed.
+ * Returns locked page at given index in given cache, creating it if
+ * needed.  XXX: Assumes that page will be dirtied soon!
  */
 static inline struct page *grab_cache_page(struct address_space *mapping,
 								pgoff_t index)
 {
-	return find_or_create_page(mapping, index, mapping_gfp_mask(mapping));
+	return find_or_create_page(mapping, index,
+				   mapping_gfp_mask(mapping) | __GFP_WRITE);
 }
 
 extern struct page * grab_cache_page_nowait(struct address_space *mapping,
diff --git a/mm/filemap.c b/mm/filemap.c
index a8251a8..e315d46 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1030,6 +1030,7 @@ struct page *
 grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
 {
 	struct page *page = find_get_page(mapping, index);
+	gfp_t gfp_mask;
 
 	if (page) {
 		if (trylock_page(page))
@@ -1037,7 +1038,8 @@ grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
 		page_cache_release(page);
 		return NULL;
 	}
-	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
+	gfp_mask = (mapping_gfp_mask(mapping) | __GFP_WRITE) & ~__GFP_FS;
+	page = __page_cache_alloc(gfp_mask);
 	if (page && add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
 		page_cache_release(page);
 		page = NULL;
@@ -2330,6 +2332,7 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 					pgoff_t index, unsigned flags)
 {
 	int status;
+	gfp_t gfp_mask;
 	struct page *page;
 	gfp_t gfp_notmask = 0;
 	if (flags & AOP_FLAG_NOFS)
@@ -2339,7 +2342,8 @@ repeat:
 	if (page)
 		goto found;
 
-	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~gfp_notmask);
+	gfp_mask = (mapping_gfp_mask(mapping) | __GFP_WRITE) & ~gfp_notmask;
+	page = __page_cache_alloc(gfp_mask);
 	if (!page)
 		return NULL;
 	status = add_to_page_cache_lru(page, mapping, index,
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [patch 4/5] mm: writeback: throttle __GFP_WRITE on per-zone dirty limits
  2011-07-25 20:19   ` Johannes Weiner
@ 2011-07-25 20:37     ` Andi Kleen
  -1 siblings, 0 replies; 64+ messages in thread
From: Andi Kleen @ 2011-07-25 20:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara,
	linux-kernel

> The global dirty limits are put in proportion to the respective zone's
> amount of dirtyable memory and the allocation denied when the limit of
> that zone is reached.
> 
> Before the allocation fails, the allocator slowpath has a stage before
> compaction and reclaim, where the flusher threads are kicked and the
> allocator ultimately has to wait for writeback if still none of the
> zones has become eligible for allocation again in the meantime.
> 

I don't really like this. It seems wrong to make memory
placement depend on dirtyness.

Just try to explain it to some system administrator or tuner: her 
head will explode and for good reasons.

On the other hand I like doing round-robin in filemap by default
(I think that is what your patch essentially does)
We should have made  this default long ago. It avoids most of the
"IO fills up local node" problems people run into all the time.

So I would rather just change the default in filemap allocation.

That's also easy to explain.

BTW the original argument why this wasn't done was that it may
be a problem on extremly large systems, but I think it's reasonable
to let these oddballs change their defaults instead of letting
everyone else handle them.

-Andi

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 4/5] mm: writeback: throttle __GFP_WRITE on per-zone dirty limits
@ 2011-07-25 20:37     ` Andi Kleen
  0 siblings, 0 replies; 64+ messages in thread
From: Andi Kleen @ 2011-07-25 20:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara,
	linux-kernel

> The global dirty limits are put in proportion to the respective zone's
> amount of dirtyable memory and the allocation denied when the limit of
> that zone is reached.
> 
> Before the allocation fails, the allocator slowpath has a stage before
> compaction and reclaim, where the flusher threads are kicked and the
> allocator ultimately has to wait for writeback if still none of the
> zones has become eligible for allocation again in the meantime.
> 

I don't really like this. It seems wrong to make memory
placement depend on dirtyness.

Just try to explain it to some system administrator or tuner: her 
head will explode and for good reasons.

On the other hand I like doing round-robin in filemap by default
(I think that is what your patch essentially does)
We should have made  this default long ago. It avoids most of the
"IO fills up local node" problems people run into all the time.

So I would rather just change the default in filemap allocation.

That's also easy to explain.

BTW the original argument why this wasn't done was that it may
be a problem on extremly large systems, but I think it's reasonable
to let these oddballs change their defaults instead of letting
everyone else handle them.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 1/5] mm: page_alloc: increase __GFP_BITS_SHIFT to include __GFP_OTHER_NODE
  2011-07-25 20:19   ` Johannes Weiner
@ 2011-07-25 20:52     ` Andi Kleen
  -1 siblings, 0 replies; 64+ messages in thread
From: Andi Kleen @ 2011-07-25 20:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara,
	linux-kernel

On Mon, Jul 25, 2011 at 10:19:15PM +0200, Johannes Weiner wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> __GFP_OTHER_NODE is used for NUMA allocations on behalf of other
> nodes.  It's supposed to be passed through from the page allocator to
> zone_statistics(), but it never gets there as gfp_allowed_mask is not
> wide enough and masks out the flag early in the allocation path.
> 
> The result is an accounting glitch where successful NUMA allocations
> by-agent are not properly attributed as local.
> 
> Increase __GFP_BITS_SHIFT so that it includes __GFP_OTHER_NODE.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Andi Kleen <ak@linux.intel.com>

-Andi

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 1/5] mm: page_alloc: increase __GFP_BITS_SHIFT to include __GFP_OTHER_NODE
@ 2011-07-25 20:52     ` Andi Kleen
  0 siblings, 0 replies; 64+ messages in thread
From: Andi Kleen @ 2011-07-25 20:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara,
	linux-kernel

On Mon, Jul 25, 2011 at 10:19:15PM +0200, Johannes Weiner wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> __GFP_OTHER_NODE is used for NUMA allocations on behalf of other
> nodes.  It's supposed to be passed through from the page allocator to
> zone_statistics(), but it never gets there as gfp_allowed_mask is not
> wide enough and masks out the flag early in the allocation path.
> 
> The result is an accounting glitch where successful NUMA allocations
> by-agent are not properly attributed as local.
> 
> Increase __GFP_BITS_SHIFT so that it includes __GFP_OTHER_NODE.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Andi Kleen <ak@linux.intel.com>

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 1/5] mm: page_alloc: increase __GFP_BITS_SHIFT to include __GFP_OTHER_NODE
  2011-07-25 20:19   ` Johannes Weiner
@ 2011-07-25 22:56     ` Minchan Kim
  -1 siblings, 0 replies; 64+ messages in thread
From: Minchan Kim @ 2011-07-25 22:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Rik van Riel, Jan Kara, Andi Kleen,
	linux-kernel

On Tue, Jul 26, 2011 at 5:19 AM, Johannes Weiner <jweiner@redhat.com> wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
>
> __GFP_OTHER_NODE is used for NUMA allocations on behalf of other
> nodes.  It's supposed to be passed through from the page allocator to
> zone_statistics(), but it never gets there as gfp_allowed_mask is not
> wide enough and masks out the flag early in the allocation path.
>
> The result is an accounting glitch where successful NUMA allocations
> by-agent are not properly attributed as local.
>
> Increase __GFP_BITS_SHIFT so that it includes __GFP_OTHER_NODE.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Nice catch.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 1/5] mm: page_alloc: increase __GFP_BITS_SHIFT to include __GFP_OTHER_NODE
@ 2011-07-25 22:56     ` Minchan Kim
  0 siblings, 0 replies; 64+ messages in thread
From: Minchan Kim @ 2011-07-25 22:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Rik van Riel, Jan Kara, Andi Kleen,
	linux-kernel

On Tue, Jul 26, 2011 at 5:19 AM, Johannes Weiner <jweiner@redhat.com> wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
>
> __GFP_OTHER_NODE is used for NUMA allocations on behalf of other
> nodes.  It's supposed to be passed through from the page allocator to
> zone_statistics(), but it never gets there as gfp_allowed_mask is not
> wide enough and masks out the flag early in the allocation path.
>
> The result is an accounting glitch where successful NUMA allocations
> by-agent are not properly attributed as local.
>
> Increase __GFP_BITS_SHIFT so that it includes __GFP_OTHER_NODE.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Nice catch.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 4/5] mm: writeback: throttle __GFP_WRITE on per-zone dirty limits
  2011-07-25 20:37     ` Andi Kleen
@ 2011-07-25 23:40       ` Minchan Kim
  -1 siblings, 0 replies; 64+ messages in thread
From: Minchan Kim @ 2011-07-25 23:40 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Johannes Weiner, linux-mm, Dave Chinner, Christoph Hellwig,
	Mel Gorman, Andrew Morton, Wu Fengguang, Rik van Riel, Jan Kara,
	linux-kernel

Hi Andi,

On Tue, Jul 26, 2011 at 5:37 AM, Andi Kleen <ak@linux.intel.com> wrote:
>> The global dirty limits are put in proportion to the respective zone's
>> amount of dirtyable memory and the allocation denied when the limit of
>> that zone is reached.
>>
>> Before the allocation fails, the allocator slowpath has a stage before
>> compaction and reclaim, where the flusher threads are kicked and the
>> allocator ultimately has to wait for writeback if still none of the
>> zones has become eligible for allocation again in the meantime.
>>
>
> I don't really like this. It seems wrong to make memory
> placement depend on dirtyness.
>
> Just try to explain it to some system administrator or tuner: her
> head will explode and for good reasons.
>
> On the other hand I like doing round-robin in filemap by default
> (I think that is what your patch essentially does)
> We should have made  this default long ago. It avoids most of the
> "IO fills up local node" problems people run into all the time.
>
> So I would rather just change the default in filemap allocation.
>
> That's also easy to explain.

Just out of curiosity.
Why do you want to consider only filemap allocation, not IO(ie,
filemap + sys_[read/write]) allocation?

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 4/5] mm: writeback: throttle __GFP_WRITE on per-zone dirty limits
@ 2011-07-25 23:40       ` Minchan Kim
  0 siblings, 0 replies; 64+ messages in thread
From: Minchan Kim @ 2011-07-25 23:40 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Johannes Weiner, linux-mm, Dave Chinner, Christoph Hellwig,
	Mel Gorman, Andrew Morton, Wu Fengguang, Rik van Riel, Jan Kara,
	linux-kernel

Hi Andi,

On Tue, Jul 26, 2011 at 5:37 AM, Andi Kleen <ak@linux.intel.com> wrote:
>> The global dirty limits are put in proportion to the respective zone's
>> amount of dirtyable memory and the allocation denied when the limit of
>> that zone is reached.
>>
>> Before the allocation fails, the allocator slowpath has a stage before
>> compaction and reclaim, where the flusher threads are kicked and the
>> allocator ultimately has to wait for writeback if still none of the
>> zones has become eligible for allocation again in the meantime.
>>
>
> I don't really like this. It seems wrong to make memory
> placement depend on dirtyness.
>
> Just try to explain it to some system administrator or tuner: her
> head will explode and for good reasons.
>
> On the other hand I like doing round-robin in filemap by default
> (I think that is what your patch essentially does)
> We should have made  this default long ago. It avoids most of the
> "IO fills up local node" problems people run into all the time.
>
> So I would rather just change the default in filemap allocation.
>
> That's also easy to explain.

Just out of curiosity.
Why do you want to consider only filemap allocation, not IO(ie,
filemap + sys_[read/write]) allocation?

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 0/5] mm: per-zone dirty limiting
  2011-07-25 20:19 ` Johannes Weiner
@ 2011-07-26  0:16   ` Minchan Kim
  -1 siblings, 0 replies; 64+ messages in thread
From: Minchan Kim @ 2011-07-26  0:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Rik van Riel, Jan Kara, Andi Kleen,
	linux-kernel

On Tue, Jul 26, 2011 at 5:19 AM, Johannes Weiner <jweiner@redhat.com> wrote:
> Hello!
>
> Writing back single file pages during reclaim exhibits bad IO
> patterns, but we can't just stop doing that before the VM has other
> means to ensure the pages in a zone are reclaimable.
>
> Over time there were several suggestions of at least doing
> write-around of the pages in inode-proximity when the need arises to
> clean pages during memory pressure.  But even that would interrupt
> writeback from the flushers, without any guarantees that the nearby
> inode-pages are even sitting on the same troubled zone.
>
> The reason why dirty pages reach the end of LRU lists in the first
> place is in part because the dirty limits are a global restriction
> while most systems have more than one LRU list that are different in
> size.  Multiple nodes have multiple zones have multiple file lists but
> at the same time there is nothing to balance the dirty pages between
> the lists except for reclaim writing them out upon encounter.
>
> With around 4G of RAM, a x86_64 machine of mine has a DMA32 zone of a
> bit over 3G, a Normal zone of 500M, and a DMA zone of 15M.
>
> A linear writer can quickly fill up the Normal zone, then the DMA32
> zone, throttled by the dirty limit initially.  The flushers catch up,
> the zones are now mostly full of clean pages and memory reclaim kicks
> in on subsequent allocations.  The pages it frees from the Normal zone
> are quickly filled with dirty pages (unthrottled, as the much bigger
> DMA32 zone allows for a huge number of dirty pages in comparison to
> the Normal zone).  As there are also anon and active file pages on the
> Normal zone, it is not unlikely that a significant amount of its
> inactive file pages are now dirty [ foo=zone(global) ]:
>
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=112313(821289) active=9942(10039) isolated=27(27) dirty=59709(146944) writeback=739(4017)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111102(806876) active=9925(10022) isolated=32(32) dirty=72125(146914) writeback=957(3972)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=110493(803374) active=9871(9978) isolated=32(32) dirty=57274(146618) writeback=4088(4088)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111957(806559) active=9871(9978) isolated=32(32) dirty=65125(147329) writeback=456(3866)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=110601(803978) active=9860(9973) isolated=27(27) dirty=63792(146590) writeback=61(4276)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111786(804032) active=9860(9973) isolated=0(64) dirty=64310(146998) writeback=1282(3847)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111643(805651) active=9860(9982) isolated=32(32) dirty=63778(147217) writeback=1127(4156)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111678(804709) active=9859(10112) isolated=27(27) dirty=81673(148224) writeback=29(4233)
>
> [ These prints occur only once per reclaim invocation, so the actual
> ->writepage calls are more frequent than the timestamp may suggest. ]
>
> In the scenario without the Normal zone, first the DMA32 zone fills
> up, then the DMA zone.  When reclaim kicks in, it is presented with a
> DMA zone whose inactive pages are all dirty -- and dirtied most
> recently at that, so the flushers really had abysmal chances at making
> some headway:
>
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=776(430813) active=2(2931) isolated=32(32) dirty=814(68649) writeback=0(18765)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=726(430344) active=2(2931) isolated=32(32) dirty=764(67790) writeback=0(17146)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=729(430838) active=2(2931) isolated=32(32) dirty=293(65303) writeback=468(20122)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=757(431181) active=2(2931) isolated=32(32) dirty=63(68851) writeback=731(15926)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=758(432808) active=2(2931) isolated=32(32) dirty=645(64106) writeback=0(19666)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=726(431018) active=2(2931) isolated=32(32) dirty=740(65770) writeback=10(17907)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=697(430467) active=2(2931) isolated=32(32) dirty=743(63757) writeback=0(18826)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=693(430951) active=2(2931) isolated=32(32) dirty=626(54529) writeback=91(16198)
>
> The idea behind this patch set is to take the ratio the global dirty
> limits have to the global memory state and put it into proportion to
> the individual zone.  The allocator ensures that pages allocated for
> being written to in the page cache are distributed across zones such
> that there are always enough clean pages on a zone to begin with.
>
> I am not yet really satisfied as it's not really orthogonal or
> integrated with the other writeback throttling much, and has rough
> edges here and there, but test results do look rather promising so
> far:
>
> --- Copying 8G to fuse-ntfs on USB stick in 4G machine
>
> 3.0:
>
>  Performance counter stats for 'dd if=/dev/zero of=zeroes bs=32k count=262144' (6 runs):
>
>       140,671,831 cache-misses             #      4.923 M/sec   ( +-   0.198% )  (scaled from 82.80%)
>       726,265,014 cache-references         #     25.417 M/sec   ( +-   1.104% )  (scaled from 83.06%)
>       144,092,383 branch-misses            #      4.157 %       ( +-   0.493% )  (scaled from 83.17%)
>     3,466,608,296 branches                 #    121.319 M/sec   ( +-   0.421% )  (scaled from 67.89%)
>    17,882,351,343 instructions             #      0.417 IPC     ( +-   0.457% )  (scaled from 84.73%)
>    42,848,633,897 cycles                   #   1499.554 M/sec   ( +-   0.604% )  (scaled from 83.08%)
>               236 page-faults              #      0.000 M/sec   ( +-   0.323% )
>             8,026 CPU-migrations           #      0.000 M/sec   ( +-   6.291% )
>         2,372,358 context-switches         #      0.083 M/sec   ( +-   0.003% )
>      28574.255540 task-clock-msecs         #      0.031 CPUs    ( +-   0.409% )
>
>      912.625436885  seconds time elapsed   ( +-   3.851% )
>
>  nr_vmscan_write 667839
>
> 3.0-per-zone-dirty:
>
>  Performance counter stats for 'dd if=/dev/zero of=zeroes bs=32k count=262144' (6 runs):
>
>       140,791,501 cache-misses             #      3.887 M/sec   ( +-   0.186% )  (scaled from 83.09%)
>       816,474,193 cache-references         #     22.540 M/sec   ( +-   0.923% )  (scaled from 83.16%)
>       154,500,577 branch-misses            #      4.302 %       ( +-   0.495% )  (scaled from 83.15%)
>     3,591,344,338 branches                 #     99.143 M/sec   ( +-   0.402% )  (scaled from 67.32%)
>    18,713,190,183 instructions             #      0.338 IPC     ( +-   0.448% )  (scaled from 83.96%)
>    55,285,320,107 cycles                   #   1526.208 M/sec   ( +-   0.588% )  (scaled from 83.28%)
>               237 page-faults              #      0.000 M/sec   ( +-   0.302% )
>            28,028 CPU-migrations           #      0.001 M/sec   ( +-   3.070% )
>         2,369,897 context-switches         #      0.065 M/sec   ( +-   0.006% )
>      36223.970238 task-clock-msecs         #      0.060 CPUs    ( +-   1.062% )
>
>      605.909769823  seconds time elapsed   ( +-   0.783% )
>
>  nr_vmscan_write 0
>
> That's an increase of throughput by 30% and no writeback interference
> from reclaim.
>
> As not every other allocation has to reclaim from a Normal zone full
> of dirty pages anymore, the patched kernel is also more responsive in
> general during the copy.
>
> I am also running fs_mark on XFS on a 2G machine, but the final
> results are not in yet.  The preliminary results appear to be in this
> ballpark:
>
> --- fs_mark -d fsmark-one -d fsmark-two -D 100 -N 150 -n 150 -L 25 -t 1 -S 0 -s $((10 << 20))
>
> 3.0:
>
> real    20m43.901s
> user    0m8.988s
> sys     0m58.227s
> nr_vmscan_write 3347
>
> 3.0-per-zone-dirty:
>
> real    20m8.012s
> user    0m8.862s
> sys     1m2.585s
> nr_vmscan_write 161
>
> Patch #1 is more or less an unrelated fix that subsequent patches
> depend upon as they modify the same code.  It should go upstream
> immediately, me thinks.
>
> #2 and #3 are boring cleanup, guess they can go in right away as well.
>
> #4 adds per-zone dirty throttling for __GFP_WRITE allocators, #5
> passes __GFP_WRITE from the grab_cache_page* functions in the hope to
> get most writers and no readers; I haven't checked all sites yet.
>
> Discuss! :-)
>
>  include/linux/gfp.h       |    4 +-
>  include/linux/pagemap.h   |    6 +-
>  include/linux/writeback.h |    5 +-
>  mm/filemap.c              |    8 +-
>  mm/page-writeback.c       |  225 ++++++++++++++++++++++++++++++--------------
>  mm/page_alloc.c           |   27 ++++++
>  6 files changed, 196 insertions(+), 79 deletions(-)
>
>

IMHO, looks promising!
I like *round-robin* allocation like this although we have problems
should be solved.
What I concern is that it's a kind of big change so we need many
testing and time in various environment to find edge.

Actually I had a idea that VM doesn't write out dirty pages(although
it's a victim) if other fallback zones have enough free pages. Because
root problem is small LRU zone which doesn't have enough time to
activate/reference the page. It's unfair. Even high zone is first
target on most of allocation for user so it would be severe than other
zones.

But your solution is more simple if we can use __GFP_WRITE well.
Although it has problems at the moment,  we can solve it step by step, I think.
-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 0/5] mm: per-zone dirty limiting
@ 2011-07-26  0:16   ` Minchan Kim
  0 siblings, 0 replies; 64+ messages in thread
From: Minchan Kim @ 2011-07-26  0:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Rik van Riel, Jan Kara, Andi Kleen,
	linux-kernel

On Tue, Jul 26, 2011 at 5:19 AM, Johannes Weiner <jweiner@redhat.com> wrote:
> Hello!
>
> Writing back single file pages during reclaim exhibits bad IO
> patterns, but we can't just stop doing that before the VM has other
> means to ensure the pages in a zone are reclaimable.
>
> Over time there were several suggestions of at least doing
> write-around of the pages in inode-proximity when the need arises to
> clean pages during memory pressure.  But even that would interrupt
> writeback from the flushers, without any guarantees that the nearby
> inode-pages are even sitting on the same troubled zone.
>
> The reason why dirty pages reach the end of LRU lists in the first
> place is in part because the dirty limits are a global restriction
> while most systems have more than one LRU list that are different in
> size.  Multiple nodes have multiple zones have multiple file lists but
> at the same time there is nothing to balance the dirty pages between
> the lists except for reclaim writing them out upon encounter.
>
> With around 4G of RAM, a x86_64 machine of mine has a DMA32 zone of a
> bit over 3G, a Normal zone of 500M, and a DMA zone of 15M.
>
> A linear writer can quickly fill up the Normal zone, then the DMA32
> zone, throttled by the dirty limit initially.  The flushers catch up,
> the zones are now mostly full of clean pages and memory reclaim kicks
> in on subsequent allocations.  The pages it frees from the Normal zone
> are quickly filled with dirty pages (unthrottled, as the much bigger
> DMA32 zone allows for a huge number of dirty pages in comparison to
> the Normal zone).  As there are also anon and active file pages on the
> Normal zone, it is not unlikely that a significant amount of its
> inactive file pages are now dirty [ foo=zone(global) ]:
>
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=112313(821289) active=9942(10039) isolated=27(27) dirty=59709(146944) writeback=739(4017)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111102(806876) active=9925(10022) isolated=32(32) dirty=72125(146914) writeback=957(3972)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=110493(803374) active=9871(9978) isolated=32(32) dirty=57274(146618) writeback=4088(4088)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111957(806559) active=9871(9978) isolated=32(32) dirty=65125(147329) writeback=456(3866)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=110601(803978) active=9860(9973) isolated=27(27) dirty=63792(146590) writeback=61(4276)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111786(804032) active=9860(9973) isolated=0(64) dirty=64310(146998) writeback=1282(3847)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111643(805651) active=9860(9982) isolated=32(32) dirty=63778(147217) writeback=1127(4156)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111678(804709) active=9859(10112) isolated=27(27) dirty=81673(148224) writeback=29(4233)
>
> [ These prints occur only once per reclaim invocation, so the actual
> ->writepage calls are more frequent than the timestamp may suggest. ]
>
> In the scenario without the Normal zone, first the DMA32 zone fills
> up, then the DMA zone.  When reclaim kicks in, it is presented with a
> DMA zone whose inactive pages are all dirty -- and dirtied most
> recently at that, so the flushers really had abysmal chances at making
> some headway:
>
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=776(430813) active=2(2931) isolated=32(32) dirty=814(68649) writeback=0(18765)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=726(430344) active=2(2931) isolated=32(32) dirty=764(67790) writeback=0(17146)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=729(430838) active=2(2931) isolated=32(32) dirty=293(65303) writeback=468(20122)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=757(431181) active=2(2931) isolated=32(32) dirty=63(68851) writeback=731(15926)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=758(432808) active=2(2931) isolated=32(32) dirty=645(64106) writeback=0(19666)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=726(431018) active=2(2931) isolated=32(32) dirty=740(65770) writeback=10(17907)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=697(430467) active=2(2931) isolated=32(32) dirty=743(63757) writeback=0(18826)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=693(430951) active=2(2931) isolated=32(32) dirty=626(54529) writeback=91(16198)
>
> The idea behind this patch set is to take the ratio the global dirty
> limits have to the global memory state and put it into proportion to
> the individual zone.  The allocator ensures that pages allocated for
> being written to in the page cache are distributed across zones such
> that there are always enough clean pages on a zone to begin with.
>
> I am not yet really satisfied as it's not really orthogonal or
> integrated with the other writeback throttling much, and has rough
> edges here and there, but test results do look rather promising so
> far:
>
> --- Copying 8G to fuse-ntfs on USB stick in 4G machine
>
> 3.0:
>
>  Performance counter stats for 'dd if=/dev/zero of=zeroes bs=32k count=262144' (6 runs):
>
>       140,671,831 cache-misses             #      4.923 M/sec   ( +-   0.198% )  (scaled from 82.80%)
>       726,265,014 cache-references         #     25.417 M/sec   ( +-   1.104% )  (scaled from 83.06%)
>       144,092,383 branch-misses            #      4.157 %       ( +-   0.493% )  (scaled from 83.17%)
>     3,466,608,296 branches                 #    121.319 M/sec   ( +-   0.421% )  (scaled from 67.89%)
>    17,882,351,343 instructions             #      0.417 IPC     ( +-   0.457% )  (scaled from 84.73%)
>    42,848,633,897 cycles                   #   1499.554 M/sec   ( +-   0.604% )  (scaled from 83.08%)
>               236 page-faults              #      0.000 M/sec   ( +-   0.323% )
>             8,026 CPU-migrations           #      0.000 M/sec   ( +-   6.291% )
>         2,372,358 context-switches         #      0.083 M/sec   ( +-   0.003% )
>      28574.255540 task-clock-msecs         #      0.031 CPUs    ( +-   0.409% )
>
>      912.625436885  seconds time elapsed   ( +-   3.851% )
>
>  nr_vmscan_write 667839
>
> 3.0-per-zone-dirty:
>
>  Performance counter stats for 'dd if=/dev/zero of=zeroes bs=32k count=262144' (6 runs):
>
>       140,791,501 cache-misses             #      3.887 M/sec   ( +-   0.186% )  (scaled from 83.09%)
>       816,474,193 cache-references         #     22.540 M/sec   ( +-   0.923% )  (scaled from 83.16%)
>       154,500,577 branch-misses            #      4.302 %       ( +-   0.495% )  (scaled from 83.15%)
>     3,591,344,338 branches                 #     99.143 M/sec   ( +-   0.402% )  (scaled from 67.32%)
>    18,713,190,183 instructions             #      0.338 IPC     ( +-   0.448% )  (scaled from 83.96%)
>    55,285,320,107 cycles                   #   1526.208 M/sec   ( +-   0.588% )  (scaled from 83.28%)
>               237 page-faults              #      0.000 M/sec   ( +-   0.302% )
>            28,028 CPU-migrations           #      0.001 M/sec   ( +-   3.070% )
>         2,369,897 context-switches         #      0.065 M/sec   ( +-   0.006% )
>      36223.970238 task-clock-msecs         #      0.060 CPUs    ( +-   1.062% )
>
>      605.909769823  seconds time elapsed   ( +-   0.783% )
>
>  nr_vmscan_write 0
>
> That's an increase of throughput by 30% and no writeback interference
> from reclaim.
>
> As not every other allocation has to reclaim from a Normal zone full
> of dirty pages anymore, the patched kernel is also more responsive in
> general during the copy.
>
> I am also running fs_mark on XFS on a 2G machine, but the final
> results are not in yet.  The preliminary results appear to be in this
> ballpark:
>
> --- fs_mark -d fsmark-one -d fsmark-two -D 100 -N 150 -n 150 -L 25 -t 1 -S 0 -s $((10 << 20))
>
> 3.0:
>
> real    20m43.901s
> user    0m8.988s
> sys     0m58.227s
> nr_vmscan_write 3347
>
> 3.0-per-zone-dirty:
>
> real    20m8.012s
> user    0m8.862s
> sys     1m2.585s
> nr_vmscan_write 161
>
> Patch #1 is more or less an unrelated fix that subsequent patches
> depend upon as they modify the same code.  It should go upstream
> immediately, me thinks.
>
> #2 and #3 are boring cleanup, guess they can go in right away as well.
>
> #4 adds per-zone dirty throttling for __GFP_WRITE allocators, #5
> passes __GFP_WRITE from the grab_cache_page* functions in the hope to
> get most writers and no readers; I haven't checked all sites yet.
>
> Discuss! :-)
>
>  include/linux/gfp.h       |    4 +-
>  include/linux/pagemap.h   |    6 +-
>  include/linux/writeback.h |    5 +-
>  mm/filemap.c              |    8 +-
>  mm/page-writeback.c       |  225 ++++++++++++++++++++++++++++++--------------
>  mm/page_alloc.c           |   27 ++++++
>  6 files changed, 196 insertions(+), 79 deletions(-)
>
>

IMHO, looks promising!
I like *round-robin* allocation like this although we have problems
should be solved.
What I concern is that it's a kind of big change so we need many
testing and time in various environment to find edge.

Actually I had a idea that VM doesn't write out dirty pages(although
it's a victim) if other fallback zones have enough free pages. Because
root problem is small LRU zone which doesn't have enough time to
activate/reference the page. It's unfair. Even high zone is first
target on most of allocation for user so it would be severe than other
zones.

But your solution is more simple if we can use __GFP_WRITE well.
Although it has problems at the moment,  we can solve it step by step, I think.
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 1/5] mm: page_alloc: increase __GFP_BITS_SHIFT to include __GFP_OTHER_NODE
  2011-07-25 20:19   ` Johannes Weiner
@ 2011-07-26 13:51     ` Mel Gorman
  -1 siblings, 0 replies; 64+ messages in thread
From: Mel Gorman @ 2011-07-26 13:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On Mon, Jul 25, 2011 at 10:19:15PM +0200, Johannes Weiner wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> __GFP_OTHER_NODE is used for NUMA allocations on behalf of other
> nodes.  It's supposed to be passed through from the page allocator to
> zone_statistics(), but it never gets there as gfp_allowed_mask is not
> wide enough and masks out the flag early in the allocation path.
> 
> The result is an accounting glitch where successful NUMA allocations
> by-agent are not properly attributed as local.
> 
> Increase __GFP_BITS_SHIFT so that it includes __GFP_OTHER_NODE.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

You're right, this should be merged separately.

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 1/5] mm: page_alloc: increase __GFP_BITS_SHIFT to include __GFP_OTHER_NODE
@ 2011-07-26 13:51     ` Mel Gorman
  0 siblings, 0 replies; 64+ messages in thread
From: Mel Gorman @ 2011-07-26 13:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On Mon, Jul 25, 2011 at 10:19:15PM +0200, Johannes Weiner wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> __GFP_OTHER_NODE is used for NUMA allocations on behalf of other
> nodes.  It's supposed to be passed through from the page allocator to
> zone_statistics(), but it never gets there as gfp_allowed_mask is not
> wide enough and masks out the flag early in the allocation path.
> 
> The result is an accounting glitch where successful NUMA allocations
> by-agent are not properly attributed as local.
> 
> Increase __GFP_BITS_SHIFT so that it includes __GFP_OTHER_NODE.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

You're right, this should be merged separately.

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 2/5] mm: writeback: make determine_dirtyable_memory static again
  2011-07-25 20:19   ` Johannes Weiner
@ 2011-07-26 13:53     ` Mel Gorman
  -1 siblings, 0 replies; 64+ messages in thread
From: Mel Gorman @ 2011-07-26 13:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On Mon, Jul 25, 2011 at 10:19:16PM +0200, Johannes Weiner wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> The tracing ring-buffer used this function briefly, but not anymore.
> Make it local to the writeback code again.
> 
> Also, move the function so that no forward declaration needs to be
> reintroduced.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 2/5] mm: writeback: make determine_dirtyable_memory static again
@ 2011-07-26 13:53     ` Mel Gorman
  0 siblings, 0 replies; 64+ messages in thread
From: Mel Gorman @ 2011-07-26 13:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On Mon, Jul 25, 2011 at 10:19:16PM +0200, Johannes Weiner wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> The tracing ring-buffer used this function briefly, but not anymore.
> Make it local to the writeback code again.
> 
> Also, move the function so that no forward declaration needs to be
> reintroduced.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 4/5] mm: writeback: throttle __GFP_WRITE on per-zone dirty limits
  2011-07-25 20:19   ` Johannes Weiner
@ 2011-07-26 14:42     ` Mel Gorman
  -1 siblings, 0 replies; 64+ messages in thread
From: Mel Gorman @ 2011-07-26 14:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On Mon, Jul 25, 2011 at 10:19:18PM +0200, Johannes Weiner wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> Allow allocators to pass __GFP_WRITE when they know in advance that
> the allocated page will be written to and become dirty soon.
> 
> The page allocator will then attempt to distribute those allocations
> across zones, such that no single zone will end up full of dirty and
> thus more or less unreclaimable pages.
> 

On 32-bit, this idea increases lowmem pressure. Ordinarily, this is
only a problem when the higher zone is really large and management
structures can only be allocated from the lower zones. Granted,
it is rare this is the case but in the last 6 months, I've seen at
least one bug report that could be attributed to lowmem pressure
(24G x86 machine).

A brief explanation as to why this is not a problem may be needed.

> The global dirty limits are put in proportion to the respective zone's
> amount of dirtyable memory and the allocation denied when the limit of
> that zone is reached.
> 

What are the risks of a process stalling on dirty pages in a high zone
that is very small (e.g. 64M) ?

> Before the allocation fails, the allocator slowpath has a stage before
> compaction and reclaim, where the flusher threads are kicked and the
> allocator ultimately has to wait for writeback if still none of the
> zones has become eligible for allocation again in the meantime.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/gfp.h       |    4 +-
>  include/linux/writeback.h |    3 +
>  mm/page-writeback.c       |  132 +++++++++++++++++++++++++++++++++++++++------
>  mm/page_alloc.c           |   27 +++++++++
>  4 files changed, 149 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 3a76faf..78d5338 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -36,6 +36,7 @@ struct vm_area_struct;
>  #endif
>  #define ___GFP_NO_KSWAPD	0x400000u
>  #define ___GFP_OTHER_NODE	0x800000u
> +#define ___GFP_WRITE		0x1000000u
>  
>  /*
>   * GFP bitmasks..
> @@ -85,6 +86,7 @@ struct vm_area_struct;
>  
>  #define __GFP_NO_KSWAPD	((__force gfp_t)___GFP_NO_KSWAPD)
>  #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
> +#define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Will be dirtied soon */
>  

/* May be dirtied soon */ :)

>  /*
>   * This may seem redundant, but it's a way of annotating false positives vs.
> @@ -92,7 +94,7 @@ struct vm_area_struct;
>   */
>  #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
>  
> -#define __GFP_BITS_SHIFT 24	/* Room for N __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 25	/* Room for N __GFP_FOO bits */
>  #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>  
>  /* This equals 0, but use constants in case they ever change */
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 8c63f3a..9312e25 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -93,6 +93,9 @@ void laptop_mode_timer_fn(unsigned long data);
>  static inline void laptop_sync_completion(void) { }
>  #endif
>  void throttle_vm_writeout(gfp_t gfp_mask);
> +bool zone_dirty_ok(struct zone *zone);
> +void try_to_writeback_pages(struct zonelist *zonelist, gfp_t gfp_mask,
> +			    nodemask_t *nodemask);
>  
>  /* These are exported to sysctl. */
>  extern int dirty_background_ratio;
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 41dc871..ce673ec 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -154,6 +154,18 @@ static unsigned long determine_dirtyable_memory(void)
>  	return x + 1;	/* Ensure that we never return 0 */
>  }
>  
> +static unsigned long zone_dirtyable_memory(struct zone *zone)
> +{

Terse comment there :)

> +	unsigned long x = 1; /* Ensure that we never return 0 */
> +
> +	if (is_highmem(zone) && !vm_highmem_is_dirtyable)
> +		return x;
> +
> +	x += zone_page_state(zone, NR_FREE_PAGES);
> +	x += zone_reclaimable_pages(zone);
> +	return x;
> +}

It's very similar to determine_dirtyable_memory(). Would be preferable
if the shared a core function of some sort even if that was implemented
as by "if (zone == NULL)". Otherwise, these will get out of sync
eventually.

> +
>  /*
>   * Scale the writeback cache size proportional to the relative writeout speeds.
>   *
> @@ -378,6 +390,24 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
>  }
>  EXPORT_SYMBOL(bdi_set_max_ratio);
>  
> +static void sanitize_dirty_limits(unsigned long *pbackground,
> +				  unsigned long *pdirty)
> +{

Maybe a small comment saying to look at the comment in
global_dirty_limits() to see what this is doing and why.

sanitize feels like an odd name to me. The arguements are not
"objectionable" in some way that needs to be corrected.
scale_dirty_limits maybe?

> +	unsigned long background = *pbackground;
> +	unsigned long dirty = *pdirty;
> +	struct task_struct *tsk;
> +
> +	if (background >= dirty)
> +		background = dirty / 2;
> +	tsk = current;
> +	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> +		background += background / 4;
> +		dirty += dirty / 4;
> +	}
> +	*pbackground = background;
> +	*pdirty = dirty;
> +}
> +
>  /*
>   * global_dirty_limits - background-writeback and dirty-throttling thresholds
>   *
> @@ -389,33 +419,52 @@ EXPORT_SYMBOL(bdi_set_max_ratio);
>   */
>  void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
>  {
> -	unsigned long background;
> -	unsigned long dirty;
>  	unsigned long uninitialized_var(available_memory);
> -	struct task_struct *tsk;
>  
>  	if (!vm_dirty_bytes || !dirty_background_bytes)
>  		available_memory = determine_dirtyable_memory();
>  
>  	if (vm_dirty_bytes)
> -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +		*pdirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
>  	else
> -		dirty = (vm_dirty_ratio * available_memory) / 100;
> +		*pdirty = vm_dirty_ratio * available_memory / 100;
>  
>  	if (dirty_background_bytes)
> -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +		*pbackground = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
>  	else
> -		background = (dirty_background_ratio * available_memory) / 100;
> +		*pbackground = dirty_background_ratio * available_memory / 100;
>  
> -	if (background >= dirty)
> -		background = dirty / 2;
> -	tsk = current;
> -	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> -		background += background / 4;
> -		dirty += dirty / 4;
> -	}
> -	*pbackground = background;
> -	*pdirty = dirty;
> +	sanitize_dirty_limits(pbackground, pdirty);
> +}
> +
> +static void zone_dirty_limits(struct zone *zone, unsigned long *pbackground,
> +			      unsigned long *pdirty)
> +{
> +	unsigned long uninitialized_var(global_memory);
> +	unsigned long zone_memory;
> +
> +	zone_memory = zone_dirtyable_memory(zone);
> +
> +	if (!vm_dirty_bytes || !dirty_background_bytes)
> +		global_memory = determine_dirtyable_memory();
> +
> +	if (vm_dirty_bytes) {
> +		unsigned long dirty_pages;
> +
> +		dirty_pages = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +		*pdirty = zone_memory * dirty_pages / global_memory;
> +	} else
> +		*pdirty = zone_memory * vm_dirty_ratio / 100;
> +
> +	if (dirty_background_bytes) {
> +		unsigned long dirty_pages;
> +
> +		dirty_pages = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +		*pbackground = zone_memory * dirty_pages / global_memory;
> +	} else
> +		*pbackground = zone_memory * dirty_background_ratio / 100;
> +
> +	sanitize_dirty_limits(pbackground, pdirty);
>  }
>  

Ok, seems straight-forward enough. For *_bytes, the number of allowed
dirty pages is scaled based on global memory, otherwise the ration are
applied.

>  /*
> @@ -661,6 +710,57 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>          }
>  }
>  
> +bool zone_dirty_ok(struct zone *zone)
> +{
> +	unsigned long background_thresh, dirty_thresh;
> +	unsigned long nr_reclaimable, nr_writeback;
> +
> +	zone_dirty_limits(zone, &background_thresh, &dirty_thresh);
> +
> +	nr_reclaimable = zone_page_state(zone, NR_FILE_DIRTY) +
> +		zone_page_state(zone, NR_UNSTABLE_NFS);
> +	nr_writeback = zone_page_state(zone, NR_WRITEBACK);
> +
> +	return nr_reclaimable + nr_writeback <= dirty_thresh;
> +}
> +
> +void try_to_writeback_pages(struct zonelist *zonelist, gfp_t gfp_mask,
> +			    nodemask_t *nodemask)
> +{
> +	unsigned int nr_exceeded = 0;
> +	unsigned int nr_zones = 0;
> +	struct zoneref *z;
> +	struct zone *zone;
> +
> +	for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask),
> +					nodemask) {
> +		unsigned long background_thresh, dirty_thresh;
> +		unsigned long nr_reclaimable, nr_writeback;
> +
> +		nr_zones++;
> +
> +		zone_dirty_limits(zone, &background_thresh, &dirty_thresh);
> +
> +		nr_reclaimable = zone_page_state(zone, NR_FILE_DIRTY) +
> +			zone_page_state(zone, NR_UNSTABLE_NFS);
> +		nr_writeback = zone_page_state(zone, NR_WRITEBACK);
> +
> +		if (nr_reclaimable + nr_writeback <= background_thresh)
> +			continue;
> +
> +		if (nr_reclaimable > nr_writeback)
> +			wakeup_flusher_threads(nr_reclaimable - nr_writeback);
> +

This is a potential mess. wakeup_flusher_threads() ultimately
calls "work = kzalloc(sizeof(*work), GFP_ATOMIC)" from the page
allocator. Under enough pressure, particularly if the machine has
very little memory, you may see this spewing out warning messages
which ironically will have to be written to syslog dirtying more
pages.  I know I've made the same mistake at least once by calling
wakeup_flusher_thrads() from page reclaim.

It's also still not controlling where the pages are being
written from.  On a large enough NUMA machine, there is a risk that
wakeup_flusher_treads() will be called very frequently to write pages
from remote nodes that are not in trouble.

> +		if (nr_reclaimable + nr_writeback <= dirty_thresh)
> +			continue;
> +
> +		nr_exceeded++;
> +	}
> +
> +	if (nr_zones == nr_exceeded)
> +		congestion_wait(BLK_RW_ASYNC, HZ/10);
> +}
> +

So, you congestion wait but then potentially continue on even
though it is still over the dirty limits.  Should this be more like
throttle_vm_writeout()?

>  /*
>   * sysctl handler for /proc/sys/vm/dirty_writeback_centisecs
>   */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4e8985a..1fac154 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1666,6 +1666,9 @@ zonelist_scan:
>  			!cpuset_zone_allowed_softwall(zone, gfp_mask))
>  				goto try_next_zone;
>  
> +		if ((gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
> +			goto this_zone_full;
> +

So this part needs to explain why using the lower zones does not
potentially cause lowmem pressure on 32-bit. It's not a show stopper
as such but it shouldn't be ignored either.

>  		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
>  		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
>  			unsigned long mark;
> @@ -1863,6 +1866,22 @@ out:
>  	return page;
>  }
>  
> +static struct page *
> +__alloc_pages_writeback(gfp_t gfp_mask, unsigned int order,
> +			struct zonelist *zonelist, enum zone_type high_zoneidx,
> +			nodemask_t *nodemask, int alloc_flags,
> +			struct zone *preferred_zone, int migratetype)
> +{
> +	if (!(gfp_mask & __GFP_WRITE))
> +		return NULL;
> +
> +	try_to_writeback_pages(zonelist, gfp_mask, nodemask);
> +
> +	return get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
> +				      high_zoneidx, alloc_flags,
> +				      preferred_zone, migratetype);
> +}
> +
>  #ifdef CONFIG_COMPACTION
>  /* Try memory compaction for high-order allocations before reclaim */
>  static struct page *
> @@ -2135,6 +2154,14 @@ rebalance:
>  	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
>  		goto nopage;
>  
> +	/* Try writing back pages if per-zone dirty limits are reached */
> +	page = __alloc_pages_writeback(gfp_mask, order, zonelist,
> +				       high_zoneidx, nodemask,
> +				       alloc_flags, preferred_zone,
> +				       migratetype);
> +	if (page)
> +		goto got_pg;
> +

I like the general idea but we are still not controlling where
pages are being written from, the potential lowmem pressure problem
needs to be addressed and care needs to be taken with the frequency
wakeup_flusher_threads is called due to it using kmalloc.

I suspect where the performance gain is being seen is due to
the flusher threads being woken earlier, more frequently and are
aggressively writing due to wakeup_flusher_threads() passing in loads
of requests. As you are seeing a performance gain, that is interesting
in itself if it is true.

>  	/*
>  	 * Try direct compaction. The first pass is asynchronous. Subsequent
>  	 * attempts after direct reclaim are synchronous
> -- 
> 1.7.6
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 4/5] mm: writeback: throttle __GFP_WRITE on per-zone dirty limits
@ 2011-07-26 14:42     ` Mel Gorman
  0 siblings, 0 replies; 64+ messages in thread
From: Mel Gorman @ 2011-07-26 14:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On Mon, Jul 25, 2011 at 10:19:18PM +0200, Johannes Weiner wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> Allow allocators to pass __GFP_WRITE when they know in advance that
> the allocated page will be written to and become dirty soon.
> 
> The page allocator will then attempt to distribute those allocations
> across zones, such that no single zone will end up full of dirty and
> thus more or less unreclaimable pages.
> 

On 32-bit, this idea increases lowmem pressure. Ordinarily, this is
only a problem when the higher zone is really large and management
structures can only be allocated from the lower zones. Granted,
it is rare this is the case but in the last 6 months, I've seen at
least one bug report that could be attributed to lowmem pressure
(24G x86 machine).

A brief explanation as to why this is not a problem may be needed.

> The global dirty limits are put in proportion to the respective zone's
> amount of dirtyable memory and the allocation denied when the limit of
> that zone is reached.
> 

What are the risks of a process stalling on dirty pages in a high zone
that is very small (e.g. 64M) ?

> Before the allocation fails, the allocator slowpath has a stage before
> compaction and reclaim, where the flusher threads are kicked and the
> allocator ultimately has to wait for writeback if still none of the
> zones has become eligible for allocation again in the meantime.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/gfp.h       |    4 +-
>  include/linux/writeback.h |    3 +
>  mm/page-writeback.c       |  132 +++++++++++++++++++++++++++++++++++++++------
>  mm/page_alloc.c           |   27 +++++++++
>  4 files changed, 149 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 3a76faf..78d5338 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -36,6 +36,7 @@ struct vm_area_struct;
>  #endif
>  #define ___GFP_NO_KSWAPD	0x400000u
>  #define ___GFP_OTHER_NODE	0x800000u
> +#define ___GFP_WRITE		0x1000000u
>  
>  /*
>   * GFP bitmasks..
> @@ -85,6 +86,7 @@ struct vm_area_struct;
>  
>  #define __GFP_NO_KSWAPD	((__force gfp_t)___GFP_NO_KSWAPD)
>  #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
> +#define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Will be dirtied soon */
>  

/* May be dirtied soon */ :)

>  /*
>   * This may seem redundant, but it's a way of annotating false positives vs.
> @@ -92,7 +94,7 @@ struct vm_area_struct;
>   */
>  #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
>  
> -#define __GFP_BITS_SHIFT 24	/* Room for N __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 25	/* Room for N __GFP_FOO bits */
>  #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>  
>  /* This equals 0, but use constants in case they ever change */
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 8c63f3a..9312e25 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -93,6 +93,9 @@ void laptop_mode_timer_fn(unsigned long data);
>  static inline void laptop_sync_completion(void) { }
>  #endif
>  void throttle_vm_writeout(gfp_t gfp_mask);
> +bool zone_dirty_ok(struct zone *zone);
> +void try_to_writeback_pages(struct zonelist *zonelist, gfp_t gfp_mask,
> +			    nodemask_t *nodemask);
>  
>  /* These are exported to sysctl. */
>  extern int dirty_background_ratio;
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 41dc871..ce673ec 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -154,6 +154,18 @@ static unsigned long determine_dirtyable_memory(void)
>  	return x + 1;	/* Ensure that we never return 0 */
>  }
>  
> +static unsigned long zone_dirtyable_memory(struct zone *zone)
> +{

Terse comment there :)

> +	unsigned long x = 1; /* Ensure that we never return 0 */
> +
> +	if (is_highmem(zone) && !vm_highmem_is_dirtyable)
> +		return x;
> +
> +	x += zone_page_state(zone, NR_FREE_PAGES);
> +	x += zone_reclaimable_pages(zone);
> +	return x;
> +}

It's very similar to determine_dirtyable_memory(). Would be preferable
if the shared a core function of some sort even if that was implemented
as by "if (zone == NULL)". Otherwise, these will get out of sync
eventually.

> +
>  /*
>   * Scale the writeback cache size proportional to the relative writeout speeds.
>   *
> @@ -378,6 +390,24 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
>  }
>  EXPORT_SYMBOL(bdi_set_max_ratio);
>  
> +static void sanitize_dirty_limits(unsigned long *pbackground,
> +				  unsigned long *pdirty)
> +{

Maybe a small comment saying to look at the comment in
global_dirty_limits() to see what this is doing and why.

sanitize feels like an odd name to me. The arguements are not
"objectionable" in some way that needs to be corrected.
scale_dirty_limits maybe?

> +	unsigned long background = *pbackground;
> +	unsigned long dirty = *pdirty;
> +	struct task_struct *tsk;
> +
> +	if (background >= dirty)
> +		background = dirty / 2;
> +	tsk = current;
> +	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> +		background += background / 4;
> +		dirty += dirty / 4;
> +	}
> +	*pbackground = background;
> +	*pdirty = dirty;
> +}
> +
>  /*
>   * global_dirty_limits - background-writeback and dirty-throttling thresholds
>   *
> @@ -389,33 +419,52 @@ EXPORT_SYMBOL(bdi_set_max_ratio);
>   */
>  void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
>  {
> -	unsigned long background;
> -	unsigned long dirty;
>  	unsigned long uninitialized_var(available_memory);
> -	struct task_struct *tsk;
>  
>  	if (!vm_dirty_bytes || !dirty_background_bytes)
>  		available_memory = determine_dirtyable_memory();
>  
>  	if (vm_dirty_bytes)
> -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +		*pdirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
>  	else
> -		dirty = (vm_dirty_ratio * available_memory) / 100;
> +		*pdirty = vm_dirty_ratio * available_memory / 100;
>  
>  	if (dirty_background_bytes)
> -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +		*pbackground = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
>  	else
> -		background = (dirty_background_ratio * available_memory) / 100;
> +		*pbackground = dirty_background_ratio * available_memory / 100;
>  
> -	if (background >= dirty)
> -		background = dirty / 2;
> -	tsk = current;
> -	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> -		background += background / 4;
> -		dirty += dirty / 4;
> -	}
> -	*pbackground = background;
> -	*pdirty = dirty;
> +	sanitize_dirty_limits(pbackground, pdirty);
> +}
> +
> +static void zone_dirty_limits(struct zone *zone, unsigned long *pbackground,
> +			      unsigned long *pdirty)
> +{
> +	unsigned long uninitialized_var(global_memory);
> +	unsigned long zone_memory;
> +
> +	zone_memory = zone_dirtyable_memory(zone);
> +
> +	if (!vm_dirty_bytes || !dirty_background_bytes)
> +		global_memory = determine_dirtyable_memory();
> +
> +	if (vm_dirty_bytes) {
> +		unsigned long dirty_pages;
> +
> +		dirty_pages = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +		*pdirty = zone_memory * dirty_pages / global_memory;
> +	} else
> +		*pdirty = zone_memory * vm_dirty_ratio / 100;
> +
> +	if (dirty_background_bytes) {
> +		unsigned long dirty_pages;
> +
> +		dirty_pages = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +		*pbackground = zone_memory * dirty_pages / global_memory;
> +	} else
> +		*pbackground = zone_memory * dirty_background_ratio / 100;
> +
> +	sanitize_dirty_limits(pbackground, pdirty);
>  }
>  

Ok, seems straight-forward enough. For *_bytes, the number of allowed
dirty pages is scaled based on global memory, otherwise the ration are
applied.

>  /*
> @@ -661,6 +710,57 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>          }
>  }
>  
> +bool zone_dirty_ok(struct zone *zone)
> +{
> +	unsigned long background_thresh, dirty_thresh;
> +	unsigned long nr_reclaimable, nr_writeback;
> +
> +	zone_dirty_limits(zone, &background_thresh, &dirty_thresh);
> +
> +	nr_reclaimable = zone_page_state(zone, NR_FILE_DIRTY) +
> +		zone_page_state(zone, NR_UNSTABLE_NFS);
> +	nr_writeback = zone_page_state(zone, NR_WRITEBACK);
> +
> +	return nr_reclaimable + nr_writeback <= dirty_thresh;
> +}
> +
> +void try_to_writeback_pages(struct zonelist *zonelist, gfp_t gfp_mask,
> +			    nodemask_t *nodemask)
> +{
> +	unsigned int nr_exceeded = 0;
> +	unsigned int nr_zones = 0;
> +	struct zoneref *z;
> +	struct zone *zone;
> +
> +	for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask),
> +					nodemask) {
> +		unsigned long background_thresh, dirty_thresh;
> +		unsigned long nr_reclaimable, nr_writeback;
> +
> +		nr_zones++;
> +
> +		zone_dirty_limits(zone, &background_thresh, &dirty_thresh);
> +
> +		nr_reclaimable = zone_page_state(zone, NR_FILE_DIRTY) +
> +			zone_page_state(zone, NR_UNSTABLE_NFS);
> +		nr_writeback = zone_page_state(zone, NR_WRITEBACK);
> +
> +		if (nr_reclaimable + nr_writeback <= background_thresh)
> +			continue;
> +
> +		if (nr_reclaimable > nr_writeback)
> +			wakeup_flusher_threads(nr_reclaimable - nr_writeback);
> +

This is a potential mess. wakeup_flusher_threads() ultimately
calls "work = kzalloc(sizeof(*work), GFP_ATOMIC)" from the page
allocator. Under enough pressure, particularly if the machine has
very little memory, you may see this spewing out warning messages
which ironically will have to be written to syslog dirtying more
pages.  I know I've made the same mistake at least once by calling
wakeup_flusher_thrads() from page reclaim.

It's also still not controlling where the pages are being
written from.  On a large enough NUMA machine, there is a risk that
wakeup_flusher_treads() will be called very frequently to write pages
from remote nodes that are not in trouble.

> +		if (nr_reclaimable + nr_writeback <= dirty_thresh)
> +			continue;
> +
> +		nr_exceeded++;
> +	}
> +
> +	if (nr_zones == nr_exceeded)
> +		congestion_wait(BLK_RW_ASYNC, HZ/10);
> +}
> +

So, you congestion wait but then potentially continue on even
though it is still over the dirty limits.  Should this be more like
throttle_vm_writeout()?

>  /*
>   * sysctl handler for /proc/sys/vm/dirty_writeback_centisecs
>   */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4e8985a..1fac154 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1666,6 +1666,9 @@ zonelist_scan:
>  			!cpuset_zone_allowed_softwall(zone, gfp_mask))
>  				goto try_next_zone;
>  
> +		if ((gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
> +			goto this_zone_full;
> +

So this part needs to explain why using the lower zones does not
potentially cause lowmem pressure on 32-bit. It's not a show stopper
as such but it shouldn't be ignored either.

>  		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
>  		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
>  			unsigned long mark;
> @@ -1863,6 +1866,22 @@ out:
>  	return page;
>  }
>  
> +static struct page *
> +__alloc_pages_writeback(gfp_t gfp_mask, unsigned int order,
> +			struct zonelist *zonelist, enum zone_type high_zoneidx,
> +			nodemask_t *nodemask, int alloc_flags,
> +			struct zone *preferred_zone, int migratetype)
> +{
> +	if (!(gfp_mask & __GFP_WRITE))
> +		return NULL;
> +
> +	try_to_writeback_pages(zonelist, gfp_mask, nodemask);
> +
> +	return get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
> +				      high_zoneidx, alloc_flags,
> +				      preferred_zone, migratetype);
> +}
> +
>  #ifdef CONFIG_COMPACTION
>  /* Try memory compaction for high-order allocations before reclaim */
>  static struct page *
> @@ -2135,6 +2154,14 @@ rebalance:
>  	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
>  		goto nopage;
>  
> +	/* Try writing back pages if per-zone dirty limits are reached */
> +	page = __alloc_pages_writeback(gfp_mask, order, zonelist,
> +				       high_zoneidx, nodemask,
> +				       alloc_flags, preferred_zone,
> +				       migratetype);
> +	if (page)
> +		goto got_pg;
> +

I like the general idea but we are still not controlling where
pages are being written from, the potential lowmem pressure problem
needs to be addressed and care needs to be taken with the frequency
wakeup_flusher_threads is called due to it using kmalloc.

I suspect where the performance gain is being seen is due to
the flusher threads being woken earlier, more frequently and are
aggressively writing due to wakeup_flusher_threads() passing in loads
of requests. As you are seeing a performance gain, that is interesting
in itself if it is true.

>  	/*
>  	 * Try direct compaction. The first pass is asynchronous. Subsequent
>  	 * attempts after direct reclaim are synchronous
> -- 
> 1.7.6
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 0/5] mm: per-zone dirty limiting
  2011-07-25 20:19 ` Johannes Weiner
@ 2011-07-26 15:47   ` Mel Gorman
  -1 siblings, 0 replies; 64+ messages in thread
From: Mel Gorman @ 2011-07-26 15:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On Mon, Jul 25, 2011 at 10:19:14PM +0200, Johannes Weiner wrote:
> Hello!
> 
> Writing back single file pages during reclaim exhibits bad IO
> patterns, but we can't just stop doing that before the VM has other
> means to ensure the pages in a zone are reclaimable.
> 
> Over time there were several suggestions of at least doing
> write-around of the pages in inode-proximity when the need arises to
> clean pages during memory pressure.  But even that would interrupt
> writeback from the flushers, without any guarantees that the nearby
> inode-pages are even sitting on the same troubled zone.
> 
> The reason why dirty pages reach the end of LRU lists in the first
> place is in part because the dirty limits are a global restriction
> while most systems have more than one LRU list that are different in
> size. Multiple nodes have multiple zones have multiple file lists but
> at the same time there is nothing to balance the dirty pages between
> the lists except for reclaim writing them out upon encounter.
> 
> With around 4G of RAM, a x86_64 machine of mine has a DMA32 zone of a
> bit over 3G, a Normal zone of 500M, and a DMA zone of 15M.
> 
> A linear writer can quickly fill up the Normal zone, then the DMA32
> zone, throttled by the dirty limit initially.  The flushers catch up,
> the zones are now mostly full of clean pages and memory reclaim kicks
> in on subsequent allocations.  The pages it frees from the Normal zone
> are quickly filled with dirty pages (unthrottled, as the much bigger
> DMA32 zone allows for a huge number of dirty pages in comparison to
> the Normal zone).  As there are also anon and active file pages on the
> Normal zone, it is not unlikely that a significant amount of its
> inactive file pages are now dirty [ foo=zone(global) ]:
> 
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=112313(821289) active=9942(10039) isolated=27(27) dirty=59709(146944) writeback=739(4017)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111102(806876) active=9925(10022) isolated=32(32) dirty=72125(146914) writeback=957(3972)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=110493(803374) active=9871(9978) isolated=32(32) dirty=57274(146618) writeback=4088(4088)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111957(806559) active=9871(9978) isolated=32(32) dirty=65125(147329) writeback=456(3866)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=110601(803978) active=9860(9973) isolated=27(27) dirty=63792(146590) writeback=61(4276)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111786(804032) active=9860(9973) isolated=0(64) dirty=64310(146998) writeback=1282(3847)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111643(805651) active=9860(9982) isolated=32(32) dirty=63778(147217) writeback=1127(4156)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111678(804709) active=9859(10112) isolated=27(27) dirty=81673(148224) writeback=29(4233)
> 
> [ These prints occur only once per reclaim invocation, so the actual
> ->writepage calls are more frequent than the timestamp may suggest. ]
> 
> In the scenario without the Normal zone, first the DMA32 zone fills
> up, then the DMA zone.  When reclaim kicks in, it is presented with a
> DMA zone whose inactive pages are all dirty -- and dirtied most
> recently at that, so the flushers really had abysmal chances at making
> some headway:
> 
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=776(430813) active=2(2931) isolated=32(32) dirty=814(68649) writeback=0(18765)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=726(430344) active=2(2931) isolated=32(32) dirty=764(67790) writeback=0(17146)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=729(430838) active=2(2931) isolated=32(32) dirty=293(65303) writeback=468(20122)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=757(431181) active=2(2931) isolated=32(32) dirty=63(68851) writeback=731(15926)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=758(432808) active=2(2931) isolated=32(32) dirty=645(64106) writeback=0(19666)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=726(431018) active=2(2931) isolated=32(32) dirty=740(65770) writeback=10(17907)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=697(430467) active=2(2931) isolated=32(32) dirty=743(63757) writeback=0(18826)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=693(430951) active=2(2931) isolated=32(32) dirty=626(54529) writeback=91(16198)
> 

Patches 1-7 of the series "Reduce filesystem writeback from page
reclaim" should have been able to cope with this as well by marking
the dirty pages PageReclaim and continuing on. While it could still
take some time before ZONE_DMA is cleaned, it is very unlikely that
it is the preferred zone for allocation.

> The idea behind this patch set is to take the ratio the global dirty
> limits have to the global memory state and put it into proportion to
> the individual zone.  The allocator ensures that pages allocated for
> being written to in the page cache are distributed across zones such
> that there are always enough clean pages on a zone to begin with.
> 

Ok, I comment on potential lowmem pressure problems with this in the
patch itself.

> I am not yet really satisfied as it's not really orthogonal or
> integrated with the other writeback throttling much, and has rough
> edges here and there, but test results do look rather promising so
> far:
> 

I'd consider that the idea behind this patchset is independent of
patches 1-7 of the "Reduce filesystem writeback from page reclaim"
series although it may also allow the application of patch 8 from
that series. Would you agree or do you think the series should be
mutually exclusive?

> --- Copying 8G to fuse-ntfs on USB stick in 4G machine
> 

Unusual choice of filesystem :) It'd also be worth testing ext3, ext4,
xfs and btrfs to make sure there are no surprises.

> 3.0:
> 
>  Performance counter stats for 'dd if=/dev/zero of=zeroes bs=32k count=262144' (6 runs):
> 
>        140,671,831 cache-misses             #      4.923 M/sec   ( +-   0.198% )  (scaled from 82.80%)
>        726,265,014 cache-references         #     25.417 M/sec   ( +-   1.104% )  (scaled from 83.06%)
>        144,092,383 branch-misses            #      4.157 %       ( +-   0.493% )  (scaled from 83.17%)
>      3,466,608,296 branches                 #    121.319 M/sec   ( +-   0.421% )  (scaled from 67.89%)
>     17,882,351,343 instructions             #      0.417 IPC     ( +-   0.457% )  (scaled from 84.73%)
>     42,848,633,897 cycles                   #   1499.554 M/sec   ( +-   0.604% )  (scaled from 83.08%)
>                236 page-faults              #      0.000 M/sec   ( +-   0.323% )
>              8,026 CPU-migrations           #      0.000 M/sec   ( +-   6.291% )
>          2,372,358 context-switches         #      0.083 M/sec   ( +-   0.003% )
>       28574.255540 task-clock-msecs         #      0.031 CPUs    ( +-   0.409% )
> 
>       912.625436885  seconds time elapsed   ( +-   3.851% )
> 
>  nr_vmscan_write 667839
> 
> 3.0-per-zone-dirty:
> 
>  Performance counter stats for 'dd if=/dev/zero of=zeroes bs=32k count=262144' (6 runs):
> 
>        140,791,501 cache-misses             #      3.887 M/sec   ( +-   0.186% )  (scaled from 83.09%)
>        816,474,193 cache-references         #     22.540 M/sec   ( +-   0.923% )  (scaled from 83.16%)
>        154,500,577 branch-misses            #      4.302 %       ( +-   0.495% )  (scaled from 83.15%)
>      3,591,344,338 branches                 #     99.143 M/sec   ( +-   0.402% )  (scaled from 67.32%)
>     18,713,190,183 instructions             #      0.338 IPC     ( +-   0.448% )  (scaled from 83.96%)
>     55,285,320,107 cycles                   #   1526.208 M/sec   ( +-   0.588% )  (scaled from 83.28%)
>                237 page-faults              #      0.000 M/sec   ( +-   0.302% )
>             28,028 CPU-migrations           #      0.001 M/sec   ( +-   3.070% )
>          2,369,897 context-switches         #      0.065 M/sec   ( +-   0.006% )
>       36223.970238 task-clock-msecs         #      0.060 CPUs    ( +-   1.062% )
> 
>       605.909769823  seconds time elapsed   ( +-   0.783% )
> 
>  nr_vmscan_write 0
> 

Very nice!

> That's an increase of throughput by 30% and no writeback interference
> from reclaim.
> 

Any idea how much dd was varying in performance on each run? I'd
still expect a gain but I've found dd to vary wildly at times even
if conv=fdatasync,fsync is specified.

> As not every other allocation has to reclaim from a Normal zone full
> of dirty pages anymore, the patched kernel is also more responsive in
> general during the copy.
> 
> I am also running fs_mark on XFS on a 2G machine, but the final
> results are not in yet.  The preliminary results appear to be in this
> ballpark:
> 
> --- fs_mark -d fsmark-one -d fsmark-two -D 100 -N 150 -n 150 -L 25 -t 1 -S 0 -s $((10 << 20))
> 
> 3.0:
> 
> real    20m43.901s
> user    0m8.988s
> sys     0m58.227s
> nr_vmscan_write 3347
> 
> 3.0-per-zone-dirty:
> 
> real    20m8.012s
> user    0m8.862s
> sys     1m2.585s
> nr_vmscan_write 161
> 

Thats roughly a 2.8% gain. I was seeing about 4.2% but was testing with
mem=1G, not 2G and there are a lot of factors at play.

> Patch #1 is more or less an unrelated fix that subsequent patches
> depend upon as they modify the same code.  It should go upstream
> immediately, me thinks.
> 

/me agrees

> #2 and #3 are boring cleanup, guess they can go in right away as well.
> 

Yeah, no harm.

> #4 adds per-zone dirty throttling for __GFP_WRITE allocators, #5
> passes __GFP_WRITE from the grab_cache_page* functions in the hope to
> get most writers and no readers; I haven't checked all sites yet.
> 
> Discuss! :-)
> 

I think the performance gain may be due to flusher threads simply
being more aggressive and I suspect it will have a smaller effect on
NUMA where the flushers could be cleaning pages on the wrong node.

That said, your figures are very promising and it is worth
an investigation and you should expand the number of filesystems
tested. I did a quick set of similar benchmarks locally. I only ran
dd once which is a major flaw but wanted to get a quick look.

4 kernels were tested.

vanilla:	3.0
lesskswapd	Patches 1-7 from my series
perzonedirty	Your patches
lessks-pzdirty	Both

Backing storage was a USB key. Kernel was booted with mem=4608M to
get a 500M highest zone similar to yours.

SIMPLE WRITEBACK XFS
              simple-writeback   writeback-3.0.0   writeback-3.0.0      3.0.0-lessks
                 3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1      pzdirty-v3r1
1                    526.83 ( 0.00%) 468.52 (12.45%) 542.05 (-2.81%) 464.42 (13.44%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)          7.27      7.34      7.69      7.96
Total Elapsed Time (seconds)                528.64    470.36    543.86    466.33

Direct pages scanned                             0         0         0         0
Direct pages reclaimed                           0         0         0         0
Kswapd pages scanned                       1058036   1167219   1060288   1169190
Kswapd pages reclaimed                      988591    979571    980278    981009
Kswapd efficiency                              93%       83%       92%       83%
Kswapd velocity                           2001.430  2481.544  1949.561  2507.216
Direct efficiency                             100%      100%      100%      100%
Direct velocity                              0.000     0.000     0.000     0.000
Percentage direct scans                         0%        0%        0%        0%
Page writes by reclaim                        4463      4587      4816      4910
Page reclaim invalidate                          0    145938         0    136510

Very few pages are being written back so I suspect any difference in
performance would be due to dd simply being very variable. I wasn't
running the monitoring that would tell me if the "Page writes" were
file-backed or anonymous but I assume they are file-backed. Your
patches did not seem to have much affect on the number of pages
written.

Note that direct reclaim is not triggered by this workload at all.

SIMPLE WRITEBACK EXT4
              simple-writeback   writeback-3.0.0   writeback-3.0.0      3.0.0-lessks
                 3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1      pzdirty-v3r1
1                    369.80 ( 0.00%) 370.80 (-0.27%) 384.08 (-3.72%) 371.85 (-0.55%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)          7.62       7.7      8.05      7.86
Total Elapsed Time (seconds)                371.74    372.80    386.06    373.86

Direct pages scanned                             0         0         0         0
Direct pages reclaimed                           0         0         0         0
Kswapd pages scanned                       1169587   1186543   1167690   1180982
Kswapd pages reclaimed                      988154    987885    987220    987826
Kswapd efficiency                              84%       83%       84%       83%
Kswapd velocity                           3146.250  3182.787  3024.633  3158.888
Direct efficiency                             100%      100%      100%      100%
Direct velocity                              0.000     0.000     0.000     0.000
Percentage direct scans                         0%        0%        0%        0%
Page writes by reclaim                      141229      4714    141804      4608
Page writes skipped                              0         0         0         0
Page reclaim invalidate                          0    144009         0    144012
Slabs scanned                                 3712      3712      3712      3712

Not much different here than what is in xfs other than to note that
your patches do not hurt "Kswapd efficiency" as the scanning rates
remain more or less constant.

SIMPLE WRITEBACK EXT3
              simple-writeback   writeback-3.0.0   writeback-3.0.0      3.0.0-lessks
                 3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1      pzdirty-v3r1
1                    1291.48 ( 0.00%) 1205.11 ( 7.17%) 1287.53 ( 0.31%) 1190.54 ( 8.48%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         11.01     11.04     11.44     11.39
Total Elapsed Time (seconds)               1295.44   1208.90   1293.81   1195.37

Direct pages scanned                             0         0         0         0
Direct pages reclaimed                           0         0         0         0
Kswapd pages scanned                       1073001   1183622   1065262   1179216
Kswapd pages reclaimed                      985900    985521    979727    979873
Kswapd efficiency                              91%       83%       91%       83%
Kswapd velocity                            828.291   979.090   823.353   986.486
Direct efficiency                             100%      100%      100%      100%
Direct velocity                              0.000     0.000     0.000     0.000
Percentage direct scans                         0%        0%        0%        0%
Page writes by reclaim                       13444      4664     13557      4928
Page writes skipped                              0         0         0         0
Page reclaim invalidate                          0    146167         0    146495

Other than noting that ext3 is *very* slow in comparison to xfs and
ext4, there was little of interest in this.

So I'm not seeing the same reduction in number of pages written back
as you saw and I'm not seeing the same performance gains either. I
wonder why that is but possibilities include you using fuse-ntfs or
maybe it's just the speed of the USB disk you are using that is a
factor?

As dd is variable, I'm rerunning the tests to do 4 iterations and
multiple memory sizes for just xfs and ext4 to see what falls out. It
should take about 14 hours to complete assuming nothing screws up.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 0/5] mm: per-zone dirty limiting
@ 2011-07-26 15:47   ` Mel Gorman
  0 siblings, 0 replies; 64+ messages in thread
From: Mel Gorman @ 2011-07-26 15:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On Mon, Jul 25, 2011 at 10:19:14PM +0200, Johannes Weiner wrote:
> Hello!
> 
> Writing back single file pages during reclaim exhibits bad IO
> patterns, but we can't just stop doing that before the VM has other
> means to ensure the pages in a zone are reclaimable.
> 
> Over time there were several suggestions of at least doing
> write-around of the pages in inode-proximity when the need arises to
> clean pages during memory pressure.  But even that would interrupt
> writeback from the flushers, without any guarantees that the nearby
> inode-pages are even sitting on the same troubled zone.
> 
> The reason why dirty pages reach the end of LRU lists in the first
> place is in part because the dirty limits are a global restriction
> while most systems have more than one LRU list that are different in
> size. Multiple nodes have multiple zones have multiple file lists but
> at the same time there is nothing to balance the dirty pages between
> the lists except for reclaim writing them out upon encounter.
> 
> With around 4G of RAM, a x86_64 machine of mine has a DMA32 zone of a
> bit over 3G, a Normal zone of 500M, and a DMA zone of 15M.
> 
> A linear writer can quickly fill up the Normal zone, then the DMA32
> zone, throttled by the dirty limit initially.  The flushers catch up,
> the zones are now mostly full of clean pages and memory reclaim kicks
> in on subsequent allocations.  The pages it frees from the Normal zone
> are quickly filled with dirty pages (unthrottled, as the much bigger
> DMA32 zone allows for a huge number of dirty pages in comparison to
> the Normal zone).  As there are also anon and active file pages on the
> Normal zone, it is not unlikely that a significant amount of its
> inactive file pages are now dirty [ foo=zone(global) ]:
> 
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=112313(821289) active=9942(10039) isolated=27(27) dirty=59709(146944) writeback=739(4017)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111102(806876) active=9925(10022) isolated=32(32) dirty=72125(146914) writeback=957(3972)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=110493(803374) active=9871(9978) isolated=32(32) dirty=57274(146618) writeback=4088(4088)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111957(806559) active=9871(9978) isolated=32(32) dirty=65125(147329) writeback=456(3866)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=110601(803978) active=9860(9973) isolated=27(27) dirty=63792(146590) writeback=61(4276)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111786(804032) active=9860(9973) isolated=0(64) dirty=64310(146998) writeback=1282(3847)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111643(805651) active=9860(9982) isolated=32(32) dirty=63778(147217) writeback=1127(4156)
> reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111678(804709) active=9859(10112) isolated=27(27) dirty=81673(148224) writeback=29(4233)
> 
> [ These prints occur only once per reclaim invocation, so the actual
> ->writepage calls are more frequent than the timestamp may suggest. ]
> 
> In the scenario without the Normal zone, first the DMA32 zone fills
> up, then the DMA zone.  When reclaim kicks in, it is presented with a
> DMA zone whose inactive pages are all dirty -- and dirtied most
> recently at that, so the flushers really had abysmal chances at making
> some headway:
> 
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=776(430813) active=2(2931) isolated=32(32) dirty=814(68649) writeback=0(18765)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=726(430344) active=2(2931) isolated=32(32) dirty=764(67790) writeback=0(17146)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=729(430838) active=2(2931) isolated=32(32) dirty=293(65303) writeback=468(20122)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=757(431181) active=2(2931) isolated=32(32) dirty=63(68851) writeback=731(15926)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=758(432808) active=2(2931) isolated=32(32) dirty=645(64106) writeback=0(19666)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=726(431018) active=2(2931) isolated=32(32) dirty=740(65770) writeback=10(17907)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=697(430467) active=2(2931) isolated=32(32) dirty=743(63757) writeback=0(18826)
> reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=693(430951) active=2(2931) isolated=32(32) dirty=626(54529) writeback=91(16198)
> 

Patches 1-7 of the series "Reduce filesystem writeback from page
reclaim" should have been able to cope with this as well by marking
the dirty pages PageReclaim and continuing on. While it could still
take some time before ZONE_DMA is cleaned, it is very unlikely that
it is the preferred zone for allocation.

> The idea behind this patch set is to take the ratio the global dirty
> limits have to the global memory state and put it into proportion to
> the individual zone.  The allocator ensures that pages allocated for
> being written to in the page cache are distributed across zones such
> that there are always enough clean pages on a zone to begin with.
> 

Ok, I comment on potential lowmem pressure problems with this in the
patch itself.

> I am not yet really satisfied as it's not really orthogonal or
> integrated with the other writeback throttling much, and has rough
> edges here and there, but test results do look rather promising so
> far:
> 

I'd consider that the idea behind this patchset is independent of
patches 1-7 of the "Reduce filesystem writeback from page reclaim"
series although it may also allow the application of patch 8 from
that series. Would you agree or do you think the series should be
mutually exclusive?

> --- Copying 8G to fuse-ntfs on USB stick in 4G machine
> 

Unusual choice of filesystem :) It'd also be worth testing ext3, ext4,
xfs and btrfs to make sure there are no surprises.

> 3.0:
> 
>  Performance counter stats for 'dd if=/dev/zero of=zeroes bs=32k count=262144' (6 runs):
> 
>        140,671,831 cache-misses             #      4.923 M/sec   ( +-   0.198% )  (scaled from 82.80%)
>        726,265,014 cache-references         #     25.417 M/sec   ( +-   1.104% )  (scaled from 83.06%)
>        144,092,383 branch-misses            #      4.157 %       ( +-   0.493% )  (scaled from 83.17%)
>      3,466,608,296 branches                 #    121.319 M/sec   ( +-   0.421% )  (scaled from 67.89%)
>     17,882,351,343 instructions             #      0.417 IPC     ( +-   0.457% )  (scaled from 84.73%)
>     42,848,633,897 cycles                   #   1499.554 M/sec   ( +-   0.604% )  (scaled from 83.08%)
>                236 page-faults              #      0.000 M/sec   ( +-   0.323% )
>              8,026 CPU-migrations           #      0.000 M/sec   ( +-   6.291% )
>          2,372,358 context-switches         #      0.083 M/sec   ( +-   0.003% )
>       28574.255540 task-clock-msecs         #      0.031 CPUs    ( +-   0.409% )
> 
>       912.625436885  seconds time elapsed   ( +-   3.851% )
> 
>  nr_vmscan_write 667839
> 
> 3.0-per-zone-dirty:
> 
>  Performance counter stats for 'dd if=/dev/zero of=zeroes bs=32k count=262144' (6 runs):
> 
>        140,791,501 cache-misses             #      3.887 M/sec   ( +-   0.186% )  (scaled from 83.09%)
>        816,474,193 cache-references         #     22.540 M/sec   ( +-   0.923% )  (scaled from 83.16%)
>        154,500,577 branch-misses            #      4.302 %       ( +-   0.495% )  (scaled from 83.15%)
>      3,591,344,338 branches                 #     99.143 M/sec   ( +-   0.402% )  (scaled from 67.32%)
>     18,713,190,183 instructions             #      0.338 IPC     ( +-   0.448% )  (scaled from 83.96%)
>     55,285,320,107 cycles                   #   1526.208 M/sec   ( +-   0.588% )  (scaled from 83.28%)
>                237 page-faults              #      0.000 M/sec   ( +-   0.302% )
>             28,028 CPU-migrations           #      0.001 M/sec   ( +-   3.070% )
>          2,369,897 context-switches         #      0.065 M/sec   ( +-   0.006% )
>       36223.970238 task-clock-msecs         #      0.060 CPUs    ( +-   1.062% )
> 
>       605.909769823  seconds time elapsed   ( +-   0.783% )
> 
>  nr_vmscan_write 0
> 

Very nice!

> That's an increase of throughput by 30% and no writeback interference
> from reclaim.
> 

Any idea how much dd was varying in performance on each run? I'd
still expect a gain but I've found dd to vary wildly at times even
if conv=fdatasync,fsync is specified.

> As not every other allocation has to reclaim from a Normal zone full
> of dirty pages anymore, the patched kernel is also more responsive in
> general during the copy.
> 
> I am also running fs_mark on XFS on a 2G machine, but the final
> results are not in yet.  The preliminary results appear to be in this
> ballpark:
> 
> --- fs_mark -d fsmark-one -d fsmark-two -D 100 -N 150 -n 150 -L 25 -t 1 -S 0 -s $((10 << 20))
> 
> 3.0:
> 
> real    20m43.901s
> user    0m8.988s
> sys     0m58.227s
> nr_vmscan_write 3347
> 
> 3.0-per-zone-dirty:
> 
> real    20m8.012s
> user    0m8.862s
> sys     1m2.585s
> nr_vmscan_write 161
> 

Thats roughly a 2.8% gain. I was seeing about 4.2% but was testing with
mem=1G, not 2G and there are a lot of factors at play.

> Patch #1 is more or less an unrelated fix that subsequent patches
> depend upon as they modify the same code.  It should go upstream
> immediately, me thinks.
> 

/me agrees

> #2 and #3 are boring cleanup, guess they can go in right away as well.
> 

Yeah, no harm.

> #4 adds per-zone dirty throttling for __GFP_WRITE allocators, #5
> passes __GFP_WRITE from the grab_cache_page* functions in the hope to
> get most writers and no readers; I haven't checked all sites yet.
> 
> Discuss! :-)
> 

I think the performance gain may be due to flusher threads simply
being more aggressive and I suspect it will have a smaller effect on
NUMA where the flushers could be cleaning pages on the wrong node.

That said, your figures are very promising and it is worth
an investigation and you should expand the number of filesystems
tested. I did a quick set of similar benchmarks locally. I only ran
dd once which is a major flaw but wanted to get a quick look.

4 kernels were tested.

vanilla:	3.0
lesskswapd	Patches 1-7 from my series
perzonedirty	Your patches
lessks-pzdirty	Both

Backing storage was a USB key. Kernel was booted with mem=4608M to
get a 500M highest zone similar to yours.

SIMPLE WRITEBACK XFS
              simple-writeback   writeback-3.0.0   writeback-3.0.0      3.0.0-lessks
                 3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1      pzdirty-v3r1
1                    526.83 ( 0.00%) 468.52 (12.45%) 542.05 (-2.81%) 464.42 (13.44%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)          7.27      7.34      7.69      7.96
Total Elapsed Time (seconds)                528.64    470.36    543.86    466.33

Direct pages scanned                             0         0         0         0
Direct pages reclaimed                           0         0         0         0
Kswapd pages scanned                       1058036   1167219   1060288   1169190
Kswapd pages reclaimed                      988591    979571    980278    981009
Kswapd efficiency                              93%       83%       92%       83%
Kswapd velocity                           2001.430  2481.544  1949.561  2507.216
Direct efficiency                             100%      100%      100%      100%
Direct velocity                              0.000     0.000     0.000     0.000
Percentage direct scans                         0%        0%        0%        0%
Page writes by reclaim                        4463      4587      4816      4910
Page reclaim invalidate                          0    145938         0    136510

Very few pages are being written back so I suspect any difference in
performance would be due to dd simply being very variable. I wasn't
running the monitoring that would tell me if the "Page writes" were
file-backed or anonymous but I assume they are file-backed. Your
patches did not seem to have much affect on the number of pages
written.

Note that direct reclaim is not triggered by this workload at all.

SIMPLE WRITEBACK EXT4
              simple-writeback   writeback-3.0.0   writeback-3.0.0      3.0.0-lessks
                 3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1      pzdirty-v3r1
1                    369.80 ( 0.00%) 370.80 (-0.27%) 384.08 (-3.72%) 371.85 (-0.55%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)          7.62       7.7      8.05      7.86
Total Elapsed Time (seconds)                371.74    372.80    386.06    373.86

Direct pages scanned                             0         0         0         0
Direct pages reclaimed                           0         0         0         0
Kswapd pages scanned                       1169587   1186543   1167690   1180982
Kswapd pages reclaimed                      988154    987885    987220    987826
Kswapd efficiency                              84%       83%       84%       83%
Kswapd velocity                           3146.250  3182.787  3024.633  3158.888
Direct efficiency                             100%      100%      100%      100%
Direct velocity                              0.000     0.000     0.000     0.000
Percentage direct scans                         0%        0%        0%        0%
Page writes by reclaim                      141229      4714    141804      4608
Page writes skipped                              0         0         0         0
Page reclaim invalidate                          0    144009         0    144012
Slabs scanned                                 3712      3712      3712      3712

Not much different here than what is in xfs other than to note that
your patches do not hurt "Kswapd efficiency" as the scanning rates
remain more or less constant.

SIMPLE WRITEBACK EXT3
              simple-writeback   writeback-3.0.0   writeback-3.0.0      3.0.0-lessks
                 3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1      pzdirty-v3r1
1                    1291.48 ( 0.00%) 1205.11 ( 7.17%) 1287.53 ( 0.31%) 1190.54 ( 8.48%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         11.01     11.04     11.44     11.39
Total Elapsed Time (seconds)               1295.44   1208.90   1293.81   1195.37

Direct pages scanned                             0         0         0         0
Direct pages reclaimed                           0         0         0         0
Kswapd pages scanned                       1073001   1183622   1065262   1179216
Kswapd pages reclaimed                      985900    985521    979727    979873
Kswapd efficiency                              91%       83%       91%       83%
Kswapd velocity                            828.291   979.090   823.353   986.486
Direct efficiency                             100%      100%      100%      100%
Direct velocity                              0.000     0.000     0.000     0.000
Percentage direct scans                         0%        0%        0%        0%
Page writes by reclaim                       13444      4664     13557      4928
Page writes skipped                              0         0         0         0
Page reclaim invalidate                          0    146167         0    146495

Other than noting that ext3 is *very* slow in comparison to xfs and
ext4, there was little of interest in this.

So I'm not seeing the same reduction in number of pages written back
as you saw and I'm not seeing the same performance gains either. I
wonder why that is but possibilities include you using fuse-ntfs or
maybe it's just the speed of the USB disk you are using that is a
factor?

As dd is variable, I'm rerunning the tests to do 4 iterations and
multiple memory sizes for just xfs and ext4 to see what falls out. It
should take about 14 hours to complete assuming nothing screws up.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 0/5] mm: per-zone dirty limiting
  2011-07-26 15:47   ` Mel Gorman
@ 2011-07-26 18:05     ` Johannes Weiner
  -1 siblings, 0 replies; 64+ messages in thread
From: Johannes Weiner @ 2011-07-26 18:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On Tue, Jul 26, 2011 at 04:47:41PM +0100, Mel Gorman wrote:
> On Mon, Jul 25, 2011 at 10:19:14PM +0200, Johannes Weiner wrote:
> > Hello!
> > 
> > Writing back single file pages during reclaim exhibits bad IO
> > patterns, but we can't just stop doing that before the VM has other
> > means to ensure the pages in a zone are reclaimable.
> > 
> > Over time there were several suggestions of at least doing
> > write-around of the pages in inode-proximity when the need arises to
> > clean pages during memory pressure.  But even that would interrupt
> > writeback from the flushers, without any guarantees that the nearby
> > inode-pages are even sitting on the same troubled zone.
> > 
> > The reason why dirty pages reach the end of LRU lists in the first
> > place is in part because the dirty limits are a global restriction
> > while most systems have more than one LRU list that are different in
> > size. Multiple nodes have multiple zones have multiple file lists but
> > at the same time there is nothing to balance the dirty pages between
> > the lists except for reclaim writing them out upon encounter.
> > 
> > With around 4G of RAM, a x86_64 machine of mine has a DMA32 zone of a
> > bit over 3G, a Normal zone of 500M, and a DMA zone of 15M.
> > 
> > A linear writer can quickly fill up the Normal zone, then the DMA32
> > zone, throttled by the dirty limit initially.  The flushers catch up,
> > the zones are now mostly full of clean pages and memory reclaim kicks
> > in on subsequent allocations.  The pages it frees from the Normal zone
> > are quickly filled with dirty pages (unthrottled, as the much bigger
> > DMA32 zone allows for a huge number of dirty pages in comparison to
> > the Normal zone).  As there are also anon and active file pages on the
> > Normal zone, it is not unlikely that a significant amount of its
> > inactive file pages are now dirty [ foo=zone(global) ]:
> > 
> > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=112313(821289) active=9942(10039) isolated=27(27) dirty=59709(146944) writeback=739(4017)
> > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111102(806876) active=9925(10022) isolated=32(32) dirty=72125(146914) writeback=957(3972)
> > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=110493(803374) active=9871(9978) isolated=32(32) dirty=57274(146618) writeback=4088(4088)
> > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111957(806559) active=9871(9978) isolated=32(32) dirty=65125(147329) writeback=456(3866)
> > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=110601(803978) active=9860(9973) isolated=27(27) dirty=63792(146590) writeback=61(4276)
> > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111786(804032) active=9860(9973) isolated=0(64) dirty=64310(146998) writeback=1282(3847)
> > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111643(805651) active=9860(9982) isolated=32(32) dirty=63778(147217) writeback=1127(4156)
> > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111678(804709) active=9859(10112) isolated=27(27) dirty=81673(148224) writeback=29(4233)
> > 
> > [ These prints occur only once per reclaim invocation, so the actual
> > ->writepage calls are more frequent than the timestamp may suggest. ]
> > 
> > In the scenario without the Normal zone, first the DMA32 zone fills
> > up, then the DMA zone.  When reclaim kicks in, it is presented with a
> > DMA zone whose inactive pages are all dirty -- and dirtied most
> > recently at that, so the flushers really had abysmal chances at making
> > some headway:
> > 
> > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=776(430813) active=2(2931) isolated=32(32) dirty=814(68649) writeback=0(18765)
> > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=726(430344) active=2(2931) isolated=32(32) dirty=764(67790) writeback=0(17146)
> > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=729(430838) active=2(2931) isolated=32(32) dirty=293(65303) writeback=468(20122)
> > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=757(431181) active=2(2931) isolated=32(32) dirty=63(68851) writeback=731(15926)
> > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=758(432808) active=2(2931) isolated=32(32) dirty=645(64106) writeback=0(19666)
> > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=726(431018) active=2(2931) isolated=32(32) dirty=740(65770) writeback=10(17907)
> > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=697(430467) active=2(2931) isolated=32(32) dirty=743(63757) writeback=0(18826)
> > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=693(430951) active=2(2931) isolated=32(32) dirty=626(54529) writeback=91(16198)
> > 
> 
> Patches 1-7 of the series "Reduce filesystem writeback from page
> reclaim" should have been able to cope with this as well by marking
> the dirty pages PageReclaim and continuing on. While it could still
> take some time before ZONE_DMA is cleaned, it is very unlikely that
> it is the preferred zone for allocation.

My changes can not fully prevent dirty pages from reaching the LRU
tail, so IMO we want your patches in any case (sorry I haven't replied
yet, but I went through them and they look good to me.  Acks coming
up).  But this should reduce what reclaim has to skip and shuffle.

> > The idea behind this patch set is to take the ratio the global dirty
> > limits have to the global memory state and put it into proportion to
> > the individual zone.  The allocator ensures that pages allocated for
> > being written to in the page cache are distributed across zones such
> > that there are always enough clean pages on a zone to begin with.
> > 
> 
> Ok, I comment on potential lowmem pressure problems with this in the
> patch itself.
> 
> > I am not yet really satisfied as it's not really orthogonal or
> > integrated with the other writeback throttling much, and has rough
> > edges here and there, but test results do look rather promising so
> > far:
> > 
> 
> I'd consider that the idea behind this patchset is independent of
> patches 1-7 of the "Reduce filesystem writeback from page reclaim"
> series although it may also allow the application of patch 8 from
> that series. Would you agree or do you think the series should be
> mutually exclusive?

My patchset was triggered by patch 8 of your series, as I think we can
not simply remove our only measure to stay on top of the dirty pages
from a per-zone perspective.

But I think your patches 1-7 and this series complement each other in
that one series tries to keep the dirty pages per-zone on sane levels
and the other series improves how we deal with what dirty pages still
end up at the lru tails.

> > --- Copying 8G to fuse-ntfs on USB stick in 4G machine
> > 
> 
> Unusual choice of filesystem :) It'd also be worth testing ext3, ext4,
> xfs and btrfs to make sure there are no surprises.

Yeah, testing has been really shallow so far as my test box is
occupied with the exclusive memcg lru stuff.

Also, this is the stick my TV has to be able to read from ;-)

> > 3.0:
> > 
> >  Performance counter stats for 'dd if=/dev/zero of=zeroes bs=32k count=262144' (6 runs):
> > 
> >        140,671,831 cache-misses             #      4.923 M/sec   ( +-   0.198% )  (scaled from 82.80%)
> >        726,265,014 cache-references         #     25.417 M/sec   ( +-   1.104% )  (scaled from 83.06%)
> >        144,092,383 branch-misses            #      4.157 %       ( +-   0.493% )  (scaled from 83.17%)
> >      3,466,608,296 branches                 #    121.319 M/sec   ( +-   0.421% )  (scaled from 67.89%)
> >     17,882,351,343 instructions             #      0.417 IPC     ( +-   0.457% )  (scaled from 84.73%)
> >     42,848,633,897 cycles                   #   1499.554 M/sec   ( +-   0.604% )  (scaled from 83.08%)
> >                236 page-faults              #      0.000 M/sec   ( +-   0.323% )
> >              8,026 CPU-migrations           #      0.000 M/sec   ( +-   6.291% )
> >          2,372,358 context-switches         #      0.083 M/sec   ( +-   0.003% )
> >       28574.255540 task-clock-msecs         #      0.031 CPUs    ( +-   0.409% )
> > 
> >       912.625436885  seconds time elapsed   ( +-   3.851% )
> > 
> >  nr_vmscan_write 667839
> > 
> > 3.0-per-zone-dirty:
> > 
> >  Performance counter stats for 'dd if=/dev/zero of=zeroes bs=32k count=262144' (6 runs):
> > 
> >        140,791,501 cache-misses             #      3.887 M/sec   ( +-   0.186% )  (scaled from 83.09%)
> >        816,474,193 cache-references         #     22.540 M/sec   ( +-   0.923% )  (scaled from 83.16%)
> >        154,500,577 branch-misses            #      4.302 %       ( +-   0.495% )  (scaled from 83.15%)
> >      3,591,344,338 branches                 #     99.143 M/sec   ( +-   0.402% )  (scaled from 67.32%)
> >     18,713,190,183 instructions             #      0.338 IPC     ( +-   0.448% )  (scaled from 83.96%)
> >     55,285,320,107 cycles                   #   1526.208 M/sec   ( +-   0.588% )  (scaled from 83.28%)
> >                237 page-faults              #      0.000 M/sec   ( +-   0.302% )
> >             28,028 CPU-migrations           #      0.001 M/sec   ( +-   3.070% )
> >          2,369,897 context-switches         #      0.065 M/sec   ( +-   0.006% )
> >       36223.970238 task-clock-msecs         #      0.060 CPUs    ( +-   1.062% )
> > 
> >       605.909769823  seconds time elapsed   ( +-   0.783% )
> > 
> >  nr_vmscan_write 0
> > 
> 
> Very nice!
> 
> > That's an increase of throughput by 30% and no writeback interference
> > from reclaim.
> > 
> 
> Any idea how much dd was varying in performance on each run? I'd
> still expect a gain but I've found dd to vary wildly at times even
> if conv=fdatasync,fsync is specified.

The fluctuation is in the figures after the 'seconds time elapsed'.
It is less than 1% for the six runs.

Or did you mean something else?

> > As not every other allocation has to reclaim from a Normal zone full
> > of dirty pages anymore, the patched kernel is also more responsive in
> > general during the copy.
> > 
> > I am also running fs_mark on XFS on a 2G machine, but the final
> > results are not in yet.  The preliminary results appear to be in this
> > ballpark:
> > 
> > --- fs_mark -d fsmark-one -d fsmark-two -D 100 -N 150 -n 150 -L 25 -t 1 -S 0 -s $((10 << 20))
> > 
> > 3.0:
> > 
> > real    20m43.901s
> > user    0m8.988s
> > sys     0m58.227s
> > nr_vmscan_write 3347
> > 
> > 3.0-per-zone-dirty:
> > 
> > real    20m8.012s
> > user    0m8.862s
> > sys     1m2.585s
> > nr_vmscan_write 161
> > 
> 
> Thats roughly a 2.8% gain. I was seeing about 4.2% but was testing with
> mem=1G, not 2G and there are a lot of factors at play.

[...]

> > #4 adds per-zone dirty throttling for __GFP_WRITE allocators, #5
> > passes __GFP_WRITE from the grab_cache_page* functions in the hope to
> > get most writers and no readers; I haven't checked all sites yet.
> > 
> > Discuss! :-)
> > 
> 
> I think the performance gain may be due to flusher threads simply
> being more aggressive and I suspect it will have a smaller effect on
> NUMA where the flushers could be cleaning pages on the wrong node.

I ran this same test with statistics (which I now realize should
probably become part of this series) and they indicated that the
flushers were not woken a single time from the new code.

All it did in this case was defer future-dirty pages from the Normal
zone to the DMA32 zone.

My understanding is that as the dirty pages are forcibly spread out
into the bigger zone, reclaim and flushers become less likely to step
on each other's toes.

> That said, your figures are very promising and it is worth
> an investigation and you should expand the number of filesystems
> tested. I did a quick set of similar benchmarks locally. I only ran
> dd once which is a major flaw but wanted to get a quick look.

Yeah, more testing is definitely going to happen on this.  I tried
other filesystems with one-shot runs as well, just to see if anything
stood out, but nothing conclusive.

> 4 kernels were tested.
> 
> vanilla:	3.0
> lesskswapd	Patches 1-7 from my series
> perzonedirty	Your patches
> lessks-pzdirty	Both
> 
> Backing storage was a USB key. Kernel was booted with mem=4608M to
> get a 500M highest zone similar to yours.

I think what I wrote was a bit misleading.  The zone size example was
taken from my desktop machine to simply point out the different zones
sizes in a simple UMA machine.  But I ran this test on my laptop,
where the Normal zone is ~880MB (226240 present pages).

The dirty_background_ratio is 10, dirty_ratio is 20, btw, ISTR that
you had set them higher and I expect that to be a factor.

The dd throughput is ~14 MB/s on the pzd kernel.

> SIMPLE WRITEBACK XFS
>               simple-writeback   writeback-3.0.0   writeback-3.0.0      3.0.0-lessks
>                  3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1      pzdirty-v3r1
> 1                    526.83 ( 0.00%) 468.52 (12.45%) 542.05 (-2.81%) 464.42 (13.44%)
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)          7.27      7.34      7.69      7.96
> Total Elapsed Time (seconds)                528.64    470.36    543.86    466.33
> 
> Direct pages scanned                             0         0         0         0
> Direct pages reclaimed                           0         0         0         0
> Kswapd pages scanned                       1058036   1167219   1060288   1169190
> Kswapd pages reclaimed                      988591    979571    980278    981009
> Kswapd efficiency                              93%       83%       92%       83%
> Kswapd velocity                           2001.430  2481.544  1949.561  2507.216
> Direct efficiency                             100%      100%      100%      100%
> Direct velocity                              0.000     0.000     0.000     0.000
> Percentage direct scans                         0%        0%        0%        0%
> Page writes by reclaim                        4463      4587      4816      4910
> Page reclaim invalidate                          0    145938         0    136510
> 
> Very few pages are being written back so I suspect any difference in
> performance would be due to dd simply being very variable. I wasn't
> running the monitoring that would tell me if the "Page writes" were
> file-backed or anonymous but I assume they are file-backed. Your
> patches did not seem to have much affect on the number of pages
> written.

That's odd.  While it did not completely get rid of all file writes
from reclaim, it reduced them consistently in all my tests so far.

I don't have swap space on any of my machines, but I wouldn't expect
this to make a difference.

> Note that direct reclaim is not triggered by this workload at all.

Same here, not a single allocstall.

> SIMPLE WRITEBACK EXT4
>               simple-writeback   writeback-3.0.0   writeback-3.0.0      3.0.0-lessks
>                  3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1      pzdirty-v3r1
> 1                    369.80 ( 0.00%) 370.80 (-0.27%) 384.08 (-3.72%) 371.85 (-0.55%)
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)          7.62       7.7      8.05      7.86
> Total Elapsed Time (seconds)                371.74    372.80    386.06    373.86
> 
> Direct pages scanned                             0         0         0         0
> Direct pages reclaimed                           0         0         0         0
> Kswapd pages scanned                       1169587   1186543   1167690   1180982
> Kswapd pages reclaimed                      988154    987885    987220    987826
> Kswapd efficiency                              84%       83%       84%       83%
> Kswapd velocity                           3146.250  3182.787  3024.633  3158.888
> Direct efficiency                             100%      100%      100%      100%
> Direct velocity                              0.000     0.000     0.000     0.000
> Percentage direct scans                         0%        0%        0%        0%
> Page writes by reclaim                      141229      4714    141804      4608
> Page writes skipped                              0         0         0         0
> Page reclaim invalidate                          0    144009         0    144012
> Slabs scanned                                 3712      3712      3712      3712
> 
> Not much different here than what is in xfs other than to note that
> your patches do not hurt "Kswapd efficiency" as the scanning rates
> remain more or less constant.
> 
> SIMPLE WRITEBACK EXT3
>               simple-writeback   writeback-3.0.0   writeback-3.0.0      3.0.0-lessks
>                  3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1      pzdirty-v3r1
> 1                    1291.48 ( 0.00%) 1205.11 ( 7.17%) 1287.53 ( 0.31%) 1190.54 ( 8.48%)
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)         11.01     11.04     11.44     11.39
> Total Elapsed Time (seconds)               1295.44   1208.90   1293.81   1195.37
> 
> Direct pages scanned                             0         0         0         0
> Direct pages reclaimed                           0         0         0         0
> Kswapd pages scanned                       1073001   1183622   1065262   1179216
> Kswapd pages reclaimed                      985900    985521    979727    979873
> Kswapd efficiency                              91%       83%       91%       83%
> Kswapd velocity                            828.291   979.090   823.353   986.486
> Direct efficiency                             100%      100%      100%      100%
> Direct velocity                              0.000     0.000     0.000     0.000
> Percentage direct scans                         0%        0%        0%        0%
> Page writes by reclaim                       13444      4664     13557      4928
> Page writes skipped                              0         0         0         0
> Page reclaim invalidate                          0    146167         0    146495
> 
> Other than noting that ext3 is *very* slow in comparison to xfs and
> ext4, there was little of interest in this.
> 
> So I'm not seeing the same reduction in number of pages written back
> as you saw and I'm not seeing the same performance gains either. I
> wonder why that is but possibilities include you using fuse-ntfs or
> maybe it's just the speed of the USB disk you are using that is a
> factor?

I will try out other filesystems here as well.

> As dd is variable, I'm rerunning the tests to do 4 iterations and
> multiple memory sizes for just xfs and ext4 to see what falls out. It
> should take about 14 hours to complete assuming nothing screws up.

Awesome, thanks!

---
From: Johannes Weiner <jweiner@redhat.com>
Subject: mm: per-zone dirty limit statistics

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---
 include/linux/vm_event_item.h |    4 ++++
 include/linux/vmstat.h        |    3 +++
 mm/page-writeback.c           |    8 ++++++--
 mm/page_alloc.c               |    4 +++-
 mm/vmstat.c                   |    4 ++++
 5 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03b90cdc..6bfc604 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -58,6 +58,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_COLLAPSE_ALLOC_FAILED,
 		THP_SPLIT,
 #endif
+		FOR_ALL_ZONES(DIRTY_ALLOC_DENIED),
+		FOR_ALL_ZONES(DIRTY_WAKE_FLUSHERS),
+		DIRTY_WAIT_CONGESTION,
+
 		NR_VM_EVENT_ITEMS
 };
 
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index bcd942f..5926225 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -80,6 +80,9 @@ static inline void vm_events_fold_cpu(int cpu)
 
 #endif /* CONFIG_VM_EVENT_COUNTERS */
 
+#define count_zone_vm_event(item, zone)		\
+	count_vm_event(item##_NORMAL - ZONE_NORMAL + zone_idx(zone))
+
 #define __count_zone_vm_events(item, zone, delta) \
 		__count_vm_events(item##_NORMAL - ZONE_NORMAL + \
 		zone_idx(zone), delta)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index ce673ec..0937382 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -748,8 +748,10 @@ void try_to_writeback_pages(struct zonelist *zonelist, gfp_t gfp_mask,
 		if (nr_reclaimable + nr_writeback <= background_thresh)
 			continue;
 
-		if (nr_reclaimable > nr_writeback)
+		if (nr_reclaimable > nr_writeback) {
+			count_zone_vm_event(DIRTY_WAKE_FLUSHERS, zone);
 			wakeup_flusher_threads(nr_reclaimable - nr_writeback);
+		}
 
 		if (nr_reclaimable + nr_writeback <= dirty_thresh)
 			continue;
@@ -757,8 +759,10 @@ void try_to_writeback_pages(struct zonelist *zonelist, gfp_t gfp_mask,
 		nr_exceeded++;
 	}
 
-	if (nr_zones == nr_exceeded)
+	if (nr_zones == nr_exceeded) {
+		count_vm_event(DIRTY_WAIT_CONGESTION);
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
+	}
 }
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1fac154..5939a98 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1666,8 +1666,10 @@ zonelist_scan:
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				goto try_next_zone;
 
-		if ((gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
+		if ((gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone)) {
+			count_zone_vm_event(DIRTY_ALLOC_DENIED, zone);
 			goto this_zone_full;
+		}
 
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 20c18b7..d302a77 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -786,6 +786,10 @@ const char * const vmstat_text[] = {
 	"thp_split",
 #endif
 
+	TEXTS_FOR_ZONES("dirty_alloc_denied")
+	TEXTS_FOR_ZONES("dirty_wake_flushers")
+	"dirty_wait_congestion",
+
 #endif /* CONFIG_VM_EVENTS_COUNTERS */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS */
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [patch 0/5] mm: per-zone dirty limiting
@ 2011-07-26 18:05     ` Johannes Weiner
  0 siblings, 0 replies; 64+ messages in thread
From: Johannes Weiner @ 2011-07-26 18:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On Tue, Jul 26, 2011 at 04:47:41PM +0100, Mel Gorman wrote:
> On Mon, Jul 25, 2011 at 10:19:14PM +0200, Johannes Weiner wrote:
> > Hello!
> > 
> > Writing back single file pages during reclaim exhibits bad IO
> > patterns, but we can't just stop doing that before the VM has other
> > means to ensure the pages in a zone are reclaimable.
> > 
> > Over time there were several suggestions of at least doing
> > write-around of the pages in inode-proximity when the need arises to
> > clean pages during memory pressure.  But even that would interrupt
> > writeback from the flushers, without any guarantees that the nearby
> > inode-pages are even sitting on the same troubled zone.
> > 
> > The reason why dirty pages reach the end of LRU lists in the first
> > place is in part because the dirty limits are a global restriction
> > while most systems have more than one LRU list that are different in
> > size. Multiple nodes have multiple zones have multiple file lists but
> > at the same time there is nothing to balance the dirty pages between
> > the lists except for reclaim writing them out upon encounter.
> > 
> > With around 4G of RAM, a x86_64 machine of mine has a DMA32 zone of a
> > bit over 3G, a Normal zone of 500M, and a DMA zone of 15M.
> > 
> > A linear writer can quickly fill up the Normal zone, then the DMA32
> > zone, throttled by the dirty limit initially.  The flushers catch up,
> > the zones are now mostly full of clean pages and memory reclaim kicks
> > in on subsequent allocations.  The pages it frees from the Normal zone
> > are quickly filled with dirty pages (unthrottled, as the much bigger
> > DMA32 zone allows for a huge number of dirty pages in comparison to
> > the Normal zone).  As there are also anon and active file pages on the
> > Normal zone, it is not unlikely that a significant amount of its
> > inactive file pages are now dirty [ foo=zone(global) ]:
> > 
> > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=112313(821289) active=9942(10039) isolated=27(27) dirty=59709(146944) writeback=739(4017)
> > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111102(806876) active=9925(10022) isolated=32(32) dirty=72125(146914) writeback=957(3972)
> > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=110493(803374) active=9871(9978) isolated=32(32) dirty=57274(146618) writeback=4088(4088)
> > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111957(806559) active=9871(9978) isolated=32(32) dirty=65125(147329) writeback=456(3866)
> > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=110601(803978) active=9860(9973) isolated=27(27) dirty=63792(146590) writeback=61(4276)
> > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111786(804032) active=9860(9973) isolated=0(64) dirty=64310(146998) writeback=1282(3847)
> > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111643(805651) active=9860(9982) isolated=32(32) dirty=63778(147217) writeback=1127(4156)
> > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111678(804709) active=9859(10112) isolated=27(27) dirty=81673(148224) writeback=29(4233)
> > 
> > [ These prints occur only once per reclaim invocation, so the actual
> > ->writepage calls are more frequent than the timestamp may suggest. ]
> > 
> > In the scenario without the Normal zone, first the DMA32 zone fills
> > up, then the DMA zone.  When reclaim kicks in, it is presented with a
> > DMA zone whose inactive pages are all dirty -- and dirtied most
> > recently at that, so the flushers really had abysmal chances at making
> > some headway:
> > 
> > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=776(430813) active=2(2931) isolated=32(32) dirty=814(68649) writeback=0(18765)
> > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=726(430344) active=2(2931) isolated=32(32) dirty=764(67790) writeback=0(17146)
> > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=729(430838) active=2(2931) isolated=32(32) dirty=293(65303) writeback=468(20122)
> > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=757(431181) active=2(2931) isolated=32(32) dirty=63(68851) writeback=731(15926)
> > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=758(432808) active=2(2931) isolated=32(32) dirty=645(64106) writeback=0(19666)
> > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=726(431018) active=2(2931) isolated=32(32) dirty=740(65770) writeback=10(17907)
> > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=697(430467) active=2(2931) isolated=32(32) dirty=743(63757) writeback=0(18826)
> > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=693(430951) active=2(2931) isolated=32(32) dirty=626(54529) writeback=91(16198)
> > 
> 
> Patches 1-7 of the series "Reduce filesystem writeback from page
> reclaim" should have been able to cope with this as well by marking
> the dirty pages PageReclaim and continuing on. While it could still
> take some time before ZONE_DMA is cleaned, it is very unlikely that
> it is the preferred zone for allocation.

My changes can not fully prevent dirty pages from reaching the LRU
tail, so IMO we want your patches in any case (sorry I haven't replied
yet, but I went through them and they look good to me.  Acks coming
up).  But this should reduce what reclaim has to skip and shuffle.

> > The idea behind this patch set is to take the ratio the global dirty
> > limits have to the global memory state and put it into proportion to
> > the individual zone.  The allocator ensures that pages allocated for
> > being written to in the page cache are distributed across zones such
> > that there are always enough clean pages on a zone to begin with.
> > 
> 
> Ok, I comment on potential lowmem pressure problems with this in the
> patch itself.
> 
> > I am not yet really satisfied as it's not really orthogonal or
> > integrated with the other writeback throttling much, and has rough
> > edges here and there, but test results do look rather promising so
> > far:
> > 
> 
> I'd consider that the idea behind this patchset is independent of
> patches 1-7 of the "Reduce filesystem writeback from page reclaim"
> series although it may also allow the application of patch 8 from
> that series. Would you agree or do you think the series should be
> mutually exclusive?

My patchset was triggered by patch 8 of your series, as I think we can
not simply remove our only measure to stay on top of the dirty pages
from a per-zone perspective.

But I think your patches 1-7 and this series complement each other in
that one series tries to keep the dirty pages per-zone on sane levels
and the other series improves how we deal with what dirty pages still
end up at the lru tails.

> > --- Copying 8G to fuse-ntfs on USB stick in 4G machine
> > 
> 
> Unusual choice of filesystem :) It'd also be worth testing ext3, ext4,
> xfs and btrfs to make sure there are no surprises.

Yeah, testing has been really shallow so far as my test box is
occupied with the exclusive memcg lru stuff.

Also, this is the stick my TV has to be able to read from ;-)

> > 3.0:
> > 
> >  Performance counter stats for 'dd if=/dev/zero of=zeroes bs=32k count=262144' (6 runs):
> > 
> >        140,671,831 cache-misses             #      4.923 M/sec   ( +-   0.198% )  (scaled from 82.80%)
> >        726,265,014 cache-references         #     25.417 M/sec   ( +-   1.104% )  (scaled from 83.06%)
> >        144,092,383 branch-misses            #      4.157 %       ( +-   0.493% )  (scaled from 83.17%)
> >      3,466,608,296 branches                 #    121.319 M/sec   ( +-   0.421% )  (scaled from 67.89%)
> >     17,882,351,343 instructions             #      0.417 IPC     ( +-   0.457% )  (scaled from 84.73%)
> >     42,848,633,897 cycles                   #   1499.554 M/sec   ( +-   0.604% )  (scaled from 83.08%)
> >                236 page-faults              #      0.000 M/sec   ( +-   0.323% )
> >              8,026 CPU-migrations           #      0.000 M/sec   ( +-   6.291% )
> >          2,372,358 context-switches         #      0.083 M/sec   ( +-   0.003% )
> >       28574.255540 task-clock-msecs         #      0.031 CPUs    ( +-   0.409% )
> > 
> >       912.625436885  seconds time elapsed   ( +-   3.851% )
> > 
> >  nr_vmscan_write 667839
> > 
> > 3.0-per-zone-dirty:
> > 
> >  Performance counter stats for 'dd if=/dev/zero of=zeroes bs=32k count=262144' (6 runs):
> > 
> >        140,791,501 cache-misses             #      3.887 M/sec   ( +-   0.186% )  (scaled from 83.09%)
> >        816,474,193 cache-references         #     22.540 M/sec   ( +-   0.923% )  (scaled from 83.16%)
> >        154,500,577 branch-misses            #      4.302 %       ( +-   0.495% )  (scaled from 83.15%)
> >      3,591,344,338 branches                 #     99.143 M/sec   ( +-   0.402% )  (scaled from 67.32%)
> >     18,713,190,183 instructions             #      0.338 IPC     ( +-   0.448% )  (scaled from 83.96%)
> >     55,285,320,107 cycles                   #   1526.208 M/sec   ( +-   0.588% )  (scaled from 83.28%)
> >                237 page-faults              #      0.000 M/sec   ( +-   0.302% )
> >             28,028 CPU-migrations           #      0.001 M/sec   ( +-   3.070% )
> >          2,369,897 context-switches         #      0.065 M/sec   ( +-   0.006% )
> >       36223.970238 task-clock-msecs         #      0.060 CPUs    ( +-   1.062% )
> > 
> >       605.909769823  seconds time elapsed   ( +-   0.783% )
> > 
> >  nr_vmscan_write 0
> > 
> 
> Very nice!
> 
> > That's an increase of throughput by 30% and no writeback interference
> > from reclaim.
> > 
> 
> Any idea how much dd was varying in performance on each run? I'd
> still expect a gain but I've found dd to vary wildly at times even
> if conv=fdatasync,fsync is specified.

The fluctuation is in the figures after the 'seconds time elapsed'.
It is less than 1% for the six runs.

Or did you mean something else?

> > As not every other allocation has to reclaim from a Normal zone full
> > of dirty pages anymore, the patched kernel is also more responsive in
> > general during the copy.
> > 
> > I am also running fs_mark on XFS on a 2G machine, but the final
> > results are not in yet.  The preliminary results appear to be in this
> > ballpark:
> > 
> > --- fs_mark -d fsmark-one -d fsmark-two -D 100 -N 150 -n 150 -L 25 -t 1 -S 0 -s $((10 << 20))
> > 
> > 3.0:
> > 
> > real    20m43.901s
> > user    0m8.988s
> > sys     0m58.227s
> > nr_vmscan_write 3347
> > 
> > 3.0-per-zone-dirty:
> > 
> > real    20m8.012s
> > user    0m8.862s
> > sys     1m2.585s
> > nr_vmscan_write 161
> > 
> 
> Thats roughly a 2.8% gain. I was seeing about 4.2% but was testing with
> mem=1G, not 2G and there are a lot of factors at play.

[...]

> > #4 adds per-zone dirty throttling for __GFP_WRITE allocators, #5
> > passes __GFP_WRITE from the grab_cache_page* functions in the hope to
> > get most writers and no readers; I haven't checked all sites yet.
> > 
> > Discuss! :-)
> > 
> 
> I think the performance gain may be due to flusher threads simply
> being more aggressive and I suspect it will have a smaller effect on
> NUMA where the flushers could be cleaning pages on the wrong node.

I ran this same test with statistics (which I now realize should
probably become part of this series) and they indicated that the
flushers were not woken a single time from the new code.

All it did in this case was defer future-dirty pages from the Normal
zone to the DMA32 zone.

My understanding is that as the dirty pages are forcibly spread out
into the bigger zone, reclaim and flushers become less likely to step
on each other's toes.

> That said, your figures are very promising and it is worth
> an investigation and you should expand the number of filesystems
> tested. I did a quick set of similar benchmarks locally. I only ran
> dd once which is a major flaw but wanted to get a quick look.

Yeah, more testing is definitely going to happen on this.  I tried
other filesystems with one-shot runs as well, just to see if anything
stood out, but nothing conclusive.

> 4 kernels were tested.
> 
> vanilla:	3.0
> lesskswapd	Patches 1-7 from my series
> perzonedirty	Your patches
> lessks-pzdirty	Both
> 
> Backing storage was a USB key. Kernel was booted with mem=4608M to
> get a 500M highest zone similar to yours.

I think what I wrote was a bit misleading.  The zone size example was
taken from my desktop machine to simply point out the different zones
sizes in a simple UMA machine.  But I ran this test on my laptop,
where the Normal zone is ~880MB (226240 present pages).

The dirty_background_ratio is 10, dirty_ratio is 20, btw, ISTR that
you had set them higher and I expect that to be a factor.

The dd throughput is ~14 MB/s on the pzd kernel.

> SIMPLE WRITEBACK XFS
>               simple-writeback   writeback-3.0.0   writeback-3.0.0      3.0.0-lessks
>                  3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1      pzdirty-v3r1
> 1                    526.83 ( 0.00%) 468.52 (12.45%) 542.05 (-2.81%) 464.42 (13.44%)
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)          7.27      7.34      7.69      7.96
> Total Elapsed Time (seconds)                528.64    470.36    543.86    466.33
> 
> Direct pages scanned                             0         0         0         0
> Direct pages reclaimed                           0         0         0         0
> Kswapd pages scanned                       1058036   1167219   1060288   1169190
> Kswapd pages reclaimed                      988591    979571    980278    981009
> Kswapd efficiency                              93%       83%       92%       83%
> Kswapd velocity                           2001.430  2481.544  1949.561  2507.216
> Direct efficiency                             100%      100%      100%      100%
> Direct velocity                              0.000     0.000     0.000     0.000
> Percentage direct scans                         0%        0%        0%        0%
> Page writes by reclaim                        4463      4587      4816      4910
> Page reclaim invalidate                          0    145938         0    136510
> 
> Very few pages are being written back so I suspect any difference in
> performance would be due to dd simply being very variable. I wasn't
> running the monitoring that would tell me if the "Page writes" were
> file-backed or anonymous but I assume they are file-backed. Your
> patches did not seem to have much affect on the number of pages
> written.

That's odd.  While it did not completely get rid of all file writes
from reclaim, it reduced them consistently in all my tests so far.

I don't have swap space on any of my machines, but I wouldn't expect
this to make a difference.

> Note that direct reclaim is not triggered by this workload at all.

Same here, not a single allocstall.

> SIMPLE WRITEBACK EXT4
>               simple-writeback   writeback-3.0.0   writeback-3.0.0      3.0.0-lessks
>                  3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1      pzdirty-v3r1
> 1                    369.80 ( 0.00%) 370.80 (-0.27%) 384.08 (-3.72%) 371.85 (-0.55%)
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)          7.62       7.7      8.05      7.86
> Total Elapsed Time (seconds)                371.74    372.80    386.06    373.86
> 
> Direct pages scanned                             0         0         0         0
> Direct pages reclaimed                           0         0         0         0
> Kswapd pages scanned                       1169587   1186543   1167690   1180982
> Kswapd pages reclaimed                      988154    987885    987220    987826
> Kswapd efficiency                              84%       83%       84%       83%
> Kswapd velocity                           3146.250  3182.787  3024.633  3158.888
> Direct efficiency                             100%      100%      100%      100%
> Direct velocity                              0.000     0.000     0.000     0.000
> Percentage direct scans                         0%        0%        0%        0%
> Page writes by reclaim                      141229      4714    141804      4608
> Page writes skipped                              0         0         0         0
> Page reclaim invalidate                          0    144009         0    144012
> Slabs scanned                                 3712      3712      3712      3712
> 
> Not much different here than what is in xfs other than to note that
> your patches do not hurt "Kswapd efficiency" as the scanning rates
> remain more or less constant.
> 
> SIMPLE WRITEBACK EXT3
>               simple-writeback   writeback-3.0.0   writeback-3.0.0      3.0.0-lessks
>                  3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1      pzdirty-v3r1
> 1                    1291.48 ( 0.00%) 1205.11 ( 7.17%) 1287.53 ( 0.31%) 1190.54 ( 8.48%)
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)         11.01     11.04     11.44     11.39
> Total Elapsed Time (seconds)               1295.44   1208.90   1293.81   1195.37
> 
> Direct pages scanned                             0         0         0         0
> Direct pages reclaimed                           0         0         0         0
> Kswapd pages scanned                       1073001   1183622   1065262   1179216
> Kswapd pages reclaimed                      985900    985521    979727    979873
> Kswapd efficiency                              91%       83%       91%       83%
> Kswapd velocity                            828.291   979.090   823.353   986.486
> Direct efficiency                             100%      100%      100%      100%
> Direct velocity                              0.000     0.000     0.000     0.000
> Percentage direct scans                         0%        0%        0%        0%
> Page writes by reclaim                       13444      4664     13557      4928
> Page writes skipped                              0         0         0         0
> Page reclaim invalidate                          0    146167         0    146495
> 
> Other than noting that ext3 is *very* slow in comparison to xfs and
> ext4, there was little of interest in this.
> 
> So I'm not seeing the same reduction in number of pages written back
> as you saw and I'm not seeing the same performance gains either. I
> wonder why that is but possibilities include you using fuse-ntfs or
> maybe it's just the speed of the USB disk you are using that is a
> factor?

I will try out other filesystems here as well.

> As dd is variable, I'm rerunning the tests to do 4 iterations and
> multiple memory sizes for just xfs and ext4 to see what falls out. It
> should take about 14 hours to complete assuming nothing screws up.

Awesome, thanks!

---
From: Johannes Weiner <jweiner@redhat.com>
Subject: mm: per-zone dirty limit statistics

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---
 include/linux/vm_event_item.h |    4 ++++
 include/linux/vmstat.h        |    3 +++
 mm/page-writeback.c           |    8 ++++++--
 mm/page_alloc.c               |    4 +++-
 mm/vmstat.c                   |    4 ++++
 5 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03b90cdc..6bfc604 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -58,6 +58,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_COLLAPSE_ALLOC_FAILED,
 		THP_SPLIT,
 #endif
+		FOR_ALL_ZONES(DIRTY_ALLOC_DENIED),
+		FOR_ALL_ZONES(DIRTY_WAKE_FLUSHERS),
+		DIRTY_WAIT_CONGESTION,
+
 		NR_VM_EVENT_ITEMS
 };
 
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index bcd942f..5926225 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -80,6 +80,9 @@ static inline void vm_events_fold_cpu(int cpu)
 
 #endif /* CONFIG_VM_EVENT_COUNTERS */
 
+#define count_zone_vm_event(item, zone)		\
+	count_vm_event(item##_NORMAL - ZONE_NORMAL + zone_idx(zone))
+
 #define __count_zone_vm_events(item, zone, delta) \
 		__count_vm_events(item##_NORMAL - ZONE_NORMAL + \
 		zone_idx(zone), delta)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index ce673ec..0937382 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -748,8 +748,10 @@ void try_to_writeback_pages(struct zonelist *zonelist, gfp_t gfp_mask,
 		if (nr_reclaimable + nr_writeback <= background_thresh)
 			continue;
 
-		if (nr_reclaimable > nr_writeback)
+		if (nr_reclaimable > nr_writeback) {
+			count_zone_vm_event(DIRTY_WAKE_FLUSHERS, zone);
 			wakeup_flusher_threads(nr_reclaimable - nr_writeback);
+		}
 
 		if (nr_reclaimable + nr_writeback <= dirty_thresh)
 			continue;
@@ -757,8 +759,10 @@ void try_to_writeback_pages(struct zonelist *zonelist, gfp_t gfp_mask,
 		nr_exceeded++;
 	}
 
-	if (nr_zones == nr_exceeded)
+	if (nr_zones == nr_exceeded) {
+		count_vm_event(DIRTY_WAIT_CONGESTION);
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
+	}
 }
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1fac154..5939a98 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1666,8 +1666,10 @@ zonelist_scan:
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				goto try_next_zone;
 
-		if ((gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
+		if ((gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone)) {
+			count_zone_vm_event(DIRTY_ALLOC_DENIED, zone);
 			goto this_zone_full;
+		}
 
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 20c18b7..d302a77 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -786,6 +786,10 @@ const char * const vmstat_text[] = {
 	"thp_split",
 #endif
 
+	TEXTS_FOR_ZONES("dirty_alloc_denied")
+	TEXTS_FOR_ZONES("dirty_wake_flushers")
+	"dirty_wait_congestion",
+
 #endif /* CONFIG_VM_EVENTS_COUNTERS */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS */
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [patch 0/5] mm: per-zone dirty limiting
  2011-07-26 18:05     ` Johannes Weiner
@ 2011-07-26 21:54       ` Mel Gorman
  -1 siblings, 0 replies; 64+ messages in thread
From: Mel Gorman @ 2011-07-26 21:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On Tue, Jul 26, 2011 at 08:05:59PM +0200, Johannes Weiner wrote:
> On Tue, Jul 26, 2011 at 04:47:41PM +0100, Mel Gorman wrote:
> > On Mon, Jul 25, 2011 at 10:19:14PM +0200, Johannes Weiner wrote:
> > > Hello!
> > > 
> > > Writing back single file pages during reclaim exhibits bad IO
> > > patterns, but we can't just stop doing that before the VM has other
> > > means to ensure the pages in a zone are reclaimable.
> > > 
> > > Over time there were several suggestions of at least doing
> > > write-around of the pages in inode-proximity when the need arises to
> > > clean pages during memory pressure.  But even that would interrupt
> > > writeback from the flushers, without any guarantees that the nearby
> > > inode-pages are even sitting on the same troubled zone.
> > > 
> > > The reason why dirty pages reach the end of LRU lists in the first
> > > place is in part because the dirty limits are a global restriction
> > > while most systems have more than one LRU list that are different in
> > > size. Multiple nodes have multiple zones have multiple file lists but
> > > at the same time there is nothing to balance the dirty pages between
> > > the lists except for reclaim writing them out upon encounter.
> > > 
> > > With around 4G of RAM, a x86_64 machine of mine has a DMA32 zone of a
> > > bit over 3G, a Normal zone of 500M, and a DMA zone of 15M.
> > > 
> > > A linear writer can quickly fill up the Normal zone, then the DMA32
> > > zone, throttled by the dirty limit initially.  The flushers catch up,
> > > the zones are now mostly full of clean pages and memory reclaim kicks
> > > in on subsequent allocations.  The pages it frees from the Normal zone
> > > are quickly filled with dirty pages (unthrottled, as the much bigger
> > > DMA32 zone allows for a huge number of dirty pages in comparison to
> > > the Normal zone).  As there are also anon and active file pages on the
> > > Normal zone, it is not unlikely that a significant amount of its
> > > inactive file pages are now dirty [ foo=zone(global) ]:
> > > 
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=112313(821289) active=9942(10039) isolated=27(27) dirty=59709(146944) writeback=739(4017)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111102(806876) active=9925(10022) isolated=32(32) dirty=72125(146914) writeback=957(3972)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=110493(803374) active=9871(9978) isolated=32(32) dirty=57274(146618) writeback=4088(4088)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111957(806559) active=9871(9978) isolated=32(32) dirty=65125(147329) writeback=456(3866)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=110601(803978) active=9860(9973) isolated=27(27) dirty=63792(146590) writeback=61(4276)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111786(804032) active=9860(9973) isolated=0(64) dirty=64310(146998) writeback=1282(3847)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111643(805651) active=9860(9982) isolated=32(32) dirty=63778(147217) writeback=1127(4156)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111678(804709) active=9859(10112) isolated=27(27) dirty=81673(148224) writeback=29(4233)
> > > 
> > > [ These prints occur only once per reclaim invocation, so the actual
> > > ->writepage calls are more frequent than the timestamp may suggest. ]
> > > 
> > > In the scenario without the Normal zone, first the DMA32 zone fills
> > > up, then the DMA zone.  When reclaim kicks in, it is presented with a
> > > DMA zone whose inactive pages are all dirty -- and dirtied most
> > > recently at that, so the flushers really had abysmal chances at making
> > > some headway:
> > > 
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=776(430813) active=2(2931) isolated=32(32) dirty=814(68649) writeback=0(18765)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=726(430344) active=2(2931) isolated=32(32) dirty=764(67790) writeback=0(17146)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=729(430838) active=2(2931) isolated=32(32) dirty=293(65303) writeback=468(20122)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=757(431181) active=2(2931) isolated=32(32) dirty=63(68851) writeback=731(15926)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=758(432808) active=2(2931) isolated=32(32) dirty=645(64106) writeback=0(19666)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=726(431018) active=2(2931) isolated=32(32) dirty=740(65770) writeback=10(17907)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=697(430467) active=2(2931) isolated=32(32) dirty=743(63757) writeback=0(18826)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=693(430951) active=2(2931) isolated=32(32) dirty=626(54529) writeback=91(16198)
> > > 
> > 
> > Patches 1-7 of the series "Reduce filesystem writeback from page
> > reclaim" should have been able to cope with this as well by marking
> > the dirty pages PageReclaim and continuing on. While it could still
> > take some time before ZONE_DMA is cleaned, it is very unlikely that
> > it is the preferred zone for allocation.
> 
> My changes can not fully prevent dirty pages from reaching the LRU
> tail, so IMO we want your patches in any case (sorry I haven't replied
> yet, but I went through them and they look good to me.  Acks coming
> up).  But this should reduce what reclaim has to skip and shuffle.
> 

No need to be sorry, I guessed this work may be related so figured
you had at least seen them.  While I was reasonable sure the patches
were not mutually exclusive, there was no harm in checking you thought
the same.

> > > The idea behind this patch set is to take the ratio the global dirty
> > > limits have to the global memory state and put it into proportion to
> > > the individual zone.  The allocator ensures that pages allocated for
> > > being written to in the page cache are distributed across zones such
> > > that there are always enough clean pages on a zone to begin with.
> > > 
> > 
> > Ok, I comment on potential lowmem pressure problems with this in the
> > patch itself.
> > 
> > > I am not yet really satisfied as it's not really orthogonal or
> > > integrated with the other writeback throttling much, and has rough
> > > edges here and there, but test results do look rather promising so
> > > far:
> > > 
> > 
> > I'd consider that the idea behind this patchset is independent of
> > patches 1-7 of the "Reduce filesystem writeback from page reclaim"
> > series although it may also allow the application of patch 8 from
> > that series. Would you agree or do you think the series should be
> > mutually exclusive?
> 
> My patchset was triggered by patch 8 of your series, as I think we can
> not simply remove our only measure to stay on top of the dirty pages
> from a per-zone perspective.
> 

Agreed. I fully intend to drop patch 8 until there is a better way of
handling pages from a specific zone.

> But I think your patches 1-7 and this series complement each other in
> that one series tries to keep the dirty pages per-zone on sane levels
> and the other series improves how we deal with what dirty pages still
> end up at the lru tails.
> 

Agreed.

> > > --- Copying 8G to fuse-ntfs on USB stick in 4G machine
> > > 
> > 
> > Unusual choice of filesystem :) It'd also be worth testing ext3, ext4,
> > xfs and btrfs to make sure there are no surprises.
> 
> Yeah, testing has been really shallow so far as my test box is
> occupied with the exclusive memcg lru stuff.
> 
> Also, this is the stick my TV has to be able to read from ;-)
> 

heh, fair enough.

> > > 3.0:
> > > 
> > >  Performance counter stats for 'dd if=/dev/zero of=zeroes bs=32k count=262144' (6 runs):
> > > 
> > >        140,671,831 cache-misses             #      4.923 M/sec   ( +-   0.198% )  (scaled from 82.80%)
> > >        726,265,014 cache-references         #     25.417 M/sec   ( +-   1.104% )  (scaled from 83.06%)
> > >        144,092,383 branch-misses            #      4.157 %       ( +-   0.493% )  (scaled from 83.17%)
> > >      3,466,608,296 branches                 #    121.319 M/sec   ( +-   0.421% )  (scaled from 67.89%)
> > >     17,882,351,343 instructions             #      0.417 IPC     ( +-   0.457% )  (scaled from 84.73%)
> > >     42,848,633,897 cycles                   #   1499.554 M/sec   ( +-   0.604% )  (scaled from 83.08%)
> > >                236 page-faults              #      0.000 M/sec   ( +-   0.323% )
> > >              8,026 CPU-migrations           #      0.000 M/sec   ( +-   6.291% )
> > >          2,372,358 context-switches         #      0.083 M/sec   ( +-   0.003% )
> > >       28574.255540 task-clock-msecs         #      0.031 CPUs    ( +-   0.409% )
> > > 
> > >       912.625436885  seconds time elapsed   ( +-   3.851% )
> > > 
> > >  nr_vmscan_write 667839
> > > 
> > > 3.0-per-zone-dirty:
> > > 
> > >  Performance counter stats for 'dd if=/dev/zero of=zeroes bs=32k count=262144' (6 runs):
> > > 
> > >        140,791,501 cache-misses             #      3.887 M/sec   ( +-   0.186% )  (scaled from 83.09%)
> > >        816,474,193 cache-references         #     22.540 M/sec   ( +-   0.923% )  (scaled from 83.16%)
> > >        154,500,577 branch-misses            #      4.302 %       ( +-   0.495% )  (scaled from 83.15%)
> > >      3,591,344,338 branches                 #     99.143 M/sec   ( +-   0.402% )  (scaled from 67.32%)
> > >     18,713,190,183 instructions             #      0.338 IPC     ( +-   0.448% )  (scaled from 83.96%)
> > >     55,285,320,107 cycles                   #   1526.208 M/sec   ( +-   0.588% )  (scaled from 83.28%)
> > >                237 page-faults              #      0.000 M/sec   ( +-   0.302% )
> > >             28,028 CPU-migrations           #      0.001 M/sec   ( +-   3.070% )
> > >          2,369,897 context-switches         #      0.065 M/sec   ( +-   0.006% )
> > >       36223.970238 task-clock-msecs         #      0.060 CPUs    ( +-   1.062% )
> > > 
> > >       605.909769823  seconds time elapsed   ( +-   0.783% )
> > > 
> > >  nr_vmscan_write 0
> > > 
> > 
> > Very nice!
> > 
> > > That's an increase of throughput by 30% and no writeback interference
> > > from reclaim.
> > > 
> > 
> > Any idea how much dd was varying in performance on each run? I'd
> > still expect a gain but I've found dd to vary wildly at times even
> > if conv=fdatasync,fsync is specified.
> 
> The fluctuation is in the figures after the 'seconds time elapsed'.
> It is less than 1% for the six runs.
> 
> Or did you mean something else?
> 

No, this is what I meant. I sometimes see very large variances but
that is usually on machines that are also doing other work. At the
moment for these tests, I'm see variances of +/- 1.5% for XFS and +/-
3.6% for ext4 which is acceptable.

> > > As not every other allocation has to reclaim from a Normal zone full
> > > of dirty pages anymore, the patched kernel is also more responsive in
> > > general during the copy.
> > > 
> > > I am also running fs_mark on XFS on a 2G machine, but the final
> > > results are not in yet.  The preliminary results appear to be in this
> > > ballpark:
> > > 
> > > --- fs_mark -d fsmark-one -d fsmark-two -D 100 -N 150 -n 150 -L 25 -t 1 -S 0 -s $((10 << 20))
> > > 
> > > 3.0:
> > > 
> > > real    20m43.901s
> > > user    0m8.988s
> > > sys     0m58.227s
> > > nr_vmscan_write 3347
> > > 
> > > 3.0-per-zone-dirty:
> > > 
> > > real    20m8.012s
> > > user    0m8.862s
> > > sys     1m2.585s
> > > nr_vmscan_write 161
> > > 
> > 
> > Thats roughly a 2.8% gain. I was seeing about 4.2% but was testing with
> > mem=1G, not 2G and there are a lot of factors at play.
> 
> [...]
> 
> > > #4 adds per-zone dirty throttling for __GFP_WRITE allocators, #5
> > > passes __GFP_WRITE from the grab_cache_page* functions in the hope to
> > > get most writers and no readers; I haven't checked all sites yet.
> > > 
> > > Discuss! :-)
> > > 
> > 
> > I think the performance gain may be due to flusher threads simply
> > being more aggressive and I suspect it will have a smaller effect on
> > NUMA where the flushers could be cleaning pages on the wrong node.
> 
> I ran this same test with statistics (which I now realize should
> probably become part of this series) and they indicated that the
> flushers were not woken a single time from the new code.
> 

Scratch that theory so!

> All it did in this case was defer future-dirty pages from the Normal
> zone to the DMA32 zone.
> 
> My understanding is that as the dirty pages are forcibly spread out
> into the bigger zone, reclaim and flushers become less likely to step
> on each other's toes.
> 

It makes more sense although I am surprised I didn't see something
similar in the initial tests.

> > That said, your figures are very promising and it is worth
> > an investigation and you should expand the number of filesystems
> > tested. I did a quick set of similar benchmarks locally. I only ran
> > dd once which is a major flaw but wanted to get a quick look.
> 
> Yeah, more testing is definitely going to happen on this.  I tried
> other filesystems with one-shot runs as well, just to see if anything
> stood out, but nothing conclusive.
> 
> > 4 kernels were tested.
> > 
> > vanilla:	3.0
> > lesskswapd	Patches 1-7 from my series
> > perzonedirty	Your patches
> > lessks-pzdirty	Both
> > 
> > Backing storage was a USB key. Kernel was booted with mem=4608M to
> > get a 500M highest zone similar to yours.
> 
> I think what I wrote was a bit misleading.  The zone size example was
> taken from my desktop machine to simply point out the different zones
> sizes in a simple UMA machine.  But I ran this test on my laptop,
> where the Normal zone is ~880MB (226240 present pages).
> 

I don't think that would make a massive difference. At the moment, I'm
testing with mem=512M, mem=1024M and mem=4608M.

> The dirty_background_ratio is 10, dirty_ratio is 20, btw, ISTR that
> you had set them higher and I expect that to be a factor.
> 

I was testing with dirty_ratio=40 to make the writeback-from-reclaim
problem worse so that is another important difference between the
tests.

> The dd throughput is ~14 MB/s on the pzd kernel.
> 
> > SIMPLE WRITEBACK XFS
> >               simple-writeback   writeback-3.0.0   writeback-3.0.0      3.0.0-lessks
> >                  3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1      pzdirty-v3r1
> > 1                    526.83 ( 0.00%) 468.52 (12.45%) 542.05 (-2.81%) 464.42 (13.44%)
> > MMTests Statistics: duration
> > User/Sys Time Running Test (seconds)          7.27      7.34      7.69      7.96
> > Total Elapsed Time (seconds)                528.64    470.36    543.86    466.33
> > 
> > Direct pages scanned                             0         0         0         0
> > Direct pages reclaimed                           0         0         0         0
> > Kswapd pages scanned                       1058036   1167219   1060288   1169190
> > Kswapd pages reclaimed                      988591    979571    980278    981009
> > Kswapd efficiency                              93%       83%       92%       83%
> > Kswapd velocity                           2001.430  2481.544  1949.561  2507.216
> > Direct efficiency                             100%      100%      100%      100%
> > Direct velocity                              0.000     0.000     0.000     0.000
> > Percentage direct scans                         0%        0%        0%        0%
> > Page writes by reclaim                        4463      4587      4816      4910
> > Page reclaim invalidate                          0    145938         0    136510
> > 
> > Very few pages are being written back so I suspect any difference in
> > performance would be due to dd simply being very variable. I wasn't
> > running the monitoring that would tell me if the "Page writes" were
> > file-backed or anonymous but I assume they are file-backed. Your
> > patches did not seem to have much affect on the number of pages
> > written.
> 
> That's odd.  While it did not completely get rid of all file writes
> from reclaim, it reduced them consistently in all my tests so far.
> 

Do you see the same if dirty_ratio==40?

> I don't have swap space on any of my machines, but I wouldn't expect
> this to make a difference.
> 

No swap would affect the ratio of slab to LRU pages that are
reclaimed by slab shrinkers. It also affects the ratio of anon/file
pages that are isolated from the LRUs based on the calculations in
get_scan_count(). Either would affect results although I'd expect the
reclaiming of anonymous pages, increasing major faults and swapping
to make a bigger difference than shrinkers in a test case involving
dd to a single file.

Do you see the same results if swap is enabled?

> <SNIP>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 0/5] mm: per-zone dirty limiting
@ 2011-07-26 21:54       ` Mel Gorman
  0 siblings, 0 replies; 64+ messages in thread
From: Mel Gorman @ 2011-07-26 21:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On Tue, Jul 26, 2011 at 08:05:59PM +0200, Johannes Weiner wrote:
> On Tue, Jul 26, 2011 at 04:47:41PM +0100, Mel Gorman wrote:
> > On Mon, Jul 25, 2011 at 10:19:14PM +0200, Johannes Weiner wrote:
> > > Hello!
> > > 
> > > Writing back single file pages during reclaim exhibits bad IO
> > > patterns, but we can't just stop doing that before the VM has other
> > > means to ensure the pages in a zone are reclaimable.
> > > 
> > > Over time there were several suggestions of at least doing
> > > write-around of the pages in inode-proximity when the need arises to
> > > clean pages during memory pressure.  But even that would interrupt
> > > writeback from the flushers, without any guarantees that the nearby
> > > inode-pages are even sitting on the same troubled zone.
> > > 
> > > The reason why dirty pages reach the end of LRU lists in the first
> > > place is in part because the dirty limits are a global restriction
> > > while most systems have more than one LRU list that are different in
> > > size. Multiple nodes have multiple zones have multiple file lists but
> > > at the same time there is nothing to balance the dirty pages between
> > > the lists except for reclaim writing them out upon encounter.
> > > 
> > > With around 4G of RAM, a x86_64 machine of mine has a DMA32 zone of a
> > > bit over 3G, a Normal zone of 500M, and a DMA zone of 15M.
> > > 
> > > A linear writer can quickly fill up the Normal zone, then the DMA32
> > > zone, throttled by the dirty limit initially.  The flushers catch up,
> > > the zones are now mostly full of clean pages and memory reclaim kicks
> > > in on subsequent allocations.  The pages it frees from the Normal zone
> > > are quickly filled with dirty pages (unthrottled, as the much bigger
> > > DMA32 zone allows for a huge number of dirty pages in comparison to
> > > the Normal zone).  As there are also anon and active file pages on the
> > > Normal zone, it is not unlikely that a significant amount of its
> > > inactive file pages are now dirty [ foo=zone(global) ]:
> > > 
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=112313(821289) active=9942(10039) isolated=27(27) dirty=59709(146944) writeback=739(4017)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111102(806876) active=9925(10022) isolated=32(32) dirty=72125(146914) writeback=957(3972)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=110493(803374) active=9871(9978) isolated=32(32) dirty=57274(146618) writeback=4088(4088)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111957(806559) active=9871(9978) isolated=32(32) dirty=65125(147329) writeback=456(3866)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=110601(803978) active=9860(9973) isolated=27(27) dirty=63792(146590) writeback=61(4276)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111786(804032) active=9860(9973) isolated=0(64) dirty=64310(146998) writeback=1282(3847)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111643(805651) active=9860(9982) isolated=32(32) dirty=63778(147217) writeback=1127(4156)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111678(804709) active=9859(10112) isolated=27(27) dirty=81673(148224) writeback=29(4233)
> > > 
> > > [ These prints occur only once per reclaim invocation, so the actual
> > > ->writepage calls are more frequent than the timestamp may suggest. ]
> > > 
> > > In the scenario without the Normal zone, first the DMA32 zone fills
> > > up, then the DMA zone.  When reclaim kicks in, it is presented with a
> > > DMA zone whose inactive pages are all dirty -- and dirtied most
> > > recently at that, so the flushers really had abysmal chances at making
> > > some headway:
> > > 
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=776(430813) active=2(2931) isolated=32(32) dirty=814(68649) writeback=0(18765)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=726(430344) active=2(2931) isolated=32(32) dirty=764(67790) writeback=0(17146)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=729(430838) active=2(2931) isolated=32(32) dirty=293(65303) writeback=468(20122)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=757(431181) active=2(2931) isolated=32(32) dirty=63(68851) writeback=731(15926)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=758(432808) active=2(2931) isolated=32(32) dirty=645(64106) writeback=0(19666)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=726(431018) active=2(2931) isolated=32(32) dirty=740(65770) writeback=10(17907)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=697(430467) active=2(2931) isolated=32(32) dirty=743(63757) writeback=0(18826)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=693(430951) active=2(2931) isolated=32(32) dirty=626(54529) writeback=91(16198)
> > > 
> > 
> > Patches 1-7 of the series "Reduce filesystem writeback from page
> > reclaim" should have been able to cope with this as well by marking
> > the dirty pages PageReclaim and continuing on. While it could still
> > take some time before ZONE_DMA is cleaned, it is very unlikely that
> > it is the preferred zone for allocation.
> 
> My changes can not fully prevent dirty pages from reaching the LRU
> tail, so IMO we want your patches in any case (sorry I haven't replied
> yet, but I went through them and they look good to me.  Acks coming
> up).  But this should reduce what reclaim has to skip and shuffle.
> 

No need to be sorry, I guessed this work may be related so figured
you had at least seen them.  While I was reasonable sure the patches
were not mutually exclusive, there was no harm in checking you thought
the same.

> > > The idea behind this patch set is to take the ratio the global dirty
> > > limits have to the global memory state and put it into proportion to
> > > the individual zone.  The allocator ensures that pages allocated for
> > > being written to in the page cache are distributed across zones such
> > > that there are always enough clean pages on a zone to begin with.
> > > 
> > 
> > Ok, I comment on potential lowmem pressure problems with this in the
> > patch itself.
> > 
> > > I am not yet really satisfied as it's not really orthogonal or
> > > integrated with the other writeback throttling much, and has rough
> > > edges here and there, but test results do look rather promising so
> > > far:
> > > 
> > 
> > I'd consider that the idea behind this patchset is independent of
> > patches 1-7 of the "Reduce filesystem writeback from page reclaim"
> > series although it may also allow the application of patch 8 from
> > that series. Would you agree or do you think the series should be
> > mutually exclusive?
> 
> My patchset was triggered by patch 8 of your series, as I think we can
> not simply remove our only measure to stay on top of the dirty pages
> from a per-zone perspective.
> 

Agreed. I fully intend to drop patch 8 until there is a better way of
handling pages from a specific zone.

> But I think your patches 1-7 and this series complement each other in
> that one series tries to keep the dirty pages per-zone on sane levels
> and the other series improves how we deal with what dirty pages still
> end up at the lru tails.
> 

Agreed.

> > > --- Copying 8G to fuse-ntfs on USB stick in 4G machine
> > > 
> > 
> > Unusual choice of filesystem :) It'd also be worth testing ext3, ext4,
> > xfs and btrfs to make sure there are no surprises.
> 
> Yeah, testing has been really shallow so far as my test box is
> occupied with the exclusive memcg lru stuff.
> 
> Also, this is the stick my TV has to be able to read from ;-)
> 

heh, fair enough.

> > > 3.0:
> > > 
> > >  Performance counter stats for 'dd if=/dev/zero of=zeroes bs=32k count=262144' (6 runs):
> > > 
> > >        140,671,831 cache-misses             #      4.923 M/sec   ( +-   0.198% )  (scaled from 82.80%)
> > >        726,265,014 cache-references         #     25.417 M/sec   ( +-   1.104% )  (scaled from 83.06%)
> > >        144,092,383 branch-misses            #      4.157 %       ( +-   0.493% )  (scaled from 83.17%)
> > >      3,466,608,296 branches                 #    121.319 M/sec   ( +-   0.421% )  (scaled from 67.89%)
> > >     17,882,351,343 instructions             #      0.417 IPC     ( +-   0.457% )  (scaled from 84.73%)
> > >     42,848,633,897 cycles                   #   1499.554 M/sec   ( +-   0.604% )  (scaled from 83.08%)
> > >                236 page-faults              #      0.000 M/sec   ( +-   0.323% )
> > >              8,026 CPU-migrations           #      0.000 M/sec   ( +-   6.291% )
> > >          2,372,358 context-switches         #      0.083 M/sec   ( +-   0.003% )
> > >       28574.255540 task-clock-msecs         #      0.031 CPUs    ( +-   0.409% )
> > > 
> > >       912.625436885  seconds time elapsed   ( +-   3.851% )
> > > 
> > >  nr_vmscan_write 667839
> > > 
> > > 3.0-per-zone-dirty:
> > > 
> > >  Performance counter stats for 'dd if=/dev/zero of=zeroes bs=32k count=262144' (6 runs):
> > > 
> > >        140,791,501 cache-misses             #      3.887 M/sec   ( +-   0.186% )  (scaled from 83.09%)
> > >        816,474,193 cache-references         #     22.540 M/sec   ( +-   0.923% )  (scaled from 83.16%)
> > >        154,500,577 branch-misses            #      4.302 %       ( +-   0.495% )  (scaled from 83.15%)
> > >      3,591,344,338 branches                 #     99.143 M/sec   ( +-   0.402% )  (scaled from 67.32%)
> > >     18,713,190,183 instructions             #      0.338 IPC     ( +-   0.448% )  (scaled from 83.96%)
> > >     55,285,320,107 cycles                   #   1526.208 M/sec   ( +-   0.588% )  (scaled from 83.28%)
> > >                237 page-faults              #      0.000 M/sec   ( +-   0.302% )
> > >             28,028 CPU-migrations           #      0.001 M/sec   ( +-   3.070% )
> > >          2,369,897 context-switches         #      0.065 M/sec   ( +-   0.006% )
> > >       36223.970238 task-clock-msecs         #      0.060 CPUs    ( +-   1.062% )
> > > 
> > >       605.909769823  seconds time elapsed   ( +-   0.783% )
> > > 
> > >  nr_vmscan_write 0
> > > 
> > 
> > Very nice!
> > 
> > > That's an increase of throughput by 30% and no writeback interference
> > > from reclaim.
> > > 
> > 
> > Any idea how much dd was varying in performance on each run? I'd
> > still expect a gain but I've found dd to vary wildly at times even
> > if conv=fdatasync,fsync is specified.
> 
> The fluctuation is in the figures after the 'seconds time elapsed'.
> It is less than 1% for the six runs.
> 
> Or did you mean something else?
> 

No, this is what I meant. I sometimes see very large variances but
that is usually on machines that are also doing other work. At the
moment for these tests, I'm see variances of +/- 1.5% for XFS and +/-
3.6% for ext4 which is acceptable.

> > > As not every other allocation has to reclaim from a Normal zone full
> > > of dirty pages anymore, the patched kernel is also more responsive in
> > > general during the copy.
> > > 
> > > I am also running fs_mark on XFS on a 2G machine, but the final
> > > results are not in yet.  The preliminary results appear to be in this
> > > ballpark:
> > > 
> > > --- fs_mark -d fsmark-one -d fsmark-two -D 100 -N 150 -n 150 -L 25 -t 1 -S 0 -s $((10 << 20))
> > > 
> > > 3.0:
> > > 
> > > real    20m43.901s
> > > user    0m8.988s
> > > sys     0m58.227s
> > > nr_vmscan_write 3347
> > > 
> > > 3.0-per-zone-dirty:
> > > 
> > > real    20m8.012s
> > > user    0m8.862s
> > > sys     1m2.585s
> > > nr_vmscan_write 161
> > > 
> > 
> > Thats roughly a 2.8% gain. I was seeing about 4.2% but was testing with
> > mem=1G, not 2G and there are a lot of factors at play.
> 
> [...]
> 
> > > #4 adds per-zone dirty throttling for __GFP_WRITE allocators, #5
> > > passes __GFP_WRITE from the grab_cache_page* functions in the hope to
> > > get most writers and no readers; I haven't checked all sites yet.
> > > 
> > > Discuss! :-)
> > > 
> > 
> > I think the performance gain may be due to flusher threads simply
> > being more aggressive and I suspect it will have a smaller effect on
> > NUMA where the flushers could be cleaning pages on the wrong node.
> 
> I ran this same test with statistics (which I now realize should
> probably become part of this series) and they indicated that the
> flushers were not woken a single time from the new code.
> 

Scratch that theory so!

> All it did in this case was defer future-dirty pages from the Normal
> zone to the DMA32 zone.
> 
> My understanding is that as the dirty pages are forcibly spread out
> into the bigger zone, reclaim and flushers become less likely to step
> on each other's toes.
> 

It makes more sense although I am surprised I didn't see something
similar in the initial tests.

> > That said, your figures are very promising and it is worth
> > an investigation and you should expand the number of filesystems
> > tested. I did a quick set of similar benchmarks locally. I only ran
> > dd once which is a major flaw but wanted to get a quick look.
> 
> Yeah, more testing is definitely going to happen on this.  I tried
> other filesystems with one-shot runs as well, just to see if anything
> stood out, but nothing conclusive.
> 
> > 4 kernels were tested.
> > 
> > vanilla:	3.0
> > lesskswapd	Patches 1-7 from my series
> > perzonedirty	Your patches
> > lessks-pzdirty	Both
> > 
> > Backing storage was a USB key. Kernel was booted with mem=4608M to
> > get a 500M highest zone similar to yours.
> 
> I think what I wrote was a bit misleading.  The zone size example was
> taken from my desktop machine to simply point out the different zones
> sizes in a simple UMA machine.  But I ran this test on my laptop,
> where the Normal zone is ~880MB (226240 present pages).
> 

I don't think that would make a massive difference. At the moment, I'm
testing with mem=512M, mem=1024M and mem=4608M.

> The dirty_background_ratio is 10, dirty_ratio is 20, btw, ISTR that
> you had set them higher and I expect that to be a factor.
> 

I was testing with dirty_ratio=40 to make the writeback-from-reclaim
problem worse so that is another important difference between the
tests.

> The dd throughput is ~14 MB/s on the pzd kernel.
> 
> > SIMPLE WRITEBACK XFS
> >               simple-writeback   writeback-3.0.0   writeback-3.0.0      3.0.0-lessks
> >                  3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1      pzdirty-v3r1
> > 1                    526.83 ( 0.00%) 468.52 (12.45%) 542.05 (-2.81%) 464.42 (13.44%)
> > MMTests Statistics: duration
> > User/Sys Time Running Test (seconds)          7.27      7.34      7.69      7.96
> > Total Elapsed Time (seconds)                528.64    470.36    543.86    466.33
> > 
> > Direct pages scanned                             0         0         0         0
> > Direct pages reclaimed                           0         0         0         0
> > Kswapd pages scanned                       1058036   1167219   1060288   1169190
> > Kswapd pages reclaimed                      988591    979571    980278    981009
> > Kswapd efficiency                              93%       83%       92%       83%
> > Kswapd velocity                           2001.430  2481.544  1949.561  2507.216
> > Direct efficiency                             100%      100%      100%      100%
> > Direct velocity                              0.000     0.000     0.000     0.000
> > Percentage direct scans                         0%        0%        0%        0%
> > Page writes by reclaim                        4463      4587      4816      4910
> > Page reclaim invalidate                          0    145938         0    136510
> > 
> > Very few pages are being written back so I suspect any difference in
> > performance would be due to dd simply being very variable. I wasn't
> > running the monitoring that would tell me if the "Page writes" were
> > file-backed or anonymous but I assume they are file-backed. Your
> > patches did not seem to have much affect on the number of pages
> > written.
> 
> That's odd.  While it did not completely get rid of all file writes
> from reclaim, it reduced them consistently in all my tests so far.
> 

Do you see the same if dirty_ratio==40?

> I don't have swap space on any of my machines, but I wouldn't expect
> this to make a difference.
> 

No swap would affect the ratio of slab to LRU pages that are
reclaimed by slab shrinkers. It also affects the ratio of anon/file
pages that are isolated from the LRUs based on the calculations in
get_scan_count(). Either would affect results although I'd expect the
reclaiming of anonymous pages, increasing major faults and swapping
to make a bigger difference than shrinkers in a test case involving
dd to a single file.

Do you see the same results if swap is enabled?

> <SNIP>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 1/5] mm: page_alloc: increase __GFP_BITS_SHIFT to include __GFP_OTHER_NODE
  2011-07-25 20:19   ` Johannes Weiner
@ 2011-07-27 12:50     ` Michal Hocko
  -1 siblings, 0 replies; 64+ messages in thread
From: Michal Hocko @ 2011-07-27 12:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara,
	Andi Kleen, linux-kernel

On Mon 25-07-11 22:19:15, Johannes Weiner wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> __GFP_OTHER_NODE is used for NUMA allocations on behalf of other
> nodes.  It's supposed to be passed through from the page allocator to
> zone_statistics(), but it never gets there as gfp_allowed_mask is not
> wide enough and masks out the flag early in the allocation path.
> 
> The result is an accounting glitch where successful NUMA allocations
> by-agent are not properly attributed as local.
> 
> Increase __GFP_BITS_SHIFT so that it includes __GFP_OTHER_NODE.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  include/linux/gfp.h |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index cb40892..3a76faf 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -92,7 +92,7 @@ struct vm_area_struct;
>   */
>  #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
>  
> -#define __GFP_BITS_SHIFT 23	/* Room for 23 __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 24	/* Room for N __GFP_FOO bits */
>  #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>  
>  /* This equals 0, but use constants in case they ever change */
> -- 
> 1.7.6

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 1/5] mm: page_alloc: increase __GFP_BITS_SHIFT to include __GFP_OTHER_NODE
@ 2011-07-27 12:50     ` Michal Hocko
  0 siblings, 0 replies; 64+ messages in thread
From: Michal Hocko @ 2011-07-27 12:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara,
	Andi Kleen, linux-kernel

On Mon 25-07-11 22:19:15, Johannes Weiner wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> __GFP_OTHER_NODE is used for NUMA allocations on behalf of other
> nodes.  It's supposed to be passed through from the page allocator to
> zone_statistics(), but it never gets there as gfp_allowed_mask is not
> wide enough and masks out the flag early in the allocation path.
> 
> The result is an accounting glitch where successful NUMA allocations
> by-agent are not properly attributed as local.
> 
> Increase __GFP_BITS_SHIFT so that it includes __GFP_OTHER_NODE.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  include/linux/gfp.h |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index cb40892..3a76faf 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -92,7 +92,7 @@ struct vm_area_struct;
>   */
>  #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
>  
> -#define __GFP_BITS_SHIFT 23	/* Room for 23 __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 24	/* Room for N __GFP_FOO bits */
>  #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>  
>  /* This equals 0, but use constants in case they ever change */
> -- 
> 1.7.6

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 2/5] mm: writeback: make determine_dirtyable_memory static again
  2011-07-25 20:19   ` Johannes Weiner
@ 2011-07-27 12:59     ` Michal Hocko
  -1 siblings, 0 replies; 64+ messages in thread
From: Michal Hocko @ 2011-07-27 12:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara,
	Andi Kleen, linux-kernel

On Mon 25-07-11 22:19:16, Johannes Weiner wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> The tracing ring-buffer used this function briefly, but not anymore.
> Make it local to the writeback code again.
> 
> Also, move the function so that no forward declaration needs to be
> reintroduced.

git grep says that the only reference is from the page-writeback.c and
the symbol is not exported to modules so this looks correct.
Moving it up in the file certainly makes sense.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  include/linux/writeback.h |    2 -
>  mm/page-writeback.c       |   85 ++++++++++++++++++++++-----------------------
>  2 files changed, 42 insertions(+), 45 deletions(-)
> 
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 17e7ccc..8c63f3a 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -105,8 +105,6 @@ extern int vm_highmem_is_dirtyable;
>  extern int block_dump;
>  extern int laptop_mode;
>  
> -extern unsigned long determine_dirtyable_memory(void);
> -
>  extern int dirty_background_ratio_handler(struct ctl_table *table, int write,
>  		void __user *buffer, size_t *lenp,
>  		loff_t *ppos);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 31f6988..a4de005 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -111,6 +111,48 @@ EXPORT_SYMBOL(laptop_mode);
>  
>  /* End of sysctl-exported parameters */
>  
> +static unsigned long highmem_dirtyable_memory(unsigned long total)
> +{
> +#ifdef CONFIG_HIGHMEM
> +	int node;
> +	unsigned long x = 0;
> +
> +	for_each_node_state(node, N_HIGH_MEMORY) {
> +		struct zone *z =
> +			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
> +
> +		x += zone_page_state(z, NR_FREE_PAGES) +
> +		     zone_reclaimable_pages(z);
> +	}
> +	/*
> +	 * Make sure that the number of highmem pages is never larger
> +	 * than the number of the total dirtyable memory. This can only
> +	 * occur in very strange VM situations but we want to make sure
> +	 * that this does not occur.
> +	 */
> +	return min(x, total);
> +#else
> +	return 0;
> +#endif
> +}
> +
> +/**
> + * determine_dirtyable_memory - amount of memory that may be used
> + *
> + * Returns the numebr of pages that can currently be freed and used
> + * by the kernel for direct mappings.
> + */
> +static unsigned long determine_dirtyable_memory(void)
> +{
> +	unsigned long x;
> +
> +	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +
> +	if (!vm_highmem_is_dirtyable)
> +		x -= highmem_dirtyable_memory(x);
> +
> +	return x + 1;	/* Ensure that we never return 0 */
> +}
>  
>  /*
>   * Scale the writeback cache size proportional to the relative writeout speeds.
> @@ -354,49 +396,6 @@ EXPORT_SYMBOL(bdi_set_max_ratio);
>   * clamping level.
>   */
>  
> -static unsigned long highmem_dirtyable_memory(unsigned long total)
> -{
> -#ifdef CONFIG_HIGHMEM
> -	int node;
> -	unsigned long x = 0;
> -
> -	for_each_node_state(node, N_HIGH_MEMORY) {
> -		struct zone *z =
> -			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
> -
> -		x += zone_page_state(z, NR_FREE_PAGES) +
> -		     zone_reclaimable_pages(z);
> -	}
> -	/*
> -	 * Make sure that the number of highmem pages is never larger
> -	 * than the number of the total dirtyable memory. This can only
> -	 * occur in very strange VM situations but we want to make sure
> -	 * that this does not occur.
> -	 */
> -	return min(x, total);
> -#else
> -	return 0;
> -#endif
> -}
> -
> -/**
> - * determine_dirtyable_memory - amount of memory that may be used
> - *
> - * Returns the numebr of pages that can currently be freed and used
> - * by the kernel for direct mappings.
> - */
> -unsigned long determine_dirtyable_memory(void)
> -{
> -	unsigned long x;
> -
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> -
> -	if (!vm_highmem_is_dirtyable)
> -		x -= highmem_dirtyable_memory(x);
> -
> -	return x + 1;	/* Ensure that we never return 0 */
> -}
> -
>  /*
>   * global_dirty_limits - background-writeback and dirty-throttling thresholds
>   *
> -- 
> 1.7.6
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 2/5] mm: writeback: make determine_dirtyable_memory static again
@ 2011-07-27 12:59     ` Michal Hocko
  0 siblings, 0 replies; 64+ messages in thread
From: Michal Hocko @ 2011-07-27 12:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara,
	Andi Kleen, linux-kernel

On Mon 25-07-11 22:19:16, Johannes Weiner wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> The tracing ring-buffer used this function briefly, but not anymore.
> Make it local to the writeback code again.
> 
> Also, move the function so that no forward declaration needs to be
> reintroduced.

git grep says that the only reference is from the page-writeback.c and
the symbol is not exported to modules so this looks correct.
Moving it up in the file certainly makes sense.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  include/linux/writeback.h |    2 -
>  mm/page-writeback.c       |   85 ++++++++++++++++++++++-----------------------
>  2 files changed, 42 insertions(+), 45 deletions(-)
> 
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 17e7ccc..8c63f3a 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -105,8 +105,6 @@ extern int vm_highmem_is_dirtyable;
>  extern int block_dump;
>  extern int laptop_mode;
>  
> -extern unsigned long determine_dirtyable_memory(void);
> -
>  extern int dirty_background_ratio_handler(struct ctl_table *table, int write,
>  		void __user *buffer, size_t *lenp,
>  		loff_t *ppos);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 31f6988..a4de005 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -111,6 +111,48 @@ EXPORT_SYMBOL(laptop_mode);
>  
>  /* End of sysctl-exported parameters */
>  
> +static unsigned long highmem_dirtyable_memory(unsigned long total)
> +{
> +#ifdef CONFIG_HIGHMEM
> +	int node;
> +	unsigned long x = 0;
> +
> +	for_each_node_state(node, N_HIGH_MEMORY) {
> +		struct zone *z =
> +			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
> +
> +		x += zone_page_state(z, NR_FREE_PAGES) +
> +		     zone_reclaimable_pages(z);
> +	}
> +	/*
> +	 * Make sure that the number of highmem pages is never larger
> +	 * than the number of the total dirtyable memory. This can only
> +	 * occur in very strange VM situations but we want to make sure
> +	 * that this does not occur.
> +	 */
> +	return min(x, total);
> +#else
> +	return 0;
> +#endif
> +}
> +
> +/**
> + * determine_dirtyable_memory - amount of memory that may be used
> + *
> + * Returns the numebr of pages that can currently be freed and used
> + * by the kernel for direct mappings.
> + */
> +static unsigned long determine_dirtyable_memory(void)
> +{
> +	unsigned long x;
> +
> +	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +
> +	if (!vm_highmem_is_dirtyable)
> +		x -= highmem_dirtyable_memory(x);
> +
> +	return x + 1;	/* Ensure that we never return 0 */
> +}
>  
>  /*
>   * Scale the writeback cache size proportional to the relative writeout speeds.
> @@ -354,49 +396,6 @@ EXPORT_SYMBOL(bdi_set_max_ratio);
>   * clamping level.
>   */
>  
> -static unsigned long highmem_dirtyable_memory(unsigned long total)
> -{
> -#ifdef CONFIG_HIGHMEM
> -	int node;
> -	unsigned long x = 0;
> -
> -	for_each_node_state(node, N_HIGH_MEMORY) {
> -		struct zone *z =
> -			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
> -
> -		x += zone_page_state(z, NR_FREE_PAGES) +
> -		     zone_reclaimable_pages(z);
> -	}
> -	/*
> -	 * Make sure that the number of highmem pages is never larger
> -	 * than the number of the total dirtyable memory. This can only
> -	 * occur in very strange VM situations but we want to make sure
> -	 * that this does not occur.
> -	 */
> -	return min(x, total);
> -#else
> -	return 0;
> -#endif
> -}
> -
> -/**
> - * determine_dirtyable_memory - amount of memory that may be used
> - *
> - * Returns the numebr of pages that can currently be freed and used
> - * by the kernel for direct mappings.
> - */
> -unsigned long determine_dirtyable_memory(void)
> -{
> -	unsigned long x;
> -
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> -
> -	if (!vm_highmem_is_dirtyable)
> -		x -= highmem_dirtyable_memory(x);
> -
> -	return x + 1;	/* Ensure that we never return 0 */
> -}
> -
>  /*
>   * global_dirty_limits - background-writeback and dirty-throttling thresholds
>   *
> -- 
> 1.7.6
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 3/5] mm: writeback: remove seriously stale comment on dirty limits
  2011-07-25 20:19   ` Johannes Weiner
@ 2011-07-27 13:38     ` Michal Hocko
  -1 siblings, 0 replies; 64+ messages in thread
From: Michal Hocko @ 2011-07-27 13:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara,
	Andi Kleen, linux-kernel

On Mon 25-07-11 22:19:17, Johannes Weiner wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Indeed outdated a lot.

Reviewed-by: Michal Hocko <mhocko@suse.cz>
if it makes any sense for comment removal like this.

> ---
>  mm/page-writeback.c |   18 ------------------
>  1 files changed, 0 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index a4de005..41dc871 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -379,24 +379,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
>  EXPORT_SYMBOL(bdi_set_max_ratio);
>  
>  /*
> - * Work out the current dirty-memory clamping and background writeout
> - * thresholds.
> - *
> - * The main aim here is to lower them aggressively if there is a lot of mapped
> - * memory around.  To avoid stressing page reclaim with lots of unreclaimable
> - * pages.  It is better to clamp down on writers than to start swapping, and
> - * performing lots of scanning.
> - *
> - * We only allow 1/2 of the currently-unmapped memory to be dirtied.
> - *
> - * We don't permit the clamping level to fall below 5% - that is getting rather
> - * excessive.
> - *
> - * We make sure that the background writeout level is below the adjusted
> - * clamping level.
> - */
> -
> -/*
>   * global_dirty_limits - background-writeback and dirty-throttling thresholds
>   *
>   * Calculate the dirty thresholds based on sysctl parameters
> -- 
> 1.7.6
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 3/5] mm: writeback: remove seriously stale comment on dirty limits
@ 2011-07-27 13:38     ` Michal Hocko
  0 siblings, 0 replies; 64+ messages in thread
From: Michal Hocko @ 2011-07-27 13:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara,
	Andi Kleen, linux-kernel

On Mon 25-07-11 22:19:17, Johannes Weiner wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Indeed outdated a lot.

Reviewed-by: Michal Hocko <mhocko@suse.cz>
if it makes any sense for comment removal like this.

> ---
>  mm/page-writeback.c |   18 ------------------
>  1 files changed, 0 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index a4de005..41dc871 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -379,24 +379,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
>  EXPORT_SYMBOL(bdi_set_max_ratio);
>  
>  /*
> - * Work out the current dirty-memory clamping and background writeout
> - * thresholds.
> - *
> - * The main aim here is to lower them aggressively if there is a lot of mapped
> - * memory around.  To avoid stressing page reclaim with lots of unreclaimable
> - * pages.  It is better to clamp down on writers than to start swapping, and
> - * performing lots of scanning.
> - *
> - * We only allow 1/2 of the currently-unmapped memory to be dirtied.
> - *
> - * We don't permit the clamping level to fall below 5% - that is getting rather
> - * excessive.
> - *
> - * We make sure that the background writeout level is below the adjusted
> - * clamping level.
> - */
> -
> -/*
>   * global_dirty_limits - background-writeback and dirty-throttling thresholds
>   *
>   * Calculate the dirty thresholds based on sysctl parameters
> -- 
> 1.7.6
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 4/5] mm: writeback: throttle __GFP_WRITE on per-zone dirty limits
  2011-07-25 20:19   ` Johannes Weiner
@ 2011-07-27 14:24     ` Michal Hocko
  -1 siblings, 0 replies; 64+ messages in thread
From: Michal Hocko @ 2011-07-27 14:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara,
	Andi Kleen, linux-kernel

On Mon 25-07-11 22:19:18, Johannes Weiner wrote:
[...]
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 41dc871..ce673ec 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -378,6 +390,24 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
>  }
>  EXPORT_SYMBOL(bdi_set_max_ratio);
>  
> +static void sanitize_dirty_limits(unsigned long *pbackground,
> +				  unsigned long *pdirty)
> +{
> +	unsigned long background = *pbackground;
> +	unsigned long dirty = *pdirty;
> +	struct task_struct *tsk;
> +
> +	if (background >= dirty)
> +		background = dirty / 2;
> +	tsk = current;
> +	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> +		background += background / 4;
> +		dirty += dirty / 4;
> +	}
> +	*pbackground = background;
> +	*pdirty = dirty;
> +}
> +
>  /*
>   * global_dirty_limits - background-writeback and dirty-throttling thresholds
>   *
> @@ -389,33 +419,52 @@ EXPORT_SYMBOL(bdi_set_max_ratio);
>   */
>  void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
>  {
> -	unsigned long background;
> -	unsigned long dirty;
>  	unsigned long uninitialized_var(available_memory);
> -	struct task_struct *tsk;
>  
>  	if (!vm_dirty_bytes || !dirty_background_bytes)
>  		available_memory = determine_dirtyable_memory();
>  
>  	if (vm_dirty_bytes)
> -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +		*pdirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
>  	else
> -		dirty = (vm_dirty_ratio * available_memory) / 100;
> +		*pdirty = vm_dirty_ratio * available_memory / 100;
>  
>  	if (dirty_background_bytes)
> -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +		*pbackground = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
>  	else
> -		background = (dirty_background_ratio * available_memory) / 100;
> +		*pbackground = dirty_background_ratio * available_memory / 100;
>  
> -	if (background >= dirty)
> -		background = dirty / 2;
> -	tsk = current;
> -	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> -		background += background / 4;
> -		dirty += dirty / 4;
> -	}
> -	*pbackground = background;
> -	*pdirty = dirty;
> +	sanitize_dirty_limits(pbackground, pdirty);
> +}

Hmm, wouldn't be the patch little bit easier to read if this was
outside in a separate (cleanup) one?

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 4/5] mm: writeback: throttle __GFP_WRITE on per-zone dirty limits
@ 2011-07-27 14:24     ` Michal Hocko
  0 siblings, 0 replies; 64+ messages in thread
From: Michal Hocko @ 2011-07-27 14:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara,
	Andi Kleen, linux-kernel

On Mon 25-07-11 22:19:18, Johannes Weiner wrote:
[...]
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 41dc871..ce673ec 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -378,6 +390,24 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
>  }
>  EXPORT_SYMBOL(bdi_set_max_ratio);
>  
> +static void sanitize_dirty_limits(unsigned long *pbackground,
> +				  unsigned long *pdirty)
> +{
> +	unsigned long background = *pbackground;
> +	unsigned long dirty = *pdirty;
> +	struct task_struct *tsk;
> +
> +	if (background >= dirty)
> +		background = dirty / 2;
> +	tsk = current;
> +	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> +		background += background / 4;
> +		dirty += dirty / 4;
> +	}
> +	*pbackground = background;
> +	*pdirty = dirty;
> +}
> +
>  /*
>   * global_dirty_limits - background-writeback and dirty-throttling thresholds
>   *
> @@ -389,33 +419,52 @@ EXPORT_SYMBOL(bdi_set_max_ratio);
>   */
>  void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
>  {
> -	unsigned long background;
> -	unsigned long dirty;
>  	unsigned long uninitialized_var(available_memory);
> -	struct task_struct *tsk;
>  
>  	if (!vm_dirty_bytes || !dirty_background_bytes)
>  		available_memory = determine_dirtyable_memory();
>  
>  	if (vm_dirty_bytes)
> -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +		*pdirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
>  	else
> -		dirty = (vm_dirty_ratio * available_memory) / 100;
> +		*pdirty = vm_dirty_ratio * available_memory / 100;
>  
>  	if (dirty_background_bytes)
> -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +		*pbackground = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
>  	else
> -		background = (dirty_background_ratio * available_memory) / 100;
> +		*pbackground = dirty_background_ratio * available_memory / 100;
>  
> -	if (background >= dirty)
> -		background = dirty / 2;
> -	tsk = current;
> -	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> -		background += background / 4;
> -		dirty += dirty / 4;
> -	}
> -	*pbackground = background;
> -	*pdirty = dirty;
> +	sanitize_dirty_limits(pbackground, pdirty);
> +}

Hmm, wouldn't be the patch little bit easier to read if this was
outside in a separate (cleanup) one?

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 0/5] mm: per-zone dirty limiting
  2011-07-26 18:05     ` Johannes Weiner
@ 2011-07-29 11:05       ` Mel Gorman
  -1 siblings, 0 replies; 64+ messages in thread
From: Mel Gorman @ 2011-07-29 11:05 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On Tue, Jul 26, 2011 at 08:05:59PM +0200, Johannes Weiner wrote:
> > As dd is variable, I'm rerunning the tests to do 4 iterations and
> > multiple memory sizes for just xfs and ext4 to see what falls out. It
> > should take about 14 hours to complete assuming nothing screws up.
> 
> Awesome, thanks!
> 

While they in fact took about 30 hours to complete, I only got around
to packaging them up now. Unfortuantely the tests were incomplete as
I needed the machine back for another use but the results that did
complete are at http://www.csn.ul.ie/~mel/postings/hnaz-20110729/

Look for the comparison.html files such as this one

http://www.csn.ul.ie/~mel/postings/hnaz-20110729/global-dhp-512M__writeback-reclaimdirty-ext3/hydra/comparison.html

I'm afraid I haven't looked through them in detail.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 0/5] mm: per-zone dirty limiting
@ 2011-07-29 11:05       ` Mel Gorman
  0 siblings, 0 replies; 64+ messages in thread
From: Mel Gorman @ 2011-07-29 11:05 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On Tue, Jul 26, 2011 at 08:05:59PM +0200, Johannes Weiner wrote:
> > As dd is variable, I'm rerunning the tests to do 4 iterations and
> > multiple memory sizes for just xfs and ext4 to see what falls out. It
> > should take about 14 hours to complete assuming nothing screws up.
> 
> Awesome, thanks!
> 

While they in fact took about 30 hours to complete, I only got around
to packaging them up now. Unfortuantely the tests were incomplete as
I needed the machine back for another use but the results that did
complete are at http://www.csn.ul.ie/~mel/postings/hnaz-20110729/

Look for the comparison.html files such as this one

http://www.csn.ul.ie/~mel/postings/hnaz-20110729/global-dhp-512M__writeback-reclaimdirty-ext3/hydra/comparison.html

I'm afraid I haven't looked through them in detail.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 0/5] mm: per-zone dirty limiting
  2011-07-29 11:05       ` Mel Gorman
@ 2011-08-02 12:17         ` Johannes Weiner
  -1 siblings, 0 replies; 64+ messages in thread
From: Johannes Weiner @ 2011-08-02 12:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On Fri, Jul 29, 2011 at 12:05:10PM +0100, Mel Gorman wrote:
> On Tue, Jul 26, 2011 at 08:05:59PM +0200, Johannes Weiner wrote:
> > > As dd is variable, I'm rerunning the tests to do 4 iterations and
> > > multiple memory sizes for just xfs and ext4 to see what falls out. It
> > > should take about 14 hours to complete assuming nothing screws up.
> > 
> > Awesome, thanks!
> > 
> 
> While they in fact took about 30 hours to complete, I only got around
> to packaging them up now. Unfortuantely the tests were incomplete as
> I needed the machine back for another use but the results that did
> complete are at http://www.csn.ul.ie/~mel/postings/hnaz-20110729/
> 
> Look for the comparison.html files such as this one
> 
> http://www.csn.ul.ie/~mel/postings/hnaz-20110729/global-dhp-512M__writeback-reclaimdirty-ext3/hydra/comparison.html
> 
> I'm afraid I haven't looked through them in detail.

Mel, thanks a lot for running those tests, you shall be compensated in
finest brewery goods some time.

Here is an attempt:

	global-dhp-512M__writeback-reclaimdirty-xfs

SIMPLE WRITEBACK
              simple-writeback   writeback-3.0.0   writeback-3.0.0      3.0.0-lessks
                 3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1      pzdirty-v3r1
1                    1054.54 ( 0.00%) 386.65 (172.74%) 375.60 (180.76%) 375.88 (180.55%)
                 +/-            1.41%            4.56%            3.09%            2.34%
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         32.27     29.97     30.65     30.91
Total Elapsed Time (seconds)               4220.48   1548.84   1504.64   1505.79

MMTests Statistics: vmstat
Page Ins                                    720433    392017    317097    343849
Page Outs                                 27746435  27673017  27619134  27555437
Swap Ins                                    173563     94196     74844     81954
Swap Outs                                   115864    100264     86833     70904
Direct pages scanned                       3268014      7515         0      1008
Kswapd pages scanned                       5351371  12045948   7973273   7923387
Kswapd pages reclaimed                     3320848   6498700   6486754   6492607
Direct pages reclaimed                     3267145      7243         0      1008
Kswapd efficiency                              62%       53%       81%       81%
Kswapd velocity                           1267.953  7777.400  5299.123  5261.947
Direct efficiency                              99%       96%      100%      100%
Direct velocity                            774.323     4.852     0.000     0.669
Percentage direct scans                        37%        0%        0%        0%
Page writes by reclaim                      130541    100265     86833     70904
Page writes file                             14677         1         0         0
Page writes anon                            115864    100264     86833     70904
Page reclaim invalidate                          0   3120195         0         0
Slabs scanned                                 8448      8448      8576      8448
Direct inode steals                              0         0         0         0
Kswapd inode steals                           1828      1837      2056      1918
Kswapd skipped wait                              0         1         0         0
Compaction stalls                                2         0         0         0
Compaction success                               1         0         0         0
Compaction failures                              1         0         0         0
Compaction pages moved                           0         0         0         0
Compaction move failure                          0         0         0         0

While file writes from reclaim are prevented by both patches on their
own, perzonedirty decreases the amount of anonymous pages swapped out
because reclaim is always able to make progress instead of wasting its
file scan budget on shuffling dirty pages.  With lesskswapd in
addition, swapping is throttled in reclaim by the ratio of dirty pages
to isolated pages.

The runtime improvements speak for both perzonedirty and
perzonedirty+lesskswapd.  Given the swap upside and increased reclaim
efficiency, the combination of both appears to be the most desirable.

	global-dhp-512M__writeback-reclaimdirty-ext3

SIMPLE WRITEBACK
              simple-writeback   writeback-3.0.0   writeback-3.0.0
                 3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1
1                    1762.23 ( 0.00%) 987.73 (78.41%) 983.82 (79.12%)
                 +/-            4.35%            2.24%            1.56%
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         46.36     44.07        46
Total Elapsed Time (seconds)               7053.28   3956.60   3940.39

MMTests Statistics: vmstat
Page Ins                                    965236    661660    629972
Page Outs                                 27984332  27922904  27715628
Swap Ins                                    231181    158799    137341
Swap Outs                                   151395    142150     88644
Direct pages scanned                       2749884     11138   1315072
Kswapd pages scanned                       6340921  12591169   6599999
Kswapd pages reclaimed                     3915635   6576549   5264406
Direct pages reclaimed                     2749002     10877   1314842
Kswapd efficiency                              61%       52%       79%
Kswapd velocity                            899.003  3182.320  1674.961
Direct efficiency                              99%       97%       99%
Direct velocity                            389.873     2.815   333.742
Percentage direct scans                        30%        0%       16%
Page writes by reclaim                      620698    142155     88645
Page writes file                            469303         5         1
Page writes anon                            151395    142150     88644
Page reclaim invalidate                          0   3717819         0
Slabs scanned                                 8704      8576     33408
Direct inode steals                              0         0       466
Kswapd inode steals                           1872      2107      2115
Kswapd skipped wait                              0         1         0
Compaction stalls                                2         0         1
Compaction success                               1         0         0
Compaction failures                              1         0         1
Compaction pages moved                           0         0         0
Compaction move failure                          0         0         0

perzonedirty the highest reclaim efficiencies, the lowest writeout
counts from reclaim, and the shortest runtime.

While file writes are practically gone with both lesskswapd and
perzonedirty on their own, the latter also reduces swapping by 40%.

I expect the combination of both series to have the best results here
as well.

	global-dhp-512M__writeback-reclaimdirty-ext4

SIMPLE WRITEBACK
              simple-writeback   writeback-3.0.0   writeback-3.0.0
                 3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1
1                    405.42 ( 0.00%) 410.48 (-1.23%) 401.77 ( 0.91%)
                 +/-            3.62%            4.45%            2.82%
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         31.25      31.4     31.37
Total Elapsed Time (seconds)               1624.60   1644.56   1609.67

MMTests Statistics: vmstat
Page Ins                                    354364    403612    332812
Page Outs                                 27607792  27709096  27536412
Swap Ins                                     84065     96398     79219
Swap Outs                                    83096    108478     65342
Direct pages scanned                           112         0        56
Kswapd pages scanned                      12207898  12063862   7615377
Kswapd pages reclaimed                     6492490   6504947   6486946
Direct pages reclaimed                         112         0        56
Kswapd efficiency                              53%       53%       85%
Kswapd velocity                           7514.402  7335.617  4731.018
Direct efficiency                             100%      100%      100%
Direct velocity                              0.069     0.000     0.035
Percentage direct scans                         0%        0%        0%
Page writes by reclaim                     3076760    108483     65342
Page writes file                           2993664         5         0
Page writes anon                             83096    108478     65342
Page reclaim invalidate                          0   3291697         0
Slabs scanned                                 8448      8448      8448
Direct inode steals                              0         0         0
Kswapd inode steals                           1979      1993      1945
Kswapd skipped wait                              1         0         0
Compaction stalls                                0         0         0
Compaction success                               0         0         0
Compaction failures                              0         0         0
Compaction pages moved                           0         0         0
Compaction move failure                          0         0         0

With lesskswapd, both runtime and swapouts increased.  My only guess
is that in this configuration, the writepage calls actually improve
things to a certain extent.

Otherwise, nothing stands out to me here, and the same as above
applies wrt runtime and reclaim efficiency being the best with
perzonedirty.

	global-dhp-1024M__writeback-reclaimdirty-ext3

SIMPLE WRITEBACK
              simple-writeback   writeback-3.0.0   writeback-3.0.0
                 3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1
1                    1291.74 ( 0.00%) 1034.56 (24.86%) 1023.04 (26.26%)
                 +/-            2.77%            1.98%            4.42%
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         42.41     41.97     43.49
Total Elapsed Time (seconds)               5176.73   4142.26   4096.57

MMTests Statistics: vmstat
Page Ins                                     27856     24392     23292
Page Outs                                 27360416  27352736  27352700
Swap Ins                                         1         6         0
Swap Outs                                        2        39        32
Direct pages scanned                          5899         0         0
Kswapd pages scanned                       6500396   7948564   6014854
Kswapd pages reclaimed                     6008477   6012586   6013794
Direct pages reclaimed                        5899         0         0
Kswapd efficiency                              92%       75%       99%
Kswapd velocity                           1255.695  1918.895  1468.266
Direct efficiency                             100%      100%      100%
Direct velocity                              1.140     0.000     0.000
Percentage direct scans                         0%        0%        0%
Page writes by reclaim                      181091        39        32
Page writes file                            181089         0         0
Page writes anon                                 2        39        32
Page reclaim invalidate                          0   1843189         0
Slabs scanned                                 3840      3840      4096
Direct inode steals                              0         0         0
Kswapd inode steals                              0         0         0
Kswapd skipped wait                              0         0         0
Compaction stalls                                0         0         0
Compaction success                               0         0         0
Compaction failures                              0         0         0
Compaction pages moved                           0         0         0
Compaction move failure                          0         0         0

Writes from reclaim are reduced to practically nothing by both
patchsets, but perzonedirty standalone wins in runtime and reclaim
efficiency.

	global-dhp-1024M__writeback-reclaimdirty-ext4

SIMPLE WRITEBACK
              simple-writeback   writeback-3.0.0   writeback-3.0.0
                 3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1
1                    434.46 ( 0.00%) 432.42 ( 0.47%) 429.15 ( 1.24%)
                 +/-            2.62%            2.15%            2.47%
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         29.44     29.37     29.64
Total Elapsed Time (seconds)               1740.46   1732.34   1719.08

MMTests Statistics: vmstat
Page Ins                                     15216     14728     12936
Page Outs                                 27274352  27274144  27274236
Swap Ins                                        12         0         7
Swap Outs                                       13         0        29
Direct pages scanned                             0         0         0
Kswapd pages scanned                       8151970   7662106   5989819
Kswapd pages reclaimed                     5990667   5987919   5988646
Direct pages reclaimed                           0         0         0
Kswapd efficiency                              73%       78%       99%
Kswapd velocity                           4683.802  4422.980  3484.317
Direct efficiency                             100%      100%      100%
Direct velocity                              0.000     0.000     0.000
Percentage direct scans                         0%        0%        0%
Page writes by reclaim                     1889005         0        29
Page writes file                           1888992         0         0
Page writes anon                                13         0        29
Page reclaim invalidate                          0   1574594         0
Slabs scanned                                 3968      3840      3968
Direct inode steals                              0         0         0
Kswapd inode steals                              0         0         0
Kswapd skipped wait                              0         0         0
Compaction stalls                                0         0         0
Compaction success                               0         0         0
Compaction failures                              0         0         0
Compaction pages moved                           0         0         0
Compaction move failure                          0         0         0

As with ext3, perzonedirty is best in overall runtime and reclaim
efficiency.

	global-dhp-1024M__writeback-reclaimdirty-xfs

SIMPLE WRITEBACK
              simple-writeback   writeback-3.0.0   writeback-3.0.0
                 3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1
1                    757.46 ( 0.00%) 387.51 (95.47%) 381.90 (98.34%)
                 +/-            3.03%            1.41%            1.13%
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         28.68     27.86     29.25
Total Elapsed Time (seconds)               3032.05   1552.22   1529.82

MMTests Statistics: vmstat
Page Ins                                     23325     13801     13733
Page Outs                                 27277838  27271665  27272055
Swap Ins                                         1         0         0
Swap Outs                                       24         0        58
Direct pages scanned                         37729         0         0
Kswapd pages scanned                       6340969   7643093   5994387
Kswapd pages reclaimed                     5959043   5990117   5993349
Direct pages reclaimed                       37388         0         0
Kswapd efficiency                              93%       78%       99%
Kswapd velocity                           2091.314  4923.975  3918.361
Direct efficiency                              99%      100%      100%
Direct velocity                             12.443     0.000     0.000
Percentage direct scans                         0%        0%        0%
Page writes by reclaim                        7148         0        58
Page writes file                              7124         0         0
Page writes anon                                24         0        58
Page reclaim invalidate                          0   1552818         0
Slabs scanned                                 4224      3968      3968
Direct inode steals                              0         0         0
Kswapd inode steals                              0         0         0
Kswapd skipped wait                              0         0         0
Compaction stalls                                0         0         0
Compaction success                               0         0         0
Compaction failures                              0         0         0
Compaction pages moved                           0         0         0
Compaction move failure                          0         0         0

As with ext3 and ext4, perzonedirty is best in overall runtime and
reclaim efficiency.

	global-dhp-4608M__writeback-reclaimdirty-ext3

SIMPLE WRITEBACK
              simple-writeback   writeback-3.0.0   writeback-3.0.0
                 3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1
1                    1274.37 ( 0.00%) 1204.00 ( 5.84%) 1317.79 (-3.29%)
                 +/-            2.02%            2.03%            3.05%
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         43.93      44.4     45.85
Total Elapsed Time (seconds)               5130.22   4824.17   5278.84

MMTests Statistics: vmstat
Page Ins                                     44004     43704     44492
Page Outs                                 27391592  27386240  27390108
Swap Ins                                      6968      5855      6091
Swap Outs                                     8846      8024      8065
Direct pages scanned                             0         0    115384
Kswapd pages scanned                       4234168   4656846   4105795
Kswapd pages reclaimed                     3899101   3893500   3776056
Direct pages reclaimed                           0         0    115347
Kswapd efficiency                              92%       83%       91%
Kswapd velocity                            825.338   965.315   777.784
Direct efficiency                             100%      100%       99%
Direct velocity                              0.000     0.000    21.858
Percentage direct scans                         0%        0%        2%
Page writes by reclaim                       42555      8024     40622
Page writes file                             33709         0     32557
Page writes anon                              8846      8024      8065
Page reclaim invalidate                          0    586463         0
Slabs scanned                                 3712      3840      3840
Direct inode steals                              0         0         0
Kswapd inode steals                              0         0         0
Kswapd skipped wait                              0         0         0
Compaction stalls                                0         0         0
Compaction success                               0         0         0
Compaction failures                              0         0         0
Compaction pages moved                           0         0         0
Compaction move failure                          0         0         0

Here, perzonedirty fails to ensure enough clean pages in what I guess
is a small Normal zone on top of the DMA32 zone.  The
(not-yet-optimized) per-zone dirty checks cost CPU time but they do
not pay off and dirty pages are still encountered by reclaim.

Mel, can you say how big exactly the Normal zone is with this setup?

My theory is that the closer (file_pages - dirty_pages) is to the high
watermark which kswapd tries to balance to, the more likely it is to
run into dirty pages.  And to my knowledge, these tests are run with a
non-standard 40% dirty ratio, which lowers the threshold at which
perzonedirty falls apart.  Per-zone dirty limits should probably take
the high watermark into account.

This does not explain the regression to me, however, if the Normal
zone here is about the same size as the DMA32 zone in the 512M tests
above, for which perzonedirty was an unambiguous improvement.

What makes me wonder, is that in addition, something in perzonedirty
makes kswapd less efficient in the 4G tests, which is the opposite
effect it had in all other setups.  This increases direct reclaim
invocations against the preferred Normal zone.  The higher pressure
could also explain why reclaim rushes through the clean pages and runs
into dirty pages quicker.

Does anyone have a theory about what might be going on here?

The tests with other filesystems on 4G memory look similarly bleak for
perzonedirty:

	global-dhp-4608M__writeback-reclaimdirty-ext4

SIMPLE WRITEBACK
              simple-writeback   writeback-3.0.0   writeback-3.0.0
                 3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1
1                    396.85 ( 0.00%) 437.61 (-9.31%) 404.65 (-1.93%)
                 +/-            13.10%            16.04%            16.35%
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         30.46     30.52     32.28
Total Elapsed Time (seconds)               1591.42   1754.49   1622.63

MMTests Statistics: vmstat
Page Ins                                     37316     38984     36816
Page Outs                                 27304668  27305952  27307584
Swap Ins                                      6705      6728      6840
Swap Outs                                     7989      7911      8431
Direct pages scanned                             0         0         0
Kswapd pages scanned                       4627064   4644718   4618129
Kswapd pages reclaimed                     3883654   3891597   3878173
Direct pages reclaimed                           0         0         0
Kswapd efficiency                              83%       83%       83%
Kswapd velocity                           2907.507  2647.332  2846.076
Direct efficiency                             100%      100%      100%
Direct velocity                              0.000     0.000     0.000
Percentage direct scans                         0%        0%        0%
Page writes by reclaim                      586753      7911    588292
Page writes file                            578764         0    579861
Page writes anon                              7989      7911      8431
Page reclaim invalidate                          0    591028         0
Slabs scanned                                 3840      3840      4096
Direct inode steals                              0         0         0
Kswapd inode steals                              0         0         0
Kswapd skipped wait                              0         0         0
Compaction stalls                                0         0         0
Compaction success                               0         0         0
Compaction failures                              0         0         0
Compaction pages moved                           0         0         0
Compaction move failure                          0         0         0

	global-dhp-4608M__writeback-reclaimdirty-xfs

SIMPLE WRITEBACK
              simple-writeback   writeback-3.0.0   writeback-3.0.0
                 3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1
1                    531.54 ( 0.00%) 404.88 (31.28%) 546.32 (-2.71%)
                 +/-            1.77%            7.06%            1.01%
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         29.35     30.04     30.63
Total Elapsed Time (seconds)               2129.69   1623.11   2188.73

MMTests Statistics: vmstat
Page Ins                                     38329     37173     35117
Page Outs                                 27307040  27304636  27305927
Swap Ins                                      6469      6239      5138
Swap Outs                                     8292      8299      7934
Direct pages scanned                             0         0    117901
Kswapd pages scanned                       4197481   4630492   4060306
Kswapd pages reclaimed                     3880444   3882479   3767544
Direct pages reclaimed                           0         0    117872
Kswapd efficiency                              92%       83%       92%
Kswapd velocity                           1970.935  2852.852  1855.097
Direct efficiency                             100%      100%       99%
Direct velocity                              0.000     0.000    53.867
Percentage direct scans                         0%        0%        2%
Page writes by reclaim                        9667      8299      9249
Page writes file                              1375         0      1315
Page writes anon                              8292      8299      7934
Page reclaim invalidate                          0    575703         0
Slabs scanned                                 3840      3712      4352
Direct inode steals                              0         0         0
Kswapd inode steals                              0         0         0
Kswapd skipped wait                              0         0         0
Compaction stalls                                0         0         0
Compaction success                               0         0         0
Compaction failures                              0         0         0
Compaction pages moved                           0         0         0
Compaction move failure                          0         0         0

I am doubly confused because I ran similar tests with 4G memory and
got contradicting results.  Will rerun those to make sure.

Comments?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 0/5] mm: per-zone dirty limiting
@ 2011-08-02 12:17         ` Johannes Weiner
  0 siblings, 0 replies; 64+ messages in thread
From: Johannes Weiner @ 2011-08-02 12:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On Fri, Jul 29, 2011 at 12:05:10PM +0100, Mel Gorman wrote:
> On Tue, Jul 26, 2011 at 08:05:59PM +0200, Johannes Weiner wrote:
> > > As dd is variable, I'm rerunning the tests to do 4 iterations and
> > > multiple memory sizes for just xfs and ext4 to see what falls out. It
> > > should take about 14 hours to complete assuming nothing screws up.
> > 
> > Awesome, thanks!
> > 
> 
> While they in fact took about 30 hours to complete, I only got around
> to packaging them up now. Unfortuantely the tests were incomplete as
> I needed the machine back for another use but the results that did
> complete are at http://www.csn.ul.ie/~mel/postings/hnaz-20110729/
> 
> Look for the comparison.html files such as this one
> 
> http://www.csn.ul.ie/~mel/postings/hnaz-20110729/global-dhp-512M__writeback-reclaimdirty-ext3/hydra/comparison.html
> 
> I'm afraid I haven't looked through them in detail.

Mel, thanks a lot for running those tests, you shall be compensated in
finest brewery goods some time.

Here is an attempt:

	global-dhp-512M__writeback-reclaimdirty-xfs

SIMPLE WRITEBACK
              simple-writeback   writeback-3.0.0   writeback-3.0.0      3.0.0-lessks
                 3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1      pzdirty-v3r1
1                    1054.54 ( 0.00%) 386.65 (172.74%) 375.60 (180.76%) 375.88 (180.55%)
                 +/-            1.41%            4.56%            3.09%            2.34%
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         32.27     29.97     30.65     30.91
Total Elapsed Time (seconds)               4220.48   1548.84   1504.64   1505.79

MMTests Statistics: vmstat
Page Ins                                    720433    392017    317097    343849
Page Outs                                 27746435  27673017  27619134  27555437
Swap Ins                                    173563     94196     74844     81954
Swap Outs                                   115864    100264     86833     70904
Direct pages scanned                       3268014      7515         0      1008
Kswapd pages scanned                       5351371  12045948   7973273   7923387
Kswapd pages reclaimed                     3320848   6498700   6486754   6492607
Direct pages reclaimed                     3267145      7243         0      1008
Kswapd efficiency                              62%       53%       81%       81%
Kswapd velocity                           1267.953  7777.400  5299.123  5261.947
Direct efficiency                              99%       96%      100%      100%
Direct velocity                            774.323     4.852     0.000     0.669
Percentage direct scans                        37%        0%        0%        0%
Page writes by reclaim                      130541    100265     86833     70904
Page writes file                             14677         1         0         0
Page writes anon                            115864    100264     86833     70904
Page reclaim invalidate                          0   3120195         0         0
Slabs scanned                                 8448      8448      8576      8448
Direct inode steals                              0         0         0         0
Kswapd inode steals                           1828      1837      2056      1918
Kswapd skipped wait                              0         1         0         0
Compaction stalls                                2         0         0         0
Compaction success                               1         0         0         0
Compaction failures                              1         0         0         0
Compaction pages moved                           0         0         0         0
Compaction move failure                          0         0         0         0

While file writes from reclaim are prevented by both patches on their
own, perzonedirty decreases the amount of anonymous pages swapped out
because reclaim is always able to make progress instead of wasting its
file scan budget on shuffling dirty pages.  With lesskswapd in
addition, swapping is throttled in reclaim by the ratio of dirty pages
to isolated pages.

The runtime improvements speak for both perzonedirty and
perzonedirty+lesskswapd.  Given the swap upside and increased reclaim
efficiency, the combination of both appears to be the most desirable.

	global-dhp-512M__writeback-reclaimdirty-ext3

SIMPLE WRITEBACK
              simple-writeback   writeback-3.0.0   writeback-3.0.0
                 3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1
1                    1762.23 ( 0.00%) 987.73 (78.41%) 983.82 (79.12%)
                 +/-            4.35%            2.24%            1.56%
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         46.36     44.07        46
Total Elapsed Time (seconds)               7053.28   3956.60   3940.39

MMTests Statistics: vmstat
Page Ins                                    965236    661660    629972
Page Outs                                 27984332  27922904  27715628
Swap Ins                                    231181    158799    137341
Swap Outs                                   151395    142150     88644
Direct pages scanned                       2749884     11138   1315072
Kswapd pages scanned                       6340921  12591169   6599999
Kswapd pages reclaimed                     3915635   6576549   5264406
Direct pages reclaimed                     2749002     10877   1314842
Kswapd efficiency                              61%       52%       79%
Kswapd velocity                            899.003  3182.320  1674.961
Direct efficiency                              99%       97%       99%
Direct velocity                            389.873     2.815   333.742
Percentage direct scans                        30%        0%       16%
Page writes by reclaim                      620698    142155     88645
Page writes file                            469303         5         1
Page writes anon                            151395    142150     88644
Page reclaim invalidate                          0   3717819         0
Slabs scanned                                 8704      8576     33408
Direct inode steals                              0         0       466
Kswapd inode steals                           1872      2107      2115
Kswapd skipped wait                              0         1         0
Compaction stalls                                2         0         1
Compaction success                               1         0         0
Compaction failures                              1         0         1
Compaction pages moved                           0         0         0
Compaction move failure                          0         0         0

perzonedirty the highest reclaim efficiencies, the lowest writeout
counts from reclaim, and the shortest runtime.

While file writes are practically gone with both lesskswapd and
perzonedirty on their own, the latter also reduces swapping by 40%.

I expect the combination of both series to have the best results here
as well.

	global-dhp-512M__writeback-reclaimdirty-ext4

SIMPLE WRITEBACK
              simple-writeback   writeback-3.0.0   writeback-3.0.0
                 3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1
1                    405.42 ( 0.00%) 410.48 (-1.23%) 401.77 ( 0.91%)
                 +/-            3.62%            4.45%            2.82%
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         31.25      31.4     31.37
Total Elapsed Time (seconds)               1624.60   1644.56   1609.67

MMTests Statistics: vmstat
Page Ins                                    354364    403612    332812
Page Outs                                 27607792  27709096  27536412
Swap Ins                                     84065     96398     79219
Swap Outs                                    83096    108478     65342
Direct pages scanned                           112         0        56
Kswapd pages scanned                      12207898  12063862   7615377
Kswapd pages reclaimed                     6492490   6504947   6486946
Direct pages reclaimed                         112         0        56
Kswapd efficiency                              53%       53%       85%
Kswapd velocity                           7514.402  7335.617  4731.018
Direct efficiency                             100%      100%      100%
Direct velocity                              0.069     0.000     0.035
Percentage direct scans                         0%        0%        0%
Page writes by reclaim                     3076760    108483     65342
Page writes file                           2993664         5         0
Page writes anon                             83096    108478     65342
Page reclaim invalidate                          0   3291697         0
Slabs scanned                                 8448      8448      8448
Direct inode steals                              0         0         0
Kswapd inode steals                           1979      1993      1945
Kswapd skipped wait                              1         0         0
Compaction stalls                                0         0         0
Compaction success                               0         0         0
Compaction failures                              0         0         0
Compaction pages moved                           0         0         0
Compaction move failure                          0         0         0

With lesskswapd, both runtime and swapouts increased.  My only guess
is that in this configuration, the writepage calls actually improve
things to a certain extent.

Otherwise, nothing stands out to me here, and the same as above
applies wrt runtime and reclaim efficiency being the best with
perzonedirty.

	global-dhp-1024M__writeback-reclaimdirty-ext3

SIMPLE WRITEBACK
              simple-writeback   writeback-3.0.0   writeback-3.0.0
                 3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1
1                    1291.74 ( 0.00%) 1034.56 (24.86%) 1023.04 (26.26%)
                 +/-            2.77%            1.98%            4.42%
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         42.41     41.97     43.49
Total Elapsed Time (seconds)               5176.73   4142.26   4096.57

MMTests Statistics: vmstat
Page Ins                                     27856     24392     23292
Page Outs                                 27360416  27352736  27352700
Swap Ins                                         1         6         0
Swap Outs                                        2        39        32
Direct pages scanned                          5899         0         0
Kswapd pages scanned                       6500396   7948564   6014854
Kswapd pages reclaimed                     6008477   6012586   6013794
Direct pages reclaimed                        5899         0         0
Kswapd efficiency                              92%       75%       99%
Kswapd velocity                           1255.695  1918.895  1468.266
Direct efficiency                             100%      100%      100%
Direct velocity                              1.140     0.000     0.000
Percentage direct scans                         0%        0%        0%
Page writes by reclaim                      181091        39        32
Page writes file                            181089         0         0
Page writes anon                                 2        39        32
Page reclaim invalidate                          0   1843189         0
Slabs scanned                                 3840      3840      4096
Direct inode steals                              0         0         0
Kswapd inode steals                              0         0         0
Kswapd skipped wait                              0         0         0
Compaction stalls                                0         0         0
Compaction success                               0         0         0
Compaction failures                              0         0         0
Compaction pages moved                           0         0         0
Compaction move failure                          0         0         0

Writes from reclaim are reduced to practically nothing by both
patchsets, but perzonedirty standalone wins in runtime and reclaim
efficiency.

	global-dhp-1024M__writeback-reclaimdirty-ext4

SIMPLE WRITEBACK
              simple-writeback   writeback-3.0.0   writeback-3.0.0
                 3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1
1                    434.46 ( 0.00%) 432.42 ( 0.47%) 429.15 ( 1.24%)
                 +/-            2.62%            2.15%            2.47%
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         29.44     29.37     29.64
Total Elapsed Time (seconds)               1740.46   1732.34   1719.08

MMTests Statistics: vmstat
Page Ins                                     15216     14728     12936
Page Outs                                 27274352  27274144  27274236
Swap Ins                                        12         0         7
Swap Outs                                       13         0        29
Direct pages scanned                             0         0         0
Kswapd pages scanned                       8151970   7662106   5989819
Kswapd pages reclaimed                     5990667   5987919   5988646
Direct pages reclaimed                           0         0         0
Kswapd efficiency                              73%       78%       99%
Kswapd velocity                           4683.802  4422.980  3484.317
Direct efficiency                             100%      100%      100%
Direct velocity                              0.000     0.000     0.000
Percentage direct scans                         0%        0%        0%
Page writes by reclaim                     1889005         0        29
Page writes file                           1888992         0         0
Page writes anon                                13         0        29
Page reclaim invalidate                          0   1574594         0
Slabs scanned                                 3968      3840      3968
Direct inode steals                              0         0         0
Kswapd inode steals                              0         0         0
Kswapd skipped wait                              0         0         0
Compaction stalls                                0         0         0
Compaction success                               0         0         0
Compaction failures                              0         0         0
Compaction pages moved                           0         0         0
Compaction move failure                          0         0         0

As with ext3, perzonedirty is best in overall runtime and reclaim
efficiency.

	global-dhp-1024M__writeback-reclaimdirty-xfs

SIMPLE WRITEBACK
              simple-writeback   writeback-3.0.0   writeback-3.0.0
                 3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1
1                    757.46 ( 0.00%) 387.51 (95.47%) 381.90 (98.34%)
                 +/-            3.03%            1.41%            1.13%
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         28.68     27.86     29.25
Total Elapsed Time (seconds)               3032.05   1552.22   1529.82

MMTests Statistics: vmstat
Page Ins                                     23325     13801     13733
Page Outs                                 27277838  27271665  27272055
Swap Ins                                         1         0         0
Swap Outs                                       24         0        58
Direct pages scanned                         37729         0         0
Kswapd pages scanned                       6340969   7643093   5994387
Kswapd pages reclaimed                     5959043   5990117   5993349
Direct pages reclaimed                       37388         0         0
Kswapd efficiency                              93%       78%       99%
Kswapd velocity                           2091.314  4923.975  3918.361
Direct efficiency                              99%      100%      100%
Direct velocity                             12.443     0.000     0.000
Percentage direct scans                         0%        0%        0%
Page writes by reclaim                        7148         0        58
Page writes file                              7124         0         0
Page writes anon                                24         0        58
Page reclaim invalidate                          0   1552818         0
Slabs scanned                                 4224      3968      3968
Direct inode steals                              0         0         0
Kswapd inode steals                              0         0         0
Kswapd skipped wait                              0         0         0
Compaction stalls                                0         0         0
Compaction success                               0         0         0
Compaction failures                              0         0         0
Compaction pages moved                           0         0         0
Compaction move failure                          0         0         0

As with ext3 and ext4, perzonedirty is best in overall runtime and
reclaim efficiency.

	global-dhp-4608M__writeback-reclaimdirty-ext3

SIMPLE WRITEBACK
              simple-writeback   writeback-3.0.0   writeback-3.0.0
                 3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1
1                    1274.37 ( 0.00%) 1204.00 ( 5.84%) 1317.79 (-3.29%)
                 +/-            2.02%            2.03%            3.05%
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         43.93      44.4     45.85
Total Elapsed Time (seconds)               5130.22   4824.17   5278.84

MMTests Statistics: vmstat
Page Ins                                     44004     43704     44492
Page Outs                                 27391592  27386240  27390108
Swap Ins                                      6968      5855      6091
Swap Outs                                     8846      8024      8065
Direct pages scanned                             0         0    115384
Kswapd pages scanned                       4234168   4656846   4105795
Kswapd pages reclaimed                     3899101   3893500   3776056
Direct pages reclaimed                           0         0    115347
Kswapd efficiency                              92%       83%       91%
Kswapd velocity                            825.338   965.315   777.784
Direct efficiency                             100%      100%       99%
Direct velocity                              0.000     0.000    21.858
Percentage direct scans                         0%        0%        2%
Page writes by reclaim                       42555      8024     40622
Page writes file                             33709         0     32557
Page writes anon                              8846      8024      8065
Page reclaim invalidate                          0    586463         0
Slabs scanned                                 3712      3840      3840
Direct inode steals                              0         0         0
Kswapd inode steals                              0         0         0
Kswapd skipped wait                              0         0         0
Compaction stalls                                0         0         0
Compaction success                               0         0         0
Compaction failures                              0         0         0
Compaction pages moved                           0         0         0
Compaction move failure                          0         0         0

Here, perzonedirty fails to ensure enough clean pages in what I guess
is a small Normal zone on top of the DMA32 zone.  The
(not-yet-optimized) per-zone dirty checks cost CPU time but they do
not pay off and dirty pages are still encountered by reclaim.

Mel, can you say how big exactly the Normal zone is with this setup?

My theory is that the closer (file_pages - dirty_pages) is to the high
watermark which kswapd tries to balance to, the more likely it is to
run into dirty pages.  And to my knowledge, these tests are run with a
non-standard 40% dirty ratio, which lowers the threshold at which
perzonedirty falls apart.  Per-zone dirty limits should probably take
the high watermark into account.

This does not explain the regression to me, however, if the Normal
zone here is about the same size as the DMA32 zone in the 512M tests
above, for which perzonedirty was an unambiguous improvement.

What makes me wonder, is that in addition, something in perzonedirty
makes kswapd less efficient in the 4G tests, which is the opposite
effect it had in all other setups.  This increases direct reclaim
invocations against the preferred Normal zone.  The higher pressure
could also explain why reclaim rushes through the clean pages and runs
into dirty pages quicker.

Does anyone have a theory about what might be going on here?

The tests with other filesystems on 4G memory look similarly bleak for
perzonedirty:

	global-dhp-4608M__writeback-reclaimdirty-ext4

SIMPLE WRITEBACK
              simple-writeback   writeback-3.0.0   writeback-3.0.0
                 3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1
1                    396.85 ( 0.00%) 437.61 (-9.31%) 404.65 (-1.93%)
                 +/-            13.10%            16.04%            16.35%
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         30.46     30.52     32.28
Total Elapsed Time (seconds)               1591.42   1754.49   1622.63

MMTests Statistics: vmstat
Page Ins                                     37316     38984     36816
Page Outs                                 27304668  27305952  27307584
Swap Ins                                      6705      6728      6840
Swap Outs                                     7989      7911      8431
Direct pages scanned                             0         0         0
Kswapd pages scanned                       4627064   4644718   4618129
Kswapd pages reclaimed                     3883654   3891597   3878173
Direct pages reclaimed                           0         0         0
Kswapd efficiency                              83%       83%       83%
Kswapd velocity                           2907.507  2647.332  2846.076
Direct efficiency                             100%      100%      100%
Direct velocity                              0.000     0.000     0.000
Percentage direct scans                         0%        0%        0%
Page writes by reclaim                      586753      7911    588292
Page writes file                            578764         0    579861
Page writes anon                              7989      7911      8431
Page reclaim invalidate                          0    591028         0
Slabs scanned                                 3840      3840      4096
Direct inode steals                              0         0         0
Kswapd inode steals                              0         0         0
Kswapd skipped wait                              0         0         0
Compaction stalls                                0         0         0
Compaction success                               0         0         0
Compaction failures                              0         0         0
Compaction pages moved                           0         0         0
Compaction move failure                          0         0         0

	global-dhp-4608M__writeback-reclaimdirty-xfs

SIMPLE WRITEBACK
              simple-writeback   writeback-3.0.0   writeback-3.0.0
                 3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1
1                    531.54 ( 0.00%) 404.88 (31.28%) 546.32 (-2.71%)
                 +/-            1.77%            7.06%            1.01%
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         29.35     30.04     30.63
Total Elapsed Time (seconds)               2129.69   1623.11   2188.73

MMTests Statistics: vmstat
Page Ins                                     38329     37173     35117
Page Outs                                 27307040  27304636  27305927
Swap Ins                                      6469      6239      5138
Swap Outs                                     8292      8299      7934
Direct pages scanned                             0         0    117901
Kswapd pages scanned                       4197481   4630492   4060306
Kswapd pages reclaimed                     3880444   3882479   3767544
Direct pages reclaimed                           0         0    117872
Kswapd efficiency                              92%       83%       92%
Kswapd velocity                           1970.935  2852.852  1855.097
Direct efficiency                             100%      100%       99%
Direct velocity                              0.000     0.000    53.867
Percentage direct scans                         0%        0%        2%
Page writes by reclaim                        9667      8299      9249
Page writes file                              1375         0      1315
Page writes anon                              8292      8299      7934
Page reclaim invalidate                          0    575703         0
Slabs scanned                                 3840      3712      4352
Direct inode steals                              0         0         0
Kswapd inode steals                              0         0         0
Kswapd skipped wait                              0         0         0
Compaction stalls                                0         0         0
Compaction success                               0         0         0
Compaction failures                              0         0         0
Compaction pages moved                           0         0         0
Compaction move failure                          0         0         0

I am doubly confused because I ran similar tests with 4G memory and
got contradicting results.  Will rerun those to make sure.

Comments?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 0/5] mm: per-zone dirty limiting
  2011-08-02 12:17         ` Johannes Weiner
@ 2011-08-03 13:18           ` Mel Gorman
  -1 siblings, 0 replies; 64+ messages in thread
From: Mel Gorman @ 2011-08-03 13:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On Tue, Aug 02, 2011 at 02:17:33PM +0200, Johannes Weiner wrote:
> On Fri, Jul 29, 2011 at 12:05:10PM +0100, Mel Gorman wrote:
> > On Tue, Jul 26, 2011 at 08:05:59PM +0200, Johannes Weiner wrote:
> > > > As dd is variable, I'm rerunning the tests to do 4 iterations and
> > > > multiple memory sizes for just xfs and ext4 to see what falls out. It
> > > > should take about 14 hours to complete assuming nothing screws up.
> > > 
> > > Awesome, thanks!
> > > 
> > 
> > While they in fact took about 30 hours to complete, I only got around
> > to packaging them up now. Unfortuantely the tests were incomplete as
> > I needed the machine back for another use but the results that did
> > complete are at http://www.csn.ul.ie/~mel/postings/hnaz-20110729/
> > 
> > Look for the comparison.html files such as this one
> > 
> > http://www.csn.ul.ie/~mel/postings/hnaz-20110729/global-dhp-512M__writeback-reclaimdirty-ext3/hydra/comparison.html
> > 
> > I'm afraid I haven't looked through them in detail.
> 
> Mel, thanks a lot for running those tests, you shall be compensated in
> finest brewery goods some time.
> 

Sweet.

> Here is an attempt:
> 
> 	global-dhp-512M__writeback-reclaimdirty-xfs
> 
> SIMPLE WRITEBACK
>               simple-writeback   writeback-3.0.0   writeback-3.0.0      3.0.0-lessks
>                  3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1      pzdirty-v3r1
> 1                    1054.54 ( 0.00%) 386.65 (172.74%) 375.60 (180.76%) 375.88 (180.55%)
>                  +/-            1.41%            4.56%            3.09%            2.34%
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)         32.27     29.97     30.65     30.91
> Total Elapsed Time (seconds)               4220.48   1548.84   1504.64   1505.79
> 
> MMTests Statistics: vmstat
> Page Ins                                    720433    392017    317097    343849
> Page Outs                                 27746435  27673017  27619134  27555437
> Swap Ins                                    173563     94196     74844     81954
> Swap Outs                                   115864    100264     86833     70904
> Direct pages scanned                       3268014      7515         0      1008
> Kswapd pages scanned                       5351371  12045948   7973273   7923387
> Kswapd pages reclaimed                     3320848   6498700   6486754   6492607
> Direct pages reclaimed                     3267145      7243         0      1008
> Kswapd efficiency                              62%       53%       81%       81%
> Kswapd velocity                           1267.953  7777.400  5299.123  5261.947
> Direct efficiency                              99%       96%      100%      100%
> Direct velocity                            774.323     4.852     0.000     0.669
> Percentage direct scans                        37%        0%        0%        0%
> Page writes by reclaim                      130541    100265     86833     70904
> Page writes file                             14677         1         0         0
> Page writes anon                            115864    100264     86833     70904
> Page reclaim invalidate                          0   3120195         0         0
> Slabs scanned                                 8448      8448      8576      8448
> Direct inode steals                              0         0         0         0
> Kswapd inode steals                           1828      1837      2056      1918
> Kswapd skipped wait                              0         1         0         0
> Compaction stalls                                2         0         0         0
> Compaction success                               1         0         0         0
> Compaction failures                              1         0         0         0
> Compaction pages moved                           0         0         0         0
> Compaction move failure                          0         0         0         0
> 
> While file writes from reclaim are prevented by both patches on their
> own, perzonedirty decreases the amount of anonymous pages swapped out
> because reclaim is always able to make progress instead of wasting its
> file scan budget on shuffling dirty pages. 

Good observation and it's related to the usual problem of balancing
multiple LRU lists and what the consequences can be. I had wondered
if it was worth moving dirty pages that were marked PageReclaim to
a separate LRU list but worried that young clean file pages would be
reclaimed before old anonymous pages as a result.

> With lesskswapd in
> addition, swapping is throttled in reclaim by the ratio of dirty pages
> to isolated pages.
> 
> The runtime improvements speak for both perzonedirty and
> perzonedirty+lesskswapd.  Given the swap upside and increased reclaim
> efficiency, the combination of both appears to be the most desirable.
> 
> 	global-dhp-512M__writeback-reclaimdirty-ext3
> 

Agreed.

> SIMPLE WRITEBACK
>               simple-writeback   writeback-3.0.0   writeback-3.0.0
>                  3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1
> 1                    1762.23 ( 0.00%) 987.73 (78.41%) 983.82 (79.12%)
>                  +/-            4.35%            2.24%            1.56%
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)         46.36     44.07        46
> Total Elapsed Time (seconds)               7053.28   3956.60   3940.39
> 
> MMTests Statistics: vmstat
> Page Ins                                    965236    661660    629972
> Page Outs                                 27984332  27922904  27715628
> Swap Ins                                    231181    158799    137341
> Swap Outs                                   151395    142150     88644
> Direct pages scanned                       2749884     11138   1315072
> Kswapd pages scanned                       6340921  12591169   6599999
> Kswapd pages reclaimed                     3915635   6576549   5264406
> Direct pages reclaimed                     2749002     10877   1314842
> Kswapd efficiency                              61%       52%       79%
> Kswapd velocity                            899.003  3182.320  1674.961
> Direct efficiency                              99%       97%       99%
> Direct velocity                            389.873     2.815   333.742
> Percentage direct scans                        30%        0%       16%
> Page writes by reclaim                      620698    142155     88645
> Page writes file                            469303         5         1
> Page writes anon                            151395    142150     88644
> Page reclaim invalidate                          0   3717819         0
> Slabs scanned                                 8704      8576     33408
> Direct inode steals                              0         0       466
> Kswapd inode steals                           1872      2107      2115
> Kswapd skipped wait                              0         1         0
> Compaction stalls                                2         0         1
> Compaction success                               1         0         0
> Compaction failures                              1         0         1
> Compaction pages moved                           0         0         0
> Compaction move failure                          0         0         0
> 
> perzonedirty the highest reclaim efficiencies, the lowest writeout
> counts from reclaim, and the shortest runtime.
> 
> While file writes are practically gone with both lesskswapd and
> perzonedirty on their own, the latter also reduces swapping by 40%.
> 

Similar observation as before - fewer anonymous pages are being
reclaimed. This should also have a positive effect when writing to a USB
stick and avoiding distruption of running applications.

I do note that there were a large number of pages direct reclaimed
though. It'd be worth keeping an eye on stall times there be it due to
congestion or similar due to page allocator latency.

> I expect the combination of both series to have the best results here
> as well.
> 

Quite likely. I regret the combination tests did not have a chance to
run but I'm sure there will be more than one revision.

> 	global-dhp-512M__writeback-reclaimdirty-ext4
> 
> SIMPLE WRITEBACK
>               simple-writeback   writeback-3.0.0   writeback-3.0.0
>                  3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1
> 1                    405.42 ( 0.00%) 410.48 (-1.23%) 401.77 ( 0.91%)
>                  +/-            3.62%            4.45%            2.82%
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)         31.25      31.4     31.37
> Total Elapsed Time (seconds)               1624.60   1644.56   1609.67
> 
> MMTests Statistics: vmstat
> Page Ins                                    354364    403612    332812
> Page Outs                                 27607792  27709096  27536412
> Swap Ins                                     84065     96398     79219
> Swap Outs                                    83096    108478     65342
> Direct pages scanned                           112         0        56
> Kswapd pages scanned                      12207898  12063862   7615377
> Kswapd pages reclaimed                     6492490   6504947   6486946
> Direct pages reclaimed                         112         0        56
> Kswapd efficiency                              53%       53%       85%
> Kswapd velocity                           7514.402  7335.617  4731.018
> Direct efficiency                             100%      100%      100%
> Direct velocity                              0.069     0.000     0.035
> Percentage direct scans                         0%        0%        0%
> Page writes by reclaim                     3076760    108483     65342
> Page writes file                           2993664         5         0
> Page writes anon                             83096    108478     65342
> Page reclaim invalidate                          0   3291697         0
> Slabs scanned                                 8448      8448      8448
> Direct inode steals                              0         0         0
> Kswapd inode steals                           1979      1993      1945
> Kswapd skipped wait                              1         0         0
> Compaction stalls                                0         0         0
> Compaction success                               0         0         0
> Compaction failures                              0         0         0
> Compaction pages moved                           0         0         0
> Compaction move failure                          0         0         0
> 
> With lesskswapd, both runtime and swapouts increased.  My only guess
> is that in this configuration, the writepage calls actually improve
> things to a certain extent.
> 

A possible explanation is that file pages are being skipped but still
accounted for as scanned. shrink_zone is called() more as a result and
the anonymous lists are being shrunk more relate to the file lists.
One way to test the theory would be to not count dirty pages marked
PageReclaim as scanned.

> Otherwise, nothing stands out to me here, and the same as above
> applies wrt runtime and reclaim efficiency being the best with
> perzonedirty.
> 
> 	global-dhp-1024M__writeback-reclaimdirty-ext3
> 
> SIMPLE WRITEBACK
>               simple-writeback   writeback-3.0.0   writeback-3.0.0
>                  3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1
> 1                    1291.74 ( 0.00%) 1034.56 (24.86%) 1023.04 (26.26%)
>                  +/-            2.77%            1.98%            4.42%
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)         42.41     41.97     43.49
> Total Elapsed Time (seconds)               5176.73   4142.26   4096.57
> 
> MMTests Statistics: vmstat
> Page Ins                                     27856     24392     23292
> Page Outs                                 27360416  27352736  27352700
> Swap Ins                                         1         6         0
> Swap Outs                                        2        39        32
> Direct pages scanned                          5899         0         0
> Kswapd pages scanned                       6500396   7948564   6014854
> Kswapd pages reclaimed                     6008477   6012586   6013794
> Direct pages reclaimed                        5899         0         0
> Kswapd efficiency                              92%       75%       99%
> Kswapd velocity                           1255.695  1918.895  1468.266
> Direct efficiency                             100%      100%      100%
> Direct velocity                              1.140     0.000     0.000
> Percentage direct scans                         0%        0%        0%
> Page writes by reclaim                      181091        39        32
> Page writes file                            181089         0         0
> Page writes anon                                 2        39        32
> Page reclaim invalidate                          0   1843189         0
> Slabs scanned                                 3840      3840      4096
> Direct inode steals                              0         0         0
> Kswapd inode steals                              0         0         0
> Kswapd skipped wait                              0         0         0
> Compaction stalls                                0         0         0
> Compaction success                               0         0         0
> Compaction failures                              0         0         0
> Compaction pages moved                           0         0         0
> Compaction move failure                          0         0         0
> 
> Writes from reclaim are reduced to practically nothing by both
> patchsets, but perzonedirty standalone wins in runtime and reclaim
> efficiency.
> 

Yep, the figures do support the patchset being brought to completion
assuming the issues like lowmem pressure and any risk assocated with
using wakeup_flusher_threads can be ironed out.

> 	global-dhp-1024M__writeback-reclaimdirty-ext4
> 
> <SNIP, looks good>
> 
> 	global-dhp-4608M__writeback-reclaimdirty-ext3
> 
> SIMPLE WRITEBACK
>               simple-writeback   writeback-3.0.0   writeback-3.0.0
>                  3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1
> 1                    1274.37 ( 0.00%) 1204.00 ( 5.84%) 1317.79 (-3.29%)
>                  +/-            2.02%            2.03%            3.05%
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)         43.93      44.4     45.85
> Total Elapsed Time (seconds)               5130.22   4824.17   5278.84
> 
> MMTests Statistics: vmstat
> Page Ins                                     44004     43704     44492
> Page Outs                                 27391592  27386240  27390108
> Swap Ins                                      6968      5855      6091
> Swap Outs                                     8846      8024      8065
> Direct pages scanned                             0         0    115384
> Kswapd pages scanned                       4234168   4656846   4105795
> Kswapd pages reclaimed                     3899101   3893500   3776056
> Direct pages reclaimed                           0         0    115347
> Kswapd efficiency                              92%       83%       91%
> Kswapd velocity                            825.338   965.315   777.784
> Direct efficiency                             100%      100%       99%
> Direct velocity                              0.000     0.000    21.858
> Percentage direct scans                         0%        0%        2%
> Page writes by reclaim                       42555      8024     40622
> Page writes file                             33709         0     32557
> Page writes anon                              8846      8024      8065
> Page reclaim invalidate                          0    586463         0
> Slabs scanned                                 3712      3840      3840
> Direct inode steals                              0         0         0
> Kswapd inode steals                              0         0         0
> Kswapd skipped wait                              0         0         0
> Compaction stalls                                0         0         0
> Compaction success                               0         0         0
> Compaction failures                              0         0         0
> Compaction pages moved                           0         0         0
> Compaction move failure                          0         0         0
> 
> Here, perzonedirty fails to ensure enough clean pages in what I guess
> is a small Normal zone on top of the DMA32 zone.  The
> (not-yet-optimized) per-zone dirty checks cost CPU time but they do
> not pay off and dirty pages are still encountered by reclaim.
> 
> Mel, can you say how big exactly the Normal zone is with this setup?
> 

Normal zone == 129280 pages == 505M. DMA32 is 701976 pages or
2742M. Not small enough to cause the worse of problems related to a
smallest upper zone admittedly but enough to cause a lot of direct
reclaim activity with plenty of writing files back.

> My theory is that the closer (file_pages - dirty_pages) is to the high
> watermark which kswapd tries to balance to, the more likely it is to
> run into dirty pages.  And to my knowledge, these tests are run with a
> non-standard 40% dirty ratio, which lowers the threshold at which
> perzonedirty falls apart.  Per-zone dirty limits should probably take
> the high watermark into account.
> 

That would appear sensible. The choice of 40% dirty ratio is deliberate.
My understanding is a number of servers that are IO intensive will have
dirty ratio tuned to this value. On bug reports I've seen for distro
kernels related to IO slowdowns, it seemed to be a common choice. I
suspect it's tuned to this because it used to be the old default. Of
course, 40% also made the writeback problem worse so the effect of the
patches is easier to see.

> This does not explain the regression to me, however, if the Normal
> zone here is about the same size as the DMA32 zone in the 512M tests
> above, for which perzonedirty was an unambiguous improvement.
> 

The Normal zone is not the same size as DMA32 so scratch that.

Note that the slowdown here is small. The vanilla kernel is finishes
in 1274.37 +/ 2.04%. Your patches result are 1317.79 +/ 3.05% so there
is some overlap. kswapd is less aggressive and direct reclaim is used
more which might be sufficient to explain the slowdown. An avenue of
investigation is why kswapd is reclaiming so much less. It can't be
just the use of writepage or the vanilla kernel would show similar
scan and reclaim rates.

> What makes me wonder, is that in addition, something in perzonedirty
> makes kswapd less efficient in the 4G tests, which is the opposite
> effect it had in all other setups.  This increases direct reclaim
> invocations against the preferred Normal zone.  The higher pressure
> could also explain why reclaim rushes through the clean pages and runs
> into dirty pages quicker.
> 
> Does anyone have a theory about what might be going on here?
> 

This is tenuous at best and I confess I have not thought deeply
about it but it could be due to the relative age of the pages in the
highest zone.

In the vanilla kernel, the Normal zone gets filled with dirty pages
first and then the lower zones get used up until dirty ratio when
flusher threads get woken. Because the highest zone also has the
oldest pages and presumably the oldest inodes, the zone gets fully
cleaned by the flusher. The pattern is "fill zone with dirty pages,
use lower zones, highest zone gets fully cleaned reclaimed and refilled
with dirty pages, repeat"

In the patched kernel, lower zones are used when the dirty limits of a
zone are met and the flusher threads are woken to clean a small number
of pages but not the full zone. Reclaim takes the clean pages and they
get replaced with younger dirty pages. Over time, the highest zone
becomes a mix of old and young dirty pages. The flusher threads run
but instead of cleaning the highest zone first, it is cleaning a mix
of pages both all the zones. If this was the case, kswapd would end
up writing more pages from the higher zone and stalling as a result.

A further problem could be that direct reclaimers are hitting that new
congestion_wait(). Unfortunately, I was not running with stats enabled
to see what the congestion figures looked like.

> The tests with other filesystems on 4G memory look similarly bleak for
> perzonedirty:
> 
> 	global-dhp-4608M__writeback-reclaimdirty-ext4
> 
> <SNIP>
> 
> I am doubly confused because I ran similar tests with 4G memory and
> got contradicting results.  Will rerun those to make sure.
> 
> Comments?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 0/5] mm: per-zone dirty limiting
@ 2011-08-03 13:18           ` Mel Gorman
  0 siblings, 0 replies; 64+ messages in thread
From: Mel Gorman @ 2011-08-03 13:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On Tue, Aug 02, 2011 at 02:17:33PM +0200, Johannes Weiner wrote:
> On Fri, Jul 29, 2011 at 12:05:10PM +0100, Mel Gorman wrote:
> > On Tue, Jul 26, 2011 at 08:05:59PM +0200, Johannes Weiner wrote:
> > > > As dd is variable, I'm rerunning the tests to do 4 iterations and
> > > > multiple memory sizes for just xfs and ext4 to see what falls out. It
> > > > should take about 14 hours to complete assuming nothing screws up.
> > > 
> > > Awesome, thanks!
> > > 
> > 
> > While they in fact took about 30 hours to complete, I only got around
> > to packaging them up now. Unfortuantely the tests were incomplete as
> > I needed the machine back for another use but the results that did
> > complete are at http://www.csn.ul.ie/~mel/postings/hnaz-20110729/
> > 
> > Look for the comparison.html files such as this one
> > 
> > http://www.csn.ul.ie/~mel/postings/hnaz-20110729/global-dhp-512M__writeback-reclaimdirty-ext3/hydra/comparison.html
> > 
> > I'm afraid I haven't looked through them in detail.
> 
> Mel, thanks a lot for running those tests, you shall be compensated in
> finest brewery goods some time.
> 

Sweet.

> Here is an attempt:
> 
> 	global-dhp-512M__writeback-reclaimdirty-xfs
> 
> SIMPLE WRITEBACK
>               simple-writeback   writeback-3.0.0   writeback-3.0.0      3.0.0-lessks
>                  3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1      pzdirty-v3r1
> 1                    1054.54 ( 0.00%) 386.65 (172.74%) 375.60 (180.76%) 375.88 (180.55%)
>                  +/-            1.41%            4.56%            3.09%            2.34%
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)         32.27     29.97     30.65     30.91
> Total Elapsed Time (seconds)               4220.48   1548.84   1504.64   1505.79
> 
> MMTests Statistics: vmstat
> Page Ins                                    720433    392017    317097    343849
> Page Outs                                 27746435  27673017  27619134  27555437
> Swap Ins                                    173563     94196     74844     81954
> Swap Outs                                   115864    100264     86833     70904
> Direct pages scanned                       3268014      7515         0      1008
> Kswapd pages scanned                       5351371  12045948   7973273   7923387
> Kswapd pages reclaimed                     3320848   6498700   6486754   6492607
> Direct pages reclaimed                     3267145      7243         0      1008
> Kswapd efficiency                              62%       53%       81%       81%
> Kswapd velocity                           1267.953  7777.400  5299.123  5261.947
> Direct efficiency                              99%       96%      100%      100%
> Direct velocity                            774.323     4.852     0.000     0.669
> Percentage direct scans                        37%        0%        0%        0%
> Page writes by reclaim                      130541    100265     86833     70904
> Page writes file                             14677         1         0         0
> Page writes anon                            115864    100264     86833     70904
> Page reclaim invalidate                          0   3120195         0         0
> Slabs scanned                                 8448      8448      8576      8448
> Direct inode steals                              0         0         0         0
> Kswapd inode steals                           1828      1837      2056      1918
> Kswapd skipped wait                              0         1         0         0
> Compaction stalls                                2         0         0         0
> Compaction success                               1         0         0         0
> Compaction failures                              1         0         0         0
> Compaction pages moved                           0         0         0         0
> Compaction move failure                          0         0         0         0
> 
> While file writes from reclaim are prevented by both patches on their
> own, perzonedirty decreases the amount of anonymous pages swapped out
> because reclaim is always able to make progress instead of wasting its
> file scan budget on shuffling dirty pages. 

Good observation and it's related to the usual problem of balancing
multiple LRU lists and what the consequences can be. I had wondered
if it was worth moving dirty pages that were marked PageReclaim to
a separate LRU list but worried that young clean file pages would be
reclaimed before old anonymous pages as a result.

> With lesskswapd in
> addition, swapping is throttled in reclaim by the ratio of dirty pages
> to isolated pages.
> 
> The runtime improvements speak for both perzonedirty and
> perzonedirty+lesskswapd.  Given the swap upside and increased reclaim
> efficiency, the combination of both appears to be the most desirable.
> 
> 	global-dhp-512M__writeback-reclaimdirty-ext3
> 

Agreed.

> SIMPLE WRITEBACK
>               simple-writeback   writeback-3.0.0   writeback-3.0.0
>                  3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1
> 1                    1762.23 ( 0.00%) 987.73 (78.41%) 983.82 (79.12%)
>                  +/-            4.35%            2.24%            1.56%
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)         46.36     44.07        46
> Total Elapsed Time (seconds)               7053.28   3956.60   3940.39
> 
> MMTests Statistics: vmstat
> Page Ins                                    965236    661660    629972
> Page Outs                                 27984332  27922904  27715628
> Swap Ins                                    231181    158799    137341
> Swap Outs                                   151395    142150     88644
> Direct pages scanned                       2749884     11138   1315072
> Kswapd pages scanned                       6340921  12591169   6599999
> Kswapd pages reclaimed                     3915635   6576549   5264406
> Direct pages reclaimed                     2749002     10877   1314842
> Kswapd efficiency                              61%       52%       79%
> Kswapd velocity                            899.003  3182.320  1674.961
> Direct efficiency                              99%       97%       99%
> Direct velocity                            389.873     2.815   333.742
> Percentage direct scans                        30%        0%       16%
> Page writes by reclaim                      620698    142155     88645
> Page writes file                            469303         5         1
> Page writes anon                            151395    142150     88644
> Page reclaim invalidate                          0   3717819         0
> Slabs scanned                                 8704      8576     33408
> Direct inode steals                              0         0       466
> Kswapd inode steals                           1872      2107      2115
> Kswapd skipped wait                              0         1         0
> Compaction stalls                                2         0         1
> Compaction success                               1         0         0
> Compaction failures                              1         0         1
> Compaction pages moved                           0         0         0
> Compaction move failure                          0         0         0
> 
> perzonedirty the highest reclaim efficiencies, the lowest writeout
> counts from reclaim, and the shortest runtime.
> 
> While file writes are practically gone with both lesskswapd and
> perzonedirty on their own, the latter also reduces swapping by 40%.
> 

Similar observation as before - fewer anonymous pages are being
reclaimed. This should also have a positive effect when writing to a USB
stick and avoiding distruption of running applications.

I do note that there were a large number of pages direct reclaimed
though. It'd be worth keeping an eye on stall times there be it due to
congestion or similar due to page allocator latency.

> I expect the combination of both series to have the best results here
> as well.
> 

Quite likely. I regret the combination tests did not have a chance to
run but I'm sure there will be more than one revision.

> 	global-dhp-512M__writeback-reclaimdirty-ext4
> 
> SIMPLE WRITEBACK
>               simple-writeback   writeback-3.0.0   writeback-3.0.0
>                  3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1
> 1                    405.42 ( 0.00%) 410.48 (-1.23%) 401.77 ( 0.91%)
>                  +/-            3.62%            4.45%            2.82%
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)         31.25      31.4     31.37
> Total Elapsed Time (seconds)               1624.60   1644.56   1609.67
> 
> MMTests Statistics: vmstat
> Page Ins                                    354364    403612    332812
> Page Outs                                 27607792  27709096  27536412
> Swap Ins                                     84065     96398     79219
> Swap Outs                                    83096    108478     65342
> Direct pages scanned                           112         0        56
> Kswapd pages scanned                      12207898  12063862   7615377
> Kswapd pages reclaimed                     6492490   6504947   6486946
> Direct pages reclaimed                         112         0        56
> Kswapd efficiency                              53%       53%       85%
> Kswapd velocity                           7514.402  7335.617  4731.018
> Direct efficiency                             100%      100%      100%
> Direct velocity                              0.069     0.000     0.035
> Percentage direct scans                         0%        0%        0%
> Page writes by reclaim                     3076760    108483     65342
> Page writes file                           2993664         5         0
> Page writes anon                             83096    108478     65342
> Page reclaim invalidate                          0   3291697         0
> Slabs scanned                                 8448      8448      8448
> Direct inode steals                              0         0         0
> Kswapd inode steals                           1979      1993      1945
> Kswapd skipped wait                              1         0         0
> Compaction stalls                                0         0         0
> Compaction success                               0         0         0
> Compaction failures                              0         0         0
> Compaction pages moved                           0         0         0
> Compaction move failure                          0         0         0
> 
> With lesskswapd, both runtime and swapouts increased.  My only guess
> is that in this configuration, the writepage calls actually improve
> things to a certain extent.
> 

A possible explanation is that file pages are being skipped but still
accounted for as scanned. shrink_zone is called() more as a result and
the anonymous lists are being shrunk more relate to the file lists.
One way to test the theory would be to not count dirty pages marked
PageReclaim as scanned.

> Otherwise, nothing stands out to me here, and the same as above
> applies wrt runtime and reclaim efficiency being the best with
> perzonedirty.
> 
> 	global-dhp-1024M__writeback-reclaimdirty-ext3
> 
> SIMPLE WRITEBACK
>               simple-writeback   writeback-3.0.0   writeback-3.0.0
>                  3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1
> 1                    1291.74 ( 0.00%) 1034.56 (24.86%) 1023.04 (26.26%)
>                  +/-            2.77%            1.98%            4.42%
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)         42.41     41.97     43.49
> Total Elapsed Time (seconds)               5176.73   4142.26   4096.57
> 
> MMTests Statistics: vmstat
> Page Ins                                     27856     24392     23292
> Page Outs                                 27360416  27352736  27352700
> Swap Ins                                         1         6         0
> Swap Outs                                        2        39        32
> Direct pages scanned                          5899         0         0
> Kswapd pages scanned                       6500396   7948564   6014854
> Kswapd pages reclaimed                     6008477   6012586   6013794
> Direct pages reclaimed                        5899         0         0
> Kswapd efficiency                              92%       75%       99%
> Kswapd velocity                           1255.695  1918.895  1468.266
> Direct efficiency                             100%      100%      100%
> Direct velocity                              1.140     0.000     0.000
> Percentage direct scans                         0%        0%        0%
> Page writes by reclaim                      181091        39        32
> Page writes file                            181089         0         0
> Page writes anon                                 2        39        32
> Page reclaim invalidate                          0   1843189         0
> Slabs scanned                                 3840      3840      4096
> Direct inode steals                              0         0         0
> Kswapd inode steals                              0         0         0
> Kswapd skipped wait                              0         0         0
> Compaction stalls                                0         0         0
> Compaction success                               0         0         0
> Compaction failures                              0         0         0
> Compaction pages moved                           0         0         0
> Compaction move failure                          0         0         0
> 
> Writes from reclaim are reduced to practically nothing by both
> patchsets, but perzonedirty standalone wins in runtime and reclaim
> efficiency.
> 

Yep, the figures do support the patchset being brought to completion
assuming the issues like lowmem pressure and any risk assocated with
using wakeup_flusher_threads can be ironed out.

> 	global-dhp-1024M__writeback-reclaimdirty-ext4
> 
> <SNIP, looks good>
> 
> 	global-dhp-4608M__writeback-reclaimdirty-ext3
> 
> SIMPLE WRITEBACK
>               simple-writeback   writeback-3.0.0   writeback-3.0.0
>                  3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1
> 1                    1274.37 ( 0.00%) 1204.00 ( 5.84%) 1317.79 (-3.29%)
>                  +/-            2.02%            2.03%            3.05%
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)         43.93      44.4     45.85
> Total Elapsed Time (seconds)               5130.22   4824.17   5278.84
> 
> MMTests Statistics: vmstat
> Page Ins                                     44004     43704     44492
> Page Outs                                 27391592  27386240  27390108
> Swap Ins                                      6968      5855      6091
> Swap Outs                                     8846      8024      8065
> Direct pages scanned                             0         0    115384
> Kswapd pages scanned                       4234168   4656846   4105795
> Kswapd pages reclaimed                     3899101   3893500   3776056
> Direct pages reclaimed                           0         0    115347
> Kswapd efficiency                              92%       83%       91%
> Kswapd velocity                            825.338   965.315   777.784
> Direct efficiency                             100%      100%       99%
> Direct velocity                              0.000     0.000    21.858
> Percentage direct scans                         0%        0%        2%
> Page writes by reclaim                       42555      8024     40622
> Page writes file                             33709         0     32557
> Page writes anon                              8846      8024      8065
> Page reclaim invalidate                          0    586463         0
> Slabs scanned                                 3712      3840      3840
> Direct inode steals                              0         0         0
> Kswapd inode steals                              0         0         0
> Kswapd skipped wait                              0         0         0
> Compaction stalls                                0         0         0
> Compaction success                               0         0         0
> Compaction failures                              0         0         0
> Compaction pages moved                           0         0         0
> Compaction move failure                          0         0         0
> 
> Here, perzonedirty fails to ensure enough clean pages in what I guess
> is a small Normal zone on top of the DMA32 zone.  The
> (not-yet-optimized) per-zone dirty checks cost CPU time but they do
> not pay off and dirty pages are still encountered by reclaim.
> 
> Mel, can you say how big exactly the Normal zone is with this setup?
> 

Normal zone == 129280 pages == 505M. DMA32 is 701976 pages or
2742M. Not small enough to cause the worse of problems related to a
smallest upper zone admittedly but enough to cause a lot of direct
reclaim activity with plenty of writing files back.

> My theory is that the closer (file_pages - dirty_pages) is to the high
> watermark which kswapd tries to balance to, the more likely it is to
> run into dirty pages.  And to my knowledge, these tests are run with a
> non-standard 40% dirty ratio, which lowers the threshold at which
> perzonedirty falls apart.  Per-zone dirty limits should probably take
> the high watermark into account.
> 

That would appear sensible. The choice of 40% dirty ratio is deliberate.
My understanding is a number of servers that are IO intensive will have
dirty ratio tuned to this value. On bug reports I've seen for distro
kernels related to IO slowdowns, it seemed to be a common choice. I
suspect it's tuned to this because it used to be the old default. Of
course, 40% also made the writeback problem worse so the effect of the
patches is easier to see.

> This does not explain the regression to me, however, if the Normal
> zone here is about the same size as the DMA32 zone in the 512M tests
> above, for which perzonedirty was an unambiguous improvement.
> 

The Normal zone is not the same size as DMA32 so scratch that.

Note that the slowdown here is small. The vanilla kernel is finishes
in 1274.37 +/ 2.04%. Your patches result are 1317.79 +/ 3.05% so there
is some overlap. kswapd is less aggressive and direct reclaim is used
more which might be sufficient to explain the slowdown. An avenue of
investigation is why kswapd is reclaiming so much less. It can't be
just the use of writepage or the vanilla kernel would show similar
scan and reclaim rates.

> What makes me wonder, is that in addition, something in perzonedirty
> makes kswapd less efficient in the 4G tests, which is the opposite
> effect it had in all other setups.  This increases direct reclaim
> invocations against the preferred Normal zone.  The higher pressure
> could also explain why reclaim rushes through the clean pages and runs
> into dirty pages quicker.
> 
> Does anyone have a theory about what might be going on here?
> 

This is tenuous at best and I confess I have not thought deeply
about it but it could be due to the relative age of the pages in the
highest zone.

In the vanilla kernel, the Normal zone gets filled with dirty pages
first and then the lower zones get used up until dirty ratio when
flusher threads get woken. Because the highest zone also has the
oldest pages and presumably the oldest inodes, the zone gets fully
cleaned by the flusher. The pattern is "fill zone with dirty pages,
use lower zones, highest zone gets fully cleaned reclaimed and refilled
with dirty pages, repeat"

In the patched kernel, lower zones are used when the dirty limits of a
zone are met and the flusher threads are woken to clean a small number
of pages but not the full zone. Reclaim takes the clean pages and they
get replaced with younger dirty pages. Over time, the highest zone
becomes a mix of old and young dirty pages. The flusher threads run
but instead of cleaning the highest zone first, it is cleaning a mix
of pages both all the zones. If this was the case, kswapd would end
up writing more pages from the higher zone and stalling as a result.

A further problem could be that direct reclaimers are hitting that new
congestion_wait(). Unfortunately, I was not running with stats enabled
to see what the congestion figures looked like.

> The tests with other filesystems on 4G memory look similarly bleak for
> perzonedirty:
> 
> 	global-dhp-4608M__writeback-reclaimdirty-ext4
> 
> <SNIP>
> 
> I am doubly confused because I ran similar tests with 4G memory and
> got contradicting results.  Will rerun those to make sure.
> 
> Comments?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 4/5] mm: writeback: throttle __GFP_WRITE on per-zone dirty limits
  2011-07-25 23:40       ` Minchan Kim
@ 2011-08-03 19:06         ` Johannes Weiner
  -1 siblings, 0 replies; 64+ messages in thread
From: Johannes Weiner @ 2011-08-03 19:06 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andi Kleen, linux-mm, Dave Chinner, Christoph Hellwig,
	Mel Gorman, Andrew Morton, Wu Fengguang, Rik van Riel, Jan Kara,
	linux-kernel

On Tue, Jul 26, 2011 at 08:40:59AM +0900, Minchan Kim wrote:
> Hi Andi,
> 
> On Tue, Jul 26, 2011 at 5:37 AM, Andi Kleen <ak@linux.intel.com> wrote:
> >> The global dirty limits are put in proportion to the respective zone's
> >> amount of dirtyable memory and the allocation denied when the limit of
> >> that zone is reached.
> >>
> >> Before the allocation fails, the allocator slowpath has a stage before
> >> compaction and reclaim, where the flusher threads are kicked and the
> >> allocator ultimately has to wait for writeback if still none of the
> >> zones has become eligible for allocation again in the meantime.
> >>
> >
> > I don't really like this. It seems wrong to make memory
> > placement depend on dirtyness.
> >
> > Just try to explain it to some system administrator or tuner: her
> > head will explode and for good reasons.
> >
> > On the other hand I like doing round-robin in filemap by default
> > (I think that is what your patch essentially does)
> > We should have made  this default long ago. It avoids most of the
> > "IO fills up local node" problems people run into all the time.
> >
> > So I would rather just change the default in filemap allocation.

It's not only a problem that exists solely on a node-level but also on
a zone-level.  Round-robin over the nodes does not fix the problem
that a small zone can fill up with dirty pages before the global dirty
limit kicks in.

> Just out of curiosity.
> Why do you want to consider only filemap allocation, not IO(ie,
> filemap + sys_[read/write]) allocation?

I guess Andi was referring to the page cache (mapping file offsets to
pages), rather than mmaps (mapping virtual addresses to pages).

mm/filemap.c::__page_cache_alloc()

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 4/5] mm: writeback: throttle __GFP_WRITE on per-zone dirty limits
@ 2011-08-03 19:06         ` Johannes Weiner
  0 siblings, 0 replies; 64+ messages in thread
From: Johannes Weiner @ 2011-08-03 19:06 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andi Kleen, linux-mm, Dave Chinner, Christoph Hellwig,
	Mel Gorman, Andrew Morton, Wu Fengguang, Rik van Riel, Jan Kara,
	linux-kernel

On Tue, Jul 26, 2011 at 08:40:59AM +0900, Minchan Kim wrote:
> Hi Andi,
> 
> On Tue, Jul 26, 2011 at 5:37 AM, Andi Kleen <ak@linux.intel.com> wrote:
> >> The global dirty limits are put in proportion to the respective zone's
> >> amount of dirtyable memory and the allocation denied when the limit of
> >> that zone is reached.
> >>
> >> Before the allocation fails, the allocator slowpath has a stage before
> >> compaction and reclaim, where the flusher threads are kicked and the
> >> allocator ultimately has to wait for writeback if still none of the
> >> zones has become eligible for allocation again in the meantime.
> >>
> >
> > I don't really like this. It seems wrong to make memory
> > placement depend on dirtyness.
> >
> > Just try to explain it to some system administrator or tuner: her
> > head will explode and for good reasons.
> >
> > On the other hand I like doing round-robin in filemap by default
> > (I think that is what your patch essentially does)
> > We should have made  this default long ago. It avoids most of the
> > "IO fills up local node" problems people run into all the time.
> >
> > So I would rather just change the default in filemap allocation.

It's not only a problem that exists solely on a node-level but also on
a zone-level.  Round-robin over the nodes does not fix the problem
that a small zone can fill up with dirty pages before the global dirty
limit kicks in.

> Just out of curiosity.
> Why do you want to consider only filemap allocation, not IO(ie,
> filemap + sys_[read/write]) allocation?

I guess Andi was referring to the page cache (mapping file offsets to
pages), rather than mmaps (mapping virtual addresses to pages).

mm/filemap.c::__page_cache_alloc()

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 4/5] mm: writeback: throttle __GFP_WRITE on per-zone dirty limits
  2011-07-26 14:42     ` Mel Gorman
@ 2011-08-03 20:21       ` Johannes Weiner
  -1 siblings, 0 replies; 64+ messages in thread
From: Johannes Weiner @ 2011-08-03 20:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On Tue, Jul 26, 2011 at 03:42:42PM +0100, Mel Gorman wrote:
> On Mon, Jul 25, 2011 at 10:19:18PM +0200, Johannes Weiner wrote:
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > Allow allocators to pass __GFP_WRITE when they know in advance that
> > the allocated page will be written to and become dirty soon.
> > 
> > The page allocator will then attempt to distribute those allocations
> > across zones, such that no single zone will end up full of dirty and
> > thus more or less unreclaimable pages.
> > 
> 
> On 32-bit, this idea increases lowmem pressure. Ordinarily, this is
> only a problem when the higher zone is really large and management
> structures can only be allocated from the lower zones. Granted,
> it is rare this is the case but in the last 6 months, I've seen at
> least one bug report that could be attributed to lowmem pressure
> (24G x86 machine).
> 
> A brief explanation as to why this is not a problem may be needed.

Only lowmem is considered dirtyable memory per default, so more
highmem does not mean more dirty pages.  If the highmem zone is equal
to or bigger than the lowmem zones, the amount of dirtyable memory
(dirty_ratio * lowmem) can still be placed completely into the highmem
zone (dirty_ratio * highmem) - if the gfp_mask allows for it.

For this patchset, I (blindly) copied this highmem exclusion also when
it comes to allocation placement, with the result that no highmem page
is allowed for __GFP_WRITE.  I need to fix this.

But generally, this patchset rather protects lower zones.  As the
higher zones fill up with first-dirty-then-clean-pages, subsequent
allocations for soon-dirty pages fill up the lower zones.  The
per-zone dirty limit prevents that and forces the allocator to reclaim
the clean pages of the higher zone(s) instead.  This was observable
with the DMA zone during testing for example, which hat consistently
less dirty pages in it on the patched kernel.

> > The global dirty limits are put in proportion to the respective zone's
> > amount of dirtyable memory and the allocation denied when the limit of
> > that zone is reached.
> > 
> 
> What are the risks of a process stalling on dirty pages in a high zone
> that is very small (e.g. 64M) ?

It will fall back to the lower zones.  I should have added that...

The allocation will only stall if all considered zones reached their
dirty limits.  At this point, the implementation basically bangs its
head against the wall until it passes out, hoping the flushers catch
up in the meantime.  There might be some space for improvement.

> > @@ -85,6 +86,7 @@ struct vm_area_struct;
> >  
> >  #define __GFP_NO_KSWAPD	((__force gfp_t)___GFP_NO_KSWAPD)
> >  #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
> > +#define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Will be dirtied soon */
> >  
> 
> /* May be dirtied soon */ :)

Right :)

> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 41dc871..ce673ec 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -154,6 +154,18 @@ static unsigned long determine_dirtyable_memory(void)
> >  	return x + 1;	/* Ensure that we never return 0 */
> >  }
> >  
> > +static unsigned long zone_dirtyable_memory(struct zone *zone)
> > +{
> 
> Terse comment there :)

I tried to write more but was forced to balance dirty laundry.

"Document interfaces" is on the todo list for the next version,
though.

> > +	unsigned long x = 1; /* Ensure that we never return 0 */
> > +
> > +	if (is_highmem(zone) && !vm_highmem_is_dirtyable)
> > +		return x;
> > +
> > +	x += zone_page_state(zone, NR_FREE_PAGES);
> > +	x += zone_reclaimable_pages(zone);
> > +	return x;
> > +}
> 
> It's very similar to determine_dirtyable_memory(). Would be preferable
> if the shared a core function of some sort even if that was implemented
> as by "if (zone == NULL)". Otherwise, these will get out of sync
> eventually.

That makes sense, I'll do that.

> > @@ -378,6 +390,24 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
> >  }
> >  EXPORT_SYMBOL(bdi_set_max_ratio);
> >  
> > +static void sanitize_dirty_limits(unsigned long *pbackground,
> > +				  unsigned long *pdirty)
> > +{
> 
> Maybe a small comment saying to look at the comment in
> global_dirty_limits() to see what this is doing and why.
> 
> sanitize feels like an odd name to me. The arguements are not
> "objectionable" in some way that needs to be corrected.
> scale_dirty_limits maybe?

The background limit is kind of sanitized if it exceeds the foreground
limit.  But yeah, the name sucks given that this is not all the
function does.

I'll just go with scale_dirty_limits().

> > @@ -661,6 +710,57 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> >          }
> >  }
> >  
> > +bool zone_dirty_ok(struct zone *zone)
> > +{
> > +	unsigned long background_thresh, dirty_thresh;
> > +	unsigned long nr_reclaimable, nr_writeback;
> > +
> > +	zone_dirty_limits(zone, &background_thresh, &dirty_thresh);
> > +
> > +	nr_reclaimable = zone_page_state(zone, NR_FILE_DIRTY) +
> > +		zone_page_state(zone, NR_UNSTABLE_NFS);
> > +	nr_writeback = zone_page_state(zone, NR_WRITEBACK);
> > +
> > +	return nr_reclaimable + nr_writeback <= dirty_thresh;
> > +}
> > +
> > +void try_to_writeback_pages(struct zonelist *zonelist, gfp_t gfp_mask,
> > +			    nodemask_t *nodemask)
> > +{
> > +	unsigned int nr_exceeded = 0;
> > +	unsigned int nr_zones = 0;
> > +	struct zoneref *z;
> > +	struct zone *zone;
> > +
> > +	for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask),
> > +					nodemask) {
> > +		unsigned long background_thresh, dirty_thresh;
> > +		unsigned long nr_reclaimable, nr_writeback;
> > +
> > +		nr_zones++;
> > +
> > +		zone_dirty_limits(zone, &background_thresh, &dirty_thresh);
> > +
> > +		nr_reclaimable = zone_page_state(zone, NR_FILE_DIRTY) +
> > +			zone_page_state(zone, NR_UNSTABLE_NFS);
> > +		nr_writeback = zone_page_state(zone, NR_WRITEBACK);
> > +
> > +		if (nr_reclaimable + nr_writeback <= background_thresh)
> > +			continue;
> > +
> > +		if (nr_reclaimable > nr_writeback)
> > +			wakeup_flusher_threads(nr_reclaimable - nr_writeback);
> > +
> 
> This is a potential mess. wakeup_flusher_threads() ultimately
> calls "work = kzalloc(sizeof(*work), GFP_ATOMIC)" from the page
> allocator. Under enough pressure, particularly if the machine has
> very little memory, you may see this spewing out warning messages
> which ironically will have to be written to syslog dirtying more
> pages.  I know I've made the same mistake at least once by calling
> wakeup_flusher_thrads() from page reclaim.

Oops.  Actually, I chose to do this as I remembered your patches
trying to add calls like this.

The problem really is that I have no better idea what to do if all
considered zones exceed their dirty limit.

I think it would be much better to do nothing more than to check for
the global dirty limit and wait for some writeback, then try other
means of reclaim.  If fallback to other nodes is allowed, all
dirtyable zones have been considered and the global dirty limit MUST
be exceeded, writeback is happening.  If a specific node is requested
that has reached its per-zone dirty limits, there are likely clean
pages around to reclaim.

> It's also still not controlling where the pages are being
> written from.  On a large enough NUMA machine, there is a risk that
> wakeup_flusher_treads() will be called very frequently to write pages
> from remote nodes that are not in trouble.

> > +		if (nr_reclaimable + nr_writeback <= dirty_thresh)
> > +			continue;
> > +
> > +		nr_exceeded++;
> > +	}
> > +
> > +	if (nr_zones == nr_exceeded)
> > +		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +}
> > +
> 
> So, you congestion wait but then potentially continue on even
> though it is still over the dirty limits.  Should this be more like
> throttle_vm_writeout()?

I need to think about what to do here in general a bit more.

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 4e8985a..1fac154 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1666,6 +1666,9 @@ zonelist_scan:
> >  			!cpuset_zone_allowed_softwall(zone, gfp_mask))
> >  				goto try_next_zone;
> >  
> > +		if ((gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
> > +			goto this_zone_full;
> > +
> 
> So this part needs to explain why using the lower zones does not
> potentially cause lowmem pressure on 32-bit. It's not a show stopper
> as such but it shouldn't be ignored either.

Agreed.

> > @@ -2135,6 +2154,14 @@ rebalance:
> >  	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> >  		goto nopage;
> >  
> > +	/* Try writing back pages if per-zone dirty limits are reached */
> > +	page = __alloc_pages_writeback(gfp_mask, order, zonelist,
> > +				       high_zoneidx, nodemask,
> > +				       alloc_flags, preferred_zone,
> > +				       migratetype);
> > +	if (page)
> > +		goto got_pg;
> > +
> 
> I like the general idea but we are still not controlling where
> pages are being written from, the potential lowmem pressure problem
> needs to be addressed and care needs to be taken with the frequency
> wakeup_flusher_threads is called due to it using kmalloc.
> 
> I suspect where the performance gain is being seen is due to
> the flusher threads being woken earlier, more frequently and are
> aggressively writing due to wakeup_flusher_threads() passing in loads
> of requests. As you are seeing a performance gain, that is interesting
> in itself if it is true.

As written in another email, the flushers are never woken through this
code in the tests.  The benefits really come from keeping enough clean
pages in the zones and reduce reclaim latencies.

Thanks for your input.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 4/5] mm: writeback: throttle __GFP_WRITE on per-zone dirty limits
@ 2011-08-03 20:21       ` Johannes Weiner
  0 siblings, 0 replies; 64+ messages in thread
From: Johannes Weiner @ 2011-08-03 20:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On Tue, Jul 26, 2011 at 03:42:42PM +0100, Mel Gorman wrote:
> On Mon, Jul 25, 2011 at 10:19:18PM +0200, Johannes Weiner wrote:
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > Allow allocators to pass __GFP_WRITE when they know in advance that
> > the allocated page will be written to and become dirty soon.
> > 
> > The page allocator will then attempt to distribute those allocations
> > across zones, such that no single zone will end up full of dirty and
> > thus more or less unreclaimable pages.
> > 
> 
> On 32-bit, this idea increases lowmem pressure. Ordinarily, this is
> only a problem when the higher zone is really large and management
> structures can only be allocated from the lower zones. Granted,
> it is rare this is the case but in the last 6 months, I've seen at
> least one bug report that could be attributed to lowmem pressure
> (24G x86 machine).
> 
> A brief explanation as to why this is not a problem may be needed.

Only lowmem is considered dirtyable memory per default, so more
highmem does not mean more dirty pages.  If the highmem zone is equal
to or bigger than the lowmem zones, the amount of dirtyable memory
(dirty_ratio * lowmem) can still be placed completely into the highmem
zone (dirty_ratio * highmem) - if the gfp_mask allows for it.

For this patchset, I (blindly) copied this highmem exclusion also when
it comes to allocation placement, with the result that no highmem page
is allowed for __GFP_WRITE.  I need to fix this.

But generally, this patchset rather protects lower zones.  As the
higher zones fill up with first-dirty-then-clean-pages, subsequent
allocations for soon-dirty pages fill up the lower zones.  The
per-zone dirty limit prevents that and forces the allocator to reclaim
the clean pages of the higher zone(s) instead.  This was observable
with the DMA zone during testing for example, which hat consistently
less dirty pages in it on the patched kernel.

> > The global dirty limits are put in proportion to the respective zone's
> > amount of dirtyable memory and the allocation denied when the limit of
> > that zone is reached.
> > 
> 
> What are the risks of a process stalling on dirty pages in a high zone
> that is very small (e.g. 64M) ?

It will fall back to the lower zones.  I should have added that...

The allocation will only stall if all considered zones reached their
dirty limits.  At this point, the implementation basically bangs its
head against the wall until it passes out, hoping the flushers catch
up in the meantime.  There might be some space for improvement.

> > @@ -85,6 +86,7 @@ struct vm_area_struct;
> >  
> >  #define __GFP_NO_KSWAPD	((__force gfp_t)___GFP_NO_KSWAPD)
> >  #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
> > +#define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Will be dirtied soon */
> >  
> 
> /* May be dirtied soon */ :)

Right :)

> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 41dc871..ce673ec 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -154,6 +154,18 @@ static unsigned long determine_dirtyable_memory(void)
> >  	return x + 1;	/* Ensure that we never return 0 */
> >  }
> >  
> > +static unsigned long zone_dirtyable_memory(struct zone *zone)
> > +{
> 
> Terse comment there :)

I tried to write more but was forced to balance dirty laundry.

"Document interfaces" is on the todo list for the next version,
though.

> > +	unsigned long x = 1; /* Ensure that we never return 0 */
> > +
> > +	if (is_highmem(zone) && !vm_highmem_is_dirtyable)
> > +		return x;
> > +
> > +	x += zone_page_state(zone, NR_FREE_PAGES);
> > +	x += zone_reclaimable_pages(zone);
> > +	return x;
> > +}
> 
> It's very similar to determine_dirtyable_memory(). Would be preferable
> if the shared a core function of some sort even if that was implemented
> as by "if (zone == NULL)". Otherwise, these will get out of sync
> eventually.

That makes sense, I'll do that.

> > @@ -378,6 +390,24 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
> >  }
> >  EXPORT_SYMBOL(bdi_set_max_ratio);
> >  
> > +static void sanitize_dirty_limits(unsigned long *pbackground,
> > +				  unsigned long *pdirty)
> > +{
> 
> Maybe a small comment saying to look at the comment in
> global_dirty_limits() to see what this is doing and why.
> 
> sanitize feels like an odd name to me. The arguements are not
> "objectionable" in some way that needs to be corrected.
> scale_dirty_limits maybe?

The background limit is kind of sanitized if it exceeds the foreground
limit.  But yeah, the name sucks given that this is not all the
function does.

I'll just go with scale_dirty_limits().

> > @@ -661,6 +710,57 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> >          }
> >  }
> >  
> > +bool zone_dirty_ok(struct zone *zone)
> > +{
> > +	unsigned long background_thresh, dirty_thresh;
> > +	unsigned long nr_reclaimable, nr_writeback;
> > +
> > +	zone_dirty_limits(zone, &background_thresh, &dirty_thresh);
> > +
> > +	nr_reclaimable = zone_page_state(zone, NR_FILE_DIRTY) +
> > +		zone_page_state(zone, NR_UNSTABLE_NFS);
> > +	nr_writeback = zone_page_state(zone, NR_WRITEBACK);
> > +
> > +	return nr_reclaimable + nr_writeback <= dirty_thresh;
> > +}
> > +
> > +void try_to_writeback_pages(struct zonelist *zonelist, gfp_t gfp_mask,
> > +			    nodemask_t *nodemask)
> > +{
> > +	unsigned int nr_exceeded = 0;
> > +	unsigned int nr_zones = 0;
> > +	struct zoneref *z;
> > +	struct zone *zone;
> > +
> > +	for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask),
> > +					nodemask) {
> > +		unsigned long background_thresh, dirty_thresh;
> > +		unsigned long nr_reclaimable, nr_writeback;
> > +
> > +		nr_zones++;
> > +
> > +		zone_dirty_limits(zone, &background_thresh, &dirty_thresh);
> > +
> > +		nr_reclaimable = zone_page_state(zone, NR_FILE_DIRTY) +
> > +			zone_page_state(zone, NR_UNSTABLE_NFS);
> > +		nr_writeback = zone_page_state(zone, NR_WRITEBACK);
> > +
> > +		if (nr_reclaimable + nr_writeback <= background_thresh)
> > +			continue;
> > +
> > +		if (nr_reclaimable > nr_writeback)
> > +			wakeup_flusher_threads(nr_reclaimable - nr_writeback);
> > +
> 
> This is a potential mess. wakeup_flusher_threads() ultimately
> calls "work = kzalloc(sizeof(*work), GFP_ATOMIC)" from the page
> allocator. Under enough pressure, particularly if the machine has
> very little memory, you may see this spewing out warning messages
> which ironically will have to be written to syslog dirtying more
> pages.  I know I've made the same mistake at least once by calling
> wakeup_flusher_thrads() from page reclaim.

Oops.  Actually, I chose to do this as I remembered your patches
trying to add calls like this.

The problem really is that I have no better idea what to do if all
considered zones exceed their dirty limit.

I think it would be much better to do nothing more than to check for
the global dirty limit and wait for some writeback, then try other
means of reclaim.  If fallback to other nodes is allowed, all
dirtyable zones have been considered and the global dirty limit MUST
be exceeded, writeback is happening.  If a specific node is requested
that has reached its per-zone dirty limits, there are likely clean
pages around to reclaim.

> It's also still not controlling where the pages are being
> written from.  On a large enough NUMA machine, there is a risk that
> wakeup_flusher_treads() will be called very frequently to write pages
> from remote nodes that are not in trouble.

> > +		if (nr_reclaimable + nr_writeback <= dirty_thresh)
> > +			continue;
> > +
> > +		nr_exceeded++;
> > +	}
> > +
> > +	if (nr_zones == nr_exceeded)
> > +		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +}
> > +
> 
> So, you congestion wait but then potentially continue on even
> though it is still over the dirty limits.  Should this be more like
> throttle_vm_writeout()?

I need to think about what to do here in general a bit more.

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 4e8985a..1fac154 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1666,6 +1666,9 @@ zonelist_scan:
> >  			!cpuset_zone_allowed_softwall(zone, gfp_mask))
> >  				goto try_next_zone;
> >  
> > +		if ((gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
> > +			goto this_zone_full;
> > +
> 
> So this part needs to explain why using the lower zones does not
> potentially cause lowmem pressure on 32-bit. It's not a show stopper
> as such but it shouldn't be ignored either.

Agreed.

> > @@ -2135,6 +2154,14 @@ rebalance:
> >  	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> >  		goto nopage;
> >  
> > +	/* Try writing back pages if per-zone dirty limits are reached */
> > +	page = __alloc_pages_writeback(gfp_mask, order, zonelist,
> > +				       high_zoneidx, nodemask,
> > +				       alloc_flags, preferred_zone,
> > +				       migratetype);
> > +	if (page)
> > +		goto got_pg;
> > +
> 
> I like the general idea but we are still not controlling where
> pages are being written from, the potential lowmem pressure problem
> needs to be addressed and care needs to be taken with the frequency
> wakeup_flusher_threads is called due to it using kmalloc.
> 
> I suspect where the performance gain is being seen is due to
> the flusher threads being woken earlier, more frequently and are
> aggressively writing due to wakeup_flusher_threads() passing in loads
> of requests. As you are seeing a performance gain, that is interesting
> in itself if it is true.

As written in another email, the flushers are never woken through this
code in the tests.  The benefits really come from keeping enough clean
pages in the zones and reduce reclaim latencies.

Thanks for your input.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 4/5] mm: writeback: throttle __GFP_WRITE on per-zone dirty limits
  2011-07-27 14:24     ` Michal Hocko
@ 2011-08-03 20:25       ` Johannes Weiner
  -1 siblings, 0 replies; 64+ messages in thread
From: Johannes Weiner @ 2011-08-03 20:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara,
	Andi Kleen, linux-kernel

On Wed, Jul 27, 2011 at 04:24:05PM +0200, Michal Hocko wrote:
> On Mon 25-07-11 22:19:18, Johannes Weiner wrote:
> [...]
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 41dc871..ce673ec 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -378,6 +390,24 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
> >  }
> >  EXPORT_SYMBOL(bdi_set_max_ratio);
> >  
> > +static void sanitize_dirty_limits(unsigned long *pbackground,
> > +				  unsigned long *pdirty)
> > +{
> > +	unsigned long background = *pbackground;
> > +	unsigned long dirty = *pdirty;
> > +	struct task_struct *tsk;
> > +
> > +	if (background >= dirty)
> > +		background = dirty / 2;
> > +	tsk = current;
> > +	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> > +		background += background / 4;
> > +		dirty += dirty / 4;
> > +	}
> > +	*pbackground = background;
> > +	*pdirty = dirty;
> > +}
> > +
> >  /*
> >   * global_dirty_limits - background-writeback and dirty-throttling thresholds
> >   *
> > @@ -389,33 +419,52 @@ EXPORT_SYMBOL(bdi_set_max_ratio);
> >   */
> >  void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
> >  {
> > -	unsigned long background;
> > -	unsigned long dirty;
> >  	unsigned long uninitialized_var(available_memory);
> > -	struct task_struct *tsk;
> >  
> >  	if (!vm_dirty_bytes || !dirty_background_bytes)
> >  		available_memory = determine_dirtyable_memory();
> >  
> >  	if (vm_dirty_bytes)
> > -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> > +		*pdirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> >  	else
> > -		dirty = (vm_dirty_ratio * available_memory) / 100;
> > +		*pdirty = vm_dirty_ratio * available_memory / 100;
> >  
> >  	if (dirty_background_bytes)
> > -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> > +		*pbackground = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> >  	else
> > -		background = (dirty_background_ratio * available_memory) / 100;
> > +		*pbackground = dirty_background_ratio * available_memory / 100;
> >  
> > -	if (background >= dirty)
> > -		background = dirty / 2;
> > -	tsk = current;
> > -	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> > -		background += background / 4;
> > -		dirty += dirty / 4;
> > -	}
> > -	*pbackground = background;
> > -	*pdirty = dirty;
> > +	sanitize_dirty_limits(pbackground, pdirty);
> > +}
> 
> Hmm, wouldn't be the patch little bit easier to read if this was
> outside in a separate (cleanup) one?

I didn't find it hard to read.  But I wrote it, so... :)

Will split it out in the next round.  Thanks, Michal.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 4/5] mm: writeback: throttle __GFP_WRITE on per-zone dirty limits
@ 2011-08-03 20:25       ` Johannes Weiner
  0 siblings, 0 replies; 64+ messages in thread
From: Johannes Weiner @ 2011-08-03 20:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara,
	Andi Kleen, linux-kernel

On Wed, Jul 27, 2011 at 04:24:05PM +0200, Michal Hocko wrote:
> On Mon 25-07-11 22:19:18, Johannes Weiner wrote:
> [...]
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 41dc871..ce673ec 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -378,6 +390,24 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
> >  }
> >  EXPORT_SYMBOL(bdi_set_max_ratio);
> >  
> > +static void sanitize_dirty_limits(unsigned long *pbackground,
> > +				  unsigned long *pdirty)
> > +{
> > +	unsigned long background = *pbackground;
> > +	unsigned long dirty = *pdirty;
> > +	struct task_struct *tsk;
> > +
> > +	if (background >= dirty)
> > +		background = dirty / 2;
> > +	tsk = current;
> > +	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> > +		background += background / 4;
> > +		dirty += dirty / 4;
> > +	}
> > +	*pbackground = background;
> > +	*pdirty = dirty;
> > +}
> > +
> >  /*
> >   * global_dirty_limits - background-writeback and dirty-throttling thresholds
> >   *
> > @@ -389,33 +419,52 @@ EXPORT_SYMBOL(bdi_set_max_ratio);
> >   */
> >  void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
> >  {
> > -	unsigned long background;
> > -	unsigned long dirty;
> >  	unsigned long uninitialized_var(available_memory);
> > -	struct task_struct *tsk;
> >  
> >  	if (!vm_dirty_bytes || !dirty_background_bytes)
> >  		available_memory = determine_dirtyable_memory();
> >  
> >  	if (vm_dirty_bytes)
> > -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> > +		*pdirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> >  	else
> > -		dirty = (vm_dirty_ratio * available_memory) / 100;
> > +		*pdirty = vm_dirty_ratio * available_memory / 100;
> >  
> >  	if (dirty_background_bytes)
> > -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> > +		*pbackground = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> >  	else
> > -		background = (dirty_background_ratio * available_memory) / 100;
> > +		*pbackground = dirty_background_ratio * available_memory / 100;
> >  
> > -	if (background >= dirty)
> > -		background = dirty / 2;
> > -	tsk = current;
> > -	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> > -		background += background / 4;
> > -		dirty += dirty / 4;
> > -	}
> > -	*pbackground = background;
> > -	*pdirty = dirty;
> > +	sanitize_dirty_limits(pbackground, pdirty);
> > +}
> 
> Hmm, wouldn't be the patch little bit easier to read if this was
> outside in a separate (cleanup) one?

I didn't find it hard to read.  But I wrote it, so... :)

Will split it out in the next round.  Thanks, Michal.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 4/5] mm: writeback: throttle __GFP_WRITE on per-zone dirty limits
  2011-08-03 20:25       ` Johannes Weiner
@ 2011-08-04  7:27         ` Michal Hocko
  -1 siblings, 0 replies; 64+ messages in thread
From: Michal Hocko @ 2011-08-04  7:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara,
	Andi Kleen, linux-kernel

On Wed 03-08-11 22:25:00, Johannes Weiner wrote:
> On Wed, Jul 27, 2011 at 04:24:05PM +0200, Michal Hocko wrote:
> > On Mon 25-07-11 22:19:18, Johannes Weiner wrote:
> > [...]
> > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > index 41dc871..ce673ec 100644
> > > --- a/mm/page-writeback.c
> > > +++ b/mm/page-writeback.c
> > > @@ -378,6 +390,24 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
> > >  }
> > >  EXPORT_SYMBOL(bdi_set_max_ratio);
> > >  
> > > +static void sanitize_dirty_limits(unsigned long *pbackground,
> > > +				  unsigned long *pdirty)
> > > +{
> > > +	unsigned long background = *pbackground;
> > > +	unsigned long dirty = *pdirty;
> > > +	struct task_struct *tsk;
> > > +
> > > +	if (background >= dirty)
> > > +		background = dirty / 2;
> > > +	tsk = current;
> > > +	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> > > +		background += background / 4;
> > > +		dirty += dirty / 4;
> > > +	}
> > > +	*pbackground = background;
> > > +	*pdirty = dirty;
> > > +}
> > > +
> > >  /*
> > >   * global_dirty_limits - background-writeback and dirty-throttling thresholds
> > >   *
> > > @@ -389,33 +419,52 @@ EXPORT_SYMBOL(bdi_set_max_ratio);
> > >   */
> > >  void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
> > >  {
> > > -	unsigned long background;
> > > -	unsigned long dirty;
> > >  	unsigned long uninitialized_var(available_memory);
> > > -	struct task_struct *tsk;
> > >  
> > >  	if (!vm_dirty_bytes || !dirty_background_bytes)
> > >  		available_memory = determine_dirtyable_memory();
> > >  
> > >  	if (vm_dirty_bytes)
> > > -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> > > +		*pdirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> > >  	else
> > > -		dirty = (vm_dirty_ratio * available_memory) / 100;
> > > +		*pdirty = vm_dirty_ratio * available_memory / 100;
> > >  
> > >  	if (dirty_background_bytes)
> > > -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> > > +		*pbackground = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> > >  	else
> > > -		background = (dirty_background_ratio * available_memory) / 100;
> > > +		*pbackground = dirty_background_ratio * available_memory / 100;
> > >  
> > > -	if (background >= dirty)
> > > -		background = dirty / 2;
> > > -	tsk = current;
> > > -	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> > > -		background += background / 4;
> > > -		dirty += dirty / 4;
> > > -	}
> > > -	*pbackground = background;
> > > -	*pdirty = dirty;
> > > +	sanitize_dirty_limits(pbackground, pdirty);
> > > +}
> > 
> > Hmm, wouldn't be the patch little bit easier to read if this was
> > outside in a separate (cleanup) one?
> 
> I didn't find it hard to read.  But I wrote it, so... :)
> 
> Will split it out in the next round.  Thanks, Michal.

Thanks

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 4/5] mm: writeback: throttle __GFP_WRITE on per-zone dirty limits
@ 2011-08-04  7:27         ` Michal Hocko
  0 siblings, 0 replies; 64+ messages in thread
From: Michal Hocko @ 2011-08-04  7:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara,
	Andi Kleen, linux-kernel

On Wed 03-08-11 22:25:00, Johannes Weiner wrote:
> On Wed, Jul 27, 2011 at 04:24:05PM +0200, Michal Hocko wrote:
> > On Mon 25-07-11 22:19:18, Johannes Weiner wrote:
> > [...]
> > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > index 41dc871..ce673ec 100644
> > > --- a/mm/page-writeback.c
> > > +++ b/mm/page-writeback.c
> > > @@ -378,6 +390,24 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
> > >  }
> > >  EXPORT_SYMBOL(bdi_set_max_ratio);
> > >  
> > > +static void sanitize_dirty_limits(unsigned long *pbackground,
> > > +				  unsigned long *pdirty)
> > > +{
> > > +	unsigned long background = *pbackground;
> > > +	unsigned long dirty = *pdirty;
> > > +	struct task_struct *tsk;
> > > +
> > > +	if (background >= dirty)
> > > +		background = dirty / 2;
> > > +	tsk = current;
> > > +	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> > > +		background += background / 4;
> > > +		dirty += dirty / 4;
> > > +	}
> > > +	*pbackground = background;
> > > +	*pdirty = dirty;
> > > +}
> > > +
> > >  /*
> > >   * global_dirty_limits - background-writeback and dirty-throttling thresholds
> > >   *
> > > @@ -389,33 +419,52 @@ EXPORT_SYMBOL(bdi_set_max_ratio);
> > >   */
> > >  void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
> > >  {
> > > -	unsigned long background;
> > > -	unsigned long dirty;
> > >  	unsigned long uninitialized_var(available_memory);
> > > -	struct task_struct *tsk;
> > >  
> > >  	if (!vm_dirty_bytes || !dirty_background_bytes)
> > >  		available_memory = determine_dirtyable_memory();
> > >  
> > >  	if (vm_dirty_bytes)
> > > -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> > > +		*pdirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> > >  	else
> > > -		dirty = (vm_dirty_ratio * available_memory) / 100;
> > > +		*pdirty = vm_dirty_ratio * available_memory / 100;
> > >  
> > >  	if (dirty_background_bytes)
> > > -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> > > +		*pbackground = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> > >  	else
> > > -		background = (dirty_background_ratio * available_memory) / 100;
> > > +		*pbackground = dirty_background_ratio * available_memory / 100;
> > >  
> > > -	if (background >= dirty)
> > > -		background = dirty / 2;
> > > -	tsk = current;
> > > -	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> > > -		background += background / 4;
> > > -		dirty += dirty / 4;
> > > -	}
> > > -	*pbackground = background;
> > > -	*pdirty = dirty;
> > > +	sanitize_dirty_limits(pbackground, pdirty);
> > > +}
> > 
> > Hmm, wouldn't be the patch little bit easier to read if this was
> > outside in a separate (cleanup) one?
> 
> I didn't find it hard to read.  But I wrote it, so... :)
> 
> Will split it out in the next round.  Thanks, Michal.

Thanks

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 1/5] mm: page_alloc: increase __GFP_BITS_SHIFT to include __GFP_OTHER_NODE
  2011-07-25 20:19   ` Johannes Weiner
@ 2011-08-05 14:16     ` Rik van Riel
  -1 siblings, 0 replies; 64+ messages in thread
From: Rik van Riel @ 2011-08-05 14:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On 07/25/2011 04:19 PM, Johannes Weiner wrote:
> From: Johannes Weiner<hannes@cmpxchg.org>
>
> __GFP_OTHER_NODE is used for NUMA allocations on behalf of other
> nodes.  It's supposed to be passed through from the page allocator to
> zone_statistics(), but it never gets there as gfp_allowed_mask is not
> wide enough and masks out the flag early in the allocation path.
>
> The result is an accounting glitch where successful NUMA allocations
> by-agent are not properly attributed as local.
>
> Increase __GFP_BITS_SHIFT so that it includes __GFP_OTHER_NODE.
>
> Signed-off-by: Johannes Weiner<hannes@cmpxchg.org>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 1/5] mm: page_alloc: increase __GFP_BITS_SHIFT to include __GFP_OTHER_NODE
@ 2011-08-05 14:16     ` Rik van Riel
  0 siblings, 0 replies; 64+ messages in thread
From: Rik van Riel @ 2011-08-05 14:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On 07/25/2011 04:19 PM, Johannes Weiner wrote:
> From: Johannes Weiner<hannes@cmpxchg.org>
>
> __GFP_OTHER_NODE is used for NUMA allocations on behalf of other
> nodes.  It's supposed to be passed through from the page allocator to
> zone_statistics(), but it never gets there as gfp_allowed_mask is not
> wide enough and masks out the flag early in the allocation path.
>
> The result is an accounting glitch where successful NUMA allocations
> by-agent are not properly attributed as local.
>
> Increase __GFP_BITS_SHIFT so that it includes __GFP_OTHER_NODE.
>
> Signed-off-by: Johannes Weiner<hannes@cmpxchg.org>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 2/5] mm: writeback: make determine_dirtyable_memory static again
  2011-07-25 20:19   ` Johannes Weiner
@ 2011-08-05 14:38     ` Rik van Riel
  -1 siblings, 0 replies; 64+ messages in thread
From: Rik van Riel @ 2011-08-05 14:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On 07/25/2011 04:19 PM, Johannes Weiner wrote:
> From: Johannes Weiner<hannes@cmpxchg.org>
>
> The tracing ring-buffer used this function briefly, but not anymore.
> Make it local to the writeback code again.
>
> Also, move the function so that no forward declaration needs to be
> reintroduced.
>
> Signed-off-by: Johannes Weiner<hannes@cmpxchg.org>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 2/5] mm: writeback: make determine_dirtyable_memory static again
@ 2011-08-05 14:38     ` Rik van Riel
  0 siblings, 0 replies; 64+ messages in thread
From: Rik van Riel @ 2011-08-05 14:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On 07/25/2011 04:19 PM, Johannes Weiner wrote:
> From: Johannes Weiner<hannes@cmpxchg.org>
>
> The tracing ring-buffer used this function briefly, but not anymore.
> Make it local to the writeback code again.
>
> Also, move the function so that no forward declaration needs to be
> reintroduced.
>
> Signed-off-by: Johannes Weiner<hannes@cmpxchg.org>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 3/5] mm: writeback: remove seriously stale comment on dirty limits
  2011-07-25 20:19   ` Johannes Weiner
@ 2011-08-05 14:45     ` Rik van Riel
  -1 siblings, 0 replies; 64+ messages in thread
From: Rik van Riel @ 2011-08-05 14:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On 07/25/2011 04:19 PM, Johannes Weiner wrote:
> From: Johannes Weiner<hannes@cmpxchg.org>
>
> Signed-off-by: Johannes Weiner<hannes@cmpxchg.org>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 3/5] mm: writeback: remove seriously stale comment on dirty limits
@ 2011-08-05 14:45     ` Rik van Riel
  0 siblings, 0 replies; 64+ messages in thread
From: Rik van Riel @ 2011-08-05 14:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Andrew Morton, Wu Fengguang, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

On 07/25/2011 04:19 PM, Johannes Weiner wrote:
> From: Johannes Weiner<hannes@cmpxchg.org>
>
> Signed-off-by: Johannes Weiner<hannes@cmpxchg.org>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 0/5] mm: per-zone dirty limiting
  2011-08-03 13:18           ` Mel Gorman
@ 2011-09-20 12:19             ` Johannes Weiner
  -1 siblings, 0 replies; 64+ messages in thread
From: Johannes Weiner @ 2011-09-20 12:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

Hi, sorry for the long delay,

On Wed, Aug 03, 2011 at 02:18:11PM +0100, Mel Gorman wrote:
> On Tue, Aug 02, 2011 at 02:17:33PM +0200, Johannes Weiner wrote:
> > My theory is that the closer (file_pages - dirty_pages) is to the high
> > watermark which kswapd tries to balance to, the more likely it is to
> > run into dirty pages.  And to my knowledge, these tests are run with a
> > non-standard 40% dirty ratio, which lowers the threshold at which
> > perzonedirty falls apart.  Per-zone dirty limits should probably take
> > the high watermark into account.
> > 
> 
> That would appear sensible. The choice of 40% dirty ratio is deliberate.
> My understanding is a number of servers that are IO intensive will have
> dirty ratio tuned to this value. On bug reports I've seen for distro
> kernels related to IO slowdowns, it seemed to be a common choice. I
> suspect it's tuned to this because it used to be the old default. Of
> course, 40% also made the writeback problem worse so the effect of the
> patches is easier to see.

Agreed.

It was just meant as an observation/possible explanation for why this
might exacerbate adverse effects, no blaming, rest assured :)

I added a patch that excludes reserved pages from dirtyable memory and
file writes are now down to the occassional hundred pages once in ten
runs, even with a dirty ratio of 40%.  I even ran a test with 40%
background and 80% foreground limit for giggles and still no writeouts
from reclaim with this patch, so this was probably it.

> > What makes me wonder, is that in addition, something in perzonedirty
> > makes kswapd less efficient in the 4G tests, which is the opposite
> > effect it had in all other setups.  This increases direct reclaim
> > invocations against the preferred Normal zone.  The higher pressure
> > could also explain why reclaim rushes through the clean pages and runs
> > into dirty pages quicker.
> > 
> > Does anyone have a theory about what might be going on here?
> > 
> 
> This is tenuous at best and I confess I have not thought deeply
> about it but it could be due to the relative age of the pages in the
> highest zone.
> 
> In the vanilla kernel, the Normal zone gets filled with dirty pages
> first and then the lower zones get used up until dirty ratio when
> flusher threads get woken. Because the highest zone also has the
> oldest pages and presumably the oldest inodes, the zone gets fully
> cleaned by the flusher. The pattern is "fill zone with dirty pages,
> use lower zones, highest zone gets fully cleaned reclaimed and refilled
> with dirty pages, repeat"
> 
> In the patched kernel, lower zones are used when the dirty limits of a
> zone are met and the flusher threads are woken to clean a small number
> of pages but not the full zone. Reclaim takes the clean pages and they
> get replaced with younger dirty pages. Over time, the highest zone
> becomes a mix of old and young dirty pages. The flusher threads run
> but instead of cleaning the highest zone first, it is cleaning a mix
> of pages both all the zones. If this was the case, kswapd would end
> up writing more pages from the higher zone and stalling as a result.
> 
> A further problem could be that direct reclaimers are hitting that new
> congestion_wait(). Unfortunately, I was not running with stats enabled
> to see what the congestion figures looked like.

The throttling could indeed uselessly force a NOFS allocation to wait
a bit without making progress, so kswapd could in turn get stuck
waiting on that allocator when calling into the fs.

I dropped the throttling completely for now and the zone dirty limits
are only applied in the allocator fast path to distribute allocations,
but not throttle/writeback anything.  The direct reclaim invocations
are no longer increased.

This leaves the problem to allocations whose allowable zones are in
sum not big enough to trigger the global limit, but the series is
still useful without it and we can handle such situations in later
patches.

Thanks for your input, Mel, I'll shortly send out the latest revision.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [patch 0/5] mm: per-zone dirty limiting
@ 2011-09-20 12:19             ` Johannes Weiner
  0 siblings, 0 replies; 64+ messages in thread
From: Johannes Weiner @ 2011-09-20 12:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, Rik van Riel, Minchan Kim, Jan Kara, Andi Kleen,
	linux-kernel

Hi, sorry for the long delay,

On Wed, Aug 03, 2011 at 02:18:11PM +0100, Mel Gorman wrote:
> On Tue, Aug 02, 2011 at 02:17:33PM +0200, Johannes Weiner wrote:
> > My theory is that the closer (file_pages - dirty_pages) is to the high
> > watermark which kswapd tries to balance to, the more likely it is to
> > run into dirty pages.  And to my knowledge, these tests are run with a
> > non-standard 40% dirty ratio, which lowers the threshold at which
> > perzonedirty falls apart.  Per-zone dirty limits should probably take
> > the high watermark into account.
> > 
> 
> That would appear sensible. The choice of 40% dirty ratio is deliberate.
> My understanding is a number of servers that are IO intensive will have
> dirty ratio tuned to this value. On bug reports I've seen for distro
> kernels related to IO slowdowns, it seemed to be a common choice. I
> suspect it's tuned to this because it used to be the old default. Of
> course, 40% also made the writeback problem worse so the effect of the
> patches is easier to see.

Agreed.

It was just meant as an observation/possible explanation for why this
might exacerbate adverse effects, no blaming, rest assured :)

I added a patch that excludes reserved pages from dirtyable memory and
file writes are now down to the occassional hundred pages once in ten
runs, even with a dirty ratio of 40%.  I even ran a test with 40%
background and 80% foreground limit for giggles and still no writeouts
from reclaim with this patch, so this was probably it.

> > What makes me wonder, is that in addition, something in perzonedirty
> > makes kswapd less efficient in the 4G tests, which is the opposite
> > effect it had in all other setups.  This increases direct reclaim
> > invocations against the preferred Normal zone.  The higher pressure
> > could also explain why reclaim rushes through the clean pages and runs
> > into dirty pages quicker.
> > 
> > Does anyone have a theory about what might be going on here?
> > 
> 
> This is tenuous at best and I confess I have not thought deeply
> about it but it could be due to the relative age of the pages in the
> highest zone.
> 
> In the vanilla kernel, the Normal zone gets filled with dirty pages
> first and then the lower zones get used up until dirty ratio when
> flusher threads get woken. Because the highest zone also has the
> oldest pages and presumably the oldest inodes, the zone gets fully
> cleaned by the flusher. The pattern is "fill zone with dirty pages,
> use lower zones, highest zone gets fully cleaned reclaimed and refilled
> with dirty pages, repeat"
> 
> In the patched kernel, lower zones are used when the dirty limits of a
> zone are met and the flusher threads are woken to clean a small number
> of pages but not the full zone. Reclaim takes the clean pages and they
> get replaced with younger dirty pages. Over time, the highest zone
> becomes a mix of old and young dirty pages. The flusher threads run
> but instead of cleaning the highest zone first, it is cleaning a mix
> of pages both all the zones. If this was the case, kswapd would end
> up writing more pages from the higher zone and stalling as a result.
> 
> A further problem could be that direct reclaimers are hitting that new
> congestion_wait(). Unfortunately, I was not running with stats enabled
> to see what the congestion figures looked like.

The throttling could indeed uselessly force a NOFS allocation to wait
a bit without making progress, so kswapd could in turn get stuck
waiting on that allocator when calling into the fs.

I dropped the throttling completely for now and the zone dirty limits
are only applied in the allocator fast path to distribute allocations,
but not throttle/writeback anything.  The direct reclaim invocations
are no longer increased.

This leaves the problem to allocations whose allowable zones are in
sum not big enough to trigger the global limit, but the series is
still useful without it and we can handle such situations in later
patches.

Thanks for your input, Mel, I'll shortly send out the latest revision.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2011-09-20 12:20 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-25 20:19 [patch 0/5] mm: per-zone dirty limiting Johannes Weiner
2011-07-25 20:19 ` Johannes Weiner
2011-07-25 20:19 ` [patch 1/5] mm: page_alloc: increase __GFP_BITS_SHIFT to include __GFP_OTHER_NODE Johannes Weiner
2011-07-25 20:19   ` Johannes Weiner
2011-07-25 20:52   ` Andi Kleen
2011-07-25 20:52     ` Andi Kleen
2011-07-25 22:56   ` Minchan Kim
2011-07-25 22:56     ` Minchan Kim
2011-07-26 13:51   ` Mel Gorman
2011-07-26 13:51     ` Mel Gorman
2011-07-27 12:50   ` Michal Hocko
2011-07-27 12:50     ` Michal Hocko
2011-08-05 14:16   ` Rik van Riel
2011-08-05 14:16     ` Rik van Riel
2011-07-25 20:19 ` [patch 2/5] mm: writeback: make determine_dirtyable_memory static again Johannes Weiner
2011-07-25 20:19   ` Johannes Weiner
2011-07-26 13:53   ` Mel Gorman
2011-07-26 13:53     ` Mel Gorman
2011-07-27 12:59   ` Michal Hocko
2011-07-27 12:59     ` Michal Hocko
2011-08-05 14:38   ` Rik van Riel
2011-08-05 14:38     ` Rik van Riel
2011-07-25 20:19 ` [patch 3/5] mm: writeback: remove seriously stale comment on dirty limits Johannes Weiner
2011-07-25 20:19   ` Johannes Weiner
2011-07-27 13:38   ` Michal Hocko
2011-07-27 13:38     ` Michal Hocko
2011-08-05 14:45   ` Rik van Riel
2011-08-05 14:45     ` Rik van Riel
2011-07-25 20:19 ` [patch 4/5] mm: writeback: throttle __GFP_WRITE on per-zone " Johannes Weiner
2011-07-25 20:19   ` Johannes Weiner
2011-07-25 20:37   ` Andi Kleen
2011-07-25 20:37     ` Andi Kleen
2011-07-25 23:40     ` Minchan Kim
2011-07-25 23:40       ` Minchan Kim
2011-08-03 19:06       ` Johannes Weiner
2011-08-03 19:06         ` Johannes Weiner
2011-07-26 14:42   ` Mel Gorman
2011-07-26 14:42     ` Mel Gorman
2011-08-03 20:21     ` Johannes Weiner
2011-08-03 20:21       ` Johannes Weiner
2011-07-27 14:24   ` Michal Hocko
2011-07-27 14:24     ` Michal Hocko
2011-08-03 20:25     ` Johannes Weiner
2011-08-03 20:25       ` Johannes Weiner
2011-08-04  7:27       ` Michal Hocko
2011-08-04  7:27         ` Michal Hocko
2011-07-25 20:19 ` [patch 5/5] mm: filemap: horrid hack to pass __GFP_WRITE for most page cache writers Johannes Weiner
2011-07-25 20:19   ` Johannes Weiner
2011-07-26  0:16 ` [patch 0/5] mm: per-zone dirty limiting Minchan Kim
2011-07-26  0:16   ` Minchan Kim
2011-07-26 15:47 ` Mel Gorman
2011-07-26 15:47   ` Mel Gorman
2011-07-26 18:05   ` Johannes Weiner
2011-07-26 18:05     ` Johannes Weiner
2011-07-26 21:54     ` Mel Gorman
2011-07-26 21:54       ` Mel Gorman
2011-07-29 11:05     ` Mel Gorman
2011-07-29 11:05       ` Mel Gorman
2011-08-02 12:17       ` Johannes Weiner
2011-08-02 12:17         ` Johannes Weiner
2011-08-03 13:18         ` Mel Gorman
2011-08-03 13:18           ` Mel Gorman
2011-09-20 12:19           ` Johannes Weiner
2011-09-20 12:19             ` Johannes Weiner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.