* [patch 0/5] mm: per-zone dirty limits v3-resend
@ 2011-11-23 13:34 ` Johannes Weiner
0 siblings, 0 replies; 28+ messages in thread
From: Johannes Weiner @ 2011-11-23 13:34 UTC (permalink / raw)
To: Andrew Morton
Cc: Mel Gorman, Rik van Riel, Minchan Kim, Michal Hocko,
Christoph Hellwig, Wu Fengguang, Dave Chinner, Jan Kara,
Shaohua Li, linux-mm, linux-fsdevel, linux-kernel
This is a resend of version 3, rebased to v3.2-rc2. In addition to my
own tests - results in 3/5 - Wu Fengguang also ran tests of his own in
combination with the IO-less dirty throttling series, the results of
which can be found here:
http://article.gmane.org/gmane.comp.file-systems.ext4/28795
http://article.gmane.org/gmane.linux.kernel.mm/69648
Per-zone dirty limits try to distribute page cache pages allocated for
writing across zones in proportion to the individual zone sizes, to
reduce the likelihood of reclaim having to write back individual pages
from the LRU lists in order to make progress.
Please consider merging into 3.3.
fs/btrfs/file.c | 2 +-
include/linux/gfp.h | 4 +-
include/linux/mmzone.h | 6 +
include/linux/swap.h | 1 +
include/linux/writeback.h | 1 +
mm/filemap.c | 5 +-
mm/page-writeback.c | 290 +++++++++++++++++++++++++++++----------------
mm/page_alloc.c | 48 ++++++++
8 files changed, 251 insertions(+), 106 deletions(-)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* [patch 1/5] mm: exclude reserved pages from dirtyable memory
2011-11-23 13:34 ` Johannes Weiner
@ 2011-11-23 13:34 ` Johannes Weiner
-1 siblings, 0 replies; 28+ messages in thread
From: Johannes Weiner @ 2011-11-23 13:34 UTC (permalink / raw)
To: Andrew Morton
Cc: Mel Gorman, Rik van Riel, Minchan Kim, Michal Hocko,
Christoph Hellwig, Wu Fengguang, Dave Chinner, Jan Kara,
Shaohua Li, linux-mm, linux-fsdevel, linux-kernel
From: Johannes Weiner <jweiner@redhat.com>
The amount of dirtyable pages should not include the full number of
free pages: there is a number of reserved pages that the page
allocator and kswapd always try to keep free.
The closer (reclaimable pages - dirty pages) is to the number of
reserved pages, the more likely it becomes for reclaim to run into
dirty pages:
+----------+ ---
| anon | |
+----------+ |
| | |
| | -- dirty limit new -- flusher new
| file | | |
| | | |
| | -- dirty limit old -- flusher old
| | |
+----------+ --- reclaim
| reserved |
+----------+
| kernel |
+----------+
This patch introduces a per-zone dirty reserve that takes both the
lowmem reserve as well as the high watermark of the zone into account,
and a global sum of those per-zone values that is subtracted from the
global amount of dirtyable pages. The lowmem reserve is unavailable
to page cache allocations and kswapd tries to keep the high watermark
free. We don't want to end up in a situation where reclaim has to
clean pages in order to balance zones.
Not treating reserved pages as dirtyable on a global level is only a
conceptual fix. In reality, dirty pages are not distributed equally
across zones and reclaim runs into dirty pages on a regular basis.
But it is important to get this right before tackling the problem on a
per-zone level, where the distance between reclaim and the dirty pages
is mostly much smaller in absolute numbers.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
---
include/linux/mmzone.h | 6 ++++++
include/linux/swap.h | 1 +
mm/page-writeback.c | 6 ++++--
mm/page_alloc.c | 19 +++++++++++++++++++
4 files changed, 30 insertions(+), 2 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 188cb2f..f395ad4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -317,6 +317,12 @@ struct zone {
*/
unsigned long lowmem_reserve[MAX_NR_ZONES];
+ /*
+ * This is a per-zone reserve of pages that should not be
+ * considered dirtyable memory.
+ */
+ unsigned long dirty_balance_reserve;
+
#ifdef CONFIG_NUMA
int node;
/*
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1e22e12..06061a7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -207,6 +207,7 @@ struct swap_list_t {
/* linux/mm/page_alloc.c */
extern unsigned long totalram_pages;
extern unsigned long totalreserve_pages;
+extern unsigned long dirty_balance_reserve;
extern unsigned int nr_free_buffer_pages(void);
extern unsigned int nr_free_pagecache_pages(void);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a3278f0..562f691 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -327,7 +327,8 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
x += zone_page_state(z, NR_FREE_PAGES) +
- zone_reclaimable_pages(z);
+ zone_reclaimable_pages(z) -
+ zone->dirty_balance_reserve;
}
/*
* Make sure that the number of highmem pages is never larger
@@ -351,7 +352,8 @@ unsigned long determine_dirtyable_memory(void)
{
unsigned long x;
- x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
+ x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages() -
+ dirty_balance_reserve;
if (!vm_highmem_is_dirtyable)
x -= highmem_dirtyable_memory(x);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9dd443d..d90af98 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -96,6 +96,14 @@ EXPORT_SYMBOL(node_states);
unsigned long totalram_pages __read_mostly;
unsigned long totalreserve_pages __read_mostly;
+/*
+ * When calculating the number of globally allowed dirty pages, there
+ * is a certain number of per-zone reserves that should not be
+ * considered dirtyable memory. This is the sum of those reserves
+ * over all existing zones that contribute dirtyable memory.
+ */
+unsigned long dirty_balance_reserve __read_mostly;
+
int percpu_pagelist_fraction;
gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
@@ -5076,8 +5084,19 @@ static void calculate_totalreserve_pages(void)
if (max > zone->present_pages)
max = zone->present_pages;
reserve_pages += max;
+ /*
+ * Lowmem reserves are not available to
+ * GFP_HIGHUSER page cache allocations and
+ * kswapd tries to balance zones to their high
+ * watermark. As a result, neither should be
+ * regarded as dirtyable memory, to prevent a
+ * situation where reclaim has to clean pages
+ * in order to balance the zones.
+ */
+ zone->dirty_balance_reserve = max;
}
}
+ dirty_balance_reserve = reserve_pages;
totalreserve_pages = reserve_pages;
}
--
1.7.6.4
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [patch 1/5] mm: exclude reserved pages from dirtyable memory
@ 2011-11-23 13:34 ` Johannes Weiner
0 siblings, 0 replies; 28+ messages in thread
From: Johannes Weiner @ 2011-11-23 13:34 UTC (permalink / raw)
To: Andrew Morton
Cc: Mel Gorman, Rik van Riel, Minchan Kim, Michal Hocko,
Christoph Hellwig, Wu Fengguang, Dave Chinner, Jan Kara,
Shaohua Li, linux-mm, linux-fsdevel, linux-kernel
From: Johannes Weiner <jweiner@redhat.com>
The amount of dirtyable pages should not include the full number of
free pages: there is a number of reserved pages that the page
allocator and kswapd always try to keep free.
The closer (reclaimable pages - dirty pages) is to the number of
reserved pages, the more likely it becomes for reclaim to run into
dirty pages:
+----------+ ---
| anon | |
+----------+ |
| | |
| | -- dirty limit new -- flusher new
| file | | |
| | | |
| | -- dirty limit old -- flusher old
| | |
+----------+ --- reclaim
| reserved |
+----------+
| kernel |
+----------+
This patch introduces a per-zone dirty reserve that takes both the
lowmem reserve as well as the high watermark of the zone into account,
and a global sum of those per-zone values that is subtracted from the
global amount of dirtyable pages. The lowmem reserve is unavailable
to page cache allocations and kswapd tries to keep the high watermark
free. We don't want to end up in a situation where reclaim has to
clean pages in order to balance zones.
Not treating reserved pages as dirtyable on a global level is only a
conceptual fix. In reality, dirty pages are not distributed equally
across zones and reclaim runs into dirty pages on a regular basis.
But it is important to get this right before tackling the problem on a
per-zone level, where the distance between reclaim and the dirty pages
is mostly much smaller in absolute numbers.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
---
include/linux/mmzone.h | 6 ++++++
include/linux/swap.h | 1 +
mm/page-writeback.c | 6 ++++--
mm/page_alloc.c | 19 +++++++++++++++++++
4 files changed, 30 insertions(+), 2 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 188cb2f..f395ad4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -317,6 +317,12 @@ struct zone {
*/
unsigned long lowmem_reserve[MAX_NR_ZONES];
+ /*
+ * This is a per-zone reserve of pages that should not be
+ * considered dirtyable memory.
+ */
+ unsigned long dirty_balance_reserve;
+
#ifdef CONFIG_NUMA
int node;
/*
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1e22e12..06061a7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -207,6 +207,7 @@ struct swap_list_t {
/* linux/mm/page_alloc.c */
extern unsigned long totalram_pages;
extern unsigned long totalreserve_pages;
+extern unsigned long dirty_balance_reserve;
extern unsigned int nr_free_buffer_pages(void);
extern unsigned int nr_free_pagecache_pages(void);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a3278f0..562f691 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -327,7 +327,8 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
x += zone_page_state(z, NR_FREE_PAGES) +
- zone_reclaimable_pages(z);
+ zone_reclaimable_pages(z) -
+ zone->dirty_balance_reserve;
}
/*
* Make sure that the number of highmem pages is never larger
@@ -351,7 +352,8 @@ unsigned long determine_dirtyable_memory(void)
{
unsigned long x;
- x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
+ x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages() -
+ dirty_balance_reserve;
if (!vm_highmem_is_dirtyable)
x -= highmem_dirtyable_memory(x);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9dd443d..d90af98 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -96,6 +96,14 @@ EXPORT_SYMBOL(node_states);
unsigned long totalram_pages __read_mostly;
unsigned long totalreserve_pages __read_mostly;
+/*
+ * When calculating the number of globally allowed dirty pages, there
+ * is a certain number of per-zone reserves that should not be
+ * considered dirtyable memory. This is the sum of those reserves
+ * over all existing zones that contribute dirtyable memory.
+ */
+unsigned long dirty_balance_reserve __read_mostly;
+
int percpu_pagelist_fraction;
gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
@@ -5076,8 +5084,19 @@ static void calculate_totalreserve_pages(void)
if (max > zone->present_pages)
max = zone->present_pages;
reserve_pages += max;
+ /*
+ * Lowmem reserves are not available to
+ * GFP_HIGHUSER page cache allocations and
+ * kswapd tries to balance zones to their high
+ * watermark. As a result, neither should be
+ * regarded as dirtyable memory, to prevent a
+ * situation where reclaim has to clean pages
+ * in order to balance the zones.
+ */
+ zone->dirty_balance_reserve = max;
}
}
+ dirty_balance_reserve = reserve_pages;
totalreserve_pages = reserve_pages;
}
--
1.7.6.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [patch 1/5] mm: exclude reserved pages from dirtyable memory
2011-11-23 13:34 ` Johannes Weiner
@ 2011-11-30 0:20 ` Andrew Morton
-1 siblings, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2011-11-30 0:20 UTC (permalink / raw)
To: Johannes Weiner
Cc: Mel Gorman, Rik van Riel, Minchan Kim, Michal Hocko,
Christoph Hellwig, Wu Fengguang, Dave Chinner, Jan Kara,
Shaohua Li, linux-mm, linux-fsdevel, linux-kernel
On Wed, 23 Nov 2011 14:34:14 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:
> From: Johannes Weiner <jweiner@redhat.com>
>
> The amount of dirtyable pages should not include the full number of
> free pages: there is a number of reserved pages that the page
> allocator and kswapd always try to keep free.
>
> The closer (reclaimable pages - dirty pages) is to the number of
> reserved pages, the more likely it becomes for reclaim to run into
> dirty pages:
>
> +----------+ ---
> | anon | |
> +----------+ |
> | | |
> | | -- dirty limit new -- flusher new
> | file | | |
> | | | |
> | | -- dirty limit old -- flusher old
> | | |
> +----------+ --- reclaim
> | reserved |
> +----------+
> | kernel |
> +----------+
>
> This patch introduces a per-zone dirty reserve that takes both the
> lowmem reserve as well as the high watermark of the zone into account,
> and a global sum of those per-zone values that is subtracted from the
> global amount of dirtyable pages. The lowmem reserve is unavailable
> to page cache allocations and kswapd tries to keep the high watermark
> free. We don't want to end up in a situation where reclaim has to
> clean pages in order to balance zones.
>
> Not treating reserved pages as dirtyable on a global level is only a
> conceptual fix. In reality, dirty pages are not distributed equally
> across zones and reclaim runs into dirty pages on a regular basis.
>
> But it is important to get this right before tackling the problem on a
> per-zone level, where the distance between reclaim and the dirty pages
> is mostly much smaller in absolute numbers.
>
> ...
>
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -327,7 +327,8 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
> &NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
>
> x += zone_page_state(z, NR_FREE_PAGES) +
> - zone_reclaimable_pages(z);
> + zone_reclaimable_pages(z) -
> + zone->dirty_balance_reserve;
Doesn't compile. s/zone/z/.
Which makes me suspect it wasn't tested on a highmem box. This is
rather worrisome, as highmem machines tend to have acute and unique
zone balancing issues.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/5] mm: exclude reserved pages from dirtyable memory
@ 2011-11-30 0:20 ` Andrew Morton
0 siblings, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2011-11-30 0:20 UTC (permalink / raw)
To: Johannes Weiner
Cc: Mel Gorman, Rik van Riel, Minchan Kim, Michal Hocko,
Christoph Hellwig, Wu Fengguang, Dave Chinner, Jan Kara,
Shaohua Li, linux-mm, linux-fsdevel, linux-kernel
On Wed, 23 Nov 2011 14:34:14 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:
> From: Johannes Weiner <jweiner@redhat.com>
>
> The amount of dirtyable pages should not include the full number of
> free pages: there is a number of reserved pages that the page
> allocator and kswapd always try to keep free.
>
> The closer (reclaimable pages - dirty pages) is to the number of
> reserved pages, the more likely it becomes for reclaim to run into
> dirty pages:
>
> +----------+ ---
> | anon | |
> +----------+ |
> | | |
> | | -- dirty limit new -- flusher new
> | file | | |
> | | | |
> | | -- dirty limit old -- flusher old
> | | |
> +----------+ --- reclaim
> | reserved |
> +----------+
> | kernel |
> +----------+
>
> This patch introduces a per-zone dirty reserve that takes both the
> lowmem reserve as well as the high watermark of the zone into account,
> and a global sum of those per-zone values that is subtracted from the
> global amount of dirtyable pages. The lowmem reserve is unavailable
> to page cache allocations and kswapd tries to keep the high watermark
> free. We don't want to end up in a situation where reclaim has to
> clean pages in order to balance zones.
>
> Not treating reserved pages as dirtyable on a global level is only a
> conceptual fix. In reality, dirty pages are not distributed equally
> across zones and reclaim runs into dirty pages on a regular basis.
>
> But it is important to get this right before tackling the problem on a
> per-zone level, where the distance between reclaim and the dirty pages
> is mostly much smaller in absolute numbers.
>
> ...
>
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -327,7 +327,8 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
> &NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
>
> x += zone_page_state(z, NR_FREE_PAGES) +
> - zone_reclaimable_pages(z);
> + zone_reclaimable_pages(z) -
> + zone->dirty_balance_reserve;
Doesn't compile. s/zone/z/.
Which makes me suspect it wasn't tested on a highmem box. This is
rather worrisome, as highmem machines tend to have acute and unique
zone balancing issues.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/5] mm: exclude reserved pages from dirtyable memory
2011-11-30 0:20 ` Andrew Morton
@ 2011-12-07 13:58 ` Johannes Weiner
-1 siblings, 0 replies; 28+ messages in thread
From: Johannes Weiner @ 2011-12-07 13:58 UTC (permalink / raw)
To: Andrew Morton
Cc: Mel Gorman, Rik van Riel, Minchan Kim, Michal Hocko,
Christoph Hellwig, Wu Fengguang, Dave Chinner, Jan Kara,
Shaohua Li, linux-mm, linux-fsdevel, linux-kernel
On Tue, Nov 29, 2011 at 04:20:14PM -0800, Andrew Morton wrote:
> On Wed, 23 Nov 2011 14:34:14 +0100
> Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> > From: Johannes Weiner <jweiner@redhat.com>
> >
> > The amount of dirtyable pages should not include the full number of
> > free pages: there is a number of reserved pages that the page
> > allocator and kswapd always try to keep free.
> >
> > The closer (reclaimable pages - dirty pages) is to the number of
> > reserved pages, the more likely it becomes for reclaim to run into
> > dirty pages:
> >
> > +----------+ ---
> > | anon | |
> > +----------+ |
> > | | |
> > | | -- dirty limit new -- flusher new
> > | file | | |
> > | | | |
> > | | -- dirty limit old -- flusher old
> > | | |
> > +----------+ --- reclaim
> > | reserved |
> > +----------+
> > | kernel |
> > +----------+
> >
> > This patch introduces a per-zone dirty reserve that takes both the
> > lowmem reserve as well as the high watermark of the zone into account,
> > and a global sum of those per-zone values that is subtracted from the
> > global amount of dirtyable pages. The lowmem reserve is unavailable
> > to page cache allocations and kswapd tries to keep the high watermark
> > free. We don't want to end up in a situation where reclaim has to
> > clean pages in order to balance zones.
> >
> > Not treating reserved pages as dirtyable on a global level is only a
> > conceptual fix. In reality, dirty pages are not distributed equally
> > across zones and reclaim runs into dirty pages on a regular basis.
> >
> > But it is important to get this right before tackling the problem on a
> > per-zone level, where the distance between reclaim and the dirty pages
> > is mostly much smaller in absolute numbers.
> >
> > ...
> >
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -327,7 +327,8 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
> > &NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
> >
> > x += zone_page_state(z, NR_FREE_PAGES) +
> > - zone_reclaimable_pages(z);
> > + zone_reclaimable_pages(z) -
> > + zone->dirty_balance_reserve;
>
> Doesn't compile. s/zone/z/.
>
> Which makes me suspect it wasn't tested on a highmem box. This is
> rather worrisome, as highmem machines tend to have acute and unique
> zone balancing issues.
You are right, so I ran fs_mark on an x86 machine with 8GB and a
32-bit kernel.
fs_mark -S 0 -d work-01 -d work-02 -d work-03 -d work-04 -D 128 -N 128 -L 16 -n 512 -s 655360
This translates to 4 threads doing 16 iterations over a new set of 512
files each time, where each file is 640k in size, which adds up to 20G
of written data per run. The results are gathered over 5 runs. Data
are written to an ext4 on a standard consumer rotational disk.
The overall runtime for the loads were the same:
seconds
mean(stddev)
vanilla: 242.061(0.953)
patched: 242.726(1.714)
Allocation counts confirm that allocation placement does not change:
pgalloc_dma pgalloc_normal pgalloc_high
min|median|max
vanilla: 0.000|0.000|0.000 3733291.000|3742709.000|4034662.000 5189412.000|5202220.000|5208743.000
patched: 0.000|0.000|0.000 3716148.000|3733269.000|4032205.000 5212301.000|5216834.000|5227756.000
Kswapd in both kernels did the same amount of work in each zone over
the course of the workload; direct reclaim was never invoked:
pgscan_kswapd_dma pgscan_kswapd_normal pgscan_kswapd_high
min|median|max
vanilla: 0.000|0.000|0.000 109919.000|115773.000|117952.000 3235879.000|3246707.000|3255205.000
patched: 0.000|0.000|0.000 104169.000|114845.000|117657.000 3241327.000|3246835.000|3257843.000
pgsteal_dma pgsteal_normal pgsteal_high
min|median|max
vanilla: 0.000|0.000|0.000 109912.000|115766.000|117945.000 3235318.000|3246632.000|3255098.000
patched: 0.000|0.000|0.000 104163.000|114839.000|117651.000 3240765.000|3246760.000|3257768.000
and the distribution of scans over time was equivalent, with no new
hickups or scan spikes:
pgscan_kswapd_dma/s pgscan_kswapd_normal/s pgscan_kswapd_high/s
min|median|max
vanilla: 0.000|0.000|0.000 0.000|144.000|2100.000 0.000|15582.500|44916.000
patched: 0.000|0.000|0.000 0.000|152.000|2058.000 0.000|15361.000|44453.000
pgsteal_dma/s pgsteal_normal/s pgsteal_high/s
min|median|max
vanilla: 0.000|0.000|0.000 0.000|144.000|2094.000 0.000|15582.500|44916.000
patched: 0.000|0.000|0.000 0.000|152.000|2058.000 0.000|15361.000|44453.000
fs_mark 1G
The same fs_mark load was run on the system limited to 1G memory
(booted with mem=1G), to have a highmem zone that is much smaller
compared to the rest of the system.
seconds
mean(stddev)
vanilla: 238.428(3.810)
patched: 241.392(0.221)
In this case, allocation placement did shift slightly towards lower
zones, to protect the tiny highmem zone from being unreclaimable due
to dirty pages:
pgalloc_dma pgalloc_normal pgalloc_high
min|median|max
vanilla: 20658.000|21863.000|23231.000 4017580.000|4023331.000|4038774.000 1057246.000|1076280.000|1083824.000
patched: 25403.000|27679.000|28556.000 4163538.000|4172116.000|4179151.000 917054.000| 922206.000| 933609.000
However, while there were in total more allocations in the DMA and
Normal zone, the utilization peaks of the zones individually were
actually reduced due to smoother distribution:
DMA min nr_free_pages Normal min nr_free_pages HighMem min nr_free_pages
vanilla: 1244.000 14819.000 432.000
patched: 1337.000 14850.000 439.000
Keep in mind that the lower zones are only used more often for
allocation because they are providing dirtyable memory in this
scenario, i.e. they have space to spare.
With increasing lowmem usage for stuff that is truly lowmem, like
dcache and page tables, the amount of memory we consider dirtyable
(free pages + file pages) shrinks, so when highmem is not allowed to
take anymore dirty pages, we will not thrash on the lower zones:
either they have space left or the dirtiers are already being
throttled in balance_dirty_pages().
Reclaim numbers suggests that kswapd can easily keep up with the the
allocation frequency increase in the Normal zone. But for DMA, it
looks like the unpatched kernel flooded the zone with dirty pages
every once in a while, making it ineligible for allocations until
those pages were cleaned. Through better distribution, the patch
improves reclaim efficiency (reclaimed/scanned) from 32% to 100% for
DMA:
pgscan_kswapd_dma pgscan_kswapd_normal pgscan_kswapd_high
min|median|max
vanilla: 39734.000|41248.000|41965.000 3692050.000|3696209.000|3716653.000 970411.000|987483.000|991469.000
patched: 21204.000|23901.000|25141.000 3874782.000|3879125.000|3888302.000 793141.000|795631.000|803482.000
pgsteal_dma pgsteal_normal pgsteal_high
min|median|max
vanilla: 12932.000|14044.000|16957.000 3692025.000|3696183.000|3716626.000 966050.000|987386.000|991405.000
patched: 21204.000|23901.000|25141.000 3874771.000|3879095.000|3888284.000 792079.000|795572.000|803370.000
And the increased reclaim efficiency in the DMA zone indeed correlates
with the reduced likelyhood of reclaim running into dirty pages:
DMA Normal Highmem
nr_vmscan_write nr_vmscan_immediate_reclaim
vanilla:
26.0 19614.0 0.0 0.0 1174.0 0.0
0.0 21737.0 0.0 1.0 0.0 0.0
0.0 22101.0 0.0 0.0 0.0 0.0
0.0 21906.0 0.0 0.0 0.0 0.0
0.0 21880.0 0.0 0.0 0.0 0.0
patched:
0.0 0.0 0.0 1.0 502.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 1.0 0.0 0.0
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 1/5] mm: exclude reserved pages from dirtyable memory
@ 2011-12-07 13:58 ` Johannes Weiner
0 siblings, 0 replies; 28+ messages in thread
From: Johannes Weiner @ 2011-12-07 13:58 UTC (permalink / raw)
To: Andrew Morton
Cc: Mel Gorman, Rik van Riel, Minchan Kim, Michal Hocko,
Christoph Hellwig, Wu Fengguang, Dave Chinner, Jan Kara,
Shaohua Li, linux-mm, linux-fsdevel, linux-kernel
On Tue, Nov 29, 2011 at 04:20:14PM -0800, Andrew Morton wrote:
> On Wed, 23 Nov 2011 14:34:14 +0100
> Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> > From: Johannes Weiner <jweiner@redhat.com>
> >
> > The amount of dirtyable pages should not include the full number of
> > free pages: there is a number of reserved pages that the page
> > allocator and kswapd always try to keep free.
> >
> > The closer (reclaimable pages - dirty pages) is to the number of
> > reserved pages, the more likely it becomes for reclaim to run into
> > dirty pages:
> >
> > +----------+ ---
> > | anon | |
> > +----------+ |
> > | | |
> > | | -- dirty limit new -- flusher new
> > | file | | |
> > | | | |
> > | | -- dirty limit old -- flusher old
> > | | |
> > +----------+ --- reclaim
> > | reserved |
> > +----------+
> > | kernel |
> > +----------+
> >
> > This patch introduces a per-zone dirty reserve that takes both the
> > lowmem reserve as well as the high watermark of the zone into account,
> > and a global sum of those per-zone values that is subtracted from the
> > global amount of dirtyable pages. The lowmem reserve is unavailable
> > to page cache allocations and kswapd tries to keep the high watermark
> > free. We don't want to end up in a situation where reclaim has to
> > clean pages in order to balance zones.
> >
> > Not treating reserved pages as dirtyable on a global level is only a
> > conceptual fix. In reality, dirty pages are not distributed equally
> > across zones and reclaim runs into dirty pages on a regular basis.
> >
> > But it is important to get this right before tackling the problem on a
> > per-zone level, where the distance between reclaim and the dirty pages
> > is mostly much smaller in absolute numbers.
> >
> > ...
> >
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -327,7 +327,8 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
> > &NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
> >
> > x += zone_page_state(z, NR_FREE_PAGES) +
> > - zone_reclaimable_pages(z);
> > + zone_reclaimable_pages(z) -
> > + zone->dirty_balance_reserve;
>
> Doesn't compile. s/zone/z/.
>
> Which makes me suspect it wasn't tested on a highmem box. This is
> rather worrisome, as highmem machines tend to have acute and unique
> zone balancing issues.
You are right, so I ran fs_mark on an x86 machine with 8GB and a
32-bit kernel.
fs_mark -S 0 -d work-01 -d work-02 -d work-03 -d work-04 -D 128 -N 128 -L 16 -n 512 -s 655360
This translates to 4 threads doing 16 iterations over a new set of 512
files each time, where each file is 640k in size, which adds up to 20G
of written data per run. The results are gathered over 5 runs. Data
are written to an ext4 on a standard consumer rotational disk.
The overall runtime for the loads were the same:
seconds
mean(stddev)
vanilla: 242.061(0.953)
patched: 242.726(1.714)
Allocation counts confirm that allocation placement does not change:
pgalloc_dma pgalloc_normal pgalloc_high
min|median|max
vanilla: 0.000|0.000|0.000 3733291.000|3742709.000|4034662.000 5189412.000|5202220.000|5208743.000
patched: 0.000|0.000|0.000 3716148.000|3733269.000|4032205.000 5212301.000|5216834.000|5227756.000
Kswapd in both kernels did the same amount of work in each zone over
the course of the workload; direct reclaim was never invoked:
pgscan_kswapd_dma pgscan_kswapd_normal pgscan_kswapd_high
min|median|max
vanilla: 0.000|0.000|0.000 109919.000|115773.000|117952.000 3235879.000|3246707.000|3255205.000
patched: 0.000|0.000|0.000 104169.000|114845.000|117657.000 3241327.000|3246835.000|3257843.000
pgsteal_dma pgsteal_normal pgsteal_high
min|median|max
vanilla: 0.000|0.000|0.000 109912.000|115766.000|117945.000 3235318.000|3246632.000|3255098.000
patched: 0.000|0.000|0.000 104163.000|114839.000|117651.000 3240765.000|3246760.000|3257768.000
and the distribution of scans over time was equivalent, with no new
hickups or scan spikes:
pgscan_kswapd_dma/s pgscan_kswapd_normal/s pgscan_kswapd_high/s
min|median|max
vanilla: 0.000|0.000|0.000 0.000|144.000|2100.000 0.000|15582.500|44916.000
patched: 0.000|0.000|0.000 0.000|152.000|2058.000 0.000|15361.000|44453.000
pgsteal_dma/s pgsteal_normal/s pgsteal_high/s
min|median|max
vanilla: 0.000|0.000|0.000 0.000|144.000|2094.000 0.000|15582.500|44916.000
patched: 0.000|0.000|0.000 0.000|152.000|2058.000 0.000|15361.000|44453.000
fs_mark 1G
The same fs_mark load was run on the system limited to 1G memory
(booted with mem=1G), to have a highmem zone that is much smaller
compared to the rest of the system.
seconds
mean(stddev)
vanilla: 238.428(3.810)
patched: 241.392(0.221)
In this case, allocation placement did shift slightly towards lower
zones, to protect the tiny highmem zone from being unreclaimable due
to dirty pages:
pgalloc_dma pgalloc_normal pgalloc_high
min|median|max
vanilla: 20658.000|21863.000|23231.000 4017580.000|4023331.000|4038774.000 1057246.000|1076280.000|1083824.000
patched: 25403.000|27679.000|28556.000 4163538.000|4172116.000|4179151.000 917054.000| 922206.000| 933609.000
However, while there were in total more allocations in the DMA and
Normal zone, the utilization peaks of the zones individually were
actually reduced due to smoother distribution:
DMA min nr_free_pages Normal min nr_free_pages HighMem min nr_free_pages
vanilla: 1244.000 14819.000 432.000
patched: 1337.000 14850.000 439.000
Keep in mind that the lower zones are only used more often for
allocation because they are providing dirtyable memory in this
scenario, i.e. they have space to spare.
With increasing lowmem usage for stuff that is truly lowmem, like
dcache and page tables, the amount of memory we consider dirtyable
(free pages + file pages) shrinks, so when highmem is not allowed to
take anymore dirty pages, we will not thrash on the lower zones:
either they have space left or the dirtiers are already being
throttled in balance_dirty_pages().
Reclaim numbers suggests that kswapd can easily keep up with the the
allocation frequency increase in the Normal zone. But for DMA, it
looks like the unpatched kernel flooded the zone with dirty pages
every once in a while, making it ineligible for allocations until
those pages were cleaned. Through better distribution, the patch
improves reclaim efficiency (reclaimed/scanned) from 32% to 100% for
DMA:
pgscan_kswapd_dma pgscan_kswapd_normal pgscan_kswapd_high
min|median|max
vanilla: 39734.000|41248.000|41965.000 3692050.000|3696209.000|3716653.000 970411.000|987483.000|991469.000
patched: 21204.000|23901.000|25141.000 3874782.000|3879125.000|3888302.000 793141.000|795631.000|803482.000
pgsteal_dma pgsteal_normal pgsteal_high
min|median|max
vanilla: 12932.000|14044.000|16957.000 3692025.000|3696183.000|3716626.000 966050.000|987386.000|991405.000
patched: 21204.000|23901.000|25141.000 3874771.000|3879095.000|3888284.000 792079.000|795572.000|803370.000
And the increased reclaim efficiency in the DMA zone indeed correlates
with the reduced likelyhood of reclaim running into dirty pages:
DMA Normal Highmem
nr_vmscan_write nr_vmscan_immediate_reclaim
vanilla:
26.0 19614.0 0.0 0.0 1174.0 0.0
0.0 21737.0 0.0 1.0 0.0 0.0
0.0 22101.0 0.0 0.0 0.0 0.0
0.0 21906.0 0.0 0.0 0.0 0.0
0.0 21880.0 0.0 0.0 0.0 0.0
patched:
0.0 0.0 0.0 1.0 502.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 1.0 0.0 0.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* [patch 2/5] mm: writeback: cleanups in preparation for per-zone dirty limits
2011-11-23 13:34 ` Johannes Weiner
@ 2011-11-23 13:34 ` Johannes Weiner
-1 siblings, 0 replies; 28+ messages in thread
From: Johannes Weiner @ 2011-11-23 13:34 UTC (permalink / raw)
To: Andrew Morton
Cc: Mel Gorman, Rik van Riel, Minchan Kim, Michal Hocko,
Christoph Hellwig, Wu Fengguang, Dave Chinner, Jan Kara,
Shaohua Li, linux-mm, linux-fsdevel, linux-kernel
From: Johannes Weiner <jweiner@redhat.com>
The next patch will introduce per-zone dirty limiting functions in
addition to the traditional global dirty limiting.
Rename determine_dirtyable_memory() to global_dirtyable_memory()
before adding the zone-specific version, and fix up its documentation.
Also, move the functions to determine the dirtyable memory and the
function to calculate the dirty limit based on that together so that
their relationship is more apparent and that they can be commented on
as a group.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Mel Gorman <mel@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
mm/page-writeback.c | 210 +++++++++++++++++++++++++-------------------------
1 files changed, 105 insertions(+), 105 deletions(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 562f691..8856b7c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -131,6 +131,110 @@ static struct prop_descriptor vm_completions;
static struct prop_descriptor vm_dirties;
/*
+ * Work out the current dirty-memory clamping and background writeout
+ * thresholds.
+ *
+ * The main aim here is to lower them aggressively if there is a lot of mapped
+ * memory around. To avoid stressing page reclaim with lots of unreclaimable
+ * pages. It is better to clamp down on writers than to start swapping, and
+ * performing lots of scanning.
+ *
+ * We only allow 1/2 of the currently-unmapped memory to be dirtied.
+ *
+ * We don't permit the clamping level to fall below 5% - that is getting rather
+ * excessive.
+ *
+ * We make sure that the background writeout level is below the adjusted
+ * clamping level.
+ */
+
+static unsigned long highmem_dirtyable_memory(unsigned long total)
+{
+#ifdef CONFIG_HIGHMEM
+ int node;
+ unsigned long x = 0;
+
+ for_each_node_state(node, N_HIGH_MEMORY) {
+ struct zone *z =
+ &NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
+
+ x += zone_page_state(z, NR_FREE_PAGES) +
+ zone_reclaimable_pages(z) -
+ zone->dirty_balance_reserve;
+ }
+ /*
+ * Make sure that the number of highmem pages is never larger
+ * than the number of the total dirtyable memory. This can only
+ * occur in very strange VM situations but we want to make sure
+ * that this does not occur.
+ */
+ return min(x, total);
+#else
+ return 0;
+#endif
+}
+
+/**
+ * global_dirtyable_memory - number of globally dirtyable pages
+ *
+ * Returns the global number of pages potentially available for dirty
+ * page cache. This is the base value for the global dirty limits.
+ */
+unsigned long global_dirtyable_memory(void)
+{
+ unsigned long x;
+
+ x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages() -
+ dirty_balance_reserve;
+
+ if (!vm_highmem_is_dirtyable)
+ x -= highmem_dirtyable_memory(x);
+
+ return x + 1; /* Ensure that we never return 0 */
+}
+
+/*
+ * global_dirty_limits - background-writeback and dirty-throttling thresholds
+ *
+ * Calculate the dirty thresholds based on sysctl parameters
+ * - vm.dirty_background_ratio or vm.dirty_background_bytes
+ * - vm.dirty_ratio or vm.dirty_bytes
+ * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
+ * real-time tasks.
+ */
+void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
+{
+ unsigned long background;
+ unsigned long dirty;
+ unsigned long uninitialized_var(available_memory);
+ struct task_struct *tsk;
+
+ if (!vm_dirty_bytes || !dirty_background_bytes)
+ available_memory = global_dirtyable_memory();
+
+ if (vm_dirty_bytes)
+ dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+ else
+ dirty = (vm_dirty_ratio * available_memory) / 100;
+
+ if (dirty_background_bytes)
+ background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+ else
+ background = (dirty_background_ratio * available_memory) / 100;
+
+ if (background >= dirty)
+ background = dirty / 2;
+ tsk = current;
+ if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
+ background += background / 4;
+ dirty += dirty / 4;
+ }
+ *pbackground = background;
+ *pdirty = dirty;
+ trace_global_dirty_state(background, dirty);
+}
+
+/*
* couple the period to the dirty_ratio:
*
* period/2 ~ roundup_pow_of_two(dirty limit)
@@ -142,7 +246,7 @@ static int calc_period_shift(void)
if (vm_dirty_bytes)
dirty_total = vm_dirty_bytes / PAGE_SIZE;
else
- dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
+ dirty_total = (vm_dirty_ratio * global_dirtyable_memory()) /
100;
return 2 + ilog2(dirty_total - 1);
}
@@ -298,69 +402,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
}
EXPORT_SYMBOL(bdi_set_max_ratio);
-/*
- * Work out the current dirty-memory clamping and background writeout
- * thresholds.
- *
- * The main aim here is to lower them aggressively if there is a lot of mapped
- * memory around. To avoid stressing page reclaim with lots of unreclaimable
- * pages. It is better to clamp down on writers than to start swapping, and
- * performing lots of scanning.
- *
- * We only allow 1/2 of the currently-unmapped memory to be dirtied.
- *
- * We don't permit the clamping level to fall below 5% - that is getting rather
- * excessive.
- *
- * We make sure that the background writeout level is below the adjusted
- * clamping level.
- */
-
-static unsigned long highmem_dirtyable_memory(unsigned long total)
-{
-#ifdef CONFIG_HIGHMEM
- int node;
- unsigned long x = 0;
-
- for_each_node_state(node, N_HIGH_MEMORY) {
- struct zone *z =
- &NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
-
- x += zone_page_state(z, NR_FREE_PAGES) +
- zone_reclaimable_pages(z) -
- zone->dirty_balance_reserve;
- }
- /*
- * Make sure that the number of highmem pages is never larger
- * than the number of the total dirtyable memory. This can only
- * occur in very strange VM situations but we want to make sure
- * that this does not occur.
- */
- return min(x, total);
-#else
- return 0;
-#endif
-}
-
-/**
- * determine_dirtyable_memory - amount of memory that may be used
- *
- * Returns the numebr of pages that can currently be freed and used
- * by the kernel for direct mappings.
- */
-unsigned long determine_dirtyable_memory(void)
-{
- unsigned long x;
-
- x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages() -
- dirty_balance_reserve;
-
- if (!vm_highmem_is_dirtyable)
- x -= highmem_dirtyable_memory(x);
-
- return x + 1; /* Ensure that we never return 0 */
-}
-
static unsigned long dirty_freerun_ceiling(unsigned long thresh,
unsigned long bg_thresh)
{
@@ -372,47 +413,6 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
return max(thresh, global_dirty_limit);
}
-/*
- * global_dirty_limits - background-writeback and dirty-throttling thresholds
- *
- * Calculate the dirty thresholds based on sysctl parameters
- * - vm.dirty_background_ratio or vm.dirty_background_bytes
- * - vm.dirty_ratio or vm.dirty_bytes
- * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
- * real-time tasks.
- */
-void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
-{
- unsigned long background;
- unsigned long dirty;
- unsigned long uninitialized_var(available_memory);
- struct task_struct *tsk;
-
- if (!vm_dirty_bytes || !dirty_background_bytes)
- available_memory = determine_dirtyable_memory();
-
- if (vm_dirty_bytes)
- dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
- else
- dirty = (vm_dirty_ratio * available_memory) / 100;
-
- if (dirty_background_bytes)
- background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
- else
- background = (dirty_background_ratio * available_memory) / 100;
-
- if (background >= dirty)
- background = dirty / 2;
- tsk = current;
- if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
- background += background / 4;
- dirty += dirty / 4;
- }
- *pbackground = background;
- *pdirty = dirty;
- trace_global_dirty_state(background, dirty);
-}
-
/**
* bdi_dirty_limit - @bdi's share of dirty throttling threshold
* @bdi: the backing_dev_info to query
--
1.7.6.4
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [patch 2/5] mm: writeback: cleanups in preparation for per-zone dirty limits
@ 2011-11-23 13:34 ` Johannes Weiner
0 siblings, 0 replies; 28+ messages in thread
From: Johannes Weiner @ 2011-11-23 13:34 UTC (permalink / raw)
To: Andrew Morton
Cc: Mel Gorman, Rik van Riel, Minchan Kim, Michal Hocko,
Christoph Hellwig, Wu Fengguang, Dave Chinner, Jan Kara,
Shaohua Li, linux-mm, linux-fsdevel, linux-kernel
From: Johannes Weiner <jweiner@redhat.com>
The next patch will introduce per-zone dirty limiting functions in
addition to the traditional global dirty limiting.
Rename determine_dirtyable_memory() to global_dirtyable_memory()
before adding the zone-specific version, and fix up its documentation.
Also, move the functions to determine the dirtyable memory and the
function to calculate the dirty limit based on that together so that
their relationship is more apparent and that they can be commented on
as a group.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Mel Gorman <mel@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
mm/page-writeback.c | 210 +++++++++++++++++++++++++-------------------------
1 files changed, 105 insertions(+), 105 deletions(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 562f691..8856b7c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -131,6 +131,110 @@ static struct prop_descriptor vm_completions;
static struct prop_descriptor vm_dirties;
/*
+ * Work out the current dirty-memory clamping and background writeout
+ * thresholds.
+ *
+ * The main aim here is to lower them aggressively if there is a lot of mapped
+ * memory around. To avoid stressing page reclaim with lots of unreclaimable
+ * pages. It is better to clamp down on writers than to start swapping, and
+ * performing lots of scanning.
+ *
+ * We only allow 1/2 of the currently-unmapped memory to be dirtied.
+ *
+ * We don't permit the clamping level to fall below 5% - that is getting rather
+ * excessive.
+ *
+ * We make sure that the background writeout level is below the adjusted
+ * clamping level.
+ */
+
+static unsigned long highmem_dirtyable_memory(unsigned long total)
+{
+#ifdef CONFIG_HIGHMEM
+ int node;
+ unsigned long x = 0;
+
+ for_each_node_state(node, N_HIGH_MEMORY) {
+ struct zone *z =
+ &NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
+
+ x += zone_page_state(z, NR_FREE_PAGES) +
+ zone_reclaimable_pages(z) -
+ zone->dirty_balance_reserve;
+ }
+ /*
+ * Make sure that the number of highmem pages is never larger
+ * than the number of the total dirtyable memory. This can only
+ * occur in very strange VM situations but we want to make sure
+ * that this does not occur.
+ */
+ return min(x, total);
+#else
+ return 0;
+#endif
+}
+
+/**
+ * global_dirtyable_memory - number of globally dirtyable pages
+ *
+ * Returns the global number of pages potentially available for dirty
+ * page cache. This is the base value for the global dirty limits.
+ */
+unsigned long global_dirtyable_memory(void)
+{
+ unsigned long x;
+
+ x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages() -
+ dirty_balance_reserve;
+
+ if (!vm_highmem_is_dirtyable)
+ x -= highmem_dirtyable_memory(x);
+
+ return x + 1; /* Ensure that we never return 0 */
+}
+
+/*
+ * global_dirty_limits - background-writeback and dirty-throttling thresholds
+ *
+ * Calculate the dirty thresholds based on sysctl parameters
+ * - vm.dirty_background_ratio or vm.dirty_background_bytes
+ * - vm.dirty_ratio or vm.dirty_bytes
+ * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
+ * real-time tasks.
+ */
+void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
+{
+ unsigned long background;
+ unsigned long dirty;
+ unsigned long uninitialized_var(available_memory);
+ struct task_struct *tsk;
+
+ if (!vm_dirty_bytes || !dirty_background_bytes)
+ available_memory = global_dirtyable_memory();
+
+ if (vm_dirty_bytes)
+ dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+ else
+ dirty = (vm_dirty_ratio * available_memory) / 100;
+
+ if (dirty_background_bytes)
+ background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+ else
+ background = (dirty_background_ratio * available_memory) / 100;
+
+ if (background >= dirty)
+ background = dirty / 2;
+ tsk = current;
+ if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
+ background += background / 4;
+ dirty += dirty / 4;
+ }
+ *pbackground = background;
+ *pdirty = dirty;
+ trace_global_dirty_state(background, dirty);
+}
+
+/*
* couple the period to the dirty_ratio:
*
* period/2 ~ roundup_pow_of_two(dirty limit)
@@ -142,7 +246,7 @@ static int calc_period_shift(void)
if (vm_dirty_bytes)
dirty_total = vm_dirty_bytes / PAGE_SIZE;
else
- dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
+ dirty_total = (vm_dirty_ratio * global_dirtyable_memory()) /
100;
return 2 + ilog2(dirty_total - 1);
}
@@ -298,69 +402,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
}
EXPORT_SYMBOL(bdi_set_max_ratio);
-/*
- * Work out the current dirty-memory clamping and background writeout
- * thresholds.
- *
- * The main aim here is to lower them aggressively if there is a lot of mapped
- * memory around. To avoid stressing page reclaim with lots of unreclaimable
- * pages. It is better to clamp down on writers than to start swapping, and
- * performing lots of scanning.
- *
- * We only allow 1/2 of the currently-unmapped memory to be dirtied.
- *
- * We don't permit the clamping level to fall below 5% - that is getting rather
- * excessive.
- *
- * We make sure that the background writeout level is below the adjusted
- * clamping level.
- */
-
-static unsigned long highmem_dirtyable_memory(unsigned long total)
-{
-#ifdef CONFIG_HIGHMEM
- int node;
- unsigned long x = 0;
-
- for_each_node_state(node, N_HIGH_MEMORY) {
- struct zone *z =
- &NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
-
- x += zone_page_state(z, NR_FREE_PAGES) +
- zone_reclaimable_pages(z) -
- zone->dirty_balance_reserve;
- }
- /*
- * Make sure that the number of highmem pages is never larger
- * than the number of the total dirtyable memory. This can only
- * occur in very strange VM situations but we want to make sure
- * that this does not occur.
- */
- return min(x, total);
-#else
- return 0;
-#endif
-}
-
-/**
- * determine_dirtyable_memory - amount of memory that may be used
- *
- * Returns the numebr of pages that can currently be freed and used
- * by the kernel for direct mappings.
- */
-unsigned long determine_dirtyable_memory(void)
-{
- unsigned long x;
-
- x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages() -
- dirty_balance_reserve;
-
- if (!vm_highmem_is_dirtyable)
- x -= highmem_dirtyable_memory(x);
-
- return x + 1; /* Ensure that we never return 0 */
-}
-
static unsigned long dirty_freerun_ceiling(unsigned long thresh,
unsigned long bg_thresh)
{
@@ -372,47 +413,6 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
return max(thresh, global_dirty_limit);
}
-/*
- * global_dirty_limits - background-writeback and dirty-throttling thresholds
- *
- * Calculate the dirty thresholds based on sysctl parameters
- * - vm.dirty_background_ratio or vm.dirty_background_bytes
- * - vm.dirty_ratio or vm.dirty_bytes
- * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
- * real-time tasks.
- */
-void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
-{
- unsigned long background;
- unsigned long dirty;
- unsigned long uninitialized_var(available_memory);
- struct task_struct *tsk;
-
- if (!vm_dirty_bytes || !dirty_background_bytes)
- available_memory = determine_dirtyable_memory();
-
- if (vm_dirty_bytes)
- dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
- else
- dirty = (vm_dirty_ratio * available_memory) / 100;
-
- if (dirty_background_bytes)
- background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
- else
- background = (dirty_background_ratio * available_memory) / 100;
-
- if (background >= dirty)
- background = dirty / 2;
- tsk = current;
- if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
- background += background / 4;
- dirty += dirty / 4;
- }
- *pbackground = background;
- *pdirty = dirty;
- trace_global_dirty_state(background, dirty);
-}
-
/**
* bdi_dirty_limit - @bdi's share of dirty throttling threshold
* @bdi: the backing_dev_info to query
--
1.7.6.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [patch 3/5] mm: try to distribute dirty pages fairly across zones
2011-11-23 13:34 ` Johannes Weiner
@ 2011-11-23 13:34 ` Johannes Weiner
-1 siblings, 0 replies; 28+ messages in thread
From: Johannes Weiner @ 2011-11-23 13:34 UTC (permalink / raw)
To: Andrew Morton
Cc: Mel Gorman, Rik van Riel, Minchan Kim, Michal Hocko,
Christoph Hellwig, Wu Fengguang, Dave Chinner, Jan Kara,
Shaohua Li, linux-mm, linux-fsdevel, linux-kernel
From: Johannes Weiner <jweiner@redhat.com>
The maximum number of dirty pages that exist in the system at any time
is determined by a number of pages considered dirtyable and a
user-configured percentage of those, or an absolute number in bytes.
This number of dirtyable pages is the sum of memory provided by all
the zones in the system minus their lowmem reserves and high
watermarks, so that the system can retain a healthy number of free
pages without having to reclaim dirty pages.
But there is a flaw in that we have a zoned page allocator which does
not care about the global state but rather the state of individual
memory zones. And right now there is nothing that prevents one zone
from filling up with dirty pages while other zones are spared, which
frequently leads to situations where kswapd, in order to restore the
watermark of free pages, does indeed have to write pages from that
zone's LRU list. This can interfere so badly with IO from the flusher
threads that major filesystems (btrfs, xfs, ext4) mostly ignore write
requests from reclaim already, taking away the VM's only possibility
to keep such a zone balanced, aside from hoping the flushers will soon
clean pages from that zone.
Enter per-zone dirty limits. They are to a zone's dirtyable memory
what the global limit is to the global amount of dirtyable memory, and
try to make sure that no single zone receives more than its fair share
of the globally allowed dirty pages in the first place. As the number
of pages considered dirtyable excludes the zones' lowmem reserves and
high watermarks, the maximum number of dirty pages in a zone is such
that the zone can always be balanced without requiring page cleaning.
As this is a placement decision in the page allocator and pages are
dirtied only after the allocation, this patch allows allocators to
pass __GFP_WRITE when they know in advance that the page will be
written to and become dirty soon. The page allocator will then
attempt to allocate from the first zone of the zonelist - which on
NUMA is determined by the task's NUMA memory policy - that has not
exceeded its dirty limit.
At first glance, it would appear that the diversion to lower zones can
increase pressure on them, but this is not the case. With a full high
zone, allocations will be diverted to lower zones eventually, so it is
more of a shift in timing of the lower zone allocations. Workloads
that previously could fit their dirty pages completely in the higher
zone may be forced to allocate from lower zones, but the amount of
pages that "spill over" are limited themselves by the lower zones'
dirty constraints, and thus unlikely to become a problem.
For now, the problem of unfair dirty page distribution remains for
NUMA configurations where the zones allowed for allocation are in sum
not big enough to trigger the global dirty limits, wake up the flusher
threads and remedy the situation. Because of this, an allocation that
could not succeed on any of the considered zones is allowed to ignore
the dirty limits before going into direct reclaim or even failing the
allocation, until a future patch changes the global dirty throttling
and flusher thread activation so that they take individual zone states
into account.
Test results
15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
40% dirty ratio
16G USB thumb drive
10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))
seconds nr_vmscan_write
(stddev) min| median| max
xfs
vanilla: 549.747( 3.492) 0.000| 0.000| 0.000
patched: 550.996( 3.802) 0.000| 0.000| 0.000
fuse-ntfs
vanilla: 1183.094(53.178) 54349.000| 59341.000| 65163.000
patched: 558.049(17.914) 0.000| 0.000| 43.000
btrfs
vanilla: 573.679(14.015) 156657.000| 460178.000| 606926.000
patched: 563.365(11.368) 0.000| 0.000| 1362.000
ext4
vanilla: 561.197(15.782) 0.000|2725438.000|4143837.000
patched: 568.806(17.496) 0.000| 0.000| 0.000
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/gfp.h | 4 ++-
include/linux/writeback.h | 1 +
mm/page-writeback.c | 82 +++++++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 29 ++++++++++++++++
4 files changed, 115 insertions(+), 1 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 3a76faf..50efc7e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -36,6 +36,7 @@ struct vm_area_struct;
#endif
#define ___GFP_NO_KSWAPD 0x400000u
#define ___GFP_OTHER_NODE 0x800000u
+#define ___GFP_WRITE 0x1000000u
/*
* GFP bitmasks..
@@ -85,6 +86,7 @@ struct vm_area_struct;
#define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD)
#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
+#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */
/*
* This may seem redundant, but it's a way of annotating false positives vs.
@@ -92,7 +94,7 @@ struct vm_area_struct;
*/
#define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
-#define __GFP_BITS_SHIFT 24 /* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/* This equals 0, but use constants in case they ever change */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a378c29..d172a90 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -124,6 +124,7 @@ void laptop_mode_timer_fn(unsigned long data);
static inline void laptop_sync_completion(void) { }
#endif
void throttle_vm_writeout(gfp_t gfp_mask);
+bool zone_dirty_ok(struct zone *zone);
extern unsigned long global_dirty_limit;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 8856b7c..b173d97 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -148,6 +148,24 @@ static struct prop_descriptor vm_dirties;
* clamping level.
*/
+/*
+ * In a memory zone, there is a certain amount of pages we consider
+ * available for the page cache, which is essentially the number of
+ * free and reclaimable pages, minus some zone reserves to protect
+ * lowmem and the ability to uphold the zone's watermarks without
+ * requiring writeback.
+ *
+ * This number of dirtyable pages is the base value of which the
+ * user-configurable dirty ratio is the effictive number of pages that
+ * are allowed to be actually dirtied. Per individual zone, or
+ * globally by using the sum of dirtyable pages over all zones.
+ *
+ * Because the user is allowed to specify the dirty limit globally as
+ * absolute number of bytes, calculating the per-zone dirty limit can
+ * require translating the configured limit into a percentage of
+ * global dirtyable memory first.
+ */
+
static unsigned long highmem_dirtyable_memory(unsigned long total)
{
#ifdef CONFIG_HIGHMEM
@@ -234,6 +252,70 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
trace_global_dirty_state(background, dirty);
}
+/**
+ * zone_dirtyable_memory - number of dirtyable pages in a zone
+ * @zone: the zone
+ *
+ * Returns the zone's number of pages potentially available for dirty
+ * page cache. This is the base value for the per-zone dirty limits.
+ */
+static unsigned long zone_dirtyable_memory(struct zone *zone)
+{
+ /*
+ * The effective global number of dirtyable pages may exclude
+ * highmem as a big-picture measure to keep the ratio between
+ * dirty memory and lowmem reasonable.
+ *
+ * But this function is purely about the individual zone and a
+ * highmem zone can hold its share of dirty pages, so we don't
+ * care about vm_highmem_is_dirtyable here.
+ */
+ return zone_page_state(zone, NR_FREE_PAGES) +
+ zone_reclaimable_pages(zone) -
+ zone->dirty_balance_reserve;
+}
+
+/**
+ * zone_dirty_limit - maximum number of dirty pages allowed in a zone
+ * @zone: the zone
+ *
+ * Returns the maximum number of dirty pages allowed in a zone, based
+ * on the zone's dirtyable memory.
+ */
+static unsigned long zone_dirty_limit(struct zone *zone)
+{
+ unsigned long zone_memory = zone_dirtyable_memory(zone);
+ struct task_struct *tsk = current;
+ unsigned long dirty;
+
+ if (vm_dirty_bytes)
+ dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE) *
+ zone_memory / global_dirtyable_memory();
+ else
+ dirty = vm_dirty_ratio * zone_memory / 100;
+
+ if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk))
+ dirty += dirty / 4;
+
+ return dirty;
+}
+
+/**
+ * zone_dirty_ok - tells whether a zone is within its dirty limits
+ * @zone: the zone to check
+ *
+ * Returns %true when the dirty pages in @zone are within the zone's
+ * dirty limit, %false if the limit is exceeded.
+ */
+bool zone_dirty_ok(struct zone *zone)
+{
+ unsigned long limit = zone_dirty_limit(zone);
+
+ return zone_page_state(zone, NR_FILE_DIRTY) +
+ zone_page_state(zone, NR_UNSTABLE_NFS) +
+ zone_page_state(zone, NR_WRITEBACK) <= limit;
+}
+
/*
* couple the period to the dirty_ratio:
*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d90af98..9cdf1a3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1675,6 +1675,35 @@ zonelist_scan:
if ((alloc_flags & ALLOC_CPUSET) &&
!cpuset_zone_allowed_softwall(zone, gfp_mask))
continue;
+ /*
+ * When allocating a page cache page for writing, we
+ * want to get it from a zone that is within its dirty
+ * limit, such that no single zone holds more than its
+ * proportional share of globally allowed dirty pages.
+ * The dirty limits take into account the zone's
+ * lowmem reserves and high watermark so that kswapd
+ * should be able to balance it without having to
+ * write pages from its LRU list.
+ *
+ * This may look like it could increase pressure on
+ * lower zones by failing allocations in higher zones
+ * before they are full. But the pages that do spill
+ * over are limited as the lower zones are protected
+ * by this very same mechanism. It should not become
+ * a practical burden to them.
+ *
+ * XXX: For now, allow allocations to potentially
+ * exceed the per-zone dirty limit in the slowpath
+ * (ALLOC_WMARK_LOW unset) before going into reclaim,
+ * which is important when on a NUMA setup the allowed
+ * zones are together not big enough to reach the
+ * global limit. The proper fix for these situations
+ * will require awareness of zones in the
+ * dirty-throttling and the flusher threads.
+ */
+ if ((alloc_flags & ALLOC_WMARK_LOW) &&
+ (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
+ goto this_zone_full;
BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
--
1.7.6.4
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [patch 3/5] mm: try to distribute dirty pages fairly across zones
@ 2011-11-23 13:34 ` Johannes Weiner
0 siblings, 0 replies; 28+ messages in thread
From: Johannes Weiner @ 2011-11-23 13:34 UTC (permalink / raw)
To: Andrew Morton
Cc: Mel Gorman, Rik van Riel, Minchan Kim, Michal Hocko,
Christoph Hellwig, Wu Fengguang, Dave Chinner, Jan Kara,
Shaohua Li, linux-mm, linux-fsdevel, linux-kernel
From: Johannes Weiner <jweiner@redhat.com>
The maximum number of dirty pages that exist in the system at any time
is determined by a number of pages considered dirtyable and a
user-configured percentage of those, or an absolute number in bytes.
This number of dirtyable pages is the sum of memory provided by all
the zones in the system minus their lowmem reserves and high
watermarks, so that the system can retain a healthy number of free
pages without having to reclaim dirty pages.
But there is a flaw in that we have a zoned page allocator which does
not care about the global state but rather the state of individual
memory zones. And right now there is nothing that prevents one zone
from filling up with dirty pages while other zones are spared, which
frequently leads to situations where kswapd, in order to restore the
watermark of free pages, does indeed have to write pages from that
zone's LRU list. This can interfere so badly with IO from the flusher
threads that major filesystems (btrfs, xfs, ext4) mostly ignore write
requests from reclaim already, taking away the VM's only possibility
to keep such a zone balanced, aside from hoping the flushers will soon
clean pages from that zone.
Enter per-zone dirty limits. They are to a zone's dirtyable memory
what the global limit is to the global amount of dirtyable memory, and
try to make sure that no single zone receives more than its fair share
of the globally allowed dirty pages in the first place. As the number
of pages considered dirtyable excludes the zones' lowmem reserves and
high watermarks, the maximum number of dirty pages in a zone is such
that the zone can always be balanced without requiring page cleaning.
As this is a placement decision in the page allocator and pages are
dirtied only after the allocation, this patch allows allocators to
pass __GFP_WRITE when they know in advance that the page will be
written to and become dirty soon. The page allocator will then
attempt to allocate from the first zone of the zonelist - which on
NUMA is determined by the task's NUMA memory policy - that has not
exceeded its dirty limit.
At first glance, it would appear that the diversion to lower zones can
increase pressure on them, but this is not the case. With a full high
zone, allocations will be diverted to lower zones eventually, so it is
more of a shift in timing of the lower zone allocations. Workloads
that previously could fit their dirty pages completely in the higher
zone may be forced to allocate from lower zones, but the amount of
pages that "spill over" are limited themselves by the lower zones'
dirty constraints, and thus unlikely to become a problem.
For now, the problem of unfair dirty page distribution remains for
NUMA configurations where the zones allowed for allocation are in sum
not big enough to trigger the global dirty limits, wake up the flusher
threads and remedy the situation. Because of this, an allocation that
could not succeed on any of the considered zones is allowed to ignore
the dirty limits before going into direct reclaim or even failing the
allocation, until a future patch changes the global dirty throttling
and flusher thread activation so that they take individual zone states
into account.
Test results
15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
40% dirty ratio
16G USB thumb drive
10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))
seconds nr_vmscan_write
(stddev) min| median| max
xfs
vanilla: 549.747( 3.492) 0.000| 0.000| 0.000
patched: 550.996( 3.802) 0.000| 0.000| 0.000
fuse-ntfs
vanilla: 1183.094(53.178) 54349.000| 59341.000| 65163.000
patched: 558.049(17.914) 0.000| 0.000| 43.000
btrfs
vanilla: 573.679(14.015) 156657.000| 460178.000| 606926.000
patched: 563.365(11.368) 0.000| 0.000| 1362.000
ext4
vanilla: 561.197(15.782) 0.000|2725438.000|4143837.000
patched: 568.806(17.496) 0.000| 0.000| 0.000
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/gfp.h | 4 ++-
include/linux/writeback.h | 1 +
mm/page-writeback.c | 82 +++++++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 29 ++++++++++++++++
4 files changed, 115 insertions(+), 1 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 3a76faf..50efc7e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -36,6 +36,7 @@ struct vm_area_struct;
#endif
#define ___GFP_NO_KSWAPD 0x400000u
#define ___GFP_OTHER_NODE 0x800000u
+#define ___GFP_WRITE 0x1000000u
/*
* GFP bitmasks..
@@ -85,6 +86,7 @@ struct vm_area_struct;
#define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD)
#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
+#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */
/*
* This may seem redundant, but it's a way of annotating false positives vs.
@@ -92,7 +94,7 @@ struct vm_area_struct;
*/
#define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
-#define __GFP_BITS_SHIFT 24 /* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/* This equals 0, but use constants in case they ever change */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a378c29..d172a90 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -124,6 +124,7 @@ void laptop_mode_timer_fn(unsigned long data);
static inline void laptop_sync_completion(void) { }
#endif
void throttle_vm_writeout(gfp_t gfp_mask);
+bool zone_dirty_ok(struct zone *zone);
extern unsigned long global_dirty_limit;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 8856b7c..b173d97 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -148,6 +148,24 @@ static struct prop_descriptor vm_dirties;
* clamping level.
*/
+/*
+ * In a memory zone, there is a certain amount of pages we consider
+ * available for the page cache, which is essentially the number of
+ * free and reclaimable pages, minus some zone reserves to protect
+ * lowmem and the ability to uphold the zone's watermarks without
+ * requiring writeback.
+ *
+ * This number of dirtyable pages is the base value of which the
+ * user-configurable dirty ratio is the effictive number of pages that
+ * are allowed to be actually dirtied. Per individual zone, or
+ * globally by using the sum of dirtyable pages over all zones.
+ *
+ * Because the user is allowed to specify the dirty limit globally as
+ * absolute number of bytes, calculating the per-zone dirty limit can
+ * require translating the configured limit into a percentage of
+ * global dirtyable memory first.
+ */
+
static unsigned long highmem_dirtyable_memory(unsigned long total)
{
#ifdef CONFIG_HIGHMEM
@@ -234,6 +252,70 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
trace_global_dirty_state(background, dirty);
}
+/**
+ * zone_dirtyable_memory - number of dirtyable pages in a zone
+ * @zone: the zone
+ *
+ * Returns the zone's number of pages potentially available for dirty
+ * page cache. This is the base value for the per-zone dirty limits.
+ */
+static unsigned long zone_dirtyable_memory(struct zone *zone)
+{
+ /*
+ * The effective global number of dirtyable pages may exclude
+ * highmem as a big-picture measure to keep the ratio between
+ * dirty memory and lowmem reasonable.
+ *
+ * But this function is purely about the individual zone and a
+ * highmem zone can hold its share of dirty pages, so we don't
+ * care about vm_highmem_is_dirtyable here.
+ */
+ return zone_page_state(zone, NR_FREE_PAGES) +
+ zone_reclaimable_pages(zone) -
+ zone->dirty_balance_reserve;
+}
+
+/**
+ * zone_dirty_limit - maximum number of dirty pages allowed in a zone
+ * @zone: the zone
+ *
+ * Returns the maximum number of dirty pages allowed in a zone, based
+ * on the zone's dirtyable memory.
+ */
+static unsigned long zone_dirty_limit(struct zone *zone)
+{
+ unsigned long zone_memory = zone_dirtyable_memory(zone);
+ struct task_struct *tsk = current;
+ unsigned long dirty;
+
+ if (vm_dirty_bytes)
+ dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE) *
+ zone_memory / global_dirtyable_memory();
+ else
+ dirty = vm_dirty_ratio * zone_memory / 100;
+
+ if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk))
+ dirty += dirty / 4;
+
+ return dirty;
+}
+
+/**
+ * zone_dirty_ok - tells whether a zone is within its dirty limits
+ * @zone: the zone to check
+ *
+ * Returns %true when the dirty pages in @zone are within the zone's
+ * dirty limit, %false if the limit is exceeded.
+ */
+bool zone_dirty_ok(struct zone *zone)
+{
+ unsigned long limit = zone_dirty_limit(zone);
+
+ return zone_page_state(zone, NR_FILE_DIRTY) +
+ zone_page_state(zone, NR_UNSTABLE_NFS) +
+ zone_page_state(zone, NR_WRITEBACK) <= limit;
+}
+
/*
* couple the period to the dirty_ratio:
*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d90af98..9cdf1a3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1675,6 +1675,35 @@ zonelist_scan:
if ((alloc_flags & ALLOC_CPUSET) &&
!cpuset_zone_allowed_softwall(zone, gfp_mask))
continue;
+ /*
+ * When allocating a page cache page for writing, we
+ * want to get it from a zone that is within its dirty
+ * limit, such that no single zone holds more than its
+ * proportional share of globally allowed dirty pages.
+ * The dirty limits take into account the zone's
+ * lowmem reserves and high watermark so that kswapd
+ * should be able to balance it without having to
+ * write pages from its LRU list.
+ *
+ * This may look like it could increase pressure on
+ * lower zones by failing allocations in higher zones
+ * before they are full. But the pages that do spill
+ * over are limited as the lower zones are protected
+ * by this very same mechanism. It should not become
+ * a practical burden to them.
+ *
+ * XXX: For now, allow allocations to potentially
+ * exceed the per-zone dirty limit in the slowpath
+ * (ALLOC_WMARK_LOW unset) before going into reclaim,
+ * which is important when on a NUMA setup the allowed
+ * zones are together not big enough to reach the
+ * global limit. The proper fix for these situations
+ * will require awareness of zones in the
+ * dirty-throttling and the flusher threads.
+ */
+ if ((alloc_flags & ALLOC_WMARK_LOW) &&
+ (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
+ goto this_zone_full;
BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
--
1.7.6.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [patch 3/5] mm: try to distribute dirty pages fairly across zones
2011-11-23 13:34 ` Johannes Weiner
@ 2011-11-24 1:07 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-11-24 1:07 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Mel Gorman, Rik van Riel, Minchan Kim,
Michal Hocko, Christoph Hellwig, Wu Fengguang, Dave Chinner,
Jan Kara, Shaohua Li, linux-mm, linux-fsdevel, linux-kernel
Can I make a question ?
On Wed, 23 Nov 2011 14:34:16 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:
> + /*
> + * When allocating a page cache page for writing, we
> + * want to get it from a zone that is within its dirty
> + * limit, such that no single zone holds more than its
> + * proportional share of globally allowed dirty pages.
> + * The dirty limits take into account the zone's
> + * lowmem reserves and high watermark so that kswapd
> + * should be able to balance it without having to
> + * write pages from its LRU list.
> + *
> + * This may look like it could increase pressure on
> + * lower zones by failing allocations in higher zones
> + * before they are full. But the pages that do spill
> + * over are limited as the lower zones are protected
> + * by this very same mechanism. It should not become
> + * a practical burden to them.
> + *
> + * XXX: For now, allow allocations to potentially
> + * exceed the per-zone dirty limit in the slowpath
> + * (ALLOC_WMARK_LOW unset) before going into reclaim,
> + * which is important when on a NUMA setup the allowed
> + * zones are together not big enough to reach the
> + * global limit. The proper fix for these situations
> + * will require awareness of zones in the
> + * dirty-throttling and the flusher threads.
> + */
> + if ((alloc_flags & ALLOC_WMARK_LOW) &&
> + (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
> + goto this_zone_full;
>
> BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
> if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
This wil call
if (NUMA_BUILD)
zlc_mark_zone_full(zonelist, z);
And this zone will be marked as full.
IIUC, zlc_clear_zones_full() is called only when direct reclaim ends.
So, if no one calls direct-reclaim, 'full' mark may never be cleared
even when number of dirty pages goes down to safe level ?
I'm sorry if this is alread discussed.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 3/5] mm: try to distribute dirty pages fairly across zones
@ 2011-11-24 1:07 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-11-24 1:07 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Mel Gorman, Rik van Riel, Minchan Kim,
Michal Hocko, Christoph Hellwig, Wu Fengguang, Dave Chinner,
Jan Kara, Shaohua Li, linux-mm, linux-fsdevel, linux-kernel
Can I make a question ?
On Wed, 23 Nov 2011 14:34:16 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:
> + /*
> + * When allocating a page cache page for writing, we
> + * want to get it from a zone that is within its dirty
> + * limit, such that no single zone holds more than its
> + * proportional share of globally allowed dirty pages.
> + * The dirty limits take into account the zone's
> + * lowmem reserves and high watermark so that kswapd
> + * should be able to balance it without having to
> + * write pages from its LRU list.
> + *
> + * This may look like it could increase pressure on
> + * lower zones by failing allocations in higher zones
> + * before they are full. But the pages that do spill
> + * over are limited as the lower zones are protected
> + * by this very same mechanism. It should not become
> + * a practical burden to them.
> + *
> + * XXX: For now, allow allocations to potentially
> + * exceed the per-zone dirty limit in the slowpath
> + * (ALLOC_WMARK_LOW unset) before going into reclaim,
> + * which is important when on a NUMA setup the allowed
> + * zones are together not big enough to reach the
> + * global limit. The proper fix for these situations
> + * will require awareness of zones in the
> + * dirty-throttling and the flusher threads.
> + */
> + if ((alloc_flags & ALLOC_WMARK_LOW) &&
> + (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
> + goto this_zone_full;
>
> BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
> if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
This wil call
if (NUMA_BUILD)
zlc_mark_zone_full(zonelist, z);
And this zone will be marked as full.
IIUC, zlc_clear_zones_full() is called only when direct reclaim ends.
So, if no one calls direct-reclaim, 'full' mark may never be cleared
even when number of dirty pages goes down to safe level ?
I'm sorry if this is alread discussed.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 3/5] mm: try to distribute dirty pages fairly across zones
2011-11-24 1:07 ` KAMEZAWA Hiroyuki
@ 2011-11-24 13:11 ` Johannes Weiner
-1 siblings, 0 replies; 28+ messages in thread
From: Johannes Weiner @ 2011-11-24 13:11 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Andrew Morton, Mel Gorman, Rik van Riel, Minchan Kim,
Michal Hocko, Christoph Hellwig, Wu Fengguang, Dave Chinner,
Jan Kara, Shaohua Li, linux-mm, linux-fsdevel, linux-kernel
On Thu, Nov 24, 2011 at 10:07:55AM +0900, KAMEZAWA Hiroyuki wrote:
>
>
> Can I make a question ?
>
> On Wed, 23 Nov 2011 14:34:16 +0100
> Johannes Weiner <hannes@cmpxchg.org> wrote:
>
>
> > + /*
> > + * When allocating a page cache page for writing, we
> > + * want to get it from a zone that is within its dirty
> > + * limit, such that no single zone holds more than its
> > + * proportional share of globally allowed dirty pages.
> > + * The dirty limits take into account the zone's
> > + * lowmem reserves and high watermark so that kswapd
> > + * should be able to balance it without having to
> > + * write pages from its LRU list.
> > + *
> > + * This may look like it could increase pressure on
> > + * lower zones by failing allocations in higher zones
> > + * before they are full. But the pages that do spill
> > + * over are limited as the lower zones are protected
> > + * by this very same mechanism. It should not become
> > + * a practical burden to them.
> > + *
> > + * XXX: For now, allow allocations to potentially
> > + * exceed the per-zone dirty limit in the slowpath
> > + * (ALLOC_WMARK_LOW unset) before going into reclaim,
> > + * which is important when on a NUMA setup the allowed
> > + * zones are together not big enough to reach the
> > + * global limit. The proper fix for these situations
> > + * will require awareness of zones in the
> > + * dirty-throttling and the flusher threads.
> > + */
> > + if ((alloc_flags & ALLOC_WMARK_LOW) &&
> > + (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
> > + goto this_zone_full;
> >
> > BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
> > if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
>
> This wil call
>
> if (NUMA_BUILD)
> zlc_mark_zone_full(zonelist, z);
>
> And this zone will be marked as full.
>
> IIUC, zlc_clear_zones_full() is called only when direct reclaim ends.
> So, if no one calls direct-reclaim, 'full' mark may never be cleared
> even when number of dirty pages goes down to safe level ?
> I'm sorry if this is alread discussed.
It does not remember which zones are marked full for longer than a
second - see zlc_setup() - and also ignores this information when an
iteration over the zonelist with the cache enabled came up
empty-handed.
I thought it would make sense to take advantage of the cache and save
the zone_dirty_ok() checks against ineligible zones too on subsequent
iterations.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 3/5] mm: try to distribute dirty pages fairly across zones
@ 2011-11-24 13:11 ` Johannes Weiner
0 siblings, 0 replies; 28+ messages in thread
From: Johannes Weiner @ 2011-11-24 13:11 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Andrew Morton, Mel Gorman, Rik van Riel, Minchan Kim,
Michal Hocko, Christoph Hellwig, Wu Fengguang, Dave Chinner,
Jan Kara, Shaohua Li, linux-mm, linux-fsdevel, linux-kernel
On Thu, Nov 24, 2011 at 10:07:55AM +0900, KAMEZAWA Hiroyuki wrote:
>
>
> Can I make a question ?
>
> On Wed, 23 Nov 2011 14:34:16 +0100
> Johannes Weiner <hannes@cmpxchg.org> wrote:
>
>
> > + /*
> > + * When allocating a page cache page for writing, we
> > + * want to get it from a zone that is within its dirty
> > + * limit, such that no single zone holds more than its
> > + * proportional share of globally allowed dirty pages.
> > + * The dirty limits take into account the zone's
> > + * lowmem reserves and high watermark so that kswapd
> > + * should be able to balance it without having to
> > + * write pages from its LRU list.
> > + *
> > + * This may look like it could increase pressure on
> > + * lower zones by failing allocations in higher zones
> > + * before they are full. But the pages that do spill
> > + * over are limited as the lower zones are protected
> > + * by this very same mechanism. It should not become
> > + * a practical burden to them.
> > + *
> > + * XXX: For now, allow allocations to potentially
> > + * exceed the per-zone dirty limit in the slowpath
> > + * (ALLOC_WMARK_LOW unset) before going into reclaim,
> > + * which is important when on a NUMA setup the allowed
> > + * zones are together not big enough to reach the
> > + * global limit. The proper fix for these situations
> > + * will require awareness of zones in the
> > + * dirty-throttling and the flusher threads.
> > + */
> > + if ((alloc_flags & ALLOC_WMARK_LOW) &&
> > + (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
> > + goto this_zone_full;
> >
> > BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
> > if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
>
> This wil call
>
> if (NUMA_BUILD)
> zlc_mark_zone_full(zonelist, z);
>
> And this zone will be marked as full.
>
> IIUC, zlc_clear_zones_full() is called only when direct reclaim ends.
> So, if no one calls direct-reclaim, 'full' mark may never be cleared
> even when number of dirty pages goes down to safe level ?
> I'm sorry if this is alread discussed.
It does not remember which zones are marked full for longer than a
second - see zlc_setup() - and also ignores this information when an
iteration over the zonelist with the cache enabled came up
empty-handed.
I thought it would make sense to take advantage of the cache and save
the zone_dirty_ok() checks against ineligible zones too on subsequent
iterations.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 3/5] mm: try to distribute dirty pages fairly across zones
2011-11-24 13:11 ` Johannes Weiner
@ 2011-11-25 1:00 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-11-25 1:00 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Mel Gorman, Rik van Riel, Minchan Kim,
Michal Hocko, Christoph Hellwig, Wu Fengguang, Dave Chinner,
Jan Kara, Shaohua Li, linux-mm, linux-fsdevel, linux-kernel
On Thu, 24 Nov 2011 14:11:55 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Thu, Nov 24, 2011 at 10:07:55AM +0900, KAMEZAWA Hiroyuki wrote:
goto this_zone_full;
> > >
> > > BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
> > > if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
> >
> > This wil call
> >
> > if (NUMA_BUILD)
> > zlc_mark_zone_full(zonelist, z);
> >
> > And this zone will be marked as full.
> >
> > IIUC, zlc_clear_zones_full() is called only when direct reclaim ends.
> > So, if no one calls direct-reclaim, 'full' mark may never be cleared
> > even when number of dirty pages goes down to safe level ?
> > I'm sorry if this is alread discussed.
>
> It does not remember which zones are marked full for longer than a
> second - see zlc_setup() - and also ignores this information when an
> iteration over the zonelist with the cache enabled came up
> empty-handed.
>
Ah, thank you for clarification.
I understand how zlc_active/did_zlc_setup/zlc_setup()...complicated ;)
Thanks,
-Kame
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch 3/5] mm: try to distribute dirty pages fairly across zones
@ 2011-11-25 1:00 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-11-25 1:00 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Mel Gorman, Rik van Riel, Minchan Kim,
Michal Hocko, Christoph Hellwig, Wu Fengguang, Dave Chinner,
Jan Kara, Shaohua Li, linux-mm, linux-fsdevel, linux-kernel
On Thu, 24 Nov 2011 14:11:55 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Thu, Nov 24, 2011 at 10:07:55AM +0900, KAMEZAWA Hiroyuki wrote:
goto this_zone_full;
> > >
> > > BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
> > > if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
> >
> > This wil call
> >
> > if (NUMA_BUILD)
> > zlc_mark_zone_full(zonelist, z);
> >
> > And this zone will be marked as full.
> >
> > IIUC, zlc_clear_zones_full() is called only when direct reclaim ends.
> > So, if no one calls direct-reclaim, 'full' mark may never be cleared
> > even when number of dirty pages goes down to safe level ?
> > I'm sorry if this is alread discussed.
>
> It does not remember which zones are marked full for longer than a
> second - see zlc_setup() - and also ignores this information when an
> iteration over the zonelist with the cache enabled came up
> empty-handed.
>
Ah, thank you for clarification.
I understand how zlc_active/did_zlc_setup/zlc_setup()...complicated ;)
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* [patch 4/5] mm: filemap: pass __GFP_WRITE from grab_cache_page_write_begin()
2011-11-23 13:34 ` Johannes Weiner
@ 2011-11-23 13:34 ` Johannes Weiner
-1 siblings, 0 replies; 28+ messages in thread
From: Johannes Weiner @ 2011-11-23 13:34 UTC (permalink / raw)
To: Andrew Morton
Cc: Mel Gorman, Rik van Riel, Minchan Kim, Michal Hocko,
Christoph Hellwig, Wu Fengguang, Dave Chinner, Jan Kara,
Shaohua Li, linux-mm, linux-fsdevel, linux-kernel
From: Johannes Weiner <jweiner@redhat.com>
Tell the page allocator that pages allocated through
grab_cache_page_write_begin() are expected to become dirty soon.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
mm/filemap.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index c0018f2..5344dec 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2354,8 +2354,11 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
pgoff_t index, unsigned flags)
{
int status;
+ gfp_t gfp_mask;
struct page *page;
gfp_t gfp_notmask = 0;
+
+ gfp_mask = mapping_gfp_mask(mapping) | __GFP_WRITE;
if (flags & AOP_FLAG_NOFS)
gfp_notmask = __GFP_FS;
repeat:
@@ -2363,7 +2366,7 @@ repeat:
if (page)
goto found;
- page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~gfp_notmask);
+ page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
if (!page)
return NULL;
status = add_to_page_cache_lru(page, mapping, index,
--
1.7.6.4
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [patch 4/5] mm: filemap: pass __GFP_WRITE from grab_cache_page_write_begin()
@ 2011-11-23 13:34 ` Johannes Weiner
0 siblings, 0 replies; 28+ messages in thread
From: Johannes Weiner @ 2011-11-23 13:34 UTC (permalink / raw)
To: Andrew Morton
Cc: Mel Gorman, Rik van Riel, Minchan Kim, Michal Hocko,
Christoph Hellwig, Wu Fengguang, Dave Chinner, Jan Kara,
Shaohua Li, linux-mm, linux-fsdevel, linux-kernel
From: Johannes Weiner <jweiner@redhat.com>
Tell the page allocator that pages allocated through
grab_cache_page_write_begin() are expected to become dirty soon.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
mm/filemap.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index c0018f2..5344dec 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2354,8 +2354,11 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
pgoff_t index, unsigned flags)
{
int status;
+ gfp_t gfp_mask;
struct page *page;
gfp_t gfp_notmask = 0;
+
+ gfp_mask = mapping_gfp_mask(mapping) | __GFP_WRITE;
if (flags & AOP_FLAG_NOFS)
gfp_notmask = __GFP_FS;
repeat:
@@ -2363,7 +2366,7 @@ repeat:
if (page)
goto found;
- page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~gfp_notmask);
+ page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
if (!page)
return NULL;
status = add_to_page_cache_lru(page, mapping, index,
--
1.7.6.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [patch 5/5] Btrfs: pass __GFP_WRITE for buffered write page allocations
2011-11-23 13:34 ` Johannes Weiner
@ 2011-11-23 13:34 ` Johannes Weiner
-1 siblings, 0 replies; 28+ messages in thread
From: Johannes Weiner @ 2011-11-23 13:34 UTC (permalink / raw)
To: Andrew Morton
Cc: Mel Gorman, Rik van Riel, Minchan Kim, Michal Hocko,
Christoph Hellwig, Wu Fengguang, Dave Chinner, Jan Kara,
Shaohua Li, linux-mm, linux-fsdevel, linux-kernel
From: Johannes Weiner <jweiner@redhat.com>
Tell the page allocator that pages allocated for a buffered write are
expected to become dirty soon.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
---
fs/btrfs/file.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index dafdfa0..d673f4a 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1081,7 +1081,7 @@ static noinline int prepare_pages(struct btrfs_root *root, struct file *file,
again:
for (i = 0; i < num_pages; i++) {
pages[i] = find_or_create_page(inode->i_mapping, index + i,
- mask);
+ mask | __GFP_WRITE);
if (!pages[i]) {
faili = i - 1;
err = -ENOMEM;
--
1.7.6.4
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [patch 5/5] Btrfs: pass __GFP_WRITE for buffered write page allocations
@ 2011-11-23 13:34 ` Johannes Weiner
0 siblings, 0 replies; 28+ messages in thread
From: Johannes Weiner @ 2011-11-23 13:34 UTC (permalink / raw)
To: Andrew Morton
Cc: Mel Gorman, Rik van Riel, Minchan Kim, Michal Hocko,
Christoph Hellwig, Wu Fengguang, Dave Chinner, Jan Kara,
Shaohua Li, linux-mm, linux-fsdevel, linux-kernel
From: Johannes Weiner <jweiner@redhat.com>
Tell the page allocator that pages allocated for a buffered write are
expected to become dirty soon.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
---
fs/btrfs/file.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index dafdfa0..d673f4a 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1081,7 +1081,7 @@ static noinline int prepare_pages(struct btrfs_root *root, struct file *file,
again:
for (i = 0; i < num_pages; i++) {
pages[i] = find_or_create_page(inode->i_mapping, index + i,
- mask);
+ mask | __GFP_WRITE);
if (!pages[i]) {
faili = i - 1;
err = -ENOMEM;
--
1.7.6.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 28+ messages in thread