linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* extra free kbytes tunable
@ 2013-02-12  2:01 dormando
  2013-02-15 22:21 ` Seiji Aguchi
  0 siblings, 1 reply; 21+ messages in thread
From: dormando @ 2013-02-12  2:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Randy Dunlap, Satoru Moriya, linux-kernel, linux-mm, lwoodman,
	Seiji Aguchi, akpm, hughd

Hi,

As discussed in this thread:
http://marc.info/?l=linux-mm&m=131490523222031&w=2
(with this cleanup as well: https://lkml.org/lkml/2011/9/2/225)

A tunable was proposed to allow specifying the distance between pages_min
and the low watermark before kswapd is kicked in to free up pages. I'd
like to re-open this thread since the patch did not appear to go anywhere.

We have a server workload wherein machines with 100G+ of "free" memory
(used by page cache), scattered but frequent random io reads from 12+
SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim
in a few different ways.

1) It'll run into small amounts of reclaim randomly (a few hundred
thousand).

2) A burst of reads or traffic can cause extra pressure, which kswapd
occasionally responds to by freeing up 40g+ of the pagecache all at once
(!) while pausing the system (Argh).

3) A blip in an upstream provider or failover from a peer causes the
kernel to allocate massive amounts of memory for retransmission
queues/etc, potentially along with buffered IO reads and (some, but not
often a ton) of new allocations from an application. This paired with 2)
can cause the box to stall for 15+ seconds.

We're seeing this more in 3.4/3.5/3.6, saw it less in 2.6.38. Mass
reclaims are more common in newer kernels, but reclaims still happen in
all kernels without raising min_free_kbytes dramatically.

I've found that setting "lowmem_reserve_ratio" to something like "1 1 32"
(thus protecting the DMA32 zone) causes 2) to happen less often, and is
generally less violent with 1).

Setting min_free_kbytes to 15G or more, paired with the above, has been
the best at mitigating the issue. This is simply trying to raise the
distance between the min and low watermarks. With min_free_kbytes set to
15000000, that gives us a whopping 1.8G (!!!) of leeway before slamming
into direct reclaim.

So, this patch is unfortunate but wonderful at letting us reclaim 10G+ of
otherwise lost memory. Could we please revisit it?

I saw a lot of discussion on doing this automatically, or making kswapd
more efficient to it, and I'd love to do that. Beyond making kswapd
psychic I haven't seen any better options yet.

The issue is more complex than simply having an application warn of an
impending allocation, since this can happen via read load on disk or from
kernel page allocations for the network, or a combination of the two (or
three, if you add the app back in).

It's going to get worse as we push machines with faster SSD's and bigger
networks. I'm open to any ideas on how to make kswapd more efficient in
our case, or really anything at all that works.

I have more details, but cut it down as much as I could for this mail.

Thanks,
-Dormando

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: extra free kbytes tunable
  2013-02-12  2:01 extra free kbytes tunable dormando
@ 2013-02-15 22:21 ` Seiji Aguchi
  2013-02-15 22:25   ` Rik van Riel
  2013-02-15 22:49   ` Satoru Moriya
  0 siblings, 2 replies; 21+ messages in thread
From: Seiji Aguchi @ 2013-02-15 22:21 UTC (permalink / raw)
  To: dormando, Rik van Riel, Satoru Moriya
  Cc: Randy Dunlap, linux-kernel, linux-mm, lwoodman, akpm, hughd

Rik, Satoru,

Do you have any comments?

Seiji

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of dormando
> Sent: Monday, February 11, 2013 9:01 PM
> To: Rik van Riel
> Cc: Randy Dunlap; Satoru Moriya; linux-kernel@vger.kernel.org; linux-mm@kvack.org; lwoodman@redhat.com; Seiji Aguchi;
> akpm@linux-foundation.org; hughd@google.com
> Subject: extra free kbytes tunable
> 
> Hi,
> 
> As discussed in this thread:
> http://marc.info/?l=linux-mm&m=131490523222031&w=2
> (with this cleanup as well: https://lkml.org/lkml/2011/9/2/225)
> 
> A tunable was proposed to allow specifying the distance between pages_min and the low watermark before kswapd is kicked in to
> free up pages. I'd like to re-open this thread since the patch did not appear to go anywhere.
> 
> We have a server workload wherein machines with 100G+ of "free" memory (used by page cache), scattered but frequent random io
> reads from 12+ SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim in a few different ways.
> 
> 1) It'll run into small amounts of reclaim randomly (a few hundred thousand).
> 
> 2) A burst of reads or traffic can cause extra pressure, which kswapd occasionally responds to by freeing up 40g+ of the pagecache all
> at once
> (!) while pausing the system (Argh).
> 
> 3) A blip in an upstream provider or failover from a peer causes the kernel to allocate massive amounts of memory for retransmission
> queues/etc, potentially along with buffered IO reads and (some, but not often a ton) of new allocations from an application. This
> paired with 2) can cause the box to stall for 15+ seconds.
> 
> We're seeing this more in 3.4/3.5/3.6, saw it less in 2.6.38. Mass reclaims are more common in newer kernels, but reclaims still happen
> in all kernels without raising min_free_kbytes dramatically.
> 
> I've found that setting "lowmem_reserve_ratio" to something like "1 1 32"
> (thus protecting the DMA32 zone) causes 2) to happen less often, and is generally less violent with 1).
> 
> Setting min_free_kbytes to 15G or more, paired with the above, has been the best at mitigating the issue. This is simply trying to raise
> the distance between the min and low watermarks. With min_free_kbytes set to 15000000, that gives us a whopping 1.8G (!!!) of
> leeway before slamming into direct reclaim.
> 
> So, this patch is unfortunate but wonderful at letting us reclaim 10G+ of otherwise lost memory. Could we please revisit it?
> 
> I saw a lot of discussion on doing this automatically, or making kswapd more efficient to it, and I'd love to do that. Beyond making
> kswapd psychic I haven't seen any better options yet.
> 
> The issue is more complex than simply having an application warn of an impending allocation, since this can happen via read load on
> disk or from kernel page allocations for the network, or a combination of the two (or three, if you add the app back in).
> 
> It's going to get worse as we push machines with faster SSD's and bigger networks. I'm open to any ideas on how to make kswapd
> more efficient in our case, or really anything at all that works.
> 
> I have more details, but cut it down as much as I could for this mail.
> 
> Thanks,
> -Dormando
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More
> majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: extra free kbytes tunable
  2013-02-15 22:21 ` Seiji Aguchi
@ 2013-02-15 22:25   ` Rik van Riel
  2013-02-17 23:48     ` [PATCH] add " dormando
  2013-02-17 23:54     ` dormando
  2013-02-15 22:49   ` Satoru Moriya
  1 sibling, 2 replies; 21+ messages in thread
From: Rik van Riel @ 2013-02-15 22:25 UTC (permalink / raw)
  To: Seiji Aguchi
  Cc: dormando, Satoru Moriya, Randy Dunlap, linux-kernel, linux-mm,
	lwoodman, akpm, hughd

On 02/15/2013 05:21 PM, Seiji Aguchi wrote:
> Rik, Satoru,
>
> Do you have any comments?

IIRC at the time the patch was rejected as too inelegant.

However, nobody else seems to have come up with a better plan, and
there are users in need of a fix for this problem.

I would still like to see a fix for the problem merged upstream.

>> -----Original Message-----
>> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of dormando
>> Sent: Monday, February 11, 2013 9:01 PM
>> To: Rik van Riel
>> Cc: Randy Dunlap; Satoru Moriya; linux-kernel@vger.kernel.org; linux-mm@kvack.org; lwoodman@redhat.com; Seiji Aguchi;
>> akpm@linux-foundation.org; hughd@google.com
>> Subject: extra free kbytes tunable
>>
>> Hi,
>>
>> As discussed in this thread:
>> http://marc.info/?l=linux-mm&m=131490523222031&w=2
>> (with this cleanup as well: https://lkml.org/lkml/2011/9/2/225)
>>
>> A tunable was proposed to allow specifying the distance between pages_min and the low watermark before kswapd is kicked in to
>> free up pages. I'd like to re-open this thread since the patch did not appear to go anywhere.
>>
>> We have a server workload wherein machines with 100G+ of "free" memory (used by page cache), scattered but frequent random io
>> reads from 12+ SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim in a few different ways.
>>
>> 1) It'll run into small amounts of reclaim randomly (a few hundred thousand).
>>
>> 2) A burst of reads or traffic can cause extra pressure, which kswapd occasionally responds to by freeing up 40g+ of the pagecache all
>> at once
>> (!) while pausing the system (Argh).
>>
>> 3) A blip in an upstream provider or failover from a peer causes the kernel to allocate massive amounts of memory for retransmission
>> queues/etc, potentially along with buffered IO reads and (some, but not often a ton) of new allocations from an application. This
>> paired with 2) can cause the box to stall for 15+ seconds.
>>
>> We're seeing this more in 3.4/3.5/3.6, saw it less in 2.6.38. Mass reclaims are more common in newer kernels, but reclaims still happen
>> in all kernels without raising min_free_kbytes dramatically.
>>
>> I've found that setting "lowmem_reserve_ratio" to something like "1 1 32"
>> (thus protecting the DMA32 zone) causes 2) to happen less often, and is generally less violent with 1).
>>
>> Setting min_free_kbytes to 15G or more, paired with the above, has been the best at mitigating the issue. This is simply trying to raise
>> the distance between the min and low watermarks. With min_free_kbytes set to 15000000, that gives us a whopping 1.8G (!!!) of
>> leeway before slamming into direct reclaim.
>>
>> So, this patch is unfortunate but wonderful at letting us reclaim 10G+ of otherwise lost memory. Could we please revisit it?
>>
>> I saw a lot of discussion on doing this automatically, or making kswapd more efficient to it, and I'd love to do that. Beyond making
>> kswapd psychic I haven't seen any better options yet.
>>
>> The issue is more complex than simply having an application warn of an impending allocation, since this can happen via read load on
>> disk or from kernel page allocations for the network, or a combination of the two (or three, if you add the app back in).
>>
>> It's going to get worse as we push machines with faster SSD's and bigger networks. I'm open to any ideas on how to make kswapd
>> more efficient in our case, or really anything at all that works.
>>
>> I have more details, but cut it down as much as I could for this mail.
>>
>> Thanks,
>> -Dormando
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More
>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: extra free kbytes tunable
  2013-02-15 22:21 ` Seiji Aguchi
  2013-02-15 22:25   ` Rik van Riel
@ 2013-02-15 22:49   ` Satoru Moriya
  1 sibling, 0 replies; 21+ messages in thread
From: Satoru Moriya @ 2013-02-15 22:49 UTC (permalink / raw)
  To: Seiji Aguchi, dormando, Rik van Riel
  Cc: Randy Dunlap, linux-kernel, linux-mm, lwoodman, akpm, hughd

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 4298 bytes --]

On 02/15/2013 05:21 PM, Seiji Aguchi wrote:
> Rik, Satoru,
> 
> Do you have any comments?
> 
> Seiji

Hmm, this seems what we wanted to know in the previous thread.

Because extra_free_kbytes is quite simple and it fixes the problem,
it should be merged into upstream.

Regards,
Satoru


>> -----Original Message-----
>> From: linux-kernel-owner@vger.kernel.org 
>> [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of dormando
>> Sent: Monday, February 11, 2013 9:01 PM
>> To: Rik van Riel
>> Cc: Randy Dunlap; Satoru Moriya; linux-kernel@vger.kernel.org; 
>> linux-mm@kvack.org; lwoodman@redhat.com; Seiji Aguchi; 
>> akpm@linux-foundation.org; hughd@google.com
>> Subject: extra free kbytes tunable
>>
>> Hi,
>>
>> As discussed in this thread:
>> http://marc.info/?l=linux-mm&m=131490523222031&w=2
>> (with this cleanup as well: https://lkml.org/lkml/2011/9/2/225)
>>
>> A tunable was proposed to allow specifying the distance between 
>> pages_min and the low watermark before kswapd is kicked in to free up 
>> pages. I'd like to re-open this thread since the patch did not appear to go anywhere.
>>
>> We have a server workload wherein machines with 100G+ of "free" 
>> memory (used by page cache), scattered but frequent random io reads 
>> from 12+ SSD's, and 5gbps+ of internet traffic, will frequently hit 
>> direct reclaim in a few different ways.
>>
>> 1) It'll run into small amounts of reclaim randomly (a few hundred thousand).
>>
>> 2) A burst of reads or traffic can cause extra pressure, which kswapd 
>> occasionally responds to by freeing up 40g+ of the pagecache all at 
>> once
>> (!) while pausing the system (Argh).
>>
>> 3) A blip in an upstream provider or failover from a peer causes the 
>> kernel to allocate massive amounts of memory for retransmission 
>> queues/etc, potentially along with buffered IO reads and (some, but 
>> not often a ton) of new allocations from an application. This paired 
>> with 2) can cause the box to stall for 15+ seconds.
>>
>> We're seeing this more in 3.4/3.5/3.6, saw it less in 2.6.38. Mass 
>> reclaims are more common in newer kernels, but reclaims still happen 
>> in all kernels without raising min_free_kbytes dramatically.
>>
>> I've found that setting "lowmem_reserve_ratio" to something like "1 1 32"
>> (thus protecting the DMA32 zone) causes 2) to happen less often, and 
>> is generally less violent with 1).
>>
>> Setting min_free_kbytes to 15G or more, paired with the above, has 
>> been the best at mitigating the issue. This is simply trying to raise 
>> the distance between the min and low watermarks. With min_free_kbytes 
>> set to 15000000, that gives us a whopping 1.8G (!!!) of leeway before 
>> slamming into direct reclaim.
>>
>> So, this patch is unfortunate but wonderful at letting us reclaim 
>> 10G+ of otherwise lost memory. Could we please revisit it?
>>
>> I saw a lot of discussion on doing this automatically, or making 
>> kswapd more efficient to it, and I'd love to do that. Beyond making 
>> kswapd psychic I haven't seen any better options yet.
>>
>> The issue is more complex than simply having an application warn of 
>> an impending allocation, since this can happen via read load on disk 
>> or from kernel page allocations for the network, or a combination of 
>> the two (or three, if you add the app back in).
>>
>> It's going to get worse as we push machines with faster SSD's and 
>> bigger networks. I'm open to any ideas on how to make kswapd more 
>> efficient in our case, or really anything at all that works.
>>
>> I have more details, but cut it down as much as I could for this mail.
>>
>> Thanks,
>> -Dormando
>> --
>> To unsubscribe from this list: send the line "unsubscribe 
>> linux-kernel" in the body of a message to majordomo@vger.kernel.org 
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in the body 
> to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=ilto:"dont@kvack.org"> email@kvack.org </a>
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH] add extra free kbytes tunable
  2013-02-15 22:25   ` Rik van Riel
@ 2013-02-17 23:48     ` dormando
  2013-02-19 23:29       ` Andrew Morton
  2013-02-17 23:54     ` dormando
  1 sibling, 1 reply; 21+ messages in thread
From: dormando @ 2013-02-17 23:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Seiji Aguchi, Satoru Moriya, Randy Dunlap, linux-kernel,
	linux-mm, lwoodman, akpm, hughd

From: Rik van Riel <riel@redhat.com>

Add a userspace visible knob to tell the VM to keep an extra amount
of memory free, by increasing the gap between each zone's min and
low watermarks.

This is useful for realtime applications that call system
calls and have a bound on the number of allocations that happen
in any short time period.  In this application, extra_free_kbytes
would be left at an amount equal to or larger than than the
maximum number of allocations that happen in any burst.

It may also be useful to reduce the memory use of virtual
machines (temporarily?), in a way that does not cause memory
fragmentation like ballooning does.
---
 Documentation/sysctl/vm.txt |   16 ++++++++++++++++
 include/linux/mmzone.h      |    2 +-
 include/linux/swap.h        |    2 ++
 kernel/sysctl.c             |   11 +++++++++--
 mm/page_alloc.c             |   39 +++++++++++++++++++++++++++++----------
 5 files changed, 57 insertions(+), 13 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 078701f..5d12bbd 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -28,6 +28,7 @@ Currently, these files are in /proc/sys/vm:
 - dirty_writeback_centisecs
 - drop_caches
 - extfrag_threshold
+- extra_free_kbytes
 - hugepages_treat_as_movable
 - hugetlb_shm_group
 - laptop_mode
@@ -167,6 +168,21 @@ fragmentation index is <= extfrag_threshold. The default value is 500.

 ==============================================================

+extra_free_kbytes
+
+This parameter tells the VM to keep extra free memory between the threshold
+where background reclaim (kswapd) kicks in, and the threshold where direct
+reclaim (by allocating processes) kicks in.
+
+This is useful for workloads that require low latency memory allocations
+and have a bounded burstiness in memory allocations, for example a
+realtime application that receives and transmits network traffic
+(causing in-kernel memory allocations) with a maximum total message burst
+size of 200MB may need 200MB of extra free memory to avoid direct reclaim
+related latencies.
+
+==============================================================
+
 hugepages_treat_as_movable

 This parameter is only useful when kernelcore= is specified at boot time to
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 73b64a3..7f8f883 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -881,7 +881,7 @@ static inline int is_dma(struct zone *zone)

 /* These two functions are used to setup the per zone pages min values */
 struct ctl_table;
-int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
+int free_kbytes_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
 extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1];
 int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 68df9c1..66a12c4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -215,6 +215,8 @@ struct swap_list_t {
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
+extern int min_free_kbytes;
+extern int extra_free_kbytes;
 extern unsigned long dirty_balance_reserve;
 extern unsigned int nr_free_buffer_pages(void);
 extern unsigned int nr_free_pagecache_pages(void);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c88878d..102e9a1 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -104,7 +104,6 @@ extern char core_pattern[];
 extern unsigned int core_pipe_limit;
 #endif
 extern int pid_max;
-extern int min_free_kbytes;
 extern int pid_max_min, pid_max_max;
 extern int sysctl_drop_caches;
 extern int percpu_pagelist_fraction;
@@ -1246,10 +1245,18 @@ static struct ctl_table vm_table[] = {
 		.data		= &min_free_kbytes,
 		.maxlen		= sizeof(min_free_kbytes),
 		.mode		= 0644,
-		.proc_handler	= min_free_kbytes_sysctl_handler,
+		.proc_handler	= free_kbytes_sysctl_handler,
 		.extra1		= &zero,
 	},
 	{
+		.procname   = "extra_free_kbytes",
+		.data       = &extra_free_kbytes,
+		.maxlen     = sizeof(extra_free_kbytes),
+		.mode       = 0644,
+		.proc_handler   = free_kbytes_sysctl_handler,
+		.extra1     = &zero,
+	},
+	{
 		.procname	= "percpu_pagelist_fraction",
 		.data		= &percpu_pagelist_fraction,
 		.maxlen		= sizeof(percpu_pagelist_fraction),
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9673d96..5380d84 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -194,8 +194,21 @@ static char * const zone_names[MAX_NR_ZONES] = {
 	 "Movable",
 };

+/*
+ * Try to keep at least this much lowmem free.  Do not allow normal
+ * allocations below this point, only high priority ones. Automatically
+ * tuned according to the amount of memory in the system.
+ */
 int min_free_kbytes = 1024;

+/*
+ * Extra memory for the system to try freeing between the min and
+ * low watermarks.  Useful for workloads that require low latency
+ * memory allocations in bursts larger than the normal gap between
+ * low and min.
+ */
+int extra_free_kbytes;
+
 static unsigned long __meminitdata nr_kernel_pages;
 static unsigned long __meminitdata nr_all_pages;
 static unsigned long __meminitdata dma_reserve;
@@ -5217,6 +5230,7 @@ static void setup_per_zone_lowmem_reserve(void)
 static void __setup_per_zone_wmarks(void)
 {
 	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+	unsigned long pages_low = extra_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
 	struct zone *zone;
 	unsigned long flags;
@@ -5228,11 +5242,14 @@ static void __setup_per_zone_wmarks(void)
 	}

 	for_each_zone(zone) {
-		u64 tmp;
+		u64 min, low;

 		spin_lock_irqsave(&zone->lock, flags);
-		tmp = (u64)pages_min * zone->present_pages;
-		do_div(tmp, lowmem_pages);
+		min = (u64)pages_min * zone->present_pages;
+		do_div(min, lowmem_pages);
+		low = (u64)pages_low * zone->present_pages;
+		do_div(low, vm_total_pages);
+
 		if (is_highmem(zone)) {
 			/*
 			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
@@ -5256,11 +5273,13 @@ static void __setup_per_zone_wmarks(void)
 			 * If it's a lowmem zone, reserve a number of pages
 			 * proportionate to the zone's size.
 			 */
-			zone->watermark[WMARK_MIN] = tmp;
+			zone->watermark[WMARK_MIN] = min;
 		}

-		zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + (tmp >> 2);
-		zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);
+		zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) +
+					low + (min >> 2);
+		zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) +
+					low + (min >> 1);

 		setup_zone_migrate_reserve(zone);
 		spin_unlock_irqrestore(&zone->lock, flags);
@@ -5371,11 +5390,11 @@ int __meminit init_per_zone_wmark_min(void)
 module_init(init_per_zone_wmark_min)

 /*
- * min_free_kbytes_sysctl_handler - just a wrapper around proc_dointvec() so
- *	that we can call two helper functions whenever min_free_kbytes
- *	changes.
+ * free_kbytes_sysctl_handler - just a wrapper around proc_dointvec() so
+ * that we can call two helper functions whenever min_free_kbytes
+ * or extra_free_kbytes changes.
  */
-int min_free_kbytes_sysctl_handler(ctl_table *table, int write,
+int free_kbytes_sysctl_handler(ctl_table *table, int write,
 	void __user *buffer, size_t *length, loff_t *ppos)
 {
 	proc_dointvec(table, write, buffer, length, ppos);
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: extra free kbytes tunable
  2013-02-15 22:25   ` Rik van Riel
  2013-02-17 23:48     ` [PATCH] add " dormando
@ 2013-02-17 23:54     ` dormando
  1 sibling, 0 replies; 21+ messages in thread
From: dormando @ 2013-02-17 23:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Seiji Aguchi, Satoru Moriya, Randy Dunlap, linux-kernel,
	linux-mm, lwoodman, akpm, hughd



On Fri, 15 Feb 2013, Rik van Riel wrote:

> On 02/15/2013 05:21 PM, Seiji Aguchi wrote:
> > Rik, Satoru,
> >
> > Do you have any comments?
>
> IIRC at the time the patch was rejected as too inelegant.
>
> However, nobody else seems to have come up with a better plan, and
> there are users in need of a fix for this problem.
>
> I would still like to see a fix for the problem merged upstream.

I merged in the cleanups to your original patch, rebased it off of linus'
master from a day or two ago and re-sent (not sure how to preserve
authorship in that case? Apologies for goofing it).

I'm willing to argue it, or investigate better options. I'm going to be
stuck maintaining this patch since we can't really afford to have
production hang, or waste 12g+ of RAM per box.

> > > -----Original Message-----
> > > From: linux-kernel-owner@vger.kernel.org
> > > [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of dormando
> > > Sent: Monday, February 11, 2013 9:01 PM
> > > To: Rik van Riel
> > > Cc: Randy Dunlap; Satoru Moriya; linux-kernel@vger.kernel.org;
> > > linux-mm@kvack.org; lwoodman@redhat.com; Seiji Aguchi;
> > > akpm@linux-foundation.org; hughd@google.com
> > > Subject: extra free kbytes tunable
> > >
> > > Hi,
> > >
> > > As discussed in this thread:
> > > http://marc.info/?l=linux-mm&m=131490523222031&w=2
> > > (with this cleanup as well: https://lkml.org/lkml/2011/9/2/225)
> > >
> > > A tunable was proposed to allow specifying the distance between pages_min
> > > and the low watermark before kswapd is kicked in to
> > > free up pages. I'd like to re-open this thread since the patch did not
> > > appear to go anywhere.
> > >
> > > We have a server workload wherein machines with 100G+ of "free" memory
> > > (used by page cache), scattered but frequent random io
> > > reads from 12+ SSD's, and 5gbps+ of internet traffic, will frequently hit
> > > direct reclaim in a few different ways.
> > >
> > > 1) It'll run into small amounts of reclaim randomly (a few hundred
> > > thousand).
> > >
> > > 2) A burst of reads or traffic can cause extra pressure, which kswapd
> > > occasionally responds to by freeing up 40g+ of the pagecache all
> > > at once
> > > (!) while pausing the system (Argh).
> > >
> > > 3) A blip in an upstream provider or failover from a peer causes the
> > > kernel to allocate massive amounts of memory for retransmission
> > > queues/etc, potentially along with buffered IO reads and (some, but not
> > > often a ton) of new allocations from an application. This
> > > paired with 2) can cause the box to stall for 15+ seconds.
> > >
> > > We're seeing this more in 3.4/3.5/3.6, saw it less in 2.6.38. Mass
> > > reclaims are more common in newer kernels, but reclaims still happen
> > > in all kernels without raising min_free_kbytes dramatically.
> > >
> > > I've found that setting "lowmem_reserve_ratio" to something like "1 1 32"
> > > (thus protecting the DMA32 zone) causes 2) to happen less often, and is
> > > generally less violent with 1).
> > >
> > > Setting min_free_kbytes to 15G or more, paired with the above, has been
> > > the best at mitigating the issue. This is simply trying to raise
> > > the distance between the min and low watermarks. With min_free_kbytes set
> > > to 15000000, that gives us a whopping 1.8G (!!!) of
> > > leeway before slamming into direct reclaim.
> > >
> > > So, this patch is unfortunate but wonderful at letting us reclaim 10G+ of
> > > otherwise lost memory. Could we please revisit it?
> > >
> > > I saw a lot of discussion on doing this automatically, or making kswapd
> > > more efficient to it, and I'd love to do that. Beyond making
> > > kswapd psychic I haven't seen any better options yet.
> > >
> > > The issue is more complex than simply having an application warn of an
> > > impending allocation, since this can happen via read load on
> > > disk or from kernel page allocations for the network, or a combination of
> > > the two (or three, if you add the app back in).
> > >
> > > It's going to get worse as we push machines with faster SSD's and bigger
> > > networks. I'm open to any ideas on how to make kswapd
> > > more efficient in our case, or really anything at all that works.
> > >
> > > I have more details, but cut it down as much as I could for this mail.
> > >
> > > Thanks,
> > > -Dormando
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > the body of a message to majordomo@vger.kernel.org More
> > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at  http://www.tux.org/lkml/
>
>
> --
> All rights reversed
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] add extra free kbytes tunable
  2013-02-17 23:48     ` [PATCH] add " dormando
@ 2013-02-19 23:29       ` Andrew Morton
  2013-02-20  5:19         ` dormando
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2013-02-19 23:29 UTC (permalink / raw)
  To: dormando
  Cc: Rik van Riel, Seiji Aguchi, Satoru Moriya, Randy Dunlap,
	linux-kernel, linux-mm, lwoodman, hughd

On Sun, 17 Feb 2013 15:48:31 -0800 (PST)
dormando <dormando@rydia.net> wrote:

> Add a userspace visible knob to tell the VM to keep an extra amount
> of memory free, by increasing the gap between each zone's min and
> low watermarks.

The problem is that adding this tunable will constrain future VM
implementations.  We will forever need to at least retain the
pseudo-file.  We will also need to make some effort to retain its
behaviour.

It would of course be better to fix things so you don't need to tweak
VM internals to get acceptable behaviour.

You said:

: We have a server workload wherein machines with 100G+ of "free" memory
: (used by page cache), scattered but frequent random io reads from 12+
: SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim
: in a few different ways.
: 
: 1) It'll run into small amounts of reclaim randomly (a few hundred
: thousand).
: 
: 2) A burst of reads or traffic can cause extra pressure, which kswapd
: occasionally responds to by freeing up 40g+ of the pagecache all at once
: (!) while pausing the system (Argh).
: 
: 3) A blip in an upstream provider or failover from a peer causes the
: kernel to allocate massive amounts of memory for retransmission
: queues/etc, potentially along with buffered IO reads and (some, but not
: often a ton) of new allocations from an application. This paired with 2)
: can cause the box to stall for 15+ seconds.

Can we prioritise these?  2) looks just awful - kswapd shouldn't just
go off and free 40G of pagecache.  Do you know what's actually in that
pagecache?  Large number of small files or small number of (very) large
files?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] add extra free kbytes tunable
  2013-02-19 23:29       ` Andrew Morton
@ 2013-02-20  5:19         ` dormando
  2013-02-22 17:56           ` Johannes Weiner
  0 siblings, 1 reply; 21+ messages in thread
From: dormando @ 2013-02-20  5:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Seiji Aguchi, Satoru Moriya, Randy Dunlap,
	linux-kernel, linux-mm, lwoodman, hughd

>
> The problem is that adding this tunable will constrain future VM
> implementations.  We will forever need to at least retain the
> pseudo-file.  We will also need to make some effort to retain its
> behaviour.
>
> It would of course be better to fix things so you don't need to tweak
> VM internals to get acceptable behaviour.

I sympathize with this. It's presently all that keeps us afloat though.
I'll whine about it again later if nothing else pans out.

> You said:
>
> : We have a server workload wherein machines with 100G+ of "free" memory
> : (used by page cache), scattered but frequent random io reads from 12+
> : SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim
> : in a few different ways.
> :
> : 1) It'll run into small amounts of reclaim randomly (a few hundred
> : thousand).
> :
> : 2) A burst of reads or traffic can cause extra pressure, which kswapd
> : occasionally responds to by freeing up 40g+ of the pagecache all at once
> : (!) while pausing the system (Argh).
> :
> : 3) A blip in an upstream provider or failover from a peer causes the
> : kernel to allocate massive amounts of memory for retransmission
> : queues/etc, potentially along with buffered IO reads and (some, but not
> : often a ton) of new allocations from an application. This paired with 2)
> : can cause the box to stall for 15+ seconds.
>
> Can we prioritise these?  2) looks just awful - kswapd shouldn't just
> go off and free 40G of pagecache.  Do you know what's actually in that
> pagecache?  Large number of small files or small number of (very) large
> files?

We have a handful of huge files (6-12ish 200g+) that are mmap'ed and
accessed via address. occasionally madvise (WILLNEED) applied to the
address ranges before attempting to use them. There're a mix of other
files but nothing significant. The mmap's are READONLY and writes are done
via pwrite-ish functions.

I could use some guidance on inspecting/tracing the problem. I've been
trying to reproduce it in a lab, and respecting to 2)'s issue I've found:

- The amount of memory freed back up is either a percentage of total
memory or a percentage of free memory. (a machine with 48G of ram will
"only" free up an extra 4-7g)

- It's most likely to happen after a fresh boot, or if "3 > drop_caches"
is applied with the application down. As it fills it seems to get itself
into trouble, but becomes more stable after that. Unfortunately 1) and 3)
still apply to a stable instance.

- Protecting the DMA32 zone with something like "1 1 32" into
lowmem_reserve_ratio makes the mass-reclaiming less likely to happen.

- While watching "sar -B 1" I'll see kswapd wake up, and scan up to a few
hundred thousand pages before finding anything it actually wants to
reclaim (low vmeff). I've only been able to reproduce this from a clean
start. It can take up to 3 seconds before kswapd starts actually
reclaiming pages.

- So far as I can tell we're almost exclusively using 0 order allocations.
THP is disabled.

There's not much dirty memory involved. It's not flushing out writes while
reclaiming, it just kills off massive amount of cached memory.

We're not running the machines particularily hard... Often less than 30%
CPU usage at peak.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] add extra free kbytes tunable
  2013-02-20  5:19         ` dormando
@ 2013-02-22 17:56           ` Johannes Weiner
  2013-02-26 10:47             ` Mel Gorman
  2013-03-01  9:22             ` Simon Jeons
  0 siblings, 2 replies; 21+ messages in thread
From: Johannes Weiner @ 2013-02-22 17:56 UTC (permalink / raw)
  To: dormando
  Cc: Andrew Morton, Rik van Riel, Seiji Aguchi, Satoru Moriya,
	Randy Dunlap, linux-kernel, linux-mm, lwoodman, hughd,
	Mel Gorman

On Tue, Feb 19, 2013 at 09:19:27PM -0800, dormando wrote:
> >
> > The problem is that adding this tunable will constrain future VM
> > implementations.  We will forever need to at least retain the
> > pseudo-file.  We will also need to make some effort to retain its
> > behaviour.
> >
> > It would of course be better to fix things so you don't need to tweak
> > VM internals to get acceptable behaviour.
> 
> I sympathize with this. It's presently all that keeps us afloat though.
> I'll whine about it again later if nothing else pans out.
> 
> > You said:
> >
> > : We have a server workload wherein machines with 100G+ of "free" memory
> > : (used by page cache), scattered but frequent random io reads from 12+
> > : SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim
> > : in a few different ways.
> > :
> > : 1) It'll run into small amounts of reclaim randomly (a few hundred
> > : thousand).
> > :
> > : 2) A burst of reads or traffic can cause extra pressure, which kswapd
> > : occasionally responds to by freeing up 40g+ of the pagecache all at once
> > : (!) while pausing the system (Argh).
> > :
> > : 3) A blip in an upstream provider or failover from a peer causes the
> > : kernel to allocate massive amounts of memory for retransmission
> > : queues/etc, potentially along with buffered IO reads and (some, but not
> > : often a ton) of new allocations from an application. This paired with 2)
> > : can cause the box to stall for 15+ seconds.
> >
> > Can we prioritise these?  2) looks just awful - kswapd shouldn't just
> > go off and free 40G of pagecache.  Do you know what's actually in that
> > pagecache?  Large number of small files or small number of (very) large
> > files?
> 
> We have a handful of huge files (6-12ish 200g+) that are mmap'ed and
> accessed via address. occasionally madvise (WILLNEED) applied to the
> address ranges before attempting to use them. There're a mix of other
> files but nothing significant. The mmap's are READONLY and writes are done
> via pwrite-ish functions.
> 
> I could use some guidance on inspecting/tracing the problem. I've been
> trying to reproduce it in a lab, and respecting to 2)'s issue I've found:
> 
> - The amount of memory freed back up is either a percentage of total
> memory or a percentage of free memory. (a machine with 48G of ram will
> "only" free up an extra 4-7g)
> 
> - It's most likely to happen after a fresh boot, or if "3 > drop_caches"
> is applied with the application down. As it fills it seems to get itself
> into trouble, but becomes more stable after that. Unfortunately 1) and 3)
> still apply to a stable instance.
> 
> - Protecting the DMA32 zone with something like "1 1 32" into
> lowmem_reserve_ratio makes the mass-reclaiming less likely to happen.
> 
> - While watching "sar -B 1" I'll see kswapd wake up, and scan up to a few
> hundred thousand pages before finding anything it actually wants to
> reclaim (low vmeff). I've only been able to reproduce this from a clean
> start. It can take up to 3 seconds before kswapd starts actually
> reclaiming pages.
> 
> - So far as I can tell we're almost exclusively using 0 order allocations.
> THP is disabled.
> 
> There's not much dirty memory involved. It's not flushing out writes while
> reclaiming, it just kills off massive amount of cached memory.

Mapped file pages have to get scanned twice before they are reclaimed
because we don't have enough usage information after the first scan.

In your case, when you start this workload after a fresh boot or
dropping the caches, there will be 48G of mapped file pages that have
never been scanned before and that need to be looked at twice.

Unfortunately, if kswapd does not make progress (and it won't for some
time at first), it will scan more and more aggressively with
increasing scan priority.  And when the 48G of pages are finally
cycled, kswapd's scan window is a large percentage of your machine's
memory, and it will free every single page in it.

I think we should think about capping kswapd zone reclaim cycles just
as we do for direct reclaim.  It's a little ridiculous that it can run
unbounded and reclaim every page in a zone without ever checking back
against the watermark.  We still increase the scan window evenly when
we don't make forward progress, but we are more carefully inching zone
levels back toward the watermarks.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4883eb..8a4c446 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2645,10 +2645,11 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 		.may_unmap = 1,
 		.may_swap = 1,
 		/*
-		 * kswapd doesn't want to be bailed out while reclaim. because
-		 * we want to put equal scanning pressure on each zone.
+		 * Even kswapd zone scans want to be bailed out after
+		 * reclaiming a good chunk of pages.  It will just
+		 * come back if the watermarks are still not met.
 		 */
-		.nr_to_reclaim = ULONG_MAX,
+		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.order = order,
 		.target_mem_cgroup = NULL,
 	};

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] add extra free kbytes tunable
  2013-02-22 17:56           ` Johannes Weiner
@ 2013-02-26 10:47             ` Mel Gorman
  2013-02-26 15:13               ` Johannes Weiner
  2013-03-01  9:22             ` Simon Jeons
  1 sibling, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2013-02-26 10:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: dormando, Andrew Morton, Rik van Riel, Seiji Aguchi,
	Satoru Moriya, Randy Dunlap, linux-kernel, linux-mm, lwoodman,
	hughd, Mel Gorman

On Fri, Feb 22, 2013 at 12:56:34PM -0500, Johannes Weiner wrote:
> > > <SNIP>
> > >
> > > : We have a server workload wherein machines with 100G+ of "free" memory
> > > : (used by page cache), scattered but frequent random io reads from 12+
> > > : SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim
> > > : in a few different ways.
> > > :
> > > : 1) It'll run into small amounts of reclaim randomly (a few hundred
> > > : thousand).
> > > :
> > > : 2) A burst of reads or traffic can cause extra pressure, which kswapd
> > > : occasionally responds to by freeing up 40g+ of the pagecache all at once
> > > : (!) while pausing the system (Argh).
> > > :
> > > : 3) A blip in an upstream provider or failover from a peer causes the
> > > : kernel to allocate massive amounts of memory for retransmission
> > > : queues/etc, potentially along with buffered IO reads and (some, but not
> > > : often a ton) of new allocations from an application. This paired with 2)
> > > : can cause the box to stall for 15+ seconds.
> > >
> > > Can we prioritise these?  2) looks just awful - kswapd shouldn't just
> > > go off and free 40G of pagecache.  Do you know what's actually in that
> > > pagecache?  Large number of small files or small number of (very) large
> > > files?
> > 
> > We have a handful of huge files (6-12ish 200g+) that are mmap'ed and
> > accessed via address. occasionally madvise (WILLNEED) applied to the
> > address ranges before attempting to use them. There're a mix of other
> > files but nothing significant. The mmap's are READONLY and writes are done
> > via pwrite-ish functions.
> > 
> > I could use some guidance on inspecting/tracing the problem. I've been
> > trying to reproduce it in a lab, and respecting to 2)'s issue I've found:
> > 
> > - The amount of memory freed back up is either a percentage of total
> > memory or a percentage of free memory. (a machine with 48G of ram will
> > "only" free up an extra 4-7g)
> > 
> > - It's most likely to happen after a fresh boot, or if "3 > drop_caches"
> > is applied with the application down. As it fills it seems to get itself
> > into trouble, but becomes more stable after that. Unfortunately 1) and 3)
> > still apply to a stable instance.
> > 
> > - Protecting the DMA32 zone with something like "1 1 32" into
> > lowmem_reserve_ratio makes the mass-reclaiming less likely to happen.
> > 
> > - While watching "sar -B 1" I'll see kswapd wake up, and scan up to a few
> > hundred thousand pages before finding anything it actually wants to
> > reclaim (low vmeff). I've only been able to reproduce this from a clean
> > start. It can take up to 3 seconds before kswapd starts actually
> > reclaiming pages.
> > 
> > - So far as I can tell we're almost exclusively using 0 order allocations.
> > THP is disabled.
> > 
> > There's not much dirty memory involved. It's not flushing out writes while
> > reclaiming, it just kills off massive amount of cached memory.
> 
> Mapped file pages have to get scanned twice before they are reclaimed
> because we don't have enough usage information after the first scan.
> 
> In your case, when you start this workload after a fresh boot or
> dropping the caches, there will be 48G of mapped file pages that have
> never been scanned before and that need to be looked at twice.
> 
> Unfortunately, if kswapd does not make progress (and it won't for some
> time at first), it will scan more and more aggressively with
> increasing scan priority.  And when the 48G of pages are finally
> cycled, kswapd's scan window is a large percentage of your machine's
> memory, and it will free every single page in it.
> 
> I think we should think about capping kswapd zone reclaim cycles just
> as we do for direct reclaim.  It's a little ridiculous that it can run
> unbounded and reclaim every page in a zone without ever checking back
> against the watermark.  We still increase the scan window evenly when
> we don't make forward progress, but we are more carefully inching zone
> levels back toward the watermarks.
> 

While on the surface I think this will appear to work, I worry that it
will cause kswapds priorities to continually reset even when it's under
real pressure as opposed to "failing to reclaim because of use-once".
With nr_to_reclaim always at SWAP_CLUSTER_MAX, we'll hit this check and
reset after each zone scan.

                if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
                        break;

It'll fail the watermark check and restart of course but it does mean we
would call shrink_slab() for every SWAP_CLUSTER_MAX*nr_unbalaced_zones
pages scanned which will have other consequences. It'll behave differently
but not necessarily better.

In general, IO causing anonymous workloads to stall has gotten a lot worse
during the last few kernels without us properly realising it other than
interactivity in the presence of IO has gone down the crapper again. Late
last week I fixed up an mmtests that runs memcachetest as the primary
workload while doing varying amounts of IO in the background and found this

http://www.csn.ul.ie/~mel/postings/reclaim-20130221/global-dhp__parallelio-memcachetest-ext4/hydra/report.html

Snippet looks like this;
                                            3.0.56                      3.6.10                       3.7.4                   3.8.0-rc4
                                          mainline                    mainline                    mainline                    mainline
Ops memcachetest-0M             10125.00 (  0.00%)          10091.00 ( -0.34%)          11038.00 (  9.02%)          10864.00 (  7.30%)
Ops memcachetest-749M           10097.00 (  0.00%)           8546.00 (-15.36%)           8770.00 (-13.14%)           4872.00 (-51.75%)
Ops memcachetest-1623M          10161.00 (  0.00%)           3149.00 (-69.01%)           3645.00 (-64.13%)           2760.00 (-72.84%)
Ops memcachetest-2498M           8095.00 (  0.00%)           2527.00 (-68.78%)           2461.00 (-69.60%)           2282.00 (-71.81%)
Ops memcachetest-3372M           7814.00 (  0.00%)           2369.00 (-69.68%)           2396.00 (-69.34%)           2323.00 (-70.27%)
Ops memcachetest-4247M           3818.00 (  0.00%)           2366.00 (-38.03%)           2391.00 (-37.38%)           2274.00 (-40.44%)
Ops memcachetest-5121M           3852.00 (  0.00%)           2335.00 (-39.38%)           2384.00 (-38.11%)           2233.00 (-42.03%)

This is showing transactions/second -- more the better. 3.0.56 was pretty
bad in itself because a large amount of IO in the background wrecked the
throughput. It's gotten a lot worse since then. 3.8 results have
completed but a quick check tells me the results are no better which is
not surprising as there were no relevant commits since 3.8-rc4.

Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-749M                     0.00 (  0.00%)          36002.00 (-99.00%)          50499.00 (-99.00%)         155135.00 (-99.00%)
Ops swapin-1623M                    8.00 (  0.00%)         176816.00 (-2210100.00%)         172010.00 (-2150025.00%)         206212.00 (-2577550.00%)
Ops swapin-2498M                26291.00 (  0.00%)         195448.00 (-643.40%)         200911.00 (-664.18%)         209180.00 (-695.63%)
Ops swapin-3372M                27787.00 (  0.00%)         179221.00 (-544.98%)         183509.00 (-560.41%)         182371.00 (-556.32%)
Ops swapin-4247M               105081.00 (  0.00%)         157617.00 (-50.00%)         158054.00 (-50.41%)         167478.00 (-59.38%)
Ops swapin-5121M                89589.00 (  0.00%)         148095.00 (-65.30%)         151012.00 (-68.56%)         159079.00 (-77.57%)

This is indicating that we are making the wrong reclaim decisions
because of the amount of swapins.

Ops majorfaults-0M                  1.00 (  0.00%)              1.00 (  0.00%)              9.00 (-800.00%)              0.00 (  0.00%)
Ops majorfaults-749M                2.00 (  0.00%)           5356.00 (-267700.00%)           7872.00 (-393500.00%)          22472.00 (-1123500.00%)
Ops majorfaults-1623M              30.00 (  0.00%)          26950.00 (-89733.33%)          25074.00 (-83480.00%)          28815.00 (-95950.00%)
Ops majorfaults-2498M            6459.00 (  0.00%)          27719.00 (-329.15%)          27904.00 (-332.02%)          29001.00 (-349.00%)
Ops majorfaults-3372M            5133.00 (  0.00%)          25565.00 (-398.05%)          26444.00 (-415.18%)          25789.00 (-402.42%)
Ops majorfaults-4247M           19822.00 (  0.00%)          22767.00 (-14.86%)          22936.00 (-15.71%)          23475.00 (-18.43%)
Ops majorfaults-5121M           17689.00 (  0.00%)          21292.00 (-20.37%)          21820.00 (-23.35%)          22234.00 (-25.69%)

Major faults are also high.

I have not had enough time to investigate this because other bugs cropped
up. I can tell you that it's not bisectable as there are multiple root
causes and it's not always reliably reproducible (with this test at least).

Unfortunately I'm also dropping offline today for a week and then I'll
have to play catchup again when I get back. It's going to be close to 2
weeks before I can start figuring out what went wrong here but I plan to
start with 3.0 and work forward and see how I get on.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] add extra free kbytes tunable
  2013-02-26 10:47             ` Mel Gorman
@ 2013-02-26 15:13               ` Johannes Weiner
  2013-02-26 16:25                 ` Mel Gorman
  0 siblings, 1 reply; 21+ messages in thread
From: Johannes Weiner @ 2013-02-26 15:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: dormando, Andrew Morton, Rik van Riel, Seiji Aguchi,
	Satoru Moriya, Randy Dunlap, linux-kernel, linux-mm, lwoodman,
	hughd, Mel Gorman

On Tue, Feb 26, 2013 at 10:47:31AM +0000, Mel Gorman wrote:
> On Fri, Feb 22, 2013 at 12:56:34PM -0500, Johannes Weiner wrote:
> > > > <SNIP>
> > > >
> > > > : We have a server workload wherein machines with 100G+ of "free" memory
> > > > : (used by page cache), scattered but frequent random io reads from 12+
> > > > : SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim
> > > > : in a few different ways.
> > > > :
> > > > : 1) It'll run into small amounts of reclaim randomly (a few hundred
> > > > : thousand).
> > > > :
> > > > : 2) A burst of reads or traffic can cause extra pressure, which kswapd
> > > > : occasionally responds to by freeing up 40g+ of the pagecache all at once
> > > > : (!) while pausing the system (Argh).
> > > > :
> > > > : 3) A blip in an upstream provider or failover from a peer causes the
> > > > : kernel to allocate massive amounts of memory for retransmission
> > > > : queues/etc, potentially along with buffered IO reads and (some, but not
> > > > : often a ton) of new allocations from an application. This paired with 2)
> > > > : can cause the box to stall for 15+ seconds.
> > > >
> > > > Can we prioritise these?  2) looks just awful - kswapd shouldn't just
> > > > go off and free 40G of pagecache.  Do you know what's actually in that
> > > > pagecache?  Large number of small files or small number of (very) large
> > > > files?
> > > 
> > > We have a handful of huge files (6-12ish 200g+) that are mmap'ed and
> > > accessed via address. occasionally madvise (WILLNEED) applied to the
> > > address ranges before attempting to use them. There're a mix of other
> > > files but nothing significant. The mmap's are READONLY and writes are done
> > > via pwrite-ish functions.
> > > 
> > > I could use some guidance on inspecting/tracing the problem. I've been
> > > trying to reproduce it in a lab, and respecting to 2)'s issue I've found:
> > > 
> > > - The amount of memory freed back up is either a percentage of total
> > > memory or a percentage of free memory. (a machine with 48G of ram will
> > > "only" free up an extra 4-7g)
> > > 
> > > - It's most likely to happen after a fresh boot, or if "3 > drop_caches"
> > > is applied with the application down. As it fills it seems to get itself
> > > into trouble, but becomes more stable after that. Unfortunately 1) and 3)
> > > still apply to a stable instance.
> > > 
> > > - Protecting the DMA32 zone with something like "1 1 32" into
> > > lowmem_reserve_ratio makes the mass-reclaiming less likely to happen.
> > > 
> > > - While watching "sar -B 1" I'll see kswapd wake up, and scan up to a few
> > > hundred thousand pages before finding anything it actually wants to
> > > reclaim (low vmeff). I've only been able to reproduce this from a clean
> > > start. It can take up to 3 seconds before kswapd starts actually
> > > reclaiming pages.
> > > 
> > > - So far as I can tell we're almost exclusively using 0 order allocations.
> > > THP is disabled.
> > > 
> > > There's not much dirty memory involved. It's not flushing out writes while
> > > reclaiming, it just kills off massive amount of cached memory.
> > 
> > Mapped file pages have to get scanned twice before they are reclaimed
> > because we don't have enough usage information after the first scan.
> > 
> > In your case, when you start this workload after a fresh boot or
> > dropping the caches, there will be 48G of mapped file pages that have
> > never been scanned before and that need to be looked at twice.
> > 
> > Unfortunately, if kswapd does not make progress (and it won't for some
> > time at first), it will scan more and more aggressively with
> > increasing scan priority.  And when the 48G of pages are finally
> > cycled, kswapd's scan window is a large percentage of your machine's
> > memory, and it will free every single page in it.
> > 
> > I think we should think about capping kswapd zone reclaim cycles just
> > as we do for direct reclaim.  It's a little ridiculous that it can run
> > unbounded and reclaim every page in a zone without ever checking back
> > against the watermark.  We still increase the scan window evenly when
> > we don't make forward progress, but we are more carefully inching zone
> > levels back toward the watermarks.
> > 
> 
> While on the surface I think this will appear to work, I worry that it
> will cause kswapds priorities to continually reset even when it's under
> real pressure as opposed to "failing to reclaim because of use-once".
> With nr_to_reclaim always at SWAP_CLUSTER_MAX, we'll hit this check and
> reset after each zone scan.
> 
>                 if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
>                         break;

But we hit that check now as well...?  I.e. unless there is a hard to
reclaim batch and kswapd is unable to make forward progress, priority
levels will always get reset after we scanned all zones and reclaimed
SWAP_CLUSTER_MAX or more in the process.

All I'm arguing is that, if we hit a hard to reclaim batch we should
continue to increase the number of pages to scan, but still bail out
if we reclaimed a batch successfully.  It does make sense to me to
look at more pages if we encounter unreclaimable ones.  It makes less
sense to me, however, to increase the reclaim goal as well in that
case.

> It'll fail the watermark check and restart of course but it does mean we
> would call shrink_slab() for every SWAP_CLUSTER_MAX*nr_unbalaced_zones
> pages scanned which will have other consequences. It'll behave differently
> but not necessarily better.

Right, I wasn't proposing to merge the patch as is.  But I do think
it's not okay that a batch of immediately unreclaimable pages can
cause kswapd to grow its reclaim target exponentially and we should
probably think about capping it one way or another.

shrink_slab()'s action is already based on the ratio between the
number of scanned pages and the number of lru pages, so I don't see
this as a fundamental issue, although it may require some tweaking.

> In general, IO causing anonymous workloads to stall has gotten a lot worse
> during the last few kernels without us properly realising it other than
> interactivity in the presence of IO has gone down the crapper again. Late
> last week I fixed up an mmtests that runs memcachetest as the primary
> workload while doing varying amounts of IO in the background and found this
> 
> http://www.csn.ul.ie/~mel/postings/reclaim-20130221/global-dhp__parallelio-memcachetest-ext4/hydra/report.html
> 
> Snippet looks like this;
>                                             3.0.56                      3.6.10                       3.7.4                   3.8.0-rc4
>                                           mainline                    mainline                    mainline                    mainline
> Ops memcachetest-0M             10125.00 (  0.00%)          10091.00 ( -0.34%)          11038.00 (  9.02%)          10864.00 (  7.30%)
> Ops memcachetest-749M           10097.00 (  0.00%)           8546.00 (-15.36%)           8770.00 (-13.14%)           4872.00 (-51.75%)
> Ops memcachetest-1623M          10161.00 (  0.00%)           3149.00 (-69.01%)           3645.00 (-64.13%)           2760.00 (-72.84%)
> Ops memcachetest-2498M           8095.00 (  0.00%)           2527.00 (-68.78%)           2461.00 (-69.60%)           2282.00 (-71.81%)
> Ops memcachetest-3372M           7814.00 (  0.00%)           2369.00 (-69.68%)           2396.00 (-69.34%)           2323.00 (-70.27%)
> Ops memcachetest-4247M           3818.00 (  0.00%)           2366.00 (-38.03%)           2391.00 (-37.38%)           2274.00 (-40.44%)
> Ops memcachetest-5121M           3852.00 (  0.00%)           2335.00 (-39.38%)           2384.00 (-38.11%)           2233.00 (-42.03%)
> 
> This is showing transactions/second -- more the better. 3.0.56 was pretty
> bad in itself because a large amount of IO in the background wrecked the
> throughput. It's gotten a lot worse since then. 3.8 results have
> completed but a quick check tells me the results are no better which is
> not surprising as there were no relevant commits since 3.8-rc4.

That does look horrible.  What kind of background IO is that?
Mapped/unmapped?  Read/write?  Linear or clustered?  I'm guessing some
of it is write at least as there are more page writes than swap outs.

> Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
> Ops swapin-749M                     0.00 (  0.00%)          36002.00 (-99.00%)          50499.00 (-99.00%)         155135.00 (-99.00%)
> Ops swapin-1623M                    8.00 (  0.00%)         176816.00 (-2210100.00%)         172010.00 (-2150025.00%)         206212.00 (-2577550.00%)
> Ops swapin-2498M                26291.00 (  0.00%)         195448.00 (-643.40%)         200911.00 (-664.18%)         209180.00 (-695.63%)
> Ops swapin-3372M                27787.00 (  0.00%)         179221.00 (-544.98%)         183509.00 (-560.41%)         182371.00 (-556.32%)
> Ops swapin-4247M               105081.00 (  0.00%)         157617.00 (-50.00%)         158054.00 (-50.41%)         167478.00 (-59.38%)
> Ops swapin-5121M                89589.00 (  0.00%)         148095.00 (-65.30%)         151012.00 (-68.56%)         159079.00 (-77.57%)
> 
> This is indicating that we are making the wrong reclaim decisions
> because of the amount of swapins.

I would have expected e986850 "mm,vmscan: only evict file pages when
we have plenty" to make some difference.  But depending on the IO
pattern, the file pages may all just sit on the active list.

> Ops majorfaults-0M                  1.00 (  0.00%)              1.00 (  0.00%)              9.00 (-800.00%)              0.00 (  0.00%)
> Ops majorfaults-749M                2.00 (  0.00%)           5356.00 (-267700.00%)           7872.00 (-393500.00%)          22472.00 (-1123500.00%)
> Ops majorfaults-1623M              30.00 (  0.00%)          26950.00 (-89733.33%)          25074.00 (-83480.00%)          28815.00 (-95950.00%)
> Ops majorfaults-2498M            6459.00 (  0.00%)          27719.00 (-329.15%)          27904.00 (-332.02%)          29001.00 (-349.00%)
> Ops majorfaults-3372M            5133.00 (  0.00%)          25565.00 (-398.05%)          26444.00 (-415.18%)          25789.00 (-402.42%)
> Ops majorfaults-4247M           19822.00 (  0.00%)          22767.00 (-14.86%)          22936.00 (-15.71%)          23475.00 (-18.43%)
> Ops majorfaults-5121M           17689.00 (  0.00%)          21292.00 (-20.37%)          21820.00 (-23.35%)          22234.00 (-25.69%)
> 
> Major faults are also high.
> 
> I have not had enough time to investigate this because other bugs cropped
> up. I can tell you that it's not bisectable as there are multiple root
> causes and it's not always reliably reproducible (with this test at least).
> 
> Unfortunately I'm also dropping offline today for a week and then I'll
> have to play catchup again when I get back. It's going to be close to 2
> weeks before I can start figuring out what went wrong here but I plan to
> start with 3.0 and work forward and see how I get on.

Would you have that mmtest configuration available somewhere by any
chance?  I can't see it in mmtests.git.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] add extra free kbytes tunable
  2013-02-26 15:13               ` Johannes Weiner
@ 2013-02-26 16:25                 ` Mel Gorman
  0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2013-02-26 16:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: dormando, Andrew Morton, Rik van Riel, Seiji Aguchi,
	Satoru Moriya, Randy Dunlap, linux-kernel, linux-mm, lwoodman,
	hughd, Mel Gorman

On Tue, Feb 26, 2013 at 10:13:15AM -0500, Johannes Weiner wrote:
> > > <SNIP>
> > > I think we should think about capping kswapd zone reclaim cycles just
> > > as we do for direct reclaim.  It's a little ridiculous that it can run
> > > unbounded and reclaim every page in a zone without ever checking back
> > > against the watermark.  We still increase the scan window evenly when
> > > we don't make forward progress, but we are more carefully inching zone
> > > levels back toward the watermarks.
> > > 
> > 
> > While on the surface I think this will appear to work, I worry that it
> > will cause kswapds priorities to continually reset even when it's under
> > real pressure as opposed to "failing to reclaim because of use-once".
> > With nr_to_reclaim always at SWAP_CLUSTER_MAX, we'll hit this check and
> > reset after each zone scan.
> > 
> >                 if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> >                         break;
> 
> But we hit that check now as well...? 

Eventually yes.

> I.e. unless there is a hard to
> reclaim batch and kswapd is unable to make forward progress, priority
> levels will always get reset after we scanned all zones and reclaimed
> SWAP_CLUSTER_MAX or more in the process.
> 

The reset happens after it has reclaimed a lot of pages. I agree with
you that this is likely the wrong thing to do. I'm just pointing out
that this simple patch changes behaviour in a big way.

> All I'm arguing is that, if we hit a hard to reclaim batch we should
> continue to increase the number of pages to scan, but still bail out
> if we reclaimed a batch successfully.  It does make sense to me to
> look at more pages if we encounter unreclaimable ones.  It makes less
> sense to me, however, to increase the reclaim goal as well in that
> case.
> 

Bail out from the reclaim maybe but care should be taken to ensure we do
not hammer slab on each "bail" or reset the scanning priorities if the
watermark was not met by that batch of SWAP_CLUSTER_MAX reclaims.

We also have to think about what it means for pressure being applied
equally to each zone. We will still apply equal scanning pressure but
not necessarily reclaim pressure. Does that matter? I don't know.

> > It'll fail the watermark check and restart of course but it does mean we
> > would call shrink_slab() for every SWAP_CLUSTER_MAX*nr_unbalaced_zones
> > pages scanned which will have other consequences. It'll behave differently
> > but not necessarily better.
> 
> Right, I wasn't proposing to merge the patch as is.  But I do think
> it's not okay that a batch of immediately unreclaimable pages can
> cause kswapd to grow its reclaim target exponentially and we should
> probably think about capping it one way or another.
> 

I agree with you. MMtest results I looked at over the weekend showed
that kswapd tends to be extremely spiky. Doing nothing following by
reclaiming an excessive amount of memory and going back to doing
nothing. This partially explains it.

> shrink_slab()'s action is already based on the ratio between the
> number of scanned pages and the number of lru pages, so I don't see
> this as a fundamental issue, although it may require some tweaking.
> 
> > In general, IO causing anonymous workloads to stall has gotten a lot worse
> > during the last few kernels without us properly realising it other than
> > interactivity in the presence of IO has gone down the crapper again. Late
> > last week I fixed up an mmtests that runs memcachetest as the primary
> > workload while doing varying amounts of IO in the background and found this
> > 
> > http://www.csn.ul.ie/~mel/postings/reclaim-20130221/global-dhp__parallelio-memcachetest-ext4/hydra/report.html
> > 
> > Snippet looks like this;
> >                                             3.0.56                      3.6.10                       3.7.4                   3.8.0-rc4
> >                                           mainline                    mainline                    mainline                    mainline
> > Ops memcachetest-0M             10125.00 (  0.00%)          10091.00 ( -0.34%)          11038.00 (  9.02%)          10864.00 (  7.30%)
> > Ops memcachetest-749M           10097.00 (  0.00%)           8546.00 (-15.36%)           8770.00 (-13.14%)           4872.00 (-51.75%)
> > Ops memcachetest-1623M          10161.00 (  0.00%)           3149.00 (-69.01%)           3645.00 (-64.13%)           2760.00 (-72.84%)
> > Ops memcachetest-2498M           8095.00 (  0.00%)           2527.00 (-68.78%)           2461.00 (-69.60%)           2282.00 (-71.81%)
> > Ops memcachetest-3372M           7814.00 (  0.00%)           2369.00 (-69.68%)           2396.00 (-69.34%)           2323.00 (-70.27%)
> > Ops memcachetest-4247M           3818.00 (  0.00%)           2366.00 (-38.03%)           2391.00 (-37.38%)           2274.00 (-40.44%)
> > Ops memcachetest-5121M           3852.00 (  0.00%)           2335.00 (-39.38%)           2384.00 (-38.11%)           2233.00 (-42.03%)
> > 
> > This is showing transactions/second -- more the better. 3.0.56 was pretty
> > bad in itself because a large amount of IO in the background wrecked the
> > throughput. It's gotten a lot worse since then. 3.8 results have
> > completed but a quick check tells me the results are no better which is
> > not surprising as there were no relevant commits since 3.8-rc4.
> 
> That does look horrible.  What kind of background IO is that?

dd to a large file conv=fdatasync

> Mapped/unmapped? 

unmapped.

> Read/write? 

write

> Linear or clustered? 

Not sure what you mean by "clusters". It's a linear write rather than a
random write.

The objective of the test was to detect one aspect of situations like
"during backup my main application performance goes to hell". It would be
possible to generate other types of background IO as there are elements
of the config-global-dhp__io-largeread-starvation test that can be broken
out and reused.

> I'm guessing some
> of it is write at least as there are more page writes than swap outs.
> 
> > Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
> > Ops swapin-749M                     0.00 (  0.00%)          36002.00 (-99.00%)          50499.00 (-99.00%)         155135.00 (-99.00%)
> > Ops swapin-1623M                    8.00 (  0.00%)         176816.00 (-2210100.00%)         172010.00 (-2150025.00%)         206212.00 (-2577550.00%)
> > Ops swapin-2498M                26291.00 (  0.00%)         195448.00 (-643.40%)         200911.00 (-664.18%)         209180.00 (-695.63%)
> > Ops swapin-3372M                27787.00 (  0.00%)         179221.00 (-544.98%)         183509.00 (-560.41%)         182371.00 (-556.32%)
> > Ops swapin-4247M               105081.00 (  0.00%)         157617.00 (-50.00%)         158054.00 (-50.41%)         167478.00 (-59.38%)
> > Ops swapin-5121M                89589.00 (  0.00%)         148095.00 (-65.30%)         151012.00 (-68.56%)         159079.00 (-77.57%)
> > 
> > This is indicating that we are making the wrong reclaim decisions
> > because of the amount of swapins.
> 
> I would have expected e986850 "mm,vmscan: only evict file pages when
> we have plenty" to make some difference.  But depending on the IO
> pattern, the file pages may all just sit on the active list.
> 

Maybe it did help and what we're seeing is a side-ffect of cda73a10 (mm:
do not sleep in balance_pgdat if there's no i/o congestion) that is keeping
kswapd awake and reclaiming for longer. It would not be the first time we
removed a congestion_wait() to find that we depended on that sledge hammer.

> > Ops majorfaults-0M                  1.00 (  0.00%)              1.00 (  0.00%)              9.00 (-800.00%)              0.00 (  0.00%)
> > Ops majorfaults-749M                2.00 (  0.00%)           5356.00 (-267700.00%)           7872.00 (-393500.00%)          22472.00 (-1123500.00%)
> > Ops majorfaults-1623M              30.00 (  0.00%)          26950.00 (-89733.33%)          25074.00 (-83480.00%)          28815.00 (-95950.00%)
> > Ops majorfaults-2498M            6459.00 (  0.00%)          27719.00 (-329.15%)          27904.00 (-332.02%)          29001.00 (-349.00%)
> > Ops majorfaults-3372M            5133.00 (  0.00%)          25565.00 (-398.05%)          26444.00 (-415.18%)          25789.00 (-402.42%)
> > Ops majorfaults-4247M           19822.00 (  0.00%)          22767.00 (-14.86%)          22936.00 (-15.71%)          23475.00 (-18.43%)
> > Ops majorfaults-5121M           17689.00 (  0.00%)          21292.00 (-20.37%)          21820.00 (-23.35%)          22234.00 (-25.69%)
> > 
> > Major faults are also high.
> > 
> > I have not had enough time to investigate this because other bugs cropped
> > up. I can tell you that it's not bisectable as there are multiple root
> > causes and it's not always reliably reproducible (with this test at least).
> > 
> > Unfortunately I'm also dropping offline today for a week and then I'll
> > have to play catchup again when I get back. It's going to be close to 2
> > weeks before I can start figuring out what went wrong here but I plan to
> > start with 3.0 and work forward and see how I get on.
> 
> Would you have that mmtest configuration available somewhere by any
> chance?  I can't see it in mmtests.git.

Of course. It's configs/config-global-dhp__parallelio-memcachetest. That
does not create a filesystem but the difference between what's in mmtests
and what I used is below.

Current released mmtests also does not have a module that can compare
parallelio tests but I've pushed the support to git. If you git pull then
something like this should report something sensible

cp configs/config-global-dhp__parallelio-memcachetest config
./run-mmtests vanilla
(build boot new kernel)
./run-mmtests patched

To get a report do either this
./bin/compare-kernel.sh -d work/log -b parallelio -n vanilla,patched

or

cd work/log
../../compare-kernel.sh

This is the config file diff

@@ -43,10 +43,10 @@
 #export TESTDISK_RAID_OFFSET=63
 #export TESTDISK_RAID_SIZE=250019532
 #export TESTDISK_RAID_TYPE=raid0
-#export TESTDISK_PARTITION=/dev/sda6
-#export TESTDISK_FILESYSTEM=ext3
-#export TESTDISK_MKFS_PARAM="-f -d agcount=8"
-#export TESTDISK_MOUNT_ARGS=""
+export TESTDISK_PARTITION=/dev/sda6
+export TESTDISK_FILESYSTEM=ext4
+export TESTDISK_MKFS_PARAM=
+export TESTDISK_MOUNT_ARGS=
 #
 # Test NFS disk to setup (optional)
 #export TESTDISK_NFS_MOUNT=192.168.10.7:/exports/`hostname`

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] add extra free kbytes tunable
  2013-02-22 17:56           ` Johannes Weiner
  2013-02-26 10:47             ` Mel Gorman
@ 2013-03-01  9:22             ` Simon Jeons
  2013-03-01  9:31               ` Simon Jeons
  1 sibling, 1 reply; 21+ messages in thread
From: Simon Jeons @ 2013-03-01  9:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: dormando, Andrew Morton, Rik van Riel, Seiji Aguchi,
	Satoru Moriya, Randy Dunlap, linux-kernel, linux-mm, lwoodman,
	hughd, Mel Gorman

Hi Johannes,

On 02/23/2013 01:56 AM, Johannes Weiner wrote:
> On Tue, Feb 19, 2013 at 09:19:27PM -0800, dormando wrote:
>>> The problem is that adding this tunable will constrain future VM
>>> implementations.  We will forever need to at least retain the
>>> pseudo-file.  We will also need to make some effort to retain its
>>> behaviour.
>>>
>>> It would of course be better to fix things so you don't need to tweak
>>> VM internals to get acceptable behaviour.
>> I sympathize with this. It's presently all that keeps us afloat though.
>> I'll whine about it again later if nothing else pans out.
>>
>>> You said:
>>>
>>> : We have a server workload wherein machines with 100G+ of "free" memory
>>> : (used by page cache), scattered but frequent random io reads from 12+
>>> : SSD's, and 5gbps+ of internet traffic, will frequently hit direct reclaim
>>> : in a few different ways.
>>> :
>>> : 1) It'll run into small amounts of reclaim randomly (a few hundred
>>> : thousand).
>>> :
>>> : 2) A burst of reads or traffic can cause extra pressure, which kswapd
>>> : occasionally responds to by freeing up 40g+ of the pagecache all at once
>>> : (!) while pausing the system (Argh).
>>> :
>>> : 3) A blip in an upstream provider or failover from a peer causes the
>>> : kernel to allocate massive amounts of memory for retransmission
>>> : queues/etc, potentially along with buffered IO reads and (some, but not
>>> : often a ton) of new allocations from an application. This paired with 2)
>>> : can cause the box to stall for 15+ seconds.
>>>
>>> Can we prioritise these?  2) looks just awful - kswapd shouldn't just
>>> go off and free 40G of pagecache.  Do you know what's actually in that
>>> pagecache?  Large number of small files or small number of (very) large
>>> files?
>> We have a handful of huge files (6-12ish 200g+) that are mmap'ed and
>> accessed via address. occasionally madvise (WILLNEED) applied to the
>> address ranges before attempting to use them. There're a mix of other
>> files but nothing significant. The mmap's are READONLY and writes are done
>> via pwrite-ish functions.
>>
>> I could use some guidance on inspecting/tracing the problem. I've been
>> trying to reproduce it in a lab, and respecting to 2)'s issue I've found:
>>
>> - The amount of memory freed back up is either a percentage of total
>> memory or a percentage of free memory. (a machine with 48G of ram will
>> "only" free up an extra 4-7g)
>>
>> - It's most likely to happen after a fresh boot, or if "3 > drop_caches"
>> is applied with the application down. As it fills it seems to get itself
>> into trouble, but becomes more stable after that. Unfortunately 1) and 3)
>> still apply to a stable instance.
>>
>> - Protecting the DMA32 zone with something like "1 1 32" into
>> lowmem_reserve_ratio makes the mass-reclaiming less likely to happen.
>>
>> - While watching "sar -B 1" I'll see kswapd wake up, and scan up to a few
>> hundred thousand pages before finding anything it actually wants to
>> reclaim (low vmeff). I've only been able to reproduce this from a clean
>> start. It can take up to 3 seconds before kswapd starts actually
>> reclaiming pages.
>>
>> - So far as I can tell we're almost exclusively using 0 order allocations.
>> THP is disabled.
>>
>> There's not much dirty memory involved. It's not flushing out writes while
>> reclaiming, it just kills off massive amount of cached memory.
> Mapped file pages have to get scanned twice before they are reclaimed
> because we don't have enough usage information after the first scan.

It seems that just VM_EXEC mapped file pages are protected.
Issue in page reclaim subsystem:
static inline int page_is_file_cache(struct page *page)
{
     return !PageSwapBacked(page);
}
AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and 
be cleaned if removed from swap cache. So anonymous pages which are 
reclaimed and add to swap cache won't have this flag, then they will be 
treated as file backed pages?  Is it buggy? In function 
__add_to_swap_cache if add to radix tree successfully will result in 
increase NR_FILE_PAGES, why?
>
> In your case, when you start this workload after a fresh boot or
> dropping the caches, there will be 48G of mapped file pages that have
> never been scanned before and that need to be looked at twice.
>
> Unfortunately, if kswapd does not make progress (and it won't for some
> time at first), it will scan more and more aggressively with

Why kswapd does not make progress for some time at first?

> increasing scan priority.  And when the 48G of pages are finally
> cycled, kswapd's scan window is a large percentage of your machine's
> memory, and it will free every single page in it.
>
> I think we should think about capping kswapd zone reclaim cycles just
> as we do for direct reclaim.  It's a little ridiculous that it can run
> unbounded and reclaim every page in a zone without ever checking back
> against the watermark.  We still increase the scan window evenly when
> we don't make forward progress, but we are more carefully inching zone
> levels back toward the watermarks.
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c4883eb..8a4c446 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2645,10 +2645,11 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>   		.may_unmap = 1,
>   		.may_swap = 1,
>   		/*
> -		 * kswapd doesn't want to be bailed out while reclaim. because
> -		 * we want to put equal scanning pressure on each zone.
> +		 * Even kswapd zone scans want to be bailed out after
> +		 * reclaiming a good chunk of pages.  It will just
> +		 * come back if the watermarks are still not met.
>   		 */
> -		.nr_to_reclaim = ULONG_MAX,
> +		.nr_to_reclaim = SWAP_CLUSTER_MAX,
>   		.order = order,
>   		.target_mem_cgroup = NULL,
>   	};
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] add extra free kbytes tunable
  2013-03-01  9:22             ` Simon Jeons
@ 2013-03-01  9:31               ` Simon Jeons
  2013-03-01 22:33                 ` Hugh Dickins
  0 siblings, 1 reply; 21+ messages in thread
From: Simon Jeons @ 2013-03-01  9:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: dormando, Andrew Morton, Rik van Riel, Seiji Aguchi,
	Satoru Moriya, Randy Dunlap, linux-kernel, linux-mm, lwoodman,
	hughd, Mel Gorman

On 03/01/2013 05:22 PM, Simon Jeons wrote:
> Hi Johannes,
>
> On 02/23/2013 01:56 AM, Johannes Weiner wrote:
>> On Tue, Feb 19, 2013 at 09:19:27PM -0800, dormando wrote:
>>>> The problem is that adding this tunable will constrain future VM
>>>> implementations.  We will forever need to at least retain the
>>>> pseudo-file.  We will also need to make some effort to retain its
>>>> behaviour.
>>>>
>>>> It would of course be better to fix things so you don't need to tweak
>>>> VM internals to get acceptable behaviour.
>>> I sympathize with this. It's presently all that keeps us afloat though.
>>> I'll whine about it again later if nothing else pans out.
>>>
>>>> You said:
>>>>
>>>> : We have a server workload wherein machines with 100G+ of "free" 
>>>> memory
>>>> : (used by page cache), scattered but frequent random io reads from 
>>>> 12+
>>>> : SSD's, and 5gbps+ of internet traffic, will frequently hit direct 
>>>> reclaim
>>>> : in a few different ways.
>>>> :
>>>> : 1) It'll run into small amounts of reclaim randomly (a few hundred
>>>> : thousand).
>>>> :
>>>> : 2) A burst of reads or traffic can cause extra pressure, which 
>>>> kswapd
>>>> : occasionally responds to by freeing up 40g+ of the pagecache all 
>>>> at once
>>>> : (!) while pausing the system (Argh).
>>>> :
>>>> : 3) A blip in an upstream provider or failover from a peer causes the
>>>> : kernel to allocate massive amounts of memory for retransmission
>>>> : queues/etc, potentially along with buffered IO reads and (some, 
>>>> but not
>>>> : often a ton) of new allocations from an application. This paired 
>>>> with 2)
>>>> : can cause the box to stall for 15+ seconds.
>>>>
>>>> Can we prioritise these?  2) looks just awful - kswapd shouldn't just
>>>> go off and free 40G of pagecache.  Do you know what's actually in that
>>>> pagecache?  Large number of small files or small number of (very) 
>>>> large
>>>> files?
>>> We have a handful of huge files (6-12ish 200g+) that are mmap'ed and
>>> accessed via address. occasionally madvise (WILLNEED) applied to the
>>> address ranges before attempting to use them. There're a mix of other
>>> files but nothing significant. The mmap's are READONLY and writes 
>>> are done
>>> via pwrite-ish functions.
>>>
>>> I could use some guidance on inspecting/tracing the problem. I've been
>>> trying to reproduce it in a lab, and respecting to 2)'s issue I've 
>>> found:
>>>
>>> - The amount of memory freed back up is either a percentage of total
>>> memory or a percentage of free memory. (a machine with 48G of ram will
>>> "only" free up an extra 4-7g)
>>>
>>> - It's most likely to happen after a fresh boot, or if "3 > 
>>> drop_caches"
>>> is applied with the application down. As it fills it seems to get 
>>> itself
>>> into trouble, but becomes more stable after that. Unfortunately 1) 
>>> and 3)
>>> still apply to a stable instance.
>>>
>>> - Protecting the DMA32 zone with something like "1 1 32" into
>>> lowmem_reserve_ratio makes the mass-reclaiming less likely to happen.
>>>
>>> - While watching "sar -B 1" I'll see kswapd wake up, and scan up to 
>>> a few
>>> hundred thousand pages before finding anything it actually wants to
>>> reclaim (low vmeff). I've only been able to reproduce this from a clean
>>> start. It can take up to 3 seconds before kswapd starts actually
>>> reclaiming pages.
>>>
>>> - So far as I can tell we're almost exclusively using 0 order 
>>> allocations.
>>> THP is disabled.
>>>
>>> There's not much dirty memory involved. It's not flushing out writes 
>>> while
>>> reclaiming, it just kills off massive amount of cached memory.
>> Mapped file pages have to get scanned twice before they are reclaimed
>> because we don't have enough usage information after the first scan.
>
> It seems that just VM_EXEC mapped file pages are protected.
> Issue in page reclaim subsystem:
> static inline int page_is_file_cache(struct page *page)
> {
>     return !PageSwapBacked(page);
> }
> AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and 
> be cleaned if removed from swap cache. So anonymous pages which are 
> reclaimed and add to swap cache won't have this flag, then they will 
> be treated as

s/are/aren't

> file backed pages?  Is it buggy? In function __add_to_swap_cache if 
> add to radix tree successfully will result in increase NR_FILE_PAGES, 
> why?
>>
>> In your case, when you start this workload after a fresh boot or
>> dropping the caches, there will be 48G of mapped file pages that have
>> never been scanned before and that need to be looked at twice.
>>
>> Unfortunately, if kswapd does not make progress (and it won't for some
>> time at first), it will scan more and more aggressively with
>
> Why kswapd does not make progress for some time at first?
>
>> increasing scan priority.  And when the 48G of pages are finally
>> cycled, kswapd's scan window is a large percentage of your machine's
>> memory, and it will free every single page in it.
>>
>> I think we should think about capping kswapd zone reclaim cycles just
>> as we do for direct reclaim.  It's a little ridiculous that it can run
>> unbounded and reclaim every page in a zone without ever checking back
>> against the watermark.  We still increase the scan window evenly when
>> we don't make forward progress, but we are more carefully inching zone
>> levels back toward the watermarks.
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index c4883eb..8a4c446 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2645,10 +2645,11 @@ static unsigned long balance_pgdat(pg_data_t 
>> *pgdat, int order,
>>           .may_unmap = 1,
>>           .may_swap = 1,
>>           /*
>> -         * kswapd doesn't want to be bailed out while reclaim. because
>> -         * we want to put equal scanning pressure on each zone.
>> +         * Even kswapd zone scans want to be bailed out after
>> +         * reclaiming a good chunk of pages.  It will just
>> +         * come back if the watermarks are still not met.
>>            */
>> -        .nr_to_reclaim = ULONG_MAX,
>> +        .nr_to_reclaim = SWAP_CLUSTER_MAX,
>>           .order = order,
>>           .target_mem_cgroup = NULL,
>>       };
>>
>> -- 
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] add extra free kbytes tunable
  2013-03-01  9:31               ` Simon Jeons
@ 2013-03-01 22:33                 ` Hugh Dickins
  2013-03-02  0:10                   ` Simon Jeons
  0 siblings, 1 reply; 21+ messages in thread
From: Hugh Dickins @ 2013-03-01 22:33 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Johannes Weiner, dormando, Andrew Morton, Rik van Riel,
	Seiji Aguchi, Satoru Moriya, Randy Dunlap, linux-kernel,
	linux-mm, lwoodman, Mel Gorman

On Fri, 1 Mar 2013, Simon Jeons wrote:
> On 03/01/2013 05:22 PM, Simon Jeons wrote:
> > On 02/23/2013 01:56 AM, Johannes Weiner wrote:
> > > Mapped file pages have to get scanned twice before they are reclaimed
> > > because we don't have enough usage information after the first scan.
> > 
> > It seems that just VM_EXEC mapped file pages are protected.
> > Issue in page reclaim subsystem:
> > static inline int page_is_file_cache(struct page *page)
> > {
> >     return !PageSwapBacked(page);
> > }
> > AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and be
> > cleaned if removed from swap cache. So anonymous pages which are reclaimed
> > and add to swap cache won't have this flag, then they will be treated as
> 
> s/are/aren't

PG_swapbacked != PG_swapcache

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] add extra free kbytes tunable
  2013-03-01 22:33                 ` Hugh Dickins
@ 2013-03-02  0:10                   ` Simon Jeons
  2013-03-02  1:42                     ` Hugh Dickins
  0 siblings, 1 reply; 21+ messages in thread
From: Simon Jeons @ 2013-03-02  0:10 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Johannes Weiner, dormando, Andrew Morton, Rik van Riel,
	Seiji Aguchi, Satoru Moriya, Randy Dunlap, linux-kernel,
	linux-mm, lwoodman, Mel Gorman

On 03/02/2013 06:33 AM, Hugh Dickins wrote:
> On Fri, 1 Mar 2013, Simon Jeons wrote:
>> On 03/01/2013 05:22 PM, Simon Jeons wrote:
>>> On 02/23/2013 01:56 AM, Johannes Weiner wrote:
>>>> Mapped file pages have to get scanned twice before they are reclaimed
>>>> because we don't have enough usage information after the first scan.
>>> It seems that just VM_EXEC mapped file pages are protected.
>>> Issue in page reclaim subsystem:
>>> static inline int page_is_file_cache(struct page *page)
>>> {
>>>      return !PageSwapBacked(page);
>>> }
>>> AFAIK, PG_swapbacked is set if anonymous page added to swap cache, and be
>>> cleaned if removed from swap cache. So anonymous pages which are reclaimed
>>> and add to swap cache won't have this flag, then they will be treated as
>> s/are/aren't
> PG_swapbacked != PG_swapcache

Oh, I see. Thanks Hugh, thanks for your patient. :)

In function __add_to_swap_cache if add to radix tree successfully will 
result in increase NR_FILE_PAGES, why? This is anonymous page instead of 
file backed page.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] add extra free kbytes tunable
  2013-03-02  0:10                   ` Simon Jeons
@ 2013-03-02  1:42                     ` Hugh Dickins
  2013-03-02  2:42                       ` Simon Jeons
  0 siblings, 1 reply; 21+ messages in thread
From: Hugh Dickins @ 2013-03-02  1:42 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Johannes Weiner, dormando, Andrew Morton, Rik van Riel,
	Seiji Aguchi, Satoru Moriya, Randy Dunlap, linux-kernel,
	linux-mm, lwoodman, Mel Gorman

On Sat, 2 Mar 2013, Simon Jeons wrote:
> 
> In function __add_to_swap_cache if add to radix tree successfully will result
> in increase NR_FILE_PAGES, why? This is anonymous page instead of file backed
> page.

Right, that's hard to understand without historical background.

I think the quick answer would be that we used to (and still do) think
of file-cache and swap-cache as two halves of page-cache.  And then when
someone changed the way stats were gathered, they couldn't very well
name the stat for page-cache pages NR_PAGE_PAGES, so they called it
NR_FILE_PAGES - but it still included swap.

We have tried down the years to keep the info shown in /proc/meminfo
(for example, but it is the prime example) consistent across releases,
while adding new lines and new distinctions.

But it has often been hard to find good enough short enough names for
those new distinctions: when 2.6.28 split the LRUs between file-backed
and swap-backed, it used "anon" for swap-backed in /proc/meminfo.

So you'll find that shmem and swap are counted as file in some places
and anon in others, and it's hard to grasp which is where and why,
without remembering the history.

I notice that fs/proc/meminfo.c:meminfo_proc_show() subtracts
total_swapcache_pages from the NR_FILE_PAGES count for /proc/meminfo:
so it's undoing what you observe __add_to_swap_cache() to be doing.

It's quite possible that if you went through all the users of
NR_FILE_PAGES, you'd find it makes much more sense to leave out
the swap-cache pages, and just add those on where needed.

But you might find a few places where it's hard to decide whether
the swap-cache pages were ever intended to be included or not, and
hard to decide if it's safe to change those numbers now or not.

Hugh

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] add extra free kbytes tunable
  2013-03-02  1:42                     ` Hugh Dickins
@ 2013-03-02  2:42                       ` Simon Jeons
  2013-03-02  3:08                         ` Hugh Dickins
  0 siblings, 1 reply; 21+ messages in thread
From: Simon Jeons @ 2013-03-02  2:42 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Johannes Weiner, dormando, Andrew Morton, Rik van Riel,
	Seiji Aguchi, Satoru Moriya, Randy Dunlap, linux-kernel,
	linux-mm, lwoodman, Mel Gorman

On 03/02/2013 09:42 AM, Hugh Dickins wrote:
> On Sat, 2 Mar 2013, Simon Jeons wrote:
>> In function __add_to_swap_cache if add to radix tree successfully will result
>> in increase NR_FILE_PAGES, why? This is anonymous page instead of file backed
>> page.
> Right, that's hard to understand without historical background.
>
> I think the quick answer would be that we used to (and still do) think
> of file-cache and swap-cache as two halves of page-cache.  And then when

shmem page should be treated as file-cache or swap-cache? It is strange 
since it is consist of anonymous pages and these pages establish files.

> someone changed the way stats were gathered, they couldn't very well
> name the stat for page-cache pages NR_PAGE_PAGES, so they called it
> NR_FILE_PAGES - but it still included swap.
>
> We have tried down the years to keep the info shown in /proc/meminfo
> (for example, but it is the prime example) consistent across releases,
> while adding new lines and new distinctions.
>
> But it has often been hard to find good enough short enough names for
> those new distinctions: when 2.6.28 split the LRUs between file-backed
> and swap-backed, it used "anon" for swap-backed in /proc/meminfo.
>
> So you'll find that shmem and swap are counted as file in some places
> and anon in others, and it's hard to grasp which is where and why,
> without remembering the history.
>
> I notice that fs/proc/meminfo.c:meminfo_proc_show() subtracts
> total_swapcache_pages from the NR_FILE_PAGES count for /proc/meminfo:
> so it's undoing what you observe __add_to_swap_cache() to be doing.
>
> It's quite possible that if you went through all the users of
> NR_FILE_PAGES, you'd find it makes much more sense to leave out
> the swap-cache pages, and just add those on where needed.
>
> But you might find a few places where it's hard to decide whether
> the swap-cache pages were ever intended to be included or not, and
> hard to decide if it's safe to change those numbers now or not.
>
> Hugh


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] add extra free kbytes tunable
  2013-03-02  2:42                       ` Simon Jeons
@ 2013-03-02  3:08                         ` Hugh Dickins
  2013-03-02  4:06                           ` Simon Jeons
  2013-03-09  1:08                           ` Simon Jeons
  0 siblings, 2 replies; 21+ messages in thread
From: Hugh Dickins @ 2013-03-02  3:08 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Johannes Weiner, dormando, Andrew Morton, Rik van Riel,
	Seiji Aguchi, Satoru Moriya, Randy Dunlap, linux-kernel,
	linux-mm, lwoodman, Mel Gorman

On Sat, 2 Mar 2013, Simon Jeons wrote:
> On 03/02/2013 09:42 AM, Hugh Dickins wrote:
> > On Sat, 2 Mar 2013, Simon Jeons wrote:
> > > In function __add_to_swap_cache if add to radix tree successfully will
> > > result
> > > in increase NR_FILE_PAGES, why? This is anonymous page instead of file
> > > backed
> > > page.
> > Right, that's hard to understand without historical background.
> > 
> > I think the quick answer would be that we used to (and still do) think
> > of file-cache and swap-cache as two halves of page-cache.  And then when
> 
> shmem page should be treated as file-cache or swap-cache? It is strange since
> it is consist of anonymous pages and these pages establish files.

A shmem page is swap-backed file-cache, and it may get transferred to or
from swap-cache: yes, it's a difficult and confusing case, as I said below.

I would never call it "anonymous", but it is counted in /proc/meminfo's
Active(anon) or Inactive(anon) rather than in (file), because "anon"
there is shorthand for "swap-backed".

> > So you'll find that shmem and swap are counted as file in some places
> > and anon in others, and it's hard to grasp which is where and why,
> > without remembering the history.

Hugh

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] add extra free kbytes tunable
  2013-03-02  3:08                         ` Hugh Dickins
@ 2013-03-02  4:06                           ` Simon Jeons
  2013-03-09  1:08                           ` Simon Jeons
  1 sibling, 0 replies; 21+ messages in thread
From: Simon Jeons @ 2013-03-02  4:06 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Johannes Weiner, dormando, Andrew Morton, Rik van Riel,
	Seiji Aguchi, Satoru Moriya, Randy Dunlap, linux-kernel,
	linux-mm, lwoodman, Mel Gorman

On 03/02/2013 11:08 AM, Hugh Dickins wrote:
> On Sat, 2 Mar 2013, Simon Jeons wrote:
>> On 03/02/2013 09:42 AM, Hugh Dickins wrote:
>>> On Sat, 2 Mar 2013, Simon Jeons wrote:
>>>> In function __add_to_swap_cache if add to radix tree successfully will
>>>> result
>>>> in increase NR_FILE_PAGES, why? This is anonymous page instead of file
>>>> backed
>>>> page.
>>> Right, that's hard to understand without historical background.
>>>
>>> I think the quick answer would be that we used to (and still do) think
>>> of file-cache and swap-cache as two halves of page-cache.  And then when
>> shmem page should be treated as file-cache or swap-cache? It is strange since
>> it is consist of anonymous pages and these pages establish files.
> A shmem page is swap-backed file-cache, and it may get transferred to or
> from swap-cache: yes, it's a difficult and confusing case, as I said below.
>
> I would never call it "anonymous", but it is counted in /proc/meminfo's
> Active(anon) or Inactive(anon) rather than in (file), because "anon"
> there is shorthand for "swap-backed".

Oh, I see. Thanks. :)

>
>>> So you'll find that shmem and swap are counted as file in some places
>>> and anon in others, and it's hard to grasp which is where and why,
>>> without remembering the history.
> Hugh


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] add extra free kbytes tunable
  2013-03-02  3:08                         ` Hugh Dickins
  2013-03-02  4:06                           ` Simon Jeons
@ 2013-03-09  1:08                           ` Simon Jeons
  1 sibling, 0 replies; 21+ messages in thread
From: Simon Jeons @ 2013-03-09  1:08 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Johannes Weiner, dormando, Andrew Morton, Rik van Riel,
	Seiji Aguchi, Satoru Moriya, Randy Dunlap, linux-kernel,
	linux-mm, lwoodman, Mel Gorman

Hi Hugh,
On 03/02/2013 11:08 AM, Hugh Dickins wrote:
> On Sat, 2 Mar 2013, Simon Jeons wrote:
>> On 03/02/2013 09:42 AM, Hugh Dickins wrote:
>>> On Sat, 2 Mar 2013, Simon Jeons wrote:
>>>> In function __add_to_swap_cache if add to radix tree successfully will
>>>> result
>>>> in increase NR_FILE_PAGES, why? This is anonymous page instead of file
>>>> backed
>>>> page.
>>> Right, that's hard to understand without historical background.
>>>
>>> I think the quick answer would be that we used to (and still do) think
>>> of file-cache and swap-cache as two halves of page-cache.  And then when
>> shmem page should be treated as file-cache or swap-cache? It is strange since
>> it is consist of anonymous pages and these pages establish files.
> A shmem page is swap-backed file-cache, and it may get transferred to or
> from swap-cache: yes, it's a difficult and confusing case, as I said below.
>
> I would never call it "anonymous", but it is counted in /proc/meminfo's
> Active(anon) or Inactive(anon) rather than in (file), because "anon"
> there is shorthand for "swap-backed".

In read_swap_cache_async:

SetPageSwapBacked(new_page);
__add_to_swap_cache();
swap_readpage();
ClearPageSwapBacked(new_page);

Why clear PG_swapbacked flag?

>
>>> So you'll find that shmem and swap are counted as file in some places
>>> and anon in others, and it's hard to grasp which is where and why,
>>> without remembering the history.
> Hugh


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2013-03-09  1:08 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-12  2:01 extra free kbytes tunable dormando
2013-02-15 22:21 ` Seiji Aguchi
2013-02-15 22:25   ` Rik van Riel
2013-02-17 23:48     ` [PATCH] add " dormando
2013-02-19 23:29       ` Andrew Morton
2013-02-20  5:19         ` dormando
2013-02-22 17:56           ` Johannes Weiner
2013-02-26 10:47             ` Mel Gorman
2013-02-26 15:13               ` Johannes Weiner
2013-02-26 16:25                 ` Mel Gorman
2013-03-01  9:22             ` Simon Jeons
2013-03-01  9:31               ` Simon Jeons
2013-03-01 22:33                 ` Hugh Dickins
2013-03-02  0:10                   ` Simon Jeons
2013-03-02  1:42                     ` Hugh Dickins
2013-03-02  2:42                       ` Simon Jeons
2013-03-02  3:08                         ` Hugh Dickins
2013-03-02  4:06                           ` Simon Jeons
2013-03-09  1:08                           ` Simon Jeons
2013-02-17 23:54     ` dormando
2013-02-15 22:49   ` Satoru Moriya

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).